mirror of
https://github.com/elastic/elasticsearch.git
synced 2025-04-24 23:27:25 -04:00
321 lines
10 KiB
Text
321 lines
10 KiB
Text
[[synthetic-source]]
|
|
==== Synthetic `_source`
|
|
|
|
Though very handy to have around, the source field takes up a significant amount
|
|
of space on disk. Instead of storing source documents on disk exactly as you
|
|
send them, Elasticsearch can reconstruct source content on the fly upon retrieval.
|
|
To enable this https://www.elastic.co/subscriptions[subscription] feature, use the value `synthetic` for the index setting `index.mapping.source.mode`:
|
|
|
|
[source,console,id=enable-synthetic-source-example]
|
|
----
|
|
PUT idx
|
|
{
|
|
"settings": {
|
|
"index": {
|
|
"mapping": {
|
|
"source": {
|
|
"mode": "synthetic"
|
|
}
|
|
}
|
|
}
|
|
}
|
|
}
|
|
----
|
|
// TESTSETUP
|
|
|
|
While this on-the-fly reconstruction is _generally_ slower than saving the source
|
|
documents verbatim and loading them at query time, it saves a lot of storage
|
|
space. Additional latency can be avoided by not loading `_source` field in queries when it is not needed.
|
|
|
|
[[synthetic-source-fields]]
|
|
===== Supported fields
|
|
Synthetic `_source` is supported by all field types. Depending on implementation details, field types have different
|
|
properties when used with synthetic `_source`.
|
|
|
|
<<synthetic-source-fields-native-list, Most field types>> construct synthetic `_source` using existing data, most
|
|
commonly <<doc-values,`doc_values`>> and <<stored-fields, stored fields>>. For these field types, no additional space
|
|
is needed to store the contents of `_source` field. Due to the storage layout of <<doc-values,`doc_values`>>, the
|
|
generated `_source` field undergoes <<synthetic-source-modifications, modifications>> compared to the original document.
|
|
|
|
For all other field types, the original value of the field is stored as is, in the same way as the `_source` field in
|
|
non-synthetic mode. In this case there are no modifications and field data in `_source` is the same as in the original
|
|
document. Similarly, malformed values of fields that use <<ignore-malformed,`ignore_malformed`>> or
|
|
<<ignore-above,`ignore_above`>> need to be stored as is. This approach is less storage efficient since data needed for
|
|
`_source` reconstruction is stored in addition to other data required to index the field (like `doc_values`).
|
|
|
|
[[synthetic-source-restrictions]]
|
|
===== Synthetic `_source` restrictions
|
|
|
|
Some field types have additional restrictions. These restrictions are documented in the **synthetic `_source`** section
|
|
of the field type's <<mapping-types,documentation>>.
|
|
|
|
[[synthetic-source-modifications]]
|
|
===== Synthetic `_source` modifications
|
|
|
|
When synthetic `_source` is enabled, retrieved documents undergo some
|
|
modifications compared to the original JSON.
|
|
|
|
[[synthetic-source-modifications-leaf-arrays]]
|
|
====== Arrays moved to leaf fields
|
|
Synthetic `_source` arrays are moved to leaves. For example:
|
|
|
|
[source,console,id=synthetic-source-leaf-arrays-example]
|
|
----
|
|
PUT idx/_doc/1
|
|
{
|
|
"foo": [
|
|
{
|
|
"bar": 1
|
|
},
|
|
{
|
|
"bar": 2
|
|
}
|
|
]
|
|
}
|
|
----
|
|
// TEST[s/$/\nGET idx\/_doc\/1?filter_path=_source\n/]
|
|
|
|
Will become:
|
|
|
|
[source,console-result]
|
|
----
|
|
{
|
|
"foo": {
|
|
"bar": [1, 2]
|
|
}
|
|
}
|
|
----
|
|
// TEST[s/^/{"_source":/ s/\n$/}/]
|
|
|
|
This can cause some arrays to vanish:
|
|
|
|
[source,console,id=synthetic-source-leaf-arrays-example-sneaky]
|
|
----
|
|
PUT idx/_doc/1
|
|
{
|
|
"foo": [
|
|
{
|
|
"bar": 1
|
|
},
|
|
{
|
|
"baz": 2
|
|
}
|
|
]
|
|
}
|
|
----
|
|
// TEST[s/$/\nGET idx\/_doc\/1?filter_path=_source\n/]
|
|
|
|
Will become:
|
|
|
|
[source,console-result]
|
|
----
|
|
{
|
|
"foo": {
|
|
"bar": 1,
|
|
"baz": 2
|
|
}
|
|
}
|
|
----
|
|
// TEST[s/^/{"_source":/ s/\n$/}/]
|
|
|
|
[[synthetic-source-modifications-field-names]]
|
|
====== Fields named as they are mapped
|
|
Synthetic source names fields as they are named in the mapping. When used
|
|
with <<dynamic,dynamic mapping>>, fields with dots (`.`) in their names are, by
|
|
default, interpreted as multiple objects, while dots in field names are
|
|
preserved within objects that have <<subobjects>> disabled. For example:
|
|
|
|
[source,console,id=synthetic-source-objecty-example]
|
|
----
|
|
PUT idx/_doc/1
|
|
{
|
|
"foo.bar.baz": 1
|
|
}
|
|
----
|
|
// TEST[s/$/\nGET idx\/_doc\/1?filter_path=_source\n/]
|
|
|
|
Will become:
|
|
|
|
[source,console-result]
|
|
----
|
|
{
|
|
"foo": {
|
|
"bar": {
|
|
"baz": 1
|
|
}
|
|
}
|
|
}
|
|
----
|
|
// TEST[s/^/{"_source":/ s/\n$/}/]
|
|
|
|
This impacts how source contents can be referenced in <<modules-scripting-using,scripts>>. For instance, referencing
|
|
a script in its original source form will return null:
|
|
|
|
[source,js]
|
|
----
|
|
"script": { "source": """ emit(params._source['foo.bar.baz']) """ }
|
|
----
|
|
// NOTCONSOLE
|
|
|
|
Instead, source references need to be in line with the mapping structure:
|
|
|
|
[source,js]
|
|
----
|
|
"script": { "source": """ emit(params._source['foo']['bar']['baz']) """ }
|
|
----
|
|
// NOTCONSOLE
|
|
|
|
or simply
|
|
|
|
[source,js]
|
|
----
|
|
"script": { "source": """ emit(params._source.foo.bar.baz) """ }
|
|
----
|
|
// NOTCONSOLE
|
|
|
|
The following <<modules-scripting-fields, field APIs>> are preferable as, in addition to being agnostic to the
|
|
mapping structure, they make use of docvalues if available and fall back to synthetic source only when needed. This
|
|
reduces source synthesizing, a slow and costly operation.
|
|
|
|
[source,js]
|
|
----
|
|
"script": { "source": """ emit(field('foo.bar.baz').get(null)) """ }
|
|
"script": { "source": """ emit($('foo.bar.baz', null)) """ }
|
|
----
|
|
// NOTCONSOLE
|
|
|
|
[[synthetic-source-modifications-alphabetical]]
|
|
====== Alphabetical sorting
|
|
Synthetic `_source` fields are sorted alphabetically. The
|
|
https://www.rfc-editor.org/rfc/rfc7159.html[JSON RFC] defines objects as
|
|
"an unordered collection of zero or more name/value pairs" so applications
|
|
shouldn't care but without synthetic `_source` the original ordering is
|
|
preserved and some applications may, counter to the spec, do something with
|
|
that ordering.
|
|
|
|
[[synthetic-source-modifications-ranges]]
|
|
====== Representation of ranges
|
|
Range field values (e.g. `long_range`) are always represented as inclusive on both sides with bounds adjusted
|
|
accordingly. See <<range-synthetic-source-inclusive, examples>>.
|
|
|
|
[[synthetic-source-precision-loss-for-point-types]]
|
|
====== Reduced precision of `geo_point` values
|
|
Values of `geo_point` fields are represented in synthetic `_source` with reduced precision. See
|
|
<<geo-point-synthetic-source, examples>>.
|
|
|
|
[[synthetic-source-keep]]
|
|
====== Minimizing source modifications
|
|
|
|
It is possible to avoid synthetic source modifications for a particular object or field, at extra storage cost.
|
|
This is controlled through param `synthetic_source_keep` with the following option:
|
|
|
|
- `none`: synthetic source diverges from the original source as described above (default).
|
|
- `arrays`: arrays of the corresponding field or object preserve the original element ordering and duplicate elements.
|
|
The synthetic source fragment for such arrays is not guaranteed to match the original source exactly, e.g. array
|
|
`[1, 2, [5], [[4, [3]]], 5]` may appear as-is or in an equivalent format like `[1, 2, 5, 4, 3, 5]`. The exact format
|
|
may change in the future, in an effort to reduce the storage overhead of this option.
|
|
- `all`: the source for both singleton instances and arrays of the corresponding field or object gets recorded. When
|
|
applied to objects, the source of all sub-objects and sub-fields gets captured. Furthermore, the original source of
|
|
arrays gets captured and appears in synthetic source with no modifications.
|
|
|
|
For instance:
|
|
|
|
[source,console,id=create-index-with-synthetic-source-keep]
|
|
----
|
|
PUT idx_keep
|
|
{
|
|
"settings": {
|
|
"index": {
|
|
"mapping": {
|
|
"source": {
|
|
"mode": "synthetic"
|
|
}
|
|
}
|
|
}
|
|
},
|
|
"mappings": {
|
|
"properties": {
|
|
"path": {
|
|
"type": "object",
|
|
"synthetic_source_keep": "all"
|
|
},
|
|
"ids": {
|
|
"type": "integer",
|
|
"synthetic_source_keep": "arrays"
|
|
}
|
|
}
|
|
}
|
|
}
|
|
----
|
|
// TEST
|
|
|
|
[source,console,id=synthetic-source-keep-example]
|
|
----
|
|
PUT idx_keep/_doc/1
|
|
{
|
|
"path": {
|
|
"to": [
|
|
{ "foo": [3, 2, 1] },
|
|
{ "foo": [30, 20, 10] }
|
|
],
|
|
"bar": "baz"
|
|
},
|
|
"ids": [ 200, 100, 300, 100 ]
|
|
}
|
|
----
|
|
// TEST[s/$/\nGET idx_keep\/_doc\/1?filter_path=_source\n/]
|
|
|
|
returns the original source, with no array deduplication and sorting:
|
|
|
|
[source,console-result]
|
|
----
|
|
{
|
|
"path": {
|
|
"to": [
|
|
{ "foo": [3, 2, 1] },
|
|
{ "foo": [30, 20, 10] }
|
|
],
|
|
"bar": "baz"
|
|
},
|
|
"ids": [ 200, 100, 300, 100 ]
|
|
}
|
|
----
|
|
// TEST[s/^/{"_source":/ s/\n$/}/]
|
|
|
|
The option for capturing the source of arrays can be applied at index level, by setting
|
|
`index.mapping.synthetic_source_keep` to `arrays`. This applies to all objects and fields in the index, except for
|
|
the ones with explicit overrides of `synthetic_source_keep` set to `none`. In this case, the storage overhead grows
|
|
with the number and sizes of arrays present in source of each document, naturally.
|
|
|
|
[[synthetic-source-fields-native-list]]
|
|
===== Field types that support synthetic source with no storage overhead
|
|
The following field types support synthetic source using data from <<doc-values,`doc_values`>> or
|
|
<<stored-fields, stored fields>>, and require no additional storage space to construct the `_source` field.
|
|
|
|
NOTE: If you enable the <<ignore-malformed,`ignore_malformed`>> or <<ignore-above,`ignore_above`>> settings, then
|
|
additional storage is required to store ignored field values for these types.
|
|
|
|
** <<aggregate-metric-double-synthetic-source, `aggregate_metric_double`>>
|
|
** {plugins}/mapper-annotated-text-usage.html#annotated-text-synthetic-source[`annotated-text`]
|
|
** <<binary-synthetic-source,`binary`>>
|
|
** <<boolean-synthetic-source,`boolean`>>
|
|
** <<numeric-synthetic-source,`byte`>>
|
|
** <<date-synthetic-source,`date`>>
|
|
** <<date-nanos-synthetic-source,`date_nanos`>>
|
|
** <<dense-vector-synthetic-source,`dense_vector`>>
|
|
** <<numeric-synthetic-source,`double`>>
|
|
** <<flattened-synthetic-source, `flattened`>>
|
|
** <<numeric-synthetic-source,`float`>>
|
|
** <<geo-point-synthetic-source,`geo_point`>>
|
|
** <<numeric-synthetic-source,`half_float`>>
|
|
** <<histogram-synthetic-source,`histogram`>>
|
|
** <<numeric-synthetic-source,`integer`>>
|
|
** <<ip-synthetic-source,`ip`>>
|
|
** <<keyword-synthetic-source,`keyword`>>
|
|
** <<numeric-synthetic-source,`long`>>
|
|
** <<range-synthetic-source,`range` types>>
|
|
** <<numeric-synthetic-source,`scaled_float`>>
|
|
** <<numeric-synthetic-source,`short`>>
|
|
** <<text-synthetic-source,`text`>>
|
|
** <<version-synthetic-source,`version`>>
|
|
** <<wildcard-synthetic-source,`wildcard`>>
|