elasticsearch/docs/reference/mapping/fields/synthetic-source.asciidoc
Oleksandr Kolomiiets e1d902d33b
Implement synthetic source support for annotated text field (#107735)
This PR adds synthetic source support for annotated_text fields. Existing implementation for text is reused including test infrastructure so the majority of the change is moving and making things accessible.

Contributes to #106460, #78744.
2024-04-25 10:31:27 -07:00

178 lines
4.8 KiB
Text

[[synthetic-source]]
==== Synthetic `_source`
IMPORTANT: Synthetic `_source` is Generally Available only for TSDB indices
(indices that have `index.mode` set to `time_series`). For other indices
synthetic `_source` is in technical preview. Features in technical preview may
be changed or removed in a future release. Elastic will work to fix
any issues, but features in technical preview are not subject to the support SLA
of official GA features.
Though very handy to have around, the source field takes up a significant amount
of space on disk. Instead of storing source documents on disk exactly as you
send them, Elasticsearch can reconstruct source content on the fly upon retrieval.
Enable this by setting `mode: synthetic` in `_source`:
[source,console,id=enable-synthetic-source-example]
----
PUT idx
{
"mappings": {
"_source": {
"mode": "synthetic"
}
}
}
----
// TESTSETUP
While this on the fly reconstruction is *generally* slower than saving the source
documents verbatim and loading them at query time, it saves a lot of storage
space.
[[synthetic-source-restrictions]]
===== Synthetic `_source` restrictions
There are a couple of restrictions to be aware of:
* When you retrieve synthetic `_source` content it undergoes minor
<<synthetic-source-modifications,modifications>> compared to the original JSON.
* Synthetic `_source` can be used with indices that contain only these field
types:
** <<aggregate-metric-double-synthetic-source, `aggregate_metric_double`>>
** {plugins}/mapper-annotated-text-usage.html#annotated-text-synthetic-source[`annotated-text`]
** <<binary-synthetic-source,`binary`>>
** <<boolean-synthetic-source,`boolean`>>
** <<numeric-synthetic-source,`byte`>>
** <<date-synthetic-source,`date`>>
** <<date-nanos-synthetic-source,`date_nanos`>>
** <<dense-vector-synthetic-source,`dense_vector`>>
** <<numeric-synthetic-source,`double`>>
** <<flattened-synthetic-source, `flattened`>>
** <<numeric-synthetic-source,`float`>>
** <<geo-point-synthetic-source,`geo_point`>>
** <<numeric-synthetic-source,`half_float`>>
** <<histogram-synthetic-source,`histogram`>>
** <<numeric-synthetic-source,`integer`>>
** <<ip-synthetic-source,`ip`>>
** <<keyword-synthetic-source,`keyword`>>
** <<numeric-synthetic-source,`long`>>
** <<range-synthetic-source,`range` types>>
** <<numeric-synthetic-source,`scaled_float`>>
** <<numeric-synthetic-source,`short`>>
** <<text-synthetic-source,`text`>>
** <<version-synthetic-source,`version`>>
** <<wildcard-synthetic-source,`wildcard`>>
[[synthetic-source-modifications]]
===== Synthetic `_source` modifications
When synthetic `_source` is enabled, retrieved documents undergo some
modifications compared to the original JSON.
[[synthetic-source-modifications-leaf-arrays]]
====== Arrays moved to leaf fields
Synthetic `_source` arrays are moved to leaves. For example:
[source,console,id=synthetic-source-leaf-arrays-example]
----
PUT idx/_doc/1
{
"foo": [
{
"bar": 1
},
{
"bar": 2
}
]
}
----
// TEST[s/$/\nGET idx\/_doc\/1?filter_path=_source\n/]
Will become:
[source,console-result]
----
{
"foo": {
"bar": [1, 2]
}
}
----
// TEST[s/^/{"_source":/ s/\n$/}/]
This can cause some arrays to vanish:
[source,console,id=synthetic-source-leaf-arrays-example-sneaky]
----
PUT idx/_doc/1
{
"foo": [
{
"bar": 1
},
{
"baz": 2
}
]
}
----
// TEST[s/$/\nGET idx\/_doc\/1?filter_path=_source\n/]
Will become:
[source,console-result]
----
{
"foo": {
"bar": 1,
"baz": 2
}
}
----
// TEST[s/^/{"_source":/ s/\n$/}/]
[[synthetic-source-modifications-field-names]]
====== Fields named as they are mapped
Synthetic source names fields as they are named in the mapping. When used
with <<dynamic,dynamic mapping>>, fields with dots (`.`) in their names are, by
default, interpreted as multiple objects, while dots in field names are
preserved within objects that have <<subobjects>> disabled. For example:
[source,console,id=synthetic-source-objecty-example]
----
PUT idx/_doc/1
{
"foo.bar.baz": 1
}
----
// TEST[s/$/\nGET idx\/_doc\/1?filter_path=_source\n/]
Will become:
[source,console-result]
----
{
"foo": {
"bar": {
"baz": 1
}
}
}
----
// TEST[s/^/{"_source":/ s/\n$/}/]
[[synthetic-source-modifications-alphabetical]]
====== Alphabetical sorting
Synthetic `_source` fields are sorted alphabetically. The
https://www.rfc-editor.org/rfc/rfc7159.html[JSON RFC] defines objects as
"an unordered collection of zero or more name/value pairs" so applications
shouldn't care but without synthetic `_source` the original ordering is
preserved and some applications may, counter to the spec, do something with
that ordering.
[[synthetic-source-modifications-ranges]]
====== Representation of ranges
Range field vales (e.g. `long_range`) are always represented as inclusive on both sides with bounds adjusted accordingly. See <<range-synthetic-source-inclusive, examples>>.