Implement synthetic source support for annotated text field (#107735)

This PR adds synthetic source support for annotated_text fields. Existing implementation for text is reused including test infrastructure so the majority of the change is moving and making things accessible. Contributes to #106460, #78744.
2025-06-28 17:34:17 -04:00 · 2024-04-25 10:31:27 -07:00 · 2024-04-25 10:31:27 -07:00 · e1d902d33b
commit e1d902d33b
parent 4ef8b3825e
16 changed files with 824 additions and 300 deletions
--- a/docs/plugins/mapper-annotated-text.asciidoc
+++ b/docs/plugins/mapper-annotated-text.asciidoc
@ -6,7 +6,7 @@ experimental[]
 The mapper-annotated-text plugin provides the ability to index text that is a
 combination of free-text and special markup that is typically used to identify
 items of interest such as people or organisations (see NER or Named Entity Recognition
-tools). 
+tools).


 The elasticsearch markup allows one or more additional tokens to be injected, unchanged, into the token
@ -18,7 +18,7 @@ include::install_remove.asciidoc[]
 [[mapper-annotated-text-usage]]
 ==== Using the `annotated-text` field

-The `annotated-text` tokenizes text content as per the more common {ref}/text.html[`text`] field (see 
+The `annotated-text` tokenizes text content as per the more common {ref}/text.html[`text`] field (see
 "limitations" below) but also injects any marked-up annotation tokens directly into
 the search index:

@ -49,7 +49,7 @@ in the search index:
 --------------------------
 GET my-index-000001/_analyze
 {
-  "field": "my_field", 
+  "field": "my_field",
  "text":"Investors in [Apple](Apple+Inc.) rejoiced."
 }
 --------------------------
@ -76,7 +76,7 @@ Response:
      "position": 1
    },
    {
-      "token": "Apple Inc.", <1> 
+      "token": "Apple Inc.", <1>
      "start_offset": 13,
      "end_offset": 18,
      "type": "annotation",
@ -106,7 +106,7 @@ the token stream and at the same position (position 2) as the text token (`apple


 We can now perform searches for annotations using regular `term` queries that don't tokenize
-the provided search values. Annotations are a more precise way of matching as can be seen 
+the provided search values. Annotations are a more precise way of matching as can be seen
 in this example where a search for `Beck` will not match `Jeff Beck` :

 [source,console]
@ -133,18 +133,119 @@ GET my-index-000001/_search
 }
 --------------------------

-<1> As well as tokenising the plain text into single words e.g. `beck`, here we 
+<1> As well as tokenising the plain text into single words e.g. `beck`, here we
 inject the single token value `Beck` at the same position as `beck` in the token stream.
 <2> Note annotations can inject multiple tokens at the same position - here we inject both
 the very specific value `Jeff Beck` and the broader term `Guitarist`. This enables
 broader positional queries e.g. finding mentions of a `Guitarist` near to `strat`.
-<3> A benefit of searching with these carefully defined annotation tokens is that a query for 
+<3> A benefit of searching with these carefully defined annotation tokens is that a query for
 `Beck` will not match document 2 that contains the tokens `jeff`, `beck` and `Jeff Beck`

-WARNING: Any use of `=` signs in annotation values eg `[Prince](person=Prince)` will 
+WARNING: Any use of `=` signs in annotation values eg `[Prince](person=Prince)` will
 cause the document to be rejected with a parse failure. In future we hope to have a use for
 the equals signs so wil actively reject documents that contain this today.

+[[annotated-text-synthetic-source]]
+===== Synthetic `_source`
+
+IMPORTANT: Synthetic `_source` is Generally Available only for TSDB indices
+(indices that have `index.mode` set to `time_series`). For other indices
+synthetic `_source` is in technical preview. Features in technical preview may
+be changed or removed in a future release. Elastic will work to fix
+any issues, but features in technical preview are not subject to the support SLA
+of official GA features.
+
+`annotated_text` fields support {ref}/mapping-source-field.html#synthetic-source[synthetic `_source`] if they have
+a {ref}/keyword.html#keyword-synthetic-source[`keyword`] sub-field that supports synthetic
+`_source` or if the `text` field sets `store` to `true`. Either way, it may
+not have {ref}/copy-to.html[`copy_to`].
+
+If using a sub-`keyword` field then the values are sorted in the same way as
+a `keyword` field's values are sorted. By default, that means sorted with
+duplicates removed. So:
+[source,console,id=synthetic-source-text-example-default]
+----
+PUT idx
+{
+  "mappings": {
+    "_source": { "mode": "synthetic" },
+    "properties": {
+      "text": {
+        "type": "annotated_text",
+        "fields": {
+          "raw": {
+            "type": "keyword"
+          }
+        }
+      }
+    }
+  }
+}
+PUT idx/_doc/1
+{
+  "text": [
+    "the quick brown fox",
+    "the quick brown fox",
+    "jumped over the lazy dog"
+  ]
+}
+----
+// TEST[s/$/\nGET idx\/_doc\/1?filter_path=_source\n/]
+
+Will become:
+[source,console-result]
+----
+{
+  "text": [
+    "jumped over the lazy dog",
+    "the quick brown fox"
+  ]
+}
+----
+// TEST[s/^/{"_source":/ s/\n$/}/]
+
+NOTE: Reordering text fields can have an effect on {ref}/query-dsl-match-query-phrase.html[phrase]
+and {ref}/span-queries.html[span] queries. See the discussion about {ref}/position-increment-gap.html[`position_increment_gap`] for more detail. You
+can avoid this by making sure the `slop` parameter on the phrase queries
+is lower than the `position_increment_gap`. This is the default.
+
+If the `annotated_text` field sets `store` to true then order and duplicates
+are preserved.
+[source,console,id=synthetic-source-text-example-stored]
+----
+PUT idx
+{
+  "mappings": {
+    "_source": { "mode": "synthetic" },
+    "properties": {
+      "text": { "type": "annotated_text", "store": true }
+    }
+  }
+}
+PUT idx/_doc/1
+{
+  "text": [
+    "the quick brown fox",
+    "the quick brown fox",
+    "jumped over the lazy dog"
+  ]
+}
+----
+// TEST[s/$/\nGET idx\/_doc\/1?filter_path=_source\n/]
+
+Will become:
+[source,console-result]
+----
+{
+  "text": [
+    "the quick brown fox",
+    "the quick brown fox",
+    "jumped over the lazy dog"
+  ]
+}
+----
+// TEST[s/^/{"_source":/ s/\n$/}/]
+

 [[mapper-annotated-text-tips]]
 ==== Data modelling tips
@ -153,13 +254,13 @@ the equals signs so wil actively reject documents that contain this today.
 Annotations are normally a way of weaving structured information into unstructured text for
 higher-precision search.

-`Entity resolution` is a form of document enrichment undertaken by specialist software or people 
+`Entity resolution` is a form of document enrichment undertaken by specialist software or people
 where references to entities in a document are disambiguated by attaching a canonical ID.
 The ID is used to resolve any number of aliases or distinguish between people with the
-same name. The hyperlinks connecting Wikipedia's articles are a good example of resolved 
-entity IDs woven into text. 
+same name. The hyperlinks connecting Wikipedia's articles are a good example of resolved
+entity IDs woven into text.

-These IDs can be embedded as annotations in an annotated_text field but it often makes 
+These IDs can be embedded as annotations in an annotated_text field but it often makes
 sense to include them in dedicated structured fields to support discovery via aggregations:

 [source,console]
@ -214,20 +315,20 @@ GET my-index-000001/_search
 --------------------------

 <1> Note the `my_twitter_handles` contains a list of the annotation values
-also used in the unstructured text. (Note the annotated_text syntax requires escaping). 
-By repeating the annotation values in a structured field this application has ensured that 
-the tokens discovered in the structured field can be used for search and highlighting 
-in the unstructured field.  
+also used in the unstructured text. (Note the annotated_text syntax requires escaping).
+By repeating the annotation values in a structured field this application has ensured that
+the tokens discovered in the structured field can be used for search and highlighting
+in the unstructured field.
 <2> In this example we search for documents that talk about components of the elastic stack
 <3> We use the `my_twitter_handles` field here to discover people who are significantly
 associated with the elastic stack.

 ===== Avoiding over-matching annotations
-By design, the regular text tokens and the annotation tokens co-exist in the same indexed 
+By design, the regular text tokens and the annotation tokens co-exist in the same indexed
 field but in rare cases this can lead to some over-matching.

 The value of an annotation often denotes a _named entity_ (a person, place or company).
-The tokens for these named entities are inserted untokenized, and differ from typical text 
+The tokens for these named entities are inserted untokenized, and differ from typical text
 tokens because they are normally:

 * Mixed case e.g. `Madonna`
@ -235,19 +336,19 @@ tokens because they are normally:
 * Can have punctuation or numbers e.g. `Apple Inc.` or `@kimchy`

 This means, for the most part, a search for a named entity in the annotated text field will
-not have any false positives e.g. when selecting `Apple Inc.` from an aggregation result 
-you can drill down to highlight uses in the text without "over matching" on any text tokens 
+not have any false positives e.g. when selecting `Apple Inc.` from an aggregation result
+you can drill down to highlight uses in the text without "over matching" on any text tokens
 like the word `apple` in this context:

    the apple was very juicy
-    
-However, a problem arises if your named entity happens to be a single term and lower-case e.g. the 
+
+However, a problem arises if your named entity happens to be a single term and lower-case e.g. the
 company `elastic`. In this case, a search on the annotated text field for the token `elastic`
 may match a text document such as this:

    they fired an elastic band

-To avoid such false matches users should consider prefixing annotation values to ensure 
+To avoid such false matches users should consider prefixing annotation values to ensure
 they don't name clash with text tokens e.g.

    [elastic](Company_elastic) released version 7.0 of the elastic stack today
@ -273,7 +374,7 @@ GET my-index-000001/_search
 {
  "query": {
    "query_string": {
-        "query": "cats" 
+        "query": "cats"
    }
  },
  "highlight": {
@ -291,21 +392,21 @@ GET my-index-000001/_search

 The annotated highlighter is based on the `unified` highlighter and supports the same
 settings but does not use the `pre_tags` or `post_tags` parameters. Rather than using
-html-like markup such as `<em>cat</em>` the annotated highlighter uses the same 
+html-like markup such as `<em>cat</em>` the annotated highlighter uses the same
 markdown-like syntax used for annotations and injects a key=value annotation where `_hit_term`
-is the key and the matched search term is the value e.g. 
+is the key and the matched search term is the value e.g.

    The [cat](_hit_term=cat) sat on the [mat](sku3578)

-The annotated highlighter tries to be respectful of any existing markup in the original 
+The annotated highlighter tries to be respectful of any existing markup in the original
 text:

-* If the search term matches exactly the location of an existing annotation then the 
+* If the search term matches exactly the location of an existing annotation then the
 `_hit_term` key is merged into the url-like syntax used in the `(...)` part of the
-existing annotation. 
+existing annotation.
 * However, if the search term overlaps the span of an existing annotation it would break
 the markup formatting so the original annotation is removed in favour of a new annotation
-with just the search hit information in the results. 
+with just the search hit information in the results.
 * Any non-overlapping annotations in the original text are preserved in highlighter
 selections