mirror of
https://github.com/elastic/elasticsearch.git
synced 2025-06-28 17:34:17 -04:00
Implement synthetic source support for annotated text field (#107735)
This PR adds synthetic source support for annotated_text fields. Existing implementation for text is reused including test infrastructure so the majority of the change is moving and making things accessible. Contributes to #106460, #78744.
This commit is contained in:
parent
4ef8b3825e
commit
e1d902d33b
16 changed files with 824 additions and 300 deletions
|
@ -6,7 +6,7 @@ experimental[]
|
|||
The mapper-annotated-text plugin provides the ability to index text that is a
|
||||
combination of free-text and special markup that is typically used to identify
|
||||
items of interest such as people or organisations (see NER or Named Entity Recognition
|
||||
tools).
|
||||
tools).
|
||||
|
||||
|
||||
The elasticsearch markup allows one or more additional tokens to be injected, unchanged, into the token
|
||||
|
@ -18,7 +18,7 @@ include::install_remove.asciidoc[]
|
|||
[[mapper-annotated-text-usage]]
|
||||
==== Using the `annotated-text` field
|
||||
|
||||
The `annotated-text` tokenizes text content as per the more common {ref}/text.html[`text`] field (see
|
||||
The `annotated-text` tokenizes text content as per the more common {ref}/text.html[`text`] field (see
|
||||
"limitations" below) but also injects any marked-up annotation tokens directly into
|
||||
the search index:
|
||||
|
||||
|
@ -49,7 +49,7 @@ in the search index:
|
|||
--------------------------
|
||||
GET my-index-000001/_analyze
|
||||
{
|
||||
"field": "my_field",
|
||||
"field": "my_field",
|
||||
"text":"Investors in [Apple](Apple+Inc.) rejoiced."
|
||||
}
|
||||
--------------------------
|
||||
|
@ -76,7 +76,7 @@ Response:
|
|||
"position": 1
|
||||
},
|
||||
{
|
||||
"token": "Apple Inc.", <1>
|
||||
"token": "Apple Inc.", <1>
|
||||
"start_offset": 13,
|
||||
"end_offset": 18,
|
||||
"type": "annotation",
|
||||
|
@ -106,7 +106,7 @@ the token stream and at the same position (position 2) as the text token (`apple
|
|||
|
||||
|
||||
We can now perform searches for annotations using regular `term` queries that don't tokenize
|
||||
the provided search values. Annotations are a more precise way of matching as can be seen
|
||||
the provided search values. Annotations are a more precise way of matching as can be seen
|
||||
in this example where a search for `Beck` will not match `Jeff Beck` :
|
||||
|
||||
[source,console]
|
||||
|
@ -133,18 +133,119 @@ GET my-index-000001/_search
|
|||
}
|
||||
--------------------------
|
||||
|
||||
<1> As well as tokenising the plain text into single words e.g. `beck`, here we
|
||||
<1> As well as tokenising the plain text into single words e.g. `beck`, here we
|
||||
inject the single token value `Beck` at the same position as `beck` in the token stream.
|
||||
<2> Note annotations can inject multiple tokens at the same position - here we inject both
|
||||
the very specific value `Jeff Beck` and the broader term `Guitarist`. This enables
|
||||
broader positional queries e.g. finding mentions of a `Guitarist` near to `strat`.
|
||||
<3> A benefit of searching with these carefully defined annotation tokens is that a query for
|
||||
<3> A benefit of searching with these carefully defined annotation tokens is that a query for
|
||||
`Beck` will not match document 2 that contains the tokens `jeff`, `beck` and `Jeff Beck`
|
||||
|
||||
WARNING: Any use of `=` signs in annotation values eg `[Prince](person=Prince)` will
|
||||
WARNING: Any use of `=` signs in annotation values eg `[Prince](person=Prince)` will
|
||||
cause the document to be rejected with a parse failure. In future we hope to have a use for
|
||||
the equals signs so wil actively reject documents that contain this today.
|
||||
|
||||
[[annotated-text-synthetic-source]]
|
||||
===== Synthetic `_source`
|
||||
|
||||
IMPORTANT: Synthetic `_source` is Generally Available only for TSDB indices
|
||||
(indices that have `index.mode` set to `time_series`). For other indices
|
||||
synthetic `_source` is in technical preview. Features in technical preview may
|
||||
be changed or removed in a future release. Elastic will work to fix
|
||||
any issues, but features in technical preview are not subject to the support SLA
|
||||
of official GA features.
|
||||
|
||||
`annotated_text` fields support {ref}/mapping-source-field.html#synthetic-source[synthetic `_source`] if they have
|
||||
a {ref}/keyword.html#keyword-synthetic-source[`keyword`] sub-field that supports synthetic
|
||||
`_source` or if the `text` field sets `store` to `true`. Either way, it may
|
||||
not have {ref}/copy-to.html[`copy_to`].
|
||||
|
||||
If using a sub-`keyword` field then the values are sorted in the same way as
|
||||
a `keyword` field's values are sorted. By default, that means sorted with
|
||||
duplicates removed. So:
|
||||
[source,console,id=synthetic-source-text-example-default]
|
||||
----
|
||||
PUT idx
|
||||
{
|
||||
"mappings": {
|
||||
"_source": { "mode": "synthetic" },
|
||||
"properties": {
|
||||
"text": {
|
||||
"type": "annotated_text",
|
||||
"fields": {
|
||||
"raw": {
|
||||
"type": "keyword"
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
PUT idx/_doc/1
|
||||
{
|
||||
"text": [
|
||||
"the quick brown fox",
|
||||
"the quick brown fox",
|
||||
"jumped over the lazy dog"
|
||||
]
|
||||
}
|
||||
----
|
||||
// TEST[s/$/\nGET idx\/_doc\/1?filter_path=_source\n/]
|
||||
|
||||
Will become:
|
||||
[source,console-result]
|
||||
----
|
||||
{
|
||||
"text": [
|
||||
"jumped over the lazy dog",
|
||||
"the quick brown fox"
|
||||
]
|
||||
}
|
||||
----
|
||||
// TEST[s/^/{"_source":/ s/\n$/}/]
|
||||
|
||||
NOTE: Reordering text fields can have an effect on {ref}/query-dsl-match-query-phrase.html[phrase]
|
||||
and {ref}/span-queries.html[span] queries. See the discussion about {ref}/position-increment-gap.html[`position_increment_gap`] for more detail. You
|
||||
can avoid this by making sure the `slop` parameter on the phrase queries
|
||||
is lower than the `position_increment_gap`. This is the default.
|
||||
|
||||
If the `annotated_text` field sets `store` to true then order and duplicates
|
||||
are preserved.
|
||||
[source,console,id=synthetic-source-text-example-stored]
|
||||
----
|
||||
PUT idx
|
||||
{
|
||||
"mappings": {
|
||||
"_source": { "mode": "synthetic" },
|
||||
"properties": {
|
||||
"text": { "type": "annotated_text", "store": true }
|
||||
}
|
||||
}
|
||||
}
|
||||
PUT idx/_doc/1
|
||||
{
|
||||
"text": [
|
||||
"the quick brown fox",
|
||||
"the quick brown fox",
|
||||
"jumped over the lazy dog"
|
||||
]
|
||||
}
|
||||
----
|
||||
// TEST[s/$/\nGET idx\/_doc\/1?filter_path=_source\n/]
|
||||
|
||||
Will become:
|
||||
[source,console-result]
|
||||
----
|
||||
{
|
||||
"text": [
|
||||
"the quick brown fox",
|
||||
"the quick brown fox",
|
||||
"jumped over the lazy dog"
|
||||
]
|
||||
}
|
||||
----
|
||||
// TEST[s/^/{"_source":/ s/\n$/}/]
|
||||
|
||||
|
||||
[[mapper-annotated-text-tips]]
|
||||
==== Data modelling tips
|
||||
|
@ -153,13 +254,13 @@ the equals signs so wil actively reject documents that contain this today.
|
|||
Annotations are normally a way of weaving structured information into unstructured text for
|
||||
higher-precision search.
|
||||
|
||||
`Entity resolution` is a form of document enrichment undertaken by specialist software or people
|
||||
`Entity resolution` is a form of document enrichment undertaken by specialist software or people
|
||||
where references to entities in a document are disambiguated by attaching a canonical ID.
|
||||
The ID is used to resolve any number of aliases or distinguish between people with the
|
||||
same name. The hyperlinks connecting Wikipedia's articles are a good example of resolved
|
||||
entity IDs woven into text.
|
||||
same name. The hyperlinks connecting Wikipedia's articles are a good example of resolved
|
||||
entity IDs woven into text.
|
||||
|
||||
These IDs can be embedded as annotations in an annotated_text field but it often makes
|
||||
These IDs can be embedded as annotations in an annotated_text field but it often makes
|
||||
sense to include them in dedicated structured fields to support discovery via aggregations:
|
||||
|
||||
[source,console]
|
||||
|
@ -214,20 +315,20 @@ GET my-index-000001/_search
|
|||
--------------------------
|
||||
|
||||
<1> Note the `my_twitter_handles` contains a list of the annotation values
|
||||
also used in the unstructured text. (Note the annotated_text syntax requires escaping).
|
||||
By repeating the annotation values in a structured field this application has ensured that
|
||||
the tokens discovered in the structured field can be used for search and highlighting
|
||||
in the unstructured field.
|
||||
also used in the unstructured text. (Note the annotated_text syntax requires escaping).
|
||||
By repeating the annotation values in a structured field this application has ensured that
|
||||
the tokens discovered in the structured field can be used for search and highlighting
|
||||
in the unstructured field.
|
||||
<2> In this example we search for documents that talk about components of the elastic stack
|
||||
<3> We use the `my_twitter_handles` field here to discover people who are significantly
|
||||
associated with the elastic stack.
|
||||
|
||||
===== Avoiding over-matching annotations
|
||||
By design, the regular text tokens and the annotation tokens co-exist in the same indexed
|
||||
By design, the regular text tokens and the annotation tokens co-exist in the same indexed
|
||||
field but in rare cases this can lead to some over-matching.
|
||||
|
||||
The value of an annotation often denotes a _named entity_ (a person, place or company).
|
||||
The tokens for these named entities are inserted untokenized, and differ from typical text
|
||||
The tokens for these named entities are inserted untokenized, and differ from typical text
|
||||
tokens because they are normally:
|
||||
|
||||
* Mixed case e.g. `Madonna`
|
||||
|
@ -235,19 +336,19 @@ tokens because they are normally:
|
|||
* Can have punctuation or numbers e.g. `Apple Inc.` or `@kimchy`
|
||||
|
||||
This means, for the most part, a search for a named entity in the annotated text field will
|
||||
not have any false positives e.g. when selecting `Apple Inc.` from an aggregation result
|
||||
you can drill down to highlight uses in the text without "over matching" on any text tokens
|
||||
not have any false positives e.g. when selecting `Apple Inc.` from an aggregation result
|
||||
you can drill down to highlight uses in the text without "over matching" on any text tokens
|
||||
like the word `apple` in this context:
|
||||
|
||||
the apple was very juicy
|
||||
|
||||
However, a problem arises if your named entity happens to be a single term and lower-case e.g. the
|
||||
|
||||
However, a problem arises if your named entity happens to be a single term and lower-case e.g. the
|
||||
company `elastic`. In this case, a search on the annotated text field for the token `elastic`
|
||||
may match a text document such as this:
|
||||
|
||||
they fired an elastic band
|
||||
|
||||
To avoid such false matches users should consider prefixing annotation values to ensure
|
||||
To avoid such false matches users should consider prefixing annotation values to ensure
|
||||
they don't name clash with text tokens e.g.
|
||||
|
||||
[elastic](Company_elastic) released version 7.0 of the elastic stack today
|
||||
|
@ -273,7 +374,7 @@ GET my-index-000001/_search
|
|||
{
|
||||
"query": {
|
||||
"query_string": {
|
||||
"query": "cats"
|
||||
"query": "cats"
|
||||
}
|
||||
},
|
||||
"highlight": {
|
||||
|
@ -291,21 +392,21 @@ GET my-index-000001/_search
|
|||
|
||||
The annotated highlighter is based on the `unified` highlighter and supports the same
|
||||
settings but does not use the `pre_tags` or `post_tags` parameters. Rather than using
|
||||
html-like markup such as `<em>cat</em>` the annotated highlighter uses the same
|
||||
html-like markup such as `<em>cat</em>` the annotated highlighter uses the same
|
||||
markdown-like syntax used for annotations and injects a key=value annotation where `_hit_term`
|
||||
is the key and the matched search term is the value e.g.
|
||||
is the key and the matched search term is the value e.g.
|
||||
|
||||
The [cat](_hit_term=cat) sat on the [mat](sku3578)
|
||||
|
||||
The annotated highlighter tries to be respectful of any existing markup in the original
|
||||
The annotated highlighter tries to be respectful of any existing markup in the original
|
||||
text:
|
||||
|
||||
* If the search term matches exactly the location of an existing annotation then the
|
||||
* If the search term matches exactly the location of an existing annotation then the
|
||||
`_hit_term` key is merged into the url-like syntax used in the `(...)` part of the
|
||||
existing annotation.
|
||||
existing annotation.
|
||||
* However, if the search term overlaps the span of an existing annotation it would break
|
||||
the markup formatting so the original annotation is removed in favour of a new annotation
|
||||
with just the search hit information in the results.
|
||||
with just the search hit information in the results.
|
||||
* Any non-overlapping annotations in the original text are preserved in highlighter
|
||||
selections
|
||||
|
||||
|
|
Loading…
Add table
Add a link
Reference in a new issue