mirror of
https://github.com/elastic/elasticsearch.git
synced 2025-04-25 07:37:19 -04:00
Co-authored-by: Jim Ferenczi <jim.ferenczi@elastic.co> Co-authored-by: Elastic Machine <elasticmachine@users.noreply.github.com>
217 lines
6.6 KiB
Text
217 lines
6.6 KiB
Text
[[query-dsl-text-expansion-query]]
|
|
== Text expansion query
|
|
++++
|
|
<titleabbrev>Text expansion</titleabbrev>
|
|
++++
|
|
|
|
The text expansion query uses a {nlp} model to convert the query text into a
|
|
list of token-weight pairs which are then used in a query against a
|
|
<<sparse-vector,sparse vector>> or <<rank-features,rank features>> field.
|
|
|
|
[discrete]
|
|
[[text-expansion-query-ex-request]]
|
|
=== Example request
|
|
|
|
|
|
[source,console]
|
|
----
|
|
GET _search
|
|
{
|
|
"query":{
|
|
"text_expansion":{
|
|
"<sparse_vector_field>":{
|
|
"model_id":"the model to produce the token weights",
|
|
"model_text":"the query string"
|
|
}
|
|
}
|
|
}
|
|
}
|
|
----
|
|
// TEST[skip: TBD]
|
|
|
|
[discrete]
|
|
[[text-expansion-query-params]]
|
|
=== Top level parameters for `text_expansion`
|
|
|
|
`<sparse_vector_field>`:::
|
|
(Required, object)
|
|
The name of the field that contains the token-weight pairs the NLP model created
|
|
based on the input text.
|
|
|
|
[discrete]
|
|
[[text-expansion-rank-feature-field-params]]
|
|
=== Top level parameters for `<sparse_vector_field>`
|
|
|
|
`model_id`::::
|
|
(Required, string)
|
|
The ID of the model to use to convert the query text into token-weight pairs. It
|
|
must be the same model ID that was used to create the tokens from the input
|
|
text.
|
|
|
|
`model_text`::::
|
|
(Required, string)
|
|
The query text you want to use for search.
|
|
|
|
`pruning_config` ::::
|
|
(Optional, object)
|
|
preview:[]
|
|
Optional pruning configuration. If enabled, this will omit non-significant tokens from the query in order to improve query performance.
|
|
Default: Disabled.
|
|
+
|
|
--
|
|
Parameters for `<pruning_config>` are:
|
|
|
|
`tokens_freq_ratio_threshold`::
|
|
(Optional, float)
|
|
preview:[]
|
|
Tokens whose frequency is more than `tokens_freq_ratio_threshold` times the average frequency of all tokens in the specified field are considered outliers and pruned.
|
|
This value must between 1 and 100.
|
|
Default: `5`.
|
|
|
|
`tokens_weight_threshold`::
|
|
(Optional, float)
|
|
preview:[]
|
|
Tokens whose weight is less than `tokens_weight_threshold` are considered nonsignificant and pruned.
|
|
This value must be between 0 and 1.
|
|
Default: `0.4`.
|
|
|
|
`only_score_pruned_tokens`::
|
|
(Optional, boolean)
|
|
preview:[]
|
|
If `true` we only input pruned tokens into scoring, and discard non-pruned tokens.
|
|
It is strongly recommended to set this to `false` for the main query, but this can be set to `true` for a rescore query to get more relevant results.
|
|
Default: `false`.
|
|
|
|
NOTE: The default values for `tokens_freq_ratio_threshold` and `tokens_weight_threshold` were chosen based on tests using ELSER that provided the most optimal results.
|
|
--
|
|
|
|
[discrete]
|
|
[[text-expansion-query-example]]
|
|
=== Example ELSER query
|
|
|
|
The following is an example of the `text_expansion` query that references the
|
|
ELSER model to perform semantic search. For a more detailed description of how
|
|
to perform semantic search by using ELSER and the `text_expansion` query, refer
|
|
to <<semantic-search-elser,this tutorial>>.
|
|
|
|
[source,console]
|
|
----
|
|
GET my-index/_search
|
|
{
|
|
"query":{
|
|
"text_expansion":{
|
|
"ml.tokens":{
|
|
"model_id":".elser_model_2",
|
|
"model_text":"How is the weather in Jamaica?"
|
|
}
|
|
}
|
|
}
|
|
}
|
|
----
|
|
// TEST[skip: TBD]
|
|
|
|
[discrete]
|
|
[[text-expansion-query-with-pruning-config-example]]
|
|
=== Example ELSER query with pruning configuration
|
|
|
|
The following is an extension to the above example that adds a preview:[] pruning configuration to the `text_expansion` query.
|
|
The pruning configuration identifies non-significant tokens to prune from the query in order to improve query performance.
|
|
[source,console]
|
|
----
|
|
GET my-index/_search
|
|
{
|
|
"query":{
|
|
"text_expansion":{
|
|
"ml.tokens":{
|
|
"model_id":".elser_model_2",
|
|
"model_text":"How is the weather in Jamaica?"
|
|
},
|
|
"pruning_config": {
|
|
"tokens_freq_ratio_threshold": 5,
|
|
"tokens_weight_threshold": 0.4,
|
|
"only_score_pruned_tokens": false
|
|
}
|
|
}
|
|
}
|
|
}
|
|
----
|
|
// TEST[skip: TBD]
|
|
|
|
[discrete]
|
|
[[text-expansion-query-with-pruning-config-and-rescore-example]]
|
|
=== Example ELSER query with pruning configuration and rescore
|
|
|
|
The following is an extension to the above example that adds a <<rescore>> function on top of the preview:[] pruning configuration to the `text_expansion` query.
|
|
The pruning configuration identifies non-significant tokens to prune from the query in order to improve query performance.
|
|
Rescoring the query with the tokens that were originally pruned from the query may improve overall search relevance when using this pruning strategy.
|
|
|
|
[source,console]
|
|
----
|
|
GET my-index/_search
|
|
{
|
|
"query":{
|
|
"text_expansion":{
|
|
"ml.tokens":{
|
|
"model_id":".elser_model_2",
|
|
"model_text":"How is the weather in Jamaica?"
|
|
},
|
|
"pruning_config": {
|
|
"tokens_freq_ratio_threshold": 5,
|
|
"tokens_weight_threshold": 0.4,
|
|
"only_score_pruned_tokens": false
|
|
}
|
|
}
|
|
},
|
|
"rescore": {
|
|
"window_size": 100,
|
|
"query": {
|
|
"rescore_query": {
|
|
"text_expansion": {
|
|
"ml.tokens": {
|
|
"model_id": ".elser_model_2",
|
|
"model_text": "How is the weather in Jamaica?"
|
|
},
|
|
"pruning_config": {
|
|
"tokens_freq_ratio_threshold": 5,
|
|
"tokens_weight_threshold": 0.4,
|
|
"only_score_pruned_tokens": false
|
|
}
|
|
}
|
|
}
|
|
}
|
|
}
|
|
}
|
|
----
|
|
//TEST[skip: TBD]
|
|
|
|
[NOTE]
|
|
====
|
|
Depending on your data, the text expansion query may be faster with `track_total_hits: false`.
|
|
====
|
|
|
|
[discrete]
|
|
[[weighted-tokens-query-example]]
|
|
=== Example Weighted token query
|
|
|
|
In order to quickly iterate during tests, we exposed a new preview:[] `weighted_tokens` query for evaluation of tokenized datasets.
|
|
While this is not a query that is intended for production use, it can be used to quickly evaluate relevance using various pruning configurations.
|
|
|
|
[source,console]
|
|
----
|
|
POST /docs/_search
|
|
{
|
|
"query": {
|
|
"weighted_tokens": {
|
|
"query_expansion": {
|
|
"tokens": {"2161": 0.4679, "2621": 0.307, "2782": 0.1299, "2851": 0.1056, "3088": 0.3041, "3376": 0.1038, "3467": 0.4873, "3684": 0.8958, "4380": 0.334, "4542": 0.4636, "4633": 2.2805, "4785": 1.2628, "4860": 1.0655, "5133": 1.0709, "7139": 1.0016, "7224": 0.2486, "7387": 0.0985, "7394": 0.0542, "8915": 0.369, "9156": 2.8947, "10505": 0.2771, "11464": 0.3996, "13525": 0.0088, "14178": 0.8161, "16893": 0.1376, "17851": 1.5348, "19939": 0.6012},
|
|
"pruning_config": {
|
|
"tokens_freq_ratio_threshold": 5,
|
|
"tokens_weight_threshold": 0.4,
|
|
"only_score_pruned_tokens": false
|
|
}
|
|
}
|
|
}
|
|
}
|
|
}
|
|
----
|
|
//TEST[skip: TBD]
|