mirror of
https://github.com/elastic/elasticsearch.git
synced 2025-06-28 17:34:17 -04:00
Update sparse_vector field mapping to include default setting for token pruning (#129089)
* Initial checkin of refactored index_options code * [CI] Auto commit changes from spotless * initial unit testing * complete unit tests; add yaml tests * [CI] Auto commit changes from spotless * register test feature for sparse vector * Update docs/changelog/129089.yaml * update changelog * add docs * explicit set default index_options if null * [CI] Auto commit changes from spotless * update yaml tests; update docs * fix yaml tests * readd auth for teardown * only serialize index options if not default * [CI] Auto commit changes from spotless * serialization refactor; pass index version around * [CI] Auto commit changes from spotless * fix transport versions merge * fix up docs * [CI] Auto commit changes from spotless * fix docs; add include_defaults unit and yaml test * [CI] Auto commit changes from spotless * override getIndexReaderManager for SemanticQueryBuilderTests * [CI] Auto commit changes from spotless * cleanup mapper/builder/tests; index vers. in type still need to refactor / clean YAML tests * [CI] Auto commit changes from spotless * cleanups to mapper tests for clarity * [CI] Auto commit changes from spotless * move feature into mappers; fix yaml tests * cleanups; add comments; remove redundant test * [CI] Auto commit changes from spotless * escape more periods in the YAML tests * cleanup mapper and type tests * [CI] Auto commit changes from spotless * rename mapping for previous index test * set explicit number of shards for yaml test --------- Co-authored-by: elasticsearchmachine <infra-root+elasticsearchmachine@elastic.co> Co-authored-by: Kathleen DeRusso <kathleen.derusso@elastic.co>
This commit is contained in:
parent
a324853d43
commit
a671505c8a
17 changed files with 2408 additions and 50 deletions
|
@ -24,6 +24,33 @@ PUT my-index
|
|||
}
|
||||
```
|
||||
|
||||
## Token pruning
|
||||
```{applies_to}
|
||||
stack: preview 9.1
|
||||
```
|
||||
|
||||
With any new indices created, token pruning will be turned on by default with appropriate defaults. You can control this behaviour using the optional `index_options` parameters for the field:
|
||||
|
||||
```console
|
||||
PUT my-index
|
||||
{
|
||||
"mappings": {
|
||||
"properties": {
|
||||
"text.tokens": {
|
||||
"type": "sparse_vector",
|
||||
"index_options": {
|
||||
"prune": true,
|
||||
"pruning_config": {
|
||||
"tokens_freq_ratio_threshold": 5,
|
||||
"tokens_weight_threshold": 0.4
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
See [semantic search with ELSER](docs-content://solutions/search/semantic-search/semantic-search-elser-ingest-pipelines.md) for a complete example on adding documents to a `sparse_vector` mapped field using ELSER.
|
||||
|
||||
## Parameters for `sparse_vector` fields [sparse-vectors-params]
|
||||
|
@ -36,6 +63,38 @@ The following parameters are accepted by `sparse_vector` fields:
|
|||
* Exclude the field from [_source](/reference/elasticsearch/rest-apis/retrieve-selected-fields.md#source-filtering).
|
||||
* Use [synthetic `_source`](/reference/elasticsearch/mapping-reference/mapping-source-field.md#synthetic-source).
|
||||
|
||||
index_options {applies_to}`stack: preview 9.1`
|
||||
: (Optional, object) You can set index options for your `sparse_vector` field to determine if you should prune tokens, and the parameter configurations for the token pruning. If pruning options are not set in your [`sparse_vector` query](/reference/query-languages/query-dsl/query-dsl-sparse-vector-query.md), Elasticsearch will use the default options configured for the field, if any.
|
||||
|
||||
Parameters for `index_options` are:
|
||||
|
||||
`prune` {applies_to}`stack: preview 9.1`
|
||||
: (Optional, boolean) Whether to perform pruning, omitting the non-significant tokens from the query to improve query performance. If `prune` is true but the `pruning_config` is not specified, pruning will occur but default values will be used. Default: true.
|
||||
|
||||
`pruning_config` {applies_to}`stack: preview 9.1`
|
||||
: (Optional, object) Optional pruning configuration. If enabled, this will omit non-significant tokens from the query in order to improve query performance. This is only used if `prune` is set to `true`. If `prune` is set to `true` but `pruning_config` is not specified, default values will be used. If `prune` is set to false but `pruning_config` is specified, an exception will occur.
|
||||
|
||||
Parameters for `pruning_config` include:
|
||||
|
||||
`tokens_freq_ratio_threshold` {applies_to}`stack: preview 9.1`
|
||||
: (Optional, integer) Tokens whose frequency is more than `tokens_freq_ratio_threshold` times the average frequency of all tokens in the specified field are considered outliers and pruned. This value must between 1 and 100. Default: `5`.
|
||||
|
||||
`tokens_weight_threshold` {applies_to}`stack: preview 9.1`
|
||||
: (Optional, float) Tokens whose weight is less than `tokens_weight_threshold` are considered insignificant and pruned. This value must be between 0 and 1. Default: `0.4`.
|
||||
|
||||
::::{note}
|
||||
The default values for `tokens_freq_ratio_threshold` and `tokens_weight_threshold` were chosen based on tests using ELSERv2 that provided the most optimal results.
|
||||
::::
|
||||
|
||||
When token pruning is applied, non-significant tokens will be pruned from the query.
|
||||
Non-significant tokens can be defined as tokens that meet both of the following criteria:
|
||||
* The token appears much more frequently than most tokens, indicating that it is a very common word and may not benefit the overall search results much.
|
||||
* The weight/score is so low that the token is likely not very relevant to the original term
|
||||
|
||||
Both the token frequency threshold and weight threshold must show the token is non-significant in order for the token to be pruned.
|
||||
This ensures that:
|
||||
* The tokens that are kept are frequent enough and have significant scoring.
|
||||
* Very infrequent tokens that may not have as high of a score are removed.
|
||||
|
||||
|
||||
## Multi-value sparse vectors [index-multi-value-sparse-vectors]
|
||||
|
|
Loading…
Add table
Add a link
Reference in a new issue