This commit introduces the `MappedFieldType#getDefaultHighlighter`, allowing a specific highlighter to be enforced for a field.
The semantic field mapper utilizes this new functionality to set the `semantic` highlighter as the default.
All other fields will continue to use the `unified` highlighter by default.
With the introduction of our new backing algorithm and making rescoring
easier with the `rescore_vector` API, let's mark bbq as GA.
Additionally, this commit adds rolling upgrade tests to ensure
stability.
Semantic text fields now support multi-fields, either as part of a multi-field structure or containing multi-fields internally.
This enhancement aligns with the semantic text field's current behavior as a standard text field.
Note: Multi-field support is only available for the new index format. Attempting to set a multi-field on an index created with the older format will still result in a failure.
Late-interaction models are powerful rerankers. While their size and
overall cost doesn't lend itself for HNSW indexing, utilizing them as
second order "brute-force" reranking can provide excellent boosts in
relevance. At generally lower inference times than large cross-encoders.
This commit exposes a new experimental `rank_vectors` field that allows
for maxSim operations. This unlocks the initial, and most common use of
late-interaction dense-models.
For example, this is how you would use it via the API:
```
PUT index
{
"mappings": {
"properties": {
"late_interaction_vectors": {
"type": "rank_vectors"
}
}
}
}
```
Then to index:
```
POST index/_doc
{
"late_interaction_vectors": [[0.1, ...],...]
}
```
For querying, scoring can be exposed with scripting:
```
POST index/_search
{
"query": {
"script_score": {
"query": {
"match_all": {}
},
"script": {
"source": "maxSimDotProduct(params.query_vector, 'my_vector')",
"params": {
"query_vector": [[0.42, ...], ...]
}
}
}
}
}
```
Of course, the initial ranking should be done before re-scoring or
combining via the `rescore` parameter, or simply passing whatever first
phase retrieval you want as the inner query in `script_score`.
Enhance documenation to explain that "_index_prefix" subfield must
be added to `matched_fields` param for highlighting a main field.
When doing prefix queries on fields that are indexed with prefixes,
"_index_prefix" subfield is used. If we try to highlight the main
field, we may not get any results. "_index_prefix" subfield must
be added to `matched_fields` which instructs ES to use matches
from "_index_prefix" to highlight the main field.
Removes the old `_knn_search` API that was never out of tech preview and
deprecated throughout the v8 cycle.
To utilize the API, `compatible-with=8` can be utilized.
* Adds new default inference information
* Update docs/reference/mapping/types/semantic-text.asciidoc
Co-authored-by: István Zoltán Szabó <istvan.szabo@elastic.co>
* Update docs/reference/search/search-your-data/semantic-search-semantic-text.asciidoc
Co-authored-by: István Zoltán Szabó <istvan.szabo@elastic.co>
* Update docs/reference/mapping/types/semantic-text.asciidoc
Co-authored-by: David Kyle <david.kyle@elastic.co>
---------
Co-authored-by: István Zoltán Szabó <istvan.szabo@elastic.co>
Co-authored-by: David Kyle <david.kyle@elastic.co>
This PR introduces a new highlighter, `semantic`, tailored for semantic text fields.
It extracts the most relevant fragments by scoring nested chunks using the original semantic query.
In this initial version, the highlighter returns only the original chunks computed during ingestion. However, this is an implementation detail, and future enhancements could combine multiple chunks to generate the fragments.
This PR introduces an option for `sparse_vector` to store its values separately from `_source` by using term vectors.
This capability is primarly needed by the semantic text field.
* docs: update synthetic source docs
* fix: also doc values false works
* Revert "fix: also doc values false works"
This reverts commit 0895a76758.
* fix: update synthetic source documentation
* fix: all field types support it
* fix: no need to explicitly mention it
* fix: synthetic source sorting
* fix: may instead of might
We will deprecate the `_source.mode` mapping level configuration
in favor of the index-level `index.mapping.source.mode` setting.
As a result, we go through the documentation and update it to reflect
the introduction of the setting.
Here we introduce a new index-level setting, `ignore_above`, similar to what we have
for `ignore_malformed`. The setting will apply to all `keyword`, `wildcard` and `flattened`
fields. Each field mapping will still be allowed to override the index-level setting using a
mapping-level `ignore_above` value.
Closes https://github.com/elastic/elasticsearch/issues/110387
Having this in now affords us not having to introduce version checks in
the ES exporter later. We can simply use the same serialization logic
for metric attributes as we do for other signals. This also enables us
to properly map `*.ip` fields to the ip field type as ip fields
containing a list of IPs are not converted to a comma-separated list.
JDK 23 removes the COMPAT locale provider, leaving CLDR as the only option. This commit configures Elasticsearch
to use the CLDR provider when on JDK 23, but still use the existing COMPAT provider when on JDK 22 and below.
This causes some differences in locale behaviour; this also adapts various tests to still work whether run on COMPAT or CLDR.
* [DOCS] Clarify copy_to behavior with strict dynamic mappings
* Add id
* De-verbosify
* Delete pesky comma
* More info about root and nest
* Fixes per review, clarify non-recursive explanation
* Skip tests for illustrative example
* Fix example syntax
* Fix typo