For float32, there is no compelling reason to use all the memory
required by default for HNSW. Using `int8_hnsw` provides a much saner
default when it comes to cost vs relevancy.
So, on all new indices that use `dense_vector` and want to index them
for fast search, we will default to `int8_hnsw`.
Users can still customize their parameters, or prefer `hnsw` over
float32 if they so desire.
* Text fields are stored by default with synthetic source
Synthetic source requires text fields to be stored or have keyword
sub-field that supports synthetic source. If there are no keyword fields
users currently have to explicitly set 'store' to 'true' or get a
validation exception. This is not the best experience. It is quite
likely that setting `store` to `true` is the correct thing to do but
users still get an error and need to investigate it. With this change if
`store` setting is not specified in such context it will be set to
`true` by default. Setting it explicitly to `false` results in the
exception.
Closes#97039
* [DOCS] `time_series_dimension` fields do not support `ignore_above`
There is existing validation for this combination of parameters but
it was not documented.
Closes#99044
* Remove maximum size constraint
* Add reasoning for constraints
This adds two new vector index types: - flat - int8_flat
Both store the vectors in a flat space and search is brute-force over
the vectors in the index. For the regular `flat` index, this can be
considered syntactic sugar that allows `knn` queries without having to
put indices within HNSW.
For `int8_flat`, this allows float vectors to be stored in a flat
manner, but also automatically quantized.
Fixed a typo and a small grammatical error in the explanation of the `null_value` option
(cherry picked from commit fa52f82838)
Co-authored-by: Nimrod Dolev <nimrodavid@gmail.com>
Adds new `quantization_options` to `dense_vector`. This allows for
vectors to be automatically quantized to `byte` when indexed.
Example:
```
PUT vectors
{
"mappings": {
"properties": {
"my_vector": {
"type": "dense_vector",
"index": true,
"index_options": {
"type": "int8_hnsw"
}
}
}
}
}
```
When querying, the query vector is automatically quantized and used when
querying the HNSW graph. This reduces the memory required to only `25%`
of what was previously required for `float` vectors at a slight loss of
accuracy.
This is currently only available when `index: true` and when using
`hnsw`
* Represent histogram value count as long
Histograms currently use integers to store the count of each value,
which can overflow. Switch to using long integers to avoid this.
TDigestState was updated to use long for centroid value count in #99491Fixes#99820
* Update docs/changelog/99912.yaml
* spotless fix
* Nested dense_vector support
* Adjust nested support based on new lucene version
* fixing after rebase
* fixing some code
* fixing tests adding transport version
* spotless
* [Automated] Update Lucene snapshot to 9.9.0-snapshot-b3e67403aaf
* Adds new max_inner_product vector similarity function (#99527)
Adds new max_inner_product vector similarity function. This differs from dot_product in the following ways:
Doesn't require vectors to be normalized
Scales the similarity between vectors differently to prevent negative scores
* requiring top level filter to be parent filter
* adding docs & fixing tests
* adding and fixing docs
* adding changlog
* removing unnecessary file changes
* removing unused imports
* fixing test
* maybe fix doc tests
* continue tests in docs
* fixing more tests
* fixing tests
---------
Co-authored-by: Jim Ferenczi <jim.ferenczi@elastic.co>
Co-authored-by: elasticsearchmachine <infra-root+elasticsearchmachine@elastic.co>
Adds new max_inner_product vector similarity function. This differs from dot_product in the following ways:
Doesn't require vectors to be normalized
Scales the similarity between vectors differently to prevent negative scores
`dot_product` requires vectors to be unit-length. Previously, we would
check that vectors were unit-length and throw if they were not.
Instead, we will now auto-normalize vectors as they are indexed.
`cosine` will continue to behave as usual, not normalizing the vectors.
closes: https://github.com/elastic/elasticsearch/issues/98935
* First version
* Spotless, I liked my version better
* Fix param default values
* Add a supplier for default value to ensure it's calculated correctly
* Can't improve this without breaking tests
* Added checks for not specifying a body in PUT requests
* Fix default provider for enum params
* Added yaml test
* Changed docs and fix TODO
* Removing synonyms changes
* Added separate methods for providing default value as suppliers in enums
* Fixed test
* Add a supplier for default value to ensure it's calculated correctly
* Added checks for not specifying a body in PUT requests
* Remove synonyms changes
* Remove some supplier changes
* Better call enumParam with supplier version
* Fix compiler error on supplier
* Apply validators or requires depending on index version
* Solved BWC tests that involved using validators instead of requiresParameters
* Add tests
* Spotless
* Update docs/changelog/98268.yaml
* Update changelog
* Update docs/changelog/98268.yaml
* PR comments
* PR feedback
* Serialize index only for new index versions
---------
Co-authored-by: Elastic Machine <elasticmachine@users.noreply.github.com>
* Update field-mapping.asciidoc that Epoch format is not supported as dynamic date format
Update field-mapping.asciidoc that Epoch format is not supported as dynamic date format
* Update docs/reference/mapping/dynamic/field-mapping.asciidoc
Co-authored-by: Abdon Pijpelink <abdon.pijpelink@elastic.co>
---------
Co-authored-by: Abdon Pijpelink <abdon.pijpelink@elastic.co>
* Documentation for time-series geo_line
* Fix incorrect ids in geoline docs
* Some updates from review
Added image of kibana map, improved first example, linked to TSDS and added section on line simplification with link to wikipedia.
* Diagrams of truncation versus simplification
* Allow multiple field names/patterns for (path_)(un)match (#66364)
Arrays of patterns are now allowed for dynamic_templates in the match,
unmatch, path_match and path_unmatch fields. DynamicTemplate has been modified to
support List<String> for these fields. The patterns can be either simple wildcards
or regex. As with previous functionality, when match_pattern="regex", simple wildcards
will be flagged with an error, but when match_pattern="simple", using regular expressions
in the match will not throw an error.
One new error pathway was added: if a user specifies a list of non-strings for
one of these pattern fields (e.g., "match": [10, false]) a MapperParserException
will be thrown.
A dynamic_template yamlRestTest was added. This is a BWC change, so the REST test
that uses arrays of patterns is limited to v8.9 and above.
Closes#66364.
Currently Lucene limits the max number of vector dimensions to 1024.
This commit overrides KnnFloatVectorField and KnnByteVectorField
classes to increase the limit to 2048 for indexed vectors in ES.
Here we add synthetic source support for fields whose type is flattened.
Note that flattened fields and synthetic source have the following limitations,
all arising from the fact that in synthetic source we just see key/value pairs
when reconstructing the original object and have no type information in mappings:
* flattened fields use sorted set doc values of keywords, which means two things:
first we do not allow duplicate values, second we treat all values as keywords
* reconstructing array of objects results in nested objects (no array)
* reconstructing arrays with just one element results in a single-value field since we
have no way to distinguish single-valued from multi-values fields other then looking
at the count of values
`runtime_mappings` is the name of the param in the search request. In the
document `put` statement, it's called `runtime`
Co-authored-by: Matthew Hinea <matthew.hinea@gmail.com>
This PR enables the `ignore_malformed`parameter to be accepted as an option in
boolean field mappings. Support for synthetic source is not added yet, so if
`ignore_malformed` is set to true, synthetic source isn't supported.
Closes#89542