Introduce an optional k param for knn query
If k is not set, knn query has the previous behaviour:
- `num_candidates` docs is collected from each shard. This `num_candidates` docs
are used for combining with results with other queries and aggregations on each shard.
- docs from all shards are merged to produce the top global `size` results
If k is set, the behaviour instead is following:
- `k` docs is collected from each shard. This `k` docs are used for
combining results with other queries and aggregations on each shard.
- similarly, docs from all shards are merged to produce the top global `size`
results.
Having `k` param makes it more intuitive for users to address their needs.
They also don't need to care and can skip `num_candidates` param for this query
as it is of more internal details to tune how knn search operates.
Closes#108473
- Added a new `AbstractAggregationTestCase` base class for tests, that shares most of the code of function tests, adapted for aggregations. Including both testing and docs generation.
- Reused the `AbstractFunctionTestCase` class to also let us test evaluators if the aggregation is foldable
- Added a `TopListTests` example
- This includes the docs for Top_list _(Also added a missing include of Ip_prefix docs)_
- Adapted Kibana docs to use `type: "agg"` (@drewdaemon)
The current tests are very basic: Consume a page, generate an output,
all in Single aggregation mode (No intermediates, no grouping). More
complex testing will be added in future PRs
Initial PR of https://github.com/elastic/elasticsearch/issues/109917
This commit adds `bit` vector support by adding `element_type: bit` for
vectors. This new element type works for indexed and non-indexed
vectors. Additionally, it works with `hnsw` and `flat` index types. No
quantization based codec works with this element type, this is
consistent with `byte` vectors.
`bit` vectors accept up to `32768` dimensions in size and expect vectors
that are being indexed to be encoded either as a hexidecimal string or a
`byte[]` array where each element of the `byte` array represents `8`
bits of the vector.
`bit` vectors support script usage and regular query usage. When
indexed, all comparisons done are `xor` and `popcount` summations (aka,
hamming distance), and the scores are transformed and normalized given
the vector dimensions. Note, indexed bit vectors require `l2_norm` to be
the similarity.
For scripts, `l1norm` is the same as `hamming` distance and `l2norm` is
`sqrt(l1norm)`. `dotProduct` and `cosineSimilarity` are not supported.
Note, the dimensions expected by this element_type are always to be
divisible by `8`, and the `byte[]` vectors provided for index must be
have size `dim/8` size, where each byte element represents `8` bits of
the vectors.
closes: https://github.com/elastic/elasticsearch/issues/48322
PR #99445 introduced automatic normalization of dense vectors with
cosine similarity. This adds a note about this in the documentation.
Relates to #99445
* WIP Started refactoring in preparation for ST_DISTANCE
* Initial evaluators for ST_DISTANCE
* Update docs/changelog/108764.yaml
* Fix invalid changelog generated by CI
* Register function and get unit tests working
* Fixed failing meta function description tests, and refined descriptions
* Added initial CsvTests and calculate Geo differently to Cartesian
* Added more csv-spec tests and changed to arcDistance for accuracy
* Added generated docs files
* Link to generated docs
* Fix examples tag for linking from generated docs
* Skip wrapper function
And note that we might want to include instead some of the related intelligence from Circle2D::HaversineDistance class
* Added ST_DWITHIN and more tests for ST_DISTANCE and ST_DWITHIN
* Code style
* Added more tests, this time for sorting on distance
* Fixes after rebase on main
* The ST_DWITHIN cannot use BinarySpatialFunction because it is ternary
So we moved the common code to a separate SpatialTypeResolver, and made a simpler TernarySpatialFunction based on a simple TernaryScalarFunction. This had additional consequences, simplifying the points-only cases.
The main reason for this change was to support StDWithinTests which need to test a lot of things that involve varying all three input types, generating expected error strings, etc. The original hack of just adding to BinarySpatialFunction worked for the actual integration tests, but clearly did not satisfy all the use cases tested by the unit tests.
We also restricted ST_DWITHIN to take only a double as the third argument, because otherwise the number of evaluators would explode, since we need a separate evaluator for each Block type, and Integer and Double use different block types.
* Fixed function count after rebasing on main
* Update docs/changelog/108764.yaml
* Added generated docs for ST_DWITHIN
* Connect docs for ST_DWITHIN
* Add back issue link
* Remove support for ST_DWITHIN
* Update docs/changelog/108764.yaml
* Bring back link to issue in changelog
* Update x-pack/plugin/esql/src/main/java/org/elasticsearch/xpack/esql/expression/function/scalar/spatial/StDistance.java
Co-authored-by: Ignacio Vera <iverase@gmail.com>
* Revert reformatting of function descriptions
We should put this into a separate PR
* Github merged commit with incorrectly formatted whitespace
---------
Co-authored-by: Ignacio Vera <iverase@gmail.com>
Wholesale fix of every `TRAPPY_IMPLICIT_DEFAULT_MASTER_NODE_TIMEOUT` in
`o.e.snapshots` and `o.e.repositories`, just pulling them up to the REST
layer (where they become API params), the test suite (where they become
`TEST_REQUEST_TIMEOUT`), or some other place where an explicit value is
available.
Relates #107984
Rather than initializing the failure store right away when a new
data stream is created, we leave it empty and mark it for lazy
rollover. This results in the failure store only being initialized
(i.e. an index created) when a failure has actually occurred.
The exception to the rule is when a failure occurs while the data
stream is being auto-created. In that case, we do want to initialize
the failure store right away.
The cluster level dense vector stats returns the total number of dense vector indices globally including the replicas.
This commit fixes the total to only include the value count of the primary indices.
This change aligns with the docs stats which also reports the number of primary documents when used in cluster stats.
The indices stats API still reports granular results for replicas and primaries so the information is not lost.
This adds a test that generates
`docs/reference/esql/functions/kibana/inline_cast.json` which is a json
object who's keys are the names of valid inline casts and who's values
are the resulting data types.
I also moved one of the maps we use to make the inline casts to
`DataType`, which is a place where we want it.
Handle the "output memory allocator bytes" field if and only if it is present in the model size stats, as reported by the C++ backend.
This PR _must_ be merged prior to the corresponding `ml-cpp` one, to keep CI tests happy.
This adds `hamming` distances, the pop-count of `xor` byte vectors as a
first class citizen in painless.
For byte vectors, this means that we can compute hamming distances via
script_score (aka, brute-force).
The implementation of `hamming` is the same that is available in Lucene,
and when lucene 9.11 is merged, we should update our logic where
applicable to utilize it.
NOTE: this does not yet add hamming distance as a metric for indexed
vectors. This will be a future PR after the Lucene 9.11 upgrade.
* Add SparseVectorStats
* Update to use mappings in engine
* Update to be unique to primary shards
* Fix doc
* Fix null error in test
* Cleanup
* fix yaml
* remove comment
* add version to yaml
* Revert whitespace changes to stats doc
* fix yml test
* Checkstyle
* Fix NPE in test
* Update docs/changelog/108793.yaml
* Add link to sparse_vector field type in docs
* PR feedback
* Flesh out test a bit more
* PR feedback - alphabetize placement in docs
* Fix doc change
This adds a new quantization mechanism for HNSW and flat indices. Here
we add `int4` quantization via the `int4_hnsw` and `int4_flat` index
types. This quantization methodology further reduces the memory required
for fast HNSW, meaning that the memory required is 8x smaller than with
regular float32 values.
8x reduction means that 1M 1024 dimension vectors goes from requiring
3.8GB to 477MB.
Recall continues to stay steady, there is some reduction that is
recoverable via slightly oversampling and reranking. For example over
500k CohereV3 vectors, only 5 extra vectors are required to be gathered
to achieve over 0.98 recall in a brute-force scenario.

When you divide two integers or two longs we round towards 0. Like
Postgres or Java or Rust or C. Other systems, like MySQL or SPL or
Javascript or Python always produce a floating point number. We should
warn folks about this. It's genuinely unexpected for some folks. OTOH,
converting into a floating point number would be unexpected for other
folks. Oh well, let's document what we've got.
This is a refactoring of the internal logic that's used to translate
query-level into index-level field names for query APIs for
security entities (i.e. users, API Keys, and soon, roles).
The objective here is to have and reuse a single class to handle
all the translations for different security query APIs.
Fixing MvAppendTests CB exceptions by generating smaller geometries: the
test generates a lot of documents and the CB is too small for multiple
big shapes.
Fixes https://github.com/elastic/elasticsearch/issues/109409