Adds new bit element_type for dense_vectors (#110059)

This commit adds `bit` vector support by adding `element_type: bit` for
vectors. This new element type works for indexed and non-indexed
vectors. Additionally, it works with `hnsw` and `flat` index types. No
quantization based codec works with this element type, this is
consistent with `byte` vectors.

`bit` vectors accept up to `32768` dimensions in size and expect vectors
that are being indexed to be encoded either as a hexidecimal string or a
`byte[]` array where each element of the `byte` array represents `8`
bits of the vector.

`bit` vectors support script usage and regular query usage. When
indexed, all comparisons done are `xor` and `popcount` summations (aka,
hamming distance), and the scores are transformed and normalized given
the vector dimensions. Note, indexed bit vectors require `l2_norm` to be
the similarity.

For scripts, `l1norm` is the same as `hamming` distance and `l2norm` is
`sqrt(l1norm)`. `dotProduct` and `cosineSimilarity` are not supported.

Note, the dimensions expected by this element_type are always to be
divisible by `8`, and the `byte[]` vectors provided for index must be
have size `dim/8` size, where each byte element represents `8` bits of
the vectors.

closes: https://github.com/elastic/elasticsearch/issues/48322
This commit is contained in:
Benjamin Trent 2024-06-26 14:48:41 -04:00 committed by GitHub
parent 97651dfb9f
commit 5add44d7d1
No known key found for this signature in database
GPG key ID: B5690EEEBB952194
38 changed files with 2713 additions and 187 deletions

View file

@ -449,7 +449,10 @@ module org.elasticsearch.server {
with
org.elasticsearch.index.codec.vectors.ES813FlatVectorFormat,
org.elasticsearch.index.codec.vectors.ES813Int8FlatVectorFormat,
org.elasticsearch.index.codec.vectors.ES814HnswScalarQuantizedVectorsFormat;
org.elasticsearch.index.codec.vectors.ES814HnswScalarQuantizedVectorsFormat,
org.elasticsearch.index.codec.vectors.ES815HnswBitVectorsFormat,
org.elasticsearch.index.codec.vectors.ES815BitFlatVectorFormat;
provides org.apache.lucene.codecs.Codec with Elasticsearch814Codec;
provides org.apache.logging.log4j.core.util.ContextDataProvider with org.elasticsearch.common.logging.DynamicContextDataProvider;