mirror of
https://github.com/elastic/elasticsearch.git
synced 2025-04-25 15:47:23 -04:00
This commit adds `bit` vector support by adding `element_type: bit` for vectors. This new element type works for indexed and non-indexed vectors. Additionally, it works with `hnsw` and `flat` index types. No quantization based codec works with this element type, this is consistent with `byte` vectors. `bit` vectors accept up to `32768` dimensions in size and expect vectors that are being indexed to be encoded either as a hexidecimal string or a `byte[]` array where each element of the `byte` array represents `8` bits of the vector. `bit` vectors support script usage and regular query usage. When indexed, all comparisons done are `xor` and `popcount` summations (aka, hamming distance), and the scores are transformed and normalized given the vector dimensions. Note, indexed bit vectors require `l2_norm` to be the similarity. For scripts, `l1norm` is the same as `hamming` distance and `l2norm` is `sqrt(l1norm)`. `dotProduct` and `cosineSimilarity` are not supported. Note, the dimensions expected by this element_type are always to be divisible by `8`, and the `byte[]` vectors provided for index must be have size `dim/8` size, where each byte element represents `8` bits of the vectors. closes: https://github.com/elastic/elasticsearch/issues/48322
337 lines
9.8 KiB
Text
337 lines
9.8 KiB
Text
[[vector-functions]]
|
||
===== Functions for vector fields
|
||
|
||
NOTE: During vector functions' calculation, all matched documents are
|
||
linearly scanned. Thus, expect the query time grow linearly
|
||
with the number of matched documents. For this reason, we recommend
|
||
to limit the number of matched documents with a `query` parameter.
|
||
|
||
This is the list of available vector functions and vector access methods:
|
||
|
||
1. <<vector-functions-cosine,`cosineSimilarity`>> – calculates cosine similarity
|
||
2. <<vector-functions-dot-product,`dotProduct`>> – calculates dot product
|
||
3. <<vector-functions-l1,`l1norm`>> – calculates L^1^ distance
|
||
4. <<vector-functions-hamming,`hamming`>> – calculates Hamming distance
|
||
5. <<vector-functions-l2,`l2norm`>> - calculates L^2^ distance
|
||
6. <<vector-functions-accessing-vectors,`doc[<field>].vectorValue`>> – returns a vector's value as an array of floats
|
||
7. <<vector-functions-accessing-vectors,`doc[<field>].magnitude`>> – returns a vector's magnitude
|
||
|
||
NOTE: The `cosineSimilarity` and `dotProduct` functions are not supported for `bit` vectors.
|
||
|
||
NOTE: The recommended way to access dense vectors is through the
|
||
`cosineSimilarity`, `dotProduct`, `l1norm` or `l2norm` functions. Please note
|
||
however, that you should call these functions only once per script. For example,
|
||
don’t use these functions in a loop to calculate the similarity between a
|
||
document vector and multiple other vectors. If you need that functionality,
|
||
reimplement these functions yourself by
|
||
<<vector-functions-accessing-vectors,accessing vector values directly>>.
|
||
|
||
Let's create an index with a `dense_vector` mapping and index a couple
|
||
of documents into it.
|
||
|
||
[source,console]
|
||
--------------------------------------------------
|
||
PUT my-index-000001
|
||
{
|
||
"mappings": {
|
||
"properties": {
|
||
"my_dense_vector": {
|
||
"type": "dense_vector",
|
||
"index": false,
|
||
"dims": 3
|
||
},
|
||
"my_byte_dense_vector": {
|
||
"type": "dense_vector",
|
||
"index": false,
|
||
"dims": 3,
|
||
"element_type": "byte"
|
||
},
|
||
"status" : {
|
||
"type" : "keyword"
|
||
}
|
||
}
|
||
}
|
||
}
|
||
|
||
PUT my-index-000001/_doc/1
|
||
{
|
||
"my_dense_vector": [0.5, 10, 6],
|
||
"my_byte_dense_vector": [0, 10, 6],
|
||
"status" : "published"
|
||
}
|
||
|
||
PUT my-index-000001/_doc/2
|
||
{
|
||
"my_dense_vector": [-0.5, 10, 10],
|
||
"my_byte_dense_vector": [0, 10, 10],
|
||
"status" : "published"
|
||
}
|
||
|
||
POST my-index-000001/_refresh
|
||
|
||
--------------------------------------------------
|
||
// TESTSETUP
|
||
|
||
[[vector-functions-cosine]]
|
||
====== Cosine similarity
|
||
|
||
The `cosineSimilarity` function calculates the measure of
|
||
cosine similarity between a given query vector and document vectors.
|
||
|
||
[source,console]
|
||
--------------------------------------------------
|
||
GET my-index-000001/_search
|
||
{
|
||
"query": {
|
||
"script_score": {
|
||
"query" : {
|
||
"bool" : {
|
||
"filter" : {
|
||
"term" : {
|
||
"status" : "published" <1>
|
||
}
|
||
}
|
||
}
|
||
},
|
||
"script": {
|
||
"source": "cosineSimilarity(params.query_vector, 'my_dense_vector') + 1.0", <2>
|
||
"params": {
|
||
"query_vector": [4, 3.4, -0.2] <3>
|
||
}
|
||
}
|
||
}
|
||
}
|
||
}
|
||
--------------------------------------------------
|
||
|
||
<1> To restrict the number of documents on which script score calculation is applied, provide a filter.
|
||
<2> The script adds 1.0 to the cosine similarity to prevent the score from being negative.
|
||
<3> To take advantage of the script optimizations, provide a query vector as a script parameter.
|
||
|
||
NOTE: If a document's dense vector field has a number of dimensions
|
||
different from the query's vector, an error will be thrown.
|
||
|
||
[[vector-functions-dot-product]]
|
||
====== Dot product
|
||
|
||
The `dotProduct` function calculates the measure of
|
||
dot product between a given query vector and document vectors.
|
||
|
||
[source,console]
|
||
--------------------------------------------------
|
||
GET my-index-000001/_search
|
||
{
|
||
"query": {
|
||
"script_score": {
|
||
"query" : {
|
||
"bool" : {
|
||
"filter" : {
|
||
"term" : {
|
||
"status" : "published"
|
||
}
|
||
}
|
||
}
|
||
},
|
||
"script": {
|
||
"source": """
|
||
double value = dotProduct(params.query_vector, 'my_dense_vector');
|
||
return sigmoid(1, Math.E, -value); <1>
|
||
""",
|
||
"params": {
|
||
"query_vector": [4, 3.4, -0.2]
|
||
}
|
||
}
|
||
}
|
||
}
|
||
}
|
||
--------------------------------------------------
|
||
|
||
<1> Using the standard sigmoid function prevents scores from being negative.
|
||
|
||
[[vector-functions-l1]]
|
||
====== L^1^ distance (Manhattan distance)
|
||
|
||
The `l1norm` function calculates L^1^ distance
|
||
(Manhattan distance) between a given query vector and
|
||
document vectors.
|
||
|
||
[source,console]
|
||
--------------------------------------------------
|
||
GET my-index-000001/_search
|
||
{
|
||
"query": {
|
||
"script_score": {
|
||
"query" : {
|
||
"bool" : {
|
||
"filter" : {
|
||
"term" : {
|
||
"status" : "published"
|
||
}
|
||
}
|
||
}
|
||
},
|
||
"script": {
|
||
"source": "1 / (1 + l1norm(params.queryVector, 'my_dense_vector'))", <1>
|
||
"params": {
|
||
"queryVector": [4, 3.4, -0.2]
|
||
}
|
||
}
|
||
}
|
||
}
|
||
}
|
||
--------------------------------------------------
|
||
|
||
<1> Unlike `cosineSimilarity` that represent similarity, `l1norm` and
|
||
`l2norm` shown below represent distances or differences. This means, that
|
||
the more similar the vectors are, the lower the scores will be that are
|
||
produced by the `l1norm` and `l2norm` functions.
|
||
Thus, as we need more similar vectors to score higher,
|
||
we reversed the output from `l1norm` and `l2norm`. Also, to avoid
|
||
division by 0 when a document vector matches the query exactly,
|
||
we added `1` in the denominator.
|
||
|
||
[[vector-functions-hamming]]
|
||
====== Hamming distance
|
||
|
||
The `hamming` function calculates {wikipedia}/Hamming_distance[Hamming distance] between a given query vector and
|
||
document vectors. It is only available for byte and bit vectors.
|
||
|
||
[source,console]
|
||
--------------------------------------------------
|
||
GET my-index-000001/_search
|
||
{
|
||
"query": {
|
||
"script_score": {
|
||
"query" : {
|
||
"bool" : {
|
||
"filter" : {
|
||
"term" : {
|
||
"status" : "published"
|
||
}
|
||
}
|
||
}
|
||
},
|
||
"script": {
|
||
"source": "(24 - hamming(params.queryVector, 'my_byte_dense_vector')) / 24", <1>
|
||
"params": {
|
||
"queryVector": [4, 3, 0]
|
||
}
|
||
}
|
||
}
|
||
}
|
||
}
|
||
--------------------------------------------------
|
||
|
||
<1> Calculate the Hamming distance and normalize it by the bits to get a score between 0 and 1.
|
||
|
||
[[vector-functions-l2]]
|
||
====== L^2^ distance (Euclidean distance)
|
||
|
||
The `l2norm` function calculates L^2^ distance
|
||
(Euclidean distance) between a given query vector and
|
||
document vectors.
|
||
|
||
[source,console]
|
||
--------------------------------------------------
|
||
GET my-index-000001/_search
|
||
{
|
||
"query": {
|
||
"script_score": {
|
||
"query" : {
|
||
"bool" : {
|
||
"filter" : {
|
||
"term" : {
|
||
"status" : "published"
|
||
}
|
||
}
|
||
}
|
||
},
|
||
"script": {
|
||
"source": "1 / (1 + l2norm(params.queryVector, 'my_dense_vector'))",
|
||
"params": {
|
||
"queryVector": [4, 3.4, -0.2]
|
||
}
|
||
}
|
||
}
|
||
}
|
||
}
|
||
--------------------------------------------------
|
||
|
||
[[vector-functions-missing-values]]
|
||
====== Checking for missing values
|
||
|
||
If a document doesn't have a value for a vector field on which a vector function
|
||
is executed, an error will be thrown.
|
||
|
||
You can check if a document has a value for the field `my_vector` with
|
||
`doc['my_vector'].size() == 0`. Your overall script can look like this:
|
||
|
||
[source,js]
|
||
--------------------------------------------------
|
||
"source": "doc['my_vector'].size() == 0 ? 0 : cosineSimilarity(params.queryVector, 'my_vector')"
|
||
--------------------------------------------------
|
||
// NOTCONSOLE
|
||
|
||
[[vector-functions-accessing-vectors]]
|
||
====== Accessing vectors directly
|
||
|
||
You can access vector values directly through the following functions:
|
||
|
||
- `doc[<field>].vectorValue` – returns a vector's value as an array of floats
|
||
|
||
NOTE: For `bit` vectors, it does return a `float[]`, where each element represents 8 bits.
|
||
|
||
- `doc[<field>].magnitude` – returns a vector's magnitude as a float
|
||
(for vectors created prior to version 7.5 the magnitude is not stored.
|
||
So this function calculates it anew every time it is called).
|
||
|
||
NOTE: For `bit` vectors, this is just the square root of the sum of `1` bits.
|
||
|
||
For example, the script below implements a cosine similarity using these
|
||
two functions:
|
||
|
||
[source,console]
|
||
--------------------------------------------------
|
||
GET my-index-000001/_search
|
||
{
|
||
"query": {
|
||
"script_score": {
|
||
"query" : {
|
||
"bool" : {
|
||
"filter" : {
|
||
"term" : {
|
||
"status" : "published"
|
||
}
|
||
}
|
||
}
|
||
},
|
||
"script": {
|
||
"source": """
|
||
float[] v = doc['my_dense_vector'].vectorValue;
|
||
float vm = doc['my_dense_vector'].magnitude;
|
||
float dotProduct = 0;
|
||
for (int i = 0; i < v.length; i++) {
|
||
dotProduct += v[i] * params.queryVector[i];
|
||
}
|
||
return dotProduct / (vm * (float) params.queryVectorMag);
|
||
""",
|
||
"params": {
|
||
"queryVector": [4, 3.4, -0.2],
|
||
"queryVectorMag": 5.25357
|
||
}
|
||
}
|
||
}
|
||
}
|
||
}
|
||
--------------------------------------------------
|
||
[[vector-functions-bit-vectors]]
|
||
====== Bit vectors and vector functions
|
||
|
||
When using `bit` vectors, not all the vector functions are available. The supported functions are:
|
||
|
||
* <<vector-functions-hamming,`hamming`>> – calculates Hamming distance, the sum of the bitwise XOR of the two vectors
|
||
* <<vector-functions-l1,`l1norm`>> – calculates L^1^ distance, this is simply the `hamming` distance
|
||
* <<vector-functions-l2,`l2norm`>> - calculates L^2^ distance, this is the square root of the `hamming` distance
|
||
|
||
Currently, the `cosineSimilarity` and `dotProduct` functions are not supported for `bit` vectors.
|
||
|