elasticsearch/docs/reference/vectors/vector-functions.asciidoc
Benjamin Trent 8d16529169
Correct bit * byte and bit * float script comparisons (#117404) (#117507)
I goofed on the bit * byte and bit * float comparisons. Naturally, these
should be bigendian and compare the dimensions with the binary ones
appropriately.

Additionally, I added a test to ensure that this is handled correctly.

(cherry picked from commit 374c88a832)
2024-11-26 06:07:32 +11:00

427 lines
13 KiB
Text
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

[[vector-functions]]
===== Functions for vector fields
NOTE: During vector functions' calculation, all matched documents are
linearly scanned. Thus, expect the query time grow linearly
with the number of matched documents. For this reason, we recommend
to limit the number of matched documents with a `query` parameter.
This is the list of available vector functions and vector access methods:
1. <<vector-functions-cosine,`cosineSimilarity`>> calculates cosine similarity
2. <<vector-functions-dot-product,`dotProduct`>> calculates dot product
3. <<vector-functions-l1,`l1norm`>> calculates L^1^ distance
4. <<vector-functions-hamming,`hamming`>> calculates Hamming distance
5. <<vector-functions-l2,`l2norm`>> - calculates L^2^ distance
6. <<vector-functions-accessing-vectors,`doc[<field>].vectorValue`>> returns a vector's value as an array of floats
7. <<vector-functions-accessing-vectors,`doc[<field>].magnitude`>> returns a vector's magnitude
NOTE: The `cosineSimilarity` function is not supported for `bit` vectors.
NOTE: The recommended way to access dense vectors is through the
`cosineSimilarity`, `dotProduct`, `l1norm` or `l2norm` functions. Please note
however, that you should call these functions only once per script. For example,
dont use these functions in a loop to calculate the similarity between a
document vector and multiple other vectors. If you need that functionality,
reimplement these functions yourself by
<<vector-functions-accessing-vectors,accessing vector values directly>>.
Let's create an index with a `dense_vector` mapping and index a couple
of documents into it.
[source,console]
--------------------------------------------------
PUT my-index-000001
{
"mappings": {
"properties": {
"my_dense_vector": {
"type": "dense_vector",
"index": false,
"dims": 3
},
"my_byte_dense_vector": {
"type": "dense_vector",
"index": false,
"dims": 3,
"element_type": "byte"
},
"status" : {
"type" : "keyword"
}
}
}
}
PUT my-index-000001/_doc/1
{
"my_dense_vector": [0.5, 10, 6],
"my_byte_dense_vector": [0, 10, 6],
"status" : "published"
}
PUT my-index-000001/_doc/2
{
"my_dense_vector": [-0.5, 10, 10],
"my_byte_dense_vector": [0, 10, 10],
"status" : "published"
}
POST my-index-000001/_refresh
--------------------------------------------------
// TESTSETUP
[[vector-functions-cosine]]
====== Cosine similarity
The `cosineSimilarity` function calculates the measure of
cosine similarity between a given query vector and document vectors.
[source,console]
--------------------------------------------------
GET my-index-000001/_search
{
"query": {
"script_score": {
"query" : {
"bool" : {
"filter" : {
"term" : {
"status" : "published" <1>
}
}
}
},
"script": {
"source": "cosineSimilarity(params.query_vector, 'my_dense_vector') + 1.0", <2>
"params": {
"query_vector": [4, 3.4, -0.2] <3>
}
}
}
}
}
--------------------------------------------------
<1> To restrict the number of documents on which script score calculation is applied, provide a filter.
<2> The script adds 1.0 to the cosine similarity to prevent the score from being negative.
<3> To take advantage of the script optimizations, provide a query vector as a script parameter.
NOTE: If a document's dense vector field has a number of dimensions
different from the query's vector, an error will be thrown.
[[vector-functions-dot-product]]
====== Dot product
The `dotProduct` function calculates the measure of
dot product between a given query vector and document vectors.
[source,console]
--------------------------------------------------
GET my-index-000001/_search
{
"query": {
"script_score": {
"query" : {
"bool" : {
"filter" : {
"term" : {
"status" : "published"
}
}
}
},
"script": {
"source": """
double value = dotProduct(params.query_vector, 'my_dense_vector');
return sigmoid(1, Math.E, -value); <1>
""",
"params": {
"query_vector": [4, 3.4, -0.2]
}
}
}
}
}
--------------------------------------------------
<1> Using the standard sigmoid function prevents scores from being negative.
[[vector-functions-l1]]
====== L^1^ distance (Manhattan distance)
The `l1norm` function calculates L^1^ distance
(Manhattan distance) between a given query vector and
document vectors.
[source,console]
--------------------------------------------------
GET my-index-000001/_search
{
"query": {
"script_score": {
"query" : {
"bool" : {
"filter" : {
"term" : {
"status" : "published"
}
}
}
},
"script": {
"source": "1 / (1 + l1norm(params.queryVector, 'my_dense_vector'))", <1>
"params": {
"queryVector": [4, 3.4, -0.2]
}
}
}
}
}
--------------------------------------------------
<1> Unlike `cosineSimilarity` that represent similarity, `l1norm` and
`l2norm` shown below represent distances or differences. This means, that
the more similar the vectors are, the lower the scores will be that are
produced by the `l1norm` and `l2norm` functions.
Thus, as we need more similar vectors to score higher,
we reversed the output from `l1norm` and `l2norm`. Also, to avoid
division by 0 when a document vector matches the query exactly,
we added `1` in the denominator.
[[vector-functions-hamming]]
====== Hamming distance
The `hamming` function calculates {wikipedia}/Hamming_distance[Hamming distance] between a given query vector and
document vectors. It is only available for byte and bit vectors.
[source,console]
--------------------------------------------------
GET my-index-000001/_search
{
"query": {
"script_score": {
"query" : {
"bool" : {
"filter" : {
"term" : {
"status" : "published"
}
}
}
},
"script": {
"source": "(24 - hamming(params.queryVector, 'my_byte_dense_vector')) / 24", <1>
"params": {
"queryVector": [4, 3, 0]
}
}
}
}
}
--------------------------------------------------
<1> Calculate the Hamming distance and normalize it by the bits to get a score between 0 and 1.
[[vector-functions-l2]]
====== L^2^ distance (Euclidean distance)
The `l2norm` function calculates L^2^ distance
(Euclidean distance) between a given query vector and
document vectors.
[source,console]
--------------------------------------------------
GET my-index-000001/_search
{
"query": {
"script_score": {
"query" : {
"bool" : {
"filter" : {
"term" : {
"status" : "published"
}
}
}
},
"script": {
"source": "1 / (1 + l2norm(params.queryVector, 'my_dense_vector'))",
"params": {
"queryVector": [4, 3.4, -0.2]
}
}
}
}
}
--------------------------------------------------
[[vector-functions-missing-values]]
====== Checking for missing values
If a document doesn't have a value for a vector field on which a vector function
is executed, an error will be thrown.
You can check if a document has a value for the field `my_vector` with
`doc['my_vector'].size() == 0`. Your overall script can look like this:
[source,js]
--------------------------------------------------
"source": "doc['my_vector'].size() == 0 ? 0 : cosineSimilarity(params.queryVector, 'my_vector')"
--------------------------------------------------
// NOTCONSOLE
[[vector-functions-accessing-vectors]]
====== Accessing vectors directly
You can access vector values directly through the following functions:
- `doc[<field>].vectorValue` returns a vector's value as an array of floats
NOTE: For `bit` vectors, it does return a `float[]`, where each element represents 8 bits.
- `doc[<field>].magnitude` returns a vector's magnitude as a float
(for vectors created prior to version 7.5 the magnitude is not stored.
So this function calculates it anew every time it is called).
NOTE: For `bit` vectors, this is just the square root of the sum of `1` bits.
For example, the script below implements a cosine similarity using these
two functions:
[source,console]
--------------------------------------------------
GET my-index-000001/_search
{
"query": {
"script_score": {
"query" : {
"bool" : {
"filter" : {
"term" : {
"status" : "published"
}
}
}
},
"script": {
"source": """
float[] v = doc['my_dense_vector'].vectorValue;
float vm = doc['my_dense_vector'].magnitude;
float dotProduct = 0;
for (int i = 0; i < v.length; i++) {
dotProduct += v[i] * params.queryVector[i];
}
return dotProduct / (vm * (float) params.queryVectorMag);
""",
"params": {
"queryVector": [4, 3.4, -0.2],
"queryVectorMag": 5.25357
}
}
}
}
}
--------------------------------------------------
[[vector-functions-bit-vectors]]
====== Bit vectors and vector functions
When using `bit` vectors, not all the vector functions are available. The supported functions are:
* <<vector-functions-hamming,`hamming`>> calculates Hamming distance, the sum of the bitwise XOR of the two vectors
* <<vector-functions-l1,`l1norm`>> calculates L^1^ distance, this is simply the `hamming` distance
* <<vector-functions-l2,`l2norm`>> - calculates L^2^ distance, this is the square root of the `hamming` distance
* <<vector-functions-dot-product,`dotProduct`>> calculates dot product. When comparing two `bit` vectors,
this is the sum of the bitwise AND of the two vectors. If providing `float[]` or `byte[]`, who has `dims` number of elements, as a query vector, the `dotProduct` is
the sum of the floating point values using the stored `bit` vector as a mask.
NOTE: When comparing `floats` and `bytes` with `bit` vectors, the `bit` vector is treated as a mask in big-endian order.
For example, if the `bit` vector is `10100001` (e.g. the single byte value `161`) and its compared
with array of values `[1, 2, 3, 4, 5, 6, 7, 8]` the `dotProduct` will be `1 + 3 + 8 = 16`.
Here is an example of using dot-product with bit vectors.
[source,console]
--------------------------------------------------
PUT my-index-bit-vectors
{
"mappings": {
"properties": {
"my_dense_vector": {
"type": "dense_vector",
"index": false,
"element_type": "bit",
"dims": 40 <1>
}
}
}
}
PUT my-index-bit-vectors/_doc/1
{
"my_dense_vector": [8, 5, -15, 1, -7] <2>
}
PUT my-index-bit-vectors/_doc/2
{
"my_dense_vector": [-1, 115, -3, 4, -128]
}
PUT my-index-bit-vectors/_doc/3
{
"my_dense_vector": [2, 18, -5, 0, -124]
}
POST my-index-bit-vectors/_refresh
--------------------------------------------------
// TEST[continued]
<1> The number of dimensions or bits for the `bit` vector.
<2> This vector represents 5 bytes, or `5 * 8 = 40` bits, which equals the configured dimensions
[source,console]
--------------------------------------------------
GET my-index-bit-vectors/_search
{
"query": {
"script_score": {
"query" : {
"match_all": {}
},
"script": {
"source": "dotProduct(params.query_vector, 'my_dense_vector')",
"params": {
"query_vector": [8, 5, -15, 1, -7] <1>
}
}
}
}
}
--------------------------------------------------
// TEST[continued]
<1> This vector is 40 bits, and thus will compute a bitwise `&` operation with the stored vectors.
[source,console]
--------------------------------------------------
GET my-index-bit-vectors/_search
{
"query": {
"script_score": {
"query" : {
"match_all": {}
},
"script": {
"source": "dotProduct(params.query_vector, 'my_dense_vector')",
"params": {
"query_vector": [0.23, 1.45, 3.67, 4.89, -0.56, 2.34, 3.21, 1.78, -2.45, 0.98, -0.12, 3.45, 4.56, 2.78, 1.23, 0.67, 3.89, 4.12, -2.34, 1.56, 0.78, 3.21, 4.12, 2.45, -1.67, 0.34, -3.45, 4.56, -2.78, 1.23, -0.67, 3.89, -4.34, 2.12, -1.56, 0.78, -3.21, 4.45, 2.12, 1.67] <1>
}
}
}
}
}
--------------------------------------------------
// TEST[continued]
<1> This vector is 40 individual dimensions, and thus will sum the floating point values using the stored `bit` vector as a mask.
Currently, the `cosineSimilarity` function is not supported for `bit` vectors.