elasticsearch/docs/reference/query-dsl/knn-query.asciidoc
Liam Thompson 33a71e3289
[DOCS] Refactor book-scoped variables in docs/reference/index.asciidoc (#107413)
* Remove `es-test-dir` book-scoped variable

* Remove `plugins-examples-dir` book-scoped variable

* Remove `:dependencies-dir:` and `:xes-repo-dir:` book-scoped variables

- In `index.asciidoc`, two variables (`:dependencies-dir:` and `:xes-repo-dir:`) were removed.
- In `sql/index.asciidoc`, the `:sql-tests:` path was updated to fuller path
- In `esql/index.asciidoc`, the `:esql-tests:` path was updated idem

* Replace `es-repo-dir` with `es-ref-dir`

* Move `:include-xpack: true` to few files that use it, remove from index.asciidoc
2024-04-17 14:37:07 +02:00

274 lines
7.2 KiB
Text
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

[[query-dsl-knn-query]]
=== Knn query
++++
<titleabbrev>Knn</titleabbrev>
++++
Finds the _k_ nearest vectors to a query vector, as measured by a similarity
metric. _knn_ query finds nearest vectors through approximate search on indexed
dense_vectors. The preferred way to do approximate kNN search is through the
<<knn-search,top level knn section>> of a search request. _knn_ query is reserved for
expert cases, where there is a need to combine this query with other queries.
[[knn-query-ex-request]]
==== Example request
[source,console]
----
PUT my-image-index
{
"mappings": {
"properties": {
"image-vector": {
"type": "dense_vector",
"dims": 3,
"index": true,
"similarity": "l2_norm"
},
"file-type": {
"type": "keyword"
},
"title": {
"type": "text"
}
}
}
}
----
. Index your data.
+
[source,console]
----
POST my-image-index/_bulk?refresh=true
{ "index": { "_id": "1" } }
{ "image-vector": [1, 5, -20], "file-type": "jpg", "title": "mountain lake" }
{ "index": { "_id": "2" } }
{ "image-vector": [42, 8, -15], "file-type": "png", "title": "frozen lake"}
{ "index": { "_id": "3" } }
{ "image-vector": [15, 11, 23], "file-type": "jpg", "title": "mountain lake lodge" }
----
//TEST[continued]
. Run the search using the `knn` query, asking for the top 3 nearest vectors.
+
[source,console]
----
POST my-image-index/_search
{
"size" : 3,
"query" : {
"knn": {
"field": "image-vector",
"query_vector": [-5, 9, -12],
"num_candidates": 10
}
}
}
----
//TEST[continued]
NOTE: `knn` query doesn't have a separate `k` parameter. `k` is defined by
`size` parameter of a search request similar to other queries. `knn` query
collects `num_candidates` results from each shard, then merges them to get
the top `size` results.
[[knn-query-top-level-parameters]]
==== Top-level parameters for `knn`
`field`::
+
--
(Required, string) The name of the vector field to search against. Must be a
<<index-vectors-knn-search, `dense_vector` field with indexing enabled>>.
--
`query_vector`::
+
--
(Optional, array of floats or string) Query vector. Must have the same number of dimensions
as the vector field you are searching against. Must be either an array of floats or a hex-encoded byte vector.
Either this or `query_vector_builder` must be provided.
--
`query_vector_builder`::
+
--
(Optional, object) Query vector builder.
include::{es-ref-dir}/rest-api/common-parms.asciidoc[tag=knn-query-vector-builder]
--
`num_candidates`::
+
--
(Optional, integer) The number of nearest neighbor candidates to consider per shard.
Cannot exceed 10,000. {es} collects `num_candidates` results from each shard, then
merges them to find the top results. Increasing `num_candidates` tends to improve the
accuracy of the final results. Defaults to `Math.min(1.5 * size, 10_000)`.
--
`filter`::
+
--
(Optional, query object) Query to filter the documents that can match.
The kNN search will return the top documents that also match this filter.
The value can be a single query or a list of queries. If `filter` is not provided,
all documents are allowed to match.
The filter is a pre-filter, meaning that it is applied **during** the approximate
kNN search to ensure that `num_candidates` matching documents are returned.
--
`similarity`::
+
--
(Optional, float) The minimum similarity required for a document to be considered
a match. The similarity value calculated relates to the raw
<<dense-vector-similarity, `similarity`>> used. Not the document score. The matched
documents are then scored according to <<dense-vector-similarity, `similarity`>>
and the provided `boost` is applied.
--
`boost`::
+
--
(Optional, float) Floating point number used to multiply the
scores of matched documents. This value cannot be negative. Defaults to `1.0`.
--
`_name`::
+
--
(Optional, string) Name field to identify the query
--
[[knn-query-filtering]]
==== Pre-filters and post-filters in knn query
There are two ways to filter documents that match a kNN query:
. **pre-filtering** filter is applied during the approximate kNN search
to ensure that `k` matching documents are returned.
. **post-filtering** filter is applied after the approximate kNN search
completes, which results in fewer than k results, even when there are enough
matching documents.
Pre-filtering is supported through the `filter` parameter of the `knn` query.
Also filters from <<filter-alias,aliases>> are applied as pre-filters.
All other filters found in the Query DSL tree are applied as post-filters.
For example, `knn` query finds the top 3 documents with the nearest vectors
(num_candidates=3), which are combined with `term` filter, that is
post-filtered. The final set of documents will contain only a single document
that passes the post-filter.
[source,console]
----
POST my-image-index/_search
{
"size" : 10,
"query" : {
"bool" : {
"must" : {
"knn": {
"field": "image-vector",
"query_vector": [-5, 9, -12],
"num_candidates": 3
}
},
"filter" : {
"term" : { "file-type" : "png" }
}
}
}
}
----
//TEST[continued]
[[knn-query-in-hybrid-search]]
==== Hybrid search with knn query
Knn query can be used as a part of hybrid search, where knn query is combined
with other lexical queries. For example, the query below finds documents with
`title` matching `mountain lake`, and combines them with the top 10 documents
that have the closest image vectors to the `query_vector`. The combined documents
are then scored and the top 3 top scored documents are returned.
+
[source,console]
----
POST my-image-index/_search
{
"size" : 3,
"query": {
"bool": {
"should": [
{
"match": {
"title": {
"query": "mountain lake",
"boost": 1
}
}
},
{
"knn": {
"field": "image-vector",
"query_vector": [-5, 9, -12],
"num_candidates": 10,
"boost": 2
}
}
]
}
}
}
----
//TEST[continued]
[[knn-query-with-nested-query]]
==== Knn query inside a nested query
`knn` query can be used inside a nested query. The behaviour here is similar
to <<nested-knn-search, top level nested kNN search>>:
* kNN search over nested dense_vectors diversifies the top results over
the top-level document
* `filter` over the top-level document metadata is supported and acts as a
post-filter
* `filter` over `nested` field metadata is not supported
A sample query can look like below:
[source,js]
----
{
"query" : {
"nested" : {
"path" : "paragraph",
"query" : {
"knn": {
"query_vector": [
0.45,
45
],
"field": "paragraph.vector",
"num_candidates": 2
}
}
}
}
}
----
// NOTCONSOLE
[[knn-query-aggregations]]
==== Knn query with aggregations
`knn` query calculates aggregations on `num_candidates` from each shard.
Thus, the final results from aggregations contain
`num_candidates * number_of_shards` documents. This is different from
the <<knn-search,top level knn section>> where aggregations are
calculated on the global top k nearest documents.