elasticsearch/docs/reference/how-to/knn-search.asciidoc

[[tune-knn-search]]
== Tune approximate kNN search

{es} supports <<approximate-knn, approximate k-nearest neighbor search>> for
efficiently finding the _k_ nearest vectors to a query vector. Since
approximate kNN search works differently from other queries, there are special
considerations around its performance.

Many of these recommendations help improve search speed. With approximate kNN,
the indexing algorithm runs searches under the hood to create the vector index
structures. So these same recommendations also help with indexing speed.

[discrete]
=== Reduce vector memory foot-print

The default <<dense-vector-element-type,`element_type`>> is `float`. But this
can be automatically quantized during index time through
<<dense-vector-quantization,`quantization`>>. Quantization will reduce the
required memory by 4x, 8x, or as much as 32x, but it will also reduce the precision of the vectors and
increase disk usage for the field (by up to 25%, 12.5%, or 3.125%, respectively). Increased disk usage is a
result of {es} storing both the quantized and the unquantized vectors.
For example, when int8 quantizing 40GB of floating point vectors an extra 10GB of data will be stored for the quantized vectors.
The total disk usage amounts to 50GB, but the memory usage for fast search will be reduced to 10GB.

For `float` vectors with `dim` greater than or equal to `384`, using a
<<dense-vector-quantization,`quantized`>> index is highly recommended.

[discrete]
=== Reduce vector dimensionality

The speed of kNN search scales linearly with the number of vector dimensions,
because each similarity computation considers each element in the two vectors.
Whenever possible, it's better to use vectors with a lower dimension. Some
embedding models come in different "sizes", with both lower and higher
dimensional options available. You could also experiment with dimensionality
reduction techniques like PCA. When experimenting with different approaches,
it's important to measure the impact on relevance to ensure the search
quality is still acceptable.

[discrete]
=== Exclude vector fields from `_source`

{es} stores the original JSON document that was passed at index time in the
<<mapping-source-field, `_source` field>>. By default, each hit in the search
results contains the full document `_source`. When the documents contain
high-dimensional `dense_vector` fields, the `_source` can be quite large and
expensive to load. This could significantly slow down the speed of kNN search.

NOTE: <<docs-reindex, reindex>>, <<docs-update, update>>,
and <<docs-update-by-query, update by query>> operations generally
require the `_source` field. Disabling `_source` for a field might result in
unexpected behavior for these operations. For example, reindex might not actually
contain the `dense_vector` field in the new index.

You can disable storing `dense_vector` fields in the `_source` through the
<<include-exclude, `excludes`>> mapping parameter. This prevents loading and
returning large vectors during search, and also cuts down on the index size.
Vectors that have been omitted from `_source` can still be used in kNN search,
since it relies on separate data structures to perform the search. Before
using the <<include-exclude, `excludes`>> parameter, make sure to review the
downsides of omitting fields from `_source`.

Another option is to use  <<synthetic-source,synthetic `_source`>>.

[discrete]
=== Ensure data nodes have enough memory

{es} uses the https://arxiv.org/abs/1603.09320[HNSW] algorithm for approximate
kNN search. HNSW is a graph-based algorithm which only works efficiently when
most vector data is held in memory. You should ensure that data nodes have at
least enough RAM to hold the vector data and index structures. To check the
size of the vector data, you can use the <<indices-disk-usage>> API.

Here are estimates for different element types and quantization levels:

* `element_type: float`: `num_vectors * num_dimensions * 4`
* `element_type: float` with `quantization: int8`: `num_vectors * (num_dimensions + 4)`
* `element_type: float` with `quantization: int4`: `num_vectors * (num_dimensions/2 + 4)`
* `element_type: float` with `quantization: bbq`: `num_vectors * (num_dimensions/8 + 12)`
* `element_type: byte`: `num_vectors * num_dimensions`
* `element_type: bit`: `num_vectors * (num_dimensions/8)`

If utilizing HNSW, the graph must also be in memory, to estimate the required bytes use `num_vectors * 4 * HNSW.m`. The
default value for `HNSW.m` is 16, so by default `num_vectors * 4 * 16`.

Note that the required RAM is for the filesystem cache, which is separate from the Java heap.

The data nodes should also leave a buffer for other ways that RAM is needed.
For example your index might also include text fields and numerics, which also
benefit from using filesystem cache. It's recommended to run benchmarks with
your specific dataset to ensure there's a sufficient amount of memory to give
good search performance.
You can find https://elasticsearch-benchmarks.elastic.co/#tracks/so_vector[here]
and https://elasticsearch-benchmarks.elastic.co/#tracks/dense_vector[here] some examples
of datasets and configurations that we use for our nightly benchmarks.

[discrete]
[[dense-vector-preloading]]
include::search-speed.asciidoc[tag=warm-fs-cache]

The following file extensions are used for the approximate kNN search:
Each extension is broken down by the quantization types.

* `vex` for the HNSW graph
* `vec` for all non-quantized vector values. This includes all element types: `float`, `byte`, and `bit`.
* `veq` for quantized vectors indexed with <<dense-vector-quantization,`quantization`>>: `int4` or `int8`
* `veb` for binary vectors indexed with <<dense-vector-quantization,`quantization`>>: `bbq`
* `vem`, `vemf`, `vemq`, and `vemb` for metadata, usually small and not a concern for preloading

Generally, if you are using a quantized index, you should only preload the relevant quantized values and the HNSW graph.
Preloading the raw vectors is not necessary and might be counterproductive.

[discrete]
=== Reduce the number of index segments

{es} shards are composed of segments, which are internal storage elements in
the index. For approximate kNN search, {es} stores the vector values of
each segment as a separate HNSW graph, so kNN search must check each segment.
The recent parallelization of kNN search made it much faster to search across
multiple segments, but still kNN search can be up to several times
faster if there are fewer segments. By default, {es} periodically
merges smaller segments into larger ones through a background
<<index-modules-merge, merge process>>. If this isn't sufficient, you can take
explicit steps to reduce the number of index segments.

[discrete]
==== Increase maximum segment size

{es} provides many tunable settings for controlling the merge process. One
important setting is `index.merge.policy.max_merged_segment`. This controls
the maximum size of the segments that are created during the merge process.
By increasing the value, you can reduce the number of segments in the index.
The default value is `5GB`, but that might be too small for larger dimensional vectors.
Consider increasing this value to `10GB` or `20GB` can help reduce the number of segments.

[discrete]
==== Create large segments during bulk indexing

A common pattern is to first perform an initial bulk upload, then make an
index available for searches. Instead of force merging, you can adjust the
index settings to encourage {es} to create larger initial segments:

* Ensure there are no searches during the bulk upload and disable
<<index-refresh-interval-setting,`index.refresh_interval`>> by setting it to
`-1`. This prevents refresh operations and avoids creating extra segments.
* Give {es} a large indexing buffer so it can accept more documents before
flushing. By default, the <<indexing-buffer,`indices.memory.index_buffer_size`>>
is set to 10% of the heap size. With a substantial heap size like 32GB, this
is often enough. To allow the full indexing buffer to be used, you should also
increase the limit <<index-modules-translog,`index.translog.flush_threshold_size`>>.

[discrete]
=== Avoid heavy indexing during searches

Actively indexing documents can have a negative impact on approximate kNN
search performance, since indexing threads steal compute resources from
search. When indexing and searching at the same time, {es} also refreshes
frequently, which creates several small segments. This also hurts search
performance, since approximate kNN search is slower when there are more
segments.

When possible, it's best to avoid heavy indexing during approximate kNN
search. If you need to reindex all the data, perhaps because the vector
embedding model changed, then it's better to reindex the new documents into a
separate index rather than update them in-place. This helps avoid the slowdown
mentioned above, and prevents expensive merge operations due to frequent
document updates.

[discrete]
include::search-speed.asciidoc[tag=readahead]