* [DOCS] Clarify copy_to behavior with strict dynamic mappings
* Add id
* De-verbosify
* Delete pesky comma
* More info about root and nest
* Fixes per review, clarify non-recursive explanation
* Skip tests for illustrative example
* Fix example syntax
* Fix typo
This commit adds `bit` vector support by adding `element_type: bit` for
vectors. This new element type works for indexed and non-indexed
vectors. Additionally, it works with `hnsw` and `flat` index types. No
quantization based codec works with this element type, this is
consistent with `byte` vectors.
`bit` vectors accept up to `32768` dimensions in size and expect vectors
that are being indexed to be encoded either as a hexidecimal string or a
`byte[]` array where each element of the `byte` array represents `8`
bits of the vector.
`bit` vectors support script usage and regular query usage. When
indexed, all comparisons done are `xor` and `popcount` summations (aka,
hamming distance), and the scores are transformed and normalized given
the vector dimensions. Note, indexed bit vectors require `l2_norm` to be
the similarity.
For scripts, `l1norm` is the same as `hamming` distance and `l2norm` is
`sqrt(l1norm)`. `dotProduct` and `cosineSimilarity` are not supported.
Note, the dimensions expected by this element_type are always to be
divisible by `8`, and the `byte[]` vectors provided for index must be
have size `dim/8` size, where each byte element represents `8` bits of
the vectors.
closes: https://github.com/elastic/elasticsearch/issues/48322
PR #99445 introduced automatic normalization of dense vectors with
cosine similarity. This adds a note about this in the documentation.
Relates to #99445
This adds a new quantization mechanism for HNSW and flat indices. Here
we add `int4` quantization via the `int4_hnsw` and `int4_flat` index
types. This quantization methodology further reduces the memory required
for fast HNSW, meaning that the memory required is 8x smaller than with
regular float32 values.
8x reduction means that 1M 1024 dimension vectors goes from requiring
3.8GB to 477MB.
Recall continues to stay steady, there is some reduction that is
recoverable via slightly oversampling and reranking. For example over
500k CohereV3 vectors, only 5 extra vectors are required to be gathered
to achieve over 0.98 recall in a brute-force scenario.

Updated LuceneDocument to take advantage of looking up feature values on existing features and selecting the max when parsing multi-value sparse vectors
This PR uses infrastructure from #107567 to implement a fallback implementation of synthetic source for field mappers that don't support it natively. In that case we will store source of such field as is in a separate stored field.
This PR adds synthetic source support for annotated_text fields. Existing implementation for text is reused including test infrastructure so the majority of the change is moving and making things accessible.
Contributes to #106460, #78744.
* Implement synthetic source support for range fields
This PR adds basic synthetic source support for range fields. There are
following notable properties of synthetic source produced:
* Ranges are always normalized to be inclusive on both ends (this is how
they are stored).
* Original order of ranges is not preserved.
* Date ranges are always expressed in epoch millis, format is not
preserved.
* IP ranges are always expressed as a range of IPs while it could
have been originally provided as a CIDR.
This PR only implements retrieval of data for source reconstruction from
doc values.
* Remove `es-test-dir` book-scoped variable
* Remove `plugins-examples-dir` book-scoped variable
* Remove `:dependencies-dir:` and `:xes-repo-dir:` book-scoped variables
- In `index.asciidoc`, two variables (`:dependencies-dir:` and `:xes-repo-dir:`) were removed.
- In `sql/index.asciidoc`, the `:sql-tests:` path was updated to fuller path
- In `esql/index.asciidoc`, the `:esql-tests:` path was updated idem
* Replace `es-repo-dir` with `es-ref-dir`
* Move `:include-xpack: true` to few files that use it, remove from index.asciidoc
For float32, there is no compelling reason to use all the memory
required by default for HNSW. Using `int8_hnsw` provides a much saner
default when it comes to cost vs relevancy.
So, on all new indices that use `dense_vector` and want to index them
for fast search, we will default to `int8_hnsw`.
Users can still customize their parameters, or prefer `hnsw` over
float32 if they so desire.
* Text fields are stored by default with synthetic source
Synthetic source requires text fields to be stored or have keyword
sub-field that supports synthetic source. If there are no keyword fields
users currently have to explicitly set 'store' to 'true' or get a
validation exception. This is not the best experience. It is quite
likely that setting `store` to `true` is the correct thing to do but
users still get an error and need to investigate it. With this change if
`store` setting is not specified in such context it will be set to
`true` by default. Setting it explicitly to `false` results in the
exception.
Closes#97039
* [DOCS] `time_series_dimension` fields do not support `ignore_above`
There is existing validation for this combination of parameters but
it was not documented.
Closes#99044
* Remove maximum size constraint
* Add reasoning for constraints
This adds two new vector index types: - flat - int8_flat
Both store the vectors in a flat space and search is brute-force over
the vectors in the index. For the regular `flat` index, this can be
considered syntactic sugar that allows `knn` queries without having to
put indices within HNSW.
For `int8_flat`, this allows float vectors to be stored in a flat
manner, but also automatically quantized.
Fixed a typo and a small grammatical error in the explanation of the `null_value` option
(cherry picked from commit fa52f82838)
Co-authored-by: Nimrod Dolev <nimrodavid@gmail.com>