Commit graph

869 commits

Author SHA1 Message Date
Liam Thompson
4034615e29
[DOCS] Clarify copy_to behavior with strict dynamic mappings (#111408)
* [DOCS] Clarify copy_to behavior with strict dynamic mappings

* Add id

* De-verbosify

* Delete pesky comma

* More info about root and nest

* Fixes per review, clarify non-recursive explanation

* Skip tests for illustrative example

* Fix example syntax

* Fix typo
2024-08-01 14:37:17 +02:00
Felix Barnsteiner
3090438037
Add support for boolean dimensions (#111457)
Closes #111338
2024-07-31 23:00:32 +10:00
István Zoltán Szabó
1a5b008921
[DOCS] Clarifies semantic query behavior on sparse and dense vector fields (#111339)
* [DOCS] Clarifies semantic query behavior on sparse and dense vector fields.

* [DOCS] Adds a NOTE to the semantic query docs.
2024-07-26 16:53:38 +02:00
Carlos Delgado
ff3a77ca46
Clarify some semantic_text docs (#111329) 2024-07-26 16:45:29 +02:00
István Zoltán Szabó
22ead8d106
[DOCS] Documents automatic text chunking behavior for semantic text. (#111331) 2024-07-26 12:02:47 +02:00
Tommaso Teofili
9b86fd17aa
Document how to update dense vector field type (#111038) 2024-07-23 09:55:31 +02:00
Ioana Tagirta
e99aaad800
Document how to query for a specific feature within rank_features (#110749) 2024-07-11 16:19:14 +02:00
Oleksandr Kolomiiets
276ae121c2
Reflect latest changes in synthetic source documentation (#109501) 2024-07-04 09:48:04 -07:00
Carlos Delgado
30b32b6a46
semantic_text: Updated copy-to docs (#110350) 2024-07-03 10:18:40 +02:00
Kathleen DeRusso
7a1d532ffb
Pass over Sparse Vector docs for correctness (#110282)
* Remove legacy mentions of text expansion queries

* Add missing query_vector param to sparse_vector query docs

* Fix formatting errors in sparse vector query dsl doc

* Remove unnecessary test setup block
2024-07-02 13:37:25 -04:00
Felix Barnsteiner
cdbe092d90
Update docs now that keyword dimensions support ignore_above (#110385)
This is a follow-up from https://github.com/elastic/elasticsearch/pull/110337
2024-07-02 17:04:57 +02:00
Benjamin Trent
5add44d7d1
Adds new bit element_type for dense_vectors (#110059)
This commit adds `bit` vector support by adding `element_type: bit` for
vectors. This new element type works for indexed and non-indexed
vectors. Additionally, it works with `hnsw` and `flat` index types. No
quantization based codec works with this element type, this is
consistent with `byte` vectors.

`bit` vectors accept up to `32768` dimensions in size and expect vectors
that are being indexed to be encoded either as a hexidecimal string or a
`byte[]` array where each element of the `byte` array represents `8`
bits of the vector.

`bit` vectors support script usage and regular query usage. When
indexed, all comparisons done are `xor` and `popcount` summations (aka,
hamming distance), and the scores are transformed and normalized given
the vector dimensions. Note, indexed bit vectors require `l2_norm` to be
the similarity.

For scripts, `l1norm` is the same as `hamming` distance and `l2norm` is
`sqrt(l1norm)`. `dotProduct` and `cosineSimilarity` are not supported.

Note, the dimensions expected by this element_type are always to be
divisible by `8`, and the `byte[]` vectors provided for index must be
have size `dim/8` size, where each byte element represents `8` bits of
the vectors.

closes: https://github.com/elastic/elasticsearch/issues/48322
2024-06-27 04:48:41 +10:00
Mayya Sharipova
5c87eef89d
[DOCS Vectors with cosine automatically normalized (#110071)
PR #99445 introduced automatic normalization of dense vectors with
cosine similarity. This adds a note about this in the documentation.

Relates to #99445
2024-06-22 22:32:25 +10:00
Oleksandr Kolomiiets
8bc5ecdc31
Support synthetic source together with ignore_malformed in histogram fields (#109882) 2024-06-20 09:09:45 -07:00
Oleksandr Kolomiiets
5440f178aa
Support synthetic source for geo_point when ignore_malformed is used (#109651) 2024-06-18 08:37:27 -07:00
Benjamin Trent
3aed0afb2b
Add new int4 quantization to dense_vector (#109317)
This adds a new quantization mechanism for HNSW and flat indices. Here
we add `int4` quantization via the `int4_hnsw` and `int4_flat` index
types. This quantization methodology further reduces the memory required
for fast HNSW, meaning that the memory required is 8x smaller than with
regular float32 values. 

8x reduction means that 1M 1024 dimension vectors goes from requiring
3.8GB to 477MB.

Recall continues to stay steady, there is some reduction that is
recoverable via slightly oversampling and reranking. For example over
500k CohereV3 vectors, only 5 extra vectors are required to be gathered
to achieve over 0.98 recall in a brute-force scenario.

![recall](b47a79d0-020d-4baa-8199-41a932df00f7)
2024-06-18 00:15:43 +10:00
Carlos Delgado
d10dfb4ac5
Add limitations section to semantic_text field type docs (#109666) 2024-06-13 15:19:00 +02:00
Oleksandr Kolomiiets
c847235ed0
Support synthetic source for scaled_float and unsigned_long when ignore_malformed is used (#109506) 2024-06-12 11:05:23 -07:00
Benjamin Trent
29288d6590 Merge remote-tracking branch 'upstream/main' into lucene_snapshot_9_11 2024-06-11 06:54:23 -04:00
Carlos Delgado
d975997a3a
Add semantic-text warning about inference endpoints removal (#109561) 2024-06-11 18:33:25 +10:00
Oleksandr Kolomiiets
a9f31bd2aa
Support synthetic source for date fields when ignore_malformed is used (#109410) 2024-06-10 10:26:31 -07:00
john-wagster
dd83b5b8d0 Multivalue Sparse Vector Support (#109007)
Updated LuceneDocument to take advantage of looking up feature values on existing features and selecting the max when parsing multi-value sparse vectors
2024-06-04 12:50:58 -04:00
István Zoltán Szabó
95ce898436
[DOCS] Adds docs to semantic text (#108311)
Co-authored-by: Carlos Delgado <6339205+carlosdelest@users.noreply.github.com>
Co-authored-by: Mike Pellegrini <mike.pellegrini@elastic.co>
Co-authored-by: Kathleen DeRusso <kathleen.derusso@elastic.co>
2024-05-31 16:56:07 +02:00
Oleksandr Kolomiiets
42f4294a86
Enable fallback synthetic source for token_count (#109044) 2024-05-27 10:22:59 -07:00
Oleksandr Kolomiiets
eea996c172
Add synthetic source support for geo_shape via fallback implementation (#108881)
This PR enables geo_shape mapper to use fallback synthetic source infrastructure and as such adds synthetic source support for this field type.
2024-05-24 10:19:22 -07:00
Oleksandr Kolomiiets
8cfdbcc9a4
Documentation for ignore_malformed support with synthetic source for aggregate_metric_double (#108983) 2024-05-24 09:49:38 -07:00
Kathleen DeRusso
7f35f1bed0
Add sparse_vector query (#108254)
---------

Co-authored-by: Elastic Machine <elasticmachine@users.noreply.github.com>
Co-authored-by: Benjamin Trent <ben.w.trent@gmail.com>
2024-05-22 17:06:57 -04:00
Oleksandr Kolomiiets
91d502cec6
Add generic fallback implementation for synthetic source (#108222)
This PR uses infrastructure from #107567 to implement a fallback implementation of synthetic source for field mappers that don't support it natively. In that case we will store source of such field as is in a separate stored field.
2024-05-21 11:30:30 -07:00
Oleksandr Kolomiiets
a454ac1987
Do not produce infinity values in synthetic source for range fields (#108699) 2024-05-17 09:19:14 -07:00
Thomas Neirynck
6020bc7e06
[Docs] Add warning kibana has incomplete support for nested fields (#107971) 2024-05-13 08:42:21 -04:00
Oleksandr Kolomiiets
c3d45b99f2
Document binary field defauls in TSDB indices (#108046) 2024-04-30 08:02:16 -07:00
Benjamin Trent
67748cf616
Adding docs about scaled_float saturation with long values (#107966) 2024-04-30 08:25:37 -04:00
eyalkoren
ee262954ee
Adding aggregations support for the _ignored field (#101373)
Enables aggregations on the _ignored metadata field replacing the stored field
with doc values.
2024-04-29 16:41:34 +02:00
Oleksandr Kolomiiets
e1d902d33b
Implement synthetic source support for annotated text field (#107735)
This PR adds synthetic source support for annotated_text fields. Existing implementation for text is reused including test infrastructure so the majority of the change is moving and making things accessible.

Contributes to #106460, #78744.
2024-04-25 10:31:27 -07:00
Oleksandr Kolomiiets
cde894a5ce
Implement synthetic source support for range fields (#107081)
* Implement synthetic source support for range fields

This PR adds basic synthetic source support for range fields. There are
following notable properties of synthetic source produced:
* Ranges are always normalized to be inclusive on both ends (this is how
 they are stored).
* Original order of ranges is not preserved.
* Date ranges are always expressed in epoch millis, format is not
preserved.
* IP ranges are always expressed as a range of IPs while it could
have been originally provided as a CIDR.

This PR only implements retrieval of data for source reconstruction from
 doc values.
2024-04-24 11:32:20 -07:00
Oleksandr Kolomiiets
8ed92db288
Add synthetic source support for binary fields (#107549)
Add synthetic source support for binary fields
2024-04-22 10:06:39 -07:00
Liam Thompson
33a71e3289
[DOCS] Refactor book-scoped variables in docs/reference/index.asciidoc (#107413)
* Remove `es-test-dir` book-scoped variable

* Remove `plugins-examples-dir` book-scoped variable

* Remove `:dependencies-dir:` and `:xes-repo-dir:` book-scoped variables

- In `index.asciidoc`, two variables (`:dependencies-dir:` and `:xes-repo-dir:`) were removed.
- In `sql/index.asciidoc`, the `:sql-tests:` path was updated to fuller path
- In `esql/index.asciidoc`, the `:esql-tests:` path was updated idem

* Replace `es-repo-dir` with `es-ref-dir`

* Move `:include-xpack: true` to few files that use it, remove from index.asciidoc
2024-04-17 14:37:07 +02:00
Carlos Delgado
f8e516eb9c
Update sparse_vector docs on index version availability (#107315) 2024-04-10 17:41:42 +02:00
Benjamin Trent
89bf4b33e8
Make int8_hnsw our default index for new dense-vector fields (#106836)
For float32, there is no compelling reason to use all the memory
required by default for HNSW. Using `int8_hnsw` provides a much saner
default when it comes to cost vs relevancy. 

So, on all new indices that use `dense_vector` and want to index them
for fast search, we will default to `int8_hnsw`. 

Users can still customize their parameters, or prefer `hnsw` over
float32 if they so desire.
2024-04-01 08:23:32 -04:00
Oleksandr Kolomiiets
9e6b893896
Text fields are stored by default in TSDB indices (#106338)
* Text fields are stored by default with synthetic source

Synthetic source requires text fields to be stored or have keyword
sub-field that supports synthetic source. If there are no keyword fields
 users currently have to explicitly set 'store' to 'true' or get a
validation exception. This is not the best experience. It is quite
likely that setting `store` to `true` is  the correct thing to do but
users still get an error and need to investigate it. With this change if
 `store` setting is not specified in such context it  will be set to
 `true` by default. Setting it explicitly to `false` results in the
 exception.

Closes #97039
2024-03-26 13:37:19 -07:00
István Zoltán Szabó
5afc59b07e
[DOCS] Creates a semantic_text field type docs page. (#106528) 2024-03-20 11:05:52 +01:00
Oleksandr Kolomiiets
28f3977a2e
[DOCS] time_series_dimension fields do not support ignore_above (#106203)
* [DOCS] `time_series_dimension` fields do not support `ignore_above`

There is existing validation for this combination of parameters but
it was not documented.

Closes #99044

* Remove maximum size constraint

* Add reasoning for constraints
2024-03-13 08:40:16 -07:00
Benjamin Trent
61b3d98227
Add note about optional times and epochs (#105786) 2024-03-05 08:44:03 -05:00
Liam Thompson
4bea4a7a10
[Docs] Tiny format fix (#105820) 2024-02-29 09:32:42 +01:00
Felix Barnsteiner
dee0be589c
Flatten object mappings when subobjects is false (#103542) 2024-02-22 11:43:12 +01:00
Andrew Wilkins
5f90978296
Add unmatch_mapping_type, and support array of types (#103171)
Add an `unmatch_mapping_type` condition to dynamic templates (supporting
one or more types), and add support for specifying a list of types to
`match_mapping_type`.

Closes https://github.com/elastic/elasticsearch/issues/102795 Closes
https://github.com/elastic/elasticsearch/issues/102807
2024-02-09 10:42:26 -05:00
Benjamin Trent
43362d5de5
Add new int8_flat and flat vector index types (#104872)
This adds two new vector index types:  - flat   - int8_flat

Both store the vectors in a flat space and search is brute-force over
the vectors in the index.   For the regular `flat` index, this can be
considered syntactic sugar that allows `knn` queries without having to
put indices within HNSW. 

For `int8_flat`, this allows float vectors to be stored in a flat
manner, but also automatically quantized.
2024-02-05 12:56:13 -05:00
Felix Barnsteiner
f642b8a3aa
Add setting to ignore dynamic fields when field limit is reached (#96235)
Adds a new `index.mapping.total_fields.ignore_dynamic_beyond_limit`
index setting.

When set to `true`, new fields are added to the mapping as long as the
field limit (`index.mapping.total_fields.limit`) is not exceeded. Fields
that would exceed the limit are not added to the mapping, similar to
`dynamic: false`.  Ignored fields are added to the `_ignored` metadata
field.

Relates to https://github.com/elastic/elasticsearch/issues/89911

To make this easier to review, this is split into the following PRs: -
[x] https://github.com/elastic/elasticsearch/pull/102915 - [x]
https://github.com/elastic/elasticsearch/pull/102936 - [x]
https://github.com/elastic/elasticsearch/pull/104769

Related but not a prerequisite: - [ ]
https://github.com/elastic/elasticsearch/pull/102885
2024-02-02 05:53:52 -05:00
Abdon Pijpelink
1612ad1d65
fix typo (#103149) (#103381)
Fixed a typo and a small grammatical error in the explanation of the `null_value` option

(cherry picked from commit fa52f82838)

Co-authored-by: Nimrod Dolev <nimrodavid@gmail.com>
2023-12-13 07:17:00 -05:00
Chris Hegarty
ff22c90735
Merge branch 'main' into lucene_snapshot_9_9 2023-12-02 09:42:22 +00:00