The test that is set up assumes a single shard. Since the test uses so
few vectors and few dimensions, the statistics are pretty sensitive.
CCS tests seem to allow more than one write shard (via more than one
cluster). Consequently, the similarity detected can vary pretty wildly.
However, through empirical testing, I found that the desired vector
seems to always have a score > 0.0034 and all the other vectors have a
score < 0.001. This commit adjusts this similarity threshold
accordingly. This should make the test flakiness go away in CCS testing.
closes: https://github.com/elastic/elasticsearch/issues/109881
Wholesale fix of every `TRAPPY_IMPLICIT_DEFAULT_MASTER_NODE_TIMEOUT` in
`o.e.snapshots` and `o.e.repositories`, just pulling them up to the REST
layer (where they become API params), the test suite (where they become
`TEST_REQUEST_TIMEOUT`), or some other place where an explicit value is
available.
Relates #107984
The cluster level dense vector stats returns the total number of dense vector indices globally including the replicas.
This commit fixes the total to only include the value count of the primary indices.
This change aligns with the docs stats which also reports the number of primary documents when used in cluster stats.
The indices stats API still reports granular results for replicas and primaries so the information is not lost.
This indicator is dependent on `HealthMetadata` being present in
the cluster state, which we can't guarantee in this test,
potentially resulting in an `unknown` status.
This change adds a synthetic mode for nested fields that recursively load nested objects from stored fields and doc values.
The order of the sub-objects is preserved since they are indexed in separate Lucene documents.
This change also introduces the `store_array_source` mode in the nested field options. This option is disabled by default when synthetic is used but users can opt-in for this behaviour.
* Add SparseVectorStats
* Update to use mappings in engine
* Update to be unique to primary shards
* Fix doc
* Fix null error in test
* Cleanup
* fix yaml
* remove comment
* add version to yaml
* Revert whitespace changes to stats doc
* fix yml test
* Checkstyle
* Fix NPE in test
* Update docs/changelog/108793.yaml
* Add link to sparse_vector field type in docs
* PR feedback
* Flesh out test a bit more
* PR feedback - alphabetize placement in docs
* Fix doc change
This adds a new quantization mechanism for HNSW and flat indices. Here
we add `int4` quantization via the `int4_hnsw` and `int4_flat` index
types. This quantization methodology further reduces the memory required
for fast HNSW, meaning that the memory required is 8x smaller than with
regular float32 values.
8x reduction means that 1M 1024 dimension vectors goes from requiring
3.8GB to 477MB.
Recall continues to stay steady, there is some reduction that is
recoverable via slightly oversampling and reranking. For example over
500k CohereV3 vectors, only 5 extra vectors are required to be gathered
to achieve over 0.98 recall in a brute-force scenario.

Here we introduce a `cluster.logsdb.enabled` setting that controls activation of the new `logs` index mode in `logs@settings`. The setting default value is `false` and prevents usage of the new index mode by default in `logs@settings`. We also change `hostname` to `host.name` as the default field used for sorting (other than `@timestamp`) and include it in `logs@mappings`.
Since we are only indexing 3 docs, we need to ensure its a single shard for score repeatability.
Additionally, adding back all the flushes that were removed to ensure we exercise the merging paths.
* Add priority to the query rule index, and merge rule updates into existing rulesets by priority
* Don't require double specification of rule_id
* Initial addition of get and delete API calls
* Add tests
* Update docs/changelog/109554.yaml
* D'oh! Removed commented out code
* Add test
* Update URI for requests and add test
* Ensure URIs are consistent for individual query rule API calls and update constant names to be more explicit that they are rules within a ruleset
---------
Co-authored-by: Elastic Machine <elasticmachine@users.noreply.github.com>
Introduces new cluster settings that allow only a certain set of scripts in scripted metrics aggregations:
- search.aggs.only_allowed_metric_scripts, defaults to false
- search.aggs.allowed_inline_metric_scripts, defaults to empty list
- search.aggs.allowed_stored_metric_scripts, defaults to empty list
* Add dry run and force to json spec
* Rewording
Co-authored-by: Tim Grein <tim.grein@elastic.co>
---------
Co-authored-by: Tim Grein <tim.grein@elastic.co>
Updated LuceneDocument to take advantage of looking up feature values on existing features and selecting the max when parsing multi-value sparse vectors
After running the elastic/logs track with logs index mode enabled, I noticed that _source was still getting stored.
The issue was that other index modes than time_series weren't propagated to Indexmetadata and IndexSettings classes. Additionally the synthetic source defaults in SourceFieldMapper were geared towards time series index mode only. This change addresses this.
This PR introduces a new index mode, `logs`, which enables usage of LogsDB in Elasticsearch.
As a result of adopting the `logs` index mode, default index sorting is applied using the hostname
and @timestamp fields. Users are allowed, anyway, to override index sort settings.
By default, it will also use synthetic source and the same codecs used by TSDB.
Note: the logs index mode is a Tech Preview feature.