This allows configuration to have a `-1` dim to read files that have the
`dim` in the file.
Additionally, allows setting numQuerys to `0` to skip the search phase
easily.
The RunningSnapshotIT upgrade test adds shutdown markers to all nodes
and removes them once all nodes are upgraded. If an index gets created
in a mixed cluster, for example by ILM or deprecation messages, the
index cannot be allocated because all nodes are shutting down. Since the
cluster ready check between node upgrades expects a yellow cluster, the
unassigned index prevents the ready check to succeed and eventually
timeout. This PR fixes it by removing shutdown marker for the 1st
upgrade node to allow it hosting new indices.
Resolves: #129644Resolves: #129645Resolves: #129646
This PR introduces several fixes to various IT tests, related to the use and misuse of the version identifier for the start cluster:
wherever we can, we replace of versions in test code with features
where we can't, we make sure we use the actual stack version (the one provided by -Dtests.bwc.main.version and not the bogus "0.0.0" version string)
when requesting the cluster version we make sure we do use the "unresolved" version identifier (the value of the tests.old_cluster_version system property e.g. 0.0.0 ) so we resolve the right distribution
These changes enabled the tests to be used in BC upgrade tests (and potentially in serverless upgrade tests too, where they would have also failed)
Relates to ES-12010
Precedes #128614, #128823 and #128983
All we care about is if reindex is true or false. We shouldn't worry
about force merge. Because if reindex is true, we will create the
directory, if its false, we won't.
This adds some testing tools for verifying vector recall and latency
directly without having to spin up an entire ES node and running a rally
track.
Its pretty barebones and takes inspiration from lucene-util, but I
wanted access to our own formats and tooling to make our lives easier.
Here is an example config file. This will build the initial index, run
queries at num_candidates: 50, then again at num_candidates 100 (without
reindexing, and re-using the cached nearest neighbors).
```
[{
"doc_vectors" : "path",
"query_vectors" : "path",
"num_docs" : 10000,
"num_queries" : 10,
"index_type" : "hnsw",
"num_candidates" : 50,
"k" : 10,
"hnsw_m" : 16,
"hnsw_ef_construction" : 200,
"index_threads" : 4,
"reindex" : true,
"force_merge" : false,
"vector_space" : "maximum_inner_product",
"dimensions" : 768
},
{
"doc_vectors" : "path",
"query_vectors" : "path",
"num_docs" : 10000,
"num_queries" : 10,
"index_type" : "hnsw",
"num_candidates" : 100,
"k" : 10,
"hnsw_m" : 16,
"hnsw_ef_construction" : 200,
"vector_space" : "maximum_inner_product",
"dimensions" : 768
}
]
```
To execute:
```
./gradlew :qa:vector:checkVec --args="/Path/to/knn_tester_config.json"
```
Calling `./gradlew :qa:vector:checkVecHelp` gives some guidance on how
to use it, additionally providing a way to run it via java directly
(useful to bypass gradlew guff).
This change enables security in a number of tsdb and logsdb integration tests. A number of java/yaml rest tests in logsdb module, additionally logsdb and tsdb rolling upgrade tests.
A recent bug (#128050) wouldn't have happened if logsdb rolling upgrade tests ran with security enabled.
Before this PR sorting on integer, short and byte fields types used
SortField.Type.LONG. This made sort optimization impossible for these
field types.
This PR uses SortField.Type.INT for integer, short and byte fields. This
enables sort optimization.
There are several caveats with changing sort type that are addressed: -
Before mixed sort on integer and long fields was automatically
supported, as both field types used SortField.TYPE.LONG. Now when
merging results from different shards, we need to convert sort to LONG
and results to long values. - Similar for collapsing when there is mixed
INT and LONG sort types. - Index sorting. Similarly, before for index
sorting on integer field, SortField.Type.LONG was used. This sort type
is stored in the index writer config on disk and can't be modified. Now
when providing sortField() for index sorting, we need to account for
index version: for older indices return sort with SortField.Type.LONG
and for new indices return SortField.Type.INT.
---
There is only 1 change that may be considered not backwards compatible:
Before if an integer field was [missing a
value](https://www.elastic.co/docs/reference/elasticsearch/rest-apis/sort-search-results#_missing_values)
, it sort values will return Long.MAX_VALUE in a search response. With
this integer, it sort valeu will return Integer.MAX_VALUE. But I think
this change is ok, as in our documentation, we don't provide information
what value will be returned, we just say it will be sorted last.
---
Also closes#127965 (as same type validation in added for collapse
queries)
Restructures docker files for docker distributions
- Put Dockerfiles in specific distro specific folders keeping "Dockerfile" naming convention
- Allows better ide support
- Allows easier renovate integration
- Explicitly set base image in dockerfile
- simplify renovate configuration
- Cleanup DockerBase file to not contain ess fips base image information
This lives now in the Dockerfile content directly
* Workaround docker test issue
* Fix labels for fips image
* [Test] Rework detecting elasticsearch process in docker tests
This tweaks detecting the elasticsearch process id by using jps instead of ps which has been problematic in the past exceeding available COLUMN sizes due to es commandline invocation getting longer and longer
* Remove few muted tests
* Reuse ps for detecting processes but use pipe to find the right one
jps doesnt work well with different users
* Tweak java command running lookup to work with wolfi
* Cleanup changes
* [CI] Auto commit changes from spotless
---------
Co-authored-by: elasticsearchmachine <infra-root+elasticsearchmachine@elastic.co>
This test occasionally fails on version 8.19 clusters when we are waiting
for system index migration to finish. This changes the wait time from the
10s default to 30s to account for the occasional super slow cluster in tests.
Closes#127245
This method would default to starting a new node when the cluster was
empty. This is pretty trappy as `getClient()` (or things like
`getMaster()` that depend on `getClient()`) don't look at all like
something that would start a new node.
In any case, the intention of tests is much clearer when they explicitly
define a cluster configuration.
This change enhances the dense_vector section of the Nodes stats and Index stats APIs so that they report the desired size of off-heap memory for all indexed vectors. The dense_vector section of the Custer stats API remains unchanged.
The retrieval mechanism and structure of the new stats is the same across the various three stats APIs, but more fine-grained information is disclosed as when moving from Cluster -> Node -> Index API.
For Node stats, we aggregate the total byte sizes for all vectors, categorised by the data type. For example:
"dense_vector" : {
"value_count" : 5,
"off_heap" : {
"total_size_in_bytes" : 27,
"total_veb_size_in_bytes" : 3,
"total_vec_size_in_bytes" : 23,
"total_veq_size_in_bytes" : 0,
"total_vex_size_in_bytes" : 1
}
}
Index stats: same as Node stats with included field break down . For example:
"dense_vector" : {
"value_count" : 5,
"off_heap" : {
"total_size_in_bytes" : 27,
"total_veb_size_in_bytes" : 3,
"total_vec_size_in_bytes" : 23,
"total_veq_size_in_bytes" : 0,
"total_vex_size_in_bytes" : 1,
"fielddata" : {
"bar" : {
"veb_size_in_bytes" : 3,
"vec_size_in_bytes" : 14,
"vex_size_in_bytes" : 1
},
"foo" : {
"vec_size_in_bytes" : 9
}
}
}
The implementation accesses the actual statistics through reflection. This will be completely removed when Lucene exposes this, which is expected in Lucene 10.3
Now that entitlements are always used, there is no need to run tests
with security manager (a future enhancement will run tests with
entitlements). This commit removes setting up security manager from
tests.
With this PR we restrict the paths we allow access to, forbidding plugins to specify/request entitlements for reading or writing to specific protected directories.
I added this validation to EntitlementInitialization, as I wanted to fail fast and this is the earliest occurrence where we have all we need: PathLookup to resolve relative paths, policies (for plugins, server, agents) and the Paths for the specific directories we want to protect.
Relates to ES-10918
Today we rely on registering the channel after registering the task to
be cancelled to ensure that the task is cancelled even if the channel is
closed concurrently. However the client may already have processed a
cancellable request on the channel and therefore this mechanism doesn't
work. With this change we make sure not to register another task after
draining the registrations in order to cancel them.
Closes#88201
Update the PerFieldFormatSupplier so that new standard indices use the
Lucene101PostingsFormat instead of the current default ES812PostingsFormat.
Currently, use of the new codec is gated behind a feature flag.
* [main] Move system indices migration to migrate plugin
It seems the best way to fix#122949 is to use existing data stream reindex API. However, this API is located in the migrate x-pack plugin. This commit moves the system indices migration logic (REST handlers, transport actions, and task) to the migrate plugin.
Port of #123551
* [CI] Auto commit changes from spotless
* Fix compilation
* Fix tests
* Fix test
---------
Co-authored-by: elasticsearchmachine <infra-root+elasticsearchmachine@elastic.co>
We recently cleared stack traces on data nodes before transport back to the coordinating node when error_trace=false to reduce unnecessary data transfer and memory on the coordinating node (#118266). However, all logging of exceptions happens on the coordinating node, so stack traces disappeared from any logs. This change logs stack traces directly on the data node when error_trace=false.
This action solely needs the cluster state, it can run on any node.
Since this is the last class/action that extends the `ClusterInfo`
abstract classes, we remove those classes too as they're not required
anymore.
Relates #101805
This action solely needs the cluster state, it can run on any node.
Additionally, it needs to be cancellable to avoid doing unnecessary work
after a client failure or timeout.
Relates #101805
This PR replaces the parsing and formatting of SecurityManager policies with the parsing and formatting of Entitlements policy during plugin installation.
Relates to ES-10923
This change moves the query phase a single roundtrip per node just like can_match or field_caps work already.
A a result of executing multiple shard queries from a single request we can also partially reduce each node's query results on the data node side before responding to the coordinating node.
As a result this change significantly reduces the impact of network latencies on the end-to-end query performance, reduces the amount of work done (memory and cpu) on the coordinating node and the network traffic by factors of up to the number of shards per data node!
Benchmarking shows up to orders of magnitude improvements in heap and network traffic dimensions in querying across a larger number of shards.
This PR adds project-id to both SnapshotsInProgress and Snapshot so that
they are aware of projects and ready to handle snapshots from multiple
projects.
Relates: ES-10224
Remove `-da:org.elasticsearch.index.mapper.DocumentMapper` and `-da:org.elasticsearch.index.mapper.MapperService` from mixed cluster/bwc cluster setups. Given that #122606 is now backported to the 8.18 branch.
This makes using usesDefaultDistribution in our test setup for explicit by requiring a reason why it's needed.
This is helpful as part of revisiting the need for all those usages in our code base.