Fixes two bugs in _resolve/cluster.
First, the code that detects older clusters versions and does a fallback to the _resolve/index
endpoint was using an outdated string match for error detection. That has been adjusted.
Second, upon security exceptions, the _resolve/cluster endpoint was marking the clusters as connected: true,
under the assumption that all security exceptions related to cross cluster calls and remote index access were
coming from the remote cluster, but that is not always the case. Some cross-cluster security violations can
be detected on the local querying cluster after issuing the remoteClient.execute call but before the transport
layer actually sends the request remotely. So we now mark the connected status as false for all ElasticsearchSecurityException cases. End user docs have been updated with this information.
* fix: do not let `_resolve/cluster` hang if remote is unresponsive
Previously, `_resolve/cluster` would wait for a response from a remote
as part of the connection strategy. If the remote were to be
unresponsive, this API would wait until `netty` would terminate the
connection with a handshake exception. The threshold for terminating the
connection is `10s`. This means that the API would wait for `10s` before
determining that the remote is unresponsive. This strategy is now
replaced with a fail fast where a response is sent back to the user
immediately rather than waiting for a connection termination.
* Update docs/changelog/119516.yaml
This updates the gradle wrapper to 8.12
We addressed deprecation warnings due to the update that includes:
- Fix change in TestOutputEvent api
- Fix deprecation in groovy syntax
- Use latest ospackage plugin containing our fix
- Remove project usages at execution time
- Fix deprecated project references in repository-old-versions
(cherry picked from commit ba61f8c7f7)
# Conflicts:
# build-tools-internal/src/main/java/org/elasticsearch/gradle/internal/distribution/DockerCloudElasticsearchDistributionType.java
# build-tools-internal/src/main/java/org/elasticsearch/gradle/internal/distribution/DockerUbiElasticsearchDistributionType.java
# build-tools-internal/src/main/java/org/elasticsearch/gradle/internal/test/Fixture.java
# plugins/repository-hdfs/hadoop-client-api/build.gradle
# server/src/main/java/org/elasticsearch/inference/ChunkingOptions.java
# x-pack/plugin/kql/build.gradle
# x-pack/plugin/migrate/build.gradle
# x-pack/plugin/security/qa/security-basic/build.gradle
This test would fail to see the expected response headers if the task
timed out before it started executing, which could happen very rarely.
It's also not a very good test because it never actually executed any of
the paths involving acking.
This commit fixes the rare failure and tightens up the assertions to
verify that it does indeed see the right thread context while handling
the end of the acking process, and indeed that it always completes the
acking process.
Closes#118914
* Handle all exceptions in data nodes can match (#117469)
During the can match phase, prior to the query phase, we may have exceptions
that are returned back to the coordinating node, handled gracefully as if the
shard returned canMatch=true.
During the query phase, we perform an additional rewrite and can match phase
to eventually shortcut the query phase for the shard. That needs to handle
exceptions as well. Currently, an exception there causes shard failures, while
we should rather go ahead and execute the query on the shard.
Instead of adding another try catch on consumers code, this commit adds exception handling to the method itself so that it can no longer throw exceptions and similar mistakes can no longer be made in the future.
At the same time, this commit makes the can match method more easily testable without requiring a full-blown SearchService instance.
Closes#104994
* fix compile
* Don't skip shards in coord rewrite if timestamp is an alias (#117271)
The coordinator rewrite has logic to skip indices if the provided date range
filter is not within the min and max range of all of its shards. This mechanism
is enabled for event.ingested and @timestamp fields, against searchable snapshots.
We have basic checks that such fields need to be of date field type, yet if they
are defined as alias of a date field, their range will be empty, which indicates
that the shards are empty, and the coord rewrite logic resolves the alias and
ends up skipping shards that may have matching docs.
This commit adds an explicit check that declares the range UNKNOWN instead of EMPTY
in these circumstances. The same check is also performed in the coord rewrite logic,
so that shards are no longer skipped by mistake.
* fix compile
Relates to #115811, but applies to resize requests.
The index.mode, source.mode, and index.sort.* settings cannot be
modified during resize, as this may lead to data corruption or issues
retrieving _source. This change enforces a restriction on modifying
these settings during resize. While a fine-grained check could allow
equivalent settings, it seems simpler and safer to reject resize
requests if any of these settings are specified.
* Propagate scoring function through random sampler.
* Update docs/changelog/116957.yaml
* Correct score mode in random sampler weight
* Fix random sampling with scores and p=1.0
* Unit test with scores
* YAML test
* Add capability
* Split searchable snapshot into multiple repo operations
Each operation on a snapshot repository uses the same `Repository`,
`BlobStore`, etc. instances throughout, in order to avoid the complexity
arising from handling metadata updates that occur while an operation is
running. Today we model the entire lifetime of a searchable snapshot
shard as a single repository operation since there should be no metadata
updates that matter in this context (other than those that are handled
dynamically via other mechanisms) and some metadata updates might be
positively harmful to a searchable snapshot shard.
It turns out that there are some undocumented legacy settings which _do_
matter to searchable snapshots, and which are still in use, so with this
commit we move to a finer-grained model of repository operations within
a searchable snapshot.
Backport of #116918 to 8.16
* Add end-to-end test for reloading S3 credentials
We don't seem to have a test that completely verifies that a S3
repository can reload credentials from an updated keystore. This commit
adds such a test.
Backport of #116762 to 8.16.
Clarifies that insecure settings are stored in plaintext and must not be
used. Also removes the mention of the (wrong) system property from the
error message if insecure settings are not permitted.
Backport of #116915 to `8.16`
The fetch phase is subject to timeouts like any other search phase. Timeouts
may happen when low level cancellation is enabled (true by default), hence the
directory reader is wrapped into ExitableDirectoryReader and a timeout is
provided to the search request.
The exception that is used is TimeExceededException, but it is an internal
exception that should never be returned to the user. When that is thrown, we
need to catch it and throw error or mark the response as timed out depending
on whether partial results are allowed or not.
Several file-settings ITs fail (rarely) with exceptions like:
```
java.nio.file.AccessDeniedException: C:\Users\jenkins\workspace\platform-support\14\server\build\testrun\internalClusterTest\temp\org.elasticsearch.reservedstate.service.SnaphotsAndFileSettingsIT_5733F2A737542BE-001\tempFile-001.tmp -> C:\Users\jenkins\workspace\platform-support\14\server\build\testrun\internalClusterTest\temp\org.elasticsearch.reservedstate.service.SnaphotsAndFileSettingsIT_5733F2A737542BE-001\tempDir-002\config\operator\settings.json |
at sun.nio.fs.WindowsException.translateToIOException(WindowsException.java:89) |
-- | --
| | at sun.nio.fs.WindowsException.rethrowAsIOException(WindowsException.java:103) |
| | at sun.nio.fs.WindowsFileCopy.move(WindowsFileCopy.java:317) |
| | at sun.nio.fs.WindowsFileSystemProvider.move(WindowsFileSystemProvider.java:293) |
| | at org.apache.lucene.tests.mockfile.FilterFileSystemProvider.move(FilterFileSystemProvider.java:144) |
| | at org.apache.lucene.tests.mockfile.FilterFileSystemProvider.move(FilterFileSystemProvider.java:144) |
| | at org.apache.lucene.tests.mockfile.FilterFileSystemProvider.move(FilterFileSystemProvider.java:144) |
| | at org.apache.lucene.tests.mockfile.FilterFileSystemProvider.move(FilterFileSystemProvider.java:144) |
| | at java.nio.file.Files.move(Files.java:1430) |
| | at org.elasticsearch.reservedstate.service.SnaphotsAndFileSettingsIT.writeJSONFile(SnaphotsAndFileSettingsIT.java:86) |
| | at org.elasticsearch.reservedstate.service.SnaphotsAndFileSettingsIT.testRestoreWithPersistedFileSettings(SnaphotsAndFileSettingsIT.java:321)
```
This happens in Windows file systems, due to a race condition where the
file settings service is reading the settings file concurrently with the
test trying to modify it (a no-go in Windows). It turns out we have
already addressed this with a retry for one test suite
(https://github.com/elastic/elasticsearch/pull/91863), plus addressed a
related issue around mock windows file-systems misbehaving
(https://github.com/elastic/elasticsearch/pull/92653).
This PR extends the above fixes to all file-settings related ITs.
(cherry picked from commit 91559da015)
Co-authored-by: Elastic Machine <elasticmachine@users.noreply.github.com>
* Validate missing shards after the coordinator rewrite (#116382)
The coordinate rewrite can skip searching shards when the query filters
on `@timestamp`, event.ingested or the _tier field.
We currently check for missing shards across all the indices that are
the query is running against however, some shards/indices might not
play a role in the query at all after the coordinator rewrite.
This moves the check for missing shards **after** we've run the
coordinator rewrite so we validate only the shards that will be
searched by the query.
(cherry picked from commit cd2433d60c)
Signed-off-by: Andrei Dan <andrei.dan@elastic.co>
* imports
* Adapt unit test for 8.16 to use @timestamp rewrite
---------
Co-authored-by: Elastic Machine <elasticmachine@users.noreply.github.com>
This test was somehow difficult to write in the first place. We had to come up
with a threshold of how many tasks max are going to be created, but that is
not that easy to calculate as it depends on how quickly such tasks can be created
and be executed.
We should have rather used a higher threshold to start with, the important part
is anyways that we create a total of tasks that is no longer dependent on the
number of segments, given there are much less threads available to execute them.
Closes#116048
* Resolve pipelines from template if lazy rollover write (#116031)
If datastream rollover on write flag is set in cluster state, resolve pipelines from templates rather than from metadata. This fixes the following bug: when a pipeline reroutes every document to another index, and rollover is called with lazy=true (setting the rollover on write flag), changes to the pipeline do not go into effect, because the lack of writes means the data stream never rolls over and pipelines in metadata are not updated. The fix is to resolve pipelines from templates if the lazy rollover flag is set. To improve efficiency we only resolve pipelines once per index in the bulk request, caching the value, and reusing for other requests to the same index.
Fixes: #112781
* Remute tests blocking merge
* Remute tests blocking merge
Since we removed the search workers thread pool with #111099, we execute many
more tasks in the search thread pool, given that each shard search request
parallelizes across slices or even segments (knn query rewrite. There are also
rare situations where segment level tasks may parallelize further
(e.g. createWeight), that cause the creation of many many tasks for a single
top-level request. These are rather small tasks that previously queued up in
the unbounded search workers queue. With recent improvements in Lucene,
these tasks queue up in the search queue, yet they get executed by the caller
thread while they are still in the queue, and remain in the queue as no-op
until they are pulled out of the queue. We have protection against rejections
based on turning off search concurrency when we have more than maxPoolSize
items in the queue, yet that is not enough if enough parallel requests see
an empty queue and manage to submit enough tasks to fill the queue at once.
That will cause rejections for top-level searches that should not be rejected.
This commit introduces wrapping for the executor to limit the number of tasks
that a single search instance can submit to the executor, to prevent the situation
where a single search submits way more tasks than threads available.
Co-authored-by: Adrien Grand <jpountz@gmail.com>
The index.mode, source.mode, and index.sort.* settings cannot be
modified during restore, as this may lead to data corruption or issues
retrieving _source. This change enforces a restriction on modifying
these settings during restore. While a fine-grained check could permit
equivalent settings, it seems simpler and safer to reject restore
requests if any of these settings are specified.
* fix: correctly update search status for a nonexistent local index
* Check for cluster existence before updation
* Remove unnecessary `println`
* Address review comment: add an explanatory code comment
* Further clarify code comment
(cherry picked from commit ad9c5a0a06)
The blob store may be triggered to create a local directory while in a
reduced privilege context. This commit guards the creation of
directories with doPrivileged.
* Allow for queries on _tier to skip shards during coordinator rewrite (#114990)
The `_tier` metadata field was not used on the coordinator when
rewriting queries in order to exclude shards that don't match. This lead
to queries in the following form to continue to report failures even
though the only unavailable shards were in the tier that was excluded
from search (frozen tier in this example):
```
POST testing/_search
{
"query": {
"bool": {
"must_not": [
{
"term": {
"_tier": "data_frozen"
}
}
]
}
}
}
```
This PR addresses this by having the queries that can execute on `_tier`
(term, match, query string, simple query string, prefix, wildcard)
execute a coordinator rewrite to exclude the indices that don't match
the `_tier` query **before** attempting to reach to the shards (shards,
that might not be available and raise errors).
Fixes#114910
* Don't use getFirst
* Test compile