A while ago we enabled using ccs_minimize_roundtrips in async search.
This makes it possible for users of async search to send a single search
request per remote cluster, and minimize the impact of network latency.
With non minimized roundtrips, we have pretty recurring cancellation checks:
as part of the execution, we detect that a task expired whenever each shard comes
back with its results.
In a scenario where the coord node does not hold data, or only remote data is
targeted by an async search, we have much less chance of detecting cancellation
if roundtrips are minimized. The local coordinator would do nothing other than
waiting for the minimized results from each remote cluster.
One scenario where we can check for cancellation is when each cluster comes
back with its full set of results. This commit adds such check, plus some testing
for async search cancellation with minimized roundtrips.
Transport actions have associated request and response classes. However,
the base type restrictions are not necessary to duplicate when creating
a map of transport actions. Relatedly, the ActionHandler class doesn't
actually need strongly typed action type and classes since they are lost
when shoved into the node client map. This commit removes these type
restrictions and generic parameters.
We recently cleared stack traces on data nodes before transport back to the coordinating node when error_trace=false to reduce unnecessary data transfer and memory on the coordinating node (#118266). However, all logging of exceptions happens on the coordinating node, so stack traces disappeared from any logs. This change logs stack traces directly on the data node when error_trace=false.
This change moves the query phase a single roundtrip per node just like can_match or field_caps work already.
A a result of executing multiple shard queries from a single request we can also partially reduce each node's query results on the data node side before responding to the coordinating node.
As a result this change significantly reduces the impact of network latencies on the end-to-end query performance, reduces the amount of work done (memory and cpu) on the coordinating node and the network traffic by factors of up to the number of shards per data node!
Benchmarking shows up to orders of magnitude improvements in heap and network traffic dimensions in querying across a larger number of shards.
This makes using usesDefaultDistribution in our test setup for explicit by requiring a reason why it's needed.
This is helpful as part of revisiting the need for all those usages in our code base.
Avoiding some indirection, volatile-reads and moving the listener
functionality that needlessly kept iterating an empty CoW list (creating
iterator instances, volatile reads, more code) in an effort to improve
the low IPC on transport threads.
This logic will need a bit of adjustment for bulk query execution.
Lets dry it up before so we don't have to copy and paste the fix which
will be a couple lines.
* Fix NPE caused by race condition in async search when minimise round trips is true
Previously, the `notifyListShards()` initialised and updated the
required pre-requisites (`searchResponse` being amongst them) when a
search op began. This function takes in arguments that contain
shard-specific details amongst others. Because this information is not
immediately available when the search begins, it is not immediately
called. In some specific cases, there can be a race condition that can
cause the pre-requisities (such as `searchResponse`) to be accessed
before they're initialised, causing an NPE.
This fix addresses the race condition by splitting the initialisation
and subsequent updation amongst 2 different methods. This way, the
pre-requisities are always initialised and do not lead to an NPE.
* Try: call `notifyListShards()` after `notifySearchStart()` when minimize round trips is true
* Add removed code comment
* Pass `Clusters` to `SearchTask` rather than using progress listener to
signify search start.
To prevent polluting the progress listener with unnecessary search
specific details, we now pass the `Clusters` object to `SearchTask` when
a search op begins. This lets `AsyncSearchTask` access it and use it to
initialise `MutableSearchResponse` appropriately.
* Use appropriate `clusters` object rather than re-building it
* Do not double set `mutableSearchResponse`
* Move mutable entities such as shard counts out of `MutableSearchResponse`
* Address PR review: revert moving out mutable entities from
`MutableSearchResponse`
* Update docs/changelog/117504.yaml
* Get rid of `SetOnce` for `searchResponse`
* Drop redundant check around shards count
* Add a test that calls `onListShards()` at last and clarify `updateShardsAndClusters()`'s comment
* Fix test: ref count
* Address review comment: rewrite comment and test
This updates the gradle wrapper to 8.12
We addressed deprecation warnings due to the update that includes:
- Fix change in TestOutputEvent api
- Fix deprecation in groovy syntax
- Use latest ospackage plugin containing our fix
- Remove project usages at execution time
- Fix deprecated project references in repository-old-versions
* first iterations
* added tests
* Update docs/changelog/118266.yaml
* constant for error_trace and typos
* centralized putHeader
* moved threadContext to parent class
* uses NodeClient.threadpool
* updated async tests to retrieve final result
* moved test to avoid starting up a node
* added transport version to avoid sending useless bytes
* more async tests
We can parallelize starting the clusters and a few other things
to effectively speed up these tests by 2x which comes out to about a minute
of execution time saved for all of those in :server:internalClusterTests
on my workstation.
Static fields dont do well in Gradle with configuration cache enabled.
- Use buildParams extension in build scripts
- Keep BuildParams.ci for now for easy serverless migration
- Tweak testing doc
The most relevant ES changes that upgrading to Lucene 10 requires are:
- use the appropriate IOContext
- Scorer / ScorerSupplier breaking changes
- Regex automaton are no longer determinized by default
- minimize moved to test classes
- introduce Elasticsearch900Codec
- adjust slicing code according to the added support for intra-segment concurrency
- disable intra-segment concurrency in tests
- adjust accessor methods for many Lucene classes that became a record
- adapt to breaking changes in the analysis area
Co-authored-by: Christoph Büscher <christophbuescher@posteo.de>
Co-authored-by: Mayya Sharipova <mayya.sharipova@elastic.co>
Co-authored-by: ChrisHegarty <chegar999@gmail.com>
Co-authored-by: Brian Seeders <brian.seeders@elastic.co>
Co-authored-by: Armin Braun <me@obrown.io>
Co-authored-by: Panagiotis Bailis <pmpailis@gmail.com>
Co-authored-by: Benjamin Trent <4357155+benwtrent@users.noreply.github.com>
This removes `ccs_telemetry` feature flag, and instead introduces an
undocumented, true by default setting: - `search.ccs.collect_telemetry`
- enables CCS search telemetry collection and
`_cluster/stats?include_remote=true`. Can be disabled if this is causing
any problems.
Add the async execution ID and "is running" flag in the response as HTTP headers.
This allows users to know the request status without having to parse the response body.
It was also implemented in the `/_async_search/status/<id>` endpoint for consistency.
Continuation of https://github.com/elastic/elasticsearch/pull/111840, which implemented this same thing for ESQL.
Fixes https://github.com/elastic/elasticsearch/issues/109576
* This creates the use CCSUsage and CCSUsageTelemetry classes and wires them up to the UsageService.
An initial set of telemetry metrics are now being gathered in TransportSearchAction.
Many more will be added later to meet all the requirements for the CCS Telemetry epic of work.
Co-authored-by: Michael Peterson <michael.peterson@elastic.co>
Handling the PIT id as a `BytesReference` instead of as base64 encoded string
saves about a third of network traffic for these. We know that PIT ids can be a significant
source of traffic so the savings are well worth it.
Also, this saves cycles and memory on all nodes involved. A follow-up here would be exploring
to slice these IDs out of network buffer instead of copying them to reduce memory usage and large
allocations.
Today various test utility methods construct a random string that can be
parsed as a `TimeValue`, but in practice almost everywhere we just parse
it straight away. With this change we have the utility methods return a
`TimeValue` directly.
Fixes a bug in the async-search status endpoint where a user with monitor privileges
is not able to access the status endpoint when setting keep_alive state of the async-search.
We don't need `NamedWriteableRegistry`to parse search requests any longer,
this was an unused parameter. Removing it from search request parsing allows
for removing it as a dependency from a number of places.
This consists of 3 changes:
1. Refactoring the code so that all the security logic in the async search code is moved to AsyncSearchSecurity
2. Changing TransportGetAsyncStatusAction to check for ownership if the user does not have explicit access to the GetAsyncStatusAction (if they have such access it means that they can get the status of all async searches)
3. In RBACEngine, if a user does not have permission to GetAsyncStatusAction but does have permission to submit async searches, then let them run the action, relying on point 2 above.
Co-authored-by: Michael Peterson <michael.peterson@elastic.co>
This restores the functionality that was removed from 8.13
(waiting for a change on the Kibana side). The work for this feature
was added in #103134 but we had to remove the yaml changelog when
we turned off the functionality in #105455. So this PR restores the
changelog yaml as well.
Fixing all of these muted test classes, tried my best to keep indention changes to a minimum
but it wasn't possible to avoid them in all cases unfortunately.
This is a temporary change to avoid doing incremental merges in
cross-cluster async-search when minimize_roundtrips=true.
Currently, Kibana polls for status of the async-search via the
_async_search endpoint, which (without this change) will
do an incremental merge of all search results. Once Kibana
moves to polling status via _async_search/status, then
we will undo the change in this commit.
This change adds additional plumbing to pipe through the available cluster features into
SearchSourceBuilder. A number of different APIs use SearchSourceBuilder so they had to make this
available through their parsers as well often through ParserContext. This change is largely mechanical
passing a Predicate into existing REST actions to check for feature availability.
Note that this change was pulled mostly from this PR (#105040).
Loads of code here that is only used in tests and one duplicate unused
class that was only used as an indirection to parsing the
`AsyncSearchResponse`. Moved what I could easily move via automated
refactoring to `SearchResponseUtils` in tests and removed the duplicate
now unused class from the client codebase.
A predicate to check whether the cluster supports a feature is available
to rest handlers defined in server. This commit adds that predicate to
plugins defining rest handlers as well.