Commit graph

267 commits

Author SHA1 Message Date
David Turner
1461820dac
Fix race condition in RestCancellableNodeClient (#126686)
Today we rely on registering the channel after registering the task to
be cancelled to ensure that the task is cancelled even if the channel is
closed concurrently. However the client may already have processed a
cancellable request on the channel and therefore this mechanism doesn't
work. With this change we make sure not to register another task after
draining the registrations in order to cancel them.

Closes #88201
2025-04-12 00:59:46 +10:00
Ben Chaplin
9f6eb1d4e3
Log stack traces on data nodes before they are cleared for transport (#125732)
We recently cleared stack traces on data nodes before transport back to the coordinating node when error_trace=false to reduce unnecessary data transfer and memory on the coordinating node (#118266). However, all logging of exceptions happens on the coordinating node, so stack traces disappeared from any logs. This change logs stack traces directly on the data node when error_trace=false.
2025-04-03 13:45:09 -04:00
Niels Bauman
483f97915c
Run TransportGetIndexAction on local node (#125652)
This action solely needs the cluster state, it can run on any node.
Since this is the last class/action that extends the `ClusterInfo`
abstract classes, we remove those classes too as they're not required
anymore.

Relates #101805
2025-04-02 18:41:35 +01:00
Niels Bauman
eb4d64f94a
Run TransportGetSettingsAction on local node (#126051)
This action solely needs the cluster state, it can run on any node.
Additionally, it needs to be cancellable to avoid doing unnecessary work
after a client failure or timeout.

Relates #101805
2025-04-02 15:05:31 +01:00
Armin Braun
fd2cc97541
Introduce batched query execution and data-node side reduce (#121885)
This change moves the query phase a single roundtrip per node just like can_match or field_caps work already. 
A a result of executing multiple shard queries from a single request we can also partially reduce each node's query results on the data node side before responding to the coordinating node.

As a result this change significantly reduces the impact of network latencies on the end-to-end query performance, reduces the amount of work done (memory and cpu) on the coordinating node and the network traffic by factors of up to the number of shards per data node!

Benchmarking shows up to orders of magnitude improvements in heap and network traffic dimensions in querying across a larger number of shards.
2025-03-29 16:53:18 +01:00
Mark Vieira
0388a5980c
Migrate legacy QA projects to new test clusters framework (#125545) 2025-03-26 10:05:56 -07:00
Niels Bauman
481d91c428
Run TransportGetMappingsAction on local node (#122921)
This action solely needs the cluster state, it can run on any node.
Additionally, it needs to be cancellable to avoid doing unnecessary work
after a client failure or timeout.

Relates #101805
2025-03-15 07:59:28 +00:00
Armin Braun
425823cb5c
Remove some overhead from TransportService message handling (#124428)
Avoiding some indirection, volatile-reads and moving the listener
functionality that needlessly kept iterating an empty CoW list (creating
iterator instances, volatile reads, more code) in an effort to improve
the low IPC on transport threads.
2025-03-09 16:00:11 +01:00
Armin Braun
d3abf9d5ba
Dry up search error trace ITs (#122138)
This logic will need a bit of adjustment for bulk query execution.
Lets dry it up before so we don't have to copy and paste the fix which
will be a couple lines.
2025-02-10 08:48:49 +01:00
Artem Prigoda
62f0fe869a
Remove the failures field from snapshot responses (#114496)
Failure handling for snapshots was made stricter in #107191 (8.15), so this field is always empty since then. Clients don't need to check it anymore for failure handling, we can remove it from API responses in 9.0
2025-02-05 15:35:38 +01:00
Niels Bauman
5efe216958
Run GetPipelineTransportAction on local node (#120445)
This action solely needs the cluster state, it can run on any node.
Additionally, it needs to be cancellable to avoid doing unnecessary work
after a client failure or timeout.

Relates #101805
2025-01-22 08:16:31 +10:00
Niels Bauman
4ccd377d27
Run TransportClusterGetSettingsAction on local node (#119831)
This action solely needs the cluster state, it can run on any node.
Additionally, it needs to be cancellable to avoid doing unnecessary
work after a client failure or timeout. The `?local` parameter
becomes a no-op and is marked as deprecated.
2025-01-14 03:45:58 +00:00
Niels Bauman
27a9c4d911
Run template simulation actions on local node (#120038)
The actions `TransportSimulateTemplateAction` and
`TransportSimulateIndexTemplateAction` solely need the cluster state,
they can run on any node. Additionally, they need to be cancellable
to avoid doing unnecessary work after a client failure or timeout.

As a drive-by, this removes more usages of the trappy default master
node timeout.
2025-01-14 12:41:05 +10:00
Niels Bauman
80e8017bb6
Run TransportGetIndexTemplatesAction on local node (#119837)
This action solely needs the cluster state, it can run on any node.
Additionally, it needs to be cancellable to avoid doing unnecessary work
after a client failure or timeout.

As a drive-by, this removes another usage of the trappy default master
node timeout.
2025-01-10 00:20:16 +00:00
Niels Bauman
65e4ec129c
Run TransportGetComposableIndexTemplate on local node (#119830)
This action solely needs the cluster state, it can run on any node.
Additionally, it needs to be cancellable to avoid doing unnecessary work
after a client failure or timeout.

As a drive-by, this removes another usage of the trappy default master
node timeout.
2025-01-10 09:00:31 +10:00
Niels Bauman
9641c7623f
Run TransportGetComponentTemplateAction on local node (#116868)
This action solely needs the cluster state, it can run on any node.
Additionally, it needs to be cancellable to avoid doing unnecessary work
after a client failure or timeout.

The `?local` parameter becomes a no-op and is marked as deprecated.

Relates #101805
Relates #107984
2024-12-23 20:01:21 +00:00
Matteo Piergiovanni
97bc2919ff
Prevent data nodes from sending stack traces to coordinator when error_trace=false (#118266)
* first iterations

* added tests

* Update docs/changelog/118266.yaml

* constant for error_trace and typos

* centralized putHeader

* moved threadContext to parent class

* uses NodeClient.threadpool

* updated async tests to retrieve final result

* moved test to avoid starting up a node

* added transport version to avoid sending useless bytes

* more async tests
2024-12-18 15:29:35 +01:00
Henrique Paes
4740b02a9b
Wrap jackson exception on malformed json string (#114445)
This commit hides the underlying Jackson parse exception when encountered while parsing string tokens.
2024-12-05 09:22:48 -08:00
Simon Cooper
04e04ceaf1
Remove Version from system index descriptors (#115793)
Now it just uses mapping versions
2024-10-31 11:12:15 +00:00
Tim Brooks
e144184896
Standardize error code when bulk body is invalid (#114869)
Currently the incremental and non-incremental bulk variations will
return different error codes when the json body provided is invalid.
This commit ensures both version return status code 400. Additionally,
this renames the incremental rest tests to bulk tests and ensures that
all tests work with both bulk api versions. We set these tests to
randomize which version of the api we test each run.
2024-10-16 12:18:35 -06:00
David Turner
cd427198dc
More verbose logging in IndicesSegmentsRestCancellationIT (#113844)
Relates #88201
2024-10-03 19:53:47 +10:00
Tim Brooks
6759ae2e89
Introduce watermarks for indexing pressure backoff (#113912)
Currently we have a relatively basic decider about when to throttling
indexing. This commit adds two levels of watermarks with configurable
bulk size deciders. Additionally, adds additional settings to control
primary, coordinating, and replica rejection limits.
2024-10-02 10:06:33 -06:00
David Turner
e9d0dd9e28
Fix testClusterHealthRestCancellation (#113680)
This test was failing due to a race between an early cancellation check
and the cancel operation. With this commit we wait until the action is
definitely blocked before cancelling the task.

Closes #100062
2024-09-30 07:47:09 +01:00
Tim Brooks
d146b27a26
Default incremental bulk functionality to false (#113416)
This commit flips the incremental bulk setting to false. Additionally,
it removes some test code which intermittently causes issues with
security test cases.
2024-09-24 06:26:48 +10:00
Tim Brooks
c5caf84e2d
Move raw path into HttpPreRequest (#113231)
Currently, the raw path is only available from the RestRequest. This
makes the logic to determine if a handler supports streaming more
challenging to evaluate. This commit moves the raw path into pre request
to allow easier streaming support logic.
2024-09-21 05:32:45 +10:00
David Turner
6ff138f558
Drop useless AckedRequest interface (#113255)
Almost every implementation of `AckedRequest` is an
`AcknowledgedRequest` too, and the distinction is rather confusing.
Moreover the other implementations of `AckedRequest` are a potential
source of `null` timeouts that we'd like to get rid of. This commit
simplifies the situation by dropping the unnecessary `AckedRequest`
interface entirely.
2024-09-20 12:33:07 +01:00
Tim Brooks
92daeeba11 Properly handle empty incremental bulk requests (#112974)
This commit ensures we properly throw exceptions when an empty bulk
request is received with the incremental handling enabled.
2024-09-18 13:52:10 -06:00
Mikhail Berezovskiy
dce8a0bfd3 merge main 2024-09-18 13:52:10 -06:00
Tim Brooks
95b42a7129 Ensure incremental bulk setting is set atomically (#112479)
Currently the rest.incremental_bulk is read in two different places.
This means that it will be employed in two steps introducing
unpredictable behavior. This commit ensures that it is only read in a
single place.
2024-09-18 13:40:39 -06:00
Tim Brooks
a03fb12b09 Incremental bulk integration with rest layer (#112154)
Integrate the incremental bulks into RestBulkAction
2024-09-18 13:40:39 -06:00
Mark Vieira
a59c182f9f
Add AGPLv3 as a supported license 2024-09-13 15:29:46 -07:00
Mikhail Berezovskiy
c1c5fe64b3
Use opaque id in task cancellation assertion (#110680)
Add use of Opaque ID HTTP header in task cancellation assertion. In some
tests, like this #88201 `testCatSegmentsRestCancellation`, we assert
that all tasks related to specific HTTP request are cancelled. But we do
blanket approach in assertion block catching all tasks by action name. I
think narrowing down assertion to specific http request in this case
would be more accurate.

It is still not clear why test mentioned above failing, but after hours
of investigation and injecting random delays, I'm inclining more to
@DaveCTurner's comment about interference from other tests or cluster
activity. I added additional log that will report when we spot task with
different opaque id.
2024-07-12 14:35:57 +10:00
David Turner
5662f988b2
Remove trappy timeouts in snapshot APIs (#109828)
Wholesale fix of every `TRAPPY_IMPLICIT_DEFAULT_MASTER_NODE_TIMEOUT` in
`o.e.snapshots` and `o.e.repositories`, just pulling them up to the REST
layer (where they become API params), the test suite (where they become
`TEST_REQUEST_TIMEOUT`), or some other place where an explicit value is
available.

Relates #107984
2024-06-21 07:11:12 +10:00
Patrick Doyle
43b2e877e0
Revert "Move PluginsService to its own internal package (#109872)" (#109946)
This reverts commit b9e7965184.
2024-06-19 18:10:50 -04:00
Patrick Doyle
b9e7965184
Move PluginsService to its own internal package (#109872)
* Mechanical package change in IntelliJ
* A couple of manual fixups
* Export plugins.loading to deprecation
* Put plugin-cli in a module so can export PluginsUtils to it.
2024-06-19 15:23:47 -04:00
Ievgen Degtiarenko
d3a285e1c7
Fix testDanglingIndicesCanBeListed (#108599)
The test started failing because of the recent changes to allow closing (and deleting shards) asynchronously. As a result dandling index API now is seeing a directory in partially deleted state, fails to interpret partial data and fails as a result.
The fix retries the failure on the client.
2024-05-14 11:40:27 +02:00
David Turner
30d31bffb2
Introduce RestUtils#getMasterNodeTimeout (#107986)
Many APIs accept a `?master_timeout` parameter, but reading this
parameter requires a little unnecessary boilerplate to specify the
literal parameter name and default value. Moreover, today's convention
is to construct a `MasterNodeRequest` and then read the default master
timeout from the freshly-created request. In practice this results in a
default of 30s, but we specify in the docs that this default is _always_
30s, and in principle one could create a transport request with a
different initial value which would deviate from the documented
behaviour.

This commit introduces a utility method for reading this parameter in a
fashion which is completely consistent with the documented behaviour.

Relates #107984
2024-04-29 08:03:32 +01:00
Armin Braun
05a2ff0375
Remove some more ActionType implementations (#107664)
Cleaning up a couple more of these.
2024-04-20 20:01:04 +02:00
Jonathan Buttner
d8348560a9
muting (#107496)
Muting https://github.com/elastic/elasticsearch/issues/100062
2024-04-15 17:17:34 -04:00
Ievgen Degtiarenko
32bcb13ac4
Introduce an easy way to get node id by its name (#107392)
Our test utility returns the node name when starting a new node.
A lot of APIs (such as routing table or node shutdown) require a node id.
This change introduces a simple way to retrieve the node id based on its name.
2024-04-12 10:50:11 +02:00
David Turner
9a907704b7
Move XContent -> SnapshotInfo parsing out of prod (#106669)
The code to parse a `SnapshotInfo` object out of an `XContent` response
body is only used in tests, so this commit moves it out of the
production codebase and into the test framework.
2024-03-22 09:46:46 -04:00
David Turner
12e567d29e
Consolidate get-snapshots ?after logic (#106038)
Today the handling of the `?after` param is kinda spread out over
`TransportGetSnapshotsAction` and `GetSnapshotsRequest` making it hard
to follow and adding unnecessary complexity to these two classes. This
commit moves it into `SnapshotSortKey` which is a better fit since the
behaviour varies so much for different sort keys.
2024-03-12 05:16:46 -04:00
David Turner
1fae3e7501
Extract SnapshotSortKey (#106015)
The behaviour of the get-snapshots API varies quite considerably
depending on the sort key chosen. Today this logic is implemented using
scattered `switch` statements and other conditionals but it'd be clearer
if we delegated this stuff to the sort key instances themselves. This
commit moves the sort key enum to the top level and replaces one of the
`switch` statements with a method on the enum instances.
2024-03-06 15:27:57 +00:00
David Turner
7cbdb6cc19
Drop dead code from get-snapshots request & response (#105608)
Removes all the now-dead code related to reading pre-7.16 get-snapshots
requests and responses, and also moves the `XContent` response parsing
out of production and into the only test suite that uses it.
2024-02-21 07:57:50 +00:00
Ryan Ernst
b67f5a6b57
Make cluster feature predicate available to plugins (#105022)
A predicate to check whether the cluster supports a feature is available
to rest handlers defined in server. This commit adds that predicate to
plugins defining rest handlers as well.
2024-02-01 09:11:18 -08:00
Simon Cooper
016c778321
Remove NamedWriteableRegistry from NodeClient, pass it directly through to rest actions (#103277) 2024-01-11 12:42:22 +00:00
Lee Hinman
d297d79927
Fix require_alias implicit true value on presence (#104099)
* Fix `require_alias` implicit true value on presence

This commit brings the `require_alias` query-string parameter into line with the rest of our parameters where its presence indicates an implicit "true" value (so a user can do `POST /_bulk?require_alias` to enable the check).

Resolves #103945

* Update docs/changelog/104099.yaml
2024-01-09 10:08:41 -07:00
Mary Gouseti
046cdeae23
Introduce lazy rollover for mapping updates in data streams (#103309)
In this PR we implement the idea to introduce a flag, that a data stream needs to be rolloved over before the next document is indexed.
2024-01-08 15:07:16 +02:00
David Turner
60b833bb6d
Add utils for general XContent REST requests (#103711)
Tests that send REST requests with bodies must today build up a separate
`String` containing the body contents as JSON. This is kinda ugly, and
also means we do not cover the other supported body formats in these
tests. This commit introduces a utility to allow construction of REST
requests with `XContent` bodies directly, and generalizes things to
choose randomly between JSON and other supported body formats.
2024-01-02 13:39:21 +00:00
David Turner
2e592a3416
More logging for ClusterHealthRestCancellationIT (#103193)
Relates #100062
2023-12-08 08:48:52 -05:00