Today we rely on registering the channel after registering the task to
be cancelled to ensure that the task is cancelled even if the channel is
closed concurrently. However the client may already have processed a
cancellable request on the channel and therefore this mechanism doesn't
work. With this change we make sure not to register another task after
draining the registrations in order to cancel them.
Closes#88201
We recently cleared stack traces on data nodes before transport back to the coordinating node when error_trace=false to reduce unnecessary data transfer and memory on the coordinating node (#118266). However, all logging of exceptions happens on the coordinating node, so stack traces disappeared from any logs. This change logs stack traces directly on the data node when error_trace=false.
This action solely needs the cluster state, it can run on any node.
Since this is the last class/action that extends the `ClusterInfo`
abstract classes, we remove those classes too as they're not required
anymore.
Relates #101805
This action solely needs the cluster state, it can run on any node.
Additionally, it needs to be cancellable to avoid doing unnecessary work
after a client failure or timeout.
Relates #101805
This change moves the query phase a single roundtrip per node just like can_match or field_caps work already.
A a result of executing multiple shard queries from a single request we can also partially reduce each node's query results on the data node side before responding to the coordinating node.
As a result this change significantly reduces the impact of network latencies on the end-to-end query performance, reduces the amount of work done (memory and cpu) on the coordinating node and the network traffic by factors of up to the number of shards per data node!
Benchmarking shows up to orders of magnitude improvements in heap and network traffic dimensions in querying across a larger number of shards.
This action solely needs the cluster state, it can run on any node.
Additionally, it needs to be cancellable to avoid doing unnecessary work
after a client failure or timeout.
Relates #101805
Avoiding some indirection, volatile-reads and moving the listener
functionality that needlessly kept iterating an empty CoW list (creating
iterator instances, volatile reads, more code) in an effort to improve
the low IPC on transport threads.
This logic will need a bit of adjustment for bulk query execution.
Lets dry it up before so we don't have to copy and paste the fix which
will be a couple lines.
Failure handling for snapshots was made stricter in #107191 (8.15), so this field is always empty since then. Clients don't need to check it anymore for failure handling, we can remove it from API responses in 9.0
This action solely needs the cluster state, it can run on any node.
Additionally, it needs to be cancellable to avoid doing unnecessary work
after a client failure or timeout.
Relates #101805
This action solely needs the cluster state, it can run on any node.
Additionally, it needs to be cancellable to avoid doing unnecessary
work after a client failure or timeout. The `?local` parameter
becomes a no-op and is marked as deprecated.
The actions `TransportSimulateTemplateAction` and
`TransportSimulateIndexTemplateAction` solely need the cluster state,
they can run on any node. Additionally, they need to be cancellable
to avoid doing unnecessary work after a client failure or timeout.
As a drive-by, this removes more usages of the trappy default master
node timeout.
This action solely needs the cluster state, it can run on any node.
Additionally, it needs to be cancellable to avoid doing unnecessary work
after a client failure or timeout.
As a drive-by, this removes another usage of the trappy default master
node timeout.
This action solely needs the cluster state, it can run on any node.
Additionally, it needs to be cancellable to avoid doing unnecessary work
after a client failure or timeout.
As a drive-by, this removes another usage of the trappy default master
node timeout.
This action solely needs the cluster state, it can run on any node.
Additionally, it needs to be cancellable to avoid doing unnecessary work
after a client failure or timeout.
The `?local` parameter becomes a no-op and is marked as deprecated.
Relates #101805
Relates #107984
* first iterations
* added tests
* Update docs/changelog/118266.yaml
* constant for error_trace and typos
* centralized putHeader
* moved threadContext to parent class
* uses NodeClient.threadpool
* updated async tests to retrieve final result
* moved test to avoid starting up a node
* added transport version to avoid sending useless bytes
* more async tests
Currently the incremental and non-incremental bulk variations will
return different error codes when the json body provided is invalid.
This commit ensures both version return status code 400. Additionally,
this renames the incremental rest tests to bulk tests and ensures that
all tests work with both bulk api versions. We set these tests to
randomize which version of the api we test each run.
Currently we have a relatively basic decider about when to throttling
indexing. This commit adds two levels of watermarks with configurable
bulk size deciders. Additionally, adds additional settings to control
primary, coordinating, and replica rejection limits.
This test was failing due to a race between an early cancellation check
and the cancel operation. With this commit we wait until the action is
definitely blocked before cancelling the task.
Closes#100062
This commit flips the incremental bulk setting to false. Additionally,
it removes some test code which intermittently causes issues with
security test cases.
Currently, the raw path is only available from the RestRequest. This
makes the logic to determine if a handler supports streaming more
challenging to evaluate. This commit moves the raw path into pre request
to allow easier streaming support logic.
Almost every implementation of `AckedRequest` is an
`AcknowledgedRequest` too, and the distinction is rather confusing.
Moreover the other implementations of `AckedRequest` are a potential
source of `null` timeouts that we'd like to get rid of. This commit
simplifies the situation by dropping the unnecessary `AckedRequest`
interface entirely.
Currently the rest.incremental_bulk is read in two different places.
This means that it will be employed in two steps introducing
unpredictable behavior. This commit ensures that it is only read in a
single place.
Add use of Opaque ID HTTP header in task cancellation assertion. In some
tests, like this #88201 `testCatSegmentsRestCancellation`, we assert
that all tasks related to specific HTTP request are cancelled. But we do
blanket approach in assertion block catching all tasks by action name. I
think narrowing down assertion to specific http request in this case
would be more accurate.
It is still not clear why test mentioned above failing, but after hours
of investigation and injecting random delays, I'm inclining more to
@DaveCTurner's comment about interference from other tests or cluster
activity. I added additional log that will report when we spot task with
different opaque id.
Wholesale fix of every `TRAPPY_IMPLICIT_DEFAULT_MASTER_NODE_TIMEOUT` in
`o.e.snapshots` and `o.e.repositories`, just pulling them up to the REST
layer (where they become API params), the test suite (where they become
`TEST_REQUEST_TIMEOUT`), or some other place where an explicit value is
available.
Relates #107984
* Mechanical package change in IntelliJ
* A couple of manual fixups
* Export plugins.loading to deprecation
* Put plugin-cli in a module so can export PluginsUtils to it.
The test started failing because of the recent changes to allow closing (and deleting shards) asynchronously. As a result dandling index API now is seeing a directory in partially deleted state, fails to interpret partial data and fails as a result.
The fix retries the failure on the client.
Many APIs accept a `?master_timeout` parameter, but reading this
parameter requires a little unnecessary boilerplate to specify the
literal parameter name and default value. Moreover, today's convention
is to construct a `MasterNodeRequest` and then read the default master
timeout from the freshly-created request. In practice this results in a
default of 30s, but we specify in the docs that this default is _always_
30s, and in principle one could create a transport request with a
different initial value which would deviate from the documented
behaviour.
This commit introduces a utility method for reading this parameter in a
fashion which is completely consistent with the documented behaviour.
Relates #107984
Our test utility returns the node name when starting a new node.
A lot of APIs (such as routing table or node shutdown) require a node id.
This change introduces a simple way to retrieve the node id based on its name.
The code to parse a `SnapshotInfo` object out of an `XContent` response
body is only used in tests, so this commit moves it out of the
production codebase and into the test framework.
Today the handling of the `?after` param is kinda spread out over
`TransportGetSnapshotsAction` and `GetSnapshotsRequest` making it hard
to follow and adding unnecessary complexity to these two classes. This
commit moves it into `SnapshotSortKey` which is a better fit since the
behaviour varies so much for different sort keys.
The behaviour of the get-snapshots API varies quite considerably
depending on the sort key chosen. Today this logic is implemented using
scattered `switch` statements and other conditionals but it'd be clearer
if we delegated this stuff to the sort key instances themselves. This
commit moves the sort key enum to the top level and replaces one of the
`switch` statements with a method on the enum instances.
Removes all the now-dead code related to reading pre-7.16 get-snapshots
requests and responses, and also moves the `XContent` response parsing
out of production and into the only test suite that uses it.
A predicate to check whether the cluster supports a feature is available
to rest handlers defined in server. This commit adds that predicate to
plugins defining rest handlers as well.
* Fix `require_alias` implicit true value on presence
This commit brings the `require_alias` query-string parameter into line with the rest of our parameters where its presence indicates an implicit "true" value (so a user can do `POST /_bulk?require_alias` to enable the check).
Resolves#103945
* Update docs/changelog/104099.yaml
Tests that send REST requests with bodies must today build up a separate
`String` containing the body contents as JSON. This is kinda ugly, and
also means we do not cover the other supported body formats in these
tests. This commit introduces a utility to allow construction of REST
requests with `XContent` bodies directly, and generalizes things to
choose randomly between JSON and other supported body formats.