This API can be quite heavy in large clusters, and might spam the
`MANAGEMENT` threadpool queue with work for clients that have long-since
given up. This commit adds some basic cancellability checks to reduce
the problem.
Backport of #96551 to 7.17
This commit adds a new test framework for configuring and orchestrating
test clusters for both Java and YAML REST testing. This will eventually
replace the existing "test-clusters" Gradle plugin and the build-time
cluster orchestration.
The check used to entirely skip parent lookup relies on
ConcurrentHashMap#isEmpty() which could return inconsistent results, and
potentially skip the cancellation of a task with a banned parent upon
registration, and it doesn't seem to have a benefit considering the hash
code computation.
Closes#88201
The following two failures happen rarely, but both fail in the same
`assertBusy` block. I don't have a clue why, and couldn't reproduce
them. Considering the amount of checks in that block, maybe a larger
timeout is more suitable. (Also it seems from the test history, it is
not uncommon for those tests to take 2-3s, so every few thousand runs
hitting the 10s timeout seems likely, IMO!) Relates
https://github.com/elastic/elasticsearch/issues/88884,
https://github.com/elastic/elasticsearch/issues/88201
The get-indices API does some nontrivial work on the master and at high
index counts the response may be very large and could take a long time
to compute. Some clients will time out and retry if it takes too long.
Today this API is not properly cancellable which leads to a good deal of
wasted work in this situation, and the potentially-enormous response is
serialized on a transport worker thread.
With this commit we make the API cancellable and move the serialization
to a `MANAGEMENT` thread.
Backport of #87681
Relates #77466
When a gzip-encoded response is decompressed the response should no more
have a content-encoding header and content-length should be set to
"unknown". GzipDecompressingEntity correctly does this for the entity
but the response still reported the original response's content-encoding
and content-length headers.
This method's name is trappy: it is easy to misinterpret it as returning
an instance from the elected master, but in fact it uses any
master-eligible node. If you want an instance from the elected master,
you have to use `getCurrentMasterNodeInstance()` instead.
This commit renames the method to clarify that it might not get an
instance from the elected master, and adds docs with cross-refs to help
developers choose the right method.
Start asserting snapshots in progress only in case when they reach
a stable state (the first index has finished, the second has been
blocked).
* Move LARGE_SNAPSHOT_SETTINGS to AbstractSnapshotRestTestCase to be reused
* Check that test-index-2 is blocked
* Be more clear that the 2nd index is blocked
Fixes#79779
Relates #78507
Migrate to persistent cluster settings in Java tests
We are deprecating transient settings, therefore this
PR changes uses of transient cluster settings to
persistent cluster settings.
* Enforce common dependency configuration setup
* Tweak dependencies for plugin sql server tests
* Fix test runtime dependencies after disabling transitive support
* Do not create unused testCluster (#77581)
* Do not create unused testCluster
This avoids creating test clusters that are not required during the build.
We use lazy configuration here on testClusters and only instantiate them as theyre
* Do not fail on run task (debug)
* Create more test cluster lazy
* Make more test cluster lazy
* Avoid creating unused testcluster
* Fix PluginBuildPlugin
* Fix disabling geo db download
* Fix cluster setup in repository-multi-version
* Polishing
* Fix issue with irretic groovy ogic
* Fix bwc tests
* Fix more bwcTests
* Fix more bwc tests
* Fix more bwc tests
* Fix more bwc tests
* Fix typo
* Minor polishing
* Fix rolling upgrade tests
* Fix cluster config in sql qa mixedcluster project
* Fix more bwc tests
* Clean up before review
* Document test cluster usage
* Api polising after Review
provide useCluster(Provider) method to TestClusterAware
Ideally we take this a step further and realize those test clusters only on use.
But out of scope of this PR.
* Allow gradle provider as value for nonSystemProperties
* Some simplification on test configuration
* Fix typo in rest test config
* Fix more typos
* Fix another typo
* Fix more typos
* Fix runEqlCorrectnessNode run task and cluster configuration (#78249)
* Fix merge issue
* Fix bwc tests after backporting
Short-circuit the failure method when cancelled just like in the fail fast case.
Also, remove the special case handling that asserts but swallows exceptions in production
for when ignoring unavailable to not swallow the task cancellation exception.
closes#77980
* Return Total Result Count and Remaining Count in Get Snapshots Response (#76150)
Add total result count and remaining count to get snapshots response.
* Implement Numeric Offset Parameter in Get Snapshots API (#76233)
Add numeric offset parameter to this API.
Relates #74350
When backporting get-snapshots pagination I missed the cat snapshots API that needed adjustment
to be in line with how `8.x` works as well, leading to an NPE. Fixed by making the code the same
as in `8.x` and adding a test (that should be forward-ported to 8.x as well).
closes#76158
Found this to be the easiest fix, the alternative would have been to actually
wait for all snapshot meta threads to become blocked but that's kind of hacky.
closes#74743
Backport of the recently introduced snapshot pagination and scalability improvements listed below.
Merged as a single backport because the `7.x` and master snapshot status API logic had massively diverged between master and 7.x. With the work in the below PRs, the logic in master and 7.x once again has been aligned very closely again.
#72842#73172#73199#73570#73952#74236#74451 (this one is only partly applicable as it was mainly a change to master to align `master` and `7.x` branches)
When libs/core was created, several classes were moved from server's
o.e.common package, but they were not moved to a new package. Split
packages need to go away long term, so that Elasticsearch can even think
about modularization. This commit moves all the classes under o.e.common
in core to o.e.core.
relates #73784
backport #73909
Same as #72644. This is a much longer running action than normal
get snapshots even so it should definitely be cancellable.
Parallelization for this action will be introduced in a separate PR.
If this runs needlessly for large repositories (especially in timeout/retry situations)
it's a significant memory+cpu hit => made it cancellable like we recently did for many
other endpoints.
back porting #72470 to 7.x
Extract usage of internal API from TestClustersPlugin and PluginBuildPlugin and
related plugins and build logic
This includes a refactoring of ElasticsearchDistribution to handle types
better in a way we can differentiate between supported Elasticsearch
Distribution types supported in TestCkustersPlugin and types only supported
in internal plugins.
It also introduces a set of internal versions of public plugins.
As part of this we also generate the plugin descriptors now.
As a follow up on this we can actually move these public used classes into
an extra project (declared as included build)
We keep LoggedExec and VersionProperties effectively public And workaround for RestTestBase
* Warn users if security is implicitly disabled (#70114)
Elasticsearch has security features implicitly disabled by default for
Basic and Trial licenses, unless explicitly set in the configuration
file.
This may be good for onboarding, but it also lead to unintended insecure
clusters.
This change introduces clear warnings when security features are
implicitly disabled.
- a warning header in each REST response if security is implicitly
disabled;
- a log message during cluster boot.
System index descriptors are used to describe a system index, which are
expected to change as new versions are developed. As part of this, the
descriptors had a minimum supported version field so that the contents
within that descriptor would not be applied if there were nodes older
than that version. However, this falls short of being able to
accurately describe what a system index should look like in a given
cluster where there are mixed node versions.
This change moves us towards being able to accurately describe and
know what the system index should look like. A system index is now
able to accept a list of the prior system index descriptor objects
so that clusters with mixed versions can select the appropriate
descriptor and ensure the index is created properly. As the node
versions change during a rolling upgrade, the cluster will then be
able to adapt the system index to the most recent version once all
master and data nodes have been upgraded.
Co-authored-by: Tim Vernum <tim@adjective.org>
Co-authored-by: Yang Wang <ywangd@gmail.com>
Backport of #71144
Today by default the `MANAGEMENT` threadpool always permits 5 threads
even if the node has a single CPU, which unfairly prioritises management
activities on small nodes. With this commit we limit the size of this
threadpool to the number of processors if less than 5.
Relates #70435
Today when creating an internal test cluster, we allow the test to
supply the node settings that are applied. The extension point to
provide these settings has a single integer parameter, indicating the
index (zero-based) of the node being constructed. This allows the test
to make some decisions about the settings to return, but it is too
simplistic. For example, imagine a test that wants to provide a setting,
but some values for that setting are not valid on non-data nodes. Since
the only information the test has about the node being constructed is
its index, it does not have sufficient information to determine if the
node being constructed is a non-data node or not, since this is done by
the test framework externally by overriding the final settings with
specific settings that dicate the roles of the node. This commit changes
the test framework so that the test has information about what settings
are going to be overriden by the test framework after the test provide
its test-specific settings. This allows the test to make informed
decisions about what values it can return to the test framework.
This commit introduces system index types that will be used to
differentiate behavior. Previously system indices were all treated the
same regardless of whether they belonged to Elasticsearch, a stack
component, or one of our solutions. Upon further discussion and
analysis this decision was not in the best interest of the various
teams and instead a new type of system index was needed. These system
indices will be referred to as external system indices. Within external
system indices, an option exists for these indices to be managed by
Elasticsearch or to be managed by the external product.
In order to represent this within Elasticsearch, each system index will
have a type and this type will be used to control behavior.
Closes#67383
Backport of #68919
Opening a Lucene index that supports soft-deletes currently creates the liveDocs bitset eagerly. This requires scanning
the doc values to materialize the liveDocs bitset from the soft-delete doc values. In order for searchable snapshot shards
to be available for searches as quickly as possible (i.e. on recovery, or in case of FrozenEngine whenever a search comes
in), they should read as little as possible from the Lucene files.
This commit introduces a LazySoftDeletesDirectoryReaderWrapper, a variant of Lucene's
SoftDeletesDirectoryReaderWrapper that loads the livedocs bitset lazily on first access. It is special-tailored to
ReadOnlyEngine / FrozenEngine as it only operates on non-NRT readers.
A small followup to #67413 and #68965: the underlying actions of the
`GET /_cat/segments` API are now cancellable, so we may as well cancel
them if needed.
The response to an `IndicesSegmentsAction` might be large, perhaps 10s
of MBs of JSON, and today it is serialized on a transport thread. It
also might take so long to respond that the client times out, resulting
in the work needed to compute the response being wasted.
This commit introduces the `DispatchingRestToXContentListener` which
dispatches the work of serializing an `XContent` response to a
non-transport thread, and also makes `TransportBroadcastByNodeAction`
sensitive to the cancellability of its tasks.
It uses these two features to make the `RestIndicesSegmentsAction`
serialize its response on a `MANAGEMENT` thread, and to abort its work
more promptly if the client's channel is closed before the response is
sent.
This PR expands the meaning of `include_global_state` for snapshots to include system indices. If `include_global_state` is `true` on creation, system indices will be included in the snapshot regardless of the contents of the `indices` field. If `include_global_state` is `true` on restoration, system indices will be restored (if included in the snapshot), regardless of the contents of the `indices` field. Index renaming is not applied to system indices, as system indices rely on their names matching certain patterns. If restored system indices are already present, they are automatically deleted prior to restoration from the snapshot to avoid conflicts.
This behavior can be overridden to an extent by including a new field in the snapshot creation or restoration call, `feature_states`, which contains an array of strings indicating the "feature" for which system indices should be snapshotted or restored. For example, this call will only restore the `watcher` and `security` system indices (in addition to `index_1`):
```
POST /_snapshot/my_repository/snapshot_2/_restore
{
"indices": "index_1",
"include_global_state": true,
"feature_states": ["watcher", "security"]
}
```
If `feature_states` is present, the system indices associated with those features will be snapshotted or restored regardless of the value of `include_global_state`. All system indices can be omitted by providing a special value of `none` (`"feature_states": ["none"]`), or included by omitting the field or explicitly providing an empty array (`"feature_states": []`), similar to the `indices` field.
The list of currently available features can be retrieved via a new "Get Snapshottable Features" API:
```
GET /_snapshottable_features
```
which returns a response of the form:
```
{
"features": [
{
"name": "tasks",
"description": "Manages task results"
},
{
"name": "kibana",
"description": "Manages Kibana configuration and reports"
}
]
}
```
Features currently map one-to-one with `SystemIndexPlugin`s, but this should be considered an implementation detail. The Get Snapshottable Features API and snapshot creation rely upon all relevant plugins being installed on the master node.
Further, the list of feature states included in a given snapshot is exposed by the Get Snapshot API, which now includes a new field, `feature_states`, which contains a list of the feature states and their associated system indices which are included in the snapshot. All system indices in feature states are also included in the `indices` array for backwards compatibility, although explicitly requesting system indices included in a feature state is deprecated. For example, an excerpt from the Get Snapshot API showing `feature_states`:
```
"feature_states": [
{
"feature_name": "tasks",
"indices": [
".tasks"
]
}
],
"indices": [
".tasks",
"test1",
"test2"
]
```
Co-authored-by: William Brafford <william.brafford@elastic.co>