This commit adds a new set of classes that would compute a peer
recovery plan, based on source files + target files + available
snapshots. When possible it would try to maximize the number of
files used from a snapshot. It uses repositories with `use_for_peer_recovery`
setting set to true.
It adds a new recovery setting `indices.recovery.use_snapshots`
Relates #73496
In the upcoming Lucene 9 release, `indices.query.bool.max_clause_count` is
going to apply to the entire query tree rather than per `bool` query. In order
to avoid breaks, the limit has been bumped from 1024 to 4096.
The semantics will effectively change when we upgrade to Lucene 9, this PR
is only about agreeing on a migration strategy and documenting this change.
To avoid further breaks, I am leaning towards keeping the current setting name
even though it contains `bool`. I believe that it still makes sense given that
`bool` queries are typically the main contributors to high numbers of clauses.
Co-authored-by: James Rodewig <40268737+jrodewig@users.noreply.github.com>
Today the docs for remote cluster connections use `ping_schedule` fairly
liberally, and don't mention that you should prefer TCP keepalives
wherever possible. This commit reduces the use of this setting in the
examples and adjusts the description of the setting to include a note
about TCP keepalives instead.
This commit is related to #73497. It adds two new settings. The first setting
is transport.compression_scheme. This setting allows the user to
configure LZ4 or DEFLATE as the transport compression. Additionally, it
modifies transport.compress to support the value indexing_data. When
this setting is set to indexing_data only messages which are primarily
composed of raw source data will be compressed. This is bulk, operations
recovery, and shard changes messages.
Today if sending file chunks is CPU-bound (e.g. when using compression)
then we tend to concentrate all that work onto relatively few threads,
even if `indices.recovery.max_concurrent_file_chunks` is increased. With
this commit we fork the transmission of each chunk onto its own thread
so that the CPU-bound work can happen in parallel.
In #55805, we added a setting to allow single data node clusters to
respect the high watermark. In #73733 we added the related deprecations.
This commit ensures the only valid value for the setting is true and
adds deprecations if the setting is set. The setting will be removed
in a future release.
Co-authored-by: David Turner <david.turner@elastic.co>
* Add new thread pool for critical operations
* Split critical thread pool into read and write
* Add POJO to hold thread pool names
* Add tests for critical thread pools
* Add thread pools to data streams
* Update settings for security plugin
* Retrieve ExecutorSelector from SystemIndices where possible
* Use a singleton ExecutorSelector
Adds new snapshot meta pool that is used to speed up the get snapshots API
by making `SnapshotInfo` load in parallel. Also use this pool to load
`RepositoryData`.
A follow-up to this would expand the use of this pool to the snapshot status
API and make it run in parallel as well.
If a node is partitioned away from the rest of the cluster then the
`ClusterFormationFailureHelper` periodically reports that it cannot
discover the expected collection of nodes, but does not indicate why. To
prove it's a connectivity problem, users must today restart the node
with `DEBUG` logging on `org.elasticsearch.discovery.PeerFinder` to see
further details.
With this commit we log messages at `WARN` level if the node remains
disconnected for longer than a configurable timeout, which defaults to 5
minutes.
Relates #72968
Dedicated frozen nodes can survive less headroom than other data nodes.
This commits introduces a separate flood stage threshold for frozen as
well as an accompanying max_headroom setting that caps the amount of
free space necessary on frozen.
Relates #71844
Frozen indices (partial searchable snapshots) require less heap per
shard and the limit can therefore be raised for those. We pick 3000
frozen shards per frozen data node, since we think 2000 is reasonable
to use in production.
Relates #71042 and #34021
* Removing security overview and condensing.
* Adding new security file.
* Minor changes.
* Removing link to pass build.
* Adding minimal security page.
* Adding minimal security page.
* Changes to intro.
* Add basic and basic + http configurations.
* Lots of changes, removed files, and redirects.
* Moving some AD and LDAP sections, plus more redirects.
* Redirects for SAML.
* Updating snippet languages and redirects.
* Adding another SAML redirect.
* Hopefully fixing the ci/2 error.
* Fixing another broken link for SAML.
* Adding what's next sections and some cleanup.
* Removes both security tutorials from the TOC.
* Adding redirect for removed tutorial.
* Add graphic for Elastic Security layers.
* Incorporating reviewer feedback.
* Update x-pack/docs/en/security/securing-communications/security-basic-setup.asciidoc
Co-authored-by: Ioannis Kakavas <ikakavas@protonmail.com>
* Update x-pack/docs/en/security/securing-communications/security-minimal-setup.asciidoc
Co-authored-by: Yang Wang <ywangd@gmail.com>
* Update x-pack/docs/en/security/securing-communications/security-basic-setup.asciidoc
Co-authored-by: Yang Wang <ywangd@gmail.com>
* Update x-pack/docs/en/security/index.asciidoc
Co-authored-by: Ioannis Kakavas <ikakavas@protonmail.com>
* Update x-pack/docs/en/security/securing-communications/security-basic-setup-https.asciidoc
Co-authored-by: Ioannis Kakavas <ikakavas@protonmail.com>
* Apply suggestions from code review
Co-authored-by: Ioannis Kakavas <ikakavas@protonmail.com>
Co-authored-by: Yang Wang <ywangd@gmail.com>
* Additional changes from review feedback.
* Incorporating reviewer feedback.
* Incorporating more reviewer feedback.
* Clarify that TLS is for authenticating nodes
Co-authored-by: Tim Vernum <tim@adjective.org>
* Clarify security between nodes
Co-authored-by: Tim Vernum <tim@adjective.org>
* Clarify that TLS is between nodes
Co-authored-by: Tim Vernum <tim@adjective.org>
* Update title for configuring Kibana with a password
Co-authored-by: Tim Vernum <tim@adjective.org>
* Move section for enabling passwords between Kibana and ES to minimal security.
* Add section for transport description, plus incorporate more reviewer feedback.
* Moving operator privileges lower in the navigation.
Co-authored-by: Elastic Machine <elasticmachine@users.noreply.github.com>
Co-authored-by: Ioannis Kakavas <ikakavas@protonmail.com>
Co-authored-by: Yang Wang <ywangd@gmail.com>
Co-authored-by: Tim Vernum <tim@adjective.org>
We document that master nodes should have a persistent data path but
it's a bit hard to understand that this is what the docs are saying and
we don't really say why it's important. This commit clarifies this
paragraph.
Relates 49d0f3406c
Today the docs on node roles say that you shouldn't use dedicated
masters for heavy requests such as indexing and searching, but as per
the "designing for resilience" docs this guidance applies to all client
requests. This commit generalises the node roles docs slightly to
clarify this.
Relates #70435
This commit addresses two aspects of the description in the docs of
configuring a local node to be a remote cluster client. First, the
documentation was referring to the legacy setting for configuring a
remote cluster client. Secondly, we clarify that additional features,
not only cross-cluster search, have requirements around the usage of the
remote_cluster_client role.
Co-authored-by: Przemysław Witek <przemyslaw.witek@elastic.co>
Co-authored-by: James Rodewig <40268737+jrodewig@users.noreply.github.com>
Today if an index is set to `auto_expand_replicas: N-all` then we will
try and create a shard copy on every node that matches the applicable
allocation filters. This conflits with shard allocation awareness and
the same-host allocation decider if there is an uneven distribution of
nodes across zones or hosts, since these deciders prevent shard copies
from being allocated unevenly and may therefore leave some unassigned
shards.
The point of these two deciders is to improve resilience given a limited
number of shard copies but there is no need for this behaviour when the
number of shard copies is not limited, so this commit supresses them in
that case.
Closes#54151Closes#2869
This commit adds the `data_frozen` node role as part of the formalization of data tiers. It also
adds the `"frozen"` phase to ILM, currently allowing the same actions as the existing cold phase.
The frozen phase is intended to be used for data even less frequently searched than the cold phase,
and will eventually be loosely tied to data using partial searchable snapshots (as oppposed to full
searchable snapshots in the cold phase).
Relates to #60848
This commit sets the recovery rate for dedicated cold nodes. The goal is
here is enhance performance of recovery in a dedicated cold tier, where
we expect such nodes to be predominantly using searchable snapshots to
back the indices located on them. This commit follows a simple approach
where we increase the recovery rate as a function of the node size, for
nodes that appear to be dedicated cold nodes.
This commit removes the following deprecated settings in v8:
- `gateway.expected_nodes`
- `gateway.expected_master_nodes`
- `gateway.recover_after_nodes`
- `gateway.recover_after_master_nodes`
Co-authored-by: ShawnLi1014 <shawnli1014@gmail.com>
* [DOCS] Minor rewording for HTTP settings.
* Revert "[DOCS] Minor rewording for HTTP settings."
This reverts commit 9a831adca6.
* Adds advanced wording to HTTP & transport settings.
Today's network config docs are split into "Network", "HTTP" and
"Transport" pages, with unclear relationships between them. We often
encounter users with weird configs that indicate they don't really
understand how these settings all relate. In fact these pages are all
very interrelated, and the HTTP and Transport pages are almost all only
for advanced users. This commit brings these docs into a single page and
rewords some things to try and guide users away from the advanced
settings unless their configuration needs all the extra complexity.
It also adds a section entitled "Binding and publishing" which clarifies
the meanings of the `bind_host` and `publish_host` parameters. This is
also a common source of confusion amongst users.
It also clarifies that many of these settings accept a list of
addresses, and warns that this may not be what you want. Closes#67956.
Co-authored-by: Adam Locke <adam.locke@elastic.co>
Today the discovery phase has a short 1-second timeout for handshaking
with a remote node after connecting, which allows it to quickly move on
and retry in the case of connecting to something that doesn't respond
straight away (e.g. it isn't an Elasticsearch node).
This short timeout was necessary when the component was first developed
because each connection attempt would block a thread. Since #42636 the
connection attempt is now nonblocking so we can apply a more relaxed
timeout.
If transport security is enabled then our handshake timeout applies to
the TLS handshake followed by the Elasticsearch handshake. If the TLS
handshake alone takes over a second then the whole handshake times out
with a `ConnectTransportException`, but this does not tell us which of
the two individual handshakes took so long.
TLS handshakes have their own 10-second timeout, which if reached yields
a `SslHandshakeTimeoutException` that allows us to distinguish a problem
at the TLS level from one at the Elasticsearch level. Therefore this
commit extends the discovery probe timeouts.
limit the depth of nested bool queries
Introduce a new node level setting `indices.query.bool.max_nested_depth`
that controls the depth of nested bool queries.
Throw an error if a nested depth of a bool query exceeds the maximum
allowed nested depth.
Closes#55303
This makes sure that we only serve a hit from the request cache if it
was build using the same mapping and that the same mapping is used for
the entire "query phase" of the search.
Closes#62033