If you start up a freshly-unpacked Elasticsearch tarball, security
auto-configuration will set `http.host: 0.0.0.0` in `elasticsearch.yml`,
overriding the documented default behaviour which is to fall back to
`network.host` which itself defaults to `localhost`. This commit adds a
note to the docs about this.
Today the docs on balancing settings describe what the settings all do
but offer little guidance about how to configure them. This commit adds
some extra detail to avoid some common misunderstandings and reorders
the docs a little so that more commonly-adjusted settings are mentioned
earlier.
The docs for forced awareness indicate that no replicas will be assigned
until all zones are available, which is definitely undesirable and also
not the actual behaviour. This commit fixes the wording to match what
really happens.
Closes#104777
There's a note in the docs saying we only consider shard count and not
disk usage which is no longer true. This commit fixes the note to
reflect today's implementation.
I have several times struggled to find the docs about restoring from a
snapshot if a quorum cannot be found. That info is on the discovery
troubleshooting page, but it seems I expect it to be on somewhere like
the quorums or voting docs pages instead. This commit adds links from
those pages to the troubleshooting page.
In the note on forming a single cluster we describe what to do if
inadvertently forming extra clusters, but we can be more explicit about
what to do with `cluster.initial_master_nodes` in these instructions.
This commit adds the missing details.
**Problem:**
For historical reasons, source files for the Elasticsearch Guide's security, watcher, and Logstash API docs are housed in the `x-pack/docs` directory. This can confuse new contributors who expect Elasticsearch Guide docs to be located in `docs/reference`.
**Solution:**
- Move the security, watcher, and Logstash API doc source files to the `docs/reference` directory
- Update doc snippet tests to use security
Rel: https://github.com/elastic/platform-docs-team/issues/208
* [DOCS] Add 'Troubleshooting an unstable cluster' to nav
* Adjust docs links in code
* Revert "Adjust docs links in code"
This reverts commit f3846b1d78.
---------
Co-authored-by: David Turner <david.turner@elastic.co>
* [DOCS] Remote cluster troubleshooting guide
* Fix test failures
* Apply suggestions from code review
Co-authored-by: Yang Wang <ywangd@gmail.com>
* Review feedback
* Group issues under 'common' and 'API key'
* Apply suggestions from code review
Co-authored-by: Yang Wang <ywangd@gmail.com>
---------
Co-authored-by: Yang Wang <ywangd@gmail.com>
* [DOCS] Expand the step that enables the remote cluster server
* Update docs/reference/modules/cluster/remote-clusters-api-key.asciidoc
* Reword
* Reword
* [DOCS] Remote cluster migration guide
* Review feedback
* Clarify that any extra local privileges will be suppressed by the cross-cluster API key’s privileges
* New docs structure for remote clusters
* Fix broken cross-book link errors
* More broken cross-book link errors
* Remove redirects for new pages
* Link to generic remote cluster docs instead
* Drop 'API' from the abbreviated title
* Add 'Establish trust with a remote cluster' section
* Restructure 'Establish trust' section into Prprequisite/local/remote instructions
* Add 'Configure roles and users' section
* Add 'Connect to a remote cluster' section
* Move version compatibility to prerequisites
* Fix test errors
* Incorporate review feedback
* Mention version 8.10 or later in the intro for API keys
* Add license prerequisite
This commit enables concurrent search execution in the DFS phase, which is going to improve resource usage as well as performance of knn queries which benefit from both concurrent rewrite and collection.
We will enable concurrent execution for the query phase in a subsequent commit. While this commit does not introduce parallelism for the query phase, it introduces offloading sequential computation to the newly introduced executor. This is true both for situations where a single slice needs to be searched, as well as scenarios where a specific request does not support concurrency (currently only DFS phase does regardless of the request). Sequential collection is not offloaded only if the request includes aggregations that don't support offloading: composite, nested and cardinality as their post collection method must be executed in the same thread as the collection or we'll trip a lucene assertion that verifies that doc_values are pulled and consumed from the same thread.
## Technical details
This commit introduces a secondary executor, used exclusively to execute the concurrent bits of search. The search threads are still the ones that coordinate the search (where the caller search will originate from), but the actual work will be offloaded to the newly introduced executor.
We are offloading not only parallel execution but also sequential execution, to make the workload more predictable, as it would be surprising to have bits of search executed in either of the two thread pools. Also, that would introduce the possibility to suddenly run a higher amount of heavy operations overall (some in the caller thread and some in the separate threads), which could overload the system as well as make sizing of thread pools more difficult.
Note that fetch, together with other actions, is still executed in the search thread pool. This commit does not make the search thread pool merely a coordinating only thread pool, It does so only for what concerns the IndexSearcher#search operation itself, which is though a big portion of the different phases of search API execution.
Given that the searcher blocks waiting for all tasks to be completed, we take a simple approach of introducing a thread pool executor that has the same size as the existing search thread pool but relies on an unbounded queue. This simplifies handling of thread pool queue and rejections. In fact, we'd like to guarantee that the secondary thread pool won't reject, and delegate queuing entirely to the search thread pool which is the entry point for every search operation anyway. The principle behind this is that if you got a slot in the search thread pool, you should be able to complete your search, and rather quickly.
As part of this commit we are also introducing the ability to cancel tasks that have not started yet, so that if any task throws an exception, other tasks are prevented from starting needless computation.
Relates to #80693
Relates to #90700
Today by default the `SEARCH_COORDINATION` pool is sized at half the
allocated processors, or five if there are more than ten CPUs. Yet, if
we scale up a node to have more than ten CPUs, we probably want to scale
up the number of search coordination threads to match. This commit
removes the limit of five threads.
Discovery, like cluster membership, can also be affected by network-like
issues (e.g. GC/VM pauses, dropped packets and blocked threads) so this
commit duplicates the troubleshooting info across both places.
A completely idle `transport_worker` thread is reported as `0.0%` idle,
which is confusing. Moreover the docs on the network threading model do
not reflect the changes made in #90482. This commit fixes both of those
things.
Suggest calling `jstack` every 15s to ensure that at least one capture
shows a stuck thread. Also adds a link to this guide to the list on the
troubleshooting overview page.
Explains why you should remove `cluster.initial_master_nodes`, and
rewords some of the other sections a little for (subjectively) improved
readability.
* Fixes CORS headers needed by Elastic clients
Updates the default value for the `http.cors.allow-headers`
setting to include headers used by Elastic client libraries.
Also adds the `access-control-expose-headers` header to responses to
CORS requests so that clients can successfully perform their product
check.
In #92309 we have aligned the size of the `search` and the `get` thread
pool but the docs still contain the prior `get` thread pool size. With
this commit we also align the docs.
Relates #92309
We sometimes see a `ShardLockObtainFailedException` when a shard failed
to shut down as fast as we expected, often because a node left and
rejoined the cluster. Sometimes this is because it was held open by
ongoing scrolls or PITs, but other times it may be because the shutdown
process itself is too slow. With this commit we add the ability to
capture and log a thread dump at the time of the failure to give us more
information about where the shutdown process might be running slowly.
Relates #93226
If debug logging is enabled then the lag detector will capture and
report the hot threads of a lagging node. In some cases the resulting
log message can be very large, exceeding 10kiB, which means it is
truncated in most logging setups. The relevant thread(s) may be waiting
on I/O, which is not considered "hot" and therefore may not appear in
the first 10kiB.
This commit adjusts this logging mechanism to split the message into
chunks of size at most 2kiB (after compression and base64-encoding) to
ensure that the entire hot threads output can be faithfully
reconstructed from these logs.
Closes#88126
Today to troubleshoot an unstable cluster we ask the users to parse the
rather complex `node-join` and `node-left` messages emitted by the
`MasterService`. These messages may refer to many nodes, may be
truncated, and are generally pretty hard to work with.
With this commit we start to emit a simplified log message about each
node added and removed. It also renames the respective executor classes:
- `JoinTaskExecutor` -> `NodeJoinExecutor`
- `NodeRemovalClusterStateTaskExecutor` -> `NodeLeftExecutor`
This brings their names in line with each other, and the messages that
they emit, whilst preserving the older `node-join` and `node-left`
terminology as reported by the `MasterService`.
Finally, it updates the troubleshooting logs to reflect these new and
simplified logs.
Relates #92741