We duplicated these docs in order to avoid breaking older links, but
this makes it confusing and hard to link to the right copy of the
information. This commit removes the duplication by replacing the docs
at the old locations with stubs that link to the new locations.
A misplaced `//end::` tag meant that the docs added in #112271 are only
included in the page on fault detection and not the equivalent
troubleshooting docs. This commit fixes the problem.
Basically the same as for nodes that leave the cluster with reason
`disconnected`, except that these disconnects don't involve the master
so don't cause any nodes to leave the cluster.
Clarify that it's best to analyse the captures alongside the node logs,
and spell out in a bit more detail how to use packet captures and logs
to pin down the cause of a `disconnected` node.
* [DOCS] Add 'Troubleshooting an unstable cluster' to nav
* Adjust docs links in code
* Revert "Adjust docs links in code"
This reverts commit f3846b1d78.
---------
Co-authored-by: David Turner <david.turner@elastic.co>
Discovery, like cluster membership, can also be affected by network-like
issues (e.g. GC/VM pauses, dropped packets and blocked threads) so this
commit duplicates the troubleshooting info across both places.
Suggest calling `jstack` every 15s to ensure that at least one capture
shows a stuck thread. Also adds a link to this guide to the list on the
troubleshooting overview page.
We sometimes see a `ShardLockObtainFailedException` when a shard failed
to shut down as fast as we expected, often because a node left and
rejoined the cluster. Sometimes this is because it was held open by
ongoing scrolls or PITs, but other times it may be because the shutdown
process itself is too slow. With this commit we add the ability to
capture and log a thread dump at the time of the failure to give us more
information about where the shutdown process might be running slowly.
Relates #93226
If debug logging is enabled then the lag detector will capture and
report the hot threads of a lagging node. In some cases the resulting
log message can be very large, exceeding 10kiB, which means it is
truncated in most logging setups. The relevant thread(s) may be waiting
on I/O, which is not considered "hot" and therefore may not appear in
the first 10kiB.
This commit adjusts this logging mechanism to split the message into
chunks of size at most 2kiB (after compression and base64-encoding) to
ensure that the entire hot threads output can be faithfully
reconstructed from these logs.
Closes#88126
Today to troubleshoot an unstable cluster we ask the users to parse the
rather complex `node-join` and `node-left` messages emitted by the
`MasterService`. These messages may refer to many nodes, may be
truncated, and are generally pretty hard to work with.
With this commit we start to emit a simplified log message about each
node added and removed. It also renames the respective executor classes:
- `JoinTaskExecutor` -> `NodeJoinExecutor`
- `NodeRemovalClusterStateTaskExecutor` -> `NodeLeftExecutor`
This brings their names in line with each other, and the messages that
they emit, whilst preserving the older `node-join` and `node-left`
terminology as reported by the `MasterService`.
Finally, it updates the troubleshooting logs to reflect these new and
simplified logs.
Relates #92741
This commit overhauls the documentation of discovery and cluster coordination,
removing mention of the Zen Discovery module and replacing it with docs for the
new cluster coordination mechanism introduced in 7.0.
Relates #32006