Commit graph

17 commits

Author SHA1 Message Date
David Turner
9387ce3357
Deduplicate unstable-cluster troubleshooting docs (#112333)
We duplicated these docs in order to avoid breaking older links, but
this makes it confusing and hard to link to the right copy of the
information. This commit removes the duplication by replacing the docs
at the old locations with stubs that link to the new locations.
2024-08-29 13:16:37 +01:00
David Turner
59a42ed41b
Include network disconnect info in troubleshooting docs (#112323)
A misplaced `//end::` tag meant that the docs added in #112271 are only
included in the page on fault detection and not the equivalent
troubleshooting docs. This commit fixes the problem.
2024-08-29 15:03:13 +10:00
David Turner
42d650b9bb
Add docs for troubleshooting network disconnects (#112271)
Basically the same as for nodes that leave the cluster with reason
`disconnected`, except that these disconnects don't involve the master
so don't cause any nodes to leave the cluster.
2024-08-28 18:59:11 +10:00
David Turner
e5fd63bbb8
More detail around packet captures (#111835)
Clarify that it's best to analyse the captures alongside the node logs,
and spell out in a bit more detail how to use packet captures and logs
to pin down the cause of a `disconnected` node.
2024-08-13 21:55:38 +01:00
Abdon Pijpelink
af76a3a436
[DOCS] Add 'Troubleshooting an unstable cluster' to nav (#99287)
* [DOCS] Add 'Troubleshooting an unstable cluster' to nav

* Adjust docs links in code

* Revert "Adjust docs links in code"

This reverts commit f3846b1d78.

---------

Co-authored-by: David Turner <david.turner@elastic.co>
2023-09-08 13:42:50 +02:00
David Turner
09e53f9ad9
Enhance docs around network troubleshooting (#97305)
Discovery, like cluster membership, can also be affected by network-like
issues (e.g. GC/VM pauses, dropped packets and blocked threads) so this
commit duplicates the troubleshooting info across both places.
2023-07-10 10:57:44 +01:00
David Turner
7a517cb4a0
Add note on jstack frequency for troubleshooting (#95764)
Suggest calling `jstack` every 15s to ensure that at least one capture
shows a stuck thread. Also adds a link to this guide to the list on the
troubleshooting overview page.
2023-05-03 10:04:13 +01:00
David Turner
4c68382065
Capture thread dump on ShardLockObtainFailedException (#93458)
We sometimes see a `ShardLockObtainFailedException` when a shard failed
to shut down as fast as we expected, often because a node left and
rejoined the cluster. Sometimes this is because it was held open by
ongoing scrolls or PITs, but other times it may be because the shutdown
process itself is too slow. With this commit we add the ability to
capture and log a thread dump at the time of the failure to give us more
information about where the shutdown process might be running slowly.

Relates #93226
2023-02-02 11:17:40 -05:00
David Turner
dfab580976
Limit length of lag detector hot threads log lines (#92851)
If debug logging is enabled then the lag detector will capture and
report the hot threads of a lagging node. In some cases the resulting
log message can be very large, exceeding 10kiB, which means it is
truncated in most logging setups. The relevant thread(s) may be waiting
on I/O, which is not considered "hot" and therefore may not appear in
the first 10kiB.

This commit adjusts this logging mechanism to split the message into
chunks of size at most 2kiB (after compression and base64-encoding) to
ensure that the entire hot threads output can be faithfully
reconstructed from these logs.

Closes #88126
2023-01-13 13:11:26 +00:00
David Turner
6203560983
Fix docs for fault detection troubleshooting (#92749)
In #92742 we changed the logging around cluster membership changes but the docs
don't quite match the final version. This commit addresses that.
2023-01-09 10:17:06 +00:00
David Turner
5182748318
Improve node-{join,left} logging for troubleshooting (#92742)
Today to troubleshoot an unstable cluster we ask the users to parse the
rather complex `node-join` and `node-left` messages emitted by the
`MasterService`. These messages may refer to many nodes, may be
truncated, and are generally pretty hard to work with.

With this commit we start to emit a simplified log message about each
node added and removed. It also renames the respective executor classes:

- `JoinTaskExecutor` -> `NodeJoinExecutor`
- `NodeRemovalClusterStateTaskExecutor` -> `NodeLeftExecutor`

This brings their names in line with each other, and the messages that
they emit, whilst preserving the older `node-join` and `node-left`
terminology as reported by the `MasterService`.

Finally, it updates the troubleshooting logs to reflect these new and
simplified logs.

Relates #92741
2023-01-09 04:34:41 -05:00
David Turner
6a273886e9
Add technical docs on diagnosing instability etc (#85074)
Copies some internal troubleshooting docs to the reference manual for
wider use.

Co-authored-by: James Rodewig <james.rodewig@gmail.com>
2022-03-31 09:01:10 +01:00
Martijn van Groningen
8a1deff75a
Improve fault-detection.asciidoc (#76821)
Add section to fault-detection.asciidoc about nodes being removed from cluster
due to slow cluster state applying.
2021-08-23 14:31:06 +02:00
David Turner
c661a40083
Add docs for filesystem health checks (#59134)
Documents the feature and settings introduced in #52680.

Co-authored-by: James Rodewig <james.rodewig@elastic.co>
2020-07-07 14:14:35 +01:00
Lisa Cawley
f307847f29
[DOCS] Adds overview and API ref for cluster voting configurations (#36954) 2019-01-07 09:11:14 -08:00
Lisa Cawley
33e9cf3892
[DOCS] Merges list of discovery and cluster formation settings (#36909) 2018-12-21 11:24:48 -08:00
David Turner
1a23417aeb
[Zen2] Update documentation for Zen2 (#34714)
This commit overhauls the documentation of discovery and cluster coordination,
removing mention of the Zen Discovery module and replacing it with docs for the
new cluster coordination mechanism introduced in 7.0.

Relates #32006
2018-12-20 13:02:44 +00:00