elasticsearch

mirror of https://github.com/elastic/elasticsearch.git synced 2025-06-30 10:23:41 -04:00

Author	SHA1	Message	Date
David Turner	9387ce3357	Deduplicate unstable-cluster troubleshooting docs (#112333 ) We duplicated these docs in order to avoid breaking older links, but this makes it confusing and hard to link to the right copy of the information. This commit removes the duplication by replacing the docs at the old locations with stubs that link to the new locations.	2024-08-29 13:16:37 +01:00
David Turner	59a42ed41b	Include network disconnect info in troubleshooting docs (#112323 ) A misplaced `//end::` tag meant that the docs added in #112271 are only included in the page on fault detection and not the equivalent troubleshooting docs. This commit fixes the problem.	2024-08-29 15:03:13 +10:00
David Turner	42d650b9bb	Add docs for troubleshooting network disconnects (#112271 ) Basically the same as for nodes that leave the cluster with reason `disconnected`, except that these disconnects don't involve the master so don't cause any nodes to leave the cluster.	2024-08-28 18:59:11 +10:00
David Turner	e5fd63bbb8	More detail around packet captures (#111835 ) Clarify that it's best to analyse the captures alongside the node logs, and spell out in a bit more detail how to use packet captures and logs to pin down the cause of a `disconnected` node.	2024-08-13 21:55:38 +01:00
Abdon Pijpelink	af76a3a436	[DOCS] Add 'Troubleshooting an unstable cluster' to nav (#99287 ) * [DOCS] Add 'Troubleshooting an unstable cluster' to nav * Adjust docs links in code * Revert "Adjust docs links in code" This reverts commit `f3846b1d78`. --------- Co-authored-by: David Turner <david.turner@elastic.co>	2023-09-08 13:42:50 +02:00
David Turner	09e53f9ad9	Enhance docs around network troubleshooting (#97305 ) Discovery, like cluster membership, can also be affected by network-like issues (e.g. GC/VM pauses, dropped packets and blocked threads) so this commit duplicates the troubleshooting info across both places.	2023-07-10 10:57:44 +01:00
David Turner	7a517cb4a0	Add note on jstack frequency for troubleshooting (#95764 ) Suggest calling `jstack` every 15s to ensure that at least one capture shows a stuck thread. Also adds a link to this guide to the list on the troubleshooting overview page.	2023-05-03 10:04:13 +01:00
David Turner	4c68382065	Capture thread dump on ShardLockObtainFailedException (#93458 ) We sometimes see a `ShardLockObtainFailedException` when a shard failed to shut down as fast as we expected, often because a node left and rejoined the cluster. Sometimes this is because it was held open by ongoing scrolls or PITs, but other times it may be because the shutdown process itself is too slow. With this commit we add the ability to capture and log a thread dump at the time of the failure to give us more information about where the shutdown process might be running slowly. Relates #93226	2023-02-02 11:17:40 -05:00
David Turner	dfab580976	Limit length of lag detector hot threads log lines (#92851 ) If debug logging is enabled then the lag detector will capture and report the hot threads of a lagging node. In some cases the resulting log message can be very large, exceeding 10kiB, which means it is truncated in most logging setups. The relevant thread(s) may be waiting on I/O, which is not considered "hot" and therefore may not appear in the first 10kiB. This commit adjusts this logging mechanism to split the message into chunks of size at most 2kiB (after compression and base64-encoding) to ensure that the entire hot threads output can be faithfully reconstructed from these logs. Closes #88126	2023-01-13 13:11:26 +00:00
David Turner	6203560983	Fix docs for fault detection troubleshooting (#92749 ) In #92742 we changed the logging around cluster membership changes but the docs don't quite match the final version. This commit addresses that.	2023-01-09 10:17:06 +00:00
David Turner	5182748318	Improve node-{join,left} logging for troubleshooting (#92742 ) Today to troubleshoot an unstable cluster we ask the users to parse the rather complex `node-join` and `node-left` messages emitted by the `MasterService`. These messages may refer to many nodes, may be truncated, and are generally pretty hard to work with. With this commit we start to emit a simplified log message about each node added and removed. It also renames the respective executor classes: - `JoinTaskExecutor` -> `NodeJoinExecutor` - `NodeRemovalClusterStateTaskExecutor` -> `NodeLeftExecutor` This brings their names in line with each other, and the messages that they emit, whilst preserving the older `node-join` and `node-left` terminology as reported by the `MasterService`. Finally, it updates the troubleshooting logs to reflect these new and simplified logs. Relates #92741	2023-01-09 04:34:41 -05:00
David Turner	6a273886e9	Add technical docs on diagnosing instability etc (#85074 ) Copies some internal troubleshooting docs to the reference manual for wider use. Co-authored-by: James Rodewig <james.rodewig@gmail.com>	2022-03-31 09:01:10 +01:00
Martijn van Groningen	8a1deff75a	Improve fault-detection.asciidoc (#76821 ) Add section to fault-detection.asciidoc about nodes being removed from cluster due to slow cluster state applying.	2021-08-23 14:31:06 +02:00
David Turner	c661a40083	Add docs for filesystem health checks (#59134 ) Documents the feature and settings introduced in #52680. Co-authored-by: James Rodewig <james.rodewig@elastic.co>	2020-07-07 14:14:35 +01:00
Lisa Cawley	f307847f29	[DOCS] Adds overview and API ref for cluster voting configurations (#36954 )	2019-01-07 09:11:14 -08:00
Lisa Cawley	33e9cf3892	[DOCS] Merges list of discovery and cluster formation settings (#36909 )	2018-12-21 11:24:48 -08:00
David Turner	1a23417aeb	[Zen2] Update documentation for Zen2 (#34714 ) This commit overhauls the documentation of discovery and cluster coordination, removing mention of the Zen Discovery module and replacing it with docs for the new cluster coordination mechanism introduced in 7.0. Relates #32006	2018-12-20 13:02:44 +00:00

17 commits