elasticsearch

mirror of https://github.com/elastic/elasticsearch.git synced 2025-06-30 10:23:41 -04:00

Author	SHA1	Message	Date
David Turner	4c68382065	Capture thread dump on ShardLockObtainFailedException (#93458 ) We sometimes see a `ShardLockObtainFailedException` when a shard failed to shut down as fast as we expected, often because a node left and rejoined the cluster. Sometimes this is because it was held open by ongoing scrolls or PITs, but other times it may be because the shutdown process itself is too slow. With this commit we add the ability to capture and log a thread dump at the time of the failure to give us more information about where the shutdown process might be running slowly. Relates #93226	2023-02-02 11:17:40 -05:00
David Turner	dfab580976	Limit length of lag detector hot threads log lines (#92851 ) If debug logging is enabled then the lag detector will capture and report the hot threads of a lagging node. In some cases the resulting log message can be very large, exceeding 10kiB, which means it is truncated in most logging setups. The relevant thread(s) may be waiting on I/O, which is not considered "hot" and therefore may not appear in the first 10kiB. This commit adjusts this logging mechanism to split the message into chunks of size at most 2kiB (after compression and base64-encoding) to ensure that the entire hot threads output can be faithfully reconstructed from these logs. Closes #88126	2023-01-13 13:11:26 +00:00
David Turner	6203560983	Fix docs for fault detection troubleshooting (#92749 ) In #92742 we changed the logging around cluster membership changes but the docs don't quite match the final version. This commit addresses that.	2023-01-09 10:17:06 +00:00
David Turner	5182748318	Improve node-{join,left} logging for troubleshooting (#92742 ) Today to troubleshoot an unstable cluster we ask the users to parse the rather complex `node-join` and `node-left` messages emitted by the `MasterService`. These messages may refer to many nodes, may be truncated, and are generally pretty hard to work with. With this commit we start to emit a simplified log message about each node added and removed. It also renames the respective executor classes: - `JoinTaskExecutor` -> `NodeJoinExecutor` - `NodeRemovalClusterStateTaskExecutor` -> `NodeLeftExecutor` This brings their names in line with each other, and the messages that they emit, whilst preserving the older `node-join` and `node-left` terminology as reported by the `MasterService`. Finally, it updates the troubleshooting logs to reflect these new and simplified logs. Relates #92741	2023-01-09 04:34:41 -05:00
David Turner	6a273886e9	Add technical docs on diagnosing instability etc (#85074 ) Copies some internal troubleshooting docs to the reference manual for wider use. Co-authored-by: James Rodewig <james.rodewig@gmail.com>	2022-03-31 09:01:10 +01:00
Martijn van Groningen	8a1deff75a	Improve fault-detection.asciidoc (#76821 ) Add section to fault-detection.asciidoc about nodes being removed from cluster due to slow cluster state applying.	2021-08-23 14:31:06 +02:00
David Turner	c661a40083	Add docs for filesystem health checks (#59134 ) Documents the feature and settings introduced in #52680. Co-authored-by: James Rodewig <james.rodewig@elastic.co>	2020-07-07 14:14:35 +01:00
Lisa Cawley	f307847f29	[DOCS] Adds overview and API ref for cluster voting configurations (#36954 )	2019-01-07 09:11:14 -08:00
Lisa Cawley	33e9cf3892	[DOCS] Merges list of discovery and cluster formation settings (#36909 )	2018-12-21 11:24:48 -08:00
David Turner	1a23417aeb	[Zen2] Update documentation for Zen2 (#34714 ) This commit overhauls the documentation of discovery and cluster coordination, removing mention of the Zen Discovery module and replacing it with docs for the new cluster coordination mechanism introduced in 7.0. Relates #32006	2018-12-20 13:02:44 +00:00

10 commits