mirror of
https://github.com/elastic/elasticsearch.git
synced 2025-06-30 10:23:41 -04:00
Add section to fault-detection.asciidoc about nodes being removed from cluster due to slow cluster state applying.
33 lines
1.8 KiB
Text
33 lines
1.8 KiB
Text
[[cluster-fault-detection]]
|
|
=== Cluster fault detection
|
|
|
|
The elected master periodically checks each of the nodes in the cluster to
|
|
ensure that they are still connected and healthy. Each node in the cluster also
|
|
periodically checks the health of the elected master. These checks are known
|
|
respectively as _follower checks_ and _leader checks_.
|
|
|
|
Elasticsearch allows these checks to occasionally fail or timeout without
|
|
taking any action. It considers a node to be faulty only after a number of
|
|
consecutive checks have failed. You can control fault detection behavior with
|
|
<<modules-discovery-settings,`cluster.fault_detection.*` settings>>.
|
|
|
|
If the elected master detects that a node has disconnected, however, this
|
|
situation is treated as an immediate failure. The master bypasses the timeout
|
|
and retry setting values and attempts to remove the node from the cluster.
|
|
Similarly, if a node detects that the elected master has disconnected, this
|
|
situation is treated as an immediate failure. The node bypasses the timeout and
|
|
retry settings and restarts its discovery phase to try and find or elect a new
|
|
master.
|
|
|
|
[[cluster-fault-detection-filesystem-health]]
|
|
Additionally, each node periodically verifies that its data path is healthy by
|
|
writing a small file to disk and then deleting it again. If a node discovers
|
|
its data path is unhealthy then it is removed from the cluster until the data
|
|
path recovers. You can control this behavior with the
|
|
<<modules-discovery-settings,`monitor.fs.health` settings>>.
|
|
|
|
[[cluster-fault-detection-cluster-state-publishing]]
|
|
The elected master node will also remove nodes from the cluster if nodes are unable
|
|
to apply an updated cluster state within a reasonable time, which is by default
|
|
120s after the cluster state update started. See
|
|
<<cluster-state-publishing, cluster state publishing>> for a more detailed description.
|