mirror of
https://github.com/elastic/elasticsearch.git
synced 2025-04-27 08:37:18 -04:00
Discovery, like cluster membership, can also be affected by network-like issues (e.g. GC/VM pauses, dropped packets and blocked threads) so this commit duplicates the troubleshooting info across both places.
120 lines
5.8 KiB
Text
120 lines
5.8 KiB
Text
[[discovery-troubleshooting]]
|
|
== Troubleshooting discovery
|
|
|
|
In most cases, the discovery and election process completes quickly, and the
|
|
master node remains elected for a long period of time.
|
|
|
|
If your cluster doesn't have a stable master, many of its features won't work
|
|
correctly and {es} will report errors to clients and in its logs. You must fix
|
|
the master node's instability before addressing these other issues. It will not
|
|
be possible to solve any other issues while there is no elected master node or
|
|
the elected master node is unstable.
|
|
|
|
If your cluster has a stable master but some nodes can't discover or join it,
|
|
these nodes will report errors to clients and in their logs. You must address
|
|
the obstacles preventing these nodes from joining the cluster before addressing
|
|
other issues. It will not be possible to solve any other issues reported by
|
|
these nodes while they are unable to join the cluster.
|
|
|
|
If the cluster has no elected master node for more than a few seconds, the
|
|
master is unstable, or some nodes are unable to discover or join a stable
|
|
master, then {es} will record information in its logs explaining why. If the
|
|
problems persist for more than a few minutes, {es} will record additional
|
|
information in its logs. To properly troubleshoot discovery and election
|
|
problems, collect and analyse logs covering at least five minutes from all
|
|
nodes.
|
|
|
|
The following sections describe some common discovery and election problems.
|
|
|
|
[discrete]
|
|
[[discovery-no-master]]
|
|
=== No master is elected
|
|
|
|
When a node wins the master election, it logs a message containing
|
|
`elected-as-master` and all nodes log a message containing
|
|
`master node changed` identifying the new elected master node.
|
|
|
|
If there is no elected master node and no node can win an election, all
|
|
nodes will repeatedly log messages about the problem using a logger called
|
|
`org.elasticsearch.cluster.coordination.ClusterFormationFailureHelper`. By
|
|
default, this happens every 10 seconds.
|
|
|
|
Master elections only involve master-eligible nodes, so focus your attention on
|
|
the master-eligible nodes in this situation. These nodes' logs will indicate
|
|
the requirements for a master election, such as the discovery of a certain set
|
|
of nodes. The <<health-api>> API on these nodes will also provide useful
|
|
information about the situation.
|
|
|
|
If the logs or the health report indicate that {es} can't discover enough nodes
|
|
to form a quorum, you must address the reasons preventing {es} from discovering
|
|
the missing nodes. The missing nodes are needed to reconstruct the cluster
|
|
metadata. Without the cluster metadata, the data in your cluster is
|
|
meaningless. The cluster metadata is stored on a subset of the master-eligible
|
|
nodes in the cluster. If a quorum can't be discovered, the missing nodes were
|
|
the ones holding the cluster metadata.
|
|
|
|
Ensure there are enough nodes running to form a quorum and that every node can
|
|
communicate with every other node over the network. {es} will report additional
|
|
details about network connectivity if the election problems persist for more
|
|
than a few minutes. If you can't start enough nodes to form a quorum, start a
|
|
new cluster and restore data from a recent snapshot. Refer to
|
|
<<modules-discovery-quorums>> for more information.
|
|
|
|
If the logs or the health report indicate that {es} _has_ discovered a possible
|
|
quorum of nodes, the typical reason that the cluster can't elect a master is
|
|
that one of the other nodes can't discover a quorum. Inspect the logs on the
|
|
other master-eligible nodes and ensure that they have all discovered enough
|
|
nodes to form a quorum.
|
|
|
|
If the logs suggest that discovery or master elections are failing due to
|
|
timeouts or network-related issues then narrow down the problem as follows.
|
|
|
|
include::network-timeouts.asciidoc[tag=troubleshooting-network-timeouts-gc-vm]
|
|
|
|
include::network-timeouts.asciidoc[tag=troubleshooting-network-timeouts-packet-capture-elections]
|
|
|
|
include::network-timeouts.asciidoc[tag=troubleshooting-network-timeouts-threads]
|
|
|
|
[discrete]
|
|
[[discovery-master-unstable]]
|
|
=== Master is elected but unstable
|
|
|
|
When a node wins the master election, it logs a message containing
|
|
`elected-as-master`. If this happens repeatedly, the elected master node is
|
|
unstable. In this situation, focus on the logs from the master-eligible nodes
|
|
to understand why the election winner stops being the master and triggers
|
|
another election. If the logs suggest that the master is unstable due to
|
|
timeouts or network-related issues then narrow down the problem as follows.
|
|
|
|
include::network-timeouts.asciidoc[tag=troubleshooting-network-timeouts-gc-vm]
|
|
|
|
include::network-timeouts.asciidoc[tag=troubleshooting-network-timeouts-packet-capture-elections]
|
|
|
|
include::network-timeouts.asciidoc[tag=troubleshooting-network-timeouts-threads]
|
|
|
|
[discrete]
|
|
[[discovery-cannot-join-master]]
|
|
=== Node cannot discover or join stable master
|
|
|
|
If there is a stable elected master but a node can't discover or join its
|
|
cluster, it will repeatedly log messages about the problem using the
|
|
`ClusterFormationFailureHelper` logger. The <<health-api>> API on the affected
|
|
node will also provide useful information about the situation. Other log
|
|
messages on the affected node and the elected master may provide additional
|
|
information about the problem. If the logs suggest that the node cannot
|
|
discover or join the cluster due to timeouts or network-related issues then
|
|
narrow down the problem as follows.
|
|
|
|
include::network-timeouts.asciidoc[tag=troubleshooting-network-timeouts-gc-vm]
|
|
|
|
include::network-timeouts.asciidoc[tag=troubleshooting-network-timeouts-packet-capture-elections]
|
|
|
|
include::network-timeouts.asciidoc[tag=troubleshooting-network-timeouts-threads]
|
|
|
|
[discrete]
|
|
[[discovery-node-leaves]]
|
|
=== Node joins cluster and leaves again
|
|
|
|
If a node joins the cluster but {es} determines it to be faulty then it will be
|
|
removed from the cluster again. See <<cluster-fault-detection-troubleshooting>>
|
|
for more information.
|