mirror of
https://github.com/elastic/elasticsearch.git
synced 2025-06-28 17:34:17 -04:00
Enhance docs around network troubleshooting (#97305)
Discovery, like cluster membership, can also be affected by network-like issues (e.g. GC/VM pauses, dropped packets and blocked threads) so this commit duplicates the troubleshooting info across both places.
This commit is contained in:
parent
52a6820813
commit
09e53f9ad9
3 changed files with 101 additions and 148 deletions
|
@ -39,18 +39,19 @@ nodes will repeatedly log messages about the problem using a logger called
|
|||
`org.elasticsearch.cluster.coordination.ClusterFormationFailureHelper`. By
|
||||
default, this happens every 10 seconds.
|
||||
|
||||
Master elections only involve master-eligible nodes, so focus on the logs from
|
||||
master-eligible nodes in this situation. These nodes' logs will indicate the
|
||||
requirements for a master election, such as the discovery of a certain set of
|
||||
nodes.
|
||||
Master elections only involve master-eligible nodes, so focus your attention on
|
||||
the master-eligible nodes in this situation. These nodes' logs will indicate
|
||||
the requirements for a master election, such as the discovery of a certain set
|
||||
of nodes. The <<health-api>> API on these nodes will also provide useful
|
||||
information about the situation.
|
||||
|
||||
If the logs indicate that {es} can't discover enough nodes to form a quorum,
|
||||
you must address the reasons preventing {es} from discovering the missing
|
||||
nodes. The missing nodes are needed to reconstruct the cluster metadata.
|
||||
Without the cluster metadata, the data in your cluster is meaningless. The
|
||||
cluster metadata is stored on a subset of the master-eligible nodes in the
|
||||
cluster. If a quorum can't be discovered, the missing nodes were the ones
|
||||
holding the cluster metadata.
|
||||
If the logs or the health report indicate that {es} can't discover enough nodes
|
||||
to form a quorum, you must address the reasons preventing {es} from discovering
|
||||
the missing nodes. The missing nodes are needed to reconstruct the cluster
|
||||
metadata. Without the cluster metadata, the data in your cluster is
|
||||
meaningless. The cluster metadata is stored on a subset of the master-eligible
|
||||
nodes in the cluster. If a quorum can't be discovered, the missing nodes were
|
||||
the ones holding the cluster metadata.
|
||||
|
||||
Ensure there are enough nodes running to form a quorum and that every node can
|
||||
communicate with every other node over the network. {es} will report additional
|
||||
|
@ -59,10 +60,20 @@ than a few minutes. If you can't start enough nodes to form a quorum, start a
|
|||
new cluster and restore data from a recent snapshot. Refer to
|
||||
<<modules-discovery-quorums>> for more information.
|
||||
|
||||
If the logs indicate that {es} _has_ discovered a possible quorum of nodes, the
|
||||
typical reason that the cluster can't elect a master is that one of the other
|
||||
nodes can't discover a quorum. Inspect the logs on the other master-eligible
|
||||
nodes and ensure that they have all discovered enough nodes to form a quorum.
|
||||
If the logs or the health report indicate that {es} _has_ discovered a possible
|
||||
quorum of nodes, the typical reason that the cluster can't elect a master is
|
||||
that one of the other nodes can't discover a quorum. Inspect the logs on the
|
||||
other master-eligible nodes and ensure that they have all discovered enough
|
||||
nodes to form a quorum.
|
||||
|
||||
If the logs suggest that discovery or master elections are failing due to
|
||||
timeouts or network-related issues then narrow down the problem as follows.
|
||||
|
||||
include::network-timeouts.asciidoc[tag=troubleshooting-network-timeouts-gc-vm]
|
||||
|
||||
include::network-timeouts.asciidoc[tag=troubleshooting-network-timeouts-packet-capture-elections]
|
||||
|
||||
include::network-timeouts.asciidoc[tag=troubleshooting-network-timeouts-threads]
|
||||
|
||||
[discrete]
|
||||
[[discovery-master-unstable]]
|
||||
|
@ -72,7 +83,14 @@ When a node wins the master election, it logs a message containing
|
|||
`elected-as-master`. If this happens repeatedly, the elected master node is
|
||||
unstable. In this situation, focus on the logs from the master-eligible nodes
|
||||
to understand why the election winner stops being the master and triggers
|
||||
another election.
|
||||
another election. If the logs suggest that the master is unstable due to
|
||||
timeouts or network-related issues then narrow down the problem as follows.
|
||||
|
||||
include::network-timeouts.asciidoc[tag=troubleshooting-network-timeouts-gc-vm]
|
||||
|
||||
include::network-timeouts.asciidoc[tag=troubleshooting-network-timeouts-packet-capture-elections]
|
||||
|
||||
include::network-timeouts.asciidoc[tag=troubleshooting-network-timeouts-threads]
|
||||
|
||||
[discrete]
|
||||
[[discovery-cannot-join-master]]
|
||||
|
@ -80,8 +98,18 @@ another election.
|
|||
|
||||
If there is a stable elected master but a node can't discover or join its
|
||||
cluster, it will repeatedly log messages about the problem using the
|
||||
`ClusterFormationFailureHelper` logger. Other log messages on the affected node
|
||||
and the elected master may provide additional information about the problem.
|
||||
`ClusterFormationFailureHelper` logger. The <<health-api>> API on the affected
|
||||
node will also provide useful information about the situation. Other log
|
||||
messages on the affected node and the elected master may provide additional
|
||||
information about the problem. If the logs suggest that the node cannot
|
||||
discover or join the cluster due to timeouts or network-related issues then
|
||||
narrow down the problem as follows.
|
||||
|
||||
include::network-timeouts.asciidoc[tag=troubleshooting-network-timeouts-gc-vm]
|
||||
|
||||
include::network-timeouts.asciidoc[tag=troubleshooting-network-timeouts-packet-capture-elections]
|
||||
|
||||
include::network-timeouts.asciidoc[tag=troubleshooting-network-timeouts-threads]
|
||||
|
||||
[discrete]
|
||||
[[discovery-node-leaves]]
|
||||
|
@ -89,4 +117,4 @@ and the elected master may provide additional information about the problem.
|
|||
|
||||
If a node joins the cluster but {es} determines it to be faulty then it will be
|
||||
removed from the cluster again. See <<cluster-fault-detection-troubleshooting>>
|
||||
for more information.
|
||||
for more information.
|
||||
|
|
45
docs/reference/troubleshooting/network-timeouts.asciidoc
Normal file
45
docs/reference/troubleshooting/network-timeouts.asciidoc
Normal file
|
@ -0,0 +1,45 @@
|
|||
tag::troubleshooting-network-timeouts-gc-vm[]
|
||||
* GC pauses are recorded in the GC logs that {es} emits by default, and also
|
||||
usually by the `JvmMonitorService` in the main node logs. Use these logs to
|
||||
confirm whether or not GC is resulting in delays.
|
||||
|
||||
* VM pauses also affect other processes on the same host. A VM pause also
|
||||
typically causes a discontinuity in the system clock, which {es} will report in
|
||||
its logs.
|
||||
end::troubleshooting-network-timeouts-gc-vm[]
|
||||
|
||||
tag::troubleshooting-network-timeouts-packet-capture-elections[]
|
||||
* Packet captures will reveal system-level and network-level faults, especially
|
||||
if you capture the network traffic simultaneously at all relevant nodes. You
|
||||
should be able to observe any retransmissions, packet loss, or other delays on
|
||||
the connections between the nodes.
|
||||
end::troubleshooting-network-timeouts-packet-capture-elections[]
|
||||
|
||||
tag::troubleshooting-network-timeouts-packet-capture-fault-detection[]
|
||||
* Packet captures will reveal system-level and network-level faults, especially
|
||||
if you capture the network traffic simultaneously at the elected master and the
|
||||
faulty node. The connection used for follower checks is not used for any other
|
||||
traffic so it can be easily identified from the flow pattern alone, even if TLS
|
||||
is in use: almost exactly every second there will be a few hundred bytes sent
|
||||
each way, first the request by the master and then the response by the
|
||||
follower. You should be able to observe any retransmissions, packet loss, or
|
||||
other delays on such a connection.
|
||||
end::troubleshooting-network-timeouts-packet-capture-fault-detection[]
|
||||
|
||||
tag::troubleshooting-network-timeouts-threads[]
|
||||
* Long waits for particular threads to be available can be identified by taking
|
||||
stack dumps (for example, using `jstack`) or a profiling trace (for example,
|
||||
using Java Flight Recorder) in the few seconds leading up to the relevant log
|
||||
message.
|
||||
+
|
||||
The <<cluster-nodes-hot-threads>> API sometimes yields useful information, but
|
||||
bear in mind that this API also requires a number of `transport_worker` and
|
||||
`generic` threads across all the nodes in the cluster. The API may be affected
|
||||
by the very problem you're trying to diagnose. `jstack` is much more reliable
|
||||
since it doesn't require any JVM threads.
|
||||
+
|
||||
The threads involved in discovery and cluster membership are mainly
|
||||
`transport_worker` and `cluster_coordination` threads, for which there should
|
||||
never be a long wait. There may also be evidence of long waits for threads in
|
||||
the {es} logs. See <<modules-network-threading-model>> for more information.
|
||||
end::troubleshooting-network-timeouts-threads[]
|
Loading…
Add table
Add a link
Reference in a new issue