Enhance docs around network troubleshooting (#97305)

Discovery, like cluster membership, can also be affected by network-like
issues (e.g. GC/VM pauses, dropped packets and blocked threads) so this
commit duplicates the troubleshooting info across both places.
This commit is contained in:
David Turner 2023-07-10 10:57:44 +01:00 committed by GitHub
parent 52a6820813
commit 09e53f9ad9
No known key found for this signature in database
GPG key ID: 4AEE18F83AFDEB23
3 changed files with 101 additions and 148 deletions

View file

@ -39,18 +39,19 @@ nodes will repeatedly log messages about the problem using a logger called
`org.elasticsearch.cluster.coordination.ClusterFormationFailureHelper`. By
default, this happens every 10 seconds.
Master elections only involve master-eligible nodes, so focus on the logs from
master-eligible nodes in this situation. These nodes' logs will indicate the
requirements for a master election, such as the discovery of a certain set of
nodes.
Master elections only involve master-eligible nodes, so focus your attention on
the master-eligible nodes in this situation. These nodes' logs will indicate
the requirements for a master election, such as the discovery of a certain set
of nodes. The <<health-api>> API on these nodes will also provide useful
information about the situation.
If the logs indicate that {es} can't discover enough nodes to form a quorum,
you must address the reasons preventing {es} from discovering the missing
nodes. The missing nodes are needed to reconstruct the cluster metadata.
Without the cluster metadata, the data in your cluster is meaningless. The
cluster metadata is stored on a subset of the master-eligible nodes in the
cluster. If a quorum can't be discovered, the missing nodes were the ones
holding the cluster metadata.
If the logs or the health report indicate that {es} can't discover enough nodes
to form a quorum, you must address the reasons preventing {es} from discovering
the missing nodes. The missing nodes are needed to reconstruct the cluster
metadata. Without the cluster metadata, the data in your cluster is
meaningless. The cluster metadata is stored on a subset of the master-eligible
nodes in the cluster. If a quorum can't be discovered, the missing nodes were
the ones holding the cluster metadata.
Ensure there are enough nodes running to form a quorum and that every node can
communicate with every other node over the network. {es} will report additional
@ -59,10 +60,20 @@ than a few minutes. If you can't start enough nodes to form a quorum, start a
new cluster and restore data from a recent snapshot. Refer to
<<modules-discovery-quorums>> for more information.
If the logs indicate that {es} _has_ discovered a possible quorum of nodes, the
typical reason that the cluster can't elect a master is that one of the other
nodes can't discover a quorum. Inspect the logs on the other master-eligible
nodes and ensure that they have all discovered enough nodes to form a quorum.
If the logs or the health report indicate that {es} _has_ discovered a possible
quorum of nodes, the typical reason that the cluster can't elect a master is
that one of the other nodes can't discover a quorum. Inspect the logs on the
other master-eligible nodes and ensure that they have all discovered enough
nodes to form a quorum.
If the logs suggest that discovery or master elections are failing due to
timeouts or network-related issues then narrow down the problem as follows.
include::network-timeouts.asciidoc[tag=troubleshooting-network-timeouts-gc-vm]
include::network-timeouts.asciidoc[tag=troubleshooting-network-timeouts-packet-capture-elections]
include::network-timeouts.asciidoc[tag=troubleshooting-network-timeouts-threads]
[discrete]
[[discovery-master-unstable]]
@ -72,7 +83,14 @@ When a node wins the master election, it logs a message containing
`elected-as-master`. If this happens repeatedly, the elected master node is
unstable. In this situation, focus on the logs from the master-eligible nodes
to understand why the election winner stops being the master and triggers
another election.
another election. If the logs suggest that the master is unstable due to
timeouts or network-related issues then narrow down the problem as follows.
include::network-timeouts.asciidoc[tag=troubleshooting-network-timeouts-gc-vm]
include::network-timeouts.asciidoc[tag=troubleshooting-network-timeouts-packet-capture-elections]
include::network-timeouts.asciidoc[tag=troubleshooting-network-timeouts-threads]
[discrete]
[[discovery-cannot-join-master]]
@ -80,8 +98,18 @@ another election.
If there is a stable elected master but a node can't discover or join its
cluster, it will repeatedly log messages about the problem using the
`ClusterFormationFailureHelper` logger. Other log messages on the affected node
and the elected master may provide additional information about the problem.
`ClusterFormationFailureHelper` logger. The <<health-api>> API on the affected
node will also provide useful information about the situation. Other log
messages on the affected node and the elected master may provide additional
information about the problem. If the logs suggest that the node cannot
discover or join the cluster due to timeouts or network-related issues then
narrow down the problem as follows.
include::network-timeouts.asciidoc[tag=troubleshooting-network-timeouts-gc-vm]
include::network-timeouts.asciidoc[tag=troubleshooting-network-timeouts-packet-capture-elections]
include::network-timeouts.asciidoc[tag=troubleshooting-network-timeouts-threads]
[discrete]
[[discovery-node-leaves]]
@ -89,4 +117,4 @@ and the elected master may provide additional information about the problem.
If a node joins the cluster but {es} determines it to be faulty then it will be
removed from the cluster again. See <<cluster-fault-detection-troubleshooting>>
for more information.
for more information.

View file

@ -0,0 +1,45 @@
tag::troubleshooting-network-timeouts-gc-vm[]
* GC pauses are recorded in the GC logs that {es} emits by default, and also
usually by the `JvmMonitorService` in the main node logs. Use these logs to
confirm whether or not GC is resulting in delays.
* VM pauses also affect other processes on the same host. A VM pause also
typically causes a discontinuity in the system clock, which {es} will report in
its logs.
end::troubleshooting-network-timeouts-gc-vm[]
tag::troubleshooting-network-timeouts-packet-capture-elections[]
* Packet captures will reveal system-level and network-level faults, especially
if you capture the network traffic simultaneously at all relevant nodes. You
should be able to observe any retransmissions, packet loss, or other delays on
the connections between the nodes.
end::troubleshooting-network-timeouts-packet-capture-elections[]
tag::troubleshooting-network-timeouts-packet-capture-fault-detection[]
* Packet captures will reveal system-level and network-level faults, especially
if you capture the network traffic simultaneously at the elected master and the
faulty node. The connection used for follower checks is not used for any other
traffic so it can be easily identified from the flow pattern alone, even if TLS
is in use: almost exactly every second there will be a few hundred bytes sent
each way, first the request by the master and then the response by the
follower. You should be able to observe any retransmissions, packet loss, or
other delays on such a connection.
end::troubleshooting-network-timeouts-packet-capture-fault-detection[]
tag::troubleshooting-network-timeouts-threads[]
* Long waits for particular threads to be available can be identified by taking
stack dumps (for example, using `jstack`) or a profiling trace (for example,
using Java Flight Recorder) in the few seconds leading up to the relevant log
message.
+
The <<cluster-nodes-hot-threads>> API sometimes yields useful information, but
bear in mind that this API also requires a number of `transport_worker` and
`generic` threads across all the nodes in the cluster. The API may be affected
by the very problem you're trying to diagnose. `jstack` is much more reliable
since it doesn't require any JVM threads.
+
The threads involved in discovery and cluster membership are mainly
`transport_worker` and `cluster_coordination` threads, for which there should
never be a long wait. There may also be evidence of long waits for threads in
the {es} logs. See <<modules-network-threading-model>> for more information.
end::troubleshooting-network-timeouts-threads[]