Enhance docs around network troubleshooting (#97305)

Discovery, like cluster membership, can also be affected by network-like issues (e.g. GC/VM pauses, dropped packets and blocked threads) so this commit duplicates the troubleshooting info across both places.
2025-06-28 17:34:17 -04:00 · 2023-07-10 10:57:44 +01:00 · 2023-07-10 10:57:44 +01:00 · 09e53f9ad9
commit 09e53f9ad9
parent 52a6820813
3 changed files with 101 additions and 148 deletions
--- a/docs/reference/troubleshooting/discovery-issues.asciidoc
+++ b/docs/reference/troubleshooting/discovery-issues.asciidoc
@ -39,18 +39,19 @@ nodes will repeatedly log messages about the problem using a logger called
 `org.elasticsearch.cluster.coordination.ClusterFormationFailureHelper`. By
 default, this happens every 10 seconds.

-Master elections only involve master-eligible nodes, so focus on the logs from
-master-eligible nodes in this situation. These nodes' logs will indicate the
-requirements for a master election, such as the discovery of a certain set of
-nodes.
+Master elections only involve master-eligible nodes, so focus your attention on
+the master-eligible nodes in this situation. These nodes' logs will indicate
+the requirements for a master election, such as the discovery of a certain set
+of nodes. The <<health-api>> API on these nodes will also provide useful
+information about the situation.

-If the logs indicate that {es} can't discover enough nodes to form a quorum,
-you must address the reasons preventing {es} from discovering the missing
-nodes. The missing nodes are needed to reconstruct the cluster metadata.
-Without the cluster metadata, the data in your cluster is meaningless. The
-cluster metadata is stored on a subset of the master-eligible nodes in the
-cluster. If a quorum can't be discovered, the missing nodes were the ones
-holding the cluster metadata.
+If the logs or the health report indicate that {es} can't discover enough nodes
+to form a quorum, you must address the reasons preventing {es} from discovering
+the missing nodes. The missing nodes are needed to reconstruct the cluster
+metadata. Without the cluster metadata, the data in your cluster is
+meaningless. The cluster metadata is stored on a subset of the master-eligible
+nodes in the cluster. If a quorum can't be discovered, the missing nodes were
+the ones holding the cluster metadata.

 Ensure there are enough nodes running to form a quorum and that every node can
 communicate with every other node over the network. {es} will report additional
@ -59,10 +60,20 @@ than a few minutes. If you can't start enough nodes to form a quorum, start a
 new cluster and restore data from a recent snapshot. Refer to
 <<modules-discovery-quorums>> for more information.

-If the logs indicate that {es} _has_ discovered a possible quorum of nodes, the
-typical reason that the cluster can't elect a master is that one of the other
-nodes can't discover a quorum. Inspect the logs on the other master-eligible
-nodes and ensure that they have all discovered enough nodes to form a quorum.
+If the logs or the health report indicate that {es} _has_ discovered a possible
+quorum of nodes, the typical reason that the cluster can't elect a master is
+that one of the other nodes can't discover a quorum. Inspect the logs on the
+other master-eligible nodes and ensure that they have all discovered enough
+nodes to form a quorum.
+
+If the logs suggest that discovery or master elections are failing due to
+timeouts or network-related issues then narrow down the problem as follows.
+
+include::network-timeouts.asciidoc[tag=troubleshooting-network-timeouts-gc-vm]
+
+include::network-timeouts.asciidoc[tag=troubleshooting-network-timeouts-packet-capture-elections]
+
+include::network-timeouts.asciidoc[tag=troubleshooting-network-timeouts-threads]

 [discrete]
 [[discovery-master-unstable]]
@ -72,7 +83,14 @@ When a node wins the master election, it logs a message containing
 `elected-as-master`. If this happens repeatedly, the elected master node is
 unstable. In this situation, focus on the logs from the master-eligible nodes
 to understand why the election winner stops being the master and triggers
-another election.
+another election. If the logs suggest that the master is unstable due to
+timeouts or network-related issues then narrow down the problem as follows.
+
+include::network-timeouts.asciidoc[tag=troubleshooting-network-timeouts-gc-vm]
+
+include::network-timeouts.asciidoc[tag=troubleshooting-network-timeouts-packet-capture-elections]
+
+include::network-timeouts.asciidoc[tag=troubleshooting-network-timeouts-threads]

 [discrete]
 [[discovery-cannot-join-master]]
@ -80,8 +98,18 @@ another election.

 If there is a stable elected master but a node can't discover or join its
 cluster, it will repeatedly log messages about the problem using the
-`ClusterFormationFailureHelper` logger. Other log messages on the affected node
-and the elected master may provide additional information about the problem.
+`ClusterFormationFailureHelper` logger. The <<health-api>> API on the affected
+node will also provide useful information about the situation. Other log
+messages on the affected node and the elected master may provide additional
+information about the problem. If the logs suggest that the node cannot
+discover or join the cluster due to timeouts or network-related issues then
+narrow down the problem as follows.
+
+include::network-timeouts.asciidoc[tag=troubleshooting-network-timeouts-gc-vm]
+
+include::network-timeouts.asciidoc[tag=troubleshooting-network-timeouts-packet-capture-elections]
+
+include::network-timeouts.asciidoc[tag=troubleshooting-network-timeouts-threads]

 [discrete]
 [[discovery-node-leaves]]
@ -89,4 +117,4 @@ and the elected master may provide additional information about the problem.

 If a node joins the cluster but {es} determines it to be faulty then it will be
 removed from the cluster again. See <<cluster-fault-detection-troubleshooting>>
-for more information.
+for more information.
--- a/docs/reference/troubleshooting/network-timeouts.asciidoc
+++ b/docs/reference/troubleshooting/network-timeouts.asciidoc
@ -0,0 +1,45 @@
+tag::troubleshooting-network-timeouts-gc-vm[]
+* GC pauses are recorded in the GC logs that {es} emits by default, and also
+usually by the `JvmMonitorService` in the main node logs. Use these logs to
+confirm whether or not GC is resulting in delays.
+
+* VM pauses also affect other processes on the same host. A VM pause also
+typically causes a discontinuity in the system clock, which {es} will report in
+its logs.
+end::troubleshooting-network-timeouts-gc-vm[]
+
+tag::troubleshooting-network-timeouts-packet-capture-elections[]
+* Packet captures will reveal system-level and network-level faults, especially
+if you capture the network traffic simultaneously at all relevant nodes. You
+should be able to observe any retransmissions, packet loss, or other delays on
+the connections between the nodes.
+end::troubleshooting-network-timeouts-packet-capture-elections[]
+
+tag::troubleshooting-network-timeouts-packet-capture-fault-detection[]
+* Packet captures will reveal system-level and network-level faults, especially
+if you capture the network traffic simultaneously at the elected master and the
+faulty node. The connection used for follower checks is not used for any other
+traffic so it can be easily identified from the flow pattern alone, even if TLS
+is in use: almost exactly every second there will be a few hundred bytes sent
+each way, first the request by the master and then the response by the
+follower. You should be able to observe any retransmissions, packet loss, or
+other delays on such a connection.
+end::troubleshooting-network-timeouts-packet-capture-fault-detection[]
+
+tag::troubleshooting-network-timeouts-threads[]
+* Long waits for particular threads to be available can be identified by taking
+stack dumps (for example, using `jstack`) or a profiling trace (for example,
+using Java Flight Recorder) in the few seconds leading up to the relevant log
+message.
+
+The <<cluster-nodes-hot-threads>> API sometimes yields useful information, but
+bear in mind that this API also requires a number of `transport_worker` and
+`generic` threads across all the nodes in the cluster. The API may be affected
+by the very problem you're trying to diagnose. `jstack` is much more reliable
+since it doesn't require any JVM threads.
+
+The threads involved in discovery and cluster membership are mainly
+`transport_worker` and `cluster_coordination` threads, for which there should
+never be a long wait. There may also be evidence of long waits for threads in
+the {es} logs. See <<modules-network-threading-model>> for more information.
+end::troubleshooting-network-timeouts-threads[]