Enhance docs around network troubleshooting (#97305)

Discovery, like cluster membership, can also be affected by network-like issues (e.g. GC/VM pauses, dropped packets and blocked threads) so this commit duplicates the troubleshooting info across both places.
2025-06-28 09:28:55 -04:00 · 2023-07-10 10:57:44 +01:00 · 2023-07-10 10:57:44 +01:00 · 09e53f9ad9
commit 09e53f9ad9
parent 52a6820813
3 changed files with 101 additions and 148 deletions
--- a/docs/reference/modules/discovery/fault-detection.asciidoc
+++ b/docs/reference/modules/discovery/fault-detection.asciidoc
@ -26,7 +26,8 @@ its data path is unhealthy then it is removed from the cluster until the data
 path recovers. You can control this behavior with the
 <<modules-discovery-settings,`monitor.fs.health` settings>>.
-[[cluster-fault-detection-cluster-state-publishing]] The elected master node
+[[cluster-fault-detection-cluster-state-publishing]]
 The elected master node
 will also remove nodes from the cluster if nodes are unable to apply an updated
 cluster state within a reasonable time. The timeout defaults to 2 minutes
 starting from the beginning of the cluster state update. Refer to
@ -120,6 +121,9 @@ When it rejoins, the `NodeJoinExecutor` will log that it processed a
 is unexpectedly restarting, look at the node's logs to see why it is shutting
 down.
 The <<health-api>> API on the affected node will also provide some useful
 information about the situation.
 If the node did not restart then you should look at the reason for its
 departure more closely. Each reason has different troubleshooting steps,
 described below. There are three possible reasons:
@ -244,141 +248,17 @@ a possible cause for this kind of instability. Log messages containing
 If the last check failed with an exception then the exception is reported, and
 typically indicates the problem that needs to be addressed. If any of the
-checks timed out, it may be necessary to understand the detailed sequence of
+checks timed out then narrow down the problem as follows.
 steps involved in a successful check. Here is an example of such a sequence:
-. The master's `FollowerChecker`, running on thread
+include::../../troubleshooting/network-timeouts.asciidoc[tag=troubleshooting-network-timeouts-gc-vm]
 `elasticsearch[master][scheduler][T#1]`, tells the `TransportService` to send
 the check request message to a follower node.
-. The master's `TransportService` running on thread
+include::../../troubleshooting/network-timeouts.asciidoc[tag=troubleshooting-network-timeouts-packet-capture-fault-detection]
 `elasticsearch[master][transport_worker][T#2]` passes the check request message
 onto the operating system.
-. The operating system on the master converts the message into one or more
+include::../../troubleshooting/network-timeouts.asciidoc[tag=troubleshooting-network-timeouts-threads]
 packets and sends them out over the network.
 . Miscellaneous routers, firewalls, and other devices between the master node
 and the follower node forward the packets, possibly fragmenting or
 defragmenting them on the way.
 . The operating system on the follower node receives the packets and notifies
 {es} that they've been received.
 . The follower's `TransportService`, running on thread
 `elasticsearch[follower][transport_worker][T#3]`, reads the incoming packets.
 It then reconstructs and processes the check request. Usually, the check
 quickly succeeds. If so, the same thread immediately constructs a response and
 passes it back to the operating system.
 . If the check doesn't immediately succeed (for example, an election started
 recently) then:
 .. The follower's `FollowerChecker`, running on thread
 `elasticsearch[follower][cluster_coordination][T#4]`, processes the request. It
 constructs a response and tells the `TransportService` to send the response
 back to the master.
 .. The follower's `TransportService`, running on thread
 `elasticsearch[follower][transport_worker][T#3]`, passes the response to the
 operating system.
 . The operating system on the follower converts the response into one or more
 packets and sends them out over the network.
 . Miscellaneous routers, firewalls, and other devices between master and
 follower forward the packets, possibly fragmenting or defragmenting them on the
 way.
 . The operating system on the master receives the packets and notifies {es}
 that they've been received.
 . The master's `TransportService`, running on thread
 `elasticsearch[master][transport_worker][T#2]`, reads the incoming packets,
 reconstructs the check response, and processes it as long as the check didn't
 already time out.
 There are a lot of different things that can delay the completion of a check
 and cause it to time out. Here are some examples for each step:
 . There may be a long garbage collection (GC) or virtual machine (VM) pause
 after passing the check request to the `TransportService`.
 . There may be a long wait for the specific `transport_worker` thread to become
 available, or there may be a long GC or VM pause before passing the check
 request onto the operating system.
 . A system fault (for example, a broken network card) on the master may delay
 sending the message over the network, possibly indefinitely.
 . Intermediate devices may delay, drop, or corrupt packets along the way. The
 operating system for the master will wait and retransmit any unacknowledged or
 corrupted packets up to `net.ipv4.tcp_retries2` times. We recommend
 <<system-config-tcpretries,reducing this value>> since the default represents a
 very long delay.
 . A system fault (for example, a broken network card) on the follower may delay
 receiving the message from the network.
 . There may be a long wait for the specific `transport_worker` thread to become
 available, or there may be a long GC or VM pause during the processing of the
 request on the follower.
 . There may be a long wait for the `cluster_coordination` thread to become
 available, or for the specific `transport_worker` thread to become available
 again. There may also be a long GC or VM pause during the processing of the
 request.
 . A system fault (for example, a broken network card) on the follower may delay
 sending the response from the network.
 . Intermediate devices may delay, drop, or corrupt packets along the way again,
 causing retransmissions.
 . A system fault (for example, a broken network card) on the master may delay
 receiving the message from the network.
 . There may be a long wait for the specific `transport_worker` thread to become
 available to process the response, or a long GC or VM pause.
 To determine why follower checks are timing out, we can narrow down the reason
 for the delay as follows:
 * GC pauses are recorded in the GC logs that {es} emits by default, and also
 usually by the `JvmMonitorService` in the main node logs. Use these logs to
 confirm whether or not GC is resulting in delays.
 * VM pauses also affect other processes on the same host. A VM pause also
 typically causes a discontinuity in the system clock, which {es} will report in
 its logs.
 * Packet captures will reveal system-level and network-level faults, especially
 if you capture the network traffic simultaneously at the elected master and the
 faulty node. The connection used for follower checks is not used for any other
 traffic so it can be easily identified from the flow pattern alone, even if TLS
 is in use: almost exactly every second there will be a few hundred bytes sent
 each way, first the request by the master and then the response by the
 follower. You should be able to observe any retransmissions, packet loss, or
 other delays on such a connection.
 * Long waits for particular threads to be available can be identified by taking
 stack dumps (for example, using `jstack`) or a profiling trace (for example,
 using Java Flight Recorder) in the few seconds leading up to a node departure.
 +
 By default the follower checks will time out after 30s, so if node departures
 are unpredictable then capture stack dumps every 15s to be sure that at least
 one stack dump was taken at the right time.
 +
 The <<cluster-nodes-hot-threads>> API sometimes yields useful information, but
 bear in mind that this API also requires a number of `transport_worker` and
 `generic` threads across all the nodes in the cluster. The API may be affected
 by the very problem you're trying to diagnose. `jstack` is much more reliable
 since it doesn't require any JVM threads.
 +
 The threads involved in the follower checks are `transport_worker` and
 `cluster_coordination` threads, for which there should never be a long wait.
 There may also be evidence of long waits for threads in the {es} logs. See
 <<modules-network-threading-model>> for more information.
 ===== Diagnosing `ShardLockObtainFailedException` failures
--- a/docs/reference/troubleshooting/discovery-issues.asciidoc
+++ b/docs/reference/troubleshooting/discovery-issues.asciidoc
@ -39,18 +39,19 @@ nodes will repeatedly log messages about the problem using a logger called
 `org.elasticsearch.cluster.coordination.ClusterFormationFailureHelper`. By
 default, this happens every 10 seconds.
-Master elections only involve master-eligible nodes, so focus on the logs from
+Master elections only involve master-eligible nodes, so focus your attention on
-master-eligible nodes in this situation. These nodes' logs will indicate the
+the master-eligible nodes in this situation. These nodes' logs will indicate
-requirements for a master election, such as the discovery of a certain set of
+the requirements for a master election, such as the discovery of a certain set
-nodes.
+of nodes. The <<health-api>> API on these nodes will also provide useful
 information about the situation.
-If the logs indicate that {es} can't discover enough nodes to form a quorum,
+If the logs or the health report indicate that {es} can't discover enough nodes
-you must address the reasons preventing {es} from discovering the missing
+to form a quorum, you must address the reasons preventing {es} from discovering
-nodes. The missing nodes are needed to reconstruct the cluster metadata.
+the missing nodes. The missing nodes are needed to reconstruct the cluster
-Without the cluster metadata, the data in your cluster is meaningless. The
+metadata. Without the cluster metadata, the data in your cluster is
-cluster metadata is stored on a subset of the master-eligible nodes in the
+meaningless. The cluster metadata is stored on a subset of the master-eligible
-cluster. If a quorum can't be discovered, the missing nodes were the ones
+nodes in the cluster. If a quorum can't be discovered, the missing nodes were
-holding the cluster metadata.
+the ones holding the cluster metadata.
 Ensure there are enough nodes running to form a quorum and that every node can
 communicate with every other node over the network. {es} will report additional
@ -59,10 +60,20 @@ than a few minutes. If you can't start enough nodes to form a quorum, start a
 new cluster and restore data from a recent snapshot. Refer to
 <<modules-discovery-quorums>> for more information.
-If the logs indicate that {es} _has_ discovered a possible quorum of nodes, the
+If the logs or the health report indicate that {es} _has_ discovered a possible
-typical reason that the cluster can't elect a master is that one of the other
+quorum of nodes, the typical reason that the cluster can't elect a master is
-nodes can't discover a quorum. Inspect the logs on the other master-eligible
+that one of the other nodes can't discover a quorum. Inspect the logs on the
-nodes and ensure that they have all discovered enough nodes to form a quorum.
+other master-eligible nodes and ensure that they have all discovered enough
 nodes to form a quorum.
 If the logs suggest that discovery or master elections are failing due to
 timeouts or network-related issues then narrow down the problem as follows.
 include::network-timeouts.asciidoc[tag=troubleshooting-network-timeouts-gc-vm]
 include::network-timeouts.asciidoc[tag=troubleshooting-network-timeouts-packet-capture-elections]
 include::network-timeouts.asciidoc[tag=troubleshooting-network-timeouts-threads]
 [discrete]
 [[discovery-master-unstable]]
@ -72,7 +83,14 @@ When a node wins the master election, it logs a message containing
 `elected-as-master`. If this happens repeatedly, the elected master node is
 unstable. In this situation, focus on the logs from the master-eligible nodes
 to understand why the election winner stops being the master and triggers
-another election.
+another election. If the logs suggest that the master is unstable due to
 timeouts or network-related issues then narrow down the problem as follows.
 include::network-timeouts.asciidoc[tag=troubleshooting-network-timeouts-gc-vm]
 include::network-timeouts.asciidoc[tag=troubleshooting-network-timeouts-packet-capture-elections]
 include::network-timeouts.asciidoc[tag=troubleshooting-network-timeouts-threads]
 [discrete]
 [[discovery-cannot-join-master]]
@ -80,8 +98,18 @@ another election.
 If there is a stable elected master but a node can't discover or join its
 cluster, it will repeatedly log messages about the problem using the
-`ClusterFormationFailureHelper` logger. Other log messages on the affected node
+`ClusterFormationFailureHelper` logger. The <<health-api>> API on the affected
-and the elected master may provide additional information about the problem.
+node will also provide useful information about the situation. Other log
 messages on the affected node and the elected master may provide additional
 information about the problem. If the logs suggest that the node cannot
 discover or join the cluster due to timeouts or network-related issues then
 narrow down the problem as follows.
 include::network-timeouts.asciidoc[tag=troubleshooting-network-timeouts-gc-vm]
 include::network-timeouts.asciidoc[tag=troubleshooting-network-timeouts-packet-capture-elections]
 include::network-timeouts.asciidoc[tag=troubleshooting-network-timeouts-threads]
 [discrete]
 [[discovery-node-leaves]]
@ -89,4 +117,4 @@ and the elected master may provide additional information about the problem.
 If a node joins the cluster but {es} determines it to be faulty then it will be
 removed from the cluster again. See <<cluster-fault-detection-troubleshooting>>
-for more information.
+for more information.
--- a/docs/reference/troubleshooting/network-timeouts.asciidoc
+++ b/docs/reference/troubleshooting/network-timeouts.asciidoc
@ -0,0 +1,45 @@
 tag::troubleshooting-network-timeouts-gc-vm[]
 * GC pauses are recorded in the GC logs that {es} emits by default, and also
 usually by the `JvmMonitorService` in the main node logs. Use these logs to
 confirm whether or not GC is resulting in delays.
 * VM pauses also affect other processes on the same host. A VM pause also
 typically causes a discontinuity in the system clock, which {es} will report in
 its logs.
 end::troubleshooting-network-timeouts-gc-vm[]
 tag::troubleshooting-network-timeouts-packet-capture-elections[]
 * Packet captures will reveal system-level and network-level faults, especially
 if you capture the network traffic simultaneously at all relevant nodes. You
 should be able to observe any retransmissions, packet loss, or other delays on
 the connections between the nodes.
 end::troubleshooting-network-timeouts-packet-capture-elections[]
 tag::troubleshooting-network-timeouts-packet-capture-fault-detection[]
 * Packet captures will reveal system-level and network-level faults, especially
 if you capture the network traffic simultaneously at the elected master and the
 faulty node. The connection used for follower checks is not used for any other
 traffic so it can be easily identified from the flow pattern alone, even if TLS
 is in use: almost exactly every second there will be a few hundred bytes sent
 each way, first the request by the master and then the response by the
 follower. You should be able to observe any retransmissions, packet loss, or
 other delays on such a connection.
 end::troubleshooting-network-timeouts-packet-capture-fault-detection[]
 tag::troubleshooting-network-timeouts-threads[]
 * Long waits for particular threads to be available can be identified by taking
 stack dumps (for example, using `jstack`) or a profiling trace (for example,
 using Java Flight Recorder) in the few seconds leading up to the relevant log
 message.
 +
 The <<cluster-nodes-hot-threads>> API sometimes yields useful information, but
 bear in mind that this API also requires a number of `transport_worker` and
 `generic` threads across all the nodes in the cluster. The API may be affected
 by the very problem you're trying to diagnose. `jstack` is much more reliable
 since it doesn't require any JVM threads.
 +
 The threads involved in discovery and cluster membership are mainly
 `transport_worker` and `cluster_coordination` threads, for which there should
 never be a long wait. There may also be evidence of long waits for threads in
 the {es} logs. See <<modules-network-threading-model>> for more information.
 end::troubleshooting-network-timeouts-threads[]