Enhance docs around network troubleshooting (#97305)

Discovery, like cluster membership, can also be affected by network-like issues (e.g. GC/VM pauses, dropped packets and blocked threads) so this commit duplicates the troubleshooting info across both places.
2025-06-28 17:34:17 -04:00 · 2023-07-10 10:57:44 +01:00 · 2023-07-10 10:57:44 +01:00 · 09e53f9ad9
commit 09e53f9ad9
parent 52a6820813
3 changed files with 101 additions and 148 deletions
--- a/docs/reference/modules/discovery/fault-detection.asciidoc
+++ b/docs/reference/modules/discovery/fault-detection.asciidoc
@ -26,7 +26,8 @@ its data path is unhealthy then it is removed from the cluster until the data
 path recovers. You can control this behavior with the
 <<modules-discovery-settings,`monitor.fs.health` settings>>.

-[[cluster-fault-detection-cluster-state-publishing]] The elected master node
+[[cluster-fault-detection-cluster-state-publishing]]
+The elected master node
 will also remove nodes from the cluster if nodes are unable to apply an updated
 cluster state within a reasonable time. The timeout defaults to 2 minutes
 starting from the beginning of the cluster state update. Refer to
@ -120,6 +121,9 @@ When it rejoins, the `NodeJoinExecutor` will log that it processed a
 is unexpectedly restarting, look at the node's logs to see why it is shutting
 down.

+The <<health-api>> API on the affected node will also provide some useful
+information about the situation.
+
 If the node did not restart then you should look at the reason for its
 departure more closely. Each reason has different troubleshooting steps,
 described below. There are three possible reasons:
@ -244,141 +248,17 @@ a possible cause for this kind of instability. Log messages containing

 If the last check failed with an exception then the exception is reported, and
 typically indicates the problem that needs to be addressed. If any of the
-checks timed out, it may be necessary to understand the detailed sequence of
-steps involved in a successful check. Here is an example of such a sequence:
+checks timed out then narrow down the problem as follows.

-. The master's `FollowerChecker`, running on thread
-`elasticsearch[master][scheduler][T#1]`, tells the `TransportService` to send
-the check request message to a follower node.
+include::../../troubleshooting/network-timeouts.asciidoc[tag=troubleshooting-network-timeouts-gc-vm]

-. The master's `TransportService` running on thread
-`elasticsearch[master][transport_worker][T#2]` passes the check request message
-onto the operating system.
+include::../../troubleshooting/network-timeouts.asciidoc[tag=troubleshooting-network-timeouts-packet-capture-fault-detection]

-. The operating system on the master converts the message into one or more
-packets and sends them out over the network.
+include::../../troubleshooting/network-timeouts.asciidoc[tag=troubleshooting-network-timeouts-threads]

-. Miscellaneous routers, firewalls, and other devices between the master node
-and the follower node forward the packets, possibly fragmenting or
-defragmenting them on the way.
-
-. The operating system on the follower node receives the packets and notifies
-{es} that they've been received.
-
-. The follower's `TransportService`, running on thread
-`elasticsearch[follower][transport_worker][T#3]`, reads the incoming packets.
-It then reconstructs and processes the check request. Usually, the check
-quickly succeeds. If so, the same thread immediately constructs a response and
-passes it back to the operating system.
-
-. If the check doesn't immediately succeed (for example, an election started
-recently) then:
-
-.. The follower's `FollowerChecker`, running on thread
-`elasticsearch[follower][cluster_coordination][T#4]`, processes the request. It
-constructs a response and tells the `TransportService` to send the response
-back to the master.
-
-.. The follower's `TransportService`, running on thread
-`elasticsearch[follower][transport_worker][T#3]`, passes the response to the
-operating system.
-
-. The operating system on the follower converts the response into one or more
-packets and sends them out over the network.
-
-. Miscellaneous routers, firewalls, and other devices between master and
-follower forward the packets, possibly fragmenting or defragmenting them on the
-way.
-
-. The operating system on the master receives the packets and notifies {es}
-that they've been received.
-
-. The master's `TransportService`, running on thread
-`elasticsearch[master][transport_worker][T#2]`, reads the incoming packets,
-reconstructs the check response, and processes it as long as the check didn't
-already time out.
-
-There are a lot of different things that can delay the completion of a check
-and cause it to time out. Here are some examples for each step:
-
-. There may be a long garbage collection (GC) or virtual machine (VM) pause
-after passing the check request to the `TransportService`.
-
-. There may be a long wait for the specific `transport_worker` thread to become
-available, or there may be a long GC or VM pause before passing the check
-request onto the operating system.
-
-. A system fault (for example, a broken network card) on the master may delay
-sending the message over the network, possibly indefinitely.
-
-. Intermediate devices may delay, drop, or corrupt packets along the way. The
-operating system for the master will wait and retransmit any unacknowledged or
-corrupted packets up to `net.ipv4.tcp_retries2` times. We recommend
-<<system-config-tcpretries,reducing this value>> since the default represents a
-very long delay.
-
-. A system fault (for example, a broken network card) on the follower may delay
-receiving the message from the network.
-
-. There may be a long wait for the specific `transport_worker` thread to become
-available, or there may be a long GC or VM pause during the processing of the
-request on the follower.
-
-. There may be a long wait for the `cluster_coordination` thread to become
-available, or for the specific `transport_worker` thread to become available
-again. There may also be a long GC or VM pause during the processing of the
-request.
-
-. A system fault (for example, a broken network card) on the follower may delay
-sending the response from the network.
-
-. Intermediate devices may delay, drop, or corrupt packets along the way again,
-causing retransmissions.
-
-. A system fault (for example, a broken network card) on the master may delay
-receiving the message from the network.
-
-. There may be a long wait for the specific `transport_worker` thread to become
-available to process the response, or a long GC or VM pause.
-
-To determine why follower checks are timing out, we can narrow down the reason
-for the delay as follows:
-
-* GC pauses are recorded in the GC logs that {es} emits by default, and also
-usually by the `JvmMonitorService` in the main node logs. Use these logs to
-confirm whether or not GC is resulting in delays.
-
-* VM pauses also affect other processes on the same host. A VM pause also
-typically causes a discontinuity in the system clock, which {es} will report in
-its logs.
-
-* Packet captures will reveal system-level and network-level faults, especially
-if you capture the network traffic simultaneously at the elected master and the
-faulty node. The connection used for follower checks is not used for any other
-traffic so it can be easily identified from the flow pattern alone, even if TLS
-is in use: almost exactly every second there will be a few hundred bytes sent
-each way, first the request by the master and then the response by the
-follower. You should be able to observe any retransmissions, packet loss, or
-other delays on such a connection.
-
-* Long waits for particular threads to be available can be identified by taking
-stack dumps (for example, using `jstack`) or a profiling trace (for example,
-using Java Flight Recorder) in the few seconds leading up to a node departure.
-+
 By default the follower checks will time out after 30s, so if node departures
 are unpredictable then capture stack dumps every 15s to be sure that at least
 one stack dump was taken at the right time.
-+
-The <<cluster-nodes-hot-threads>> API sometimes yields useful information, but
-bear in mind that this API also requires a number of `transport_worker` and
-`generic` threads across all the nodes in the cluster. The API may be affected
-by the very problem you're trying to diagnose. `jstack` is much more reliable
-since it doesn't require any JVM threads.
-+
-The threads involved in the follower checks are `transport_worker` and
-`cluster_coordination` threads, for which there should never be a long wait.
-There may also be evidence of long waits for threads in the {es} logs. See
-<<modules-network-threading-model>> for more information.

 ===== Diagnosing `ShardLockObtainFailedException` failures