Enhance docs around network troubleshooting (#97305)

Discovery, like cluster membership, can also be affected by network-like
issues (e.g. GC/VM pauses, dropped packets and blocked threads) so this
commit duplicates the troubleshooting info across both places.
This commit is contained in:
David Turner 2023-07-10 10:57:44 +01:00 committed by GitHub
parent 52a6820813
commit 09e53f9ad9
No known key found for this signature in database
GPG key ID: 4AEE18F83AFDEB23
3 changed files with 101 additions and 148 deletions

View file

@ -26,7 +26,8 @@ its data path is unhealthy then it is removed from the cluster until the data
path recovers. You can control this behavior with the
<<modules-discovery-settings,`monitor.fs.health` settings>>.
[[cluster-fault-detection-cluster-state-publishing]] The elected master node
[[cluster-fault-detection-cluster-state-publishing]]
The elected master node
will also remove nodes from the cluster if nodes are unable to apply an updated
cluster state within a reasonable time. The timeout defaults to 2 minutes
starting from the beginning of the cluster state update. Refer to
@ -120,6 +121,9 @@ When it rejoins, the `NodeJoinExecutor` will log that it processed a
is unexpectedly restarting, look at the node's logs to see why it is shutting
down.
The <<health-api>> API on the affected node will also provide some useful
information about the situation.
If the node did not restart then you should look at the reason for its
departure more closely. Each reason has different troubleshooting steps,
described below. There are three possible reasons:
@ -244,141 +248,17 @@ a possible cause for this kind of instability. Log messages containing
If the last check failed with an exception then the exception is reported, and
typically indicates the problem that needs to be addressed. If any of the
checks timed out, it may be necessary to understand the detailed sequence of
steps involved in a successful check. Here is an example of such a sequence:
checks timed out then narrow down the problem as follows.
. The master's `FollowerChecker`, running on thread
`elasticsearch[master][scheduler][T#1]`, tells the `TransportService` to send
the check request message to a follower node.
include::../../troubleshooting/network-timeouts.asciidoc[tag=troubleshooting-network-timeouts-gc-vm]
. The master's `TransportService` running on thread
`elasticsearch[master][transport_worker][T#2]` passes the check request message
onto the operating system.
include::../../troubleshooting/network-timeouts.asciidoc[tag=troubleshooting-network-timeouts-packet-capture-fault-detection]
. The operating system on the master converts the message into one or more
packets and sends them out over the network.
include::../../troubleshooting/network-timeouts.asciidoc[tag=troubleshooting-network-timeouts-threads]
. Miscellaneous routers, firewalls, and other devices between the master node
and the follower node forward the packets, possibly fragmenting or
defragmenting them on the way.
. The operating system on the follower node receives the packets and notifies
{es} that they've been received.
. The follower's `TransportService`, running on thread
`elasticsearch[follower][transport_worker][T#3]`, reads the incoming packets.
It then reconstructs and processes the check request. Usually, the check
quickly succeeds. If so, the same thread immediately constructs a response and
passes it back to the operating system.
. If the check doesn't immediately succeed (for example, an election started
recently) then:
.. The follower's `FollowerChecker`, running on thread
`elasticsearch[follower][cluster_coordination][T#4]`, processes the request. It
constructs a response and tells the `TransportService` to send the response
back to the master.
.. The follower's `TransportService`, running on thread
`elasticsearch[follower][transport_worker][T#3]`, passes the response to the
operating system.
. The operating system on the follower converts the response into one or more
packets and sends them out over the network.
. Miscellaneous routers, firewalls, and other devices between master and
follower forward the packets, possibly fragmenting or defragmenting them on the
way.
. The operating system on the master receives the packets and notifies {es}
that they've been received.
. The master's `TransportService`, running on thread
`elasticsearch[master][transport_worker][T#2]`, reads the incoming packets,
reconstructs the check response, and processes it as long as the check didn't
already time out.
There are a lot of different things that can delay the completion of a check
and cause it to time out. Here are some examples for each step:
. There may be a long garbage collection (GC) or virtual machine (VM) pause
after passing the check request to the `TransportService`.
. There may be a long wait for the specific `transport_worker` thread to become
available, or there may be a long GC or VM pause before passing the check
request onto the operating system.
. A system fault (for example, a broken network card) on the master may delay
sending the message over the network, possibly indefinitely.
. Intermediate devices may delay, drop, or corrupt packets along the way. The
operating system for the master will wait and retransmit any unacknowledged or
corrupted packets up to `net.ipv4.tcp_retries2` times. We recommend
<<system-config-tcpretries,reducing this value>> since the default represents a
very long delay.
. A system fault (for example, a broken network card) on the follower may delay
receiving the message from the network.
. There may be a long wait for the specific `transport_worker` thread to become
available, or there may be a long GC or VM pause during the processing of the
request on the follower.
. There may be a long wait for the `cluster_coordination` thread to become
available, or for the specific `transport_worker` thread to become available
again. There may also be a long GC or VM pause during the processing of the
request.
. A system fault (for example, a broken network card) on the follower may delay
sending the response from the network.
. Intermediate devices may delay, drop, or corrupt packets along the way again,
causing retransmissions.
. A system fault (for example, a broken network card) on the master may delay
receiving the message from the network.
. There may be a long wait for the specific `transport_worker` thread to become
available to process the response, or a long GC or VM pause.
To determine why follower checks are timing out, we can narrow down the reason
for the delay as follows:
* GC pauses are recorded in the GC logs that {es} emits by default, and also
usually by the `JvmMonitorService` in the main node logs. Use these logs to
confirm whether or not GC is resulting in delays.
* VM pauses also affect other processes on the same host. A VM pause also
typically causes a discontinuity in the system clock, which {es} will report in
its logs.
* Packet captures will reveal system-level and network-level faults, especially
if you capture the network traffic simultaneously at the elected master and the
faulty node. The connection used for follower checks is not used for any other
traffic so it can be easily identified from the flow pattern alone, even if TLS
is in use: almost exactly every second there will be a few hundred bytes sent
each way, first the request by the master and then the response by the
follower. You should be able to observe any retransmissions, packet loss, or
other delays on such a connection.
* Long waits for particular threads to be available can be identified by taking
stack dumps (for example, using `jstack`) or a profiling trace (for example,
using Java Flight Recorder) in the few seconds leading up to a node departure.
+
By default the follower checks will time out after 30s, so if node departures
are unpredictable then capture stack dumps every 15s to be sure that at least
one stack dump was taken at the right time.
+
The <<cluster-nodes-hot-threads>> API sometimes yields useful information, but
bear in mind that this API also requires a number of `transport_worker` and
`generic` threads across all the nodes in the cluster. The API may be affected
by the very problem you're trying to diagnose. `jstack` is much more reliable
since it doesn't require any JVM threads.
+
The threads involved in the follower checks are `transport_worker` and
`cluster_coordination` threads, for which there should never be a long wait.
There may also be evidence of long waits for threads in the {es} logs. See
<<modules-network-threading-model>> for more information.
===== Diagnosing `ShardLockObtainFailedException` failures