mirror of
https://github.com/elastic/elasticsearch.git
synced 2025-04-24 23:27:25 -04:00
Enhance docs around network troubleshooting (#97305)
Discovery, like cluster membership, can also be affected by network-like issues (e.g. GC/VM pauses, dropped packets and blocked threads) so this commit duplicates the troubleshooting info across both places.
This commit is contained in:
parent
52a6820813
commit
09e53f9ad9
3 changed files with 101 additions and 148 deletions
|
@ -26,7 +26,8 @@ its data path is unhealthy then it is removed from the cluster until the data
|
|||
path recovers. You can control this behavior with the
|
||||
<<modules-discovery-settings,`monitor.fs.health` settings>>.
|
||||
|
||||
[[cluster-fault-detection-cluster-state-publishing]] The elected master node
|
||||
[[cluster-fault-detection-cluster-state-publishing]]
|
||||
The elected master node
|
||||
will also remove nodes from the cluster if nodes are unable to apply an updated
|
||||
cluster state within a reasonable time. The timeout defaults to 2 minutes
|
||||
starting from the beginning of the cluster state update. Refer to
|
||||
|
@ -120,6 +121,9 @@ When it rejoins, the `NodeJoinExecutor` will log that it processed a
|
|||
is unexpectedly restarting, look at the node's logs to see why it is shutting
|
||||
down.
|
||||
|
||||
The <<health-api>> API on the affected node will also provide some useful
|
||||
information about the situation.
|
||||
|
||||
If the node did not restart then you should look at the reason for its
|
||||
departure more closely. Each reason has different troubleshooting steps,
|
||||
described below. There are three possible reasons:
|
||||
|
@ -244,141 +248,17 @@ a possible cause for this kind of instability. Log messages containing
|
|||
|
||||
If the last check failed with an exception then the exception is reported, and
|
||||
typically indicates the problem that needs to be addressed. If any of the
|
||||
checks timed out, it may be necessary to understand the detailed sequence of
|
||||
steps involved in a successful check. Here is an example of such a sequence:
|
||||
checks timed out then narrow down the problem as follows.
|
||||
|
||||
. The master's `FollowerChecker`, running on thread
|
||||
`elasticsearch[master][scheduler][T#1]`, tells the `TransportService` to send
|
||||
the check request message to a follower node.
|
||||
include::../../troubleshooting/network-timeouts.asciidoc[tag=troubleshooting-network-timeouts-gc-vm]
|
||||
|
||||
. The master's `TransportService` running on thread
|
||||
`elasticsearch[master][transport_worker][T#2]` passes the check request message
|
||||
onto the operating system.
|
||||
include::../../troubleshooting/network-timeouts.asciidoc[tag=troubleshooting-network-timeouts-packet-capture-fault-detection]
|
||||
|
||||
. The operating system on the master converts the message into one or more
|
||||
packets and sends them out over the network.
|
||||
include::../../troubleshooting/network-timeouts.asciidoc[tag=troubleshooting-network-timeouts-threads]
|
||||
|
||||
. Miscellaneous routers, firewalls, and other devices between the master node
|
||||
and the follower node forward the packets, possibly fragmenting or
|
||||
defragmenting them on the way.
|
||||
|
||||
. The operating system on the follower node receives the packets and notifies
|
||||
{es} that they've been received.
|
||||
|
||||
. The follower's `TransportService`, running on thread
|
||||
`elasticsearch[follower][transport_worker][T#3]`, reads the incoming packets.
|
||||
It then reconstructs and processes the check request. Usually, the check
|
||||
quickly succeeds. If so, the same thread immediately constructs a response and
|
||||
passes it back to the operating system.
|
||||
|
||||
. If the check doesn't immediately succeed (for example, an election started
|
||||
recently) then:
|
||||
|
||||
.. The follower's `FollowerChecker`, running on thread
|
||||
`elasticsearch[follower][cluster_coordination][T#4]`, processes the request. It
|
||||
constructs a response and tells the `TransportService` to send the response
|
||||
back to the master.
|
||||
|
||||
.. The follower's `TransportService`, running on thread
|
||||
`elasticsearch[follower][transport_worker][T#3]`, passes the response to the
|
||||
operating system.
|
||||
|
||||
. The operating system on the follower converts the response into one or more
|
||||
packets and sends them out over the network.
|
||||
|
||||
. Miscellaneous routers, firewalls, and other devices between master and
|
||||
follower forward the packets, possibly fragmenting or defragmenting them on the
|
||||
way.
|
||||
|
||||
. The operating system on the master receives the packets and notifies {es}
|
||||
that they've been received.
|
||||
|
||||
. The master's `TransportService`, running on thread
|
||||
`elasticsearch[master][transport_worker][T#2]`, reads the incoming packets,
|
||||
reconstructs the check response, and processes it as long as the check didn't
|
||||
already time out.
|
||||
|
||||
There are a lot of different things that can delay the completion of a check
|
||||
and cause it to time out. Here are some examples for each step:
|
||||
|
||||
. There may be a long garbage collection (GC) or virtual machine (VM) pause
|
||||
after passing the check request to the `TransportService`.
|
||||
|
||||
. There may be a long wait for the specific `transport_worker` thread to become
|
||||
available, or there may be a long GC or VM pause before passing the check
|
||||
request onto the operating system.
|
||||
|
||||
. A system fault (for example, a broken network card) on the master may delay
|
||||
sending the message over the network, possibly indefinitely.
|
||||
|
||||
. Intermediate devices may delay, drop, or corrupt packets along the way. The
|
||||
operating system for the master will wait and retransmit any unacknowledged or
|
||||
corrupted packets up to `net.ipv4.tcp_retries2` times. We recommend
|
||||
<<system-config-tcpretries,reducing this value>> since the default represents a
|
||||
very long delay.
|
||||
|
||||
. A system fault (for example, a broken network card) on the follower may delay
|
||||
receiving the message from the network.
|
||||
|
||||
. There may be a long wait for the specific `transport_worker` thread to become
|
||||
available, or there may be a long GC or VM pause during the processing of the
|
||||
request on the follower.
|
||||
|
||||
. There may be a long wait for the `cluster_coordination` thread to become
|
||||
available, or for the specific `transport_worker` thread to become available
|
||||
again. There may also be a long GC or VM pause during the processing of the
|
||||
request.
|
||||
|
||||
. A system fault (for example, a broken network card) on the follower may delay
|
||||
sending the response from the network.
|
||||
|
||||
. Intermediate devices may delay, drop, or corrupt packets along the way again,
|
||||
causing retransmissions.
|
||||
|
||||
. A system fault (for example, a broken network card) on the master may delay
|
||||
receiving the message from the network.
|
||||
|
||||
. There may be a long wait for the specific `transport_worker` thread to become
|
||||
available to process the response, or a long GC or VM pause.
|
||||
|
||||
To determine why follower checks are timing out, we can narrow down the reason
|
||||
for the delay as follows:
|
||||
|
||||
* GC pauses are recorded in the GC logs that {es} emits by default, and also
|
||||
usually by the `JvmMonitorService` in the main node logs. Use these logs to
|
||||
confirm whether or not GC is resulting in delays.
|
||||
|
||||
* VM pauses also affect other processes on the same host. A VM pause also
|
||||
typically causes a discontinuity in the system clock, which {es} will report in
|
||||
its logs.
|
||||
|
||||
* Packet captures will reveal system-level and network-level faults, especially
|
||||
if you capture the network traffic simultaneously at the elected master and the
|
||||
faulty node. The connection used for follower checks is not used for any other
|
||||
traffic so it can be easily identified from the flow pattern alone, even if TLS
|
||||
is in use: almost exactly every second there will be a few hundred bytes sent
|
||||
each way, first the request by the master and then the response by the
|
||||
follower. You should be able to observe any retransmissions, packet loss, or
|
||||
other delays on such a connection.
|
||||
|
||||
* Long waits for particular threads to be available can be identified by taking
|
||||
stack dumps (for example, using `jstack`) or a profiling trace (for example,
|
||||
using Java Flight Recorder) in the few seconds leading up to a node departure.
|
||||
+
|
||||
By default the follower checks will time out after 30s, so if node departures
|
||||
are unpredictable then capture stack dumps every 15s to be sure that at least
|
||||
one stack dump was taken at the right time.
|
||||
+
|
||||
The <<cluster-nodes-hot-threads>> API sometimes yields useful information, but
|
||||
bear in mind that this API also requires a number of `transport_worker` and
|
||||
`generic` threads across all the nodes in the cluster. The API may be affected
|
||||
by the very problem you're trying to diagnose. `jstack` is much more reliable
|
||||
since it doesn't require any JVM threads.
|
||||
+
|
||||
The threads involved in the follower checks are `transport_worker` and
|
||||
`cluster_coordination` threads, for which there should never be a long wait.
|
||||
There may also be evidence of long waits for threads in the {es} logs. See
|
||||
<<modules-network-threading-model>> for more information.
|
||||
|
||||
===== Diagnosing `ShardLockObtainFailedException` failures
|
||||
|
||||
|
|
Loading…
Add table
Add a link
Reference in a new issue