Enhance docs around network troubleshooting (#97305)

Discovery, like cluster membership, can also be affected by network-like
issues (e.g. GC/VM pauses, dropped packets and blocked threads) so this
commit duplicates the troubleshooting info across both places.
This commit is contained in:
David Turner 2023-07-10 10:57:44 +01:00 committed by GitHub
parent 52a6820813
commit 09e53f9ad9
No known key found for this signature in database
GPG key ID: 4AEE18F83AFDEB23
3 changed files with 101 additions and 148 deletions

View file

@ -26,7 +26,8 @@ its data path is unhealthy then it is removed from the cluster until the data
path recovers. You can control this behavior with the path recovers. You can control this behavior with the
<<modules-discovery-settings,`monitor.fs.health` settings>>. <<modules-discovery-settings,`monitor.fs.health` settings>>.
[[cluster-fault-detection-cluster-state-publishing]] The elected master node [[cluster-fault-detection-cluster-state-publishing]]
The elected master node
will also remove nodes from the cluster if nodes are unable to apply an updated will also remove nodes from the cluster if nodes are unable to apply an updated
cluster state within a reasonable time. The timeout defaults to 2 minutes cluster state within a reasonable time. The timeout defaults to 2 minutes
starting from the beginning of the cluster state update. Refer to starting from the beginning of the cluster state update. Refer to
@ -120,6 +121,9 @@ When it rejoins, the `NodeJoinExecutor` will log that it processed a
is unexpectedly restarting, look at the node's logs to see why it is shutting is unexpectedly restarting, look at the node's logs to see why it is shutting
down. down.
The <<health-api>> API on the affected node will also provide some useful
information about the situation.
If the node did not restart then you should look at the reason for its If the node did not restart then you should look at the reason for its
departure more closely. Each reason has different troubleshooting steps, departure more closely. Each reason has different troubleshooting steps,
described below. There are three possible reasons: described below. There are three possible reasons:
@ -244,141 +248,17 @@ a possible cause for this kind of instability. Log messages containing
If the last check failed with an exception then the exception is reported, and If the last check failed with an exception then the exception is reported, and
typically indicates the problem that needs to be addressed. If any of the typically indicates the problem that needs to be addressed. If any of the
checks timed out, it may be necessary to understand the detailed sequence of checks timed out then narrow down the problem as follows.
steps involved in a successful check. Here is an example of such a sequence:
. The master's `FollowerChecker`, running on thread include::../../troubleshooting/network-timeouts.asciidoc[tag=troubleshooting-network-timeouts-gc-vm]
`elasticsearch[master][scheduler][T#1]`, tells the `TransportService` to send
the check request message to a follower node.
. The master's `TransportService` running on thread include::../../troubleshooting/network-timeouts.asciidoc[tag=troubleshooting-network-timeouts-packet-capture-fault-detection]
`elasticsearch[master][transport_worker][T#2]` passes the check request message
onto the operating system.
. The operating system on the master converts the message into one or more include::../../troubleshooting/network-timeouts.asciidoc[tag=troubleshooting-network-timeouts-threads]
packets and sends them out over the network.
. Miscellaneous routers, firewalls, and other devices between the master node
and the follower node forward the packets, possibly fragmenting or
defragmenting them on the way.
. The operating system on the follower node receives the packets and notifies
{es} that they've been received.
. The follower's `TransportService`, running on thread
`elasticsearch[follower][transport_worker][T#3]`, reads the incoming packets.
It then reconstructs and processes the check request. Usually, the check
quickly succeeds. If so, the same thread immediately constructs a response and
passes it back to the operating system.
. If the check doesn't immediately succeed (for example, an election started
recently) then:
.. The follower's `FollowerChecker`, running on thread
`elasticsearch[follower][cluster_coordination][T#4]`, processes the request. It
constructs a response and tells the `TransportService` to send the response
back to the master.
.. The follower's `TransportService`, running on thread
`elasticsearch[follower][transport_worker][T#3]`, passes the response to the
operating system.
. The operating system on the follower converts the response into one or more
packets and sends them out over the network.
. Miscellaneous routers, firewalls, and other devices between master and
follower forward the packets, possibly fragmenting or defragmenting them on the
way.
. The operating system on the master receives the packets and notifies {es}
that they've been received.
. The master's `TransportService`, running on thread
`elasticsearch[master][transport_worker][T#2]`, reads the incoming packets,
reconstructs the check response, and processes it as long as the check didn't
already time out.
There are a lot of different things that can delay the completion of a check
and cause it to time out. Here are some examples for each step:
. There may be a long garbage collection (GC) or virtual machine (VM) pause
after passing the check request to the `TransportService`.
. There may be a long wait for the specific `transport_worker` thread to become
available, or there may be a long GC or VM pause before passing the check
request onto the operating system.
. A system fault (for example, a broken network card) on the master may delay
sending the message over the network, possibly indefinitely.
. Intermediate devices may delay, drop, or corrupt packets along the way. The
operating system for the master will wait and retransmit any unacknowledged or
corrupted packets up to `net.ipv4.tcp_retries2` times. We recommend
<<system-config-tcpretries,reducing this value>> since the default represents a
very long delay.
. A system fault (for example, a broken network card) on the follower may delay
receiving the message from the network.
. There may be a long wait for the specific `transport_worker` thread to become
available, or there may be a long GC or VM pause during the processing of the
request on the follower.
. There may be a long wait for the `cluster_coordination` thread to become
available, or for the specific `transport_worker` thread to become available
again. There may also be a long GC or VM pause during the processing of the
request.
. A system fault (for example, a broken network card) on the follower may delay
sending the response from the network.
. Intermediate devices may delay, drop, or corrupt packets along the way again,
causing retransmissions.
. A system fault (for example, a broken network card) on the master may delay
receiving the message from the network.
. There may be a long wait for the specific `transport_worker` thread to become
available to process the response, or a long GC or VM pause.
To determine why follower checks are timing out, we can narrow down the reason
for the delay as follows:
* GC pauses are recorded in the GC logs that {es} emits by default, and also
usually by the `JvmMonitorService` in the main node logs. Use these logs to
confirm whether or not GC is resulting in delays.
* VM pauses also affect other processes on the same host. A VM pause also
typically causes a discontinuity in the system clock, which {es} will report in
its logs.
* Packet captures will reveal system-level and network-level faults, especially
if you capture the network traffic simultaneously at the elected master and the
faulty node. The connection used for follower checks is not used for any other
traffic so it can be easily identified from the flow pattern alone, even if TLS
is in use: almost exactly every second there will be a few hundred bytes sent
each way, first the request by the master and then the response by the
follower. You should be able to observe any retransmissions, packet loss, or
other delays on such a connection.
* Long waits for particular threads to be available can be identified by taking
stack dumps (for example, using `jstack`) or a profiling trace (for example,
using Java Flight Recorder) in the few seconds leading up to a node departure.
+
By default the follower checks will time out after 30s, so if node departures By default the follower checks will time out after 30s, so if node departures
are unpredictable then capture stack dumps every 15s to be sure that at least are unpredictable then capture stack dumps every 15s to be sure that at least
one stack dump was taken at the right time. one stack dump was taken at the right time.
+
The <<cluster-nodes-hot-threads>> API sometimes yields useful information, but
bear in mind that this API also requires a number of `transport_worker` and
`generic` threads across all the nodes in the cluster. The API may be affected
by the very problem you're trying to diagnose. `jstack` is much more reliable
since it doesn't require any JVM threads.
+
The threads involved in the follower checks are `transport_worker` and
`cluster_coordination` threads, for which there should never be a long wait.
There may also be evidence of long waits for threads in the {es} logs. See
<<modules-network-threading-model>> for more information.
===== Diagnosing `ShardLockObtainFailedException` failures ===== Diagnosing `ShardLockObtainFailedException` failures

View file

@ -39,18 +39,19 @@ nodes will repeatedly log messages about the problem using a logger called
`org.elasticsearch.cluster.coordination.ClusterFormationFailureHelper`. By `org.elasticsearch.cluster.coordination.ClusterFormationFailureHelper`. By
default, this happens every 10 seconds. default, this happens every 10 seconds.
Master elections only involve master-eligible nodes, so focus on the logs from Master elections only involve master-eligible nodes, so focus your attention on
master-eligible nodes in this situation. These nodes' logs will indicate the the master-eligible nodes in this situation. These nodes' logs will indicate
requirements for a master election, such as the discovery of a certain set of the requirements for a master election, such as the discovery of a certain set
nodes. of nodes. The <<health-api>> API on these nodes will also provide useful
information about the situation.
If the logs indicate that {es} can't discover enough nodes to form a quorum, If the logs or the health report indicate that {es} can't discover enough nodes
you must address the reasons preventing {es} from discovering the missing to form a quorum, you must address the reasons preventing {es} from discovering
nodes. The missing nodes are needed to reconstruct the cluster metadata. the missing nodes. The missing nodes are needed to reconstruct the cluster
Without the cluster metadata, the data in your cluster is meaningless. The metadata. Without the cluster metadata, the data in your cluster is
cluster metadata is stored on a subset of the master-eligible nodes in the meaningless. The cluster metadata is stored on a subset of the master-eligible
cluster. If a quorum can't be discovered, the missing nodes were the ones nodes in the cluster. If a quorum can't be discovered, the missing nodes were
holding the cluster metadata. the ones holding the cluster metadata.
Ensure there are enough nodes running to form a quorum and that every node can Ensure there are enough nodes running to form a quorum and that every node can
communicate with every other node over the network. {es} will report additional communicate with every other node over the network. {es} will report additional
@ -59,10 +60,20 @@ than a few minutes. If you can't start enough nodes to form a quorum, start a
new cluster and restore data from a recent snapshot. Refer to new cluster and restore data from a recent snapshot. Refer to
<<modules-discovery-quorums>> for more information. <<modules-discovery-quorums>> for more information.
If the logs indicate that {es} _has_ discovered a possible quorum of nodes, the If the logs or the health report indicate that {es} _has_ discovered a possible
typical reason that the cluster can't elect a master is that one of the other quorum of nodes, the typical reason that the cluster can't elect a master is
nodes can't discover a quorum. Inspect the logs on the other master-eligible that one of the other nodes can't discover a quorum. Inspect the logs on the
nodes and ensure that they have all discovered enough nodes to form a quorum. other master-eligible nodes and ensure that they have all discovered enough
nodes to form a quorum.
If the logs suggest that discovery or master elections are failing due to
timeouts or network-related issues then narrow down the problem as follows.
include::network-timeouts.asciidoc[tag=troubleshooting-network-timeouts-gc-vm]
include::network-timeouts.asciidoc[tag=troubleshooting-network-timeouts-packet-capture-elections]
include::network-timeouts.asciidoc[tag=troubleshooting-network-timeouts-threads]
[discrete] [discrete]
[[discovery-master-unstable]] [[discovery-master-unstable]]
@ -72,7 +83,14 @@ When a node wins the master election, it logs a message containing
`elected-as-master`. If this happens repeatedly, the elected master node is `elected-as-master`. If this happens repeatedly, the elected master node is
unstable. In this situation, focus on the logs from the master-eligible nodes unstable. In this situation, focus on the logs from the master-eligible nodes
to understand why the election winner stops being the master and triggers to understand why the election winner stops being the master and triggers
another election. another election. If the logs suggest that the master is unstable due to
timeouts or network-related issues then narrow down the problem as follows.
include::network-timeouts.asciidoc[tag=troubleshooting-network-timeouts-gc-vm]
include::network-timeouts.asciidoc[tag=troubleshooting-network-timeouts-packet-capture-elections]
include::network-timeouts.asciidoc[tag=troubleshooting-network-timeouts-threads]
[discrete] [discrete]
[[discovery-cannot-join-master]] [[discovery-cannot-join-master]]
@ -80,8 +98,18 @@ another election.
If there is a stable elected master but a node can't discover or join its If there is a stable elected master but a node can't discover or join its
cluster, it will repeatedly log messages about the problem using the cluster, it will repeatedly log messages about the problem using the
`ClusterFormationFailureHelper` logger. Other log messages on the affected node `ClusterFormationFailureHelper` logger. The <<health-api>> API on the affected
and the elected master may provide additional information about the problem. node will also provide useful information about the situation. Other log
messages on the affected node and the elected master may provide additional
information about the problem. If the logs suggest that the node cannot
discover or join the cluster due to timeouts or network-related issues then
narrow down the problem as follows.
include::network-timeouts.asciidoc[tag=troubleshooting-network-timeouts-gc-vm]
include::network-timeouts.asciidoc[tag=troubleshooting-network-timeouts-packet-capture-elections]
include::network-timeouts.asciidoc[tag=troubleshooting-network-timeouts-threads]
[discrete] [discrete]
[[discovery-node-leaves]] [[discovery-node-leaves]]
@ -89,4 +117,4 @@ and the elected master may provide additional information about the problem.
If a node joins the cluster but {es} determines it to be faulty then it will be If a node joins the cluster but {es} determines it to be faulty then it will be
removed from the cluster again. See <<cluster-fault-detection-troubleshooting>> removed from the cluster again. See <<cluster-fault-detection-troubleshooting>>
for more information. for more information.

View file

@ -0,0 +1,45 @@
tag::troubleshooting-network-timeouts-gc-vm[]
* GC pauses are recorded in the GC logs that {es} emits by default, and also
usually by the `JvmMonitorService` in the main node logs. Use these logs to
confirm whether or not GC is resulting in delays.
* VM pauses also affect other processes on the same host. A VM pause also
typically causes a discontinuity in the system clock, which {es} will report in
its logs.
end::troubleshooting-network-timeouts-gc-vm[]
tag::troubleshooting-network-timeouts-packet-capture-elections[]
* Packet captures will reveal system-level and network-level faults, especially
if you capture the network traffic simultaneously at all relevant nodes. You
should be able to observe any retransmissions, packet loss, or other delays on
the connections between the nodes.
end::troubleshooting-network-timeouts-packet-capture-elections[]
tag::troubleshooting-network-timeouts-packet-capture-fault-detection[]
* Packet captures will reveal system-level and network-level faults, especially
if you capture the network traffic simultaneously at the elected master and the
faulty node. The connection used for follower checks is not used for any other
traffic so it can be easily identified from the flow pattern alone, even if TLS
is in use: almost exactly every second there will be a few hundred bytes sent
each way, first the request by the master and then the response by the
follower. You should be able to observe any retransmissions, packet loss, or
other delays on such a connection.
end::troubleshooting-network-timeouts-packet-capture-fault-detection[]
tag::troubleshooting-network-timeouts-threads[]
* Long waits for particular threads to be available can be identified by taking
stack dumps (for example, using `jstack`) or a profiling trace (for example,
using Java Flight Recorder) in the few seconds leading up to the relevant log
message.
+
The <<cluster-nodes-hot-threads>> API sometimes yields useful information, but
bear in mind that this API also requires a number of `transport_worker` and
`generic` threads across all the nodes in the cluster. The API may be affected
by the very problem you're trying to diagnose. `jstack` is much more reliable
since it doesn't require any JVM threads.
+
The threads involved in discovery and cluster membership are mainly
`transport_worker` and `cluster_coordination` threads, for which there should
never be a long wait. There may also be evidence of long waits for threads in
the {es} logs. See <<modules-network-threading-model>> for more information.
end::troubleshooting-network-timeouts-threads[]