mirror of
https://github.com/elastic/elasticsearch.git
synced 2025-04-25 07:37:19 -04:00
Enhance docs around network troubleshooting (#97305)
Discovery, like cluster membership, can also be affected by network-like issues (e.g. GC/VM pauses, dropped packets and blocked threads) so this commit duplicates the troubleshooting info across both places.
This commit is contained in:
parent
52a6820813
commit
09e53f9ad9
3 changed files with 101 additions and 148 deletions
|
@ -26,7 +26,8 @@ its data path is unhealthy then it is removed from the cluster until the data
|
||||||
path recovers. You can control this behavior with the
|
path recovers. You can control this behavior with the
|
||||||
<<modules-discovery-settings,`monitor.fs.health` settings>>.
|
<<modules-discovery-settings,`monitor.fs.health` settings>>.
|
||||||
|
|
||||||
[[cluster-fault-detection-cluster-state-publishing]] The elected master node
|
[[cluster-fault-detection-cluster-state-publishing]]
|
||||||
|
The elected master node
|
||||||
will also remove nodes from the cluster if nodes are unable to apply an updated
|
will also remove nodes from the cluster if nodes are unable to apply an updated
|
||||||
cluster state within a reasonable time. The timeout defaults to 2 minutes
|
cluster state within a reasonable time. The timeout defaults to 2 minutes
|
||||||
starting from the beginning of the cluster state update. Refer to
|
starting from the beginning of the cluster state update. Refer to
|
||||||
|
@ -120,6 +121,9 @@ When it rejoins, the `NodeJoinExecutor` will log that it processed a
|
||||||
is unexpectedly restarting, look at the node's logs to see why it is shutting
|
is unexpectedly restarting, look at the node's logs to see why it is shutting
|
||||||
down.
|
down.
|
||||||
|
|
||||||
|
The <<health-api>> API on the affected node will also provide some useful
|
||||||
|
information about the situation.
|
||||||
|
|
||||||
If the node did not restart then you should look at the reason for its
|
If the node did not restart then you should look at the reason for its
|
||||||
departure more closely. Each reason has different troubleshooting steps,
|
departure more closely. Each reason has different troubleshooting steps,
|
||||||
described below. There are three possible reasons:
|
described below. There are three possible reasons:
|
||||||
|
@ -244,141 +248,17 @@ a possible cause for this kind of instability. Log messages containing
|
||||||
|
|
||||||
If the last check failed with an exception then the exception is reported, and
|
If the last check failed with an exception then the exception is reported, and
|
||||||
typically indicates the problem that needs to be addressed. If any of the
|
typically indicates the problem that needs to be addressed. If any of the
|
||||||
checks timed out, it may be necessary to understand the detailed sequence of
|
checks timed out then narrow down the problem as follows.
|
||||||
steps involved in a successful check. Here is an example of such a sequence:
|
|
||||||
|
|
||||||
. The master's `FollowerChecker`, running on thread
|
include::../../troubleshooting/network-timeouts.asciidoc[tag=troubleshooting-network-timeouts-gc-vm]
|
||||||
`elasticsearch[master][scheduler][T#1]`, tells the `TransportService` to send
|
|
||||||
the check request message to a follower node.
|
|
||||||
|
|
||||||
. The master's `TransportService` running on thread
|
include::../../troubleshooting/network-timeouts.asciidoc[tag=troubleshooting-network-timeouts-packet-capture-fault-detection]
|
||||||
`elasticsearch[master][transport_worker][T#2]` passes the check request message
|
|
||||||
onto the operating system.
|
|
||||||
|
|
||||||
. The operating system on the master converts the message into one or more
|
include::../../troubleshooting/network-timeouts.asciidoc[tag=troubleshooting-network-timeouts-threads]
|
||||||
packets and sends them out over the network.
|
|
||||||
|
|
||||||
. Miscellaneous routers, firewalls, and other devices between the master node
|
|
||||||
and the follower node forward the packets, possibly fragmenting or
|
|
||||||
defragmenting them on the way.
|
|
||||||
|
|
||||||
. The operating system on the follower node receives the packets and notifies
|
|
||||||
{es} that they've been received.
|
|
||||||
|
|
||||||
. The follower's `TransportService`, running on thread
|
|
||||||
`elasticsearch[follower][transport_worker][T#3]`, reads the incoming packets.
|
|
||||||
It then reconstructs and processes the check request. Usually, the check
|
|
||||||
quickly succeeds. If so, the same thread immediately constructs a response and
|
|
||||||
passes it back to the operating system.
|
|
||||||
|
|
||||||
. If the check doesn't immediately succeed (for example, an election started
|
|
||||||
recently) then:
|
|
||||||
|
|
||||||
.. The follower's `FollowerChecker`, running on thread
|
|
||||||
`elasticsearch[follower][cluster_coordination][T#4]`, processes the request. It
|
|
||||||
constructs a response and tells the `TransportService` to send the response
|
|
||||||
back to the master.
|
|
||||||
|
|
||||||
.. The follower's `TransportService`, running on thread
|
|
||||||
`elasticsearch[follower][transport_worker][T#3]`, passes the response to the
|
|
||||||
operating system.
|
|
||||||
|
|
||||||
. The operating system on the follower converts the response into one or more
|
|
||||||
packets and sends them out over the network.
|
|
||||||
|
|
||||||
. Miscellaneous routers, firewalls, and other devices between master and
|
|
||||||
follower forward the packets, possibly fragmenting or defragmenting them on the
|
|
||||||
way.
|
|
||||||
|
|
||||||
. The operating system on the master receives the packets and notifies {es}
|
|
||||||
that they've been received.
|
|
||||||
|
|
||||||
. The master's `TransportService`, running on thread
|
|
||||||
`elasticsearch[master][transport_worker][T#2]`, reads the incoming packets,
|
|
||||||
reconstructs the check response, and processes it as long as the check didn't
|
|
||||||
already time out.
|
|
||||||
|
|
||||||
There are a lot of different things that can delay the completion of a check
|
|
||||||
and cause it to time out. Here are some examples for each step:
|
|
||||||
|
|
||||||
. There may be a long garbage collection (GC) or virtual machine (VM) pause
|
|
||||||
after passing the check request to the `TransportService`.
|
|
||||||
|
|
||||||
. There may be a long wait for the specific `transport_worker` thread to become
|
|
||||||
available, or there may be a long GC or VM pause before passing the check
|
|
||||||
request onto the operating system.
|
|
||||||
|
|
||||||
. A system fault (for example, a broken network card) on the master may delay
|
|
||||||
sending the message over the network, possibly indefinitely.
|
|
||||||
|
|
||||||
. Intermediate devices may delay, drop, or corrupt packets along the way. The
|
|
||||||
operating system for the master will wait and retransmit any unacknowledged or
|
|
||||||
corrupted packets up to `net.ipv4.tcp_retries2` times. We recommend
|
|
||||||
<<system-config-tcpretries,reducing this value>> since the default represents a
|
|
||||||
very long delay.
|
|
||||||
|
|
||||||
. A system fault (for example, a broken network card) on the follower may delay
|
|
||||||
receiving the message from the network.
|
|
||||||
|
|
||||||
. There may be a long wait for the specific `transport_worker` thread to become
|
|
||||||
available, or there may be a long GC or VM pause during the processing of the
|
|
||||||
request on the follower.
|
|
||||||
|
|
||||||
. There may be a long wait for the `cluster_coordination` thread to become
|
|
||||||
available, or for the specific `transport_worker` thread to become available
|
|
||||||
again. There may also be a long GC or VM pause during the processing of the
|
|
||||||
request.
|
|
||||||
|
|
||||||
. A system fault (for example, a broken network card) on the follower may delay
|
|
||||||
sending the response from the network.
|
|
||||||
|
|
||||||
. Intermediate devices may delay, drop, or corrupt packets along the way again,
|
|
||||||
causing retransmissions.
|
|
||||||
|
|
||||||
. A system fault (for example, a broken network card) on the master may delay
|
|
||||||
receiving the message from the network.
|
|
||||||
|
|
||||||
. There may be a long wait for the specific `transport_worker` thread to become
|
|
||||||
available to process the response, or a long GC or VM pause.
|
|
||||||
|
|
||||||
To determine why follower checks are timing out, we can narrow down the reason
|
|
||||||
for the delay as follows:
|
|
||||||
|
|
||||||
* GC pauses are recorded in the GC logs that {es} emits by default, and also
|
|
||||||
usually by the `JvmMonitorService` in the main node logs. Use these logs to
|
|
||||||
confirm whether or not GC is resulting in delays.
|
|
||||||
|
|
||||||
* VM pauses also affect other processes on the same host. A VM pause also
|
|
||||||
typically causes a discontinuity in the system clock, which {es} will report in
|
|
||||||
its logs.
|
|
||||||
|
|
||||||
* Packet captures will reveal system-level and network-level faults, especially
|
|
||||||
if you capture the network traffic simultaneously at the elected master and the
|
|
||||||
faulty node. The connection used for follower checks is not used for any other
|
|
||||||
traffic so it can be easily identified from the flow pattern alone, even if TLS
|
|
||||||
is in use: almost exactly every second there will be a few hundred bytes sent
|
|
||||||
each way, first the request by the master and then the response by the
|
|
||||||
follower. You should be able to observe any retransmissions, packet loss, or
|
|
||||||
other delays on such a connection.
|
|
||||||
|
|
||||||
* Long waits for particular threads to be available can be identified by taking
|
|
||||||
stack dumps (for example, using `jstack`) or a profiling trace (for example,
|
|
||||||
using Java Flight Recorder) in the few seconds leading up to a node departure.
|
|
||||||
+
|
|
||||||
By default the follower checks will time out after 30s, so if node departures
|
By default the follower checks will time out after 30s, so if node departures
|
||||||
are unpredictable then capture stack dumps every 15s to be sure that at least
|
are unpredictable then capture stack dumps every 15s to be sure that at least
|
||||||
one stack dump was taken at the right time.
|
one stack dump was taken at the right time.
|
||||||
+
|
|
||||||
The <<cluster-nodes-hot-threads>> API sometimes yields useful information, but
|
|
||||||
bear in mind that this API also requires a number of `transport_worker` and
|
|
||||||
`generic` threads across all the nodes in the cluster. The API may be affected
|
|
||||||
by the very problem you're trying to diagnose. `jstack` is much more reliable
|
|
||||||
since it doesn't require any JVM threads.
|
|
||||||
+
|
|
||||||
The threads involved in the follower checks are `transport_worker` and
|
|
||||||
`cluster_coordination` threads, for which there should never be a long wait.
|
|
||||||
There may also be evidence of long waits for threads in the {es} logs. See
|
|
||||||
<<modules-network-threading-model>> for more information.
|
|
||||||
|
|
||||||
===== Diagnosing `ShardLockObtainFailedException` failures
|
===== Diagnosing `ShardLockObtainFailedException` failures
|
||||||
|
|
||||||
|
|
|
@ -39,18 +39,19 @@ nodes will repeatedly log messages about the problem using a logger called
|
||||||
`org.elasticsearch.cluster.coordination.ClusterFormationFailureHelper`. By
|
`org.elasticsearch.cluster.coordination.ClusterFormationFailureHelper`. By
|
||||||
default, this happens every 10 seconds.
|
default, this happens every 10 seconds.
|
||||||
|
|
||||||
Master elections only involve master-eligible nodes, so focus on the logs from
|
Master elections only involve master-eligible nodes, so focus your attention on
|
||||||
master-eligible nodes in this situation. These nodes' logs will indicate the
|
the master-eligible nodes in this situation. These nodes' logs will indicate
|
||||||
requirements for a master election, such as the discovery of a certain set of
|
the requirements for a master election, such as the discovery of a certain set
|
||||||
nodes.
|
of nodes. The <<health-api>> API on these nodes will also provide useful
|
||||||
|
information about the situation.
|
||||||
|
|
||||||
If the logs indicate that {es} can't discover enough nodes to form a quorum,
|
If the logs or the health report indicate that {es} can't discover enough nodes
|
||||||
you must address the reasons preventing {es} from discovering the missing
|
to form a quorum, you must address the reasons preventing {es} from discovering
|
||||||
nodes. The missing nodes are needed to reconstruct the cluster metadata.
|
the missing nodes. The missing nodes are needed to reconstruct the cluster
|
||||||
Without the cluster metadata, the data in your cluster is meaningless. The
|
metadata. Without the cluster metadata, the data in your cluster is
|
||||||
cluster metadata is stored on a subset of the master-eligible nodes in the
|
meaningless. The cluster metadata is stored on a subset of the master-eligible
|
||||||
cluster. If a quorum can't be discovered, the missing nodes were the ones
|
nodes in the cluster. If a quorum can't be discovered, the missing nodes were
|
||||||
holding the cluster metadata.
|
the ones holding the cluster metadata.
|
||||||
|
|
||||||
Ensure there are enough nodes running to form a quorum and that every node can
|
Ensure there are enough nodes running to form a quorum and that every node can
|
||||||
communicate with every other node over the network. {es} will report additional
|
communicate with every other node over the network. {es} will report additional
|
||||||
|
@ -59,10 +60,20 @@ than a few minutes. If you can't start enough nodes to form a quorum, start a
|
||||||
new cluster and restore data from a recent snapshot. Refer to
|
new cluster and restore data from a recent snapshot. Refer to
|
||||||
<<modules-discovery-quorums>> for more information.
|
<<modules-discovery-quorums>> for more information.
|
||||||
|
|
||||||
If the logs indicate that {es} _has_ discovered a possible quorum of nodes, the
|
If the logs or the health report indicate that {es} _has_ discovered a possible
|
||||||
typical reason that the cluster can't elect a master is that one of the other
|
quorum of nodes, the typical reason that the cluster can't elect a master is
|
||||||
nodes can't discover a quorum. Inspect the logs on the other master-eligible
|
that one of the other nodes can't discover a quorum. Inspect the logs on the
|
||||||
nodes and ensure that they have all discovered enough nodes to form a quorum.
|
other master-eligible nodes and ensure that they have all discovered enough
|
||||||
|
nodes to form a quorum.
|
||||||
|
|
||||||
|
If the logs suggest that discovery or master elections are failing due to
|
||||||
|
timeouts or network-related issues then narrow down the problem as follows.
|
||||||
|
|
||||||
|
include::network-timeouts.asciidoc[tag=troubleshooting-network-timeouts-gc-vm]
|
||||||
|
|
||||||
|
include::network-timeouts.asciidoc[tag=troubleshooting-network-timeouts-packet-capture-elections]
|
||||||
|
|
||||||
|
include::network-timeouts.asciidoc[tag=troubleshooting-network-timeouts-threads]
|
||||||
|
|
||||||
[discrete]
|
[discrete]
|
||||||
[[discovery-master-unstable]]
|
[[discovery-master-unstable]]
|
||||||
|
@ -72,7 +83,14 @@ When a node wins the master election, it logs a message containing
|
||||||
`elected-as-master`. If this happens repeatedly, the elected master node is
|
`elected-as-master`. If this happens repeatedly, the elected master node is
|
||||||
unstable. In this situation, focus on the logs from the master-eligible nodes
|
unstable. In this situation, focus on the logs from the master-eligible nodes
|
||||||
to understand why the election winner stops being the master and triggers
|
to understand why the election winner stops being the master and triggers
|
||||||
another election.
|
another election. If the logs suggest that the master is unstable due to
|
||||||
|
timeouts or network-related issues then narrow down the problem as follows.
|
||||||
|
|
||||||
|
include::network-timeouts.asciidoc[tag=troubleshooting-network-timeouts-gc-vm]
|
||||||
|
|
||||||
|
include::network-timeouts.asciidoc[tag=troubleshooting-network-timeouts-packet-capture-elections]
|
||||||
|
|
||||||
|
include::network-timeouts.asciidoc[tag=troubleshooting-network-timeouts-threads]
|
||||||
|
|
||||||
[discrete]
|
[discrete]
|
||||||
[[discovery-cannot-join-master]]
|
[[discovery-cannot-join-master]]
|
||||||
|
@ -80,8 +98,18 @@ another election.
|
||||||
|
|
||||||
If there is a stable elected master but a node can't discover or join its
|
If there is a stable elected master but a node can't discover or join its
|
||||||
cluster, it will repeatedly log messages about the problem using the
|
cluster, it will repeatedly log messages about the problem using the
|
||||||
`ClusterFormationFailureHelper` logger. Other log messages on the affected node
|
`ClusterFormationFailureHelper` logger. The <<health-api>> API on the affected
|
||||||
and the elected master may provide additional information about the problem.
|
node will also provide useful information about the situation. Other log
|
||||||
|
messages on the affected node and the elected master may provide additional
|
||||||
|
information about the problem. If the logs suggest that the node cannot
|
||||||
|
discover or join the cluster due to timeouts or network-related issues then
|
||||||
|
narrow down the problem as follows.
|
||||||
|
|
||||||
|
include::network-timeouts.asciidoc[tag=troubleshooting-network-timeouts-gc-vm]
|
||||||
|
|
||||||
|
include::network-timeouts.asciidoc[tag=troubleshooting-network-timeouts-packet-capture-elections]
|
||||||
|
|
||||||
|
include::network-timeouts.asciidoc[tag=troubleshooting-network-timeouts-threads]
|
||||||
|
|
||||||
[discrete]
|
[discrete]
|
||||||
[[discovery-node-leaves]]
|
[[discovery-node-leaves]]
|
||||||
|
@ -89,4 +117,4 @@ and the elected master may provide additional information about the problem.
|
||||||
|
|
||||||
If a node joins the cluster but {es} determines it to be faulty then it will be
|
If a node joins the cluster but {es} determines it to be faulty then it will be
|
||||||
removed from the cluster again. See <<cluster-fault-detection-troubleshooting>>
|
removed from the cluster again. See <<cluster-fault-detection-troubleshooting>>
|
||||||
for more information.
|
for more information.
|
||||||
|
|
45
docs/reference/troubleshooting/network-timeouts.asciidoc
Normal file
45
docs/reference/troubleshooting/network-timeouts.asciidoc
Normal file
|
@ -0,0 +1,45 @@
|
||||||
|
tag::troubleshooting-network-timeouts-gc-vm[]
|
||||||
|
* GC pauses are recorded in the GC logs that {es} emits by default, and also
|
||||||
|
usually by the `JvmMonitorService` in the main node logs. Use these logs to
|
||||||
|
confirm whether or not GC is resulting in delays.
|
||||||
|
|
||||||
|
* VM pauses also affect other processes on the same host. A VM pause also
|
||||||
|
typically causes a discontinuity in the system clock, which {es} will report in
|
||||||
|
its logs.
|
||||||
|
end::troubleshooting-network-timeouts-gc-vm[]
|
||||||
|
|
||||||
|
tag::troubleshooting-network-timeouts-packet-capture-elections[]
|
||||||
|
* Packet captures will reveal system-level and network-level faults, especially
|
||||||
|
if you capture the network traffic simultaneously at all relevant nodes. You
|
||||||
|
should be able to observe any retransmissions, packet loss, or other delays on
|
||||||
|
the connections between the nodes.
|
||||||
|
end::troubleshooting-network-timeouts-packet-capture-elections[]
|
||||||
|
|
||||||
|
tag::troubleshooting-network-timeouts-packet-capture-fault-detection[]
|
||||||
|
* Packet captures will reveal system-level and network-level faults, especially
|
||||||
|
if you capture the network traffic simultaneously at the elected master and the
|
||||||
|
faulty node. The connection used for follower checks is not used for any other
|
||||||
|
traffic so it can be easily identified from the flow pattern alone, even if TLS
|
||||||
|
is in use: almost exactly every second there will be a few hundred bytes sent
|
||||||
|
each way, first the request by the master and then the response by the
|
||||||
|
follower. You should be able to observe any retransmissions, packet loss, or
|
||||||
|
other delays on such a connection.
|
||||||
|
end::troubleshooting-network-timeouts-packet-capture-fault-detection[]
|
||||||
|
|
||||||
|
tag::troubleshooting-network-timeouts-threads[]
|
||||||
|
* Long waits for particular threads to be available can be identified by taking
|
||||||
|
stack dumps (for example, using `jstack`) or a profiling trace (for example,
|
||||||
|
using Java Flight Recorder) in the few seconds leading up to the relevant log
|
||||||
|
message.
|
||||||
|
+
|
||||||
|
The <<cluster-nodes-hot-threads>> API sometimes yields useful information, but
|
||||||
|
bear in mind that this API also requires a number of `transport_worker` and
|
||||||
|
`generic` threads across all the nodes in the cluster. The API may be affected
|
||||||
|
by the very problem you're trying to diagnose. `jstack` is much more reliable
|
||||||
|
since it doesn't require any JVM threads.
|
||||||
|
+
|
||||||
|
The threads involved in discovery and cluster membership are mainly
|
||||||
|
`transport_worker` and `cluster_coordination` threads, for which there should
|
||||||
|
never be a long wait. There may also be evidence of long waits for threads in
|
||||||
|
the {es} logs. See <<modules-network-threading-model>> for more information.
|
||||||
|
end::troubleshooting-network-timeouts-threads[]
|
Loading…
Add table
Add a link
Reference in a new issue