diff --git a/docs/reference/modules/discovery/fault-detection.asciidoc b/docs/reference/modules/discovery/fault-detection.asciidoc index 32dfc601c330..001763430cf4 100644 --- a/docs/reference/modules/discovery/fault-detection.asciidoc +++ b/docs/reference/modules/discovery/fault-detection.asciidoc @@ -26,7 +26,8 @@ its data path is unhealthy then it is removed from the cluster until the data path recovers. You can control this behavior with the <>. -[[cluster-fault-detection-cluster-state-publishing]] The elected master node +[[cluster-fault-detection-cluster-state-publishing]] +The elected master node will also remove nodes from the cluster if nodes are unable to apply an updated cluster state within a reasonable time. The timeout defaults to 2 minutes starting from the beginning of the cluster state update. Refer to @@ -120,6 +121,9 @@ When it rejoins, the `NodeJoinExecutor` will log that it processed a is unexpectedly restarting, look at the node's logs to see why it is shutting down. +The <> API on the affected node will also provide some useful +information about the situation. + If the node did not restart then you should look at the reason for its departure more closely. Each reason has different troubleshooting steps, described below. There are three possible reasons: @@ -244,141 +248,17 @@ a possible cause for this kind of instability. Log messages containing If the last check failed with an exception then the exception is reported, and typically indicates the problem that needs to be addressed. If any of the -checks timed out, it may be necessary to understand the detailed sequence of -steps involved in a successful check. Here is an example of such a sequence: +checks timed out then narrow down the problem as follows. -. The master's `FollowerChecker`, running on thread -`elasticsearch[master][scheduler][T#1]`, tells the `TransportService` to send -the check request message to a follower node. +include::../../troubleshooting/network-timeouts.asciidoc[tag=troubleshooting-network-timeouts-gc-vm] -. The master's `TransportService` running on thread -`elasticsearch[master][transport_worker][T#2]` passes the check request message -onto the operating system. +include::../../troubleshooting/network-timeouts.asciidoc[tag=troubleshooting-network-timeouts-packet-capture-fault-detection] -. The operating system on the master converts the message into one or more -packets and sends them out over the network. +include::../../troubleshooting/network-timeouts.asciidoc[tag=troubleshooting-network-timeouts-threads] -. Miscellaneous routers, firewalls, and other devices between the master node -and the follower node forward the packets, possibly fragmenting or -defragmenting them on the way. - -. The operating system on the follower node receives the packets and notifies -{es} that they've been received. - -. The follower's `TransportService`, running on thread -`elasticsearch[follower][transport_worker][T#3]`, reads the incoming packets. -It then reconstructs and processes the check request. Usually, the check -quickly succeeds. If so, the same thread immediately constructs a response and -passes it back to the operating system. - -. If the check doesn't immediately succeed (for example, an election started -recently) then: - -.. The follower's `FollowerChecker`, running on thread -`elasticsearch[follower][cluster_coordination][T#4]`, processes the request. It -constructs a response and tells the `TransportService` to send the response -back to the master. - -.. The follower's `TransportService`, running on thread -`elasticsearch[follower][transport_worker][T#3]`, passes the response to the -operating system. - -. The operating system on the follower converts the response into one or more -packets and sends them out over the network. - -. Miscellaneous routers, firewalls, and other devices between master and -follower forward the packets, possibly fragmenting or defragmenting them on the -way. - -. The operating system on the master receives the packets and notifies {es} -that they've been received. - -. The master's `TransportService`, running on thread -`elasticsearch[master][transport_worker][T#2]`, reads the incoming packets, -reconstructs the check response, and processes it as long as the check didn't -already time out. - -There are a lot of different things that can delay the completion of a check -and cause it to time out. Here are some examples for each step: - -. There may be a long garbage collection (GC) or virtual machine (VM) pause -after passing the check request to the `TransportService`. - -. There may be a long wait for the specific `transport_worker` thread to become -available, or there may be a long GC or VM pause before passing the check -request onto the operating system. - -. A system fault (for example, a broken network card) on the master may delay -sending the message over the network, possibly indefinitely. - -. Intermediate devices may delay, drop, or corrupt packets along the way. The -operating system for the master will wait and retransmit any unacknowledged or -corrupted packets up to `net.ipv4.tcp_retries2` times. We recommend -<> since the default represents a -very long delay. - -. A system fault (for example, a broken network card) on the follower may delay -receiving the message from the network. - -. There may be a long wait for the specific `transport_worker` thread to become -available, or there may be a long GC or VM pause during the processing of the -request on the follower. - -. There may be a long wait for the `cluster_coordination` thread to become -available, or for the specific `transport_worker` thread to become available -again. There may also be a long GC or VM pause during the processing of the -request. - -. A system fault (for example, a broken network card) on the follower may delay -sending the response from the network. - -. Intermediate devices may delay, drop, or corrupt packets along the way again, -causing retransmissions. - -. A system fault (for example, a broken network card) on the master may delay -receiving the message from the network. - -. There may be a long wait for the specific `transport_worker` thread to become -available to process the response, or a long GC or VM pause. - -To determine why follower checks are timing out, we can narrow down the reason -for the delay as follows: - -* GC pauses are recorded in the GC logs that {es} emits by default, and also -usually by the `JvmMonitorService` in the main node logs. Use these logs to -confirm whether or not GC is resulting in delays. - -* VM pauses also affect other processes on the same host. A VM pause also -typically causes a discontinuity in the system clock, which {es} will report in -its logs. - -* Packet captures will reveal system-level and network-level faults, especially -if you capture the network traffic simultaneously at the elected master and the -faulty node. The connection used for follower checks is not used for any other -traffic so it can be easily identified from the flow pattern alone, even if TLS -is in use: almost exactly every second there will be a few hundred bytes sent -each way, first the request by the master and then the response by the -follower. You should be able to observe any retransmissions, packet loss, or -other delays on such a connection. - -* Long waits for particular threads to be available can be identified by taking -stack dumps (for example, using `jstack`) or a profiling trace (for example, -using Java Flight Recorder) in the few seconds leading up to a node departure. -+ By default the follower checks will time out after 30s, so if node departures are unpredictable then capture stack dumps every 15s to be sure that at least one stack dump was taken at the right time. -+ -The <> API sometimes yields useful information, but -bear in mind that this API also requires a number of `transport_worker` and -`generic` threads across all the nodes in the cluster. The API may be affected -by the very problem you're trying to diagnose. `jstack` is much more reliable -since it doesn't require any JVM threads. -+ -The threads involved in the follower checks are `transport_worker` and -`cluster_coordination` threads, for which there should never be a long wait. -There may also be evidence of long waits for threads in the {es} logs. See -<> for more information. ===== Diagnosing `ShardLockObtainFailedException` failures diff --git a/docs/reference/troubleshooting/discovery-issues.asciidoc b/docs/reference/troubleshooting/discovery-issues.asciidoc index 1220de696b17..53c54e264c9e 100644 --- a/docs/reference/troubleshooting/discovery-issues.asciidoc +++ b/docs/reference/troubleshooting/discovery-issues.asciidoc @@ -39,18 +39,19 @@ nodes will repeatedly log messages about the problem using a logger called `org.elasticsearch.cluster.coordination.ClusterFormationFailureHelper`. By default, this happens every 10 seconds. -Master elections only involve master-eligible nodes, so focus on the logs from -master-eligible nodes in this situation. These nodes' logs will indicate the -requirements for a master election, such as the discovery of a certain set of -nodes. +Master elections only involve master-eligible nodes, so focus your attention on +the master-eligible nodes in this situation. These nodes' logs will indicate +the requirements for a master election, such as the discovery of a certain set +of nodes. The <> API on these nodes will also provide useful +information about the situation. -If the logs indicate that {es} can't discover enough nodes to form a quorum, -you must address the reasons preventing {es} from discovering the missing -nodes. The missing nodes are needed to reconstruct the cluster metadata. -Without the cluster metadata, the data in your cluster is meaningless. The -cluster metadata is stored on a subset of the master-eligible nodes in the -cluster. If a quorum can't be discovered, the missing nodes were the ones -holding the cluster metadata. +If the logs or the health report indicate that {es} can't discover enough nodes +to form a quorum, you must address the reasons preventing {es} from discovering +the missing nodes. The missing nodes are needed to reconstruct the cluster +metadata. Without the cluster metadata, the data in your cluster is +meaningless. The cluster metadata is stored on a subset of the master-eligible +nodes in the cluster. If a quorum can't be discovered, the missing nodes were +the ones holding the cluster metadata. Ensure there are enough nodes running to form a quorum and that every node can communicate with every other node over the network. {es} will report additional @@ -59,10 +60,20 @@ than a few minutes. If you can't start enough nodes to form a quorum, start a new cluster and restore data from a recent snapshot. Refer to <> for more information. -If the logs indicate that {es} _has_ discovered a possible quorum of nodes, the -typical reason that the cluster can't elect a master is that one of the other -nodes can't discover a quorum. Inspect the logs on the other master-eligible -nodes and ensure that they have all discovered enough nodes to form a quorum. +If the logs or the health report indicate that {es} _has_ discovered a possible +quorum of nodes, the typical reason that the cluster can't elect a master is +that one of the other nodes can't discover a quorum. Inspect the logs on the +other master-eligible nodes and ensure that they have all discovered enough +nodes to form a quorum. + +If the logs suggest that discovery or master elections are failing due to +timeouts or network-related issues then narrow down the problem as follows. + +include::network-timeouts.asciidoc[tag=troubleshooting-network-timeouts-gc-vm] + +include::network-timeouts.asciidoc[tag=troubleshooting-network-timeouts-packet-capture-elections] + +include::network-timeouts.asciidoc[tag=troubleshooting-network-timeouts-threads] [discrete] [[discovery-master-unstable]] @@ -72,7 +83,14 @@ When a node wins the master election, it logs a message containing `elected-as-master`. If this happens repeatedly, the elected master node is unstable. In this situation, focus on the logs from the master-eligible nodes to understand why the election winner stops being the master and triggers -another election. +another election. If the logs suggest that the master is unstable due to +timeouts or network-related issues then narrow down the problem as follows. + +include::network-timeouts.asciidoc[tag=troubleshooting-network-timeouts-gc-vm] + +include::network-timeouts.asciidoc[tag=troubleshooting-network-timeouts-packet-capture-elections] + +include::network-timeouts.asciidoc[tag=troubleshooting-network-timeouts-threads] [discrete] [[discovery-cannot-join-master]] @@ -80,8 +98,18 @@ another election. If there is a stable elected master but a node can't discover or join its cluster, it will repeatedly log messages about the problem using the -`ClusterFormationFailureHelper` logger. Other log messages on the affected node -and the elected master may provide additional information about the problem. +`ClusterFormationFailureHelper` logger. The <> API on the affected +node will also provide useful information about the situation. Other log +messages on the affected node and the elected master may provide additional +information about the problem. If the logs suggest that the node cannot +discover or join the cluster due to timeouts or network-related issues then +narrow down the problem as follows. + +include::network-timeouts.asciidoc[tag=troubleshooting-network-timeouts-gc-vm] + +include::network-timeouts.asciidoc[tag=troubleshooting-network-timeouts-packet-capture-elections] + +include::network-timeouts.asciidoc[tag=troubleshooting-network-timeouts-threads] [discrete] [[discovery-node-leaves]] @@ -89,4 +117,4 @@ and the elected master may provide additional information about the problem. If a node joins the cluster but {es} determines it to be faulty then it will be removed from the cluster again. See <> -for more information. \ No newline at end of file +for more information. diff --git a/docs/reference/troubleshooting/network-timeouts.asciidoc b/docs/reference/troubleshooting/network-timeouts.asciidoc new file mode 100644 index 000000000000..df961c83a292 --- /dev/null +++ b/docs/reference/troubleshooting/network-timeouts.asciidoc @@ -0,0 +1,45 @@ +tag::troubleshooting-network-timeouts-gc-vm[] +* GC pauses are recorded in the GC logs that {es} emits by default, and also +usually by the `JvmMonitorService` in the main node logs. Use these logs to +confirm whether or not GC is resulting in delays. + +* VM pauses also affect other processes on the same host. A VM pause also +typically causes a discontinuity in the system clock, which {es} will report in +its logs. +end::troubleshooting-network-timeouts-gc-vm[] + +tag::troubleshooting-network-timeouts-packet-capture-elections[] +* Packet captures will reveal system-level and network-level faults, especially +if you capture the network traffic simultaneously at all relevant nodes. You +should be able to observe any retransmissions, packet loss, or other delays on +the connections between the nodes. +end::troubleshooting-network-timeouts-packet-capture-elections[] + +tag::troubleshooting-network-timeouts-packet-capture-fault-detection[] +* Packet captures will reveal system-level and network-level faults, especially +if you capture the network traffic simultaneously at the elected master and the +faulty node. The connection used for follower checks is not used for any other +traffic so it can be easily identified from the flow pattern alone, even if TLS +is in use: almost exactly every second there will be a few hundred bytes sent +each way, first the request by the master and then the response by the +follower. You should be able to observe any retransmissions, packet loss, or +other delays on such a connection. +end::troubleshooting-network-timeouts-packet-capture-fault-detection[] + +tag::troubleshooting-network-timeouts-threads[] +* Long waits for particular threads to be available can be identified by taking +stack dumps (for example, using `jstack`) or a profiling trace (for example, +using Java Flight Recorder) in the few seconds leading up to the relevant log +message. ++ +The <> API sometimes yields useful information, but +bear in mind that this API also requires a number of `transport_worker` and +`generic` threads across all the nodes in the cluster. The API may be affected +by the very problem you're trying to diagnose. `jstack` is much more reliable +since it doesn't require any JVM threads. ++ +The threads involved in discovery and cluster membership are mainly +`transport_worker` and `cluster_coordination` threads, for which there should +never be a long wait. There may also be evidence of long waits for threads in +the {es} logs. See <> for more information. +end::troubleshooting-network-timeouts-threads[]