mirror of
https://github.com/elastic/elasticsearch.git
synced 2025-04-25 07:37:19 -04:00
Clarify that it's best to analyse the captures alongside the node logs, and spell out in a bit more detail how to use packet captures and logs to pin down the cause of a `disconnected` node.
55 lines
3.2 KiB
Text
55 lines
3.2 KiB
Text
tag::troubleshooting-network-timeouts-gc-vm[]
|
|
* GC pauses are recorded in the GC logs that {es} emits by default, and also
|
|
usually by the `JvmMonitorService` in the main node logs. Use these logs to
|
|
confirm whether or not the node is experiencing high heap usage with long GC
|
|
pauses. If so, <<high-jvm-memory-pressure,the troubleshooting guide for high
|
|
heap usage>> has some suggestions for further investigation but typically you
|
|
will need to capture a heap dump and the <<gc-logging,garbage collector logs>>
|
|
during a time of high heap usage to fully understand the problem.
|
|
|
|
* VM pauses also affect other processes on the same host. A VM pause also
|
|
typically causes a discontinuity in the system clock, which {es} will report in
|
|
its logs. If you see evidence of other processes pausing at the same time, or
|
|
unexpected clock discontinuities, investigate the infrastructure on which you
|
|
are running {es}.
|
|
end::troubleshooting-network-timeouts-gc-vm[]
|
|
|
|
tag::troubleshooting-network-timeouts-packet-capture-elections[]
|
|
* Packet captures will reveal system-level and network-level faults, especially
|
|
if you capture the network traffic simultaneously at all relevant nodes and
|
|
analyse it alongside the {es} logs from those nodes. You should be able to
|
|
observe any retransmissions, packet loss, or other delays on the connections
|
|
between the nodes.
|
|
end::troubleshooting-network-timeouts-packet-capture-elections[]
|
|
|
|
tag::troubleshooting-network-timeouts-packet-capture-fault-detection[]
|
|
* Packet captures will reveal system-level and network-level faults, especially
|
|
if you capture the network traffic simultaneously at the elected master and the
|
|
faulty node and analyse it alongside the {es} logs from those nodes. The
|
|
connection used for follower checks is not used for any other traffic so it can
|
|
be easily identified from the flow pattern alone, even if TLS is in use: almost
|
|
exactly every second there will be a few hundred bytes sent each way, first the
|
|
request by the master and then the response by the follower. You should be able
|
|
to observe any retransmissions, packet loss, or other delays on such a
|
|
connection.
|
|
end::troubleshooting-network-timeouts-packet-capture-fault-detection[]
|
|
|
|
tag::troubleshooting-network-timeouts-threads[]
|
|
* Long waits for particular threads to be available can be identified by taking
|
|
stack dumps of the main {es} process (for example, using `jstack`) or a
|
|
profiling trace (for example, using Java Flight Recorder) in the few seconds
|
|
leading up to the relevant log message.
|
|
+
|
|
The <<cluster-nodes-hot-threads>> API sometimes yields useful information, but
|
|
bear in mind that this API also requires a number of `transport_worker` and
|
|
`generic` threads across all the nodes in the cluster. The API may be affected
|
|
by the very problem you're trying to diagnose. `jstack` is much more reliable
|
|
since it doesn't require any JVM threads.
|
|
+
|
|
The threads involved in discovery and cluster membership are mainly
|
|
`transport_worker` and `cluster_coordination` threads, for which there should
|
|
never be a long wait. There may also be evidence of long waits for threads in
|
|
the {es} logs, particularly looking at warning logs from
|
|
`org.elasticsearch.transport.InboundHandler`. See
|
|
<<modules-network-threading-model>> for more information.
|
|
end::troubleshooting-network-timeouts-threads[]
|