elasticsearch/docs/reference/troubleshooting/network-timeouts.asciidoc
David Turner 42e5293c04
Capture GC logs alongside heap dumps (#109087)
GC logs can be important to understand a heap dump, especially if
there's lots of unreachable objects and the GC is struggling to keep up.
2024-05-28 04:54:04 -04:00

53 lines
3.1 KiB
Text

tag::troubleshooting-network-timeouts-gc-vm[]
* GC pauses are recorded in the GC logs that {es} emits by default, and also
usually by the `JvmMonitorService` in the main node logs. Use these logs to
confirm whether or not the node is experiencing high heap usage with long GC
pauses. If so, <<high-jvm-memory-pressure,the troubleshooting guide for high
heap usage>> has some suggestions for further investigation but typically you
will need to capture a heap dump and the <<gc-logging,garbage collector logs>>
during a time of high heap usage to fully understand the problem.
* VM pauses also affect other processes on the same host. A VM pause also
typically causes a discontinuity in the system clock, which {es} will report in
its logs. If you see evidence of other processes pausing at the same time, or
unexpected clock discontinuities, investigate the infrastructure on which you
are running {es}.
end::troubleshooting-network-timeouts-gc-vm[]
tag::troubleshooting-network-timeouts-packet-capture-elections[]
* Packet captures will reveal system-level and network-level faults, especially
if you capture the network traffic simultaneously at all relevant nodes. You
should be able to observe any retransmissions, packet loss, or other delays on
the connections between the nodes.
end::troubleshooting-network-timeouts-packet-capture-elections[]
tag::troubleshooting-network-timeouts-packet-capture-fault-detection[]
* Packet captures will reveal system-level and network-level faults, especially
if you capture the network traffic simultaneously at the elected master and the
faulty node. The connection used for follower checks is not used for any other
traffic so it can be easily identified from the flow pattern alone, even if TLS
is in use: almost exactly every second there will be a few hundred bytes sent
each way, first the request by the master and then the response by the
follower. You should be able to observe any retransmissions, packet loss, or
other delays on such a connection.
end::troubleshooting-network-timeouts-packet-capture-fault-detection[]
tag::troubleshooting-network-timeouts-threads[]
* Long waits for particular threads to be available can be identified by taking
stack dumps of the main {es} process (for example, using `jstack`) or a
profiling trace (for example, using Java Flight Recorder) in the few seconds
leading up to the relevant log message.
+
The <<cluster-nodes-hot-threads>> API sometimes yields useful information, but
bear in mind that this API also requires a number of `transport_worker` and
`generic` threads across all the nodes in the cluster. The API may be affected
by the very problem you're trying to diagnose. `jstack` is much more reliable
since it doesn't require any JVM threads.
+
The threads involved in discovery and cluster membership are mainly
`transport_worker` and `cluster_coordination` threads, for which there should
never be a long wait. There may also be evidence of long waits for threads in
the {es} logs, particularly looking at warning logs from
`org.elasticsearch.transport.InboundHandler`. See
<<modules-network-threading-model>> for more information.
end::troubleshooting-network-timeouts-threads[]