mirror of
https://github.com/elastic/elasticsearch.git
synced 2025-04-25 07:37:19 -04:00
This commit introduces a watchdog timer to monitor for long-running tasks on network threads. If a network thread is active and has not made progress for two consecutive ticks of the timer then the watchdog logs a warning and a thread dump.
157 lines
8.7 KiB
Text
157 lines
8.7 KiB
Text
[[modules-network-threading-model]]
|
|
==== Networking threading model
|
|
|
|
This section describes the threading model used by the networking subsystem in
|
|
{es}. This information isn't required to use {es}, but it may be useful to
|
|
advanced users who are diagnosing network problems in a cluster.
|
|
|
|
{es} nodes communicate over a collection of TCP channels that together form a
|
|
transport connection. {es} clients communicate with the cluster over HTTP,
|
|
which also uses one or more TCP channels. Each of these TCP channels is owned
|
|
by exactly one of the `transport_worker` threads in the node. This owning
|
|
thread is chosen when the channel is opened and remains the same for the
|
|
lifetime of the channel.
|
|
|
|
Each `transport_worker` thread has sole responsibility for sending and
|
|
receiving data over the channels it owns. Additionally, each http and transport
|
|
server socket is assigned to one of the `transport_worker` threads. That worker
|
|
has the responsibility of accepting new incoming connections to the server
|
|
socket it owns.
|
|
|
|
If a thread in {es} wants to send data over a particular channel, it passes the
|
|
data to the owning `transport_worker` thread for the actual transmission.
|
|
|
|
Normally the `transport_worker` threads will not completely handle the messages
|
|
they receive. Instead, they will do a small amount of preliminary processing
|
|
and then dispatch (hand off) the message to a different
|
|
<<modules-threadpool,threadpool>> for the rest of their handling. For instance,
|
|
bulk messages are dispatched to the `write` threadpool, searches are dispatched
|
|
to one of the `search` threadpools, and requests for statistics and other
|
|
management tasks are mostly dispatched to the `management` threadpool. However
|
|
in some cases the processing of a message is expected to be so quick that {es}
|
|
will do all of the processing on the `transport_worker` thread rather than
|
|
incur the overhead of dispatching it elsewhere.
|
|
|
|
By default, there is one `transport_worker` thread per CPU. In contrast, there
|
|
may sometimes be tens-of-thousands of TCP channels. If data arrives on a TCP
|
|
channel and its owning `transport_worker` thread is busy, the data isn't
|
|
processed until the thread finishes whatever it is doing. Similarly, outgoing
|
|
data are not sent over a channel until the owning `transport_worker` thread is
|
|
free. This means that we require every `transport_worker` thread to be idle
|
|
frequently. An idle `transport_worker` looks something like this in a stack
|
|
dump:
|
|
|
|
[source,text]
|
|
----
|
|
"elasticsearch[instance-0000000004][transport_worker][T#1]" #32 daemon prio=5 os_prio=0 cpu=9645.94ms elapsed=501.63s tid=0x00007fb83b6307f0 nid=0x1c4 runnable [0x00007fb7b8ffe000]
|
|
java.lang.Thread.State: RUNNABLE
|
|
at sun.nio.ch.EPoll.wait(java.base@17.0.2/Native Method)
|
|
at sun.nio.ch.EPollSelectorImpl.doSelect(java.base@17.0.2/EPollSelectorImpl.java:118)
|
|
at sun.nio.ch.SelectorImpl.lockAndDoSelect(java.base@17.0.2/SelectorImpl.java:129)
|
|
- locked <0x00000000c443c518> (a sun.nio.ch.Util$2)
|
|
- locked <0x00000000c38f7700> (a sun.nio.ch.EPollSelectorImpl)
|
|
at sun.nio.ch.SelectorImpl.select(java.base@17.0.2/SelectorImpl.java:146)
|
|
at io.netty.channel.nio.NioEventLoop.select(NioEventLoop.java:813)
|
|
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:460)
|
|
at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:986)
|
|
at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
|
|
at java.lang.Thread.run(java.base@17.0.2/Thread.java:833)
|
|
----
|
|
|
|
In the <<cluster-nodes-hot-threads>> API an idle `transport_worker` thread is
|
|
reported like this:
|
|
|
|
[source,text]
|
|
----
|
|
0.0% [cpu=0.0%, idle=100.0%] (500ms out of 500ms) cpu usage by thread 'elasticsearch[instance-0000000004][transport_worker][T#1]'
|
|
10/10 snapshots sharing following 9 elements
|
|
java.base@17.0.2/sun.nio.ch.EPoll.wait(Native Method)
|
|
java.base@17.0.2/sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:118)
|
|
java.base@17.0.2/sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:129)
|
|
java.base@17.0.2/sun.nio.ch.SelectorImpl.select(SelectorImpl.java:146)
|
|
io.netty.channel.nio.NioEventLoop.select(NioEventLoop.java:813)
|
|
io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:460)
|
|
io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:986)
|
|
io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
|
|
java.base@17.0.2/java.lang.Thread.run(Thread.java:833)
|
|
----
|
|
|
|
Note that `transport_worker` threads should always be in state `RUNNABLE`, even
|
|
when waiting for input, because they block in the native `EPoll#wait` method. The `idle=`
|
|
time reports the proportion of time the thread spent waiting for input, whereas the `cpu=` time
|
|
reports the proportion of time the thread spent processing input it has received.
|
|
|
|
If a `transport_worker` thread is not frequently idle, it may build up a
|
|
backlog of work. This can cause delays in processing messages on the channels
|
|
that it owns. It's hard to predict exactly which work will be delayed:
|
|
|
|
* There are many more channels than threads. If work related to one channel is
|
|
causing delays to its worker thread, all other channels owned by that thread
|
|
will also suffer delays.
|
|
|
|
* The mapping from TCP channels to worker threads is fixed but arbitrary. Each
|
|
channel is assigned an owning thread in a round-robin fashion when the channel
|
|
is opened. Each worker thread is responsible for many different kinds of
|
|
channel.
|
|
|
|
* There are many channels open between each pair of nodes. For each request,
|
|
{es} will choose from the appropriate channels in a round-robin fashion. Some
|
|
requests may end up on a channel owned by a delayed worker while other
|
|
identical requests will be sent on a channel that's working smoothly.
|
|
|
|
If the backlog builds up too far, some messages may be delayed by many seconds.
|
|
The node might even <<cluster-fault-detection,fail its health checks>> and be
|
|
removed from the cluster. Sometimes, you can find evidence of busy
|
|
`transport_worker` threads using the <<cluster-nodes-hot-threads>> API.
|
|
However, this API itself sends network messages so may not work correctly if
|
|
the `transport_worker` threads are too busy. It is more reliable to use
|
|
`jstack` to obtain stack dumps or use Java Flight Recorder to obtain a
|
|
profiling trace. These tools are independent of any work the JVM is performing.
|
|
|
|
It may also be possible to identify some reasons for delays from the server
|
|
logs. See for instance the following loggers:
|
|
|
|
`org.elasticsearch.transport.InboundHandler`:: This logger reports a warning if
|
|
processing an inbound message occupies a network thread for unreasonably long,
|
|
which is almost certainly a bug. The warning includes some information which
|
|
can be used to identify the message that took unreasonably long to process.
|
|
|
|
`org.elasticsearch.transport.OutboundHandler`:: This logger reports a warning
|
|
if sending an outbound message takes longer than expected. This duration
|
|
includes time spent waiting for network congestion to clear, and time spent
|
|
processing other work on the same network thread, so does not always indicate
|
|
the presence of a bug related to the outbound message specified in the log
|
|
entry.
|
|
|
|
`org.elasticsearch.common.network.ThreadWatchdog`:: This logger reports a
|
|
warning and a thread dump when it notices that a network thread has not made
|
|
progress between two consecutive checks, which is almost certainly a bug:
|
|
+
|
|
--
|
|
[source,text]
|
|
----
|
|
[WARN ][o.e.c.n.ThreadWatchdog ] the following threads are active but did not make progress in the preceding [5s]: [elasticsearch[instance-0000000004][transport_worker][T#1]]]
|
|
[WARN ][o.e.c.n.ThreadWatchdog ] hot threads dump due to active threads not making progress [part 1]: H4sIAAAAAAAA/+1aa2/bOBb93l8hYLUYFWgYvWw5AQbYpEkn6STZbJyiwAwGA1qiY8US6ZJUHvPr90qk/JJky41TtDMuUIci...
|
|
[WARN ][o.e.c.n.ThreadWatchdog ] hot threads dump due to active threads not making progress [part 2]: LfXL/x70a3eL8ve6Ral74ZBrp5x7HmUD9KXQz1MaXUNfFC6SeEysxSw1cNXL9JXYl3AigAE7ywbm/AZ+ll3Ox4qXJHNjVr6h...
|
|
[WARN ][o.e.c.n.ThreadWatchdog ] hot threads dump due to active threads not making progress (gzip compressed, base64-encoded, and split into 2 parts on preceding log lines; ...
|
|
----
|
|
|
|
To reconstruct the thread dump, base64-decode the data and decompress it using `gzip`. For instance, on Unix-like systems:
|
|
|
|
[source,sh]
|
|
----
|
|
cat watchdog.log | sed -e 's/.*://' | base64 --decode | gzip --decompress
|
|
----
|
|
|
|
This mechanism can be controlled with the following settings:
|
|
|
|
`network.thread.watchdog.interval`:::
|
|
(<<static-cluster-setting,Static>>, <<time-units,time value>>)
|
|
Defines the interval between watchdog checks. Defaults to `5s`. Set to `0` to
|
|
disable the network thread watchdog.
|
|
|
|
`network.thread.watchdog.quiet_time`:::
|
|
(<<static-cluster-setting,Static>>, <<time-units,time value>>)
|
|
Defines the interval between watchdog warnings. Defaults to `10m`.
|
|
|
|
--
|