elasticsearch/docs/reference/troubleshooting/common-issues/task-queue-backlog.asciidoc
Marci W 696ee806e7
Revise content to match new troubleshooting guidelines (#118033)
* Revise to match new guidelines

* Address review suggestions and comments

* Apply suggestions from review

Co-authored-by: shainaraskas <58563081+shainaraskas@users.noreply.github.com>

* Apply suggestions from review

Co-authored-by: shainaraskas <58563081+shainaraskas@users.noreply.github.com>

* Apply suggestions from review

Co-authored-by: shainaraskas <58563081+shainaraskas@users.noreply.github.com>

* Apply suggestions from review

---------

Co-authored-by: shainaraskas <58563081+shainaraskas@users.noreply.github.com>
2024-12-19 10:09:14 -05:00

149 lines
4.8 KiB
Text

[[task-queue-backlog]]
=== Backlogged task queue
*******************************
*Product:* Elasticsearch +
*Deployment type:* Elastic Cloud Enterprise, Elastic Cloud Hosted, Elastic Cloud on Kubernetes, Elastic Self-Managed +
*Versions:* All
*******************************
A backlogged task queue can prevent tasks from completing and lead to an
unhealthy cluster state. Contributing factors include resource constraints,
a large number of tasks triggered at once, and long-running tasks.
[discrete]
[[diagnose-task-queue-backlog]]
==== Diagnose a backlogged task queue
To identify the cause of the backlog, try these diagnostic actions.
* <<diagnose-task-queue-thread-pool>>
* <<diagnose-task-queue-hot-thread>>
* <<diagnose-task-queue-long-running-node-tasks>>
* <<diagnose-task-queue-long-running-cluster-tasks>>
[discrete]
[[diagnose-task-queue-thread-pool]]
===== Check the thread pool status
A <<high-cpu-usage,depleted thread pool>> can result in
<<rejected-requests,rejected requests>>.
Use the <<cat-thread-pool,cat thread pool API>> to monitor
active threads, queued tasks, rejections, and completed tasks:
[source,console]
----
GET /_cat/thread_pool?v&s=t,n&h=type,name,node_name,active,queue,rejected,completed
----
* Look for high `active` and `queue` metrics, which indicate potential bottlenecks
and opportunities to <<reduce-cpu-usage,reduce CPU usage>>.
* Determine whether thread pool issues are specific to a <<data-tiers,data tier>>.
* Check whether a specific node's thread pool is depleting faster than others. This
might indicate <<resolve-task-queue-backlog-hotspotting, hot spotting>>.
[discrete]
[[diagnose-task-queue-hot-thread]]
===== Inspect hot threads on each node
If a particular thread pool queue is backed up, periodically poll the
<<cluster-nodes-hot-threads,nodes hot threads API>> to gauge the thread's
progression and ensure it has sufficient resources:
[source,console]
----
GET /_nodes/hot_threads
----
Although the hot threads API response does not list the specific tasks running on a thread,
it provides a summary of the thread's activities. You can correlate a hot threads response
with a <<tasks,task management API response>> to identify any overlap with specific tasks. For
example, if the hot threads response indicates the thread is `performing a search query`, you can
<<diagnose-task-queue-long-running-node-tasks,check for long-running search tasks>> using the task management API.
[discrete]
[[diagnose-task-queue-long-running-node-tasks]]
===== Identify long-running node tasks
Long-running tasks can also cause a backlog. Use the <<tasks,task
management API>> to check for excessive `running_time_in_nanos` values:
[source,console]
----
GET /_tasks?pretty=true&human=true&detailed=true
----
You can filter on a specific `action`, such as <<docs-bulk,bulk indexing>> or search-related tasks.
These tend to be long-running.
* Filter on <<docs-bulk,bulk index>> actions:
+
[source,console]
----
GET /_tasks?human&detailed&actions=indices:data/write/bulk
----
* Filter on search actions:
+
[source,console]
----
GET /_tasks?human&detailed&actions=indices:data/write/search
----
Long-running tasks might need to be <<resolve-task-queue-backlog-stuck-tasks,canceled>>.
[discrete]
[[diagnose-task-queue-long-running-cluster-tasks]]
===== Look for long-running cluster tasks
Use the <<cluster-pending,cluster pending tasks API>> to identify delays
in cluster state synchronization:
[source,console]
----
GET /_cluster/pending_tasks
----
Tasks with a high `timeInQueue` value are likely contributing to the backlog and might
need to be <<resolve-task-queue-backlog-stuck-tasks,canceled>>.
[discrete]
[[resolve-task-queue-backlog]]
==== Recommendations
After identifying problematic threads and tasks, resolve the issue by increasing resources or canceling tasks.
[discrete]
[[resolve-task-queue-backlog-resources]]
===== Increase available resources
If tasks are progressing slowly, try <<reduce-cpu-usage,reducing CPU usage>>.
In some cases, you might need to increase the thread pool size. For example, the `force_merge` thread pool defaults to a single thread.
Increasing the size to 2 might help reduce a backlog of force merge requests.
[discrete]
[[resolve-task-queue-backlog-stuck-tasks]]
===== Cancel stuck tasks
If an active task's <<diagnose-task-queue-hot-thread,hot thread>> shows no progress, consider <<task-cancellation,canceling the task>>.
[discrete]
[[resolve-task-queue-backlog-hotspotting]]
===== Address hot spotting
If a specific node's thread pool is depleting faster than others, try addressing
uneven node resource utilization, also known as hot spotting.
For details on actions you can take, such as rebalancing shards, see <<hotspotting>>.
[discrete]
==== Resources
Related symptoms:
* <<high-cpu-usage>>
* <<rejected-requests>>
* <<hotspotting>>
// TODO add link to standard Additional resources when that topic exists