[ResponseOps][TaskManager] followups from resource based scheduling PR (#192124)

Towards https://github.com/elastic/kibana/issues/190095, https://github.com/elastic/kibana/issues/192183, https://github.com/elastic/kibana/issues/192185 ## Summary This PR updates the following: - heap-to-capacity converter to take into account larger amounts of RAM, updated this to be 16GB - initial`maxAllowedCost` to be the default capacity of 10 - adds `xpack.alerting.maxScheduledPerMinute`, `xpack.discovery.active_nodes_lookback`, `xpack.discovery.interval` configs to docker - updates the TM docs for `xpack.task_manager.capacity` --------- Co-authored-by: Lisa Cawley <lcawley@elastic.co>
2025-04-24 01:38:56 -04:00 · 2024-09-12 11:42:33 -07:00 · 2024-09-12 11:42:33 -07:00 · 850cdf0275
commit 850cdf0275
parent 4254adbbec
10 changed files with 28 additions and 23 deletions
--- a/docs/settings/task-manager-settings.asciidoc
+++ b/docs/settings/task-manager-settings.asciidoc
@ -23,6 +23,7 @@ How often, in milliseconds, the task manager will look for more work.  Defaults
 How many requests can Task Manager buffer before it rejects new requests.  Defaults to 1000.

 `xpack.task_manager.max_workers`::
+deprecated:[8.16.0]
 The maximum number of tasks that this Kibana instance will run simultaneously.  Defaults to 10.
 Starting in 8.0, it will not be possible to set the value greater than 100.

@ -48,6 +49,9 @@ Enables event loop delay monitoring, which will log a warning when a task causes
 `xpack.task_manager.event_loop_delay.warn_threshold`::
 Sets the amount of event loop delay during a task execution which will cause a warning to be logged. Defaults to 5000 milliseconds (5 seconds).

+`xpack.task_manager.capacity`::
+Controls the number of tasks that can be run at one time. The minimum value is 5 and the maximum is 50. Defaults to 10.
+
 [float]
 [[task-manager-health-settings]]
 ==== Task Manager Health settings
--- a/docs/user/alerting/alerting-troubleshooting.asciidoc
+++ b/docs/user/alerting/alerting-troubleshooting.asciidoc
@ -197,7 +197,7 @@ If cluster performance becomes degraded from excessive or expensive rules and {k

 [source,txt]
 --------------------------------------------------
-xpack.task_manager.max_workers: 1
+xpack.task_manager.capacity: 5
 xpack.task_manager.poll_interval: 1h
 --------------------------------------------------

--- a/docs/user/production-considerations/index.asciidoc
+++ b/docs/user/production-considerations/index.asciidoc
@ -2,6 +2,6 @@ include::production.asciidoc[]
 include::security-production-considerations.asciidoc[]
 include::alerting-production-considerations.asciidoc[]
 include::reporting-production-considerations.asciidoc[]
-include::task-manager-production-considerations.asciidoc[]
+include::task-manager-production-considerations.asciidoc[leveloffset=+1]
 include::task-manager-health-monitoring.asciidoc[]
 include::task-manager-troubleshooting.asciidoc[]
--- a/docs/user/production-considerations/task-manager-production-considerations.asciidoc
+++ b/docs/user/production-considerations/task-manager-production-considerations.asciidoc
@ -1,6 +1,5 @@
-[role="xpack"]
 [[task-manager-production-considerations]]
-== Task Manager
+= Task Manager

 {kib} Task Manager is leveraged by features such as Alerting, Actions, and Reporting to run mission critical work as persistent background tasks.
 These background tasks distribute work across multiple {kib} instances.
@ -21,7 +20,7 @@ If you lose this index, all scheduled alerts and actions are lost.

 [float]
 [[task-manager-background-tasks]]
-=== Running background tasks
+== Running background tasks

 {kib} background tasks are managed as follows:

@ -47,13 +46,13 @@ For detailed troubleshooting guidance, see <<task-manager-troubleshooting>>.
 ==============================================

 [float]
-=== Deployment considerations
+== Deployment considerations

 {es} and {kib} instances use the system clock to determine the current time. To ensure schedules are triggered when expected, synchronize the clocks of all nodes in the cluster using a time service such as http://www.ntp.org/[Network Time Protocol].

 [float]
 [[task-manager-scaling-guidance]]
-=== Scaling guidance
+== Scaling guidance

 How you deploy {kib} largely depends on your use case. Predicting the throughout a deployment might require to support Task Management is difficult because features can schedule an unpredictable number of tasks at a variety of scheduled cadences.

@ -61,7 +60,7 @@ However, there is a relatively straight forward method you can follow to produce

 [float]
 [[task-manager-default-scaling]]
-==== Default scale
+=== Default scale

 By default, {kib} polls for tasks at a rate of 10 tasks every 3 seconds.
 This means that you can expect a single {kib} instance to support up to 200 _tasks per minute_ (`200/tpm`).
@ -74,24 +73,24 @@ For details on monitoring the health of {kib} Task Manager, follow the guidance

 [float]
 [[task-manager-scaling-horizontally]]
-==== Scaling horizontally
+=== Scaling horizontally

 At times, the sustainable approach might be to expand the throughput of your cluster by provisioning additional {kib} instances.
 By default, each additional {kib} instance will add an additional 10 tasks that your cluster can run concurrently, but you can also scale each {kib} instance vertically, if your diagnosis indicates that they can handle the additional workload.

 [float]
 [[task-manager-scaling-vertically]]
-==== Scaling vertically
+=== Scaling vertically

 Other times it, might be preferable to increase the throughput of individual {kib} instances.

-Tweak the *Max Workers* via the <<task-manager-settings,`xpack.task_manager.max_workers`>> setting, which allows each {kib} to pull a higher number of tasks per interval. This could impact the performance of each {kib} instance as the workload will be higher.
+Tweak the capacity with the <<task-manager-settings,`xpack.task_manager.capacity`>> setting, which enables each {kib} instance to pull a higher number of tasks per interval. This setting can impact the performance of each instance as the workload will be higher.

-Tweak the *Poll Interval* via the <<task-manager-settings,`xpack.task_manager.poll_interval`>> setting, which allows each {kib} to pull scheduled tasks at a higher rate.  This could impact the performance of the {es} cluster as the workload will be higher.
+Tweak the poll interval with the <<task-manager-settings,`xpack.task_manager.poll_interval`>> setting, which enables each {kib} instance to pull scheduled tasks at a higher rate. This setting can impact the performance of the {es} cluster as the workload will be higher.

 [float]
 [[task-manager-choosing-scaling-strategy]]
-==== Choosing a scaling strategy
+=== Choosing a scaling strategy

 Each scaling strategy comes with its own considerations, and the appropriate strategy largely depends on your use case.

@ -113,7 +112,7 @@ A higher frequency suggests {kib} instances conflict at a high rate, which you c

 [float]
 [[task-manager-rough-throughput-estimation]]
-==== Rough throughput estimation
+=== Rough throughput estimation

 Predicting the required throughput a deployment might need to support Task Management is difficult, as features can schedule an unpredictable number of tasks at a variety of scheduled cadences.
 However, a rough lower bound can be estimated, which is then used as a guide.
@ -123,7 +122,7 @@ Throughput is best thought of as a measurements in tasks per minute.
 A default {kib} instance can support up to `200/tpm`.

 [float]
-===== Automatic estimation
+==== Automatic estimation

 experimental[]

@ -145,7 +144,7 @@ When evaluating the proposed {kib} instance number under `proposed.provisioned_k
 ============================================================================

 [float]
-===== Manual estimation
+==== Manual estimation

 By <<task-manager-health-evaluate-the-workload,evaluating the workload>>, you can make a rough estimate as to the required throughput as a _tasks per minute_ measurement.

--- a/docs/user/production-considerations/task-manager-troubleshooting.asciidoc
+++ b/docs/user/production-considerations/task-manager-troubleshooting.asciidoc
@ -1002,7 +1002,7 @@ server log [12:41:33.672] [warn][plugins][taskManager][taskManager] taskManager

 This log message tells us that Task Manager is not managing to keep up with the sheer amount of work it has been tasked with completing. This might mean that rules are not running at the frequency that was expected (instead of running every 5 minutes, it runs every 7-8 minutes, just as an example).

-By default Task Manager is limited to 10 tasks and this can be bumped up by setting a higher number in the kibana.yml file using the `xpack.task_manager.max_workers` configuration. It is important to keep in mind that a higher number of tasks running at any given time means more load on both Kibana and Elasticsearch, so only change this setting if increasing load in your environment makes sense.
+By default Task Manager is limited to 10 tasks and this can be bumped up by setting a higher number in the `kibana.yml` file using the `xpack.task_manager.capacity` configuration. It is important to keep in mind that a higher number of tasks running at any given time means more load on both Kibana and Elasticsearch; only change this setting if increasing load in your environment makes sense.

 Another approach to addressing this might be to tell workers to run at a higher rate, rather than adding more of them, which would be configured using xpack.task_manager.poll_interval. This value dictates how often Task Manager checks to see if there’s more work to be done and uses milliseconds (by default it is 3000, which means an interval of 3 seconds).

--- a/src/dev/build/tasks/os_packages/docker_generator/resources/base/bin/kibana-docker
+++ b/src/dev/build/tasks/os_packages/docker_generator/resources/base/bin/kibana-docker
@ -238,6 +238,7 @@ kibana_vars=(
    xpack.alerting.rules.run.actions.max
    xpack.alerting.rules.run.alerts.max
    xpack.alerting.rules.run.actions.connectorTypeOverrides
+    xpack.alerting.maxScheduledPerMinute
    xpack.alerts.healthCheck.interval
    xpack.alerts.invalidateApiKeysTask.interval
    xpack.alerts.invalidateApiKeysTask.removalDelay
--- a/x-pack/plugins/task_manager/server/lib/get_default_capacity.ts
+++ b/x-pack/plugins/task_manager/server/lib/get_default_capacity.ts
@ -19,8 +19,8 @@ interface GetDefaultCapacityOpts {
 const HEAP_TO_CAPACITY_MAP = [
  { minHeap: 0, maxHeap: 1, capacity: 10 },
  { minHeap: 1, maxHeap: 2, capacity: 15 },
-  { minHeap: 2, maxHeap: 4, capacity: 25, backgroundTaskNodeOnly: false },
-  { minHeap: 2, maxHeap: 4, capacity: 50, backgroundTaskNodeOnly: true },
+  { minHeap: 2, maxHeap: 16, capacity: 25, backgroundTaskNodeOnly: false },
+  { minHeap: 2, maxHeap: 16, capacity: 50, backgroundTaskNodeOnly: true },
 ];

 export function getDefaultCapacity({
--- a/x-pack/plugins/task_manager/server/task_pool/cost_capacity.test.ts
+++ b/x-pack/plugins/task_manager/server/task_pool/cost_capacity.test.ts
@ -22,7 +22,7 @@ describe('CostCapacity', () => {
    const capacity$ = new Subject<number>();
    const pool = new CostCapacity({ capacity$, logger });

-    expect(pool.capacity).toBe(0);
+    expect(pool.capacity).toBe(10);

    capacity$.next(20);
    expect(pool.capacity).toBe(40);
--- a/x-pack/plugins/task_manager/server/task_pool/cost_capacity.ts
+++ b/x-pack/plugins/task_manager/server/task_pool/cost_capacity.ts
@ -6,13 +6,14 @@
 */

 import { Logger } from '@kbn/core/server';
+import { DEFAULT_CAPACITY } from '../config';
 import { TaskDefinition } from '../task';
 import { TaskRunner } from '../task_running';
 import { CapacityOpts, ICapacity } from './types';
 import { getCapacityInCost } from './utils';

 export class CostCapacity implements ICapacity {
-  private maxAllowedCost: number = 0;
+  private maxAllowedCost: number = DEFAULT_CAPACITY;
  private logger: Logger;

  constructor(opts: CapacityOpts) {
--- a/x-pack/plugins/task_manager/server/task_pool/task_pool.test.ts
+++ b/x-pack/plugins/task_manager/server/task_pool/task_pool.test.ts
@ -517,11 +517,11 @@ describe('TaskPool', () => {
      expect(pool.availableCapacity()).toEqual(14);
    });

-    test('availableCapacity is 0 until capacity$ pushes a value', async () => {
+    test('availableCapacity is 10 until capacity$ pushes a value', async () => {
      const capacity$ = new Subject<number>();
      const pool = new TaskPool({ capacity$, definitions, logger, strategy: CLAIM_STRATEGY_MGET });

-      expect(pool.availableCapacity()).toEqual(0);
+      expect(pool.availableCapacity()).toEqual(10);
      capacity$.next(20);
      expect(pool.availableCapacity()).toEqual(40);
    });