[Response Ops][Docs] Alerting circuit breaker docs (#131459)

* Circuit breaker docs

* Apply suggestions from code review

Co-authored-by: Lisa Cawley <lcawley@elastic.co>

Co-authored-by: Lisa Cawley <lcawley@elastic.co>
This commit is contained in:
Ying Mao 2022-05-04 15:04:25 -04:00 committed by GitHub
parent c43a51d7ab
commit 2dcbcb45d1
No known key found for this signature in database
GPG key ID: 4AEE18F83AFDEB23
2 changed files with 49 additions and 2 deletions

View file

@ -198,13 +198,13 @@ Specifies the minimum schedule interval for rules. This minimum is applied to al
+
`<count>[s,m,h,d]`
+
For example, `20m`, `24h`, `7d`. Default: `1m`.
For example, `20m`, `24h`, `7d`. This duration cannot exceed `1d`. Default: `1m`.
`xpack.alerting.rules.minimumScheduleInterval.enforce`::
Specifies the behavior when a new or changed rule has a schedule interval less than the value defined in `xpack.alerting.rules.minimumScheduleInterval.value`. If `false`, rules with schedules less than the interval will be created but warnings will be logged. If `true`, rules with schedules less than the interval cannot be created. Default: `false`.
`xpack.alerting.rules.run.actions.max`::
Specifies the maximum number of actions that a rule can trigger each time detection checks run.
Specifies the maximum number of actions that a rule can generate each time detection checks run.
`xpack.alerting.rules.run.timeout`::
Specifies the default timeout for tasks associated with all types of rules. The time is formatted as:

View file

@ -64,3 +64,50 @@ Because {kib} uses the documents to display historic data, you should set the de
For more information on index lifecycle management, see:
{ref}/index-lifecycle-management.html[Index Lifecycle Policies].
[float]
[[alerting-circuit-breakers]]
=== Circuit breakers
There are several scenarios where running alerting rules and actions can start to negatively impact the overall health of a {kib} instance either by clogging up Task Manager throughput or by consuming so much CPU/memory that other operations cannot complete in a reasonable amount of time. There are several <<alert-settings,configurable>> circuit breakers to help minimize these effects.
[float]
==== Rules with very short intervals
Running large numbers of rules at very short intervals can quickly clog up Task Manager throughput, leading to higher schedule drift. Use `xpack.alerting.rules.minimumScheduleInterval.value` to set a minimum schedule interval for rules. The default (and recommended) value for this configuration is `1m`. Use `xpack.alerting.rules.minimumScheduleInterval.enforce` to specify whether to strictly enforce this minimum. While the default value for this setting is `false` to maintain backwards compatibility with existing rules, set this to `true` to prevent new and updated rules from running at an interval below the minimum.
[float]
==== Rules that run for a long time
Rules that run for a long time typically do so because they are issuing resource-intensive {es} queries or performing CPU-intensive processing. This can block the event loop, making {kib} inaccessible while the rule runs. By default, rule processing is cancelled after `5m` but this can be overriden using the `xpack.alerting.rules.run.timeout` configuration. This value can also be configured per rule type using `xpack.alerting.rules.run.ruleTypeOverrides`. For example, the following configuration sets the global timeout value to `1m` while allowing *Index Threshold* rules to run for `10m` before being cancelled.
[source,yaml]
--
xpack.alerting.rules.run:
timeout: '1m'
ruleTypeOverrides:
- id: '.index-threshold'
timeout: '10m'
--
When a rule run is cancelled, any alerts and actions that were generated during the run are discarded. This behavior is controlled by the `xpack.alerting.cancelAlertsOnRuleTimeout` configuration, which defaults to `true`. Set this to `false` to receive alerts and actions after the timeout, although be aware that these may be incomplete and possibly inaccurate.
[float]
==== Rules that spawn too many actions
Rules that spawn too many actions can quickly clog up Task Manager throughput. This can occur if:
* A rule configured with a single action generates many alerts. For example, if a rule configured to run a single email action generates 100,000 alerts, then 100,000 actions will be scheduled during a run.
* A rule configured with multiple actions generates alerts. For example, if a rule configured to run an email action, a server log action and a webhook action generates 30,000 alerts, then 90,000 actions will be scheduled during a run.
Use `xpack.alerting.rules.run.actions.max` to limit the maximum number of actions a rule can generate per run. This value can also be configured by connector type using `xpack.alerting.rules.run.actions.connectorTypeOverrides`. For example, the following config sets the global maximum number of actions to 100 while allowing rules with *Email* actions to generate up to 200 actions.
[source,yaml]
--
xpack.alerting.rules.run:
actions:
max: 100
connectorTypeOverrides:
- id: '.email'
max: 200
--