[DOCS] Create and manage rule action frequencies (#150957)

This commit is contained in:
Lisa Cawley 2023-02-23 13:16:46 -08:00 committed by GitHub
parent e0bc286a75
commit b37258e19c
No known key found for this signature in database
GPG key ID: 4AEE18F83AFDEB23
15 changed files with 184 additions and 200 deletions

View file

@ -6,7 +6,7 @@ Include a short description of the rule type.
[float]
==== Create the rule
Fill in the <<defining-rules-general-details, rule details>>, then select *<RULE TYPE>*.
Fill in the name and optional tags, then select *<RULE TYPE>*.
[float]
==== Define the conditions

View file

@ -0,0 +1,126 @@
[[rule-action-variables]]
== Rule action variables
Alerting rules can use the https://mustache.github.io/[Mustache] template syntax
(`{{variable name}}`) to pass values when its actions run.
The available variables differ by rule type, however there are some common variables:
* <<general-rule-action-variables>>
* <<alert-summary-action-variables>>
* <<alert-action-variables>>
Some cases exist where the variable values will be "escaped" when used in a context where escaping is needed. For example:
- For the <<email-action-type,email connector>>, the `message` action configuration property escapes any characters that would be interpreted as Markdown.
- For the <<slack-action-type,Slack connector>>, the `message` action configuration property escapes any characters that would be interpreted as Slack Markdown.
- For the <<webhook-action-type,Webhook connector>>, the `body` action configuration property escapes any characters that are invalid in JSON string values.
Mustache also supports "triple braces" of the form `{{{variable name}}}`, which indicates no escaping should be done at all. Use this form with caution, since it could end up rendering the variable content such that the resulting parameter is invalid or formatted incorrectly.
[float]
[[general-rule-action-variables]]
=== General
All rule types pass the following variables:
`date`:: The date the rule scheduled the action, in ISO format.
`kibanaBaseUrl`:: The configured <<server-publicBaseUrl,`server.publicBaseUrl`>>. If not configured, this will be empty.
`rule.id`:: The ID of the rule.
`rule.name`:: The name of the rule.
`rule.spaceId`:: The ID of the space for the rule.
`rule.tags`:: The list of tags applied to the rule.
[float]
[role="child_attributes"]
[[alert-summary-action-variables]]
=== Action frequency: Summary of alerts
If the rule's action frequency is a summary of alerts, it passes the following variables:
`alerts.all.count`:: The count of all alerts.
`alerts.all.data`::
An array of objects for all alerts. The following object properties are examples; it is not a comprehensive list.
+
.Properties of the alerts.all.data objects
[%collapsible%open]
=====
//# tag::alerts-data[]
`kibana.alert.end`:: Datetime stamp of alert end. preview:[]
`kibana.alert.flapping`:: A flag on the alert that indicates whether the alert status is changing repeatedly. preview:[]
`kibana.alert.instance.id`:: ID of the source that generates the alert. preview:[]
`kibana.alert.reason`:: The reason of the alert (generated with the rule conditions). preview:[]
`kibana.alert.start`:: Datetime stamp of alert start. preview:[]
`kibana.alert.status`:: Alert status (for example, active or OK). preview:[]
//# end::alerts-data[]
=====
`alerts.new.count`:: The count of new alerts.
`alerts.new.data`::
An array of objects for new alerts. The following object properties are examples; it is not a comprehensive list.
+
.Properties of the alerts.new.data objects
[%collapsible]
=====
include::action-variables.asciidoc[tag=alerts-data]
=====
`alerts.ongoing.count`:: The count of ongoing alerts.
`alerts.ongoing.data`::
An array of objects for ongoing alerts. The following object properties are examples; it is not a comprehensive list.
+
.Properties of the alerts.ongoing.data objects
[%collapsible]
=====
include::action-variables.asciidoc[tag=alerts-data]
=====
`alerts.recovered.count`:: The count of recovered alerts.
`alerts.recovered.data`::
An array of objects for recovered alerts. The following object properties are examples; it is not a comprehensive list.
+
.Properties of the alerts.recovered.data objects
[%collapsible]
=====
include::action-variables.asciidoc[tag=alerts-data]
=====
[float]
[[alert-action-variables]]
=== Action frequency: For each alert
If the rule's action frequency is not a summary of alerts, it passes the following variables:
`alert.actionGroup`:: The ID of the action group of the alert that scheduled the action.
`alert.actionGroupName`:: The name of the action group of the alert that scheduled the action.
`alert.actionSubgroup`:: The action subgroup of the alert that scheduled the action.
`alert.flapping`:: A flag on the alert that indicates whether the alert status is changing repeatedly.
`alert.id`:: The ID of the alert that scheduled the action.
[float]
[[defining-rules-actions-variable-context]]
==== Context
If the rule's action frequency is not a summary of alerts, the rule defines additional variables as properties of the variable `context`. For example, if a rule type defines a variable `value`, it can be used in an action parameter as `{{context.value}}`.
For diagnostic or exploratory purposes, action variables whose values are objects, such as `context`, can be referenced directly as variables. The resulting value will be a JSON representation of the object. For example, if an action parameter includes `{{context}}`, it will expand to the JSON representation of all the variables and values provided by the rule type. To see alert-specific variables, use `{{.}}`.
For situations where your rule response returns arrays of data, you can loop through the `context`:
[source]
--------------------------------------------------
{{#context}}{{.}}{{/context}}
--------------------------------------------------
For example, looping through search result hits:
[source]
--------------------------------------------------
triggering data was:
{{#context.hits}} - {{_source.message}}
{{/context.hits}}
--------------------------------------------------

View file

@ -22,233 +22,91 @@ available, go to <<alerting-getting-started>>.
[float]
=== Required permissions
Access to rules is granted based on your privileges to {alert-features}. For
Access to rules is granted based on your {alert-features} privileges. For
more information, go to <<alerting-security>>.
[float]
[[create-edit-rules]]
=== Create and edit rules
Many rules must be created within the context of a {kib} app like
Some rules must be created within the context of a {kib} app like
<<metrics-app,Metrics>>, <<xpack-apm,APM>>, or <<uptime-app,Uptime>>, but others
are generic. Generic rule types can be created in *{rules-ui}* by clicking the
*Create rule* button. This will launch a flyout that guides you through selecting
a rule type and configuring its conditions and action type. For details on what
types of rules are available and how to configure them, refer to <<stack-rules>>.
a rule type and configuring its conditions and actions.
After a rule is created, you can open the action menu (…) and select *Edit rule*
to re-open the flyout and change the rule properties.
[float]
[[defining-rules-general-details]]
==== General rule details
[[defining-rules-type-conditions]]
==== Rule type and conditions
All rules share the following four properties:
Depending on the {kib} app and context, you might be prompted to choose the type of rule to create. Some apps will preselect the type of rule for you.
Each rule type provides its own way of defining the conditions to detect, but an expression formed by a series of clauses is a common pattern. For example, in an index threshold rule, the `WHEN` clause enables you to select an aggregation operation to apply to a numeric field.
[role="screenshot"]
image::images/rule-flyout-rule-conditions.png[UI for defining rule conditions on an index threshold rule,500]
All rules must have a check interval, which defines how often to evaluate the rule conditions. Checks are queued; they run as close to the defined value as capacity allows.
For details on what types of rules are available and how to configure them, refer to <<rule-types>>.
[float]
[[defining-rules-actions-details]]
==== Actions
You can add one or more actions to your rule to generate notifications when its
conditions are met and when they are no longer met.
Each action uses a connector, which provides connection information for a {kib} service or third party integration, depending on where you want to send the notifications. If no connectors exist, click **Add connector** to create one.
After you select a connector, set the action frequency. If the rule type supports alert summaries, you can choose to create a summary of alerts on each check interval or on a custom interval. For example, if you create a metrics threshold rule, you can send email notifications that summarize the new, ongoing, and recovered alerts each day:
[role="screenshot"]
image::images/rule-flyout-action-summary.png[UI for defining rule conditions on an index threshold rule,500]
TIP: If you choose a custom action interval, it cannot be shorter than the rule's check interval.
Alternatively, you can set the action frequency such that the action runs for each alert. If the rule type does not support alert summaries, this is your only available option. You must choose when the action runs (for example, at each check interval, only when the alert status changes, or at a custom action interval). You must also choose an action group, which affects whether the action runs (for example, the action runs when the issue is detected or when it is recovered). Each rule type has a specific set of valid action groups.
[role="screenshot"]
image::images/rule-flyout-action-details.png[UI for defining an email action,500]
Each connector enables different action properties. For example, an email connector enables you to set the recipients, the subject, and a message body in markdown format. For more information about connectors, refer to <<action-types>>.
Name:: The name of the rule. While this name does not have to be unique, a
distinctive name can help you identify a rule.
Tags:: A list of tag names that can be applied to a rule. Tags can help you
organize and find rules.
Check every:: Defines how often to evaluate the rule condition. Checks are
queued; they run as close to the defined value as capacity allows. For more
details, go to <<alerting-production-considerations,Alerting production considerations>>.
Notify:: Defines how often alerts generate actions. Options include running
actions at each check interval, only when the alert status changes, or at a
custom action interval.
+
--
[[alerting-concepts-suppressing-duplicate-notifications]]
[TIP]
==============================================
Since actions are triggered per alert, a rule can end up generating a large
number of actions. Take the following example where a rule is monitoring three
servers every minute for CPU usage > 0.9, and the rule is set to notify
`On check intervals`:
If you are not using alert summaries, actions are triggered per alert and a rule can end up generating a large number of actions. Take the following example where a rule is monitoring three servers every minute for CPU usage > 0.9, and the rule is set to notify `On check intervals`:
* Minute 1: server X123 > 0.9. _One email_ is sent for server X123.
* Minute 2: X123 and Y456 > 0.9. _Two emails_ are sent, one for X123 and one for Y456.
* Minute 3: X123, Y456, Z789 > 0.9. _Three emails_ are sent, one for each of X123, Y456, Z789.
In this example, three emails are sent for server X123 in the span of 3 minutes
for the same rule. Often, it's desirable to suppress these re-notifications. If
you set the rule notify setting to `On custom action intervals` with an interval
of 5 minutes, you reduce noise by getting emails only every 5 minutes for
In this example, three emails are sent for server X123 in the span of 3 minutes for the same rule. Often, it's desirable to suppress these re-notifications. If
you set the rule notify setting to `On custom action intervals` with an interval of 5 minutes, you reduce noise by getting emails only every 5 minutes for
servers that continue to exceed the threshold:
* Minute 1: server X123 > 0.9. _One email_ is sent for server X123.
* Minute 2: X123 and Y456 > 0.9. _One email_ is sent for Y456.
* Minute 3: X123, Y456, Z789 > 0.9. _One email_ is sent for Z789.
To get notified only once when a server exceeds the threshold, you can set the
rule notify setting to `On status changes`.
To get notified only once when a server exceeds the threshold, you can set the rule notify setting to `On status changes`.
==============================================
--
[role="screenshot"]
image::images/rule-flyout-general-details.png[alt='All rules have name, tags, check every, and notify properties in common']
[float]
[[defining-rules-type-conditions]]
==== Rule type and conditions
Depending upon the {kib} app and context, you might be prompted to choose the type of rule to create. Some apps will preselect the type of rule for you.
[role="screenshot"]
image::images/rule-flyout-rule-type-selection.png[Choosing the type of rule to create]
Each rule type provides its own way of defining the conditions to detect, but an expression formed by a series of clauses is a common pattern. Each clause has a UI control that allows you to define the clause. For example, in an index threshold rule, the `WHEN` clause allows you to select an aggregation operation to apply to a numeric field.
[role="screenshot"]
image::images/rule-flyout-rule-conditions.png[UI for defining rule conditions on an index threshold rule]
[float]
[[defining-rules-actions-details]]
==== Action type and details
Actions are optional when you create a rule. However, to receive notifications when a rule meets the defined conditions, you must add one or more actions. Start by selecting a type of connector for your action:
[role="screenshot"]
image::images/rule-flyout-connector-type-selection.png[UI for selecting an action type]
Each action must specify a <<alerting-concepts-connectors, connector>> instance. If no connectors exist for the selected type, click **Add connector** to create one.
After you have selected a connector, use the **Run When** dropdown to choose the action group to associate with this action. When a rule meets the defined condition, it is marked as **Active** and alerts are created and assigned to an action group. In addition to the action groups defined by the selected rule type, each rule also has a **Recovered** action group that is assigned when a rule's conditions are no longer detected.
Each action type exposes different properties. For example, an email action allows you to set the recipients, the subject, and a message body in markdown format. See <<action-types>> for details on the types of actions provided by {kib} and their properties.
[role="screenshot"]
image::images/rule-flyout-action-details.png[UI for defining an email action]
You can attach more than one action. Clicking the *Add action* button will prompt you to select another rule type and repeat the above steps again.
[float]
[[defining-rules-actions-variables]]
===== Action variables
==== Action variables
Using the https://mustache.github.io/[Mustache] template syntax `{{variable name}}`, you can pass rule values to an action at the time a condition is detected. You can access the list of available variables using the "add rule variable" button:
You can pass rule values to an action at the time a condition is detected.
To view the list of variables available for your rule, click the "add rule variable" button:
[role="screenshot"]
image::images/rule-flyout-action-variables.png[Passing rule values to an action]
image::images/rule-flyout-action-variables.png[Passing rule values to an action,500]
All rule types pass the following variables:
`date`:: The date the rule scheduled the action, in ISO format.
`kibanaBaseUrl`:: The configured <<server-publicBaseUrl,`server.publicBaseUrl`>>. If not configured, this will be empty.
`rule.id`:: The ID of the rule.
`rule.name`:: The name of the rule.
`rule.spaceId`:: The ID of the space for the rule.
`rule.tags`:: The list of tags applied to the rule.
There is also a set of action variables specific to the action frequency:
- <<alert-action-variables,For each alert>>
- <<alert-summary-action-variables,Summary of alerts>>
[float]
[[alert-action-variables]]
===== Action variables for each alert
Although available variables differ by rule type, when the action frequency is
**For each alert**, all rule types pass the following variables:
`alert.actionGroup`:: The ID of the action group of the alert that scheduled the action.
`alert.actionGroupName`:: The name of the action group of the alert that scheduled the action.
`alert.actionSubgroup`:: The action subgroup of the alert that scheduled the action.
`alert.flapping`:: A flag on the alert that indicates whether the alert status is changing repeatedly.
`alert.id`:: The ID of the alert that scheduled the action.
[float]
[role="child_attributes"]
[[alert-summary-action-variables]]
===== Action variables for summary of alerts
NOTE: This type of action frequency is not available for all rule types.
When the action frequency is **Summary of alerts**, rules pass the following
variables:
`alerts.all.count`:: The count of all alerts.
`alerts.all.data`::
An array of objects for all alerts. The following object properties are examples; it is not a comprehensive list.
+
.Properties of the alerts.all.data objects
[%collapsible%open]
=====
//# tag::alerts-data[]
`kibana.alert.end`:: Datetime stamp of alert end. preview:[]
`kibana.alert.flapping`:: A flag on the alert that indicates whether the alert status is changing repeatedly. preview:[]
`kibana.alert.instance.id`:: ID of the source that generates the alert. preview:[]
`kibana.alert.reason`:: The reason of the alert (generated with the rule conditions). preview:[]
`kibana.alert.start`:: Datetime stamp of alert start. preview:[]
`kibana.alert.status`:: Alert status (for example, active or OK). preview:[]
//# end::alerts-data[]
=====
`alerts.new.count`:: The count of new alerts.
`alerts.new.data`::
An array of objects for new alerts. The following object properties are examples; it is not a comprehensive list.
+
.Properties of the alerts.new.data objects
[%collapsible]
=====
include::create-and-manage-rules.asciidoc[tag=alerts-data]
=====
`alerts.ongoing.count`:: The count of ongoing alerts.
`alerts.ongoing.data`::
An array of objects for ongoing alerts. The following object properties are examples; it is not a comprehensive list.
+
.Properties of the alerts.ongoing.data objects
[%collapsible]
=====
include::create-and-manage-rules.asciidoc[tag=alerts-data]
=====
`alerts.recovered.count`:: The count of recovered alerts.
`alerts.recovered.data`::
An array of objects for recovered alerts. The following object properties are examples; it is not a comprehensive list.
+
.Properties of the alerts.recovered.data objects
[%collapsible]
=====
include::create-and-manage-rules.asciidoc[tag=alerts-data]
=====
Some cases exist where the variable values will be "escaped", when used in a context where escaping is needed:
- For the <<email-action-type,Email>> connector, the `message` action configuration property escapes any characters that would be interpreted as Markdown.
- For the <<slack-action-type,Slack>> connector, the `message` action configuration property escapes any characters that would be interpreted as Slack Markdown.
- For the <<webhook-action-type,Webhook>> connector, the `body` action configuration property escapes any characters that are invalid in JSON string values.
Mustache also supports "triple braces" of the form `{{{variable name}}}`, which indicates no escaping should be done at all. Care should be used when using this form, as it could end up rendering the variable content in such a way as to make the resulting parameter invalid or formatted incorrectly.
[float]
[[defining-rules-actions-variable-context]]
===== Action variable context
When the action frequency is **For each alert**, each rule type defines additional variables as properties of the variable `context`. For example, if a rule type defines a variable `value`, it can be used in an action parameter as `{{context.value}}`.
For diagnostic or exploratory purposes, action variables whose values are objects, such as `context`, can be referenced directly as variables. The resulting value will be a JSON representation of the object. For example, if an action parameter includes `{{context}}`, it will expand to the JSON representation of all the variables and values provided by the rule type. To see alert-specific variables, use `{{.}}`.
For situations where your rule response returns arrays of data, you can loop through the `context`:
[source]
--------------------------------------------------
{{#context}}{{.}}{{/context}}
--------------------------------------------------
For example, looping through search result hits:
[source]
--------------------------------------------------
triggering data was:
{{#context.hits}} - {{_source.message}}
{{/context.hits}}
--------------------------------------------------
For more information about common action variables, refer to <<rule-action-variables>>.
[float]
[[controlling-rules]]
@ -298,7 +156,7 @@ Some rule types cannot be exported through this interface:
Rules are disabled on export. You are prompted to re-enable the rule on successful import.
[role="screenshot"]
image::images/rules-imported-banner.png[Rules import banner, width=50%]
image::images/rules-imported-banner.png[Rules import banner,500]
[float]
[[rule-details]]

Binary file not shown.

Before

Width:  |  Height:  |  Size: 329 KiB

After

Width:  |  Height:  |  Size: 94 KiB

Before After
Before After

Binary file not shown.

After

Width:  |  Height:  |  Size: 53 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 497 KiB

After

Width:  |  Height:  |  Size: 170 KiB

Before After
Before After

Binary file not shown.

Before

Width:  |  Height:  |  Size: 271 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 154 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 171 KiB

After

Width:  |  Height:  |  Size: 154 KiB

Before After
Before After

Binary file not shown.

Before

Width:  |  Height:  |  Size: 419 KiB

View file

@ -2,4 +2,5 @@ include::alerting-getting-started.asciidoc[]
include::alerting-setup.asciidoc[]
include::create-and-manage-rules.asciidoc[]
include::rule-types.asciidoc[]
include::action-variables.asciidoc[]
include::alerting-troubleshooting.asciidoc[]

View file

@ -9,7 +9,7 @@ threshold condition is met.
[float]
=== Create the rule
Fill in the <<defining-rules-general-details, rule details>>, then select
Fill in the name and optional tags, then select
*{es} query*. {es} query rule can be defined using KQL/Lucene or Query DSL.
[float]
@ -30,8 +30,7 @@ Threshold:: Defines a threshold value and a comparison operator (`is above`,
calculated by the aggregation is compared to this threshold.
Time window:: Defines how far back to search for documents, using the
*time field* set in the *index* clause. Generally this value should be set to a
value higher than the *check every* value in the
<<defining-rules-general-details, general rule details>>, to avoid gaps in
value higher than the *check every* value, to avoid gaps in
detection.
Size:: Specifies the number of documents to pass to the configured actions when
the threshold condition is met.

View file

@ -29,7 +29,7 @@ than the current time minus the amount of the interval. If data older than
[float]
==== Create the rule
Fill in the <<defining-rules-general-details, rule details>>, then select Tracking containment.
Fill in the name and optional tags, then select Tracking containment.
[float]
==== Define the conditions

View file

@ -7,7 +7,7 @@ The index threshold rule type runs an {es} query. It aggregates field values fro
[float]
==== Create the rule
Fill in the <<defining-rules-general-details, rule details>>, then select *Index Threshold*.
Fill in the name and optional tags, then select *Index Threshold*.
[float]
==== Define the conditions
@ -21,7 +21,7 @@ Index:: This clause requires an *index or data view* and a *time field* that wil
When:: This clause specifies how the value to be compared to the threshold is calculated. The value is calculated by aggregating a numeric field a the *time window*. The aggregation options are: `count`, `average`, `sum`, `min`, and `max`. When using `count` the document count is used, and an aggregation field is not necessary.
Over/Grouped Over:: This clause lets you configure whether the aggregation is applied over all documents, or should be split into groups using a grouping field. If grouping is used, an <<alerting-concepts-alerts, alert>> will be created for each group when it exceeds the threshold. To limit the number of alerts on high cardinality fields, you must specify the number of groups to check against the threshold. Only the *top* groups are checked.
Threshold:: This clause defines a threshold value and a comparison operator (one of `is above`, `is above or equals`, `is below`, `is below or equals`, or `is between`). The result of the aggregation is compared to this threshold.
Time window:: This clause determines how far back to search for documents, using the *time field* set in the *index* clause. Generally this value should be to a value higher than the *check every* value in the <<defining-rules-general-details, general rule details>>, to avoid gaps in detection.
Time window:: This clause determines how far back to search for documents, using the *time field* set in the *index* clause. Generally this value should be to a value higher than the *check every* value, to avoid gaps in detection.
If data is available and all clauses have been defined, a preview chart will render the threshold value and display a line chart showing the value for the last 30 intervals. This can provide an indication of recent values and their proximity to the threshold, and help you tune the clauses.

View file

@ -19,7 +19,7 @@ When relying on rules and actions as mission critical services, make sure you fo
By default, each {kib} instance polls for work at three second intervals, and can run a maximum of ten concurrent tasks.
These tasks are then run on the {kib} server.
Rules are recurring background tasks which are rescheduled according to the <<defining-rules-general-details, check interval>> on completion.
Rules are recurring background tasks which are rescheduled according to the check interval on completion.
Actions are non-recurring background tasks which are deleted on completion.
For more details on Task Manager, see <<task-manager-background-tasks>>.
@ -42,8 +42,8 @@ As rules and actions leverage background tasks to perform the majority of work,
When estimating the required task throughput, keep the following in mind:
* Each rule uses a single recurring task that is scheduled to run at the cadence defined by its <<defining-rules-general-details,check interval>>.
* Each action uses a single task. However, because <<alerting-concepts-suppressing-duplicate-notifications,actions are taken per instance>>, alerts can generate a large number of non-recurring tasks.
* Each rule uses a single recurring task that is scheduled to run at the cadence defined by its check interval.
* Each action uses a single task. However, because actions are taken per instance, alerts can generate a large number of non-recurring tasks.
It is difficult to predict how much throughput is needed to ensure all rules and actions are executed at consistent schedules.
By counting rules as recurring tasks and actions as non-recurring tasks, a rough throughput <<task-manager-rough-throughput-estimation,can be estimated>> as a _tasks per minute_ measurement.