[DOCS] Replace execution terminology in Alerting (#131357)

This commit is contained in:
Lisa Cawley 2022-05-04 15:11:53 -07:00 committed by GitHub
parent 542b381fa5
commit 1591bfba24
No known key found for this signature in database
GPG key ID: 4AEE18F83AFDEB23
10 changed files with 69 additions and 59 deletions

View file

@ -92,7 +92,7 @@ URLs can use both the `ssl` and `smtp` options.
+
No other URL values should be part of this URL, including paths,
query strings, and authentication information. When an http or smtp request
is made as part of executing an action, only the protocol, hostname, and
is made as part of running an action, only the protocol, hostname, and
port of the URL for that request are used to look up these configuration
values.
@ -188,10 +188,14 @@ For example, `20m`, `24h`, `7d`, `1w`. Default: `60s`.
==== Alerting settings
`xpack.alerting.maxEphemeralActionsPerAlert`::
Sets the number of actions that will be executed ephemerally. To use this, enable ephemeral tasks in task manager first with <<task-manager-settings,`xpack.task_manager.ephemeral_tasks.enabled`>>
Sets the number of actions that will run ephemerally. To use this, enable
ephemeral tasks in task manager first with
<<task-manager-settings,`xpack.task_manager.ephemeral_tasks.enabled`>>
`xpack.alerting.cancelAlertsOnRuleTimeout`::
Specifies whether to skip writing alerts and scheduling actions if rule execution is cancelled due to timeout. Default: `true`. This setting can be overridden by individual rule types.
Specifies whether to skip writing alerts and scheduling actions if rule
processing was cancelled due to a timeout. Default: `true`. This setting can be
overridden by individual rule types.
`xpack.alerting.rules.minimumScheduleInterval.value`::
Specifies the minimum schedule interval for rules. This minimum is applied to all rules created or updated after you set this value. The time is formatted as:

View file

@ -63,7 +63,7 @@ Rule schedules are defined as an interval between subsequent checks, and can ran
[IMPORTANT]
==============================================
The intervals of rule checks in {kib} are approximate. The timing of their execution is affected by factors such as the frequency at which tasks are claimed and the task load on the system. See <<alerting-production-considerations, Alerting production considerations>> for more information.
The intervals of rule checks in {kib} are approximate. Their timing is affected by factors such as the frequency at which tasks are claimed and the task load on the system. Refer to <<alerting-production-considerations>> for more information.
==============================================
[float]
@ -82,7 +82,7 @@ The result is a template: all the parameters needed to invoke a service are supp
In the server monitoring example, the `email` connector type is used, and `server` is mapped to the body of the email, using the template string `CPU on {{server}} is high`.
When the rule detects the condition, it creates an <<alerting-concepts-alerts, alert>> containing the details of the condition, renders the template with these details such as server name, and executes the action on the {kib} server by invoking the `email` connector type.
When the rule detects the condition, it creates an <<alerting-concepts-alerts,alert>> containing the details of the condition, renders the template with these details such as server name, and runs the action on the {kib} server by invoking the `email` connector type.
image::images/what-is-an-action.svg[Actions are like templates that are rendered when an alert detects a condition]

View file

@ -62,7 +62,7 @@ Rules and connectors are isolated to the {kib} space in which they were created.
[[alerting-authorization]]
=== Authorization
Rules are authorized using an <<api-keys, API key>> associated with the last user to edit the rule. This API key captures a snapshot of the user's privileges at the time of edit and is subsequently used to run all background tasks associated with the rule, including condition checks, like {es} queries, and action executions. The following rule actions will re-generate the API key:
Rules are authorized using an <<api-keys,API key>> associated with the last user to edit the rule. This API key captures a snapshot of the user's privileges at the time of edit and is subsequently used to run all background tasks associated with the rule, including condition checks like {es} queries and triggered actions. The following rule actions will re-generate the API key:
* Creating a rule
* Enabling a disabled rule

View file

@ -52,7 +52,7 @@ Diagnosing these may be difficult - but there may be log messages for error cond
=== Use the REST APIs
There is a rich set of HTTP endpoints to introspect and manage rules and connectors.
One of the http endpoints available for actions is the POST <<execute-connector-api,_execute API>>. You can use this to “test” an action. For instance, if you have a server log action created, you can execute it via curling the endpoint:
One of the http endpoints available for actions is the POST <<execute-connector-api,_execute API>>. You can use this to “test” an action. For instance, if you have a server log action created, you can run it via curling the endpoint:
[source, txt]
--------------------------------------------------
curl -X POST -k \
@ -75,7 +75,7 @@ The same REST POST _execute API command will be:
kbn-action execute a692dc89-15b9-4a3c-9e47-9fb6872e49ce {"params":{"subject":"hallo","message":"hallo!","to":["me@example.com"]}}
--------------------------------------------------
The result of this http request (and printed to stdout by https://github.com/pmuellr/kbn-action[kbn-action]) will be data returned by the action execution, along with error messages if errors were encountered.
The result of this http request (and printed to stdout by https://github.com/pmuellr/kbn-action[kbn-action]) will be data returned by the action, along with error messages if errors were encountered.
[float]
[[alerting-error-banners]]
@ -92,8 +92,8 @@ image::images/rules-details-health.png[Rule details page with the errors banner]
[[task-manager-diagnostics]]
=== Task Manager diagnostics
Under the hood, *Rules and Connectors* uses a plugin called Task Manager, which handles the scheduling, execution, and error handling of the tasks.
This means that failure cases in Rules or Connectors will, at times, be revealed by the Task Manager mechanism, rather than the Rules mechanism.
Under the hood, {rules-ui} uses a plugin called Task Manager, which handles the scheduling, running, and error handling of the tasks.
This means that failure cases in {rules-ui} will, at times, be revealed by the Task Manager mechanism, rather than the Rules mechanism.
Task Manager provides a visible status which can be used to diagnose issues and is very well documented <<task-manager-health-monitoring,health monitoring>> and <<task-manager-troubleshooting,troubleshooting>>.
Task Manager uses the `.kibana_task_manager` index, an internal index that contains all the saved objects that represent the tasks in the system.

View file

@ -44,7 +44,7 @@ Notify:: This value limits how often actions are repeated when an alert rem
[[alerting-concepts-suppressing-duplicate-notifications]]
[NOTE]
==============================================
Since actions are executed per alert, a rule can end up generating a large number of actions. Take the following example where a rule is monitoring three servers every minute for CPU usage > 0.9, and the rule is set to notify **Every time alert is active**:
Since actions are triggered per alert, a rule can end up generating a large number of actions. Take the following example where a rule is monitoring three servers every minute for CPU usage > 0.9, and the rule is set to notify **Every time alert is active**:
* Minute 1: server X123 > 0.9. *One email* is sent for server X123.
* Minute 2: X123 and Y456 > 0.9. *Two emails* are sent, one for X123 and one for Y456.
@ -163,8 +163,8 @@ A rule can have one of the following statuses:
`active`:: The conditions for the rule have been met, and the associated actions should be invoked.
`ok`:: The conditions for the rule have not been met, and the associated actions are not invoked.
`error`:: An error was encountered during rule execution.
`pending`:: The rule has not yet executed. The rule was either just created, or enabled after being disabled.
`error`:: An error was encountered by the rule.
`pending`:: The rule has not yet run. The rule was either just created, or enabled after being disabled.
`unknown`:: A problem occurred when calculating the status. Most likely, something went wrong with the alerting code.
[float]

View file

@ -26,7 +26,7 @@ Index:: Specifies an *index or data view* and a *time field* that is used for
the *time window*.
Size:: Specifies the number of documents to pass to the configured actions when
the threshold condition is met.
{es} query:: Specifies the ES DSL query to execute. The number of documents that
{es} query:: Specifies the ES DSL query. The number of documents that
match this query is evaluated against the threshold condition. Only the `query`
field is used, other DSL fields are not considered.
Threshold:: Defines a threshold value and a comparison operator (`is above`,
@ -81,7 +81,7 @@ image::images/rule-types-es-query-example-action-variable.png[Iterate over hits
Use the *Test query* feature to verify that your query DSL is valid.
* Valid queries are executed against the configured *index* using the configured
* Valid queries are run against the configured *index* using the configured
*time window*. The number of documents that match the query is displayed.
+
[role="screenshot"]
@ -95,16 +95,14 @@ image::user/alerting/images/rule-types-es-query-invalid.png[Test {es} query show
[float]
==== Handling multiple matches of the same document
This rule type checks for duplication of document matches across rule
executions. If you configure the rule with a schedule interval smaller than the
time window, and a document matches a query in multiple rule executions, it is
alerted on only once.
This rule type checks for duplication of document matches across multiple runs.
If you configure the rule with a schedule interval smaller than the time window,
and a document matches a query in multiple runs, it is alerted on only once.
The rule uses the timestamp of the matches to avoid alerting on the same match
multiple times. The timestamp of the latest match is used for evaluating the
rule conditions when the rule is executed. Only matches between the latest
timestamp from the previous execution and the actual rule execution are
considered.
rule conditions when the rule runs. Only matches between the latest timestamp
from the previous run and the current run are considered.
Suppose you have a rule configured to run every minute. The rule uses a time
window of 1 hour and checks if there are more than 99 matches for the query. The
@ -112,16 +110,16 @@ window of 1 hour and checks if there are more than 99 matches for the query. The
[cols="3*<"]
|===
| `Execution 1 (0:00)`
| `Run 1 (0:00)`
| Rule finds 113 matches in the last hour: `113 > 99`
| Rule is active and user is alerted.
| `Execution 2 (0:01)`
| `Run 2 (0:01)`
| Rule finds 127 matches in the last hour. 105 of the matches are duplicates that were already alerted on previously, so you actually have 22 matches: `22 !> 99`
| No alert.
| `Execution 3 (0:02)`
| `Run 3 (0:02)`
| Rule finds 159 matches in the last hour. 88 of the matches are duplicates that were already alerted on previously, so you actually have 71 matches: `71 !> 99`
| No alert.
| `Execution 4 (0:03)`
| `Run 4 (0:03)`
| Rule finds 190 matches in the last hour. 71 of them are duplicates that were already alerted on previously, so you actually have 119 matches: `119 > 99`
| Rule is active and user is alerted.
|===

View file

@ -52,7 +52,7 @@ In this example, you will use the {kib} <<add-sample-data, sample weblog dataset
. Open the main menu, then click **Stack Management > Rules and Connectors**.
. Create a new rule that is checked every four hours and executes actions when the rule status changes.
. Create a new rule that is checked every four hours and triggers actions when the rule status changes.
+
[role="screenshot"]
image::user/alerting/images/rule-types-index-threshold-select.png[Choosing an index threshold rule type]

View file

@ -40,35 +40,34 @@ When diagnosing issues related to alerting, focus on the tasks that begin with `
Alerting tasks always begin with `alerting:`. For example, the `alerting:.index-threshold` tasks back the <<rule-type-index-threshold, index threshold stack rule>>.
Action tasks always begin with `actions:`. For example, the `actions:.index` tasks back the <<index-action-type, index action>>.
For more details on monitoring and diagnosing task execution in Task Manager, see <<task-manager-health-monitoring>>.
For more details on monitoring and diagnosing tasks in Task Manager, refer to <<task-manager-health-monitoring>>.
[float]
[[connector-tls-settings]]
==== Connectors have TLS errors when executing actions
==== Connectors have TLS errors when running actions
*Problem*
When executing actions, a connector gets a TLS socket error when connecting to
the server.
A connector gets a TLS socket error when connecting to the server to run an action.
*Solution*
Configuration options are available to specialize connections to TLS servers,
including ignoring server certificate validation, and providing certificate
authority data to verify servers using custom certificates. For more details,
see <<action-settings,Action settings>>.
including ignoring server certificate validation and providing certificate
authority data to verify servers using custom certificates. For more details,
see <<action-settings>>.
[float]
[[rules-long-execution-time]]
[[rules-long-run-time]]
==== Rules take a long time to run
*Problem*
Rules are taking a long time to execute and are impacting the overall health of your deployment.
Rules are taking a long time to run and are impacting the overall health of your deployment.
[IMPORTANT]
==============================================
By default, only users with a `superuser` role can query the experimental[] {kib} event log because it is a system index. To enable additional users to execute this query, assign `read` privileges to the `.kibana-event-log*` index.
By default, only users with a `superuser` role can query the experimental[] {kib} event log because it is a system index. To enable additional users to run this query, assign `read` privileges to the `.kibana-event-log*` index.
==============================================
*Solution*
@ -87,9 +86,9 @@ image::images/rule-details-timeout-error.png[Rule details page with timeout erro
If you want your rules to run longer, update the `xpack.alerting.rules.run.timeout` configuration in your <<alert-settings>>. You can also target a specific rule type by using `xpack.alerting.rules.run.ruleTypeOverrides`.
Rules that consistently run longer than their <<create-edit-rules, check interval>> may produce unexpected results. If the average run duration, visible on the <<rule-details,details page>>, is greater than the check interval, consider increasing the check interval.
Rules that consistently run longer than their <<create-edit-rules,check interval>> may produce unexpected results. If the average run duration, visible on the <<rule-details,details page>>, is greater than the check interval, consider increasing the check interval.
To get all long-running rules, you can query for a list of rule ids, bucketed by their execution times:
To get all long-running rules, you can query for a list of rule ids, bucketed by their run times:
[source,console]
--------------------------------------------------
@ -160,9 +159,9 @@ GET /.kibana-event-log*/_search
--------------------------------------------------
// TEST
<1> This queries for rules executed in the last day. Update the values of `lte` and `gte` to query over a different time range.
<2> Use `event.provider: actions` to query for long-running action executions.
<3> Execution durations are stored as nanoseconds. This adds a runtime field to convert that duration into seconds.
<1> This queries for rules run in the last day. Update the values of `lte` and `gte` to query over a different time range.
<2> Use `event.provider: actions` to query for long-running actions.
<3> Run durations are stored as nanoseconds. This adds a runtime field to convert that duration into seconds.
<4> This interval buckets the `event.duration_in_seconds` runtime field into 1 second intervals. Update this value to change the granularity of the buckets. If you are unable to use runtime fields, make sure this aggregation targets `event.duration` and use nanoseconds for the interval.
<5> This retrieves the top 10 rule ids for this duration interval. Update this value to retrieve more rule ids.
@ -237,10 +236,10 @@ This query returns the following:
}
}
--------------------------------------------------
<1> Most rule execution durations fall within the first bucket (0 - 1 seconds).
<2> A single rule with id `41893910-6bca-11eb-9e0d-85d233e3ee35` took between 30 and 31 seconds to execute.
<1> Most run durations fall within the first bucket (0 - 1 seconds).
<2> A single rule with id `41893910-6bca-11eb-9e0d-85d233e3ee35` took between 30 and 31 seconds to run.
Use the <<get-rule-api,Get Rule API>> to retrieve additional information about rules that take a long time to execute.
Use the <<get-rule-api,get rule API>> to retrieve additional information about rules that take a long time to run.
[float]
[[rule-cannot-decrypt-api-key]]
@ -248,11 +247,11 @@ Use the <<get-rule-api,Get Rule API>> to retrieve additional information about r
*Problem*:
The rule fails to execute and has an `Unable to decrypt attribute "apiKey"` error.
The rule fails to run and has an `Unable to decrypt attribute "apiKey"` error.
*Solution*:
This error happens when the `xpack.encryptedSavedObjects.encryptionKey` value used to create the rule does not match the value used during rule execution. Depending on the scenario, there are different ways to solve this problem:
This error happens when the `xpack.encryptedSavedObjects.encryptionKey` value used to create the rule does not match the value used when the rule runs. Depending on the scenario, there are different ways to solve this problem:
[cols="2*<"]
|===

View file

@ -6,15 +6,16 @@ experimental[]
Use the event log index to determine:
* Whether a rule successfully ran but its associated actions did not
* Whether a rule ran successfully but its associated actions did not
* Whether a rule was ever activated
* Additional information about rule execution errors
* Duration times for rule and action executions
* Additional information about errors when the rule ran
* Run durations for the rules and actions
[float]
==== Example Event Log Queries
==== Example event log queries
The following event log query looks at all events related to a specific rule id:
Event log query to look at all event related to a specific rule id:
[source, txt]
--------------------------------------------------
GET /.kibana-event-log*/_search
@ -77,7 +78,9 @@ GET /.kibana-event-log*/_search
}
--------------------------------------------------
Event log query to look at all events related to executing a rule or action. These events include duration.
The following event log query looks at all events related to running a rule or
action. These events include duration:
[source, txt]
--------------------------------------------------
GET /.kibana-event-log*/_search
@ -124,8 +127,10 @@ GET /.kibana-event-log*/_search
}
--------------------------------------------------
Event log query to look at the errors.
You should see an `error.message` property in that event, with a message from the action executor that might provide more detail on why the action encountered an error:
The following event log query looks at the errors. You should see an
`error.message` property in that event, with a message that might provide more
details about why the action encountered an error:
[source, txt]
--------------------------------------------------
{
@ -150,7 +155,9 @@ You should see an `error.message` property in that event, with a message from th
}
--------------------------------------------------
And see the errors for the rules you might provide the next search query:
You might also see the errors for the rules, which can use in the next search
query. For example:
[source, txt]
--------------------------------------------------
{

View file

@ -15,9 +15,10 @@ image::user/alerting/images/email-connector-test.png[Rule management page with t
image::user/alerting/images/teams-connector-test.png[Five clauses define the condition to detect]
[float]
==== experimental[] Troubleshooting Connectors with `kbn-action` tool
==== experimental[] Troubleshooting connectors with the `kbn-action` tool
Executing an Email action via https://github.com/pmuellr/kbn-action[kbn-action]. In this example, is using a cloud deployment of the stack:
You can run an email action via https://github.com/pmuellr/kbn-action[kbn-action].
In this example, it is a Cloud deployment of the {stack}:
[source, txt]
--------------------------------------------------
@ -44,7 +45,8 @@ $ kbn-action ls
}
]
--------------------------------------------------
and then execute this:
You can then run the following test:
[source, txt]
--------------------------------------------------