mirror of
https://github.com/elastic/kibana.git
synced 2025-04-24 01:38:56 -04:00
[DOCS] Replace execution terminology in Alerting (#131357)
This commit is contained in:
parent
542b381fa5
commit
1591bfba24
10 changed files with 69 additions and 59 deletions
|
@ -92,7 +92,7 @@ URLs can use both the `ssl` and `smtp` options.
|
|||
+
|
||||
No other URL values should be part of this URL, including paths,
|
||||
query strings, and authentication information. When an http or smtp request
|
||||
is made as part of executing an action, only the protocol, hostname, and
|
||||
is made as part of running an action, only the protocol, hostname, and
|
||||
port of the URL for that request are used to look up these configuration
|
||||
values.
|
||||
|
||||
|
@ -188,10 +188,14 @@ For example, `20m`, `24h`, `7d`, `1w`. Default: `60s`.
|
|||
==== Alerting settings
|
||||
|
||||
`xpack.alerting.maxEphemeralActionsPerAlert`::
|
||||
Sets the number of actions that will be executed ephemerally. To use this, enable ephemeral tasks in task manager first with <<task-manager-settings,`xpack.task_manager.ephemeral_tasks.enabled`>>
|
||||
Sets the number of actions that will run ephemerally. To use this, enable
|
||||
ephemeral tasks in task manager first with
|
||||
<<task-manager-settings,`xpack.task_manager.ephemeral_tasks.enabled`>>
|
||||
|
||||
`xpack.alerting.cancelAlertsOnRuleTimeout`::
|
||||
Specifies whether to skip writing alerts and scheduling actions if rule execution is cancelled due to timeout. Default: `true`. This setting can be overridden by individual rule types.
|
||||
Specifies whether to skip writing alerts and scheduling actions if rule
|
||||
processing was cancelled due to a timeout. Default: `true`. This setting can be
|
||||
overridden by individual rule types.
|
||||
|
||||
`xpack.alerting.rules.minimumScheduleInterval.value`::
|
||||
Specifies the minimum schedule interval for rules. This minimum is applied to all rules created or updated after you set this value. The time is formatted as:
|
||||
|
|
|
@ -63,7 +63,7 @@ Rule schedules are defined as an interval between subsequent checks, and can ran
|
|||
|
||||
[IMPORTANT]
|
||||
==============================================
|
||||
The intervals of rule checks in {kib} are approximate. The timing of their execution is affected by factors such as the frequency at which tasks are claimed and the task load on the system. See <<alerting-production-considerations, Alerting production considerations>> for more information.
|
||||
The intervals of rule checks in {kib} are approximate. Their timing is affected by factors such as the frequency at which tasks are claimed and the task load on the system. Refer to <<alerting-production-considerations>> for more information.
|
||||
==============================================
|
||||
|
||||
[float]
|
||||
|
@ -82,7 +82,7 @@ The result is a template: all the parameters needed to invoke a service are supp
|
|||
|
||||
In the server monitoring example, the `email` connector type is used, and `server` is mapped to the body of the email, using the template string `CPU on {{server}} is high`.
|
||||
|
||||
When the rule detects the condition, it creates an <<alerting-concepts-alerts, alert>> containing the details of the condition, renders the template with these details such as server name, and executes the action on the {kib} server by invoking the `email` connector type.
|
||||
When the rule detects the condition, it creates an <<alerting-concepts-alerts,alert>> containing the details of the condition, renders the template with these details such as server name, and runs the action on the {kib} server by invoking the `email` connector type.
|
||||
|
||||
image::images/what-is-an-action.svg[Actions are like templates that are rendered when an alert detects a condition]
|
||||
|
||||
|
|
|
@ -62,7 +62,7 @@ Rules and connectors are isolated to the {kib} space in which they were created.
|
|||
[[alerting-authorization]]
|
||||
=== Authorization
|
||||
|
||||
Rules are authorized using an <<api-keys, API key>> associated with the last user to edit the rule. This API key captures a snapshot of the user's privileges at the time of edit and is subsequently used to run all background tasks associated with the rule, including condition checks, like {es} queries, and action executions. The following rule actions will re-generate the API key:
|
||||
Rules are authorized using an <<api-keys,API key>> associated with the last user to edit the rule. This API key captures a snapshot of the user's privileges at the time of edit and is subsequently used to run all background tasks associated with the rule, including condition checks like {es} queries and triggered actions. The following rule actions will re-generate the API key:
|
||||
|
||||
* Creating a rule
|
||||
* Enabling a disabled rule
|
||||
|
|
|
@ -52,7 +52,7 @@ Diagnosing these may be difficult - but there may be log messages for error cond
|
|||
=== Use the REST APIs
|
||||
|
||||
There is a rich set of HTTP endpoints to introspect and manage rules and connectors.
|
||||
One of the http endpoints available for actions is the POST <<execute-connector-api,_execute API>>. You can use this to “test” an action. For instance, if you have a server log action created, you can execute it via curling the endpoint:
|
||||
One of the http endpoints available for actions is the POST <<execute-connector-api,_execute API>>. You can use this to “test” an action. For instance, if you have a server log action created, you can run it via curling the endpoint:
|
||||
[source, txt]
|
||||
--------------------------------------------------
|
||||
curl -X POST -k \
|
||||
|
@ -75,7 +75,7 @@ The same REST POST _execute API command will be:
|
|||
kbn-action execute a692dc89-15b9-4a3c-9e47-9fb6872e49ce ‘{"params":{"subject":"hallo","message":"hallo!","to":["me@example.com"]}}’
|
||||
--------------------------------------------------
|
||||
|
||||
The result of this http request (and printed to stdout by https://github.com/pmuellr/kbn-action[kbn-action]) will be data returned by the action execution, along with error messages if errors were encountered.
|
||||
The result of this http request (and printed to stdout by https://github.com/pmuellr/kbn-action[kbn-action]) will be data returned by the action, along with error messages if errors were encountered.
|
||||
|
||||
[float]
|
||||
[[alerting-error-banners]]
|
||||
|
@ -92,8 +92,8 @@ image::images/rules-details-health.png[Rule details page with the errors banner]
|
|||
[[task-manager-diagnostics]]
|
||||
=== Task Manager diagnostics
|
||||
|
||||
Under the hood, *Rules and Connectors* uses a plugin called Task Manager, which handles the scheduling, execution, and error handling of the tasks.
|
||||
This means that failure cases in Rules or Connectors will, at times, be revealed by the Task Manager mechanism, rather than the Rules mechanism.
|
||||
Under the hood, {rules-ui} uses a plugin called Task Manager, which handles the scheduling, running, and error handling of the tasks.
|
||||
This means that failure cases in {rules-ui} will, at times, be revealed by the Task Manager mechanism, rather than the Rules mechanism.
|
||||
|
||||
Task Manager provides a visible status which can be used to diagnose issues and is very well documented <<task-manager-health-monitoring,health monitoring>> and <<task-manager-troubleshooting,troubleshooting>>.
|
||||
Task Manager uses the `.kibana_task_manager` index, an internal index that contains all the saved objects that represent the tasks in the system.
|
||||
|
|
|
@ -44,7 +44,7 @@ Notify:: This value limits how often actions are repeated when an alert rem
|
|||
[[alerting-concepts-suppressing-duplicate-notifications]]
|
||||
[NOTE]
|
||||
==============================================
|
||||
Since actions are executed per alert, a rule can end up generating a large number of actions. Take the following example where a rule is monitoring three servers every minute for CPU usage > 0.9, and the rule is set to notify **Every time alert is active**:
|
||||
Since actions are triggered per alert, a rule can end up generating a large number of actions. Take the following example where a rule is monitoring three servers every minute for CPU usage > 0.9, and the rule is set to notify **Every time alert is active**:
|
||||
|
||||
* Minute 1: server X123 > 0.9. *One email* is sent for server X123.
|
||||
* Minute 2: X123 and Y456 > 0.9. *Two emails* are sent, one for X123 and one for Y456.
|
||||
|
@ -163,8 +163,8 @@ A rule can have one of the following statuses:
|
|||
|
||||
`active`:: The conditions for the rule have been met, and the associated actions should be invoked.
|
||||
`ok`:: The conditions for the rule have not been met, and the associated actions are not invoked.
|
||||
`error`:: An error was encountered during rule execution.
|
||||
`pending`:: The rule has not yet executed. The rule was either just created, or enabled after being disabled.
|
||||
`error`:: An error was encountered by the rule.
|
||||
`pending`:: The rule has not yet run. The rule was either just created, or enabled after being disabled.
|
||||
`unknown`:: A problem occurred when calculating the status. Most likely, something went wrong with the alerting code.
|
||||
|
||||
[float]
|
||||
|
|
|
@ -26,7 +26,7 @@ Index:: Specifies an *index or data view* and a *time field* that is used for
|
|||
the *time window*.
|
||||
Size:: Specifies the number of documents to pass to the configured actions when
|
||||
the threshold condition is met.
|
||||
{es} query:: Specifies the ES DSL query to execute. The number of documents that
|
||||
{es} query:: Specifies the ES DSL query. The number of documents that
|
||||
match this query is evaluated against the threshold condition. Only the `query`
|
||||
field is used, other DSL fields are not considered.
|
||||
Threshold:: Defines a threshold value and a comparison operator (`is above`,
|
||||
|
@ -81,7 +81,7 @@ image::images/rule-types-es-query-example-action-variable.png[Iterate over hits
|
|||
|
||||
Use the *Test query* feature to verify that your query DSL is valid.
|
||||
|
||||
* Valid queries are executed against the configured *index* using the configured
|
||||
* Valid queries are run against the configured *index* using the configured
|
||||
*time window*. The number of documents that match the query is displayed.
|
||||
+
|
||||
[role="screenshot"]
|
||||
|
@ -95,16 +95,14 @@ image::user/alerting/images/rule-types-es-query-invalid.png[Test {es} query show
|
|||
[float]
|
||||
==== Handling multiple matches of the same document
|
||||
|
||||
This rule type checks for duplication of document matches across rule
|
||||
executions. If you configure the rule with a schedule interval smaller than the
|
||||
time window, and a document matches a query in multiple rule executions, it is
|
||||
alerted on only once.
|
||||
This rule type checks for duplication of document matches across multiple runs.
|
||||
If you configure the rule with a schedule interval smaller than the time window,
|
||||
and a document matches a query in multiple runs, it is alerted on only once.
|
||||
|
||||
The rule uses the timestamp of the matches to avoid alerting on the same match
|
||||
multiple times. The timestamp of the latest match is used for evaluating the
|
||||
rule conditions when the rule is executed. Only matches between the latest
|
||||
timestamp from the previous execution and the actual rule execution are
|
||||
considered.
|
||||
rule conditions when the rule runs. Only matches between the latest timestamp
|
||||
from the previous run and the current run are considered.
|
||||
|
||||
Suppose you have a rule configured to run every minute. The rule uses a time
|
||||
window of 1 hour and checks if there are more than 99 matches for the query. The
|
||||
|
@ -112,16 +110,16 @@ window of 1 hour and checks if there are more than 99 matches for the query. The
|
|||
|
||||
[cols="3*<"]
|
||||
|===
|
||||
| `Execution 1 (0:00)`
|
||||
| `Run 1 (0:00)`
|
||||
| Rule finds 113 matches in the last hour: `113 > 99`
|
||||
| Rule is active and user is alerted.
|
||||
| `Execution 2 (0:01)`
|
||||
| `Run 2 (0:01)`
|
||||
| Rule finds 127 matches in the last hour. 105 of the matches are duplicates that were already alerted on previously, so you actually have 22 matches: `22 !> 99`
|
||||
| No alert.
|
||||
| `Execution 3 (0:02)`
|
||||
| `Run 3 (0:02)`
|
||||
| Rule finds 159 matches in the last hour. 88 of the matches are duplicates that were already alerted on previously, so you actually have 71 matches: `71 !> 99`
|
||||
| No alert.
|
||||
| `Execution 4 (0:03)`
|
||||
| `Run 4 (0:03)`
|
||||
| Rule finds 190 matches in the last hour. 71 of them are duplicates that were already alerted on previously, so you actually have 119 matches: `119 > 99`
|
||||
| Rule is active and user is alerted.
|
||||
|===
|
|
@ -52,7 +52,7 @@ In this example, you will use the {kib} <<add-sample-data, sample weblog dataset
|
|||
|
||||
. Open the main menu, then click **Stack Management > Rules and Connectors**.
|
||||
|
||||
. Create a new rule that is checked every four hours and executes actions when the rule status changes.
|
||||
. Create a new rule that is checked every four hours and triggers actions when the rule status changes.
|
||||
+
|
||||
[role="screenshot"]
|
||||
image::user/alerting/images/rule-types-index-threshold-select.png[Choosing an index threshold rule type]
|
||||
|
|
|
@ -40,35 +40,34 @@ When diagnosing issues related to alerting, focus on the tasks that begin with `
|
|||
Alerting tasks always begin with `alerting:`. For example, the `alerting:.index-threshold` tasks back the <<rule-type-index-threshold, index threshold stack rule>>.
|
||||
Action tasks always begin with `actions:`. For example, the `actions:.index` tasks back the <<index-action-type, index action>>.
|
||||
|
||||
For more details on monitoring and diagnosing task execution in Task Manager, see <<task-manager-health-monitoring>>.
|
||||
For more details on monitoring and diagnosing tasks in Task Manager, refer to <<task-manager-health-monitoring>>.
|
||||
|
||||
[float]
|
||||
[[connector-tls-settings]]
|
||||
==== Connectors have TLS errors when executing actions
|
||||
==== Connectors have TLS errors when running actions
|
||||
|
||||
*Problem*
|
||||
|
||||
When executing actions, a connector gets a TLS socket error when connecting to
|
||||
the server.
|
||||
A connector gets a TLS socket error when connecting to the server to run an action.
|
||||
|
||||
*Solution*
|
||||
|
||||
Configuration options are available to specialize connections to TLS servers,
|
||||
including ignoring server certificate validation, and providing certificate
|
||||
authority data to verify servers using custom certificates. For more details,
|
||||
see <<action-settings,Action settings>>.
|
||||
including ignoring server certificate validation and providing certificate
|
||||
authority data to verify servers using custom certificates. For more details,
|
||||
see <<action-settings>>.
|
||||
|
||||
[float]
|
||||
[[rules-long-execution-time]]
|
||||
[[rules-long-run-time]]
|
||||
==== Rules take a long time to run
|
||||
|
||||
*Problem*
|
||||
|
||||
Rules are taking a long time to execute and are impacting the overall health of your deployment.
|
||||
Rules are taking a long time to run and are impacting the overall health of your deployment.
|
||||
|
||||
[IMPORTANT]
|
||||
==============================================
|
||||
By default, only users with a `superuser` role can query the experimental[] {kib} event log because it is a system index. To enable additional users to execute this query, assign `read` privileges to the `.kibana-event-log*` index.
|
||||
By default, only users with a `superuser` role can query the experimental[] {kib} event log because it is a system index. To enable additional users to run this query, assign `read` privileges to the `.kibana-event-log*` index.
|
||||
==============================================
|
||||
|
||||
*Solution*
|
||||
|
@ -87,9 +86,9 @@ image::images/rule-details-timeout-error.png[Rule details page with timeout erro
|
|||
|
||||
If you want your rules to run longer, update the `xpack.alerting.rules.run.timeout` configuration in your <<alert-settings>>. You can also target a specific rule type by using `xpack.alerting.rules.run.ruleTypeOverrides`.
|
||||
|
||||
Rules that consistently run longer than their <<create-edit-rules, check interval>> may produce unexpected results. If the average run duration, visible on the <<rule-details,details page>>, is greater than the check interval, consider increasing the check interval.
|
||||
Rules that consistently run longer than their <<create-edit-rules,check interval>> may produce unexpected results. If the average run duration, visible on the <<rule-details,details page>>, is greater than the check interval, consider increasing the check interval.
|
||||
|
||||
To get all long-running rules, you can query for a list of rule ids, bucketed by their execution times:
|
||||
To get all long-running rules, you can query for a list of rule ids, bucketed by their run times:
|
||||
|
||||
[source,console]
|
||||
--------------------------------------------------
|
||||
|
@ -160,9 +159,9 @@ GET /.kibana-event-log*/_search
|
|||
--------------------------------------------------
|
||||
// TEST
|
||||
|
||||
<1> This queries for rules executed in the last day. Update the values of `lte` and `gte` to query over a different time range.
|
||||
<2> Use `event.provider: actions` to query for long-running action executions.
|
||||
<3> Execution durations are stored as nanoseconds. This adds a runtime field to convert that duration into seconds.
|
||||
<1> This queries for rules run in the last day. Update the values of `lte` and `gte` to query over a different time range.
|
||||
<2> Use `event.provider: actions` to query for long-running actions.
|
||||
<3> Run durations are stored as nanoseconds. This adds a runtime field to convert that duration into seconds.
|
||||
<4> This interval buckets the `event.duration_in_seconds` runtime field into 1 second intervals. Update this value to change the granularity of the buckets. If you are unable to use runtime fields, make sure this aggregation targets `event.duration` and use nanoseconds for the interval.
|
||||
<5> This retrieves the top 10 rule ids for this duration interval. Update this value to retrieve more rule ids.
|
||||
|
||||
|
@ -237,10 +236,10 @@ This query returns the following:
|
|||
}
|
||||
}
|
||||
--------------------------------------------------
|
||||
<1> Most rule execution durations fall within the first bucket (0 - 1 seconds).
|
||||
<2> A single rule with id `41893910-6bca-11eb-9e0d-85d233e3ee35` took between 30 and 31 seconds to execute.
|
||||
<1> Most run durations fall within the first bucket (0 - 1 seconds).
|
||||
<2> A single rule with id `41893910-6bca-11eb-9e0d-85d233e3ee35` took between 30 and 31 seconds to run.
|
||||
|
||||
Use the <<get-rule-api,Get Rule API>> to retrieve additional information about rules that take a long time to execute.
|
||||
Use the <<get-rule-api,get rule API>> to retrieve additional information about rules that take a long time to run.
|
||||
|
||||
[float]
|
||||
[[rule-cannot-decrypt-api-key]]
|
||||
|
@ -248,11 +247,11 @@ Use the <<get-rule-api,Get Rule API>> to retrieve additional information about r
|
|||
|
||||
*Problem*:
|
||||
|
||||
The rule fails to execute and has an `Unable to decrypt attribute "apiKey"` error.
|
||||
The rule fails to run and has an `Unable to decrypt attribute "apiKey"` error.
|
||||
|
||||
*Solution*:
|
||||
|
||||
This error happens when the `xpack.encryptedSavedObjects.encryptionKey` value used to create the rule does not match the value used during rule execution. Depending on the scenario, there are different ways to solve this problem:
|
||||
This error happens when the `xpack.encryptedSavedObjects.encryptionKey` value used to create the rule does not match the value used when the rule runs. Depending on the scenario, there are different ways to solve this problem:
|
||||
|
||||
[cols="2*<"]
|
||||
|===
|
||||
|
|
|
@ -6,15 +6,16 @@ experimental[]
|
|||
|
||||
Use the event log index to determine:
|
||||
|
||||
* Whether a rule successfully ran but its associated actions did not
|
||||
* Whether a rule ran successfully but its associated actions did not
|
||||
* Whether a rule was ever activated
|
||||
* Additional information about rule execution errors
|
||||
* Duration times for rule and action executions
|
||||
* Additional information about errors when the rule ran
|
||||
* Run durations for the rules and actions
|
||||
|
||||
[float]
|
||||
==== Example Event Log Queries
|
||||
==== Example event log queries
|
||||
|
||||
The following event log query looks at all events related to a specific rule id:
|
||||
|
||||
Event log query to look at all event related to a specific rule id:
|
||||
[source, txt]
|
||||
--------------------------------------------------
|
||||
GET /.kibana-event-log*/_search
|
||||
|
@ -77,7 +78,9 @@ GET /.kibana-event-log*/_search
|
|||
}
|
||||
--------------------------------------------------
|
||||
|
||||
Event log query to look at all events related to executing a rule or action. These events include duration.
|
||||
The following event log query looks at all events related to running a rule or
|
||||
action. These events include duration:
|
||||
|
||||
[source, txt]
|
||||
--------------------------------------------------
|
||||
GET /.kibana-event-log*/_search
|
||||
|
@ -124,8 +127,10 @@ GET /.kibana-event-log*/_search
|
|||
}
|
||||
--------------------------------------------------
|
||||
|
||||
Event log query to look at the errors.
|
||||
You should see an `error.message` property in that event, with a message from the action executor that might provide more detail on why the action encountered an error:
|
||||
The following event log query looks at the errors. You should see an
|
||||
`error.message` property in that event, with a message that might provide more
|
||||
details about why the action encountered an error:
|
||||
|
||||
[source, txt]
|
||||
--------------------------------------------------
|
||||
{
|
||||
|
@ -150,7 +155,9 @@ You should see an `error.message` property in that event, with a message from th
|
|||
}
|
||||
--------------------------------------------------
|
||||
|
||||
And see the errors for the rules you might provide the next search query:
|
||||
You might also see the errors for the rules, which can use in the next search
|
||||
query. For example:
|
||||
|
||||
[source, txt]
|
||||
--------------------------------------------------
|
||||
{
|
||||
|
|
|
@ -15,9 +15,10 @@ image::user/alerting/images/email-connector-test.png[Rule management page with t
|
|||
image::user/alerting/images/teams-connector-test.png[Five clauses define the condition to detect]
|
||||
|
||||
[float]
|
||||
==== experimental[] Troubleshooting Connectors with `kbn-action` tool
|
||||
==== experimental[] Troubleshooting connectors with the `kbn-action` tool
|
||||
|
||||
Executing an Email action via https://github.com/pmuellr/kbn-action[kbn-action]. In this example, is using a cloud deployment of the stack:
|
||||
You can run an email action via https://github.com/pmuellr/kbn-action[kbn-action].
|
||||
In this example, it is a Cloud deployment of the {stack}:
|
||||
|
||||
[source, txt]
|
||||
--------------------------------------------------
|
||||
|
@ -44,7 +45,8 @@ $ kbn-action ls
|
|||
}
|
||||
]
|
||||
--------------------------------------------------
|
||||
and then execute this:
|
||||
|
||||
You can then run the following test:
|
||||
|
||||
[source, txt]
|
||||
--------------------------------------------------
|
||||
|
|
Loading…
Add table
Add a link
Reference in a new issue