mirror of
https://github.com/elastic/elasticsearch.git
synced 2025-04-25 07:37:19 -04:00
410 lines
No EOL
15 KiB
Text
410 lines
No EOL
15 KiB
Text
[[ml-configuring-alerts]]
|
|
= Generating alerts for {anomaly-jobs}
|
|
:frontmatter-description: Create {anomaly-detect} alert and {anomaly-jobs} health rules.
|
|
:frontmatter-tags-products: [ml, alerting]
|
|
:frontmatter-tags-content-type: [how-to]
|
|
:frontmatter-tags-user-goals: [configure]
|
|
|
|
{kib} {alert-features} include support for {ml} rules, which run scheduled
|
|
checks for anomalies in one or more {anomaly-jobs} or check the health of the
|
|
job with certain conditions. If the conditions of the rule are met, an alert is
|
|
created and the associated action is triggered. For example, you can create a
|
|
rule to check an {anomaly-job} every fifteen minutes for critical anomalies and
|
|
to notify you in an email. To learn more about {kib} {alert-features}, refer to
|
|
{kibana-ref}/alerting-getting-started.html#alerting-getting-started[Alerting].
|
|
|
|
The following {ml} rules are available:
|
|
|
|
{anomaly-detect-cap} alert::
|
|
Checks if the {anomaly-job} results contain anomalies that match the rule
|
|
conditions.
|
|
|
|
{anomaly-jobs-cap} health::
|
|
Monitors job health and alerts if an operational issue occurred that may
|
|
prevent the job from detecting anomalies.
|
|
|
|
TIP: If you have created rules for specific {anomaly-jobs} and you want to
|
|
monitor whether these jobs work as expected, {anomaly-jobs} health rules are
|
|
ideal for this purpose.
|
|
|
|
In *{stack-manage-app} > {rules-ui}*, you can create both types of {ml} rules.
|
|
In the *{ml-app}* app, you can create only {anomaly-detect} alert rules; create
|
|
them from the {anomaly-job} wizard after you start the job or from the
|
|
{anomaly-job} list.
|
|
|
|
[[creating-anomaly-alert-rules]]
|
|
== {anomaly-detect-cap} alert rules
|
|
|
|
When you create an {anomaly-detect} alert rule, you must select the job that
|
|
the rule applies to.
|
|
|
|
You must also select a type of {ml} result. In particular, you can create rules
|
|
based on bucket, record, or influencer results.
|
|
|
|
[role="screenshot"]
|
|
image::images/ml-anomaly-alert-severity.png["Selecting result type, severity, and test interval", 500]
|
|
// NOTE: This is an autogenerated screenshot. Do not edit it directly.
|
|
|
|
For each rule, you can configure the `anomaly_score` that triggers the action.
|
|
The `anomaly_score` indicates the significance of a given anomaly compared to
|
|
previous anomalies. The default severity threshold is 75 which means every
|
|
anomaly with an `anomaly_score` of 75 or higher triggers the associated action.
|
|
|
|
You can select whether you want to include interim results. Interim results are
|
|
created by the {anomaly-job} before a bucket is finalized. These results might
|
|
disappear after the bucket is fully processed. Include interim results if you
|
|
want to be notified earlier about a potential anomaly even if it might be a
|
|
false positive. If you want to get notified only about anomalies of fully
|
|
processed buckets, do not include interim results.
|
|
|
|
You can also configure advanced settings. _Lookback interval_ sets an interval
|
|
that is used to query previous anomalies during each condition check. Its value
|
|
is derived from the bucket span of the job and the query delay of the {dfeed} by
|
|
default. It is not recommended to set the lookback interval lower than the
|
|
default value as it might result in missed anomalies. _Number of latest buckets_
|
|
sets how many buckets to check to obtain the highest anomaly from all the
|
|
anomalies that are found during the _Lookback interval_. An alert is created
|
|
based on the anomaly with the highest anomaly score from the most anomalous
|
|
bucket.
|
|
|
|
You can also test the configured conditions against your existing data and check
|
|
the sample results by providing a valid interval for your data. The generated
|
|
preview contains the number of potentially created alerts during the relative
|
|
time range you defined.
|
|
|
|
TIP: You must also provide a _check interval_ that defines how often to
|
|
evaluate the rule conditions. It is recommended to select an interval that is
|
|
close to the bucket span of the job.
|
|
|
|
As the last step in the rule creation process, define its <<ml-configuring-alert-actions,actions>>.
|
|
|
|
[[creating-anomaly-jobs-health-rules]]
|
|
== {anomaly-jobs-cap} health rules
|
|
|
|
When you create an {anomaly-jobs} health rule, you must select the job or group
|
|
that the rule applies to. If you assign more jobs to the group, they are
|
|
included the next time the rule conditions are checked.
|
|
|
|
You can also use a special character (`*`) to apply the rule to all your jobs.
|
|
Jobs created after the rule are automatically included. You can exclude jobs
|
|
that are not critically important by using the _Exclude_ field.
|
|
|
|
Enable the health check types that you want to apply. All checks are enabled by
|
|
default. At least one check needs to be enabled to create the rule. The
|
|
following health checks are available:
|
|
|
|
_Datafeed is not started_::
|
|
Notifies if the corresponding {dfeed} of the job is not started but the job is
|
|
in an opened state. The notification message recommends the necessary
|
|
actions to solve the error.
|
|
_Model memory limit reached_::
|
|
Notifies if the model memory status of the job reaches the soft or hard model
|
|
memory limit. Optimize your job by following
|
|
<<detector-configuration,these guidelines>> or consider
|
|
<<set-model-memory-limit,amending the model memory limit>>.
|
|
_Data delay has occurred_::
|
|
Notifies when the job missed some data. You can define the threshold for the
|
|
amount of missing documents you get alerted on by setting
|
|
_Number of documents_. You can control the lookback interval for checking
|
|
delayed data with _Time interval_. Refer to the
|
|
<<ml-delayed-data-detection>> page to see what to do about delayed data.
|
|
_Errors in job messages_::
|
|
Notifies when the job messages contain error messages. Review the
|
|
notification; it contains the error messages, the corresponding job IDs and
|
|
recommendations on how to fix the issue. This check looks for job errors
|
|
that occur after the rule is created; it does not look at historic behavior.
|
|
|
|
[role="screenshot"]
|
|
image::images/ml-health-check-config.png["Selecting health checkers",500]
|
|
// NOTE: This is an autogenerated screenshot. Do not edit it directly.
|
|
|
|
TIP: You must also provide a _check interval_ that defines how often to
|
|
evaluate the rule conditions. It is recommended to select an interval that is
|
|
close to the bucket span of the job.
|
|
|
|
As the last step in the rule creation process, define its actions.
|
|
|
|
[[ml-configuring-alert-actions]]
|
|
== Actions
|
|
|
|
You can optionally send notifications when the rule conditions are met and when
|
|
they are no longer met. In particular, these rules support:
|
|
|
|
* alert summaries
|
|
* actions that run when the anomaly score matches the conditions (for {anomaly-detect} alert rules)
|
|
* actions that run when an issue is detected (for {anomaly-jobs} health rules)
|
|
* recovery actions that run when the conditions are no longer met
|
|
|
|
Each action uses a connector, which stores connection information for a {kib}
|
|
service or supported third-party integration, depending on where you want to
|
|
send the notifications. For example, you can use a Slack connector to send a
|
|
message to a channel. Or you can use an index connector that writes a JSON
|
|
object to a specific index. For details about creating connectors, refer to
|
|
{kibana-ref}/action-types.html[Connectors].
|
|
|
|
After you select a connector, you must set the action frequency. You can choose
|
|
to create a summary of alerts on each check interval or on a custom interval.
|
|
For example, send slack notifications that summarize the new, ongoing, and
|
|
recovered alerts:
|
|
|
|
[role="screenshot"]
|
|
image::images/ml-anomaly-alert-action-summary.png["Adding an alert summary action to the rule",500]
|
|
// NOTE: This is an autogenerated screenshot. Do not edit it directly.
|
|
|
|
TIP: If you choose a custom action interval, it cannot be shorter than the
|
|
rule's check interval.
|
|
|
|
Alternatively, you can set the action frequency such that actions run for each
|
|
alert. Choose how often the action runs (at each check interval, only when the
|
|
alert status changes, or at a custom action interval). For {anomaly-detect}
|
|
alert rules, you must also choose whether the action runs when the anomaly score
|
|
matches the condition or when the alert recovers:
|
|
|
|
[role="screenshot"]
|
|
image::images/ml-anomaly-alert-action-score-matched.png["Adding an action for each alert in the rule",500]
|
|
// NOTE: This is an autogenerated screenshot. Do not edit it directly.
|
|
|
|
In {anomaly-jobs} health rules, choose whether the action runs when the issue is
|
|
detected or when it is recovered:
|
|
|
|
[role="screenshot"]
|
|
image::images/ml-health-check-action.png["Adding an action for each alert in the rule",500]
|
|
// NOTE: This is an autogenerated screenshot. Do not edit it directly.
|
|
|
|
You can further refine the rule by specifying that actions run only when they
|
|
match a KQL query or when an alert occurs within a specific time frame.
|
|
|
|
There is a set of variables that you can use to customize the notification
|
|
messages for each action. Click the icon above the message text box to get the
|
|
list of variables or refer to <<action-variables>>. For example:
|
|
|
|
[role="screenshot"]
|
|
image::images/ml-anomaly-alert-messages.png["Customizing your message",500]
|
|
// NOTE: This is an autogenerated screenshot. Do not edit it directly.
|
|
|
|
After you save the configurations, the rule appears in the
|
|
*{stack-manage-app} > {rules-ui}* list; you can check its status and see the
|
|
overview of its configuration information.
|
|
|
|
When an alert occurs for an {anomaly-detect} alert rule, it is always the same
|
|
name as the job ID of the associated {anomaly-job} that triggered it. You can
|
|
review how the alerts that are occured correlate with the {anomaly-detect}
|
|
results in the **Anomaly explorer** by using the **Anomaly timeline** swimlane
|
|
and the **Alerts** panel.
|
|
|
|
If necessary, you can snooze rules to prevent them from generating actions. For
|
|
more details, refer to
|
|
{kibana-ref}/create-and-manage-rules.html#controlling-rules[Snooze and disable rules].
|
|
|
|
[[action-variables]]
|
|
== Action variables
|
|
|
|
The following variables are specific to the {ml} rule types. An asterisk (`*`)
|
|
marks the variables that you can use in actions related to recovered alerts.
|
|
|
|
You can also specify {kibana-ref}/rule-action-variables.html[variables common to all rules].
|
|
|
|
[[anomaly-alert-action-variables]]
|
|
=== {anomaly-detect-cap} alert action variables
|
|
|
|
Every {anomaly-detect} alert has the following action variables:
|
|
|
|
`context`.`anomalyExplorerUrl` ^*^::
|
|
URL to open in the Anomaly Explorer.
|
|
|
|
`context`.`isInterim`::
|
|
Indicates if top hits contain interim results.
|
|
|
|
`context`.`jobIds` ^*^::
|
|
List of job IDs that triggered the alert.
|
|
|
|
`context`.`message` ^*^::
|
|
A preconstructed message for the alert.
|
|
|
|
`context`.`score`::
|
|
Anomaly score at the time of the notification action.
|
|
|
|
`context`.`timestamp`::
|
|
The bucket timestamp of the anomaly.
|
|
|
|
`context`.`timestampIso8601`::
|
|
The bucket timestamp of the anomaly in ISO8601 format.
|
|
|
|
`context`.`topInfluencers`::
|
|
The list of top influencers.
|
|
+
|
|
.Properties of `context.topInfluencers`
|
|
[%collapsible%open]
|
|
====
|
|
`influencer_field_name`:::
|
|
The field name of the influencer.
|
|
|
|
`influencer_field_value`:::
|
|
The entity that influenced, contributed to, or was to blame for the anomaly.
|
|
|
|
`score`:::
|
|
The influencer score. A normalized score between 0-100 which shows the
|
|
influencer's overall contribution to the anomalies.
|
|
====
|
|
|
|
`context`.`topRecords`::
|
|
The list of top records.
|
|
+
|
|
.Properties of `context.topRecords`
|
|
[%collapsible%open]
|
|
====
|
|
`actual`:::
|
|
The actual value for the bucket.
|
|
|
|
`by_field_value`:::
|
|
The value of the by field.
|
|
|
|
`field_name`:::
|
|
Certain functions require a field to operate on, for example, `sum()`. For those
|
|
functions, this value is the name of the field to be analyzed.
|
|
|
|
`function`:::
|
|
The function in which the anomaly occurs, as specified in the detector
|
|
configuration. For example, `max`.
|
|
|
|
`over_field_name`:::
|
|
The field used to split the data.
|
|
|
|
`partition_field_value`:::
|
|
The field used to segment the analysis.
|
|
|
|
`score`:::
|
|
A normalized score between 0-100, which is based on the probability of the
|
|
anomalousness of this record.
|
|
|
|
`typical`:::
|
|
The typical value for the bucket, according to analytical modeling.
|
|
====
|
|
|
|
[[anomaly-jobs-health-action-variables]]
|
|
=== {anomaly-jobs-cap} health action variables
|
|
|
|
Every health check has two main variables: `context.message` and
|
|
`context.results`. The properties of `context.results` may vary based on the
|
|
type of check. You can find the possible properties for all the checks below.
|
|
|
|
==== _Datafeed is not started_
|
|
|
|
`context.message` ^*^::
|
|
A preconstructed message for the alert.
|
|
|
|
`context.results`::
|
|
Contains the following properties:
|
|
+
|
|
.Properties of `context.results`
|
|
[%collapsible%open]
|
|
====
|
|
`datafeed_id` ^*^:::
|
|
The {dfeed} identifier.
|
|
|
|
`datafeed_state` ^*^:::
|
|
The state of the {dfeed}. It can be `starting`, `started`,
|
|
`stopping`, `stopped`.
|
|
|
|
`job_id` ^*^:::
|
|
The job identifier.
|
|
|
|
`job_state` ^*^:::
|
|
The state of the job. It can be `opening`, `opened`, `closing`,
|
|
`closed`, or `failed`.
|
|
====
|
|
|
|
==== _Model memory limit reached_
|
|
|
|
`context.message` ^*^::
|
|
A preconstructed message for the rule.
|
|
|
|
`context.results`::
|
|
Contains the following properties:
|
|
+
|
|
.Properties of `context.results`
|
|
[%collapsible%open]
|
|
====
|
|
`job_id` ^*^:::
|
|
The job identifier.
|
|
|
|
`memory_status` ^*^:::
|
|
The status of the mathematical model. It can have one of the following values:
|
|
|
|
* `soft_limit`: The model used more than 60% of the configured memory limit and
|
|
older unused models will be pruned to free up space. In categorization jobs no
|
|
further category examples will be stored.
|
|
* `hard_limit`: The model used more space than the configured memory limit. As a
|
|
result, not all incoming data was processed.
|
|
|
|
The `memory_status` is `ok` for recovered alerts.
|
|
|
|
`model_bytes` ^*^:::
|
|
The number of bytes of memory used by the models.
|
|
|
|
`model_bytes_exceeded` ^*^:::
|
|
The number of bytes over the high limit for memory usage at the last allocation
|
|
failure.
|
|
|
|
`model_bytes_memory_limit` ^*^:::
|
|
The upper limit for model memory usage.
|
|
|
|
`log_time` ^*^:::
|
|
The timestamp of the model size statistics according to server time. Time
|
|
formatting is based on the {kib} settings.
|
|
|
|
`peak_model_bytes` ^*^:::
|
|
The peak number of bytes of memory ever used by the model.
|
|
====
|
|
|
|
==== _Data delay has occurred_
|
|
|
|
`context.message` ^*^::
|
|
A preconstructed message for the rule.
|
|
|
|
`context.results`::
|
|
For recovered alerts, `context.results` is either empty (when there is no
|
|
delayed data) or the same as for an active alert (when the number of missing
|
|
documents is less than the _Number of documents_ treshold set by the user).
|
|
Contains the following properties:
|
|
+
|
|
.Properties of `context.results`
|
|
[%collapsible%open]
|
|
====
|
|
`annotation` ^*^:::
|
|
The annotation corresponding to the data delay in the job.
|
|
|
|
`end_timestamp` ^*^:::
|
|
Timestamp of the latest finalized buckets with missing documents. Time
|
|
formatting is based on the {kib} settings.
|
|
|
|
`job_id` ^*^:::
|
|
The job identifier.
|
|
|
|
`missed_docs_count` ^*^:::
|
|
The number of missed documents.
|
|
====
|
|
|
|
==== _Error in job messages_
|
|
|
|
`context.message` ^*^::
|
|
A preconstructed message for the rule.
|
|
|
|
`context.results`::
|
|
Contains the following properties:
|
|
+
|
|
.Properties of `context.results`
|
|
[%collapsible%open]
|
|
====
|
|
`timestamp`:::
|
|
Timestamp of the latest finalized buckets with missing documents.
|
|
|
|
`job_id`:::
|
|
The job identifier.
|
|
|
|
`message`:::
|
|
The error message.
|
|
|
|
`node_name`:::
|
|
The name of the node that runs the job.
|
|
==== |