mirror of
https://github.com/elastic/kibana.git
synced 2025-04-24 17:59:23 -04:00
288 lines
9.9 KiB
Text
288 lines
9.9 KiB
Text
[[alerting-common-issues]]
|
|
=== Common Issues
|
|
|
|
This page describes how to resolve common problems you might encounter with Alerting.
|
|
|
|
[float]
|
|
[[rules-small-check-interval-run-late]]
|
|
==== Rules with small check intervals run late
|
|
|
|
*Problem*
|
|
|
|
Rules with a small check interval, such as every two seconds, run later than scheduled.
|
|
|
|
*Solution*
|
|
|
|
Rules run as background tasks at a cadence defined by their *check interval*.
|
|
When a Rule *check interval* is smaller than the Task Manager <<task-manager-settings,`poll_interval`>>, the rule will run late.
|
|
|
|
Either tweak the <<task-manager-settings,{kib} Task Manager settings>> or increase the *check interval* of the rules in question.
|
|
|
|
For more details, see <<task-manager-health-scheduled-tasks-small-schedule-interval-run-late>>.
|
|
|
|
|
|
[float]
|
|
[[scheduled-rules-run-late]]
|
|
==== Rules with the inconsistent cadence
|
|
|
|
*Problem*
|
|
|
|
Scheduled rules run at an inconsistent cadence, often running late.
|
|
|
|
Actions run long after the status of a rule changes, sending a notification of the change too late.
|
|
|
|
*Solution*
|
|
|
|
Rules and actions run as background tasks by each {kib} instance at a default rate of ten tasks every three seconds.
|
|
When diagnosing issues related to alerting, focus on the tasks that begin with `alerting:` and `actions:`.
|
|
|
|
Alerting tasks always begin with `alerting:`. For example, the `alerting:.index-threshold` tasks back the <<rule-type-index-threshold, index threshold stack rule>>.
|
|
Action tasks always begin with `actions:`. For example, the `actions:.index` tasks back the <<index-action-type, index action>>.
|
|
|
|
For more details on monitoring and diagnosing tasks in Task Manager, refer to <<task-manager-health-monitoring>>.
|
|
|
|
[float]
|
|
[[connector-tls-settings]]
|
|
==== Connectors have TLS errors when running actions
|
|
|
|
*Problem*
|
|
|
|
A connector gets a TLS socket error when connecting to the server to run an action.
|
|
|
|
*Solution*
|
|
|
|
Configuration options are available to specialize connections to TLS servers,
|
|
including ignoring server certificate validation and providing certificate
|
|
authority data to verify servers using custom certificates. For more details,
|
|
see <<action-settings>>.
|
|
|
|
[float]
|
|
[[rules-long-run-time]]
|
|
==== Rules take a long time to run
|
|
|
|
*Problem*
|
|
|
|
Rules are taking a long time to run and are impacting the overall health of your deployment.
|
|
|
|
[IMPORTANT]
|
|
==============================================
|
|
By default, only users with a `superuser` role can query the experimental[] {kib} event log because it is a system index. To enable additional users to run this query, assign `read` privileges to the `.kibana-event-log*` index.
|
|
==============================================
|
|
|
|
*Solution*
|
|
|
|
By default, rules have a `5m` timeout. Rules that run longer than this timeout are automatically cancelled to prevent them from consuming too much of {kib}'s resources. Alerts and actions that may have been scheduled before the rule timed out are discarded. When a rule times out, you will see this error in the {kib} logs:
|
|
|
|
[source,sh]
|
|
--------------------------------------------------
|
|
[2022-03-28T13:14:04.062-04:00][WARN ][plugins.taskManager] Cancelling task alerting:.index-threshold "a6ea0070-aec0-11ec-9985-dd576a3fe205" as it expired at 2022-03-28T17:14:03.980Z after running for 05m 10s (with timeout set at 5m).
|
|
--------------------------------------------------
|
|
|
|
and in the <<rule-details,details page>>:
|
|
|
|
[role="screenshot"]
|
|
image::images/rule-details-timeout-error.png[Rule details page with timeout error]
|
|
|
|
If you want your rules to run longer, update the `xpack.alerting.rules.run.timeout` configuration in your <<alert-settings>>. You can also target a specific rule type by using `xpack.alerting.rules.run.ruleTypeOverrides`.
|
|
|
|
Rules that consistently run longer than their <<create-edit-rules,check interval>> may produce unexpected results. If the average run duration, visible on the <<rule-details,details page>>, is greater than the check interval, consider increasing the check interval.
|
|
|
|
To get all long-running rules, you can query for a list of rule ids, bucketed by their run times:
|
|
|
|
[source,console]
|
|
--------------------------------------------------
|
|
GET /.kibana-event-log*/_search
|
|
{
|
|
"size": 0,
|
|
"query": {
|
|
"bool": {
|
|
"filter": [
|
|
{
|
|
"range": {
|
|
"@timestamp": {
|
|
"gte": "now-1d", <1>
|
|
"lte": "now"
|
|
}
|
|
}
|
|
},
|
|
{
|
|
"term": {
|
|
"event.action": {
|
|
"value": "execute"
|
|
}
|
|
}
|
|
},
|
|
{
|
|
"term": {
|
|
"event.provider": {
|
|
"value": "alerting" <2>
|
|
}
|
|
}
|
|
}
|
|
]
|
|
}
|
|
},
|
|
"runtime_mappings": { <3>
|
|
"event.duration_in_seconds": {
|
|
"type": "double",
|
|
"script": {
|
|
"source": "emit(doc['event.duration'].value / 1E9)"
|
|
}
|
|
}
|
|
},
|
|
"aggs": {
|
|
"ruleIdsByExecutionDuration": {
|
|
"histogram": {
|
|
"field": "event.duration_in_seconds",
|
|
"min_doc_count": 1,
|
|
"interval": 1 <4>
|
|
},
|
|
"aggs": {
|
|
"ruleId": {
|
|
"nested": {
|
|
"path": "kibana.saved_objects"
|
|
},
|
|
"aggs": {
|
|
"ruleId": {
|
|
"terms": {
|
|
"field": "kibana.saved_objects.id",
|
|
"size": 10 <5>
|
|
}
|
|
}
|
|
}
|
|
}
|
|
}
|
|
}
|
|
}
|
|
}
|
|
--------------------------------------------------
|
|
// TEST
|
|
|
|
<1> This queries for rules run in the last day. Update the values of `lte` and `gte` to query over a different time range.
|
|
<2> Use `event.provider: actions` to query for long-running actions.
|
|
<3> Run durations are stored as nanoseconds. This adds a runtime field to convert that duration into seconds.
|
|
<4> This interval buckets the `event.duration_in_seconds` runtime field into 1 second intervals. Update this value to change the granularity of the buckets. If you are unable to use runtime fields, make sure this aggregation targets `event.duration` and use nanoseconds for the interval.
|
|
<5> This retrieves the top 10 rule ids for this duration interval. Update this value to retrieve more rule ids.
|
|
|
|
This query returns the following:
|
|
|
|
[source,json]
|
|
--------------------------------------------------
|
|
{
|
|
"took" : 322,
|
|
"timed_out" : false,
|
|
"_shards" : {
|
|
"total" : 1,
|
|
"successful" : 1,
|
|
"skipped" : 0,
|
|
"failed" : 0
|
|
},
|
|
"hits" : {
|
|
"total" : {
|
|
"value" : 326,
|
|
"relation" : "eq"
|
|
},
|
|
"max_score" : null,
|
|
"hits" : [ ]
|
|
},
|
|
"aggregations" : {
|
|
"ruleIdsByExecutionDuration" : {
|
|
"buckets" : [
|
|
{
|
|
"key" : 0.0, <1>
|
|
"doc_count" : 320,
|
|
"ruleId" : {
|
|
"doc_count" : 320,
|
|
"ruleId" : {
|
|
"doc_count_error_upper_bound" : 0,
|
|
"sum_other_doc_count" : 0,
|
|
"buckets" : [
|
|
{
|
|
"key" : "1923ada0-a8f3-11eb-a04b-13d723cdfdc5",
|
|
"doc_count" : 140
|
|
},
|
|
{
|
|
"key" : "15415ecf-cdb0-4fef-950a-f824bd277fe4",
|
|
"doc_count" : 130
|
|
},
|
|
{
|
|
"key" : "dceeb5d0-6b41-11eb-802b-85b0c1bc8ba2",
|
|
"doc_count" : 50
|
|
}
|
|
]
|
|
}
|
|
}
|
|
},
|
|
{
|
|
"key" : 30.0, <2>
|
|
"doc_count" : 6,
|
|
"ruleId" : {
|
|
"doc_count" : 6,
|
|
"ruleId" : {
|
|
"doc_count_error_upper_bound" : 0,
|
|
"sum_other_doc_count" : 0,
|
|
"buckets" : [
|
|
{
|
|
"key" : "41893910-6bca-11eb-9e0d-85d233e3ee35",
|
|
"doc_count" : 6
|
|
}
|
|
]
|
|
}
|
|
}
|
|
}
|
|
]
|
|
}
|
|
}
|
|
}
|
|
--------------------------------------------------
|
|
<1> Most run durations fall within the first bucket (0 - 1 seconds).
|
|
<2> A single rule with id `41893910-6bca-11eb-9e0d-85d233e3ee35` took between 30 and 31 seconds to run.
|
|
|
|
Use the <<get-rule-api,get rule API>> to retrieve additional information about rules that take a long time to run.
|
|
|
|
[float]
|
|
[[rule-cannot-decrypt-api-key]]
|
|
==== Rule cannot decrypt apiKey
|
|
|
|
*Problem*:
|
|
|
|
The rule fails to run and has an `Unable to decrypt attribute "apiKey"` error.
|
|
|
|
*Solution*:
|
|
|
|
This error happens when the `xpack.encryptedSavedObjects.encryptionKey` value used to create the rule does not match the value used when the rule runs. Depending on the scenario, there are different ways to solve this problem:
|
|
|
|
[cols="2*<"]
|
|
|===
|
|
|
|
| If the value in `xpack.encryptedSavedObjects.encryptionKey` was manually changed, and the previous encryption key is still known.
|
|
| Ensure any previous encryption key is included in the keys used for <<xpack-encryptedSavedObjects-keyRotation-decryptionOnlyKeys, decryption only>>.
|
|
|
|
| If another {kib} instance with a different encryption key connects to the cluster.
|
|
| The other {kib} instance might be trying to run the rule using a different encryption key than what the rule was created with. Ensure the encryption keys among all the {kib} instances are the same, and setting <<xpack-encryptedSavedObjects-keyRotation-decryptionOnlyKeys, decryption only keys>> for previously used encryption keys.
|
|
|
|
| If other scenarios don't apply.
|
|
| Generate a new API key for the rule by disabling then enabling the rule.
|
|
|
|
|===
|
|
|
|
[float]
|
|
[[known-issue-upgrade-rule]]
|
|
==== Rules stop running after upgrade
|
|
|
|
*Problem*:
|
|
|
|
Alerting rules that were created or edited in 8.2 stop running after you upgrade
|
|
to 8.3.0 or 8.3.1. The following error occurs:
|
|
|
|
[source,text]
|
|
----
|
|
<rule-type>:<UUID>: execution failed - security_exception: [security_exception] Reason: missing authentication credentials for REST request [/_security/user/_has_privileges], caused by: ""
|
|
----
|
|
|
|
*Solution*:
|
|
|
|
Upgrade to 8.3.2 or later releases to avoid the problem. To fix failing rules,
|
|
go to *{stack-manage-app} > {rac-ui}* and generate new API keys by selecting
|
|
**Update API key** from the actions menu. For more details about API key
|
|
authorization, refer to <<alerting-authorization>>.
|