kibana/docs/user/alerting/troubleshooting/alerting-common-issues.asciidoc

[[alerting-common-issues]]
=== Common Issues

This page describes how to resolve common problems you might encounter with Alerting.

[float]
[[rules-small-check-interval-run-late]]
==== Rules with small check intervals run late

*Problem*

Rules with a small check interval, such as every two seconds, run later than scheduled.

*Solution*

Rules run as background tasks at a cadence defined by their *check interval*.
When a Rule *check interval* is smaller than the Task Manager <<task-manager-settings,`poll_interval`>>, the rule will run late.

Either tweak the <<task-manager-settings,{kib} Task Manager settings>> or increase the *check interval* of the rules in question.

For more details, see <<task-manager-health-scheduled-tasks-small-schedule-interval-run-late>>.


[float]
[[scheduled-rules-run-late]]
==== Rules with the inconsistent cadence

*Problem*

Scheduled rules run at an inconsistent cadence, often running late.

Actions run long after the status of a rule changes, sending a notification of the change too late.

*Solution*

Rules and actions run as background tasks by each {kib} instance at a default rate of ten tasks every three seconds.
When diagnosing issues related to alerting, focus on the tasks that begin with `alerting:` and `actions:`.

Alerting tasks always begin with `alerting:`. For example, the `alerting:.index-threshold` tasks back the <<rule-type-index-threshold, index threshold stack rule>>.
Action tasks always begin with `actions:`. For example, the `actions:.index` tasks back the <<index-action-type, index action>>.

For more details on monitoring and diagnosing tasks in Task Manager, refer to <<task-manager-health-monitoring>>.

[float]
[[connector-tls-settings]]
==== Connectors have TLS errors when running actions

*Problem*

A connector gets a TLS socket error when connecting to the server to run an action.

*Solution*

Configuration options are available to specialize connections to TLS servers,
including ignoring server certificate validation and providing certificate
authority data to verify servers using custom certificates. For more details,
see <<action-settings>>.

[float]
[[rules-long-run-time]]
==== Rules take a long time to run

*Problem*

Rules are taking a long time to run and are impacting the overall health of your deployment.

[IMPORTANT]
==============================================
By default, only users with a `superuser` role can query the experimental[] {kib} event log because it is a system index. To enable additional users to run this query, assign `read` privileges to the `.kibana-event-log*` index.
==============================================

*Solution*

By default, rules have a `5m` timeout. Rules that run longer than this timeout are automatically cancelled to prevent them from consuming too much of {kib}'s resources. Alerts and actions that may have been scheduled before the rule timed out are discarded. When a rule times out, you will see this error in the {kib} logs:

[source,sh]
--------------------------------------------------
[2022-03-28T13:14:04.062-04:00][WARN ][plugins.taskManager] Cancelling task alerting:.index-threshold "a6ea0070-aec0-11ec-9985-dd576a3fe205" as it expired at 2022-03-28T17:14:03.980Z after running for 05m 10s (with timeout set at 5m).
--------------------------------------------------

and in the <<rule-details,details page>>:

[role="screenshot"]
image::images/rule-details-timeout-error.png[Rule details page with timeout error]

If you want your rules to run longer, update the `xpack.alerting.rules.run.timeout` configuration in your <<alert-settings>>. You can also target a specific rule type by using `xpack.alerting.rules.run.ruleTypeOverrides`.

Rules that consistently run longer than their <<create-edit-rules,check interval>> may produce unexpected results. If the average run duration, visible on the <<rule-details,details page>>, is greater than the check interval, consider increasing the check interval.

To get all long-running rules, you can query for a list of rule ids, bucketed by their run times:

[source,console]
--------------------------------------------------
GET /.kibana-event-log*/_search
{
  "size": 0,
  "query": {
    "bool": {
      "filter": [
        {
          "range": {
            "@timestamp": {
              "gte": "now-1d", <1>
              "lte": "now"
            }
          }
        },
        {
          "term": {
            "event.action": {
              "value": "execute"
            }
          }
        },
        {
          "term": {
            "event.provider": {
              "value": "alerting" <2>
            }
          }
        }
      ]
    }
  },
  "runtime_mappings": { <3>
    "event.duration_in_seconds": {
      "type": "double",
      "script": {
        "source": "emit(doc['event.duration'].value / 1E9)"
      }
    }
  },
  "aggs": {
    "ruleIdsByExecutionDuration": {
      "histogram": {
        "field": "event.duration_in_seconds",
        "min_doc_count": 1,
        "interval": 1 <4>
      },
      "aggs": {
        "ruleId": {
          "nested": {
            "path": "kibana.saved_objects"
          },
          "aggs": {
            "ruleId": {
              "terms": {
                "field": "kibana.saved_objects.id",
                "size": 10 <5>
              }
            }
          }
        }
      }
    }
  }
}
--------------------------------------------------
// TEST

<1> This queries for rules run in the last day. Update the values of `lte` and `gte` to query over a different time range.
<2> Use `event.provider: actions` to query for long-running actions.
<3> Run durations are stored as nanoseconds. This adds a runtime field to convert that duration into seconds.
<4> This interval buckets the `event.duration_in_seconds` runtime field into 1 second intervals. Update this value to change the granularity of the buckets. If you are unable to use runtime fields, make sure this aggregation targets `event.duration` and use nanoseconds for the interval.
<5> This retrieves the top 10 rule ids for this duration interval. Update this value to retrieve more rule ids.

This query returns the following:

[source,json]
--------------------------------------------------
{
  "took" : 322,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 326,
      "relation" : "eq"
    },
    "max_score" : null,
    "hits" : [ ]
  },
  "aggregations" : {
    "ruleIdsByExecutionDuration" : {
      "buckets" : [
        {
          "key" : 0.0, <1>
          "doc_count" : 320,
          "ruleId" : {
            "doc_count" : 320,
            "ruleId" : {
              "doc_count_error_upper_bound" : 0,
              "sum_other_doc_count" : 0,
              "buckets" : [
                {
                  "key" : "1923ada0-a8f3-11eb-a04b-13d723cdfdc5",
                  "doc_count" : 140
                },
                {
                  "key" : "15415ecf-cdb0-4fef-950a-f824bd277fe4",
                  "doc_count" : 130
                },
                {
                  "key" : "dceeb5d0-6b41-11eb-802b-85b0c1bc8ba2",
                  "doc_count" : 50
                }
              ]
            }
          }
        },
        {
          "key" : 30.0, <2>
          "doc_count" : 6,
          "ruleId" : {
            "doc_count" : 6,
            "ruleId" : {
              "doc_count_error_upper_bound" : 0,
              "sum_other_doc_count" : 0,
              "buckets" : [
                {
                  "key" : "41893910-6bca-11eb-9e0d-85d233e3ee35",
                  "doc_count" : 6
                }
              ]
            }
          }
        }
      ]
    }
  }
}
--------------------------------------------------
<1> Most run durations fall within the first bucket (0 - 1 seconds).
<2> A single rule with id `41893910-6bca-11eb-9e0d-85d233e3ee35` took between 30 and 31 seconds to run.

Use the <<get-rule-api,get rule API>> to retrieve additional information about rules that take a long time to run.

[float]
[[rule-cannot-decrypt-api-key]]
==== Rule cannot decrypt apiKey

*Problem*:

The rule fails to run and has an `Unable to decrypt attribute "apiKey"` error.

*Solution*:

This error happens when the `xpack.encryptedSavedObjects.encryptionKey` value used to create the rule does not match the value used when the rule runs. Depending on the scenario, there are different ways to solve this problem:

[cols="2*<"]
|===

| If the value in `xpack.encryptedSavedObjects.encryptionKey` was manually changed, and the previous encryption key is still known.
| Ensure any previous encryption key is included in the keys used for <<xpack-encryptedSavedObjects-keyRotation-decryptionOnlyKeys, decryption only>>.

| If another {kib} instance with a different encryption key connects to the cluster.
| The other {kib} instance might be trying to run the rule using a different encryption key than what the rule was created with. Ensure the encryption keys among all the {kib} instances are the same, and setting <<xpack-encryptedSavedObjects-keyRotation-decryptionOnlyKeys, decryption only keys>> for previously used encryption keys.

| If other scenarios don't apply.
| Generate a new API key for the rule by disabling then enabling the rule.

|===

[float]
[[known-issue-upgrade-rule]]
==== Rules stop running after upgrade

*Problem*:

Alerting rules that were created or edited in 8.2 stop running after you upgrade
to 8.3.0 or 8.3.1. The following error occurs:

[source,text]
----
<rule-type>:<UUID>: execution failed - security_exception: [security_exception] Reason: missing authentication credentials for REST request [/_security/user/_has_privileges], caused by: ""
----

*Solution*:

Upgrade to 8.3.2 or later releases to avoid the problem. To fix failing rules,
go to *{stack-manage-app} > {rac-ui}* and generate new API keys by selecting
**Update API key** from the actions menu. For more details about API key
authorization, refer to <<alerting-authorization>>.