elasticsearch/docs/reference/ilm/error-handling.asciidoc

[role="xpack"]
[[index-lifecycle-error-handling]]
== Troubleshooting {ilm} errors

When {ilm-init} executes a lifecycle policy, it's possible for errors to occur
while performing the necessary index operations for a step.
When this happens, {ilm-init} moves the index to an `ERROR` step.
If {ilm-init} cannot resolve the error automatically, execution is halted
until you resolve the underlying issues with the policy, index, or cluster.

For example, you might have a `shrink-index` policy that shrinks an index to four shards once it
is at least five days old:

[source,console]
--------------------------------------------------
PUT _ilm/policy/shrink-index
{
  "policy": {
    "phases": {
      "warm": {
        "min_age": "5d",
        "actions": {
          "shrink": {
            "number_of_shards": 4
          }
        }
      }
    }
  }
}
--------------------------------------------------
// TEST

There is nothing that prevents you from applying the `shrink-index` policy to a new
index that has only two shards:

[source,console]
--------------------------------------------------
PUT /my-index-000001
{
  "settings": {
    "index.number_of_shards": 2,
    "index.lifecycle.name": "shrink-index"
  }
}
--------------------------------------------------
// TEST[continued]

After five days, {ilm-init} attempts to shrink `my-index-000001` from two shards to four shards.
Because the shrink action cannot _increase_ the number of shards, this operation fails
and {ilm-init} moves `my-index-000001` to the `ERROR` step.

You can use the <<ilm-explain-lifecycle,{ilm-init} Explain API>> to get information about
what went wrong:

[source,console]
--------------------------------------------------
GET /my-index-000001/_ilm/explain
--------------------------------------------------
// TEST[continued]

Which returns the following information:

[source,console-result]
--------------------------------------------------
{
  "indices" : {
    "my-index-000001" : {
      "index" : "my-index-000001",
      "managed" : true,
      "policy" : "shrink-index",                <1>
      "lifecycle_date_millis" : 1541717265865,
      "age": "5.1d",                            <2>
      "phase" : "warm",                         <3>
      "phase_time_millis" : 1541717272601,
      "action" : "shrink",                      <4>
      "action_time_millis" : 1541717272601,
      "step" : "ERROR",                         <5>
      "step_time_millis" : 1541717272688,
      "failed_step" : "shrink",                 <6>
      "step_info" : {
        "type" : "illegal_argument_exception",  <7>
        "reason" : "the number of target shards [4] must be less that the number of source shards [2]"
      },
      "phase_execution" : {
        "policy" : "shrink-index",
        "phase_definition" : {                  <8>
          "min_age" : "5d",
          "actions" : {
            "shrink" : {
              "number_of_shards" : 4
            }
          }
        },
        "version" : 1,
        "modified_date_in_millis" : 1541717264230
      }
    }
  }
}
--------------------------------------------------
// TESTRESPONSE[skip:no way to know if we will get this response immediately]

<1> The policy being used to manage the index: `shrink-index`
<2> The index age: 5.1 days
<3> The phase the index is currently in: `warm`
<4> The current action: `shrink`
<5> The step the index is currently in: `ERROR`
<6> The step that failed to execute: `shrink`
<7> The type of error and a description of that error.
<8> The definition of the current phase from the `shrink-index` policy

To resolve this, you could update the policy to shrink the index to a single shard after 5 days:

[source,console]
--------------------------------------------------
PUT _ilm/policy/shrink-index
{
  "policy": {
    "phases": {
      "warm": {
        "min_age": "5d",
        "actions": {
          "shrink": {
            "number_of_shards": 1
          }
        }
      }
    }
  }
}
--------------------------------------------------
// TEST[continued]

[discrete]
=== Retrying failed lifecycle policy steps

Once you fix the problem that put an index in the `ERROR` step,
you might need to explicitly tell {ilm-init} to retry the step:

[source,console]
--------------------------------------------------
POST /my-index-000001/_ilm/retry
--------------------------------------------------
// TEST[skip:we can't be sure the index is ready to be retried at this point]

{ilm-init} subsequently attempts to re-run the step that failed.
You can use the <<ilm-explain-lifecycle,{ilm-init} Explain API>> to monitor the progress.

[discrete]
=== Common {ilm-init} errors

Here's how to resolve the most common errors reported in the `ERROR` step.

TIP: Problems with rollover aliases are a common cause of errors.
Consider using <<data-streams, data streams>> instead of managing rollover with aliases.

[discrete]
==== Rollover alias [x] can point to multiple indices, found duplicated alias [x] in index template [z]

The target rollover alias is specified in an index template's `index.lifecycle.rollover_alias` setting.
You need to explicitly configure this alias _one time_ when you
<<ilm-gs-alias-bootstrap, bootstrap the initial index>>.
The rollover action then manages setting and updating the alias to
<<rollover-index-api-desc, roll over>> to each subsequent index.

Do not explicitly configure this same alias in the aliases section of an index template.

[discrete]
==== index.lifecycle.rollover_alias [x] does not point to index [y]

Either the index is using the wrong alias or the alias does not exist.

Check the `index.lifecycle.rollover_alias` <<indices-get-settings, index setting>>.
To see what aliases are configured, use <<cat-alias, _cat/aliases>>.

[discrete]
==== Setting [index.lifecycle.rollover_alias] for index [y] is empty or not defined

The `index.lifecycle.rollover_alias` setting must be configured for the rollover action to work.

Update the index settings to set `index.lifecycle.rollover_alias`.

[discrete]
==== Alias [x] has more than one write index [y,z]

Only one index can be designated as the write index for a particular alias.

Use the <<indices-aliases, aliases>> API to set `is_write_index:false` for all but one index.

[discrete]
==== index name [x] does not match pattern ^.*-\d+

The index name must match the regex pattern `^.*-\d+` for the rollover action to work.
The most common problem is that the index name does not contain trailing digits.
For example, `my-index` does not match the pattern requirement.

Append a numeric value to the index name, for example `my-index-000001`.

[discrete]
==== CircuitBreakingException: [x] data too large, data for [y]

This indicates that the cluster is hitting resource limits.

Before continuing to set up {ilm-init}, you'll need to take steps to alleviate the resource issues.
For more information, see <<circuit-breaker-errors>>.

[discrete]
==== High disk watermark [x] exceeded on [y]

This indicates that the cluster is running out of disk space.
This can happen when you don't have {ilm} set up to roll over from hot to warm nodes.

Consider adding nodes, upgrading your hardware, or deleting unneeded indices.