Disk indicator troubleshooting guides (#90504)

This commit is contained in:
Mary Gouseti 2022-10-14 16:24:21 +03:00 committed by GitHub
parent 1d4c96c9ae
commit cfd23d512f
No known key found for this signature in database
GPG key ID: 4AEE18F83AFDEB23
20 changed files with 655 additions and 1 deletions

Binary file not shown.

After

Width:  |  Height:  |  Size: 149 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 129 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 177 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 107 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 200 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 183 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 324 KiB

View file

@ -16,7 +16,7 @@ is not recommended to change any of these from their default values.
a master at all, before moving on with other checks. Defaults to `30s` (30 seconds).
`master_history.max_age`::
(<<static-cluster-setting,Static>>) The timeframe we record the master history
(<<static-cluster-setting,Static>>) The timeframe we record the master history
to be used for diagnosing the cluster health. Master node changes older than this time will not be considered when
diagnosing the cluster health. Defaults to `30m` (30 minutes).
@ -27,3 +27,11 @@ Defaults to `4`.
`health.master_history.no_master_transitions_threshold`::
(<<static-cluster-setting,Static>>) The number of transitions to no master witnessed by a node that indicates the cluster is not healthy.
Defaults to `4`.
`health.node.enabled`::
(<<cluster-update-settings,Dynamic>>) Enables the health node, which allows the health API to provide indications about
cluster wide health aspects such as disk space.
`health.reporting.local.monitor.interval`::
(<<cluster-update-settings,Dynamic>>) Determines the interval in which each node of the cluster monitors aspects that
comprise its local health such as its disk usage.

View file

@ -0,0 +1,40 @@
++++
<div class="tabs" data-tab-group="host">
<div role="tablist" aria-label="Restore from snapshot">
<button role="tab"
aria-selected="true"
aria-controls="cloud-tab-decrease-disk-usage"
id="cloud-decrease-disk-usage">
Elasticsearch Service
</button>
<button role="tab"
aria-selected="false"
aria-controls="self-managed-tab-decrease-disk-usage"
id="self-managed-decrease-disk-usage"
tabindex="-1">
Self-managed
</button>
</div>
<div tabindex="0"
role="tabpanel"
id="cloud-tab-decrease-disk-usage"
aria-labelledby="cloud-decrease-disk-usage">
++++
include::decrease-data-node-disk-usage.asciidoc[tag=cloud]
++++
</div>
<div tabindex="0"
role="tabpanel"
id="self-managed-tab-decrease-disk-usage"
aria-labelledby="self-managed-decrease-disk-usage"
hidden="">
++++
include::decrease-data-node-disk-usage.asciidoc[tag=self-managed]
++++
</div>
</div>
++++

View file

@ -0,0 +1,140 @@
// tag::cloud[]
**Use {kib}**
//tag::kibana-api-ex[]
. Log in to the {ess-console}[{ecloud} console].
+
. On the **Elasticsearch Service** panel, click the name of your deployment.
+
NOTE: If the name of your deployment is disabled your {kib} instances might be
unhealthy, in which case please contact https://support.elastic.co[Elastic Support].
If your deployment doesn't include {kib}, all you need to do is
{cloud}/ec-access-kibana.html[enable it first].
+
. Open your deployment's side navigation menu (placed under the Elastic logo in the upper left corner)
and go to **Stack Management > Index Management**.
. In the list of all your indices, click the `Replicas` column twice to sort the indices based on their number of
replicas starting with the one that has the most. Go through the indices and pick one by one the index with the
least importance and higher number of replicas.
+
WARNING: Reducing the replicas of an index can potentially reduce search throughput and data redundancy.
+
. For each index you chose, click on its name, then on the panel that appears click `Edit settings`, reduce the
value of the `index.number_of_replicas` to the desired value and then click `Save`.
+
[role="screenshot"]
image::images/troubleshooting/disk/reduce_replicas.png[Reducing replicas,align="center"]
+
. Continue this process until the cluster is healthy again.
// end::cloud[]
// tag::self-managed[]
In order to estimate how many replicas need to be removed, first you need to estimate the amount of disk space that
needs to be released.
. First, retrieve the relevant disk thresholds that will indicate how much space should be released. The
relevant thresholds are the <<cluster-routing-watermark-high, high watermark>> for all the tiers apart from the frozen
one and the <<cluster-routing-flood-stage-frozen, frozen flood stage watermark>> for the frozen tier. The following
example demonstrates disk shortage in the hot tier, so we will only retrieve the high watermark:
+
[source,console]
----
GET _cluster/settings?include_defaults&filter_path=*.cluster.routing.allocation.disk.watermark.high*
----
+
The response will look like this:
+
[source,console-result]
----
{
"defaults": {
"cluster": {
"routing": {
"allocation": {
"disk": {
"watermark": {
"high": "90%",
"high.max_headroom": "150GB"
}
}
}
}
}
}
}
----
// TEST[skip:illustration purposes only]
+
The above means that in order to resolve the disk shortage we need to either drop our disk usage below the 90% or have
more than 150GB available, read more on how this threshold works <<cluster-routing-watermark-high, here>>.
. The next step is to find out the current disk usage; this will indicate how much space should be freed. For simplicity,
our example has one node, but you can apply the same for every node over the relevant threshold.
+
[source,console]
----
GET _cat/allocation?v&s=disk.avail&h=node,disk.percent,disk.avail,disk.total,disk.used,disk.indices,shards
----
+
The response will look like this:
+
[source,console-result]
----
node disk.percent disk.avail disk.total disk.used disk.indices shards
instance-0000000000 91 4.6gb 35gb 31.1gb 29.9gb 111
----
// TEST[skip:illustration purposes only]
. The high watermark configuration indicates that the disk usage needs to drop below 90%. Consider allowing some
padding, so the node will not go over the threshold in the near future. In this example, let's release approximately 7GB.
. The next step is to list all the indices and choose which replicas to reduce.
+
NOTE: The following command orders the indices with descending number of replicas and primary store size. We do this to
help you choose which replicas to reduce under the assumption that the more replicas you have the smaller the risk if
you remove a copy and the bigger the replica the more space will be released. This does not take into consideration any
functional requirements, so please see it as a mere suggestion.
+
[source,console]
----
GET _cat/indices?v&s=rep:desc,pri.store.size:desc&h=health,index,pri,rep,store.size,pri.store.size
----
+
The response will look like:
+
[source,console-result]
----
health index pri rep store.size pri.store.size
green my_index 2 3 9.9gb 3.3gb
green my_other_index 2 3 1.8gb 470.3mb
green search-products 2 3 278.5kb 69.6kb
green logs-000001 1 0 7.7gb 7.7gb
----
// TEST[skip:illustration purposes only]
+
. In the list above we see that if we reduce the replicas to 1 of the indices `my_index` and `my_other_index` we will
release the required disk space. It is not necessary to reduce the replicas of `search-products` and `logs-000001` does
not have any replicas anyway. Reduce the replicas of one or more indices with the <<indices-update-settings,
index update settings API>>:
+
WARNING: Reducing the replicas of an index can potentially reduce search throughput and data redundancy.
+
[source,console]
----
PUT my_index,my_other_index/_settings
{
"index.number_of_replicas": 1
}
----
// TEST[skip:illustration purposes only]
// end::self-managed[]
IMPORTANT: After reducing the replicas please consider there are enough replicas to ensure your search
performance and reliability requirements. If not, at your earliest convenience (i) consider using
<<overview-index-lifecycle-management, Index Lifecycle Management>> to manage more efficiently the
retention of your timeseries data, or (ii) reduce the amount of data you have by disabling the `source` or removing
less important data, or (iii) increase your disk capacity.

View file

@ -0,0 +1,40 @@
++++
<div class="tabs" data-tab-group="host">
<div role="tablist" aria-label="Increase data node capacity">
<button role="tab"
aria-selected="true"
aria-controls="cloud-tab-increase-data-node-capacity"
id="cloud-increase-data-node-capacity">
Elasticsearch Service
</button>
<button role="tab"
aria-selected="false"
aria-controls="self-managed-tab-increase-data-node-capacity"
id="self-managed-increase-data-node-capacity"
tabindex="-1">
Self-managed
</button>
</div>
<div tabindex="0"
role="tabpanel"
id="cloud-tab-increase-data-node-capacity"
aria-labelledby="cloud-increase-data-node-capacity">
++++
include::increase-data-node-capacity.asciidoc[tag=cloud]
++++
</div>
<div tabindex="0"
role="tabpanel"
id="self-managed-tab-increase-data-node-capacity"
aria-labelledby="self-managed-increase-data-node-capacity"
hidden="">
++++
include::increase-data-node-capacity.asciidoc[tag=self-managed]
++++
</div>
</div>
++++

View file

@ -0,0 +1,110 @@
// tag::cloud[]
In order to increase the disk capacity of the data nodes in your cluster:
. Log in to the {ess-console}[{ecloud} console].
+
. On the **Elasticsearch Service** panel, click the gear under the `Manage deployment` column that corresponds to the
name of your deployment.
+
. If autoscaling is available but not enabled, please enable it. You can do this by clicking the button
`Enable autoscaling` on a banner like the one below:
+
[role="screenshot"]
image::images/troubleshooting/disk/autoscaling_banner.png[Autoscaling banner,align="center"]
+
Or you can go to `Actions > Edit deployment`, check the checkbox `Autoscale` and click `save` at the bottom of the page.
+
[role="screenshot"]
image::images/troubleshooting/disk/enable_autoscaling.png[Enabling autoscaling,align="center"]
. If autoscaling has succeeded the cluster should return to `healthy` status. If the cluster is still out of disk,
please check if autoscaling has reached its limits. You will be notified about this by the following banner:
+
[role="screenshot"]
image::images/troubleshooting/disk/autoscaling_limits_banner.png[Autoscaling banner,align="center"]
+
or you can go to `Actions > Edit deployment` and look for the label `LIMIT REACHED` as shown below:
+
[role="screenshot"]
image::images/troubleshooting/disk/reached_autoscaling_limits.png[Autoscaling limits reached,align="center"]
+
If you are seeing the banner click `Update autoscaling settings` to go to the `Edit` page. Otherwise, you are already
in the `Edit` page, click `Edit settings` to increase the autoscaling limits. After you perform the change click `save`
at the bottom of the page.
// end::cloud[]
// tag::self-managed[]
In order to increase the data node capacity in your cluster, you will need to calculate the amount of extra disk space
needed.
. First, retrieve the relevant disk thresholds that will indicate how much space should be available. The
relevant thresholds are the <<cluster-routing-watermark-high, high watermark>> for all the tiers apart from the frozen
one and the <<cluster-routing-flood-stage-frozen, frozen flood stage watermark>> for the frozen tier. The following
example demonstrates disk shortage in the hot tier, so we will only retrieve the high watermark:
+
[source,console]
----
GET _cluster/settings?include_defaults&filter_path=*.cluster.routing.allocation.disk.watermark.high*
----
+
The response will look like this:
+
[source,console-result]
----
{
"defaults": {
"cluster": {
"routing": {
"allocation": {
"disk": {
"watermark": {
"high": "90%",
"high.max_headroom": "150GB"
}
}
}
}
}
}
}
----
// TEST[skip:illustration purposes only]
+
The above means that in order to resolve the disk shortage we need to either drop our disk usage below the 90% or have
more than 150GB available, read more on how this threshold works <<cluster-routing-watermark-high, here>>.
. The next step is to find out the current disk usage, this will indicate how much extra space is needed. For simplicity,
our example has one node, but you can apply the same for every node over the relevant threshold.
+
[source,console]
----
GET _cat/allocation?v&s=disk.avail&h=node,disk.percent,disk.avail,disk.total,disk.used,disk.indices,shards
----
+
The response will look like this:
+
[source,console-result]
----
node disk.percent disk.avail disk.total disk.used disk.indices shards
instance-0000000000 91 4.6gb 35gb 31.1gb 29.9gb 111
----
// TEST[skip:illustration purposes only]
. The high watermark configuration indicates that the disk usage needs to drop below 90%. To achieve this, 2
things are possible:
- to add an extra data node to the cluster (this requires that you have more than one shard in your cluster), or
- to extend the disk space of the current node by approximately 20% to allow this node to drop to 70%. This will give
enough space to this node to not run out of space soon.
. In the case of adding another data node, the cluster will not recover immediately. It might take some time to
relocate some shards to the new node. You can check the progress here:
+
[source,console]
----
GET /_cat/shards?v&h=state,node&s=state
----
+
If in the response the shards' state is `RELOCATING`, it means that shards are still moving. Wait until all shards turn
to `STARTED` or until the health disk indicator turns to `green`.
// end::self-managed[]

View file

@ -0,0 +1,40 @@
++++
<div class="tabs" data-tab-group="host">
<div role="tablist" aria-label="Increase master node capacity">
<button role="tab"
aria-selected="true"
aria-controls="cloud-tab-increase-master-node-capacity"
id="cloud-increase-data-node-capacity">
Elasticsearch Service
</button>
<button role="tab"
aria-selected="false"
aria-controls="self-managed-tab-increase-master-node-capacity"
id="self-managed-increase-master-node-capacity"
tabindex="-1">
Self-managed
</button>
</div>
<div tabindex="0"
role="tabpanel"
id="cloud-tab-increase-master-node-capacity"
aria-labelledby="cloud-increase-master-node-capacity">
++++
include::increase-master-node-capacity.asciidoc[tag=cloud]
++++
</div>
<div tabindex="0"
role="tabpanel"
id="self-managed-tab-increase-master-node-capacity"
aria-labelledby="self-managed-increase-master-node-capacity"
hidden="">
++++
include::increase-master-node-capacity.asciidoc[tag=self-managed]
++++
</div>
</div>
++++

View file

@ -0,0 +1,89 @@
// tag::cloud[]
. Log in to the {ess-console}[{ecloud} console].
+
. On the **Elasticsearch Service** panel, click the gear under the `Manage deployment` column that corresponds to the
name of your deployment.
+
. Go to `Actions > Edit deployment` and then go to the `Master instances` section:
+
[role="screenshot"]
image::images/troubleshooting/disk/increase-disk-capacity-master-node.png[Increase disk capacity of master nodes,align="center"]
. Choose a larger than the pre-selected capacity configuration from the drop-down menu and click `save`. Wait for
the plan to be applied and the problem should be resolved.
// end::cloud[]
// tag::self-managed[]
In order to increase the disk capacity of a master node, you will need to replace *all* the master nodes with
master nodes of higher disk capacity.
. First, retrieve the disk threshold that will indicate how much disk space is needed. The relevant threshold is
the <<cluster-routing-watermark-high, high watermark>> and can be retrieved via the following command:
+
[source,console]
----
GET _cluster/settings?include_defaults&filter_path=*.cluster.routing.allocation.disk.watermark.high*
----
+
The response will look like this:
+
[source,console-result]
----
{
"defaults": {
"cluster": {
"routing": {
"allocation": {
"disk": {
"watermark": {
"high": "90%",
"high.max_headroom": "150GB"
}
}
}
}
}
}
----
// TEST[skip:illustration purposes only]
+
The above means that in order to resolve the disk shortage we need to either drop our disk usage below the 90% or have
more than 150GB available, read more how this threshold works <<cluster-routing-watermark-high, here>>.
. The next step is to find out the current disk usage, this will allow to calculate how much extra space is needed.
In the following example, we show only the master nodes for readability purposes:
+
[source,console]
----
GET /_cat/nodes?v&h=name,master,node.role,disk.used_percent,disk.used,disk.avail,disk.total
----
+
The response will look like this:
+
[source,console-result]
----
name master node.role disk.used_percent disk.used disk.avail disk.total
instance-0000000000 * m 85.31 3.4gb 500mb 4gb
instance-0000000001 * m 50.02 2.1gb 1.9gb 4gb
instance-0000000002 * m 50.02 1.9gb 2.1gb 4gb
----
// TEST[skip:illustration purposes only]
. The desired situation is to drop the disk usages below the relevant threshold, in our example 90%. Consider adding
some padding, so it will not go over the threshold soon. If you have multiple master nodes you need to ensure that *all*
master nodes will have this capacity. Assuming you have the new nodes ready, follow the next three steps for every
master node.
. Bring down one of the master nodes.
. Start up one of the new master nodes and wait for it to join the cluster. You can check this via:
+
[source,console]
----
GET /_cat/nodes?v&h=name,master,node.role,disk.used_percent,disk.used,disk.avail,disk.total
----
+
. Only after you have confirmed that your cluster has the initial number of master nodes, move forward to the next one
until all the initial master nodes have been replaced.
// end::self-managed[]

View file

@ -0,0 +1,40 @@
++++
<div class="tabs" data-tab-group="host">
<div role="tablist" aria-label="Increase other node capacity">
<button role="tab"
aria-selected="true"
aria-controls="cloud-tab-increase-other-node-capacity"
id="cloud-increase-data-node-capacity">
Elasticsearch Service
</button>
<button role="tab"
aria-selected="false"
aria-controls="self-managed-tab-increase-other-node-capacity"
id="self-managed-increase-other-node-capacity"
tabindex="-1">
Self-managed
</button>
</div>
<div tabindex="0"
role="tabpanel"
id="cloud-tab-increase-other-node-capacity"
aria-labelledby="cloud-increase-other-node-capacity">
++++
include::increase-other-node-capacity.asciidoc[tag=cloud]
++++
</div>
<div tabindex="0"
role="tabpanel"
id="self-managed-tab-increase-other-node-capacity"
aria-labelledby="self-managed-increase-other-node-capacity"
hidden="">
++++
include::increase-other-node-capacity.asciidoc[tag=self-managed]
++++
</div>
</div>
++++

View file

@ -0,0 +1,94 @@
// tag::cloud[]
. Log in to the {ess-console}[{ecloud} console].
+
. On the **Elasticsearch Service** panel, click the gear under the `Manage deployment` column that corresponds to the
name of your deployment.
+
. Go to `Actions > Edit deployment` and then go to the `Coordinating instances` or the `Machine Learning instances`
section depending on the roles listed in the diagnosis:
+
[role="screenshot"]
image::images/troubleshooting/disk/increase-disk-capacity-other-node.png[Increase disk capacity of other nodes,align="center"]
. Choose a larger than the pre-selected capacity configuration from the drop-down menu and click `save`. Wait for
the plan to be applied and the problem should be resolved.
// end::cloud[]
// tag::self-managed[]
In order to increase the disk capacity of any other node, you will need to replace the instance that has run out of
space with one of higher disk capacity.
. First, retrieve the disk threshold that will indicate how much disk space is needed. The relevant threshold is
the <<cluster-routing-watermark-high, high watermark>> and can be retrieved via the following command:
+
[source,console]
----
GET _cluster/settings?include_defaults&filter_path=*.cluster.routing.allocation.disk.watermark.high*
----
+
The response will look like this:
+
[source,console-result]
----
{
"defaults": {
"cluster": {
"routing": {
"allocation": {
"disk": {
"watermark": {
"high": "90%",
"high.max_headroom": "150GB"
}
}
}
}
}
}
----
// TEST[skip:illustration purposes only]
+
The above means that in order to resolve the disk shortage we need to either drop our disk usage below the 90% or have
more than 150GB available, read more how this threshold works <<cluster-routing-watermark-high, here>>.
. The next step is to find out the current disk usage, this will allow to calculate how much extra space is needed.
In the following example, we show only a machine learning node for readability purposes:
+
[source,console]
----
GET /_cat/nodes?v&h=name,node.role,disk.used_percent,disk.used,disk.avail,disk.total
----
+
The response will look like this:
+
[source,console-result]
----
name node.role disk.used_percent disk.used disk.avail disk.total
instance-0000000000 l 85.31 3.4gb 500mb 4gb
----
// TEST[skip:illustration purposes only]
. The desired situation is to drop the disk usage below the relevant threshold, in our example 90%. Consider adding
some padding, so it will not go over the threshold soon. Assuming you have the new node ready, add this node to the
cluster.
. Verify that the new node has joined the cluster:
+
[source,console]
----
GET /_cat/nodes?v&h=name,node.role,disk.used_percent,disk.used,disk.avail,disk.total
----
+
The response will look like this:
+
[source,console-result]
----
name node.role disk.used_percent disk.used disk.avail disk.total
instance-0000000000 l 85.31 3.4gb 500mb 4gb
instance-0000000001 l 41.31 3.4gb 4.5gb 8gb
----
// TEST[skip:illustration purposes only]
. Now you can remove the out of disk space instance.
// end::self-managed[]

View file

@ -31,6 +31,13 @@ fix problems that an {es} deployment might encounter.
* <<start-ilm,Start index lifecycle management>>
* <<start-slm,Start snapshot lifecycle management>>
[discrete]
[[troubleshooting-capacity]]
=== Capacity
* <<fix-data-node-out-of-disk, Fix data nodes out of disk>>
* <<fix-master-node-out-of-disk, Fix master nodes out of disk>>
* <<fix-other-node-out-of-disk, Fix other role nodes out of disk>>
[discrete]
[[troubleshooting-snapshot]]
=== Snapshot and restore
@ -90,6 +97,12 @@ include::troubleshooting/data/increase-cluster-shard-limit.asciidoc[]
include::troubleshooting/corruption-issues.asciidoc[]
include::troubleshooting/disk/fix-data-node-out-of-disk.asciidoc[]
include::troubleshooting/disk/fix-master-node-out-of-disk.asciidoc[]
include::troubleshooting/disk/fix-other-node-out-of-disk.asciidoc[]
include::troubleshooting/data/start-ilm.asciidoc[]
include::troubleshooting/data/start-slm.asciidoc[]

View file

@ -0,0 +1,23 @@
[[fix-data-node-out-of-disk]]
== Fix data nodes out of disk
{es} is using data nodes to distribute your data inside the cluster. If one or more of these nodes are running
out of space, {es} takes action to redistribute your data within the nodes so all nodes have enough available
disk space. If {es} cannot facilitate enough available space in a node, then you will need to intervene in one
of two ways:
. <<increase-capacity-data-node, Increase the disk capacity of your cluster>>
. <<decrease-disk-usage-data-node, Reduce the disk usage by decreasing your data volume>>
[[increase-capacity-data-node]]
=== Increase the disk capacity of data nodes
include::{es-repo-dir}/tab-widgets/troubleshooting/disk/increase-data-node-capacity-widget.asciidoc[]
[[decrease-disk-usage-data-node]]
=== Decrease the disk usage of data nodes
In order to decrease the disk usage in your cluster without losing any data, you can try reducing the replicas of indices.
NOTE: Reducing the replicas of an index can potentially reduce search throughput and data redundancy. However, it
can quickly give the cluster breathing room until a more permanent solution is in place.
include::{es-repo-dir}/tab-widgets/troubleshooting/disk/decrease-data-node-disk-usage-widget.asciidoc[]

View file

@ -0,0 +1,8 @@
[[fix-master-node-out-of-disk]]
== Fix master nodes out of disk
{es} is using master nodes to coordinate the cluster. If the master or any master eligible nodes are running
out of space, you need to ensure that they have enough disk space to function. If the <<health-api, health API>>
reports that your master node is out of space you need to increase the disk capacity of your master nodes.
include::{es-repo-dir}/tab-widgets/troubleshooting/disk/increase-master-node-capacity-widget.asciidoc[]

View file

@ -0,0 +1,9 @@
[[fix-other-node-out-of-disk]]
== Fix other role nodes out of disk
{es} can use dedicated nodes to execute other functions apart from storing data or coordinating the cluster,
for example machine learning. If one or more of these nodes are running out of space, you need to ensure that they have
enough disk space to function. If the <<health-api, health API>> reports that a node that is not a master and does not
contain data is out of space you need to increase the disk capacity of this node.
include::{es-repo-dir}/tab-widgets/troubleshooting/disk/increase-other-node-capacity-widget.asciidoc[]