Disk indicator troubleshooting guides (#90504)
After Width: | Height: | Size: 149 KiB |
After Width: | Height: | Size: 129 KiB |
After Width: | Height: | Size: 177 KiB |
After Width: | Height: | Size: 107 KiB |
After Width: | Height: | Size: 200 KiB |
After Width: | Height: | Size: 183 KiB |
BIN
docs/reference/images/troubleshooting/disk/reduce_replicas.png
Normal file
After Width: | Height: | Size: 324 KiB |
|
@ -16,7 +16,7 @@ is not recommended to change any of these from their default values.
|
|||
a master at all, before moving on with other checks. Defaults to `30s` (30 seconds).
|
||||
|
||||
`master_history.max_age`::
|
||||
(<<static-cluster-setting,Static>>) The timeframe we record the master history
|
||||
(<<static-cluster-setting,Static>>) The timeframe we record the master history
|
||||
to be used for diagnosing the cluster health. Master node changes older than this time will not be considered when
|
||||
diagnosing the cluster health. Defaults to `30m` (30 minutes).
|
||||
|
||||
|
@ -27,3 +27,11 @@ Defaults to `4`.
|
|||
`health.master_history.no_master_transitions_threshold`::
|
||||
(<<static-cluster-setting,Static>>) The number of transitions to no master witnessed by a node that indicates the cluster is not healthy.
|
||||
Defaults to `4`.
|
||||
|
||||
`health.node.enabled`::
|
||||
(<<cluster-update-settings,Dynamic>>) Enables the health node, which allows the health API to provide indications about
|
||||
cluster wide health aspects such as disk space.
|
||||
|
||||
`health.reporting.local.monitor.interval`::
|
||||
(<<cluster-update-settings,Dynamic>>) Determines the interval in which each node of the cluster monitors aspects that
|
||||
comprise its local health such as its disk usage.
|
||||
|
|
|
@ -0,0 +1,40 @@
|
|||
++++
|
||||
<div class="tabs" data-tab-group="host">
|
||||
<div role="tablist" aria-label="Restore from snapshot">
|
||||
<button role="tab"
|
||||
aria-selected="true"
|
||||
aria-controls="cloud-tab-decrease-disk-usage"
|
||||
id="cloud-decrease-disk-usage">
|
||||
Elasticsearch Service
|
||||
</button>
|
||||
<button role="tab"
|
||||
aria-selected="false"
|
||||
aria-controls="self-managed-tab-decrease-disk-usage"
|
||||
id="self-managed-decrease-disk-usage"
|
||||
tabindex="-1">
|
||||
Self-managed
|
||||
</button>
|
||||
</div>
|
||||
<div tabindex="0"
|
||||
role="tabpanel"
|
||||
id="cloud-tab-decrease-disk-usage"
|
||||
aria-labelledby="cloud-decrease-disk-usage">
|
||||
++++
|
||||
|
||||
include::decrease-data-node-disk-usage.asciidoc[tag=cloud]
|
||||
|
||||
++++
|
||||
</div>
|
||||
<div tabindex="0"
|
||||
role="tabpanel"
|
||||
id="self-managed-tab-decrease-disk-usage"
|
||||
aria-labelledby="self-managed-decrease-disk-usage"
|
||||
hidden="">
|
||||
++++
|
||||
|
||||
include::decrease-data-node-disk-usage.asciidoc[tag=self-managed]
|
||||
|
||||
++++
|
||||
</div>
|
||||
</div>
|
||||
++++
|
|
@ -0,0 +1,140 @@
|
|||
// tag::cloud[]
|
||||
**Use {kib}**
|
||||
|
||||
//tag::kibana-api-ex[]
|
||||
. Log in to the {ess-console}[{ecloud} console].
|
||||
+
|
||||
|
||||
. On the **Elasticsearch Service** panel, click the name of your deployment.
|
||||
+
|
||||
|
||||
NOTE: If the name of your deployment is disabled your {kib} instances might be
|
||||
unhealthy, in which case please contact https://support.elastic.co[Elastic Support].
|
||||
If your deployment doesn't include {kib}, all you need to do is
|
||||
{cloud}/ec-access-kibana.html[enable it first].
|
||||
+
|
||||
. Open your deployment's side navigation menu (placed under the Elastic logo in the upper left corner)
|
||||
and go to **Stack Management > Index Management**.
|
||||
|
||||
. In the list of all your indices, click the `Replicas` column twice to sort the indices based on their number of
|
||||
replicas starting with the one that has the most. Go through the indices and pick one by one the index with the
|
||||
least importance and higher number of replicas.
|
||||
+
|
||||
WARNING: Reducing the replicas of an index can potentially reduce search throughput and data redundancy.
|
||||
+
|
||||
. For each index you chose, click on its name, then on the panel that appears click `Edit settings`, reduce the
|
||||
value of the `index.number_of_replicas` to the desired value and then click `Save`.
|
||||
+
|
||||
[role="screenshot"]
|
||||
image::images/troubleshooting/disk/reduce_replicas.png[Reducing replicas,align="center"]
|
||||
+
|
||||
. Continue this process until the cluster is healthy again.
|
||||
|
||||
// end::cloud[]
|
||||
|
||||
// tag::self-managed[]
|
||||
In order to estimate how many replicas need to be removed, first you need to estimate the amount of disk space that
|
||||
needs to be released.
|
||||
|
||||
. First, retrieve the relevant disk thresholds that will indicate how much space should be released. The
|
||||
relevant thresholds are the <<cluster-routing-watermark-high, high watermark>> for all the tiers apart from the frozen
|
||||
one and the <<cluster-routing-flood-stage-frozen, frozen flood stage watermark>> for the frozen tier. The following
|
||||
example demonstrates disk shortage in the hot tier, so we will only retrieve the high watermark:
|
||||
+
|
||||
[source,console]
|
||||
----
|
||||
GET _cluster/settings?include_defaults&filter_path=*.cluster.routing.allocation.disk.watermark.high*
|
||||
----
|
||||
+
|
||||
The response will look like this:
|
||||
+
|
||||
[source,console-result]
|
||||
----
|
||||
{
|
||||
"defaults": {
|
||||
"cluster": {
|
||||
"routing": {
|
||||
"allocation": {
|
||||
"disk": {
|
||||
"watermark": {
|
||||
"high": "90%",
|
||||
"high.max_headroom": "150GB"
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
----
|
||||
// TEST[skip:illustration purposes only]
|
||||
+
|
||||
The above means that in order to resolve the disk shortage we need to either drop our disk usage below the 90% or have
|
||||
more than 150GB available, read more on how this threshold works <<cluster-routing-watermark-high, here>>.
|
||||
|
||||
. The next step is to find out the current disk usage; this will indicate how much space should be freed. For simplicity,
|
||||
our example has one node, but you can apply the same for every node over the relevant threshold.
|
||||
+
|
||||
[source,console]
|
||||
----
|
||||
GET _cat/allocation?v&s=disk.avail&h=node,disk.percent,disk.avail,disk.total,disk.used,disk.indices,shards
|
||||
----
|
||||
+
|
||||
The response will look like this:
|
||||
+
|
||||
[source,console-result]
|
||||
----
|
||||
node disk.percent disk.avail disk.total disk.used disk.indices shards
|
||||
instance-0000000000 91 4.6gb 35gb 31.1gb 29.9gb 111
|
||||
----
|
||||
// TEST[skip:illustration purposes only]
|
||||
|
||||
. The high watermark configuration indicates that the disk usage needs to drop below 90%. Consider allowing some
|
||||
padding, so the node will not go over the threshold in the near future. In this example, let's release approximately 7GB.
|
||||
|
||||
. The next step is to list all the indices and choose which replicas to reduce.
|
||||
+
|
||||
NOTE: The following command orders the indices with descending number of replicas and primary store size. We do this to
|
||||
help you choose which replicas to reduce under the assumption that the more replicas you have the smaller the risk if
|
||||
you remove a copy and the bigger the replica the more space will be released. This does not take into consideration any
|
||||
functional requirements, so please see it as a mere suggestion.
|
||||
+
|
||||
[source,console]
|
||||
----
|
||||
GET _cat/indices?v&s=rep:desc,pri.store.size:desc&h=health,index,pri,rep,store.size,pri.store.size
|
||||
----
|
||||
+
|
||||
The response will look like:
|
||||
+
|
||||
[source,console-result]
|
||||
----
|
||||
health index pri rep store.size pri.store.size
|
||||
green my_index 2 3 9.9gb 3.3gb
|
||||
green my_other_index 2 3 1.8gb 470.3mb
|
||||
green search-products 2 3 278.5kb 69.6kb
|
||||
green logs-000001 1 0 7.7gb 7.7gb
|
||||
----
|
||||
// TEST[skip:illustration purposes only]
|
||||
+
|
||||
. In the list above we see that if we reduce the replicas to 1 of the indices `my_index` and `my_other_index` we will
|
||||
release the required disk space. It is not necessary to reduce the replicas of `search-products` and `logs-000001` does
|
||||
not have any replicas anyway. Reduce the replicas of one or more indices with the <<indices-update-settings,
|
||||
index update settings API>>:
|
||||
+
|
||||
WARNING: Reducing the replicas of an index can potentially reduce search throughput and data redundancy.
|
||||
+
|
||||
[source,console]
|
||||
----
|
||||
PUT my_index,my_other_index/_settings
|
||||
{
|
||||
"index.number_of_replicas": 1
|
||||
}
|
||||
----
|
||||
// TEST[skip:illustration purposes only]
|
||||
// end::self-managed[]
|
||||
|
||||
IMPORTANT: After reducing the replicas please consider there are enough replicas to ensure your search
|
||||
performance and reliability requirements. If not, at your earliest convenience (i) consider using
|
||||
<<overview-index-lifecycle-management, Index Lifecycle Management>> to manage more efficiently the
|
||||
retention of your timeseries data, or (ii) reduce the amount of data you have by disabling the `source` or removing
|
||||
less important data, or (iii) increase your disk capacity.
|
|
@ -0,0 +1,40 @@
|
|||
++++
|
||||
<div class="tabs" data-tab-group="host">
|
||||
<div role="tablist" aria-label="Increase data node capacity">
|
||||
<button role="tab"
|
||||
aria-selected="true"
|
||||
aria-controls="cloud-tab-increase-data-node-capacity"
|
||||
id="cloud-increase-data-node-capacity">
|
||||
Elasticsearch Service
|
||||
</button>
|
||||
<button role="tab"
|
||||
aria-selected="false"
|
||||
aria-controls="self-managed-tab-increase-data-node-capacity"
|
||||
id="self-managed-increase-data-node-capacity"
|
||||
tabindex="-1">
|
||||
Self-managed
|
||||
</button>
|
||||
</div>
|
||||
<div tabindex="0"
|
||||
role="tabpanel"
|
||||
id="cloud-tab-increase-data-node-capacity"
|
||||
aria-labelledby="cloud-increase-data-node-capacity">
|
||||
++++
|
||||
|
||||
include::increase-data-node-capacity.asciidoc[tag=cloud]
|
||||
|
||||
++++
|
||||
</div>
|
||||
<div tabindex="0"
|
||||
role="tabpanel"
|
||||
id="self-managed-tab-increase-data-node-capacity"
|
||||
aria-labelledby="self-managed-increase-data-node-capacity"
|
||||
hidden="">
|
||||
++++
|
||||
|
||||
include::increase-data-node-capacity.asciidoc[tag=self-managed]
|
||||
|
||||
++++
|
||||
</div>
|
||||
</div>
|
||||
++++
|
|
@ -0,0 +1,110 @@
|
|||
// tag::cloud[]
|
||||
In order to increase the disk capacity of the data nodes in your cluster:
|
||||
|
||||
. Log in to the {ess-console}[{ecloud} console].
|
||||
+
|
||||
. On the **Elasticsearch Service** panel, click the gear under the `Manage deployment` column that corresponds to the
|
||||
name of your deployment.
|
||||
+
|
||||
. If autoscaling is available but not enabled, please enable it. You can do this by clicking the button
|
||||
`Enable autoscaling` on a banner like the one below:
|
||||
+
|
||||
[role="screenshot"]
|
||||
image::images/troubleshooting/disk/autoscaling_banner.png[Autoscaling banner,align="center"]
|
||||
+
|
||||
Or you can go to `Actions > Edit deployment`, check the checkbox `Autoscale` and click `save` at the bottom of the page.
|
||||
+
|
||||
[role="screenshot"]
|
||||
image::images/troubleshooting/disk/enable_autoscaling.png[Enabling autoscaling,align="center"]
|
||||
|
||||
. If autoscaling has succeeded the cluster should return to `healthy` status. If the cluster is still out of disk,
|
||||
please check if autoscaling has reached its limits. You will be notified about this by the following banner:
|
||||
+
|
||||
[role="screenshot"]
|
||||
image::images/troubleshooting/disk/autoscaling_limits_banner.png[Autoscaling banner,align="center"]
|
||||
+
|
||||
or you can go to `Actions > Edit deployment` and look for the label `LIMIT REACHED` as shown below:
|
||||
+
|
||||
[role="screenshot"]
|
||||
image::images/troubleshooting/disk/reached_autoscaling_limits.png[Autoscaling limits reached,align="center"]
|
||||
+
|
||||
If you are seeing the banner click `Update autoscaling settings` to go to the `Edit` page. Otherwise, you are already
|
||||
in the `Edit` page, click `Edit settings` to increase the autoscaling limits. After you perform the change click `save`
|
||||
at the bottom of the page.
|
||||
|
||||
// end::cloud[]
|
||||
|
||||
// tag::self-managed[]
|
||||
In order to increase the data node capacity in your cluster, you will need to calculate the amount of extra disk space
|
||||
needed.
|
||||
|
||||
. First, retrieve the relevant disk thresholds that will indicate how much space should be available. The
|
||||
relevant thresholds are the <<cluster-routing-watermark-high, high watermark>> for all the tiers apart from the frozen
|
||||
one and the <<cluster-routing-flood-stage-frozen, frozen flood stage watermark>> for the frozen tier. The following
|
||||
example demonstrates disk shortage in the hot tier, so we will only retrieve the high watermark:
|
||||
+
|
||||
[source,console]
|
||||
----
|
||||
GET _cluster/settings?include_defaults&filter_path=*.cluster.routing.allocation.disk.watermark.high*
|
||||
----
|
||||
+
|
||||
The response will look like this:
|
||||
+
|
||||
[source,console-result]
|
||||
----
|
||||
{
|
||||
"defaults": {
|
||||
"cluster": {
|
||||
"routing": {
|
||||
"allocation": {
|
||||
"disk": {
|
||||
"watermark": {
|
||||
"high": "90%",
|
||||
"high.max_headroom": "150GB"
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
----
|
||||
// TEST[skip:illustration purposes only]
|
||||
+
|
||||
The above means that in order to resolve the disk shortage we need to either drop our disk usage below the 90% or have
|
||||
more than 150GB available, read more on how this threshold works <<cluster-routing-watermark-high, here>>.
|
||||
|
||||
. The next step is to find out the current disk usage, this will indicate how much extra space is needed. For simplicity,
|
||||
our example has one node, but you can apply the same for every node over the relevant threshold.
|
||||
+
|
||||
[source,console]
|
||||
----
|
||||
GET _cat/allocation?v&s=disk.avail&h=node,disk.percent,disk.avail,disk.total,disk.used,disk.indices,shards
|
||||
----
|
||||
+
|
||||
The response will look like this:
|
||||
+
|
||||
[source,console-result]
|
||||
----
|
||||
node disk.percent disk.avail disk.total disk.used disk.indices shards
|
||||
instance-0000000000 91 4.6gb 35gb 31.1gb 29.9gb 111
|
||||
----
|
||||
// TEST[skip:illustration purposes only]
|
||||
|
||||
. The high watermark configuration indicates that the disk usage needs to drop below 90%. To achieve this, 2
|
||||
things are possible:
|
||||
- to add an extra data node to the cluster (this requires that you have more than one shard in your cluster), or
|
||||
- to extend the disk space of the current node by approximately 20% to allow this node to drop to 70%. This will give
|
||||
enough space to this node to not run out of space soon.
|
||||
|
||||
. In the case of adding another data node, the cluster will not recover immediately. It might take some time to
|
||||
relocate some shards to the new node. You can check the progress here:
|
||||
+
|
||||
[source,console]
|
||||
----
|
||||
GET /_cat/shards?v&h=state,node&s=state
|
||||
----
|
||||
+
|
||||
If in the response the shards' state is `RELOCATING`, it means that shards are still moving. Wait until all shards turn
|
||||
to `STARTED` or until the health disk indicator turns to `green`.
|
||||
// end::self-managed[]
|
|
@ -0,0 +1,40 @@
|
|||
++++
|
||||
<div class="tabs" data-tab-group="host">
|
||||
<div role="tablist" aria-label="Increase master node capacity">
|
||||
<button role="tab"
|
||||
aria-selected="true"
|
||||
aria-controls="cloud-tab-increase-master-node-capacity"
|
||||
id="cloud-increase-data-node-capacity">
|
||||
Elasticsearch Service
|
||||
</button>
|
||||
<button role="tab"
|
||||
aria-selected="false"
|
||||
aria-controls="self-managed-tab-increase-master-node-capacity"
|
||||
id="self-managed-increase-master-node-capacity"
|
||||
tabindex="-1">
|
||||
Self-managed
|
||||
</button>
|
||||
</div>
|
||||
<div tabindex="0"
|
||||
role="tabpanel"
|
||||
id="cloud-tab-increase-master-node-capacity"
|
||||
aria-labelledby="cloud-increase-master-node-capacity">
|
||||
++++
|
||||
|
||||
include::increase-master-node-capacity.asciidoc[tag=cloud]
|
||||
|
||||
++++
|
||||
</div>
|
||||
<div tabindex="0"
|
||||
role="tabpanel"
|
||||
id="self-managed-tab-increase-master-node-capacity"
|
||||
aria-labelledby="self-managed-increase-master-node-capacity"
|
||||
hidden="">
|
||||
++++
|
||||
|
||||
include::increase-master-node-capacity.asciidoc[tag=self-managed]
|
||||
|
||||
++++
|
||||
</div>
|
||||
</div>
|
||||
++++
|
|
@ -0,0 +1,89 @@
|
|||
// tag::cloud[]
|
||||
|
||||
. Log in to the {ess-console}[{ecloud} console].
|
||||
+
|
||||
. On the **Elasticsearch Service** panel, click the gear under the `Manage deployment` column that corresponds to the
|
||||
name of your deployment.
|
||||
+
|
||||
. Go to `Actions > Edit deployment` and then go to the `Master instances` section:
|
||||
+
|
||||
[role="screenshot"]
|
||||
image::images/troubleshooting/disk/increase-disk-capacity-master-node.png[Increase disk capacity of master nodes,align="center"]
|
||||
|
||||
. Choose a larger than the pre-selected capacity configuration from the drop-down menu and click `save`. Wait for
|
||||
the plan to be applied and the problem should be resolved.
|
||||
|
||||
// end::cloud[]
|
||||
|
||||
// tag::self-managed[]
|
||||
In order to increase the disk capacity of a master node, you will need to replace *all* the master nodes with
|
||||
master nodes of higher disk capacity.
|
||||
|
||||
. First, retrieve the disk threshold that will indicate how much disk space is needed. The relevant threshold is
|
||||
the <<cluster-routing-watermark-high, high watermark>> and can be retrieved via the following command:
|
||||
+
|
||||
[source,console]
|
||||
----
|
||||
GET _cluster/settings?include_defaults&filter_path=*.cluster.routing.allocation.disk.watermark.high*
|
||||
----
|
||||
+
|
||||
The response will look like this:
|
||||
+
|
||||
[source,console-result]
|
||||
----
|
||||
{
|
||||
"defaults": {
|
||||
"cluster": {
|
||||
"routing": {
|
||||
"allocation": {
|
||||
"disk": {
|
||||
"watermark": {
|
||||
"high": "90%",
|
||||
"high.max_headroom": "150GB"
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
----
|
||||
// TEST[skip:illustration purposes only]
|
||||
+
|
||||
The above means that in order to resolve the disk shortage we need to either drop our disk usage below the 90% or have
|
||||
more than 150GB available, read more how this threshold works <<cluster-routing-watermark-high, here>>.
|
||||
|
||||
. The next step is to find out the current disk usage, this will allow to calculate how much extra space is needed.
|
||||
In the following example, we show only the master nodes for readability purposes:
|
||||
+
|
||||
[source,console]
|
||||
----
|
||||
GET /_cat/nodes?v&h=name,master,node.role,disk.used_percent,disk.used,disk.avail,disk.total
|
||||
----
|
||||
+
|
||||
The response will look like this:
|
||||
+
|
||||
[source,console-result]
|
||||
----
|
||||
name master node.role disk.used_percent disk.used disk.avail disk.total
|
||||
instance-0000000000 * m 85.31 3.4gb 500mb 4gb
|
||||
instance-0000000001 * m 50.02 2.1gb 1.9gb 4gb
|
||||
instance-0000000002 * m 50.02 1.9gb 2.1gb 4gb
|
||||
----
|
||||
// TEST[skip:illustration purposes only]
|
||||
|
||||
. The desired situation is to drop the disk usages below the relevant threshold, in our example 90%. Consider adding
|
||||
some padding, so it will not go over the threshold soon. If you have multiple master nodes you need to ensure that *all*
|
||||
master nodes will have this capacity. Assuming you have the new nodes ready, follow the next three steps for every
|
||||
master node.
|
||||
|
||||
. Bring down one of the master nodes.
|
||||
. Start up one of the new master nodes and wait for it to join the cluster. You can check this via:
|
||||
+
|
||||
[source,console]
|
||||
----
|
||||
GET /_cat/nodes?v&h=name,master,node.role,disk.used_percent,disk.used,disk.avail,disk.total
|
||||
----
|
||||
+
|
||||
. Only after you have confirmed that your cluster has the initial number of master nodes, move forward to the next one
|
||||
until all the initial master nodes have been replaced.
|
||||
// end::self-managed[]
|
|
@ -0,0 +1,40 @@
|
|||
++++
|
||||
<div class="tabs" data-tab-group="host">
|
||||
<div role="tablist" aria-label="Increase other node capacity">
|
||||
<button role="tab"
|
||||
aria-selected="true"
|
||||
aria-controls="cloud-tab-increase-other-node-capacity"
|
||||
id="cloud-increase-data-node-capacity">
|
||||
Elasticsearch Service
|
||||
</button>
|
||||
<button role="tab"
|
||||
aria-selected="false"
|
||||
aria-controls="self-managed-tab-increase-other-node-capacity"
|
||||
id="self-managed-increase-other-node-capacity"
|
||||
tabindex="-1">
|
||||
Self-managed
|
||||
</button>
|
||||
</div>
|
||||
<div tabindex="0"
|
||||
role="tabpanel"
|
||||
id="cloud-tab-increase-other-node-capacity"
|
||||
aria-labelledby="cloud-increase-other-node-capacity">
|
||||
++++
|
||||
|
||||
include::increase-other-node-capacity.asciidoc[tag=cloud]
|
||||
|
||||
++++
|
||||
</div>
|
||||
<div tabindex="0"
|
||||
role="tabpanel"
|
||||
id="self-managed-tab-increase-other-node-capacity"
|
||||
aria-labelledby="self-managed-increase-other-node-capacity"
|
||||
hidden="">
|
||||
++++
|
||||
|
||||
include::increase-other-node-capacity.asciidoc[tag=self-managed]
|
||||
|
||||
++++
|
||||
</div>
|
||||
</div>
|
||||
++++
|
|
@ -0,0 +1,94 @@
|
|||
// tag::cloud[]
|
||||
|
||||
. Log in to the {ess-console}[{ecloud} console].
|
||||
+
|
||||
. On the **Elasticsearch Service** panel, click the gear under the `Manage deployment` column that corresponds to the
|
||||
name of your deployment.
|
||||
+
|
||||
. Go to `Actions > Edit deployment` and then go to the `Coordinating instances` or the `Machine Learning instances`
|
||||
section depending on the roles listed in the diagnosis:
|
||||
+
|
||||
[role="screenshot"]
|
||||
image::images/troubleshooting/disk/increase-disk-capacity-other-node.png[Increase disk capacity of other nodes,align="center"]
|
||||
|
||||
. Choose a larger than the pre-selected capacity configuration from the drop-down menu and click `save`. Wait for
|
||||
the plan to be applied and the problem should be resolved.
|
||||
|
||||
// end::cloud[]
|
||||
|
||||
// tag::self-managed[]
|
||||
In order to increase the disk capacity of any other node, you will need to replace the instance that has run out of
|
||||
space with one of higher disk capacity.
|
||||
|
||||
. First, retrieve the disk threshold that will indicate how much disk space is needed. The relevant threshold is
|
||||
the <<cluster-routing-watermark-high, high watermark>> and can be retrieved via the following command:
|
||||
+
|
||||
[source,console]
|
||||
----
|
||||
GET _cluster/settings?include_defaults&filter_path=*.cluster.routing.allocation.disk.watermark.high*
|
||||
----
|
||||
+
|
||||
The response will look like this:
|
||||
+
|
||||
[source,console-result]
|
||||
----
|
||||
{
|
||||
"defaults": {
|
||||
"cluster": {
|
||||
"routing": {
|
||||
"allocation": {
|
||||
"disk": {
|
||||
"watermark": {
|
||||
"high": "90%",
|
||||
"high.max_headroom": "150GB"
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
----
|
||||
// TEST[skip:illustration purposes only]
|
||||
+
|
||||
The above means that in order to resolve the disk shortage we need to either drop our disk usage below the 90% or have
|
||||
more than 150GB available, read more how this threshold works <<cluster-routing-watermark-high, here>>.
|
||||
|
||||
. The next step is to find out the current disk usage, this will allow to calculate how much extra space is needed.
|
||||
In the following example, we show only a machine learning node for readability purposes:
|
||||
+
|
||||
[source,console]
|
||||
----
|
||||
GET /_cat/nodes?v&h=name,node.role,disk.used_percent,disk.used,disk.avail,disk.total
|
||||
----
|
||||
+
|
||||
The response will look like this:
|
||||
+
|
||||
[source,console-result]
|
||||
----
|
||||
name node.role disk.used_percent disk.used disk.avail disk.total
|
||||
instance-0000000000 l 85.31 3.4gb 500mb 4gb
|
||||
----
|
||||
// TEST[skip:illustration purposes only]
|
||||
|
||||
. The desired situation is to drop the disk usage below the relevant threshold, in our example 90%. Consider adding
|
||||
some padding, so it will not go over the threshold soon. Assuming you have the new node ready, add this node to the
|
||||
cluster.
|
||||
|
||||
. Verify that the new node has joined the cluster:
|
||||
+
|
||||
[source,console]
|
||||
----
|
||||
GET /_cat/nodes?v&h=name,node.role,disk.used_percent,disk.used,disk.avail,disk.total
|
||||
----
|
||||
+
|
||||
The response will look like this:
|
||||
+
|
||||
[source,console-result]
|
||||
----
|
||||
name node.role disk.used_percent disk.used disk.avail disk.total
|
||||
instance-0000000000 l 85.31 3.4gb 500mb 4gb
|
||||
instance-0000000001 l 41.31 3.4gb 4.5gb 8gb
|
||||
----
|
||||
// TEST[skip:illustration purposes only]
|
||||
. Now you can remove the out of disk space instance.
|
||||
// end::self-managed[]
|
|
@ -31,6 +31,13 @@ fix problems that an {es} deployment might encounter.
|
|||
* <<start-ilm,Start index lifecycle management>>
|
||||
* <<start-slm,Start snapshot lifecycle management>>
|
||||
|
||||
[discrete]
|
||||
[[troubleshooting-capacity]]
|
||||
=== Capacity
|
||||
* <<fix-data-node-out-of-disk, Fix data nodes out of disk>>
|
||||
* <<fix-master-node-out-of-disk, Fix master nodes out of disk>>
|
||||
* <<fix-other-node-out-of-disk, Fix other role nodes out of disk>>
|
||||
|
||||
[discrete]
|
||||
[[troubleshooting-snapshot]]
|
||||
=== Snapshot and restore
|
||||
|
@ -90,6 +97,12 @@ include::troubleshooting/data/increase-cluster-shard-limit.asciidoc[]
|
|||
|
||||
include::troubleshooting/corruption-issues.asciidoc[]
|
||||
|
||||
include::troubleshooting/disk/fix-data-node-out-of-disk.asciidoc[]
|
||||
|
||||
include::troubleshooting/disk/fix-master-node-out-of-disk.asciidoc[]
|
||||
|
||||
include::troubleshooting/disk/fix-other-node-out-of-disk.asciidoc[]
|
||||
|
||||
include::troubleshooting/data/start-ilm.asciidoc[]
|
||||
|
||||
include::troubleshooting/data/start-slm.asciidoc[]
|
||||
|
|
|
@ -0,0 +1,23 @@
|
|||
[[fix-data-node-out-of-disk]]
|
||||
== Fix data nodes out of disk
|
||||
|
||||
{es} is using data nodes to distribute your data inside the cluster. If one or more of these nodes are running
|
||||
out of space, {es} takes action to redistribute your data within the nodes so all nodes have enough available
|
||||
disk space. If {es} cannot facilitate enough available space in a node, then you will need to intervene in one
|
||||
of two ways:
|
||||
|
||||
. <<increase-capacity-data-node, Increase the disk capacity of your cluster>>
|
||||
. <<decrease-disk-usage-data-node, Reduce the disk usage by decreasing your data volume>>
|
||||
|
||||
[[increase-capacity-data-node]]
|
||||
=== Increase the disk capacity of data nodes
|
||||
include::{es-repo-dir}/tab-widgets/troubleshooting/disk/increase-data-node-capacity-widget.asciidoc[]
|
||||
|
||||
[[decrease-disk-usage-data-node]]
|
||||
=== Decrease the disk usage of data nodes
|
||||
In order to decrease the disk usage in your cluster without losing any data, you can try reducing the replicas of indices.
|
||||
|
||||
NOTE: Reducing the replicas of an index can potentially reduce search throughput and data redundancy. However, it
|
||||
can quickly give the cluster breathing room until a more permanent solution is in place.
|
||||
|
||||
include::{es-repo-dir}/tab-widgets/troubleshooting/disk/decrease-data-node-disk-usage-widget.asciidoc[]
|
|
@ -0,0 +1,8 @@
|
|||
[[fix-master-node-out-of-disk]]
|
||||
== Fix master nodes out of disk
|
||||
|
||||
{es} is using master nodes to coordinate the cluster. If the master or any master eligible nodes are running
|
||||
out of space, you need to ensure that they have enough disk space to function. If the <<health-api, health API>>
|
||||
reports that your master node is out of space you need to increase the disk capacity of your master nodes.
|
||||
|
||||
include::{es-repo-dir}/tab-widgets/troubleshooting/disk/increase-master-node-capacity-widget.asciidoc[]
|
|
@ -0,0 +1,9 @@
|
|||
[[fix-other-node-out-of-disk]]
|
||||
== Fix other role nodes out of disk
|
||||
|
||||
{es} can use dedicated nodes to execute other functions apart from storing data or coordinating the cluster,
|
||||
for example machine learning. If one or more of these nodes are running out of space, you need to ensure that they have
|
||||
enough disk space to function. If the <<health-api, health API>> reports that a node that is not a master and does not
|
||||
contain data is out of space you need to increase the disk capacity of this node.
|
||||
|
||||
include::{es-repo-dir}/tab-widgets/troubleshooting/disk/increase-other-node-capacity-widget.asciidoc[]
|