mirror of
https://github.com/elastic/elasticsearch.git
synced 2025-04-25 07:37:19 -04:00
There's a note in the docs saying we only consider shard count and not disk usage which is no longer true. This commit fixes the note to reflect today's implementation.
193 lines
11 KiB
Text
193 lines
11 KiB
Text
[[disk-based-shard-allocation]]
|
|
==== Disk-based shard allocation settings
|
|
[[disk-based-shard-allocation-description]]
|
|
// tag::disk-based-shard-allocation-description-tag[]
|
|
|
|
The disk-based shard allocator ensures that all nodes have enough disk space
|
|
without performing more shard movements than necessary. It allocates shards
|
|
based on a pair of thresholds known as the _low watermark_ and the _high
|
|
watermark_. Its primary goal is to ensure that no node exceeds the high
|
|
watermark, or at least that any such overage is only temporary. If a node
|
|
exceeds the high watermark then {es} will solve this by moving some of its
|
|
shards onto other nodes in the cluster.
|
|
|
|
NOTE: It is normal for nodes to temporarily exceed the high watermark from time
|
|
to time.
|
|
|
|
The allocator also tries to keep nodes clear of the high watermark by
|
|
forbidding the allocation of more shards to a node that exceeds the low
|
|
watermark. Importantly, if all of your nodes have exceeded the low watermark
|
|
then no new shards can be allocated and {es} will not be able to move any
|
|
shards between nodes in order to keep the disk usage below the high watermark.
|
|
You must ensure that your cluster has enough disk space in total and that there
|
|
are always some nodes below the low watermark.
|
|
|
|
Shard movements triggered by the disk-based shard allocator must also satisfy
|
|
all other shard allocation rules such as
|
|
<<cluster-shard-allocation-filtering,allocation filtering>> and
|
|
<<forced-awareness,forced awareness>>. If these rules are too strict then they
|
|
can also prevent the shard movements needed to keep the nodes' disk usage under
|
|
control. If you are using <<data-tiers,data tiers>> then {es} automatically
|
|
configures allocation filtering rules to place shards within the appropriate
|
|
tier, which means that the disk-based shard allocator works independently
|
|
within each tier.
|
|
|
|
If a node is filling up its disk faster than {es} can move shards elsewhere
|
|
then there is a risk that the disk will completely fill up. To prevent this, as
|
|
a last resort, once the disk usage reaches the _flood-stage_ watermark {es}
|
|
will block writes to indices with a shard on the affected node. It will also
|
|
continue to move shards onto the other nodes in the cluster. When disk usage
|
|
on the affected node drops below the high watermark, {es} automatically removes
|
|
the write block. Refer to <<fix-watermark-errors,Fix watermark errors>> to
|
|
resolve persistent watermark errors.
|
|
|
|
[[disk-based-shard-allocation-does-not-balance]]
|
|
[TIP]
|
|
====
|
|
It is normal for the nodes in your cluster to be using very different amounts
|
|
of disk space. The <<shards-rebalancing-settings,balance>> of the cluster
|
|
depends on a combination of factors which includes the number of shards on each
|
|
node, the indices to which those shards belong, and the resource needs of each
|
|
shard in terms of its size on disk and its CPU usage. {es} must trade off all
|
|
of these factors against each other, and a cluster which is balanced when
|
|
looking at the combination of all of these factors may not appear to be
|
|
balanced if you focus attention on just one of them.
|
|
====
|
|
|
|
You can use the following settings to control disk-based allocation:
|
|
|
|
[[cluster-routing-disk-threshold]]
|
|
// tag::cluster-routing-disk-threshold-tag[]
|
|
`cluster.routing.allocation.disk.threshold_enabled`::
|
|
(<<dynamic-cluster-setting,Dynamic>>)
|
|
Defaults to `true`. Set to `false` to disable the disk allocation decider. Upon disabling, it will also remove any existing `index.blocks.read_only_allow_delete` index blocks.
|
|
// end::cluster-routing-disk-threshold-tag[]
|
|
|
|
[[cluster-routing-watermark-low]]
|
|
// tag::cluster-routing-watermark-low-tag[]
|
|
`cluster.routing.allocation.disk.watermark.low` {ess-icon}::
|
|
(<<dynamic-cluster-setting,Dynamic>>)
|
|
Controls the low watermark for disk usage. It defaults to `85%`, meaning that {es} will not allocate shards to nodes that have more than 85% disk used. It can alternatively be set to a ratio value, e.g., `0.85`. It can also be set to an absolute byte value (like `500mb`) to prevent {es} from allocating shards if less than the specified amount of space is available. This setting has no effect on the primary shards of newly-created indices but will prevent their replicas from being allocated.
|
|
// end::cluster-routing-watermark-low-tag[]
|
|
|
|
`cluster.routing.allocation.disk.watermark.low.max_headroom`::
|
|
(<<dynamic-cluster-setting,Dynamic>>) Controls the max headroom for the low watermark (in case of a percentage/ratio value).
|
|
Defaults to 200GB when `cluster.routing.allocation.disk.watermark.low` is not explicitly set.
|
|
This caps the amount of free space required.
|
|
|
|
[[cluster-routing-watermark-high]]
|
|
// tag::cluster-routing-watermark-high-tag[]
|
|
`cluster.routing.allocation.disk.watermark.high` {ess-icon}::
|
|
(<<dynamic-cluster-setting,Dynamic>>)
|
|
Controls the high watermark. It defaults to `90%`, meaning that {es} will attempt to relocate shards away from a node whose disk usage is above 90%. It can alternatively be set to a ratio value, e.g., `0.9`. It can also be set to an absolute byte value (similarly to the low watermark) to relocate shards away from a node if it has less than the specified amount of free space. This setting affects the allocation of all shards, whether previously allocated or not.
|
|
// end::cluster-routing-watermark-high-tag[]
|
|
|
|
`cluster.routing.allocation.disk.watermark.high.max_headroom`::
|
|
(<<dynamic-cluster-setting,Dynamic>>) Controls the max headroom for the high watermark (in case of a percentage/ratio value).
|
|
Defaults to 150GB when `cluster.routing.allocation.disk.watermark.high` is not explicitly set.
|
|
This caps the amount of free space required.
|
|
|
|
`cluster.routing.allocation.disk.watermark.enable_for_single_data_node`::
|
|
(<<static-cluster-setting,Static>>)
|
|
In earlier releases, the default behaviour was to disregard disk watermarks for a single
|
|
data node cluster when making an allocation decision. This is deprecated behavior
|
|
since 7.14 and has been removed in 8.0. The only valid value for this setting
|
|
is now `true`. The setting will be removed in a future release.
|
|
|
|
[[cluster-routing-flood-stage]]
|
|
// tag::cluster-routing-flood-stage-tag[]
|
|
`cluster.routing.allocation.disk.watermark.flood_stage` {ess-icon}::
|
|
+
|
|
--
|
|
(<<dynamic-cluster-setting,Dynamic>>)
|
|
Controls the flood stage watermark, which defaults to 95%. {es} enforces a read-only index block (`index.blocks.read_only_allow_delete`) on every index that has one or more shards allocated on the node, and that has at least one disk exceeding the flood stage. This setting is a last resort to prevent nodes from running out of disk space. The index block is automatically released when the disk utilization falls below the high watermark. Similarly to the low and high watermark values, it can alternatively be set to a ratio value, e.g., `0.95`, or an absolute byte value.
|
|
|
|
An example of resetting the read-only index block on the `my-index-000001` index:
|
|
|
|
[source,console]
|
|
--------------------------------------------------
|
|
PUT /my-index-000001/_settings
|
|
{
|
|
"index.blocks.read_only_allow_delete": null
|
|
}
|
|
--------------------------------------------------
|
|
// TEST[setup:my_index]
|
|
--
|
|
// end::cluster-routing-flood-stage-tag[]
|
|
|
|
`cluster.routing.allocation.disk.watermark.flood_stage.max_headroom`::
|
|
(<<dynamic-cluster-setting,Dynamic>>) Controls the max headroom for the flood stage watermark (in case of a percentage/ratio value).
|
|
Defaults to 100GB when
|
|
`cluster.routing.allocation.disk.watermark.flood_stage` is not explicitly set.
|
|
This caps the amount of free space required.
|
|
|
|
NOTE: You cannot mix the usage of percentage/ratio values and byte values across
|
|
the `cluster.routing.allocation.disk.watermark.low`, `cluster.routing.allocation.disk.watermark.high`,
|
|
and `cluster.routing.allocation.disk.watermark.flood_stage` settings. Either all values
|
|
are set to percentage/ratio values, or all are set to byte values. This enforcement is
|
|
so that {es} can validate that the settings are internally consistent, ensuring that the
|
|
low disk threshold is less than the high disk threshold, and the high disk threshold is
|
|
less than the flood stage threshold. A similar comparison check is done for the max
|
|
headroom values.
|
|
|
|
[[cluster-routing-flood-stage-frozen]]
|
|
// tag::cluster-routing-flood-stage-tag[]
|
|
`cluster.routing.allocation.disk.watermark.flood_stage.frozen` {ess-icon}::
|
|
(<<dynamic-cluster-setting,Dynamic>>)
|
|
Controls the flood stage watermark for dedicated frozen nodes, which defaults to
|
|
95%.
|
|
|
|
`cluster.routing.allocation.disk.watermark.flood_stage.frozen.max_headroom` {ess-icon}::
|
|
(<<dynamic-cluster-setting,Dynamic>>)
|
|
Controls the max headroom for the flood stage watermark (in case of a
|
|
percentage/ratio value) for dedicated frozen nodes. Defaults to 20GB when
|
|
`cluster.routing.allocation.disk.watermark.flood_stage.frozen` is not explicitly
|
|
set. This caps the amount of free space required on dedicated frozen nodes.
|
|
|
|
`cluster.info.update.interval`::
|
|
(<<dynamic-cluster-setting,Dynamic>>)
|
|
How often {es} should check on disk usage for each node in the
|
|
cluster. Defaults to `30s`.
|
|
|
|
NOTE: Percentage values refer to used disk space, while byte values refer to
|
|
free disk space. This can be confusing, since it flips the meaning of high and
|
|
low. For example, it makes sense to set the low watermark to 10gb and the high
|
|
watermark to 5gb, but not the other way around.
|
|
|
|
An example of updating the low watermark to at least 100 gigabytes free, a high
|
|
watermark of at least 50 gigabytes free, and a flood stage watermark of 10
|
|
gigabytes free, and updating the information about the cluster every minute:
|
|
|
|
[source,console]
|
|
--------------------------------------------------
|
|
PUT _cluster/settings
|
|
{
|
|
"persistent": {
|
|
"cluster.routing.allocation.disk.watermark.low": "100gb",
|
|
"cluster.routing.allocation.disk.watermark.high": "50gb",
|
|
"cluster.routing.allocation.disk.watermark.flood_stage": "10gb",
|
|
"cluster.info.update.interval": "1m"
|
|
}
|
|
}
|
|
--------------------------------------------------
|
|
|
|
Concerning the max headroom settings for the watermarks, please note
|
|
that these apply only in the case that the watermark settings are percentages/ratios.
|
|
The aim of a max headroom value is to cap the required free disk space before hitting
|
|
the respective watermark. This is especially useful for servers with larger
|
|
disks, where a percentage/ratio watermark could translate to a big free disk space requirement,
|
|
and the max headroom can be used to cap the required free disk space amount.
|
|
As an example, let us take the default settings for the flood watermark.
|
|
It has a 95% default value, and the flood max headroom setting has a default value of 100GB.
|
|
This means that:
|
|
|
|
* For a smaller disk, e.g., of 100GB, the flood watermark will hit at 95%, meaning at 5GB
|
|
of free space, since 5GB is smaller than the 100GB max headroom value.
|
|
* For a larger disk, e.g., of 100TB, the flood watermark will hit at 100GB of free space.
|
|
That is because the 95% flood watermark alone would require 5TB of free disk space, but
|
|
that is capped by the max headroom setting to 100GB.
|
|
|
|
Finally, the max headroom settings have their default values only if their respective watermark
|
|
settings are not explicitly set (thus, they have their default percentage values).
|
|
If watermarks are explicitly set, then the max headroom settings do not have their default values,
|
|
and would need to be explicitly set if they are desired.
|