mirror of
https://github.com/elastic/elasticsearch.git
synced 2025-04-24 23:27:25 -04:00
78 lines
4.3 KiB
Text
78 lines
4.3 KiB
Text
[[troubleshooting-unbalanced-cluster]]
|
|
== Troubleshooting an unbalanced cluster
|
|
|
|
Elasticsearch balances shards across data tiers to achieve a good compromise between:
|
|
|
|
* shard count
|
|
* disk usage
|
|
* write load (for indices in data streams)
|
|
|
|
****
|
|
If you're using Elastic Cloud Hosted, then you can use AutoOps to monitor your cluster. AutoOps significantly simplifies cluster management with performance recommendations, resource utilization visibility, real-time issue detection and resolution paths. For more information, refer to https://www.elastic.co/guide/en/cloud/current/ec-autoops.html[Monitor with AutoOps].
|
|
****
|
|
|
|
Elasticsearch does not take into account the amount or complexity of search queries when rebalancing shards.
|
|
This is indirectly achieved by balancing shard count and disk usage.
|
|
|
|
There is no guarantee that individual components will be evenly spread across the nodes.
|
|
This could happen if some nodes have fewer shards, or are using less disk space,
|
|
but are assigned shards with higher write loads.
|
|
|
|
Use the <<cat-allocation,cat allocation command>> to list workloads per node:
|
|
|
|
[source,console]
|
|
--------------------------------------------------
|
|
GET /_cat/allocation?v
|
|
--------------------------------------------------
|
|
// TEST[s/^/PUT test\n{"settings": {"number_of_replicas": 0}}\n/]
|
|
|
|
The API returns the following response:
|
|
|
|
[source,text]
|
|
--------------------------------------------------
|
|
shards shards.undesired write_load.forecast disk.indices.forecast disk.indices disk.used disk.avail disk.total disk.percent host ip node node.role
|
|
1 0 0.0 260b 260b 47.3gb 43.4gb 100.7gb 46 127.0.0.1 127.0.0.1 CSUXak2 himrst
|
|
--------------------------------------------------
|
|
// TESTRESPONSE[s/\d+(\.\d+)?[tgmk]?b/\\d+(\\.\\d+)?[tgmk]?b/ s/46/\\d+/]
|
|
// TESTRESPONSE[s/CSUXak2 himrst/.+/ non_json]
|
|
|
|
This response contains the following information that influences balancing:
|
|
|
|
* `shards` is the current number of shards allocated to the node
|
|
* `shards.undesired` is the number of shards that needs to be moved to other nodes to finish balancing
|
|
* `disk.indices.forecast` is the expected disk usage according to projected shard growth
|
|
* `write_load.forecast` is the projected total write load associated with this node
|
|
|
|
A cluster is considered balanced when all shards are in their desired locations,
|
|
which means that no further shard movements are planned (all `shards.undesired` values are equal to 0).
|
|
|
|
Some operations such as node restarting, decommissioning, or changing cluster allocation settings
|
|
are disruptive and might require multiple shards to move in order to rebalance the cluster.
|
|
|
|
Shard movement order is not deterministic and mostly determined by the source and target node readiness to move a shard.
|
|
While rebalancing is in progress some nodes might appear busier than others.
|
|
|
|
When a shard is allocated to an undesired node it uses the resources of the current node instead of the target.
|
|
This might cause a hotspot (disk or CPU) when multiple shards reside on the current node that have not been
|
|
moved to their corresponding targets yet.
|
|
|
|
If a cluster takes a long time to finish rebalancing you might find the following log entries:
|
|
[source,text]
|
|
--------------------------------------------------
|
|
[WARN][o.e.c.r.a.a.DesiredBalanceReconciler] [10%] of assigned shards (10/100) are not on their desired nodes, which exceeds the warn threshold of [10%]
|
|
--------------------------------------------------
|
|
This is not concerning as long as the number of such shards is decreasing and this warning appears occasionally,
|
|
for example after rolling restarts or changing allocation settings.
|
|
|
|
If the cluster has this warning repeatedly for an extended period of time (multiple hours),
|
|
it is possible that the desired balance is diverging too far from the current state.
|
|
|
|
If so, increase the <<shards-rebalancing-heuristics,`cluster.routing.allocation.balance.threshold`>>
|
|
to reduce the sensitivity of the algorithm that tries to level up the shard count and disk usage within the cluster.
|
|
|
|
And reset the desired balance using the following API call:
|
|
|
|
[source,console,id=delete-desired-balance-request-example]
|
|
--------------------------------------------------
|
|
DELETE /_internal/desired_balance
|
|
--------------------------------------------------
|