mirror of
https://github.com/elastic/elasticsearch.git
synced 2025-06-29 18:03:32 -04:00
Implements a new histogram aggregation called `variable_width_histogram` which dynamically determines bucket intervals based on document groupings. These groups are determined by running a one-pass clustering algorithm on each shard and then reducing each shard's clusters using an agglomerative clustering algorithm. This PR addresses #9572. The shard-level clustering is done in one pass to minimize memory overhead. The algorithm was lightly inspired by [this paper](https://ieeexplore.ieee.org/abstract/document/1198387). It fetches a small number of documents to sample the data and determine initial clusters. Subsequent documents are then placed into one of these clusters, or a new one if they are an outlier. This algorithm is described in more details in the aggregation's docs. At reduce time, a [hierarchical agglomerative clustering](https://en.wikipedia.org/wiki/Hierarchical_clustering) algorithm inspired by [this paper](https://arxiv.org/abs/1802.00304) continually merges the closest buckets from all shards (based on their centroids) until the target number of buckets is reached. The final values produced by this aggregation are approximate. Each bucket's min value is used as its key in the histogram. Furthermore, buckets are merged based on their centroids and not their bounds. So it is possible that adjacent buckets will overlap after reduction. Because each bucket's key is its min, this overlap is not shown in the final histogram. However, when such overlap occurs, we set the key of the bucket with the larger centroid to the midpoint between its minimum and the smaller bucket’s maximum: `min[large] = (min[large] + max[small]) / 2`. This heuristic is expected to increases the accuracy of the clustering. Nodes are unable to share centroids during the shard-level clustering phase. In the future, resolving https://github.com/elastic/elasticsearch/issues/50863 would let us solve this issue. It doesn’t make sense for this aggregation to support the `min_doc_count` parameter, since clusters are determined dynamically. The `order` parameter is not supported here to keep this large PR from becoming too complex. Co-authored-by: James Dorfman <jamesdorfman@users.noreply.github.com>
91 lines
4.2 KiB
Text
91 lines
4.2 KiB
Text
[[search-aggregations-bucket-variablewidthhistogram-aggregation]]
|
|
=== Variable Width Histogram Aggregation
|
|
|
|
This is a multi-bucket aggregation similar to <<search-aggregations-bucket-histogram-aggregation>>.
|
|
However, the width of each bucket is not specified. Rather, a target number of buckets is provided and bucket intervals
|
|
are dynamically determined based on the document distribution. This is done using a simple one-pass document clustering algorithm
|
|
that aims to obtain low distances between bucket centroids. Unlike other multi-bucket aggregations, the intervals will not
|
|
necessarily have a uniform width.
|
|
|
|
TIP: The number of buckets returned will always be less than or equal to the target number.
|
|
|
|
Requesting a target of 2 buckets.
|
|
|
|
[source,console]
|
|
--------------------------------------------------
|
|
POST /sales/_search?size=0
|
|
{
|
|
"aggs" : {
|
|
"prices" : {
|
|
"variable_width_histogram" : {
|
|
"field" : "price",
|
|
"buckets" : 2
|
|
}
|
|
}
|
|
}
|
|
}
|
|
--------------------------------------------------
|
|
// TEST[setup:sales]
|
|
|
|
Response:
|
|
|
|
[source,console-result]
|
|
--------------------------------------------------
|
|
{
|
|
...
|
|
"aggregations": {
|
|
"prices" : {
|
|
"buckets": [
|
|
{
|
|
"min": 10.0,
|
|
"key": 30.0,
|
|
"max": 50.0,
|
|
"doc_count": 2
|
|
},
|
|
{
|
|
"min": 150.0,
|
|
"key": 185.0,
|
|
"max": 200.0,
|
|
"doc_count": 5
|
|
}
|
|
]
|
|
}
|
|
}
|
|
}
|
|
--------------------------------------------------
|
|
// TESTRESPONSE[s/\.\.\./"took": $body.took,"timed_out": false,"_shards": $body._shards,"hits": $body.hits,/]
|
|
|
|
==== Clustering Algorithm
|
|
Each shard fetches the first `initial_buffer` documents and stores them in memory. Once the buffer is full, these documents
|
|
are sorted and linearly separated into `3/4 * shard_size buckets`.
|
|
Next each remaining documents is either collected into the nearest bucket, or placed into a new bucket if it is distant
|
|
from all the existing ones. At most `shard_size` total buckets are created.
|
|
|
|
In the reduce step, the coordinating node sorts the buckets from all shards by their centroids. Then, the two buckets
|
|
with the nearest centroids are repeatedly merged until the target number of buckets is achieved.
|
|
This merging procedure is a form of https://en.wikipedia.org/wiki/Hierarchical_clustering[agglomerative hierarchical clustering].
|
|
|
|
TIP: A shard can return fewer than `shard_size` buckets, but it cannot return more.
|
|
|
|
==== Shard size
|
|
The `shard_size` parameter specifies the number of buckets that the coordinating node will request from each shard.
|
|
A higher `shard_size` leads each shard to produce smaller buckets. This reduce the likelihood of buckets overlapping
|
|
after the reduction step. Increasing the `shard_size` will improve the accuracy of the histogram, but it will
|
|
also make it more expensive to compute the final result because bigger priority queues will have to be managed on a
|
|
shard level, and the data transfers between the nodes and the client will be larger.
|
|
|
|
TIP: Parameters `buckets`, `shard_size`, and `initial_buffer` are optional. By default, `buckets = 10`, `shard_size = 500` and `initial_buffer = min(50 * shard_size, 50000)`.
|
|
|
|
==== Initial Buffer
|
|
The `initial_buffer` parameter can be used to specify the number of individual documents that will be stored in memory
|
|
on a shard before the initial bucketing algorithm is run. Bucket distribution is determined using this sample
|
|
of `initial_buffer` documents. So, although a higher `initial_buffer` will use more memory, it will lead to more representative
|
|
clusters.
|
|
|
|
==== Bucket bounds are approximate
|
|
During the reduce step, the master node continuously merges the two buckets with the nearest centroids. If two buckets have
|
|
overlapping bounds but distant centroids, then it is possible that they will not be merged. Because of this, after
|
|
reduction the maximum value in some interval (`max`) might be greater than the minimum value in the subsequent
|
|
bucket (`min`). To reduce the impact of this error, when such an overlap occurs the bound between these intervals is adjusted to be `(max + min) / 2`.
|
|
|
|
TIP: Bucket bounds are very sensitive to outliers
|