elasticsearch

mirror of https://github.com/elastic/elasticsearch.git synced 2025-04-25 15:47:23 -04:00

History

James Dorfman e99d287fbb Add Variable Width Histogram Aggregation (#42035 ) Implements a new histogram aggregation called `variable_width_histogram` which dynamically determines bucket intervals based on document groupings. These groups are determined by running a one-pass clustering algorithm on each shard and then reducing each shard's clusters using an agglomerative clustering algorithm. This PR addresses #9572. The shard-level clustering is done in one pass to minimize memory overhead. The algorithm was lightly inspired by [this paper](https://ieeexplore.ieee.org/abstract/document/1198387). It fetches a small number of documents to sample the data and determine initial clusters. Subsequent documents are then placed into one of these clusters, or a new one if they are an outlier. This algorithm is described in more details in the aggregation's docs. At reduce time, a [hierarchical agglomerative clustering](https://en.wikipedia.org/wiki/Hierarchical_clustering) algorithm inspired by [this paper](https://arxiv.org/abs/1802.00304) continually merges the closest buckets from all shards (based on their centroids) until the target number of buckets is reached. The final values produced by this aggregation are approximate. Each bucket's min value is used as its key in the histogram. Furthermore, buckets are merged based on their centroids and not their bounds. So it is possible that adjacent buckets will overlap after reduction. Because each bucket's key is its min, this overlap is not shown in the final histogram. However, when such overlap occurs, we set the key of the bucket with the larger centroid to the midpoint between its minimum and the smaller bucket’s maximum: `min[large] = (min[large] + max[small]) / 2`. This heuristic is expected to increases the accuracy of the clustering. Nodes are unable to share centroids during the shard-level clustering phase. In the future, resolving https://github.com/elastic/elasticsearch/issues/50863 would let us solve this issue. It doesn’t make sense for this aggregation to support the `min_doc_count` parameter, since clusters are determined dynamically. The `order` parameter is not supported here to keep this large PR from becoming too complex.		2020-06-23 09:26:54 -04:00
..
bucket	Add Variable Width Histogram Aggregation (#42035 )	2020-06-23 09:26:54 -04:00
matrix	[DOCS] IDs for doc snippets (#49008 )	2019-11-25 15:30:00 +01:00
metrics	Missing comma between value types (#58383 )	2020-06-19 23:01:25 +02:00
pipeline	Added standard deviation / variance sampling to extended stats (#49782 )	2020-06-10 15:00:50 -04:00
bucket.asciidoc	Increase search.max_buckets to 65,535 (#57042 )	2020-06-03 11:54:48 -04:00
matrix.asciidoc	refactor matrix agg documentation from modules to main agg section	2016-06-06 07:39:00 -05:00
metrics.asciidoc	[DOCS] Sort metric and pipeline agg docs (#56613 )	2020-05-15 16:34:47 -04:00
misc.asciidoc	[DOCS] Links transforms in aggregation docs (#52563 )	2020-02-21 08:22:04 +01:00
pipeline.asciidoc	[DOCS] Sort metric and pipeline agg docs (#56613 )	2020-05-15 16:34:47 -04:00