mirror of
https://github.com/elastic/elasticsearch.git
synced 2025-04-25 15:47:23 -04:00
This adds a new pipeline aggregation for calculating Kolmogorov–Smirnov test for a given sample and buckets path. For now, the buckets path resolution needs to be `_count`. But, this may be relaxed in the future. It accepts a parameter `fractions` that indicates the distribution of documents from some other pre-calculated sample. This particular version of the K-S test is Two-sample, meaning, it calculates if the `fractions` and the distribution of `_count` values in the buckets_path are taken from the same distribution. This in combination with the hypothesis alternatives (`less`, `greater`, `two_sided`) and sampling logic (`upper_tail`, `lower_tail`, `uniform`) allow for flexibility and usefulness when comparing two samples and determining the likelihood of them being from the same overall distribution. Usage: ``` POST correlate_latency/_search?size=0&filter_path=aggregations { "aggs": { "buckets": { "terms": { <1> "field": "version", "size": 2 }, "aggs": { "latency_ranges": { "range": { <2> "field": "latency", "ranges": [ { "to": 0.0 }, { "from": 0, "to": 105 }, { "from": 105, "to": 225 }, { "from": 225, "to": 445 }, { "from": 445, "to": 665 }, { "from": 665, "to": 885 }, { "from": 885, "to": 1115 }, { "from": 1115, "to": 1335 }, { "from": 1335, "to": 1555 }, { "from": 1555, "to": 1775 }, { "from": 1775 } ] } }, "ks_test": { <3> "bucket_count_ks_test": { "buckets_path": "latency_ranges>_count", "alternative": ["less", "greater", "two_sided"] } } } } } } ```
319 lines
8.9 KiB
Text
319 lines
8.9 KiB
Text
[role="xpack"]
|
|
[testenv="basic"]
|
|
[[search-aggregations-bucket-correlation-aggregation]]
|
|
=== Bucket correlation aggregation
|
|
++++
|
|
<titleabbrev>Bucket correlation aggregation</titleabbrev>
|
|
++++
|
|
|
|
experimental::[]
|
|
|
|
A sibling pipeline aggregation which executes a correlation function on the
|
|
configured sibling multi-bucket aggregation.
|
|
|
|
|
|
[[bucket-correlation-agg-syntax]]
|
|
==== Parameters
|
|
|
|
`buckets_path`::
|
|
(Required, string)
|
|
Path to the buckets that contain one set of values to correlate.
|
|
For syntax, see <<buckets-path-syntax>>.
|
|
|
|
`function`::
|
|
(Required, object)
|
|
The correlation function to execute.
|
|
+
|
|
.Properties of `function`
|
|
[%collapsible%open]
|
|
====
|
|
`count_correlation`:::
|
|
(Required^*^, object)
|
|
The configuration to calculate a count correlation. This function is designed for
|
|
determining the correlation of a term value and a given metric. Consequently, it
|
|
needs to meet the following requirements.
|
|
|
|
* The `buckets_path` must point to a `_count` metric.
|
|
* The total count of all the `bucket_path` count values must be less than or equal to `indicator.doc_count`.
|
|
* When utilizing this function, an initial calculation to gather the required `indicator` values is required.
|
|
|
|
.Properties of `count_correlation`
|
|
[%collapsible%open]
|
|
=====
|
|
`indicator`:::
|
|
(Required, object)
|
|
The indicator with which to correlate the configured `bucket_path` values.
|
|
|
|
.Properties of `indicator`
|
|
[%collapsible%open]
|
|
=====
|
|
`expectations`:::
|
|
(Required, array)
|
|
An array of numbers with which to correlate the configured `bucket_path` values. The length of this value must always equal
|
|
the number of buckets returned by the `bucket_path`.
|
|
|
|
`fractions`:::
|
|
(Optional, array)
|
|
An array of fractions to use when averaging and calculating variance. This should be used if the pre-calculated data and the
|
|
`buckets_path` have known gaps. The length of `fractions`, if provided, must equal `expectations`.
|
|
|
|
`doc_count`:::
|
|
(Required, integer)
|
|
The total number of documents that initially created the `expectations`. It's required to be greater than or equal to the sum
|
|
of all values in the `buckets_path` as this is the originating superset of data to which the term values are correlated.
|
|
=====
|
|
=====
|
|
====
|
|
|
|
==== Syntax
|
|
|
|
A `bucket_correlation` aggregation looks like this in isolation:
|
|
|
|
[source,js]
|
|
--------------------------------------------------
|
|
{
|
|
"bucket_correlation": {
|
|
"buckets_path": "range_values>_count", <1>
|
|
"function": {
|
|
"count_correlation": { <2>
|
|
"expectations": [...],
|
|
"doc_count": 10000
|
|
}
|
|
}
|
|
}
|
|
}
|
|
--------------------------------------------------
|
|
// NOTCONSOLE
|
|
<1> The buckets containing the values to correlate against.
|
|
<2> The correlation function definition.
|
|
|
|
|
|
[[bucket-correlation-agg-example]]
|
|
==== Example
|
|
|
|
The following snippet correlates the individual terms in the field `version` with the `latency` metric. Not shown
|
|
is the pre-calculation of the `latency` indicator values, which was done utilizing the
|
|
<<search-aggregations-metrics-percentile-aggregation,percentiles>> aggregation.
|
|
|
|
This example is only using the 10s percentiles.
|
|
|
|
[source,console]
|
|
-------------------------------------------------
|
|
POST correlate_latency/_search?size=0&filter_path=aggregations
|
|
{
|
|
"aggs": {
|
|
"buckets": {
|
|
"terms": { <1>
|
|
"field": "version",
|
|
"size": 2
|
|
},
|
|
"aggs": {
|
|
"latency_ranges": {
|
|
"range": { <2>
|
|
"field": "latency",
|
|
"ranges": [
|
|
{ "to": 0.0 },
|
|
{ "from": 0, "to": 105 },
|
|
{ "from": 105, "to": 225 },
|
|
{ "from": 225, "to": 445 },
|
|
{ "from": 445, "to": 665 },
|
|
{ "from": 665, "to": 885 },
|
|
{ "from": 885, "to": 1115 },
|
|
{ "from": 1115, "to": 1335 },
|
|
{ "from": 1335, "to": 1555 },
|
|
{ "from": 1555, "to": 1775 },
|
|
{ "from": 1775 }
|
|
]
|
|
}
|
|
},
|
|
"bucket_correlation": { <3>
|
|
"bucket_correlation": {
|
|
"buckets_path": "latency_ranges>_count",
|
|
"function": {
|
|
"count_correlation": {
|
|
"indicator": {
|
|
"expectations": [0, 52.5, 165, 335, 555, 775, 1000, 1225, 1445, 1665, 1775],
|
|
"doc_count": 200
|
|
}
|
|
}
|
|
}
|
|
}
|
|
}
|
|
}
|
|
}
|
|
}
|
|
}
|
|
-------------------------------------------------
|
|
// TEST[setup:correlate_latency]
|
|
|
|
<1> The term buckets containing a range aggregation and the bucket correlation aggregation. Both are utilized to calculate
|
|
the correlation of the term values with the latency.
|
|
<2> The range aggregation on the latency field. The ranges were created referencing the percentiles of the latency field.
|
|
<3> The bucket correlation aggregation that calculates the correlation of the number of term values within each range
|
|
and the previously calculated indicator values.
|
|
|
|
And the following may be the response:
|
|
|
|
[source,console-result]
|
|
----
|
|
{
|
|
"aggregations" : {
|
|
"buckets" : {
|
|
"doc_count_error_upper_bound" : 0,
|
|
"sum_other_doc_count" : 0,
|
|
"buckets" : [
|
|
{
|
|
"key" : "1.0",
|
|
"doc_count" : 100,
|
|
"latency_ranges" : {
|
|
"buckets" : [
|
|
{
|
|
"key" : "*-0.0",
|
|
"to" : 0.0,
|
|
"doc_count" : 0
|
|
},
|
|
{
|
|
"key" : "0.0-105.0",
|
|
"from" : 0.0,
|
|
"to" : 105.0,
|
|
"doc_count" : 1
|
|
},
|
|
{
|
|
"key" : "105.0-225.0",
|
|
"from" : 105.0,
|
|
"to" : 225.0,
|
|
"doc_count" : 9
|
|
},
|
|
{
|
|
"key" : "225.0-445.0",
|
|
"from" : 225.0,
|
|
"to" : 445.0,
|
|
"doc_count" : 0
|
|
},
|
|
{
|
|
"key" : "445.0-665.0",
|
|
"from" : 445.0,
|
|
"to" : 665.0,
|
|
"doc_count" : 0
|
|
},
|
|
{
|
|
"key" : "665.0-885.0",
|
|
"from" : 665.0,
|
|
"to" : 885.0,
|
|
"doc_count" : 0
|
|
},
|
|
{
|
|
"key" : "885.0-1115.0",
|
|
"from" : 885.0,
|
|
"to" : 1115.0,
|
|
"doc_count" : 10
|
|
},
|
|
{
|
|
"key" : "1115.0-1335.0",
|
|
"from" : 1115.0,
|
|
"to" : 1335.0,
|
|
"doc_count" : 20
|
|
},
|
|
{
|
|
"key" : "1335.0-1555.0",
|
|
"from" : 1335.0,
|
|
"to" : 1555.0,
|
|
"doc_count" : 20
|
|
},
|
|
{
|
|
"key" : "1555.0-1775.0",
|
|
"from" : 1555.0,
|
|
"to" : 1775.0,
|
|
"doc_count" : 20
|
|
},
|
|
{
|
|
"key" : "1775.0-*",
|
|
"from" : 1775.0,
|
|
"doc_count" : 20
|
|
}
|
|
]
|
|
},
|
|
"bucket_correlation" : {
|
|
"value" : 0.8402398981360937
|
|
}
|
|
},
|
|
{
|
|
"key" : "2.0",
|
|
"doc_count" : 100,
|
|
"latency_ranges" : {
|
|
"buckets" : [
|
|
{
|
|
"key" : "*-0.0",
|
|
"to" : 0.0,
|
|
"doc_count" : 0
|
|
},
|
|
{
|
|
"key" : "0.0-105.0",
|
|
"from" : 0.0,
|
|
"to" : 105.0,
|
|
"doc_count" : 19
|
|
},
|
|
{
|
|
"key" : "105.0-225.0",
|
|
"from" : 105.0,
|
|
"to" : 225.0,
|
|
"doc_count" : 11
|
|
},
|
|
{
|
|
"key" : "225.0-445.0",
|
|
"from" : 225.0,
|
|
"to" : 445.0,
|
|
"doc_count" : 20
|
|
},
|
|
{
|
|
"key" : "445.0-665.0",
|
|
"from" : 445.0,
|
|
"to" : 665.0,
|
|
"doc_count" : 20
|
|
},
|
|
{
|
|
"key" : "665.0-885.0",
|
|
"from" : 665.0,
|
|
"to" : 885.0,
|
|
"doc_count" : 20
|
|
},
|
|
{
|
|
"key" : "885.0-1115.0",
|
|
"from" : 885.0,
|
|
"to" : 1115.0,
|
|
"doc_count" : 10
|
|
},
|
|
{
|
|
"key" : "1115.0-1335.0",
|
|
"from" : 1115.0,
|
|
"to" : 1335.0,
|
|
"doc_count" : 0
|
|
},
|
|
{
|
|
"key" : "1335.0-1555.0",
|
|
"from" : 1335.0,
|
|
"to" : 1555.0,
|
|
"doc_count" : 0
|
|
},
|
|
{
|
|
"key" : "1555.0-1775.0",
|
|
"from" : 1555.0,
|
|
"to" : 1775.0,
|
|
"doc_count" : 0
|
|
},
|
|
{
|
|
"key" : "1775.0-*",
|
|
"from" : 1775.0,
|
|
"doc_count" : 0
|
|
}
|
|
]
|
|
},
|
|
"bucket_correlation" : {
|
|
"value" : -0.5759855613334943
|
|
}
|
|
}
|
|
]
|
|
}
|
|
}
|
|
}
|
|
----
|