Commit graph

587 commits

Author SHA1 Message Date
James Rodewig
de59fd2b43
[DOCS] Include index in range agg snippets (#77290) (#77568)
Co-authored-by: James Rodewig <40268737+jrodewig@users.noreply.github.com>

Co-authored-by: xiaozhiliaoo(小知了) <772654204@qq.com>
2021-09-10 12:36:05 -04:00
Benjamin Trent
100f222650
Adds support for the rate aggregation under a composite agg (#76992)
rate aggregation should support being a sub-aggregation
of a composite agg.

The catch is that the composite aggregation source
must be a date histogram. Other sources can be present
but their must be exactly one date histogram source
otherwise the rate aggregation does not know which
interval to compare its unit rate to.

closes https://github.com/elastic/elasticsearch/issues/76988
2021-09-01 07:29:13 -04:00
James Rodewig
8ba07b4b97
[DOCS] Add filter example to nested agg docs (#76118)
Changes:
* Simplifies and formats several snippets in the nested agg docs
* Adds a `filter` sub-aggregration example
2021-08-05 09:48:28 -04:00
James Rodewig
fc0ac1923d
[DOCS] Correct spelling for geo terms (#76028)
Changes:
* Use "geopoint" when not referring to the literal field type
* Use "geoshape" when not referring to the literal field type or query type
* Use "GeoJSON" consistently
2021-08-03 09:55:48 -04:00
István Zoltán Szabó
60f3c77e3f
[DOCS] Adds p-value heuristic to significant terms aggregation (#75369)
Co-authored-by: Lisa Cawley <lcawley@elastic.co>
2021-07-27 09:12:45 +02:00
Mark Tozzi
7af39dbc35
Remove deprecated date histo interval (#75000)
Date histogram interval parameter was deprecated in 7.2, in favor of the more specific fixed_interval and calendar_interval parameters.  The old logic used some poorly understood guessing to decide if it should operate in fixed or calendar mode.  The new logic requires a specific choice by the user, which is more explicit.  In 7.x REST compatibility mode, we will parse the interval as calendar if possible, and otherwise interpret it as fixed.
2021-07-20 13:08:45 -04:00
James Rodewig
73397f7fb2 [DOCS] Deduplicate docs for search.max_buckets 2021-06-29 08:42:01 -04:00
Benjamin Trent
07b336f1b0
Add support for range aggregations on histogram mapped fields (#74146)
This adds support for the range aggregation over `histogram` mapped fields.

Decisions made for implementation:

 - Sub-aggregations are not allowed. This is to simplify implementation and follows the prior art set by the `histogram` aggregation
 - Nothing fancy is done with the ranges. No filter translations as we cannot easily do a `range` filter query against histogram fields. This may be an optimization in the future.
 - Ranges check the histogram value ONLY. No interpolation of values is done. If we have better statistics around the histogram this MAY be possible.
2021-06-29 07:24:54 -04:00
Nik Everett
1338a11d1c
Document types terms agg can consume (#73272)
Co-authored-by: James Rodewig <40268737+jrodewig@users.noreply.github.com>
2021-06-17 14:58:20 -04:00
Igor Motov
db36b6c89a
Add keep_values gap policy (#73297)
Adds a new keep_values gap policy that works like skip, except if the metric
calculated on an empty bucket provides a non-null non-NaN value, this value is
used for the bucket.

Fixes #27377

Co-authored-by: Mark Tozzi <mark.tozzi@gmail.com>
2021-06-08 09:47:29 -10:00
James Rodewig
0360ce48b4
[DOCS] Clarify supported fields for top_metrics agg (#73907)
Changes:
* Notes `metrics.field` supports `boolean` fields and runtime fields.
* Notes `metrics.field` doesn't support array values.

Closes #72889
2021-06-08 13:19:43 -04:00
James Rodewig
ff0cb8ed97
[DOCS] Make doc_count error docs more searchable (#73870)
Changes:
* Combines the `Document counts are approximate` and `Calculating document count
  error` sections.
* Rewrites the section to include `sum_other_doc_count` and
  `doc_count_error_upper_bound` for easier on-page (ctrl+f) searching.

Closes #73200
2021-06-08 09:33:10 -04:00
Mark Tozzi
2d4d3d40a0
Docvalueformat errors (#73121)
Improve the error message when inconsistent mappings cause doc value formatting errors.  For example, trying to format a binary encoded IP address as a UTF8 string often fails with something unexpected, like `ArrayIndexOutOfBounds`.  This change catches that and wraps it with a message suggesting the user check their mappings.  Also gets rid of anonymous instances for doc value formatters, which made it hard to see what format was failing to be applied.
2021-06-07 15:24:27 -04:00
Benjamin Trent
30cf4dc8be
[ML] adding new KS test pipeline aggregation (#73334)
This adds a new pipeline aggregation for calculating Kolmogorov–Smirnov test for a given sample and buckets path.

For now, the buckets path resolution needs to be `_count`. But, this may be relaxed in the future. 

It accepts a parameter `fractions` that indicates the distribution of documents from some other pre-calculated sample. 

This particular version of the K-S test is Two-sample, meaning, it calculates if the `fractions` and the distribution of `_count` values in the buckets_path are taken from the same distribution.

This in combination with the hypothesis alternatives (`less`, `greater`, `two_sided`) and sampling logic (`upper_tail`, `lower_tail`, `uniform`) allow for flexibility and usefulness when comparing two samples and determining the likelihood of them being from the same overall distribution.

Usage:

```
POST correlate_latency/_search?size=0&filter_path=aggregations
{
  "aggs": {
    "buckets": {
      "terms": { <1>
        "field": "version",
        "size": 2
      },
      "aggs": {
        "latency_ranges": {
          "range": { <2>
            "field": "latency",
            "ranges": [
              { "to": 0.0 },
              { "from": 0, "to": 105 },
              { "from": 105, "to": 225 },
              { "from": 225, "to": 445 },
              { "from": 445, "to": 665 },
              { "from": 665, "to": 885 },
              { "from": 885, "to": 1115 },
              { "from": 1115, "to": 1335 },
              { "from": 1335, "to": 1555 },
              { "from": 1555, "to": 1775 },
              { "from": 1775 }
            ]
          }
        },
        "ks_test": { <3>
          "bucket_count_ks_test": {
            "buckets_path": "latency_ranges>_count",
            "alternative": ["less", "greater", "two_sided"]
          }
        }
      }
    }
  }
}
```
2021-06-04 10:04:41 -04:00
Nik Everett
a43b166d11
More debugging info for significant_text (#72727)
Adds some extra debugging information to make it clear that you are
running `significant_text`. Also adds some using timing information
around the `_source` fetch and the `terms` accumulation. This lets you
calculate a third useful timing number: the analysis time. It is
`collect_ns - fetch_ns - accumulation_ns`.

This also adds a half dozen extra REST tests to get a *fairly*
comprehensive set of the operations this supports. It doesn't cover all
of the significance heuristic parsing, but its certainly much better
than what we had.
2021-05-10 12:50:46 -04:00
Benjamin Trent
8069e9b233
[ML] add new bucket_correlation aggregation with initial count_correlation function (#72133)
This commit adds a new pipeline aggregation that allows correlation within the aggregation frame work in bucketed values. 

The initial function is a `count_correlation` function. The purpose of which is to correlate the count in a consistent number of buckets with a pre calculated indicator. The indicator and the aggregated buckets should related to the same metrics with in documents. 

Example for correlating terms within a `service.version.keyword` with latency percentiles. The percentiles and provided correlation indicator both refer to the same source data where the indicator was previously calculated.:
```
GET apm-7.12.0-transaction-generated/_search
{
  "size": 0,
  "aggs": {
    "field_terms": {
      "terms": {
        "field": "service.version.keyword",
        "size": 20
      },
      "aggs": {
        "latency_range": {
          "range": {
            "field": "transaction.duration.us",
            "ranges": [<snip>],
            "keyed": true
          }
        },
        "correlation": {
          "bucket_correlation": {
            "buckets_path": "latency_range>_count",
            "count_correlation": {
              "indicator": {
                 "expectations": [<snip>],
                 "doc_count": 20000
               }
            }
          }
        }
      }
    }
  }
}
```
2021-05-10 12:46:11 -04:00
Nik Everett
5808f2febb
Update docs for filter agg (#72508)
The docs for the `filter` agg seemed to suggest that it was the
preferred way to filter results for aggs but its really mostly for when
you need to filter things under another bucketing agg.

Co-authored-by: James Rodewig <40268737+jrodewig@users.noreply.github.com>
2021-05-06 14:51:16 -04:00
Ignacio Vera
793166fd1f
[GeoPoint] Grid aggregations with bounds should exclude touching tiles (#72493) 2021-04-30 08:43:18 +02:00
Pierre Grimaud
3c44dfec60
[DOCS] Fix typos (#72227) 2021-04-26 12:40:38 -04:00
Nik Everett
6a1220e7f3
Convert metric aggs docs runtime fields (#71260)
This replaces the `script` docs for bucket aggregations with runtime
fields. We expect runtime fields to be nicer to work with because you
can also fetch them or filter on them. We expect them to be faster
because their don't need this sort of `instanceof` tree:
a92a647b9f/server/src/main/java/org/elasticsearch/search/aggregations/support/values/ScriptDoubleValues.java (L42)

Relates to #69291

Co-authored-by: James Rodewig <40268737+jrodewig@users.noreply.github.com>
Co-authored-by: Adam Locke <adam.locke@elastic.co>
2021-04-05 13:08:13 -04:00
Nik Everett
a9d9ee0d4b
Convert bucket aggs docs to runtime fields (#71202)
This replaces the `script` docs for bucket aggregations with runtime
fields. We expect runtime fields to be nicer to work with because you
can also fetch them or filter on them. We expect them to be faster
because their don't need this sort of `instanceof` tree:
a92a647b9f/server/src/main/java/org/elasticsearch/search/aggregations/support/values/ScriptDoubleValues.java (L42)

Relates to #69291

Co-authored-by: Adam Locke <adam.locke@elastic.co>
2021-04-02 12:12:06 -04:00
James Rodewig
693807a6d3
[DOCS] Fix double spaces (#71082) 2021-03-31 09:57:47 -04:00
Benjamin Trent
c8415a7924
[ML] adding support for composite aggs in anomaly detection (#69970)
This commit allows for composite aggregations in datafeeds. 

Composite aggs provide a much better solution for having influencers, partitions, etc. on high volume data. Instead of worrying about long scrolls in the datafeed, the calculation is distributed across cluster via the aggregations. 

The restrictions for this support are as follows:

- The composite aggregation must have EXACTLY one `date_histogram` source
- The sub-aggs of the composite aggregation must have a `max` aggregation on the SAME timefield as the aforementioned `date_histogram` source
- The composite agg must be the ONLY top level agg and it cannot have a `composite` or `date_histogram` sub-agg
- If using a `date_histogram` to bucket time, it cannot have a `composite` sub-agg.
- The top-level `composite` agg cannot have a sibling pipeline agg. Pipeline aggregations are supported as a sub-agg (thus a pipeline agg INSIDE the bucket).

Some key user interaction differences:
- Speed + resources used by the cluster should be controlled by the `size` parameter in the `composite` aggregation. Previously, we said if you are using aggs, use a specific `chunking_config`. But, with composite, that is not necessary. 
- Users really shouldn't use nested `terms` aggs anylonger. While this is still a "valid" configuration and MAY be desirable for some users (only wanting the top 10 of certain terms), typically when users want influencers, partition fields, etc. they want the ENTIRE population. Previously, this really wasn't possible with aggs, with `composite` it is.
- I cannot really think of a typical usecase that SHOULD ever use a multi-bucket aggregation that is NOT supported by composite.
2021-03-30 08:25:40 -04:00
István Zoltán Szabó
9a8c6fb66f
[DOCS] Removes beta labels from DFA related docs. (#70808) 2021-03-26 09:46:41 +01:00
Nik Everett
2b9ed7d36f
Docs: Clean doc for agg parameter (#70675)
This adds a heading for `shard_min_doc_count` and merges the paragraphs
for them. I wanted to link to this section earlier today and it wasn't a
"real" section so I couldn't.

Co-authored-by: James Rodewig <40268737+jrodewig@users.noreply.github.com>
2021-03-24 16:22:26 -04:00
Ignacio Vera
b81bb42ed9
Increase search.max_bucket by one (#70645) 2021-03-23 08:54:48 +01:00
James Rodewig
53574d2778
[DOCS] Reformat adjacency matrix agg reference (#70034) 2021-03-08 12:33:46 -05:00
James Rodewig
67288a1e4d [DOCS] Fix gap policy xref 2021-03-03 09:31:02 -05:00
James Rodewig
e21cab640f
[DOCS] Reformat avg bucket agg reference (#69751) 2021-03-02 13:44:43 -05:00
Nik Everett
ea131e5f5a
Docs: Switch terms agg scripting to runtime fields (#69628)
We expect runtime fields to perform a little better than our "native"
aggregation script so we should point folks to them instead of the
"native" aggregation script.
2021-03-02 11:27:21 -05:00
RomainGeffraye
fe7afb9d36
[DOCS] Update example for serial_diff agg (#69635) 2021-03-01 08:37:29 -05:00
Lisa Cawley
efa9b095aa
[DOCS] Adds model alias to inference processor and agg (#69576) 2021-02-24 13:12:39 -08:00
Igor Motov
7ad0201b25
Clarify the intended use case for multi_terms aggs (#69397)
This PR clarifies when multi_terms aggs should be used instead of composite
aggs or nested term aggs.

Relates to #65623
2021-02-23 15:11:53 -05:00
Nik Everett
1195b20a83
Docs: Add example fetching keyword in top_metrics (#69135)
Adds an example of fetching a keyword field.
2021-02-17 12:10:34 -05:00
James Rodewig
9b88ae92e6
[DOCS] Fix typos for duplicate words (#69125) 2021-02-17 10:34:20 -05:00
Dario Gieselaar
a28e45c0c5
[DOCS] Remove keyword/ip from list of unsupported fields in top_metrics agg (#69036) 2021-02-17 08:41:57 -05:00
James Rodewig
ab0f4d51b2
[DOCS] Add missing newline for bulleted list in top_metrics docs (#68481) (#68550)
Co-authored-by: Nathan L Smith <nathan.smith@elastic.co>
2021-02-04 14:49:02 -05:00
Igor Motov
9e3384ebc9
Add multi_terms aggs (#67597)
Adds a multi_terms aggregation support. The multi terms aggregation works
very similarly to the terms aggregation but supports multiple terms. The goal
of this PR is to add the basic functionality so it is not optimized at the
moment. It will be done in follow up PRs.

Closes #65623
2021-02-03 13:13:33 -05:00
James Rodewig
67f113314d
[DOCS] Fix acasting for agg types (#67469) 2021-01-13 14:44:54 -05:00
Adam Locke
82bfbe1195
[DOCS] Adding headers in TOC for aggregation docs. (#66604) 2020-12-18 11:31:42 -05:00
James Rodewig
77dc63b2de
[DOCS] Fix search.max_buckets default (#66311) 2020-12-14 21:55:27 -05:00
Nik Everett
524f39f61e
Drop experimental from variable width histogram (#66055)
Its been several months and we haven't bumped into any good reason to
rework the variable width histogram. So let's drop experimental from it!

Closes #58573
2020-12-08 14:15:21 -05:00
Mike Barretta
12c9ee4d80
Update inference-bucket-aggregation.asciidoc
tiny change to properly align the first code example and to add a missing word
2020-12-03 11:48:45 -05:00
James Rodewig
e955f7752b
[DOCS] Fix typo in histogram agg docs (#65822) 2020-12-03 09:55:47 -05:00
Igor Motov
a065b6d8da
Return an error when a rate aggregation cannot calculate bucket sizes (#65429)
In some cases when the rate aggregation is not a child of a date histogram
aggregation, it is not possible to determine the actual size of the date
histogram bucket. In this case the rate aggregation now throws an exception.

Closes #63703
2020-11-25 10:05:51 -05:00
Tal Levy
a6755c3be8
Add mention of geo_shape support in geotile and geohash grid agg docs (#61129)
Previously, geo_shape support was only mentioned in a dedicated x-pack
section. This may be misleading, as the introductory paragraph only
mentions geo_point.

Co-authored-by: James Rodewig <40268737+jrodewig@users.noreply.github.com>
2020-11-24 13:57:42 -08:00
Tal Levy
b514d9bf2e
Add geo_line aggregation (#41612)
A metric aggregation that aggregates a set of points as 
a GeoJSON LineString ordered by some sort parameter.

#### specifics

A `geo_line` aggregation request would specify a `geo_point` field, as well
as a `sort` field. `geo_point` represents the values used in the LineString, 
while the `sort` values will be used as the total ordering of the points.

the `sort` field would support any numeric field, including date.

#### sample usage

```
{
	"query": {
		"bool": {
			"must": [
				{ "term": { "person": "004" } },
				{ "term": { "trajectory": "20090131002206.plt" } }
			]
		}
	},
	"aggs": {
		"make_line": {
			"geo_line": {
				"point": {"field": "location"},
				"sort": { "field": "timestamp" },
                                "include_sort": true,
                                "sort_order": "desc",
                                "size": 15
			}
		}
	}
}
```

#### sample response

```
{
    "took": 21,
    "timed_out": false,
    "_shards": {...},
    "hits": {...},
    "aggregations": {
        "make_line": {
            "type": "LineString",
            "coordinates": [
                [
                    121.52926194481552,
                    38.92878997139633
                ],
                [
                    121.52922699227929,
                    38.92876998055726
                ],
             ]
        }
    }
}
```

#### visual response

<img width="540" alt="Screen Shot 2019-04-26 at 9 40 07 AM" src="https://user-images.githubusercontent.com/388837/56834977-cf278e00-6827-11e9-9c93-005ed48433cc.png">

#### limitations

Due to the cardinality of points, an initial max of 10k points 
will be used. This should support many use-cases.

One solution to overcome this limitation is to keep a PriorityQueue of
points, and simplifying the line once it hits this max. If simplifying
makes sense, it may be a nice option, in general. The ability to use a parameter
to specify how aggressive one wants to simplify. This parameter could be 
the number of points. Example algorithm one could use with a PriorityQueue:
https://bost.ocks.org/mike/simplify/. This would still require O(m) space, where m
is the number of points returned. And would also require heapifying triangles
sorted by their areas, which would be O(log(m)) operations. Since sorting is done, 
anyways, simplifying would still be a O(n log(m)) operation, where n is the total number 
of points to filter........... something to explore


closes #41649
2020-11-23 10:26:27 -08:00
Wylie Conlon
10ee0f2878
Clarify field data cache behavior in docs (#64375)
* Clarify that field data cache includes global ordinals
* Describe that the cache should be cleared once the limit is reached
* Clarify that the `_id` field does not supported aggregations anymore
* Fold the `fielddata` mapping parameter page into the `text field docs
* Improve cross-linking
2020-11-20 13:53:23 -08:00
Adam Locke
9fdcd79927
Explicitly defining types for sources parameter (#65006) 2020-11-12 16:09:04 -05:00
Mark Tozzi
f666ccb3bc
Add supports for upper and lower values on boxplot based on the IQR value (#63617) 2020-11-04 14:39:05 -05:00