Commit graph

210 commits

Author SHA1 Message Date
Nik Everett
32bdf8549b
Fail variable_width_histogram that collects from many (#58619)
Adds an explicit check to `variable_width_histogram` to stop it from
trying to collect from many buckets because it can't. I tried to make it
do so but that is more than an afternoon's project, sadly. So for now we
just disallow it.

Relates to #42035
2020-06-30 15:42:46 -04:00
Nik Everett
dda78ff760
Docs: Mark variable_width_histogram experimental (#58574)
We're tracking this aggregation's experimental-progress in #58573. We'd
like a little time to be able to make backwards incompatible changes to
the aggregation because we're not 100% sure about the request and
response format yet.
2020-06-25 16:54:37 -04:00
James Dorfman
e99d287fbb
Add Variable Width Histogram Aggregation (#42035)
Implements a new histogram aggregation called `variable_width_histogram` which
dynamically determines bucket intervals based on document groupings. These
groups are determined by running a one-pass clustering algorithm on each shard
and then reducing each shard's clusters using an agglomerative
clustering algorithm.

This PR addresses #9572.

The shard-level clustering is done in one pass to minimize memory overhead. The
algorithm was lightly inspired by
[this paper](https://ieeexplore.ieee.org/abstract/document/1198387). It fetches
a small number of documents to sample the data and determine initial clusters.
Subsequent documents are then placed into one of these clusters, or a new one
if they are an outlier. This algorithm is described in more details in the
aggregation's docs.

At reduce time, a
[hierarchical agglomerative clustering](https://en.wikipedia.org/wiki/Hierarchical_clustering)
algorithm inspired by [this paper](https://arxiv.org/abs/1802.00304)
continually merges the closest buckets from all shards (based on their
centroids) until the target number of buckets is reached.

The final values produced by this aggregation are approximate. Each bucket's
min value is used as its key in the histogram. Furthermore, buckets are merged
based on their centroids and not their bounds. So it is possible that adjacent
buckets will overlap after reduction. Because each bucket's key is its min,
this overlap is not shown in the final histogram. However, when such overlap
occurs, we set the key of the bucket with the larger centroid to the midpoint
between its minimum and the smaller bucket’s maximum:
`min[large] = (min[large] + max[small]) / 2`. This heuristic is expected to
increases the accuracy of the clustering.

Nodes are unable to share centroids during the shard-level clustering phase. In
the future, resolving https://github.com/elastic/elasticsearch/issues/50863
would let us solve this issue. 

It doesn’t make sense for this aggregation to support the `min_doc_count`
parameter, since clusters are determined dynamically. The `order` parameter is
not supported here to keep this large PR from becoming too complex.
2020-06-23 09:26:54 -04:00
Tal Levy
c765993d82
add geo_shape documentation for supported aggregations (#58284)
This commit adds documentation for geo_shape fields in aggregations

Closes #55495.
2020-06-18 10:17:49 -07:00
Benjamin Trent
484de0cd02
Adding transform docs for geotile_grid (#57000)
transforms and composite aggs support geotile_grid as a source. This adds documentation explaining that support.
2020-06-01 15:32:18 -04:00
Nik Everett
1e5e5e2da2
Update date_histogram docs (#56922)
* Make it more clear that you can use `month` or `1M`.
* Explain rounding rules
* Consistently use "time zone" instead of "timezone". It looks like both
  are right but I see "time zone" much more. And the parameter in
  elasticsearch is `time_zone` so we may as well line up.

Closes #56760

Co-authored-by: James Rodewig <james.rodewig@elastic.co>
2020-05-29 17:13:14 -04:00
Gabriel Petrovay
709ee956d7 Fixed calendar intervals documentation (#56666)
- the 1-letter intervals are not parseable (`m`, `h`, `d`, `w`,  `M`, `q`, `y`)
- fixed formatting broken by new lines
2020-05-15 16:56:27 -04:00
Gabriel Petrovay
4029818c24 [Docs] Correct formatting in datehistogram-aggregation.asciidoc (#56664) 2020-05-13 12:02:36 +02:00
AB Prashanth
785527bb58
[DOCS] Remove approximate document counts example from term agg docs (#55442)
Removes an example from the "Document counts are approximate" section of the
terms agg documentation.

As #52377 details, the example was no longer accurate in 7.x or 6.8. Document
counts were more precise than the example presented.

We've opened issue #56025 to discuss re-adding an example later.

Co-authored-by: James Rodewig <james.rodewig@elastic.co>
2020-04-30 09:49:32 -04:00
Zachary Tong
9f165bd44e
Aggs must specify a field or script (or both) (#52226)
* Aggs must specify a `field` or `script` (or both)

This adds a validation to VSParserHelper to ensure that a field or
script or both are specified by the user.  This is technically
required today already, but throws an exception much deeper
in the agg framework and has a very unintuitive error for the user
(as well as eating more resources instead of failing early)

* Fix StringStats test

* Add yaml test

* Skip test on older versions

Co-authored-by: Elastic Machine <elasticmachine@users.noreply.github.com>
2020-04-23 14:26:38 -04:00
Anton Dollmaier
e9c8c03fee [DOCS] Fix parameter formatting for GeoHash grid agg docs (#53032)
Adds missing colon (`:`) to the parameter definition list.
2020-03-09 08:17:57 -04:00
Tal Levy
6c86606d2a
Adds support for geo-bounds filtering in geogrid aggregations (#50002)
It is fairly common to filter the geo point candidates in
geohash_grid and geotile_grid aggregations according to some
viewable bounding box. This change introduces the option of
specifying this filter directly in the tiling aggregation.

This is even more relevant to `geo_shape` where the bounds will restrict
the shape to be within the bounds

this optional `bounds` parameter is parsed in an equivalent fashion to 
the bounds specified in the geo_bounding_box query.
2020-01-14 08:29:10 -08:00
Nik Everett
326d696d9a
Support offset in composite aggs (#50609)
Adds support for the `offset` parameter to the `date_histogram` source
of composite aggs. The `offset` parameter is supported by the normal
`date_histogram` aggregation and is useful for folks that need to
measure things from, say, 6am one day to 6am the next day.

This is implemented by creating a new `Rounding` that knows how to
handle offsets and delegates to other rounding implementations. That
implementation doesn't fully implement the `Rounding` contract, namely
`nextRoundingValue`. That method isn't used by composite aggs so I can't
be sure that any implementation that I add will be correct. I propose to
leave it throwing `UnsupportedOperationException` until I need it.

Closes #48757
2020-01-07 14:49:09 -05:00
Nik Everett
a7cc0b0159
Docs: Refine note about after_key (#50475)
* Docs: Refine note about `after_key`

I was curious about composite aggregations, specifically I wanted to
know how to write a composite aggregation that had all of its buckets
filtered out so you *had* to use the `after_key`. Then I saw that we've
declared composite aggregations not to work with pipelines in #44180. So
I'm not sure you *can* do that any more. Which makes the note about
`after_key` inaccurate. This rejiggers that section of the docs a little
so it is more obvious that you send the `after_key` back to us. And so
it is more obvious that you should *only* use the `after_key` that we
give you rather than try to work it out for yourself.

* Apply suggestions from code review

Co-Authored-By: James Rodewig <james.rodewig@elastic.co>

Co-authored-by: James Rodewig <james.rodewig@elastic.co>
2020-01-02 10:02:55 -05:00
Lisa Cawley
6d608e6a0d
[DOCS] Move transform resource definitions into APIs (#50108) 2019-12-17 09:01:31 -08:00
Jim Ferenczi
804a5042e7
Optimize composite aggregation based on index sorting (#48399)
Co-authored-by: Daniel Huang <danielhuang@tencent.com>

This is a spinoff of #48130 that generalizes the proposal to allow early termination with the composite aggregation when leading sources match a prefix or the entire index sort specification.
In such case the composite aggregation can use the index sort natural order to early terminate the collection when it reaches a composite key that is greater than the bottom of the queue.
The optimization is also applicable when a query other than match_all is provided. However the optimization is deactivated for sources that match the index sort in the following cases:
  * Multi-valued source, in such case early termination is not possible.
  * missing_bucket is set to true
2019-12-17 14:02:06 +01:00
Przemko Robakowski
04f6b6fdb2
[DOCS] IDs for doc snippets (#49008)
* Ids for docs snippets

* Ids for tests

* Ids for docs snippets

* ignoring build folder from idea

* Ignoring build-eclipse
2019-11-25 15:30:00 +01:00
Lisa Cawley
a4efab6ab4
[DOCS] Merge rollup config details into API (#49412) 2019-11-22 08:31:30 -08:00
James Rodewig
f53eba024b
[DOCS] Remove binary gendered language (#48362) 2019-10-23 09:36:31 -05:00
Alan Woodward
566e1b7d33
Remove type field from DocWriteRequest and associated Response objects (#47671)
This commit removes the type field from index, update and delete requests, and their
associated responses.

Relates to #41059
2019-10-11 10:23:55 +01:00
Alan Woodward
7a622f024f
Remove types from BulkRequest (#46983)
This commit removes types entirely from BulkRequest, both as a global
parameter and as individual entries on update/index/delete lines.

Relates to #41059
2019-10-07 13:29:12 +01:00
Mark Tozzi
c26ce1d7f5
DocValueFormat implementation for date range fields (#47472) 2019-10-04 16:01:28 -04:00
Mark Tozzi
57a679fbbb
Documentation notes for Range field histograms (#46890) 2019-10-01 10:46:04 -04:00
Javier Ruiz
e8dac62a4a [DOCS] Fix calendar interval typos for date histo agg (#46911) 2019-09-20 15:22:04 -04:00
James Rodewig
370e434986
[DOCS] Correct several [source,console-result] snippets (#46930) 2019-09-20 11:23:15 -04:00
markharwood
dc0abec595
Remove Adjacency_matrix setting in favour of Lucene Boolean query clause setting (#46327)
Closes #46324
2019-09-19 16:48:04 +01:00
Philipp Krenn
7c5adcc7c1 Minor improvement to the nested aggregation docs (#46475)
* Minor improvement to the nested aggregation docs

* The attributes name and resellers.name were rather confusing,
  especially since the first one was dynamically mapped and not shown
  in the documentation (you had to read the test to see it). This
  change introduces a unique name for the nested attribute and adds
  the example document to the documentation.
* Change the index name from "index" to something more speaking.

* Update docs/reference/aggregations/bucket/nested-aggregation.asciidoc

Co-Authored-By: James Rodewig <james.rodewig@elastic.co>

* Update docs/reference/aggregations/bucket/nested-aggregation.asciidoc

Co-Authored-By: James Rodewig <james.rodewig@elastic.co>

* Update docs/reference/aggregations/bucket/nested-aggregation.asciidoc

Co-Authored-By: James Rodewig <james.rodewig@elastic.co>
2019-09-11 11:23:39 -04:00
James Rodewig
e43be90e6c
[DOCS] [5 of 5] Change // TESTRESPONSE comments to [source,console-results] (#46449) 2019-09-06 14:05:36 -04:00
James Rodewig
466c59a4a7
[DOCS] Replace "// TESTRESPONSE" magic comments with "[source,console-result] (#46295) 2019-09-05 16:47:18 -04:00
James Rodewig
f5827ba0ae
[DOCS] Replace "// CONSOLE" comments with [source,console] (#46159) 2019-09-04 12:51:02 -04:00
LHearen
6dadce1112 [DOCS] Correct conditional clause in histogram agg docs (#45643) 2019-08-19 10:09:10 -04:00
LHearen
d1c0ea7833 [DOCS] Fix a 'value' -> 'values' typo in histogram aggregation docs (#45642) 2019-08-19 10:02:44 -04:00
Flavio Pompermaier
e66889635d [DOCS] Correct sum_other_doc_count value in terms agg example (#45028)
Closes issue #41902
2019-07-31 14:10:05 -04:00
Sandeep Kanabar
0e4be837db [Docs] Update daterange-aggregation.asciidoc (#44730)
Correcting the value to be the same as that specified for "missing".
2019-07-29 12:51:15 +02:00
James Rodewig
ea1adb61c2
[DOCS] Update anchors and links for Elasticsearch API relocation (#44500) 2019-07-19 09:16:35 -04:00
Zachary Tong
eac86c9bb8
Document that pipeline aggs are not compatible with composite agg (#44180) 2019-07-12 12:34:34 -04:00
Zachary Tong
baf155dced
Add RareTerms aggregation (#35718)
This adds a `rare_terms` aggregation.  It is an aggregation designed
to identify the long-tail of keywords, e.g. terms that are "rare" or
have low doc counts.

This aggregation is designed to be more memory efficient than the
alternative, which is setting a terms aggregation to size: LONG_MAX
(or worse, ordering a terms agg by count ascending, which has
unbounded error).

This aggregation works by maintaining a map of terms that have
been seen. A counter associated with each value is incremented
when we see the term again.  If the counter surpasses a predefined
threshold, the term is removed from the map and inserted into a cuckoo
filter.  If a future term is found in the cuckoo filter we assume it
was previously removed from the map and is "common".

The map keys are the "rare" terms after collection is done.
2019-07-01 10:02:36 -04:00
Paul Sanwald
6357857bba
Adds a minimum interval to auto_date_histogram. (#42814)
Adds a minimum interval to `auto_date_histogram`. We do this by
restricting the roundings passed into to the aggregator.
2019-06-11 15:53:19 -04:00
Zachary Tong
0192fe7d7c
Add documentation for calendar/fixed intervals (#41919)
Original PR missed documentation for the new calendar/fixed
intervals.  This adds the missing documentation
2019-05-10 15:27:41 -04:00
Zachary Tong
290c8b8256
Force selection of calendar or fixed intervals in date histo agg (#33727)
The date_histogram accepts an interval which can be either a calendar 
interval (DST-aware, leap seconds, arbitrary length of months, etc) or 
fixed interval (strict multiples of SI units). Unfortunately this is inferred
by first trying to parse as a calendar interval, then falling back to fixed
if that fails.

This leads to confusing arrangement where `1d` == calendar, but 
`2d` == fixed.  And if you want a day of fixed time, you have to 
specify `24h` (e.g. the next smallest unit).  This arrangement is very
error-prone for users.

This PR adds `calendar_interval` and `fixed_interval` parameters to any
code that uses intervals (date_histogram, rollup, composite, datafeed, etc).
Calendar only accepts calendar intervals, fixed accepts any combination of
units (meaning `1d` can be used to specify `24h` in fixed time), and both
are mutually exclusive.  

The old interval behavior is deprecated and will throw a deprecation warning.
It is also mutually exclusive with the two new parameters. In the future the 
old dual-purpose interval will be removed.

The change applies to both REST and java clients.
2019-05-06 17:17:11 -04:00
James Rodewig
adf67053f4
[DOCS] Add anchors for Asciidoctor migration (#41648) 2019-04-30 10:19:09 -04:00
Zachary Tong
5ccc0b5a32
Disallow null/empty or duplicate composite sources (#41359)
Adds some validation to prevent duplicate source names from being 
used in the composite agg.

Also refactored to use a ConstructingObjectParser and removed the 
private ctor and setter for sources, making it mandatory.
2019-04-24 13:22:06 -04:00
Jason Tedor
656cc709b2
Fix intervals section of auto date-histogram docs (#41203)
This section should be at the same sub-level as other sections in the
auto date-histogram docs, otherwise it is rendered on to another page
and is confusing for users to understand what it's in reference to.
2019-04-15 11:27:38 -04:00
Antonio Matarrese
badb8559fb Use the breadth first collection mode for significant terms aggs.
This helps avoid memory issues when computing deep sub-aggregations. Because it
should be rare to use sub-aggregations with significant terms, we opted to always
choose breadth first as opposed to exposing a `collect_mode` option.

Closes #28652.
2019-04-11 15:38:25 -07:00
Lisa Cawley
13454376a4
[DOCS] Fixes callout for Asciidoctor migration (#41127) 2019-04-11 12:04:04 -07:00
Ian
98bbb4176e Correct date in daterange-aggregation.asciidoc (#39727) 2019-03-06 11:30:17 +01:00
Samuel Cifuentes García
ff6ffe8ba1 Improved Terms Aggregation documentation (#38892)
Added a note after the first query example talking about fielddata.
2019-03-05 10:17:01 -05:00
Hannes Van De Vreken
b76a380f18 Fix typo in DateRange docs (yyy → yyyy) (#38883) 2019-02-15 10:20:16 -05:00
Alexander Reelsen
5f7168ea74
Remove joda time mentions in documentation (#38720)
This is the forward port of #38720 (not containing the 7.0 migration docs)
2019-02-14 10:18:48 +01:00
Yuri Astrakhan
f3cde06a1d
geotile_grid implementation (#37842)
Implements `geotile_grid` aggregation

This patch refactors previous implementation https://github.com/elastic/elasticsearch/pull/30240

This code uses the same base classes as `geohash_grid` agg, but uses a different hashing
algorithm to allow zoom consistency.  Each grid bucket is aligned to Web Mercator tiles.
2019-01-31 19:11:30 -05:00