Backports #110222 to 7.17.
JDK 23 removes the COMPAT locale provider, leaving CLDR as the only option. This commit configures Elasticsearch
to use the CLDR provider when on JDK 23, but still use the existing COMPAT provider when on JDK 22 and below.
This causes some differences in locale behaviour; this also adapts various tests to still work whether run on COMPAT or CLDR.
* Update painless-reindex-context.asciidoc (#84444) (#84712)
ctx['op'] should be set to 'noop', not 'none' when specifying no
operation.
Elasticsearch error when using 'none':
```json
{
"error" : {
"root_cause" : [
{
"type" : "illegal_argument_exception",
"reason" : "Operation type [none] not allowed, only [noop, index, delete] are allowed"
}
],
"type" : "illegal_argument_exception",
"reason" : "Operation type [none] not allowed, only [noop, index, delete] are allowed"
},
"status" : 400
}
```
Co-authored-by: jalvar08 <jeovanny.alvarez@gmail.com>
* [DOCS] Update install instructions for Debian/Ubuntu (#84645) (#84714)
The use of `apt-key` is deprecated and will no longer be available after
Debian 11 and Ubuntu 22.04. This updates the installation instructions
for Debian-based distributions.
Closes#84644
Co-authored-by: er0k <er0k@users.noreply.github.com>
* Fix some typos in plugins & reference docs (#84667) (#84717)
This pull request removes a few instances of duplicate words or
punctuation and erroneous spelling from the docs.
Co-authored-by: Abele Mălan <6689720+AbeleMM@users.noreply.github.com>
Co-authored-by: jalvar08 <jeovanny.alvarez@gmail.com>
Co-authored-by: er0k <er0k@users.noreply.github.com>
Co-authored-by: Abele Mălan <6689720+AbeleMM@users.noreply.github.com>
Fixes an error and test snippets for the sum aggregation example for histograms.
Closes#84491
Co-authored-by: James Rodewig <40268737+jrodewig@users.noreply.github.com>
(cherry picked from commit fb45ac9dea)
Co-authored-by: Maja Grubic <maja.grubic@elastic.co>
* Updates the `min` and `max` snippets for histograms. These should now run as docs integration tests.
* Fixes a copy/paste error in the `max` aggregation snippet for histograms.
Relates to https://github.com/elastic/elasticsearch/pull/83384
(cherry picked from commit 280fd2fff7)
per issue 60780, decision from team to remove experimental language from HDR Histogram percentiles and ranks. Feature has been in production for quite some time.
closes#60780
Co-authored-by: William Chaparro <william.chaparro@elastic.co>
The documentations states that if the `weight` field is missing, and no
explicit missing configuration is provided, a default value of 1 is used.
This is incorrect and does not match the implementation of the weighted
average aggregator. In this specific case the document is skipped, instead.
The `terms` agg picks the top `size` terms in a single scatter/gather
pass across all the shards. For the default `order` and if you `order`
by `_key` this works quite well. Some errors creep in, but it's fairly
easy to point to them and understand them. But ordering by doc count
ascending is like inviting the error vampire into your agg. It's super
easy to get inaccurate results. This updates the docs to be more stark
about it. Closes#72684
This commit fixes a handful of bugs with categorize_text agg
- The agg now fails on fields that are not text fields
- Limits the number of tokens categorized
- Validates the configuration inputs to disallow settings above static maximums
Backports #79346 to 7.x
When running a rate aggregation without setting the field parameter, the result is computed based on the bucket doc_count.
This PR adds support for a custom _doc_count field.
Closes#77734
This commit adds the new normalize_above parameter to the p_value significant
terms heuristic.
This parameter allows for consistent significance results at various scales. When a total count (in or out of the set background set) is above the normalize_above parameter, both the total set and the set including the term are scaled by normalize_above/count where count is term in the set or total set size.
The composite aggregation is considered expensive. Users should perform load testing before deploying it in production.
Co-authored-by: James Rodewig <40268737+jrodewig@users.noreply.github.com>
Co-authored-by: Stef Nestor <steffanie.nestor@gmail.com>
* [ML] Text/Log categorization multi-bucket aggregation (#71752)
This commit adds a new multi-bucket aggregation: `categorize_text`
The aggregation follows a similar design to significant text in that it reads from `_source`
and re-analyzes the the text as it is read.
Key difference is that it does not use the indexed field's analyzer, but instead relies on
the `ml_standard` tokenizer with specialized ML token filters. The tokenizer + filters are the
same that machine learning categorization anomaly jobs utilize.
The high level logical flow is as follows:
- at each shard, read in the text field with a custom analyzer using `ml_standard` tokenizer
- Read in the particular tokens from the analyzer
- Feed these tokens to a token tree algorithm (an adaptation of the drain categorization algorithm)
- Gather the individual log categories (the leaf nodes), sort them by doc_count, ship those buckets to be merged
- Merge all buckets that have the EXACT same key
- Once all buckets are merged, pass those keys + counts to a new token tree for additional merging
- That tree builds the final buckets and that is returned to the user
Algorithm explanation:
- Each log is parsed with the ml-standard tokenizer
- each token is passed into a token tree
- For `max_match_token` each token is stored in the tree and at `max_match_token+1` (or `len(tokens)`) a log group is created
- If another log group exists at that leaf, merge it if they have `similarity_threshold` percentage of tokens in common
- merging simply replaces tokens that are different in the group with `*`
- If a layer in the tree has `max_unique_tokens` we add a `*` child and any new tokens are passed through there. Catch here is that on the final merge, we first attempt to merge together subtrees with the smallest number of documents. Especially if the new sub tree has more documents counted.
## Aggregation configuration.
Here is an example on some openstack logs
```js
POST openstack/_search?size=0
{
"aggs": {
"categories": {
"categorize_text": {
"field": "message", // The field to categorize
"similarity_threshold": 20, // merge log groups if they are this similar
"max_unique_tokens": 20, // Max Number of children per token position
"max_match_token": 4, // Maximum tokens to build prefix trees
"size": 1
}
}
}
}
```
This will return buckets like
```json
"aggregations" : {
"categories" : {
"buckets" : [
{
"doc_count" : 806,
"key" : "nova-api.log.1.2017-05-16_13 INFO nova.osapi_compute.wsgi.server * HTTP/1.1 status len time"
}
]
}
}
```
* fixing for backport
* fixing test after backport
Related to issue #77823
This does the following:
- Updates several asciidoc files that contained code snippets with
invalid JSON, most involving unnecessary trailing commas.
- Makes the switch from the Groovy JSON parser to the Jackson parser,
pursuant to the general goal of eliminating Groovy dependence.
- Makes testing of JSON validity at build time more strict.
Note that this update still allows backslash escaping for any
character. Currently that matters because of the file
"docs/reference/ml/anomaly-detection/apis/get-datafeed-stats.asciidoc",
specifically this part:
"attributes" : {
"ml.machine_memory" :
"$body.datafeeds.0.node.attributes.ml\.machine_memory",
"ml.max_open_jobs" : "512"
}
It's not clear to me what change, if any, is appropriate there. So,
I've left in the escaped period and configured the parser to ignore
it for the time being.
* Adds support for the rate aggregation under a composite agg (#76992)
rate aggregation should support being a sub-aggregation
of a composite agg.
The catch is that the composite aggregation source
must be a date histogram. Other sources can be present
but their must be exactly one date histogram source
otherwise the rate aggregation does not know which
interval to compare its unit rate to.
closes https://github.com/elastic/elasticsearch/issues/76988
* Update RateAggregatorTests.java
Changes:
* Simplifies and formats several snippets in the nested agg docs
* Adds a `filter` sub-aggregration example
# Conflicts:
# docs/reference/aggregations/bucket/nested-aggregation.asciidoc
Changes:
* Use "geopoint" when not referring to the literal field type
* Use "geoshape" when not referring to the literal field type or query type
* Use "GeoJSON" consistently
# Conflicts:
# docs/reference/ingest/processors/enrich.asciidoc
* Add support for range aggregations on histogram mapped fields (#74146)
This adds support for the range aggregation over `histogram` mapped fields.
Decisions made for implementation:
- Sub-aggregations are not allowed. This is to simplify implementation and follows the prior art set by the `histogram` aggregation
- Nothing fancy is done with the ranges. No filter translations as we cannot easily do a `range` filter query against histogram fields. This may be an optimization in the future.
- Ranges check the histogram value ONLY. No interpolation of values is done. If we have better statistics around the histogram this MAY be possible.
Adds a new keep_values gap policy that works like skip, except if the metric
calculated on an empty bucket provides a non-null non-NaN value, this value is
used for the bucket.
Fixes#27377
Co-authored-by: Mark Tozzi <mark.tozzi@gmail.com>
Changes:
* Combines the `Document counts are approximate` and `Calculating document count
error` sections.
* Rewrites the section to include `sum_other_doc_count` and
`doc_count_error_upper_bound` for easier on-page (ctrl+f) searching.
Closes#73200
Improve the error message when inconsistent mappings cause doc value formatting errors. For example, trying to format a binary encoded IP address as a UTF8 string often fails with something unexpected, like `ArrayIndexOutOfBounds`. This change catches that and wraps it with a message suggesting the user check their mappings. Also gets rid of anonymous instances for doc value formatters, which made it hard to see what format was failing to be applied.
This adds a new pipeline aggregation for calculating Kolmogorov–Smirnov test for a given sample and buckets path.
For now, the buckets path resolution needs to be `_count`. But, this may be relaxed in the future.
It accepts a parameter `fractions` that indicates the distribution of documents from some other pre-calculated sample.
This particular version of the K-S test is Two-sample, meaning, it calculates if the `fractions` and the distribution of `_count` values in the buckets_path are taken from the same distribution.
This in combination with the hypothesis alternatives (`less`, `greater`, `two_sided`) and sampling logic (`upper_tail`, `lower_tail`, `uniform`) allow for flexibility and usefulness when comparing two samples and determining the likelihood of them being from the same overall distribution.
Usage:
```
POST correlate_latency/_search?size=0&filter_path=aggregations
{
"aggs": {
"buckets": {
"terms": { <1>
"field": "version",
"size": 2
},
"aggs": {
"latency_ranges": {
"range": { <2>
"field": "latency",
"ranges": [
{ "to": 0.0 },
{ "from": 0, "to": 105 },
{ "from": 105, "to": 225 },
{ "from": 225, "to": 445 },
{ "from": 445, "to": 665 },
{ "from": 665, "to": 885 },
{ "from": 885, "to": 1115 },
{ "from": 1115, "to": 1335 },
{ "from": 1335, "to": 1555 },
{ "from": 1555, "to": 1775 },
{ "from": 1775 }
]
}
},
"ks_test": { <3>
"bucket_count_ks_test": {
"buckets_path": "latency_ranges>_count",
"alternative": ["less", "greater", "two_sided"]
}
}
}
}
}
}
```
Co-authored-by: Elastic Machine <elasticmachine@users.noreply.github.com>
Adds some extra debugging information to make it clear that you are
running `significant_text`. Also adds some using timing information
around the `_source` fetch and the `terms` accumulation. This lets you
calculate a third useful timing number: the analysis time. It is
`collect_ns - fetch_ns - accumulation_ns`.
This also adds a half dozen extra REST tests to get a *fairly*
comprehensive set of the operations this supports. It doesn't cover all
of the significance heuristic parsing, but its certainly much better
than what we had.
* [ML] add new bucket_correlation aggregation with initial count_correlation function (#72133)
This commit adds a new pipeline aggregation that allows correlation within the aggregation frame work in bucketed values.
The initial function is a `count_correlation` function. The purpose of which is to correlate the count in a consistent number of buckets with a pre calculated indicator. The indicator and the aggregated buckets should related to the same metrics with in documents.
Example for correlating terms within a `service.version.keyword` with latency percentiles. The percentiles and provided correlation indicator both refer to the same source data where the indicator was previously calculated.:
```
GET apm-7.12.0-transaction-generated/_search
{
"size": 0,
"aggs": {
"field_terms": {
"terms": {
"field": "service.version.keyword",
"size": 20
},
"aggs": {
"latency_range": {
"range": {
"field": "transaction.duration.us",
"ranges": [<snip>],
"keyed": true
}
},
"correlation": {
"bucket_correlation": {
"buckets_path": "latency_range>_count",
"count_correlation": {
"indicator": {
"expectations": [<snip>],
"doc_count": 20000
}
}
}
}
}
}
}
}
```
The docs for the `filter` agg seemed to suggest that it was the
preferred way to filter results for aggs but its really mostly for when
you need to filter things under another bucketing agg.
Co-authored-by: James Rodewig <40268737+jrodewig@users.noreply.github.com>
This replaces the `script` docs for bucket aggregations with runtime
fields. We expect runtime fields to be nicer to work with because you
can also fetch them or filter on them. We expect them to be faster
because their don't need this sort of `instanceof` tree:
a92a647b9f/server/src/main/java/org/elasticsearch/search/aggregations/support/values/ScriptDoubleValues.java (L42)
Relates to #69291
Co-authored-by: James Rodewig <40268737+jrodewig@users.noreply.github.com>
Co-authored-by: Adam Locke <adam.locke@elastic.co>
* [ML] adding support for composite aggs in anomaly detection (#69970)
This commit allows for composite aggregations in datafeeds.
Composite aggs provide a much better solution for having influencers, partitions, etc. on high volume data. Instead of worrying about long scrolls in the datafeed, the calculation is distributed across cluster via the aggregations.
The restrictions for this support are as follows:
- The composite aggregation must have EXACTLY one `date_histogram` source
- The sub-aggs of the composite aggregation must have a `max` aggregation on the SAME timefield as the aforementioned `date_histogram` source
- The composite agg must be the ONLY top level agg and it cannot have a `composite` or `date_histogram` sub-agg
- If using a `date_histogram` to bucket time, it cannot have a `composite` sub-agg.
- The top-level `composite` agg cannot have a sibling pipeline agg. Pipeline aggregations are supported as a sub-agg (thus a pipeline agg INSIDE the bucket).
Some key user interaction differences:
- Speed + resources used by the cluster should be controlled by the `size` parameter in the `composite` aggregation. Previously, we said if you are using aggs, use a specific `chunking_config`. But, with composite, that is not necessary.
- Users really shouldn't use nested `terms` aggs anylonger. While this is still a "valid" configuration and MAY be desirable for some users (only wanting the top 10 of certain terms), typically when users want influencers, partition fields, etc. they want the ENTIRE population. Previously, this really wasn't possible with aggs, with `composite` it is.
- I cannot really think of a typical usecase that SHOULD ever use a multi-bucket aggregation that is NOT supported by composite.