Commit graph

8 commits

Author SHA1 Message Date
Liam Thompson
33a71e3289
[DOCS] Refactor book-scoped variables in docs/reference/index.asciidoc (#107413)
* Remove `es-test-dir` book-scoped variable

* Remove `plugins-examples-dir` book-scoped variable

* Remove `:dependencies-dir:` and `:xes-repo-dir:` book-scoped variables

- In `index.asciidoc`, two variables (`:dependencies-dir:` and `:xes-repo-dir:`) were removed.
- In `sql/index.asciidoc`, the `:sql-tests:` path was updated to fuller path
- In `esql/index.asciidoc`, the `:esql-tests:` path was updated idem

* Replace `es-repo-dir` with `es-ref-dir`

* Move `:include-xpack: true` to few files that use it, remove from index.asciidoc
2024-04-17 14:37:07 +02:00
David Roberts
3dbaa3ff23
[ML] Make categorize_text aggregation GA (#88600)
Removes the experimental tag from the categorize_text aggregation.
2022-11-09 13:05:35 +00:00
David Roberts
be006e2eee
[ML] Improve categorize_text docs (#90765)
Adds more detail about the meaning of the results
fields of the `categorize_text` aggregation, and
advice about how to use these fields when searching
for messages that match the categories.

Followup to #90723
2022-10-13 10:46:53 +01:00
David Roberts
bfccd20155
[ML] Add a regex to the output of the categorize_text aggregation (#90723)
The new `regex` field in `categorize_text` output is created in
the same way as the `regex` field that appears in the category
definitions created by anomaly detection jobs that do categorization.

It consists of the terms that occur in the same order for every
message that matches the category, separated with a `.+?` wildcard.
It therefore matches the category messages and enforces the order
of the terms that occurred in the same order for all messages used
to create the category.

It is not recommended to use the regex as the primary mechanism for
searching for the original documents that were categorized. Search
using a regular expression is very slow. Instead the terms of the
category should be used to search for matching documents, as a
terms search can use the inverted index and hence be much faster.
However, there may be situations where it is useful to use the
`regex` field to test whether a small set of messages that have not
been indexed match the category.
2022-10-10 11:41:16 +01:00
David Roberts
93bc2e382f
[ML] Replace the implementation of the categorize_text aggregation (#85872)
This replaces the implementation of the categorize_text aggregation
with the new algorithm that was added in #80867. The new algorithm
works in the same way as the ML C++ code used for categorization jobs
(and now includes the fixes of elastic/ml-cpp#2277).

The docs are updated to reflect the workings of the new implementation.
2022-05-23 18:46:13 +01:00
Benjamin Trent
b592d2bf01
New random_sampler aggregation for sampling documents in aggregations (#84363)
This adds a new sampling aggregation that performs a background sampling over all documents in an index. 

The syntax is as follows:
```
{
  "aggregations": {
    "sampling": {
      "random_sampler": {
        "probability": 0.1
      },
      "aggs": {
        "price_percentiles": {
          "percentiles": {
            "field": "taxful_total_price"
          }
        }
      }
    }
  }
}
```

This aggregation provides fast random sampling over the entire document set in order to speed up costly aggregations.

Testing this over a variety of aggregations and data sets, the median speed up when sampling at `0.001` over millions of documents is around 70X speed improvement.

Relative error rate does rely on the size of the data and the aggregation kind. Here are some typically expected numbers when sampling over 10s of millions of documents. `p` is the configured probability and `n` is the number of documents matched by your provided filter query.
2022-03-02 14:32:30 -05:00
Benjamin Trent
f245c477d1
[ML] fail on poor configuration for categorize_text (#79586)
This commit fixes a handful of bugs with categorize_text agg

 - The agg now fails on fields that are not text fields
 - Limits the number of tokens categorized
 - Validates the configuration inputs to disallow settings above static maximums
2021-10-21 12:14:27 -04:00
Benjamin Trent
7a7fffcb5a
[ML] Text/Log categorization multi-bucket aggregation (#71752)
This commit adds a new multi-bucket aggregation: `categorize_text`

The aggregation follows a similar design to significant text in that it reads from `_source`
and re-analyzes the the text as it is read. 

Key difference is that it does not use the indexed field's analyzer, but instead relies on 
the `ml_standard` tokenizer with specialized ML token filters. The tokenizer + filters are the
same that machine learning categorization anomaly jobs utilize.

The high level logical flow is as follows:
 - at each shard, read in the text field with a custom analyzer using `ml_standard` tokenizer
 - Read in the particular tokens from the analyzer
 - Feed these tokens to a token tree algorithm (an adaptation of the drain categorization algorithm)
 - Gather the individual log categories (the leaf nodes), sort them by doc_count, ship those buckets to be merged
 - Merge all buckets that have the EXACT same key
 - Once all buckets are merged, pass those keys + counts to a new token tree for additional merging
 - That tree builds the final buckets and that is returned to the user

Algorithm explanation:

 - Each log is parsed with the ml-standard tokenizer
 - each token is passed into a token tree
 - For `max_match_token` each token is stored in the tree and at `max_match_token+1` (or `len(tokens)`) a log group is created
 - If another log group exists at that leaf, merge it if they have `similarity_threshold` percentage of tokens in common
     - merging simply replaces tokens that are different in the group with `*`
 - If a layer in the tree has `max_unique_tokens` we add a `*` child and any new tokens are passed through there. Catch here is that on the final merge, we first attempt to merge together subtrees with the smallest number of documents. Especially if the new sub tree has more documents counted.

## Aggregation configuration.

Here is an example on some openstack logs
```js
POST openstack/_search?size=0
{
  "aggs": {
    "categories": {
      "categorize_text": {
        "field": "message", // The field to categorize
        "similarity_threshold": 20, // merge log groups if they are this similar
        "max_unique_tokens": 20, // Max Number of children per token position
        "max_match_token": 4, // Maximum tokens to build prefix trees
        "size": 1
      }
    }
  }
}
```

This will return buckets like
```json
"aggregations" : {
    "categories" : {
      "buckets" : [
        {
          "doc_count" : 806,
          "key" : "nova-api.log.1.2017-05-16_13 INFO nova.osapi_compute.wsgi.server * HTTP/1.1 status len time"
        }
      ]
    }
  }
```
2021-10-04 11:49:16 -04:00