Commit graph

5 commits

Author SHA1 Message Date
James Rodewig
f56a0f4b66
[DOCS] Remove testenv annotations from doc snippet tests (#80023)
Removes `testenv` annotations and related code. These annotations originally let you skip x-pack snippet tests in the docs. However, that's no longer possible.

Relates to #79309, #31619
2021-11-05 18:38:50 -04:00
Benjamin Trent
281ec58b8d
[ML] add new default char filter first_line_with_letters for machine learning categorization (#77457)
The char filter replaces the previous default of `first_non_blank_line`.

`first_non_blank_line` worked well to figure out what line had characters at all, but log lines 
like the following were handled poorly:
```
--------------------------------------------------------------------------------

Alias 'foo' already exists and this prevents setting up ILM for logs

--------------------------------------------------------------------------------
```
When combined with the `ml_standard` tokenizer, the first line was used:
```
--------------------------------------------------------------------------------
```
This has no valid tokens for our standard tokenizer. Consequently, no tokens were found by `ml_standard` tokenizer.


The new filter, `first_line_with_letters`, returns the first line with any letter character (e.g. `Character#isLetter` returns true).

Given the previously poorly handled log, when combining with our `ml_standard` tokenizer, we get the following, more appropriate, tokens:

```
"tokens" : ["Alias", "foo", "already", "exists", "and", "this", "prevents", "setting", "up", "ILM", "for", "logs"]
```
2021-09-09 10:09:57 -04:00
David Roberts
0059c59e25
[ML] Make ml_standard tokenizer the default for new categorization jobs (#72805)
Categorization jobs created once the entire cluster is upgraded to
version 7.14 or higher will default to using the new ml_standard
tokenizer rather than the previous default of the ml_classic
tokenizer, and will incorporate the new first_non_blank_line char
filter so that categorization is based purely on the first non-blank
line of each message.

The difference between the ml_classic and ml_standard tokenizers
is that ml_classic splits on slashes and colons, so creates multiple
tokens from URLs and filesystem paths, whereas ml_standard attempts
to keep URLs, email addresses and filesystem paths as single tokens.

It is still possible to config the ml_classic tokenizer if you
prefer: just provide a categorization_analyzer within your
analysis_config and whichever tokenizer you choose (which could be
ml_classic or any other Elasticsearch tokenizer) will be used.

To opt out of using first_non_blank_line as a default char filter,
you must explicitly specify a categorization_analyzer that does not
include it.

If no categorization_analyzer is specified but categorization_filters
are specified then the categorization filters are converted to char
filters applied that are applied after first_non_blank_line.

Closes elastic/ml-cpp#1724
2021-06-01 15:11:32 +01:00
Lisa Cawley
f05d8c2b98
[DOCS] Per-partition categorization (#61506) 2020-08-26 17:07:46 -07:00
Lisa Cawley
fb0157460f
[DOCS] Changes level offset of anomaly detection pages (#59911) 2020-07-20 16:33:54 -07:00
Renamed from docs/reference/ml/anomaly-detection/categories.asciidoc (Browse further)