mirror of
https://github.com/elastic/elasticsearch.git
synced 2025-06-28 09:28:55 -04:00
Categorization jobs created once the entire cluster is upgraded to version 7.14 or higher will default to using the new ml_standard tokenizer rather than the previous default of the ml_classic tokenizer, and will incorporate the new first_non_blank_line char filter so that categorization is based purely on the first non-blank line of each message. The difference between the ml_classic and ml_standard tokenizers is that ml_classic splits on slashes and colons, so creates multiple tokens from URLs and filesystem paths, whereas ml_standard attempts to keep URLs, email addresses and filesystem paths as single tokens. It is still possible to config the ml_classic tokenizer if you prefer: just provide a categorization_analyzer within your analysis_config and whichever tokenizer you choose (which could be ml_classic or any other Elasticsearch tokenizer) will be used. To opt out of using first_non_blank_line as a default char filter, you must explicitly specify a categorization_analyzer that does not include it. If no categorization_analyzer is specified but categorization_filters are specified then the categorization filters are converted to char filters applied that are applied after first_non_blank_line. Closes elastic/ml-cpp#1724 |
||
---|---|---|
.. | ||
asyncsearch | ||
ccr | ||
cluster | ||
document | ||
enrich | ||
graph | ||
ilm | ||
indices | ||
ingest | ||
licensing | ||
migration | ||
miscellaneous | ||
ml | ||
rollup | ||
script | ||
search | ||
searchable_snapshots | ||
security | ||
snapshot | ||
tasks | ||
textstructure | ||
transform | ||
watcher | ||
aggs-builders.asciidoc | ||
execution-no-req.asciidoc | ||
execution.asciidoc | ||
getting-started.asciidoc | ||
index.asciidoc | ||
java-builders.asciidoc | ||
migration.asciidoc | ||
query-builders.asciidoc | ||
supported-apis.asciidoc |