mirror of
https://github.com/elastic/elasticsearch.git
synced 2025-06-28 09:28:55 -04:00
[ML] Make ml_standard tokenizer the default for new categorization jobs (#72805)
Categorization jobs created once the entire cluster is upgraded to version 7.14 or higher will default to using the new ml_standard tokenizer rather than the previous default of the ml_classic tokenizer, and will incorporate the new first_non_blank_line char filter so that categorization is based purely on the first non-blank line of each message. The difference between the ml_classic and ml_standard tokenizers is that ml_classic splits on slashes and colons, so creates multiple tokens from URLs and filesystem paths, whereas ml_standard attempts to keep URLs, email addresses and filesystem paths as single tokens. It is still possible to config the ml_classic tokenizer if you prefer: just provide a categorization_analyzer within your analysis_config and whichever tokenizer you choose (which could be ml_classic or any other Elasticsearch tokenizer) will be used. To opt out of using first_non_blank_line as a default char filter, you must explicitly specify a categorization_analyzer that does not include it. If no categorization_analyzer is specified but categorization_filters are specified then the categorization filters are converted to char filters applied that are applied after first_non_blank_line. Closes elastic/ml-cpp#1724
This commit is contained in:
parent
88dfe1aebf
commit
0059c59e25
22 changed files with 688 additions and 96 deletions
|
@ -35,14 +35,13 @@ include-tagged::{doc-tests-file}[{api}-options]
|
|||
<2> Updated description.
|
||||
<3> Updated analysis limits.
|
||||
<4> Updated background persistence interval.
|
||||
<5> Updated analysis config's categorization filters.
|
||||
<6> Updated detectors through the `JobUpdate.DetectorUpdate` object.
|
||||
<7> Updated group membership.
|
||||
<8> Updated result retention.
|
||||
<9> Updated model plot configuration.
|
||||
<10> Updated model snapshot retention setting.
|
||||
<11> Updated custom settings.
|
||||
<12> Updated renormalization window.
|
||||
<5> Updated detectors through the `JobUpdate.DetectorUpdate` object.
|
||||
<6> Updated group membership.
|
||||
<7> Updated result retention.
|
||||
<8> Updated model plot configuration.
|
||||
<9> Updated model snapshot retention setting.
|
||||
<10> Updated custom settings.
|
||||
<11> Updated renormalization window.
|
||||
|
||||
Included with these options are specific optional `JobUpdate.DetectorUpdate` updates.
|
||||
["source","java",subs="attributes,callouts,macros"]
|
||||
|
|
Loading…
Add table
Add a link
Reference in a new issue