[ML] Make ml_standard tokenizer the default for new categorization jobs (#72805)

Categorization jobs created once the entire cluster is upgraded to version 7.14 or higher will default to using the new ml_standard tokenizer rather than the previous default of the ml_classic tokenizer, and will incorporate the new first_non_blank_line char filter so that categorization is based purely on the first non-blank line of each message. The difference between the ml_classic and ml_standard tokenizers is that ml_classic splits on slashes and colons, so creates multiple tokens from URLs and filesystem paths, whereas ml_standard attempts to keep URLs, email addresses and filesystem paths as single tokens. It is still possible to config the ml_classic tokenizer if you prefer: just provide a categorization_analyzer within your analysis_config and whichever tokenizer you choose (which could be ml_classic or any other Elasticsearch tokenizer) will be used. To opt out of using first_non_blank_line as a default char filter, you must explicitly specify a categorization_analyzer that does not include it. If no categorization_analyzer is specified but categorization_filters are specified then the categorization filters are converted to char filters applied that are applied after first_non_blank_line. Closes elastic/ml-cpp#1724
2025-06-28 09:28:55 -04:00 · 2021-06-01 15:11:32 +01:00 · 2021-06-01 15:11:32 +01:00 · 0059c59e25
commit 0059c59e25
parent 88dfe1aebf
22 changed files with 688 additions and 96 deletions
--- a/docs/java-rest/high-level/ml/update-job.asciidoc
+++ b/docs/java-rest/high-level/ml/update-job.asciidoc
@ -35,14 +35,13 @@ include-tagged::{doc-tests-file}[{api}-options]
 <2> Updated description.
 <3> Updated analysis limits. 
 <4> Updated background persistence interval.
-<5> Updated analysis config's categorization filters.
-<6> Updated detectors through the `JobUpdate.DetectorUpdate` object.
-<7> Updated group membership.
-<8> Updated result retention.
-<9> Updated model plot configuration.
-<10> Updated model snapshot retention setting.
-<11> Updated custom settings.
-<12> Updated renormalization window.
+<5> Updated detectors through the `JobUpdate.DetectorUpdate` object.
+<6> Updated group membership.
+<7> Updated result retention.
+<8> Updated model plot configuration.
+<9> Updated model snapshot retention setting.
+<10> Updated custom settings.
+<11> Updated renormalization window.

 Included with these options are specific optional `JobUpdate.DetectorUpdate` updates.
 ["source","java",subs="attributes,callouts,macros"]