[ML] Make ml_standard tokenizer the default for new categorization jobs (#72805)

Categorization jobs created once the entire cluster is upgraded to version 7.14 or higher will default to using the new ml_standard tokenizer rather than the previous default of the ml_classic tokenizer, and will incorporate the new first_non_blank_line char filter so that categorization is based purely on the first non-blank line of each message. The difference between the ml_classic and ml_standard tokenizers is that ml_classic splits on slashes and colons, so creates multiple tokens from URLs and filesystem paths, whereas ml_standard attempts to keep URLs, email addresses and filesystem paths as single tokens. It is still possible to config the ml_classic tokenizer if you prefer: just provide a categorization_analyzer within your analysis_config and whichever tokenizer you choose (which could be ml_classic or any other Elasticsearch tokenizer) will be used. To opt out of using first_non_blank_line as a default char filter, you must explicitly specify a categorization_analyzer that does not include it. If no categorization_analyzer is specified but categorization_filters are specified then the categorization filters are converted to char filters applied that are applied after first_non_blank_line. Closes elastic/ml-cpp#1724
2025-06-28 17:34:17 -04:00 · 2021-06-01 15:11:32 +01:00 · 2021-06-01 15:11:32 +01:00 · 0059c59e25
commit 0059c59e25
parent 88dfe1aebf
22 changed files with 688 additions and 96 deletions
--- a/client/rest-high-level/src/test/java/org/elasticsearch/client/documentation/MlClientDocumentationIT.java
+++ b/client/rest-high-level/src/test/java/org/elasticsearch/client/documentation/MlClientDocumentationIT.java
@ -588,14 +588,13 @@ public class MlClientDocumentationIT extends ESRestHighLevelClientTestCase {
                .setDescription("My description") // <2>
                .setAnalysisLimits(new AnalysisLimits(1000L, null)) // <3>
                .setBackgroundPersistInterval(TimeValue.timeValueHours(3)) // <4>
-                .setCategorizationFilters(Arrays.asList("categorization-filter")) // <5>
-                .setDetectorUpdates(Arrays.asList(detectorUpdate)) // <6>
-                .setGroups(Arrays.asList("job-group-1")) // <7>
-                .setResultsRetentionDays(10L) // <8>
-                .setModelPlotConfig(new ModelPlotConfig(true, null, true)) // <9>
-                .setModelSnapshotRetentionDays(7L) // <10>
-                .setCustomSettings(customSettings) // <11>
-                .setRenormalizationWindowDays(3L) // <12>
+                .setDetectorUpdates(Arrays.asList(detectorUpdate)) // <5>
+                .setGroups(Arrays.asList("job-group-1")) // <6>
+                .setResultsRetentionDays(10L) // <7>
+                .setModelPlotConfig(new ModelPlotConfig(true, null, true)) // <8>
+                .setModelSnapshotRetentionDays(7L) // <9>
+                .setCustomSettings(customSettings) // <10>
+                .setRenormalizationWindowDays(3L) // <11>
                .build();
            // end::update-job-options

--- a/docs/java-rest/high-level/ml/update-job.asciidoc
+++ b/docs/java-rest/high-level/ml/update-job.asciidoc
@ -35,14 +35,13 @@ include-tagged::{doc-tests-file}[{api}-options]
 <2> Updated description.
 <3> Updated analysis limits. 
 <4> Updated background persistence interval.
-<5> Updated analysis config's categorization filters.
-<6> Updated detectors through the `JobUpdate.DetectorUpdate` object.
-<7> Updated group membership.
-<8> Updated result retention.
-<9> Updated model plot configuration.
-<10> Updated model snapshot retention setting.
-<11> Updated custom settings.
-<12> Updated renormalization window.
+<5> Updated detectors through the `JobUpdate.DetectorUpdate` object.
+<6> Updated group membership.
+<7> Updated result retention.
+<8> Updated model plot configuration.
+<9> Updated model snapshot retention setting.
+<10> Updated custom settings.
+<11> Updated renormalization window.

 Included with these options are specific optional `JobUpdate.DetectorUpdate` updates.
 ["source","java",subs="attributes,callouts,macros"]
--- a/docs/reference/ml/anomaly-detection/apis/get-ml-info.asciidoc
+++ b/docs/reference/ml/anomaly-detection/apis/get-ml-info.asciidoc
@ -49,7 +49,10 @@ This is a possible response:
  "defaults" : {
    "anomaly_detectors" : {
      "categorization_analyzer" : {
-        "tokenizer" : "ml_classic",
+        "char_filter" : [
+          "first_non_blank_line"
+        ],
+        "tokenizer" : "ml_standard",
        "filter" : [
          {
            "type" : "stop",
--- a/docs/reference/ml/anomaly-detection/ml-configuring-categories.asciidoc
+++ b/docs/reference/ml/anomaly-detection/ml-configuring-categories.asciidoc
@ -157,7 +157,10 @@ POST _ml/anomaly_detectors/_validate
 {
  "analysis_config" : {
    "categorization_analyzer" : {
-      "tokenizer" : "ml_classic",
+      "char_filter" : [
+        "first_non_blank_line"
+      ],
+      "tokenizer" : "ml_standard",
      "filter" : [
        { "type" : "stop", "stopwords": [
          "Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday",
@ -182,7 +185,7 @@ POST _ml/anomaly_detectors/_validate
 If you specify any part of the `categorization_analyzer`, however, any omitted
 sub-properties are _not_ set to default values.

-The `ml_classic` tokenizer and the day and month stopword filter are more or 
+The `ml_standard` tokenizer and the day and month stopword filter are more or
 less equivalent to the following analyzer, which is defined using only built-in
 {es} {ref}/analysis-tokenizers.html[tokenizers] and
 {ref}/analysis-tokenfilters.html[token filters]:
@ -201,15 +204,18 @@ PUT _ml/anomaly_detectors/it_ops_new_logs3
      "detector_description": "Unusual message counts"
    }],
    "categorization_analyzer":{
+      "char_filter" : [
+        "first_non_blank_line" <1>
+      ],
      "tokenizer": {
        "type" : "simple_pattern_split",
-        "pattern" : "[^-0-9A-Za-z_.]+" <1>
+        "pattern" : "[^-0-9A-Za-z_./]+" <2>
      },
      "filter": [
-        { "type" : "pattern_replace", "pattern": "^[0-9].*" }, <2>
-        { "type" : "pattern_replace", "pattern": "^[-0-9A-Fa-f.]+$" }, <3>
-        { "type" : "pattern_replace", "pattern": "^[^0-9A-Za-z]+" }, <4>
-        { "type" : "pattern_replace", "pattern": "[^0-9A-Za-z]+$" }, <5>
+        { "type" : "pattern_replace", "pattern": "^[0-9].*" }, <3>
+        { "type" : "pattern_replace", "pattern": "^[-0-9A-Fa-f.]+$" }, <4>
+        { "type" : "pattern_replace", "pattern": "^[^0-9A-Za-z]+" }, <5>
+        { "type" : "pattern_replace", "pattern": "[^0-9A-Za-z]+$" }, <6>
        { "type" : "stop", "stopwords": [
          "",
          "Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday",
@ -232,17 +238,20 @@ PUT _ml/anomaly_detectors/it_ops_new_logs3
 ----------------------------------
 // TEST[skip:needs-licence]

-<1> Tokens basically consist of hyphens, digits, letters, underscores and dots.
-<2> By default, categorization ignores tokens that begin with a digit.
-<3> By default, categorization also ignores tokens that are hexadecimal numbers.
-<4> Underscores, hyphens, and dots are removed from the beginning of tokens.
-<5> Underscores, hyphens, and dots are also removed from the end of tokens.
+<1> Only consider the first non-blank line of the message for categorization purposes.
+<2> Tokens basically consist of hyphens, digits, letters, underscores, dots and slashes.
+<3> By default, categorization ignores tokens that begin with a digit.
+<4> By default, categorization also ignores tokens that are hexadecimal numbers.
+<5> Underscores, hyphens, and dots are removed from the beginning of tokens.
+<6> Underscores, hyphens, and dots are also removed from the end of tokens.

 The key difference between the default `categorization_analyzer` and this
-example analyzer is that using the `ml_classic` tokenizer is several times 
-faster. The difference in behavior is that this custom analyzer does not include 
-accented letters in tokens whereas the `ml_classic` tokenizer does, although 
-that could be fixed by using more complex regular expressions.
+example analyzer is that using the `ml_standard` tokenizer is several times
+faster. The `ml_standard` tokenizer also tries to preserve URLs, Windows paths
+and email addresses as single tokens. Another difference in behavior is that
+this custom analyzer does not include accented letters in tokens whereas the
+`ml_standard` tokenizer does, although that could be fixed by using more complex
+regular expressions.

 If you are categorizing non-English messages in a language where words are
 separated by spaces, you might get better results if you change the day or month
--- a/docs/reference/ml/ml-shared.asciidoc
+++ b/docs/reference/ml/ml-shared.asciidoc
@ -1592,11 +1592,17 @@ end::timestamp-results[]
 tag::tokenizer[]
 The name or definition of the <<analysis-tokenizers,tokenizer>> to use after
 character filters are applied. This property is compulsory if
-`categorization_analyzer` is specified as an object. Machine learning provides a
-tokenizer called `ml_classic` that tokenizes in the same way as the
-non-customizable tokenizer in older versions of the product. If you want to use
-that tokenizer but change the character or token filters, specify
-`"tokenizer": "ml_classic"` in your `categorization_analyzer`.
+`categorization_analyzer` is specified as an object. Machine learning provides
+a tokenizer called `ml_standard` that tokenizes in a way that has been
+determined to produce good categorization results on a variety of log
+file formats for logs in English. If you want to use that tokenizer but
+change the character or token filters, specify `"tokenizer": "ml_standard"`
+in your `categorization_analyzer`. Additionally, the `ml_classic` tokenizer
+is available, which tokenizes in the same way as the non-customizable
+tokenizer in old versions of the product (before 6.2). `ml_classic` was
+the default categorization tokenizer in versions 6.2 to 7.13, so if you
+need categorization identical to the default for jobs created in these
+versions, specify `"tokenizer": "ml_classic"` in your `categorization_analyzer`.
 end::tokenizer[]

 tag::total-by-field-count[]
--- a/x-pack/plugin/build.gradle
+++ b/x-pack/plugin/build.gradle
@ -117,9 +117,12 @@ tasks.named("yamlRestCompatTest").configure {
    'ml/datafeeds_crud/Test update datafeed to point to job already attached to another datafeed',
    'ml/datafeeds_crud/Test update datafeed to point to missing job',
    'ml/job_cat_apis/Test cat anomaly detector jobs',
+    'ml/jobs_crud/Test update job',
    'ml/jobs_get_stats/Test get job stats after uploading data prompting the creation of some stats',
    'ml/jobs_get_stats/Test get job stats for closed job',
    'ml/jobs_get_stats/Test no exception on get job stats with missing index',
+    // TODO: the ml_info mute can be removed from master once the ml_standard tokenizer is in 7.x
+    'ml/ml_info/Test ml info',
    'ml/post_data/Test POST data job api, flush, close and verify DataCounts doc',
    'ml/post_data/Test flush with skip_time',
    'ml/set_upgrade_mode/Setting upgrade mode to disabled from enabled',
--- a/x-pack/plugin/core/src/main/java/org/elasticsearch/xpack/core/ml/job/config/CategorizationAnalyzerConfig.java
+++ b/x-pack/plugin/core/src/main/java/org/elasticsearch/xpack/core/ml/job/config/CategorizationAnalyzerConfig.java
@ -145,38 +145,39 @@ public class CategorizationAnalyzerConfig implements ToXContentFragment, Writeab
    }

    /**
-     * Create a <code>categorization_analyzer</code> that mimics what the tokenizer and filters built into the ML C++
-     * code do.  This is the default analyzer for categorization to ensure that people upgrading from previous versions
+     * Create a <code>categorization_analyzer</code> that mimics what the tokenizer and filters built into the original ML
+     * C++ code do.  This is the default analyzer for categorization to ensure that people upgrading from old versions
     * get the same behaviour from their categorization jobs before and after upgrade.
     * @param categorizationFilters Categorization filters (if any) from the <code>analysis_config</code>.
     * @return The default categorization analyzer.
     */
    public static CategorizationAnalyzerConfig buildDefaultCategorizationAnalyzer(List<String> categorizationFilters) {

-        CategorizationAnalyzerConfig.Builder builder = new CategorizationAnalyzerConfig.Builder();
-
-        if (categorizationFilters != null) {
-            for (String categorizationFilter : categorizationFilters) {
-                Map<String, Object> charFilter = new HashMap<>();
-                charFilter.put("type", "pattern_replace");
-                charFilter.put("pattern", categorizationFilter);
-                builder.addCharFilter(charFilter);
-            }
+        return new CategorizationAnalyzerConfig.Builder()
+            .addCategorizationFilters(categorizationFilters)
+            .setTokenizer("ml_classic")
+            .addDateWordsTokenFilter()
+            .build();
    }

-        builder.setTokenizer("ml_classic");
+    /**
+     * Create a <code>categorization_analyzer</code> that will be used for newly created jobs where no categorization
+     * analyzer is explicitly provided.  This analyzer differs from the default one in that it uses the <code>ml_standard</code>
+     * tokenizer instead of the <code>ml_classic</code> tokenizer, and it only considers the first non-blank line of each message.
+     * This analyzer is <em>not</em> used for jobs that specify no categorization analyzer, as that would break jobs that were
+     * originally run in older versions.  Instead, this analyzer is explicitly added to newly created jobs once the entire cluster
+     * is upgraded to version 7.14 or above.
+     * @param categorizationFilters Categorization filters (if any) from the <code>analysis_config</code>.
+     * @return The standard categorization analyzer.
+     */
+    public static CategorizationAnalyzerConfig buildStandardCategorizationAnalyzer(List<String> categorizationFilters) {

-        Map<String, Object> tokenFilter = new HashMap<>();
-        tokenFilter.put("type", "stop");
-        tokenFilter.put("stopwords", Arrays.asList(
-                "Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday",
-                "Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun",
-                "January", "February", "March", "April", "May", "June", "July", "August", "September", "October", "November", "December",
-                "Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec",
-                "GMT", "UTC"));
-        builder.addTokenFilter(tokenFilter);
-
-        return builder.build();
+        return new CategorizationAnalyzerConfig.Builder()
+            .addCharFilter("first_non_blank_line")
+            .addCategorizationFilters(categorizationFilters)
+            .setTokenizer("ml_standard")
+            .addDateWordsTokenFilter()
+            .build();
    }

    private final String analyzer;
@ -311,6 +312,18 @@ public class CategorizationAnalyzerConfig implements ToXContentFragment, Writeab
            return this;
        }

+        public Builder addCategorizationFilters(List<String> categorizationFilters) {
+            if (categorizationFilters != null) {
+                for (String categorizationFilter : categorizationFilters) {
+                    Map<String, Object> charFilter = new HashMap<>();
+                    charFilter.put("type", "pattern_replace");
+                    charFilter.put("pattern", categorizationFilter);
+                    addCharFilter(charFilter);
+                }
+            }
+            return this;
+        }
+
        public Builder setTokenizer(String tokenizer) {
            this.tokenizer = new NameOrDefinition(tokenizer);
            return this;
@ -331,6 +344,19 @@ public class CategorizationAnalyzerConfig implements ToXContentFragment, Writeab
            return this;
        }

+        Builder addDateWordsTokenFilter() {
+            Map<String, Object> tokenFilter = new HashMap<>();
+            tokenFilter.put("type", "stop");
+            tokenFilter.put("stopwords", Arrays.asList(
+                "Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday",
+                "Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun",
+                "January", "February", "March", "April", "May", "June", "July", "August", "September", "October", "November", "December",
+                "Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec",
+                "GMT", "UTC"));
+            addTokenFilter(tokenFilter);
+            return this;
+        }
+
        /**
         * Create a config validating only structure, not exact analyzer/tokenizer/filter names
         */
--- a/x-pack/plugin/ml/qa/ml-with-security/build.gradle
+++ b/x-pack/plugin/ml/qa/ml-with-security/build.gradle
@ -17,9 +17,11 @@ restResources {

 tasks.named("yamlRestTest").configure {
  systemProperty 'tests.rest.blacklist', [
-    // Remove this test because it doesn't call an ML endpoint and we don't want
+    // Remove these tests because they don't call an ML endpoint and we don't want
    // to grant extra permissions to the users used in this test suite
    'ml/ml_classic_analyze/Test analyze API with an analyzer that does what we used to do in native code',
+    'ml/ml_standard_analyze/Test analyze API with the standard 7.14 ML analyzer',
+    'ml/ml_standard_analyze/Test 7.14 analyzer with blank lines',
    // Remove tests that are expected to throw an exception, because we cannot then
    // know whether to expect an authorization exception or a validation exception
    'ml/calendar_crud/Test get calendar given missing',
--- a/x-pack/plugin/ml/src/main/java/org/elasticsearch/xpack/ml/MachineLearning.java
+++ b/x-pack/plugin/ml/src/main/java/org/elasticsearch/xpack/ml/MachineLearning.java
@ -45,6 +45,7 @@ import org.elasticsearch.common.util.concurrent.EsExecutors;
 import org.elasticsearch.common.xcontent.NamedXContentRegistry;
 import org.elasticsearch.env.Environment;
 import org.elasticsearch.env.NodeEnvironment;
+import org.elasticsearch.index.analysis.CharFilterFactory;
 import org.elasticsearch.index.analysis.TokenizerFactory;
 import org.elasticsearch.indices.SystemIndexDescriptor;
 import org.elasticsearch.indices.analysis.AnalysisModule.AnalysisProvider;
@ -264,6 +265,8 @@ import org.elasticsearch.xpack.ml.inference.persistence.TrainedModelProvider;
 import org.elasticsearch.xpack.ml.job.JobManager;
 import org.elasticsearch.xpack.ml.job.JobManagerHolder;
 import org.elasticsearch.xpack.ml.job.UpdateJobProcessNotifier;
+import org.elasticsearch.xpack.ml.job.categorization.FirstNonBlankLineCharFilter;
+import org.elasticsearch.xpack.ml.job.categorization.FirstNonBlankLineCharFilterFactory;
 import org.elasticsearch.xpack.ml.job.categorization.MlClassicTokenizer;
 import org.elasticsearch.xpack.ml.job.categorization.MlClassicTokenizerFactory;
 import org.elasticsearch.xpack.ml.job.categorization.MlStandardTokenizer;
@ -1076,6 +1079,10 @@ public class MachineLearning extends Plugin implements SystemIndexPlugin,
        return Arrays.asList(jobComms, utility, datafeed);
    }

+    public Map<String, AnalysisProvider<CharFilterFactory>> getCharFilters() {
+        return Collections.singletonMap(FirstNonBlankLineCharFilter.NAME, FirstNonBlankLineCharFilterFactory::new);
+    }
+
    @Override
    public Map<String, AnalysisProvider<TokenizerFactory>> getTokenizers() {
        return Map.of(MlClassicTokenizer.NAME, MlClassicTokenizerFactory::new,
--- a/x-pack/plugin/ml/src/main/java/org/elasticsearch/xpack/ml/action/TransportMlInfoAction.java
+++ b/x-pack/plugin/ml/src/main/java/org/elasticsearch/xpack/ml/action/TransportMlInfoAction.java
@ -98,7 +98,7 @@ public class TransportMlInfoAction extends HandledTransportAction<MlInfoAction.R
            Job.DEFAULT_DAILY_MODEL_SNAPSHOT_RETENTION_AFTER_DAYS);
        try {
            defaults.put(CategorizationAnalyzerConfig.CATEGORIZATION_ANALYZER.getPreferredName(),
-                CategorizationAnalyzerConfig.buildDefaultCategorizationAnalyzer(Collections.emptyList())
+                CategorizationAnalyzerConfig.buildStandardCategorizationAnalyzer(Collections.emptyList())
                    .asMap(xContentRegistry).get(CategorizationAnalyzerConfig.CATEGORIZATION_ANALYZER.getPreferredName()));
        } catch (IOException e) {
            logger.error("failed to convert default categorization analyzer to map", e);
--- a/x-pack/plugin/ml/src/main/java/org/elasticsearch/xpack/ml/job/JobManager.java
+++ b/x-pack/plugin/ml/src/main/java/org/elasticsearch/xpack/ml/job/JobManager.java
@ -10,6 +10,7 @@ import org.apache.logging.log4j.LogManager;
 import org.apache.logging.log4j.Logger;
 import org.elasticsearch.ResourceAlreadyExistsException;
 import org.elasticsearch.ResourceNotFoundException;
+import org.elasticsearch.Version;
 import org.elasticsearch.action.ActionListener;
 import org.elasticsearch.action.index.IndexResponse;
 import org.elasticsearch.action.support.WriteRequest;
@ -39,6 +40,7 @@ import org.elasticsearch.xpack.core.ml.MlTasks;
 import org.elasticsearch.xpack.core.ml.action.PutJobAction;
 import org.elasticsearch.xpack.core.ml.action.RevertModelSnapshotAction;
 import org.elasticsearch.xpack.core.ml.action.UpdateJobAction;
+import org.elasticsearch.xpack.core.ml.job.config.AnalysisConfig;
 import org.elasticsearch.xpack.core.ml.job.config.AnalysisLimits;
 import org.elasticsearch.xpack.core.ml.job.config.CategorizationAnalyzerConfig;
 import org.elasticsearch.xpack.core.ml.job.config.DataDescription;
@ -85,6 +87,8 @@ import java.util.regex.Pattern;
 */
 public class JobManager {

+    private static final Version MIN_NODE_VERSION_FOR_STANDARD_CATEGORIZATION_ANALYZER = Version.V_7_14_0;
+
    private static final Logger logger = LogManager.getLogger(JobManager.class);
    private static final DeprecationLogger deprecationLogger = DeprecationLogger.getLogger(JobManager.class);

@ -220,17 +224,31 @@ public class JobManager {

    /**
     * Validate the char filter/tokenizer/token filter names used in the categorization analyzer config (if any).
-     * This validation has to be done server-side; it cannot be done in a client as that won't have loaded the
-     * appropriate analysis modules/plugins.
-     * The overall structure can be validated at parse time, but the exact names need to be checked separately,
-     * as plugins that provide the functionality can be installed/uninstalled.
+     * If the user has not provided a categorization analyzer then set the standard one if categorization is
+     * being used at all and all the nodes in the cluster are running a version that will understand it.  This
+     * method must only be called when a job is first created - since it applies a default if it were to be
+     * called after that it could change the meaning of a job that has already run.  The validation in this
+     * method has to be done server-side; it cannot be done in a client as that won't have loaded the appropriate
+     * analysis modules/plugins. (The overall structure can be validated at parse time, but the exact names need
+     * to be checked separately, as plugins that provide the functionality can be installed/uninstalled.)
     */
-    static void validateCategorizationAnalyzer(Job.Builder jobBuilder, AnalysisRegistry analysisRegistry)
-        throws IOException {
-        CategorizationAnalyzerConfig categorizationAnalyzerConfig = jobBuilder.getAnalysisConfig().getCategorizationAnalyzerConfig();
+    static void validateCategorizationAnalyzerOrSetDefault(Job.Builder jobBuilder, AnalysisRegistry analysisRegistry,
+                                                           Version minNodeVersion) throws IOException {
+        AnalysisConfig analysisConfig = jobBuilder.getAnalysisConfig();
+        CategorizationAnalyzerConfig categorizationAnalyzerConfig = analysisConfig.getCategorizationAnalyzerConfig();
        if (categorizationAnalyzerConfig != null) {
            CategorizationAnalyzer.verifyConfigBuilder(new CategorizationAnalyzerConfig.Builder(categorizationAnalyzerConfig),
                analysisRegistry);
+        } else if (analysisConfig.getCategorizationFieldName() != null
+            && minNodeVersion.onOrAfter(MIN_NODE_VERSION_FOR_STANDARD_CATEGORIZATION_ANALYZER)) {
+            // Any supplied categorization filters are transferred into the new categorization analyzer.
+            // The user supplied categorization filters will already have been validated when the put job
+            // request was built, so we know they're valid.
+            AnalysisConfig.Builder analysisConfigBuilder = new AnalysisConfig.Builder(analysisConfig)
+                .setCategorizationAnalyzerConfig(
+                    CategorizationAnalyzerConfig.buildStandardCategorizationAnalyzer(analysisConfig.getCategorizationFilters()))
+                .setCategorizationFilters(null);
+            jobBuilder.setAnalysisConfig(analysisConfigBuilder);
        }
    }

@ -240,10 +258,12 @@ public class JobManager {
    public void putJob(PutJobAction.Request request, AnalysisRegistry analysisRegistry, ClusterState state,
                       ActionListener<PutJobAction.Response> actionListener) throws IOException {

+        Version minNodeVersion = state.getNodes().getMinNodeVersion();
+
        Job.Builder jobBuilder = request.getJobBuilder();
        jobBuilder.validateAnalysisLimitsAndSetDefaults(maxModelMemoryLimit);
        jobBuilder.validateModelSnapshotRetentionSettingsAndSetDefaults();
-        validateCategorizationAnalyzer(jobBuilder, analysisRegistry);
+        validateCategorizationAnalyzerOrSetDefault(jobBuilder, analysisRegistry, minNodeVersion);

        Job job = jobBuilder.build(new Date());

--- a/x-pack/plugin/ml/src/main/java/org/elasticsearch/xpack/ml/job/categorization/AbstractMlTokenizer.java
+++ b/x-pack/plugin/ml/src/main/java/org/elasticsearch/xpack/ml/job/categorization/AbstractMlTokenizer.java
@ -20,10 +20,14 @@ public abstract class AbstractMlTokenizer extends Tokenizer {
    protected final OffsetAttribute offsetAtt = addAttribute(OffsetAttribute.class);
    protected final PositionIncrementAttribute posIncrAtt = addAttribute(PositionIncrementAttribute.class);

+    /**
+     * The internal offset stores the offset in the potentially filtered input to the tokenizer.
+     * This must be corrected before setting the offset attribute for user-visible output.
+     */
    protected int nextOffset;
    protected int skippedPositions;

-    AbstractMlTokenizer() {
+    protected AbstractMlTokenizer() {
    }

    @Override
@ -31,7 +35,8 @@ public abstract class AbstractMlTokenizer extends Tokenizer {
        super.end();
        // Set final offset
        int finalOffset = nextOffset + (int) input.skip(Integer.MAX_VALUE);
-        offsetAtt.setOffset(finalOffset, finalOffset);
+        int correctedFinalOffset = correctOffset(finalOffset);
+        offsetAtt.setOffset(correctedFinalOffset, correctedFinalOffset);
        // Adjust any skipped tokens
        posIncrAtt.setPositionIncrement(posIncrAtt.getPositionIncrement() + skippedPositions);
    }
--- a/x-pack/plugin/ml/src/main/java/org/elasticsearch/xpack/ml/job/categorization/FirstNonBlankLineCharFilter.java
+++ b/x-pack/plugin/ml/src/main/java/org/elasticsearch/xpack/ml/job/categorization/FirstNonBlankLineCharFilter.java
@ -0,0 +1,98 @@
+/*
+ * Copyright Elasticsearch B.V. and/or licensed to Elasticsearch B.V. under one
+ * or more contributor license agreements. Licensed under the Elastic License
+ * 2.0; you may not use this file except in compliance with the Elastic License
+ * 2.0.
+ */
+
+package org.elasticsearch.xpack.ml.job.categorization;
+
+import org.apache.lucene.analysis.charfilter.BaseCharFilter;
+
+import java.io.IOException;
+import java.io.Reader;
+import java.io.StringReader;
+
+/**
+ * A character filter that keeps the first non-blank line in the input, and discards everything before and after it.
+ * Treats both <code>\n</code> and <code>\r\n</code> as line endings.  If there is a line ending at the end of the
+ * first non-blank line this is discarded.  A line is considered blank if {@link Character#isWhitespace} returns
+ * <code>true</code> for all the characters in it.
+ *
+ * It is possible to achieve the same effect with a <code>pattern_replace</code> filter, but since this filter
+ * needs to be run on every single message to be categorized it is worth having a more performant specialization.
+ */
+public class FirstNonBlankLineCharFilter extends BaseCharFilter {
+
+    public static final String NAME = "first_non_blank_line";
+
+    private Reader transformedInput;
+
+    FirstNonBlankLineCharFilter(Reader in) {
+        super(in);
+    }
+
+    @Override
+    public int read(char[] cbuf, int off, int len) throws IOException {
+        // Buffer all input on the first call.
+        if (transformedInput == null) {
+            fill();
+        }
+
+        return transformedInput.read(cbuf, off, len);
+    }
+
+    @Override
+    public int read() throws IOException {
+        if (transformedInput == null) {
+            fill();
+        }
+
+        return transformedInput.read();
+    }
+
+    private void fill() throws IOException {
+        StringBuilder buffered = new StringBuilder();
+        char[] temp = new char[1024];
+        for (int cnt = input.read(temp); cnt > 0; cnt = input.read(temp)) {
+            buffered.append(temp, 0, cnt);
+        }
+        transformedInput = new StringReader(process(buffered).toString());
+    }
+
+    private CharSequence process(CharSequence input) {
+
+        boolean seenNonWhitespaceChar = false;
+        int prevNewlineIndex = -1;
+        int endIndex = -1;
+
+        for (int index = 0; index < input.length(); ++index) {
+            if (input.charAt(index) == '\n') {
+                if (seenNonWhitespaceChar) {
+                    // With Windows line endings chop the \r as well as the \n
+                    endIndex = (input.charAt(index - 1) == '\r') ? (index - 1) : index;
+                    break;
+                }
+                prevNewlineIndex = index;
+            } else {
+                seenNonWhitespaceChar = seenNonWhitespaceChar || Character.isWhitespace(input.charAt(index)) == false;
+            }
+        }
+
+        if (seenNonWhitespaceChar == false) {
+            return "";
+        }
+
+        if (endIndex == -1) {
+            if (prevNewlineIndex == -1) {
+                // This is pretty likely, as most log messages _aren't_ multiline, so worth optimising
+                // for even though the return at the end of the method would be functionally identical
+                return input;
+            }
+            endIndex = input.length();
+        }
+
+        addOffCorrectMap(0, prevNewlineIndex + 1);
+        return input.subSequence(prevNewlineIndex + 1, endIndex);
+    }
+}
--- a/x-pack/plugin/ml/src/main/java/org/elasticsearch/xpack/ml/job/categorization/FirstNonBlankLineCharFilterFactory.java
+++ b/x-pack/plugin/ml/src/main/java/org/elasticsearch/xpack/ml/job/categorization/FirstNonBlankLineCharFilterFactory.java
@ -0,0 +1,27 @@
+/*
+ * Copyright Elasticsearch B.V. and/or licensed to Elasticsearch B.V. under one
+ * or more contributor license agreements. Licensed under the Elastic License
+ * 2.0; you may not use this file except in compliance with the Elastic License
+ * 2.0.
+ */
+
+package org.elasticsearch.xpack.ml.job.categorization;
+
+import org.elasticsearch.common.settings.Settings;
+import org.elasticsearch.env.Environment;
+import org.elasticsearch.index.IndexSettings;
+import org.elasticsearch.index.analysis.AbstractCharFilterFactory;
+
+import java.io.Reader;
+
+public class FirstNonBlankLineCharFilterFactory extends AbstractCharFilterFactory {
+
+    public FirstNonBlankLineCharFilterFactory(IndexSettings indexSettings, Environment env, String name, Settings settings) {
+        super(indexSettings, name);
+    }
+
+    @Override
+    public Reader create(Reader tokenStream) {
+        return new FirstNonBlankLineCharFilter(tokenStream);
+    }
+}
--- a/x-pack/plugin/ml/src/main/java/org/elasticsearch/xpack/ml/job/categorization/MlClassicTokenizer.java
+++ b/x-pack/plugin/ml/src/main/java/org/elasticsearch/xpack/ml/job/categorization/MlClassicTokenizer.java
@ -84,7 +84,7 @@ public class MlClassicTokenizer extends AbstractMlTokenizer {

        // Characters that may exist in the term attribute beyond its defined length are ignored
        termAtt.setLength(length);
-        offsetAtt.setOffset(start, start + length);
+        offsetAtt.setOffset(correctOffset(start), correctOffset(start + length));
        posIncrAtt.setPositionIncrement(skippedPositions + 1);

        return true;
--- a/x-pack/plugin/ml/src/main/java/org/elasticsearch/xpack/ml/job/categorization/MlStandardTokenizer.java
+++ b/x-pack/plugin/ml/src/main/java/org/elasticsearch/xpack/ml/job/categorization/MlStandardTokenizer.java
@ -136,7 +136,7 @@ public class MlStandardTokenizer extends AbstractMlTokenizer {

        // Characters that may exist in the term attribute beyond its defined length are ignored
        termAtt.setLength(length);
-        offsetAtt.setOffset(start, start + length);
+        offsetAtt.setOffset(correctOffset(start), correctOffset(start + length));
        posIncrAtt.setPositionIncrement(skippedPositions + 1);

        return true;
--- a/x-pack/plugin/ml/src/test/java/org/elasticsearch/xpack/ml/job/JobManagerTests.java
+++ b/x-pack/plugin/ml/src/test/java/org/elasticsearch/xpack/ml/job/JobManagerTests.java
@ -49,6 +49,7 @@ import org.elasticsearch.xpack.core.ml.action.PutJobAction;
 import org.elasticsearch.xpack.core.ml.action.UpdateJobAction;
 import org.elasticsearch.xpack.core.action.util.QueryPage;
 import org.elasticsearch.xpack.core.ml.job.config.AnalysisConfig;
+import org.elasticsearch.xpack.core.ml.job.config.CategorizationAnalyzerConfig;
 import org.elasticsearch.xpack.core.ml.job.config.DataDescription;
 import org.elasticsearch.xpack.core.ml.job.config.DetectionRule;
 import org.elasticsearch.xpack.core.ml.job.config.Detector;
@ -91,6 +92,7 @@ import static org.hamcrest.Matchers.hasSize;
 import static org.hamcrest.Matchers.instanceOf;
 import static org.hamcrest.Matchers.is;
 import static org.hamcrest.Matchers.lessThanOrEqualTo;
+import static org.hamcrest.Matchers.nullValue;
 import static org.mockito.Matchers.any;
 import static org.mockito.Mockito.doAnswer;
 import static org.mockito.Mockito.mock;
@ -584,10 +586,68 @@ public class JobManagerTests extends ESTestCase {
        assertThat(capturedUpdateParams.get(1).isUpdateScheduledEvents(), is(true));
    }

+    public void testValidateCategorizationAnalyzer_GivenValid() throws IOException {
+
+        List<String> categorizationFilters = randomBoolean() ? Collections.singletonList("query: .*") : null;
+        CategorizationAnalyzerConfig c = CategorizationAnalyzerConfig.buildDefaultCategorizationAnalyzer(categorizationFilters);
+        Job.Builder jobBuilder = createCategorizationJob(c, null);
+        JobManager.validateCategorizationAnalyzerOrSetDefault(jobBuilder, analysisRegistry, Version.CURRENT);
+
+        Job job = jobBuilder.build(new Date());
+        assertThat(job.getAnalysisConfig().getCategorizationAnalyzerConfig(),
+            equalTo(CategorizationAnalyzerConfig.buildDefaultCategorizationAnalyzer(categorizationFilters)));
+    }
+
+    public void testValidateCategorizationAnalyzer_GivenInvalid() {
+
+        CategorizationAnalyzerConfig c = new CategorizationAnalyzerConfig.Builder().setAnalyzer("does_not_exist").build();
+        Job.Builder jobBuilder = createCategorizationJob(c, null);
+        IllegalArgumentException e = expectThrows(IllegalArgumentException.class,
+            () -> JobManager.validateCategorizationAnalyzerOrSetDefault(jobBuilder, analysisRegistry, Version.CURRENT));
+
+        assertThat(e.getMessage(), equalTo("Failed to find global analyzer [does_not_exist]"));
+    }
+
+    public void testSetDefaultCategorizationAnalyzer_GivenAllNewNodes() throws IOException {
+
+        List<String> categorizationFilters = randomBoolean() ? Collections.singletonList("query: .*") : null;
+        Job.Builder jobBuilder = createCategorizationJob(null, categorizationFilters);
+        JobManager.validateCategorizationAnalyzerOrSetDefault(jobBuilder, analysisRegistry, Version.CURRENT);
+
+        Job job = jobBuilder.build(new Date());
+        assertThat(job.getAnalysisConfig().getCategorizationAnalyzerConfig(),
+            equalTo(CategorizationAnalyzerConfig.buildStandardCategorizationAnalyzer(categorizationFilters)));
+    }
+
+    // TODO: This test can be deleted from branches that would never have to talk to a 7.13 node
+    public void testSetDefaultCategorizationAnalyzer_GivenOldNodeInCluster() throws IOException {
+
+        List<String> categorizationFilters = randomBoolean() ? Collections.singletonList("query: .*") : null;
+        Job.Builder jobBuilder = createCategorizationJob(null, categorizationFilters);
+        JobManager.validateCategorizationAnalyzerOrSetDefault(jobBuilder, analysisRegistry, Version.V_7_13_0);
+
+        Job job = jobBuilder.build(new Date());
+        assertThat(job.getAnalysisConfig().getCategorizationAnalyzerConfig(), nullValue());
+    }
+
+    private Job.Builder createCategorizationJob(CategorizationAnalyzerConfig categorizationAnalyzerConfig,
+                                                List<String> categorizationFilters) {
+        Detector.Builder d = new Detector.Builder("count", null).setByFieldName("mlcategory");
+        AnalysisConfig.Builder ac = new AnalysisConfig.Builder(Collections.singletonList(d.build()))
+            .setCategorizationFieldName("message")
+            .setCategorizationAnalyzerConfig(categorizationAnalyzerConfig)
+            .setCategorizationFilters(categorizationFilters);
+
+        Job.Builder builder = new Job.Builder();
+        builder.setId("cat");
+        builder.setAnalysisConfig(ac);
+        builder.setDataDescription(new DataDescription.Builder());
+        return builder;
+    }
+
    private Job.Builder createJob() {
-        Detector.Builder d1 = new Detector.Builder("info_content", "domain");
-        d1.setOverFieldName("client");
-        AnalysisConfig.Builder ac = new AnalysisConfig.Builder(Collections.singletonList(d1.build()));
+        Detector.Builder d = new Detector.Builder("info_content", "domain").setOverFieldName("client");
+        AnalysisConfig.Builder ac = new AnalysisConfig.Builder(Collections.singletonList(d.build()));

        Job.Builder builder = new Job.Builder();
        builder.setId("foo");
--- a/x-pack/plugin/ml/src/test/java/org/elasticsearch/xpack/ml/job/categorization/CategorizationAnalyzerTests.java
+++ b/x-pack/plugin/ml/src/test/java/org/elasticsearch/xpack/ml/job/categorization/CategorizationAnalyzerTests.java
@ -25,6 +25,29 @@ import java.util.Map;

 public class CategorizationAnalyzerTests extends ESTestCase {

+    private static final String NGINX_ERROR_EXAMPLE =
+        "a client request body is buffered to a temporary file /tmp/client-body/0000021894, client: 10.8.0.12, " +
+            "server: apm.35.205.226.121.ip.es.io, request: \"POST /intake/v2/events HTTP/1.1\", host: \"apm.35.205.226.121.ip.es.io\"\n" +
+         "10.8.0.12 - - [29/Nov/2020:21:34:55 +0000] \"POST /intake/v2/events HTTP/1.1\" 202 0 \"-\" " +
+            "\"elasticapm-dotnet/1.5.1 System.Net.Http/4.6.28208.02 .NET_Core/2.2.8\" 27821 0.002 [default-apm-apm-server-8200] [] " +
+            "10.8.1.19:8200 0 0.001 202 f961c776ff732f5c8337530aa22c7216\n" +
+         "10.8.0.14 - - [29/Nov/2020:21:34:56 +0000] \"POST /intake/v2/events HTTP/1.1\" 202 0 \"-\" " +
+            "\"elasticapm-python/5.10.0\" 3594 0.002 [default-apm-apm-server-8200] [] 10.8.1.18:8200 0 0.001 202 " +
+            "61feb8fb9232b1ebe54b588b95771ce4\n" +
+         "10.8.4.90 - - [29/Nov/2020:21:34:56 +0000] \"OPTIONS /intake/v2/rum/events HTTP/2.0\" 200 0 " +
+            "\"http://opbeans-frontend:3000/dashboard\" \"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) " +
+            "Cypress/3.3.1 Chrome/61.0.3163.100 Electron/2.0.18 Safari/537.36\" 292 0.001 [default-apm-apm-server-8200] [] " +
+            "10.8.1.19:8200 0 0.000 200 5fbe8cd4d217b932def1c17ed381c66b\n" +
+         "10.8.4.90 - - [29/Nov/2020:21:34:56 +0000] \"POST /intake/v2/rum/events HTTP/2.0\" 202 0 " +
+            "\"http://opbeans-frontend:3000/dashboard\" \"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) " +
+            "Cypress/3.3.1 Chrome/61.0.3163.100 Electron/2.0.18 Safari/537.36\" 3004 0.001 [default-apm-apm-server-8200] [] " +
+            "10.8.1.18:8200 0 0.001 202 4735f571928595744ac6a9545c3ecdf5\n" +
+         "10.8.0.11 - - [29/Nov/2020:21:34:56 +0000] \"POST /intake/v2/events HTTP/1.1\" 202 0 \"-\" " +
+            "\"elasticapm-node/3.8.0 elastic-apm-http-client/9.4.2 node/12.20.0\" 4913 10.006 [default-apm-apm-server-8200] [] " +
+            "10.8.1.18:8200 0 0.002 202 1eac41789ea9a60a8be4e476c54cbbc9\n" +
+         "10.8.0.14 - - [29/Nov/2020:21:34:57 +0000] \"POST /intake/v2/events HTTP/1.1\" 202 0 \"-\" \"elasticapm-python/5.10.0\" 1025 " +
+            "0.001 [default-apm-apm-server-8200] [] 10.8.1.18:8200 0 0.001 202 d27088936cadd3b8804b68998a5f94fa";
+
    private AnalysisRegistry analysisRegistry;

    public static AnalysisRegistry buildTestAnalysisRegistry(Environment environment) throws Exception {
@ -218,6 +241,19 @@ public class CategorizationAnalyzerTests extends ESTestCase {
            assertEquals(Arrays.asList("PSYoungGen", "total", "used"),
                    categorizationAnalyzer.tokenizeField("java",
                            "PSYoungGen      total 2572800K, used 1759355K [0x0000000759500000, 0x0000000800000000, 0x0000000800000000)"));
+
+            assertEquals(Arrays.asList("client", "request", "body", "is", "buffered", "to", "temporary", "file", "tmp", "client-body",
+                "client", "server", "apm.35.205.226.121.ip.es.io", "request", "POST", "intake", "v2", "events", "HTTP", "host",
+                "apm.35.205.226.121.ip.es.io", "POST", "intake", "v2", "events", "HTTP", "elasticapm-dotnet", "System.Net.Http", "NET_Core",
+                "default-apm-apm-server-8200", "POST", "intake", "v2", "events", "HTTP", "elasticapm-python", "default-apm-apm-server-8200",
+                "OPTIONS", "intake", "v2", "rum", "events", "HTTP", "http", "opbeans-frontend", "dashboard", "Mozilla", "X11", "Linux",
+                "x86_64", "AppleWebKit", "KHTML", "like", "Gecko", "Cypress", "Chrome", "Electron", "Safari", "default-apm-apm-server-8200",
+                "POST", "intake", "v2", "rum", "events", "HTTP", "http", "opbeans-frontend", "dashboard", "Mozilla", "X11", "Linux",
+                "x86_64", "AppleWebKit", "KHTML", "like", "Gecko", "Cypress", "Chrome", "Electron", "Safari", "default-apm-apm-server-8200",
+                "POST", "intake", "v2", "events", "HTTP", "elasticapm-node", "elastic-apm-http-client", "node",
+                "default-apm-apm-server-8200", "POST", "intake", "v2", "events", "HTTP", "elasticapm-python",
+                "default-apm-apm-server-8200"),
+                categorizationAnalyzer.tokenizeField("nginx_error", NGINX_ERROR_EXAMPLE));
        }
    }

@ -251,6 +287,51 @@ public class CategorizationAnalyzerTests extends ESTestCase {
        }
    }

+    public void testMlStandardCategorizationAnalyzer() throws IOException {
+        CategorizationAnalyzerConfig standardConfig = CategorizationAnalyzerConfig.buildStandardCategorizationAnalyzer(null);
+        try (CategorizationAnalyzer categorizationAnalyzer = new CategorizationAnalyzer(analysisRegistry, standardConfig)) {
+
+            assertEquals(Arrays.asList("ml13-4608.1.p2ps", "Info", "Source", "ML_SERVICE2", "on", "has", "shut", "down"),
+                categorizationAnalyzer.tokenizeField("p2ps",
+                    "<ml13-4608.1.p2ps: Info: > Source ML_SERVICE2 on 13122:867 has shut down."));
+
+            assertEquals(Arrays.asList("Vpxa", "verbose", "VpxaHalCnxHostagent", "opID", "WFU-ddeadb59", "WaitForUpdatesDone", "Received",
+                "callback"),
+                categorizationAnalyzer.tokenizeField("vmware",
+                    "Vpxa: [49EC0B90 verbose 'VpxaHalCnxHostagent' opID=WFU-ddeadb59] [WaitForUpdatesDone] Received callback"));
+
+            assertEquals(Arrays.asList("org.apache.coyote.http11.Http11BaseProtocol", "destroy"),
+                categorizationAnalyzer.tokenizeField("apache",
+                    "org.apache.coyote.http11.Http11BaseProtocol destroy"));
+
+            assertEquals(Arrays.asList("INFO", "session", "PROXY", "Session", "DESTROYED"),
+                categorizationAnalyzer.tokenizeField("proxy",
+                    " [1111529792] INFO  session <45409105041220090733@192.168.251.123> - " +
+                        "----------------- PROXY Session DESTROYED --------------------"));
+
+            assertEquals(Arrays.asList("PSYoungGen", "total", "used"),
+                categorizationAnalyzer.tokenizeField("java",
+                    "PSYoungGen      total 2572800K, used 1759355K [0x0000000759500000, 0x0000000800000000, 0x0000000800000000)"));
+
+            assertEquals(Arrays.asList("first", "line"),
+                categorizationAnalyzer.tokenizeField("multiline", "first line\nsecond line\nthird line"));
+
+            assertEquals(Arrays.asList("first", "line"),
+                categorizationAnalyzer.tokenizeField("windows_multiline", "first line\r\nsecond line\r\nthird line"));
+
+            assertEquals(Arrays.asList("second", "line"),
+                categorizationAnalyzer.tokenizeField("multiline_first_blank", "\nsecond line\nthird line"));
+
+            assertEquals(Arrays.asList("second", "line"),
+                categorizationAnalyzer.tokenizeField("windows_multiline_first_blank", "\r\nsecond line\r\nthird line"));
+
+            assertEquals(Arrays.asList("client", "request", "body", "is", "buffered", "to", "temporary", "file",
+                "/tmp/client-body/0000021894", "client", "server", "apm.35.205.226.121.ip.es.io", "request", "POST", "/intake/v2/events",
+                "HTTP/1.1", "host", "apm.35.205.226.121.ip.es.io"),
+                categorizationAnalyzer.tokenizeField("nginx_error", NGINX_ERROR_EXAMPLE));
+        }
+    }
+
    // The Elasticsearch standard analyzer - this is the default for indexing in Elasticsearch, but
    // NOT for ML categorization (and you'll see why if you look at the expected results of this test!)
    public void testStandardAnalyzer() throws IOException {
--- a/x-pack/plugin/ml/src/test/java/org/elasticsearch/xpack/ml/job/categorization/FirstNonBlankLineCharFilterTests.java
+++ b/x-pack/plugin/ml/src/test/java/org/elasticsearch/xpack/ml/job/categorization/FirstNonBlankLineCharFilterTests.java
@ -0,0 +1,128 @@
+/*
+ * Copyright Elasticsearch B.V. and/or licensed to Elasticsearch B.V. under one
+ * or more contributor license agreements. Licensed under the Elastic License
+ * 2.0; you may not use this file except in compliance with the Elastic License
+ * 2.0.
+ */
+
+package org.elasticsearch.xpack.ml.job.categorization;
+
+import org.elasticsearch.test.ESTestCase;
+
+import java.io.IOException;
+import java.io.StringReader;
+
+import static org.hamcrest.Matchers.equalTo;
+
+public class FirstNonBlankLineCharFilterTests extends ESTestCase {
+
+    public void testEmpty() throws IOException {
+
+        String input = "";
+        FirstNonBlankLineCharFilter filter = new FirstNonBlankLineCharFilter(new StringReader(input));
+
+        assertThat(filter.read(), equalTo(-1));
+    }
+
+    public void testAllBlankOneLine() throws IOException {
+
+        String input = "\t";
+        if (randomBoolean()) {
+            input = " " + input;
+        }
+        if (randomBoolean()) {
+            input = input + " ";
+        }
+        FirstNonBlankLineCharFilter filter = new FirstNonBlankLineCharFilter(new StringReader(input));
+
+        assertThat(filter.read(), equalTo(-1));
+    }
+
+    public void testNonBlankNoNewlines() throws IOException {
+
+        String input = "the quick brown fox jumped over the lazy dog";
+        if (randomBoolean()) {
+            input = " " + input;
+        }
+        if (randomBoolean()) {
+            input = input + " ";
+        }
+        FirstNonBlankLineCharFilter filter = new FirstNonBlankLineCharFilter(new StringReader(input));
+
+        char[] output = new char[input.length()];
+        assertThat(filter.read(output, 0, output.length), equalTo(input.length()));
+        assertThat(filter.read(), equalTo(-1));
+        assertThat(new String(output), equalTo(input));
+    }
+
+    public void testAllBlankMultiline() throws IOException {
+
+        StringBuilder input = new StringBuilder();
+        String lineEnding = randomBoolean() ? "\n" : "\r\n";
+        for (int lineNum = randomIntBetween(2, 5); lineNum > 0; --lineNum) {
+            for (int charNum = randomIntBetween(0, 5); charNum > 0; --charNum) {
+                input.append(randomBoolean() ? " " : "\t");
+            }
+            if (lineNum > 1 || randomBoolean()) {
+                input.append(lineEnding);
+            }
+        }
+
+        FirstNonBlankLineCharFilter filter = new FirstNonBlankLineCharFilter(new StringReader(input.toString()));
+
+        assertThat(filter.read(), equalTo(-1));
+    }
+
+    public void testNonBlankMultiline() throws IOException {
+
+        StringBuilder input = new StringBuilder();
+        String lineEnding = randomBoolean() ? "\n" : "\r\n";
+        for (int lineBeforeNum = randomIntBetween(2, 5); lineBeforeNum > 0; --lineBeforeNum) {
+            for (int charNum = randomIntBetween(0, 5); charNum > 0; --charNum) {
+                input.append(randomBoolean() ? " " : "\t");
+            }
+            input.append(lineEnding);
+        }
+        String lineToKeep = "the quick brown fox jumped over the lazy dog";
+        if (randomBoolean()) {
+            lineToKeep = " " + lineToKeep;
+        }
+        if (randomBoolean()) {
+            lineToKeep = lineToKeep + " ";
+        }
+        input.append(lineToKeep).append(lineEnding);
+        for (int lineAfterNum = randomIntBetween(2, 5); lineAfterNum > 0; --lineAfterNum) {
+            for (int charNum = randomIntBetween(0, 5); charNum > 0; --charNum) {
+                input.append(randomBoolean() ? " " : "more");
+            }
+            if (lineAfterNum > 1 || randomBoolean()) {
+                input.append(lineEnding);
+            }
+        }
+
+        FirstNonBlankLineCharFilter filter = new FirstNonBlankLineCharFilter(new StringReader(input.toString()));
+
+        char[] output = new char[lineToKeep.length()];
+        assertThat(filter.read(output, 0, output.length), equalTo(lineToKeep.length()));
+        assertThat(filter.read(), equalTo(-1));
+        assertThat(new String(output), equalTo(lineToKeep));
+    }
+
+    public void testCorrect() throws IOException {
+
+        String input = "   \nfirst line\nsecond line";
+        FirstNonBlankLineCharFilter filter = new FirstNonBlankLineCharFilter(new StringReader(input));
+
+        String expectedOutput = "first line";
+
+        char[] output = new char[expectedOutput.length()];
+        assertThat(filter.read(output, 0, output.length), equalTo(expectedOutput.length()));
+        assertThat(filter.read(), equalTo(-1));
+        assertThat(new String(output), equalTo(expectedOutput));
+
+        int expectedOutputIndex = input.indexOf(expectedOutput);
+        for (int i = 0; i <= expectedOutput.length(); ++i) {
+            assertThat(filter.correctOffset(i), equalTo(expectedOutputIndex + i));
+        }
+    }
+}
--- a/x-pack/plugin/src/yamlRestTest/resources/rest-api-spec/test/ml/jobs_crud.yml
+++ b/x-pack/plugin/src/yamlRestTest/resources/rest-api-spec/test/ml/jobs_crud.yml
@ -341,6 +341,12 @@
            }
          }
  - match: { job_id: "jobs-crud-update-job" }
+  - length: { analysis_config.categorization_analyzer.filter: 1 }
+  - match: { analysis_config.categorization_analyzer.tokenizer: "ml_standard" }
+  - length: { analysis_config.categorization_analyzer.char_filter: 3 }
+  - match: { analysis_config.categorization_analyzer.char_filter.0: "first_non_blank_line" }
+  - match: { analysis_config.categorization_analyzer.char_filter.1.pattern: "cat1.*" }
+  - match: { analysis_config.categorization_analyzer.char_filter.2.pattern: "cat2.*" }

  - do:
      ml.open_job:
@ -381,7 +387,6 @@
            "background_persist_interval": "3h",
            "model_snapshot_retention_days": 30,
            "results_retention_days": 40,
-            "categorization_filters" : ["cat3.*"],
            "custom_settings": {
              "setting3": "custom3"
            }
@ -392,7 +397,6 @@
  - match: { model_plot_config.enabled: false }
  - match: { model_plot_config.terms: "foobar" }
  - match: { model_plot_config.annotations_enabled: false }
-  - match: { analysis_config.categorization_filters: ["cat3.*"] }
  - match: { analysis_config.detectors.0.custom_rules.0.actions: ["skip_result"] }
  - length: { analysis_config.detectors.0.custom_rules.0.conditions: 1 }
  - match: { analysis_config.detectors.0.detector_index: 0 }
--- a/x-pack/plugin/src/yamlRestTest/resources/rest-api-spec/test/ml/ml_info.yml
+++ b/x-pack/plugin/src/yamlRestTest/resources/rest-api-spec/test/ml/ml_info.yml
@ -10,7 +10,7 @@ teardown:
 "Test ml info":
  - do:
      ml.info: {}
-  - match: { defaults.anomaly_detectors.categorization_analyzer.tokenizer: "ml_classic" }
+  - match: { defaults.anomaly_detectors.categorization_analyzer.tokenizer: "ml_standard" }
  - match: { defaults.anomaly_detectors.model_memory_limit: "1gb" }
  - match: { defaults.anomaly_detectors.categorization_examples_limit: 4 }
  - match: { defaults.anomaly_detectors.model_snapshot_retention_days: 10 }
@ -30,7 +30,7 @@ teardown:

  - do:
      ml.info: {}
-  - match: { defaults.anomaly_detectors.categorization_analyzer.tokenizer: "ml_classic" }
+  - match: { defaults.anomaly_detectors.categorization_analyzer.tokenizer: "ml_standard" }
  - match: { defaults.anomaly_detectors.model_memory_limit: "512mb" }
  - match: { defaults.anomaly_detectors.categorization_examples_limit: 4 }
  - match: { defaults.anomaly_detectors.model_snapshot_retention_days: 10 }
@ -50,7 +50,7 @@ teardown:

  - do:
      ml.info: {}
-  - match: { defaults.anomaly_detectors.categorization_analyzer.tokenizer: "ml_classic" }
+  - match: { defaults.anomaly_detectors.categorization_analyzer.tokenizer: "ml_standard" }
  - match: { defaults.anomaly_detectors.model_memory_limit: "1gb" }
  - match: { defaults.anomaly_detectors.categorization_examples_limit: 4 }
  - match: { defaults.anomaly_detectors.model_snapshot_retention_days: 10 }
@ -70,7 +70,7 @@ teardown:

  - do:
      ml.info: {}
-  - match: { defaults.anomaly_detectors.categorization_analyzer.tokenizer: "ml_classic" }
+  - match: { defaults.anomaly_detectors.categorization_analyzer.tokenizer: "ml_standard" }
  - match: { defaults.anomaly_detectors.model_memory_limit: "1gb" }
  - match: { defaults.anomaly_detectors.categorization_examples_limit: 4 }
  - match: { defaults.anomaly_detectors.model_snapshot_retention_days: 10 }
@ -90,7 +90,7 @@ teardown:

  - do:
      ml.info: {}
-  - match: { defaults.anomaly_detectors.categorization_analyzer.tokenizer: "ml_classic" }
+  - match: { defaults.anomaly_detectors.categorization_analyzer.tokenizer: "ml_standard" }
  - match: { defaults.anomaly_detectors.model_memory_limit: "1mb" }
  - match: { defaults.anomaly_detectors.categorization_examples_limit: 4 }
  - match: { defaults.anomaly_detectors.model_snapshot_retention_days: 10 }
--- a/x-pack/plugin/src/yamlRestTest/resources/rest-api-spec/test/ml/ml_standard_analyze.yml
+++ b/x-pack/plugin/src/yamlRestTest/resources/rest-api-spec/test/ml/ml_standard_analyze.yml
@ -0,0 +1,115 @@
+---
+"Test analyze API with the standard 7.14 ML analyzer":
+  - do:
+      indices.analyze:
+        body:  >
+          {
+            "char_filter" : [
+              "first_non_blank_line"
+            ],
+            "tokenizer" : "ml_standard",
+            "filter" : [
+              { "type" : "stop", "stopwords": [
+                "Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday",
+                "Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun",
+                "January", "February", "March", "April", "May", "June", "July", "August", "September", "October", "November", "December",
+                "Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec",
+                "GMT", "UTC"
+              ] }
+            ],
+            "text" : "[elasticsearch] [2017-12-13T10:46:30,816][INFO ][o.e.c.m.MetadataCreateIndexService] [node-0] [.watcher-history-7-2017.12.13] creating index, cause [auto(bulk api)], templates [.watch-history-7], shards [1]/[1], mappings [doc]"
+          }
+  - match: { tokens.0.token: "elasticsearch" }
+  - match: { tokens.0.start_offset: 1 }
+  - match: { tokens.0.end_offset: 14 }
+  - match: { tokens.0.position: 0 }
+  - match: { tokens.1.token: "INFO" }
+  - match: { tokens.1.start_offset: 42 }
+  - match: { tokens.1.end_offset: 46 }
+  - match: { tokens.1.position: 5 }
+  - match: { tokens.2.token: "o.e.c.m.MetadataCreateIndexService" }
+  - match: { tokens.2.start_offset: 49 }
+  - match: { tokens.2.end_offset: 83 }
+  - match: { tokens.2.position: 6 }
+  - match: { tokens.3.token: "node-0" }
+  - match: { tokens.3.start_offset: 86 }
+  - match: { tokens.3.end_offset: 92 }
+  - match: { tokens.3.position: 7 }
+  - match: { tokens.4.token: "watcher-history-7-2017.12.13" }
+  - match: { tokens.4.start_offset: 96 }
+  - match: { tokens.4.end_offset: 124 }
+  - match: { tokens.4.position: 8 }
+  - match: { tokens.5.token: "creating" }
+  - match: { tokens.5.start_offset: 126 }
+  - match: { tokens.5.end_offset: 134 }
+  - match: { tokens.5.position: 9 }
+  - match: { tokens.6.token: "index" }
+  - match: { tokens.6.start_offset: 135 }
+  - match: { tokens.6.end_offset: 140 }
+  - match: { tokens.6.position: 10 }
+  - match: { tokens.7.token: "cause" }
+  - match: { tokens.7.start_offset: 142 }
+  - match: { tokens.7.end_offset: 147 }
+  - match: { tokens.7.position: 11 }
+  - match: { tokens.8.token: "auto" }
+  - match: { tokens.8.start_offset: 149 }
+  - match: { tokens.8.end_offset: 153 }
+  - match: { tokens.8.position: 12 }
+  - match: { tokens.9.token: "bulk" }
+  - match: { tokens.9.start_offset: 154 }
+  - match: { tokens.9.end_offset: 158 }
+  - match: { tokens.9.position: 13 }
+  - match: { tokens.10.token: "api" }
+  - match: { tokens.10.start_offset: 159 }
+  - match: { tokens.10.end_offset: 162 }
+  - match: { tokens.10.position: 14 }
+  - match: { tokens.11.token: "templates" }
+  - match: { tokens.11.start_offset: 166 }
+  - match: { tokens.11.end_offset: 175 }
+  - match: { tokens.11.position: 15 }
+  - match: { tokens.12.token: "watch-history-7" }
+  - match: { tokens.12.start_offset: 178 }
+  - match: { tokens.12.end_offset: 193 }
+  - match: { tokens.12.position: 16 }
+  - match: { tokens.13.token: "shards" }
+  - match: { tokens.13.start_offset: 196 }
+  - match: { tokens.13.end_offset: 202 }
+  - match: { tokens.13.position: 17 }
+  - match: { tokens.14.token: "mappings" }
+  - match: { tokens.14.start_offset: 212 }
+  - match: { tokens.14.end_offset: 220 }
+  - match: { tokens.14.position: 21 }
+  - match: { tokens.15.token: "doc" }
+  - match: { tokens.15.start_offset: 222 }
+  - match: { tokens.15.end_offset: 225 }
+  - match: { tokens.15.position: 22 }
+
+---
+"Test 7.14 analyzer with blank lines":
+  - do:
+      indices.analyze:
+        body:  >
+          {
+            "char_filter" : [
+              "first_non_blank_line"
+            ],
+            "tokenizer" : "ml_standard",
+            "filter" : [
+              { "type" : "stop", "stopwords": [
+                "Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday",
+                "Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun",
+                "January", "February", "March", "April", "May", "June", "July", "August", "September", "October", "November", "December",
+                "Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec",
+                "GMT", "UTC"
+              ] }
+            ],
+            "text" : "   \nfirst line\nsecond line"
+          }
+  - match: { tokens.0.token: "first" }
+  - match: { tokens.0.start_offset: 4 }
+  - match: { tokens.0.end_offset: 9 }
+  - match: { tokens.0.position: 0 }
+  - match: { tokens.1.token: "line" }
+  - match: { tokens.1.start_offset: 10 }
+  - match: { tokens.1.end_offset: 14 }
+  - match: { tokens.1.position: 1 }