[ML] Make ml_standard tokenizer the default for new categorization jobs (#72805)

Categorization jobs created once the entire cluster is upgraded to
version 7.14 or higher will default to using the new ml_standard
tokenizer rather than the previous default of the ml_classic
tokenizer, and will incorporate the new first_non_blank_line char
filter so that categorization is based purely on the first non-blank
line of each message.

The difference between the ml_classic and ml_standard tokenizers
is that ml_classic splits on slashes and colons, so creates multiple
tokens from URLs and filesystem paths, whereas ml_standard attempts
to keep URLs, email addresses and filesystem paths as single tokens.

It is still possible to config the ml_classic tokenizer if you
prefer: just provide a categorization_analyzer within your
analysis_config and whichever tokenizer you choose (which could be
ml_classic or any other Elasticsearch tokenizer) will be used.

To opt out of using first_non_blank_line as a default char filter,
you must explicitly specify a categorization_analyzer that does not
include it.

If no categorization_analyzer is specified but categorization_filters
are specified then the categorization filters are converted to char
filters applied that are applied after first_non_blank_line.

Closes elastic/ml-cpp#1724
This commit is contained in:
David Roberts 2021-06-01 15:11:32 +01:00 committed by GitHub
parent 88dfe1aebf
commit 0059c59e25
No known key found for this signature in database
GPG key ID: 4AEE18F83AFDEB23
22 changed files with 688 additions and 96 deletions

View file

@ -588,14 +588,13 @@ public class MlClientDocumentationIT extends ESRestHighLevelClientTestCase {
.setDescription("My description") // <2> .setDescription("My description") // <2>
.setAnalysisLimits(new AnalysisLimits(1000L, null)) // <3> .setAnalysisLimits(new AnalysisLimits(1000L, null)) // <3>
.setBackgroundPersistInterval(TimeValue.timeValueHours(3)) // <4> .setBackgroundPersistInterval(TimeValue.timeValueHours(3)) // <4>
.setCategorizationFilters(Arrays.asList("categorization-filter")) // <5> .setDetectorUpdates(Arrays.asList(detectorUpdate)) // <5>
.setDetectorUpdates(Arrays.asList(detectorUpdate)) // <6> .setGroups(Arrays.asList("job-group-1")) // <6>
.setGroups(Arrays.asList("job-group-1")) // <7> .setResultsRetentionDays(10L) // <7>
.setResultsRetentionDays(10L) // <8> .setModelPlotConfig(new ModelPlotConfig(true, null, true)) // <8>
.setModelPlotConfig(new ModelPlotConfig(true, null, true)) // <9> .setModelSnapshotRetentionDays(7L) // <9>
.setModelSnapshotRetentionDays(7L) // <10> .setCustomSettings(customSettings) // <10>
.setCustomSettings(customSettings) // <11> .setRenormalizationWindowDays(3L) // <11>
.setRenormalizationWindowDays(3L) // <12>
.build(); .build();
// end::update-job-options // end::update-job-options

View file

@ -35,14 +35,13 @@ include-tagged::{doc-tests-file}[{api}-options]
<2> Updated description. <2> Updated description.
<3> Updated analysis limits. <3> Updated analysis limits.
<4> Updated background persistence interval. <4> Updated background persistence interval.
<5> Updated analysis config's categorization filters. <5> Updated detectors through the `JobUpdate.DetectorUpdate` object.
<6> Updated detectors through the `JobUpdate.DetectorUpdate` object. <6> Updated group membership.
<7> Updated group membership. <7> Updated result retention.
<8> Updated result retention. <8> Updated model plot configuration.
<9> Updated model plot configuration. <9> Updated model snapshot retention setting.
<10> Updated model snapshot retention setting. <10> Updated custom settings.
<11> Updated custom settings. <11> Updated renormalization window.
<12> Updated renormalization window.
Included with these options are specific optional `JobUpdate.DetectorUpdate` updates. Included with these options are specific optional `JobUpdate.DetectorUpdate` updates.
["source","java",subs="attributes,callouts,macros"] ["source","java",subs="attributes,callouts,macros"]

View file

@ -49,7 +49,10 @@ This is a possible response:
"defaults" : { "defaults" : {
"anomaly_detectors" : { "anomaly_detectors" : {
"categorization_analyzer" : { "categorization_analyzer" : {
"tokenizer" : "ml_classic", "char_filter" : [
"first_non_blank_line"
],
"tokenizer" : "ml_standard",
"filter" : [ "filter" : [
{ {
"type" : "stop", "type" : "stop",

View file

@ -21,8 +21,8 @@ of possible messages:
Categorization is tuned to work best on data like log messages by taking token Categorization is tuned to work best on data like log messages by taking token
order into account, including stop words, and not considering synonyms in its order into account, including stop words, and not considering synonyms in its
analysis. Complete sentences in human communication or literary text (for analysis. Complete sentences in human communication or literary text (for
example email, wiki pages, prose, or other human-generated content) can be example email, wiki pages, prose, or other human-generated content) can be
extremely diverse in structure. Since categorization is tuned for machine data, extremely diverse in structure. Since categorization is tuned for machine data,
it gives poor results for human-generated data. It would create so many it gives poor results for human-generated data. It would create so many
categories that they couldn't be handled effectively. Categorization is _not_ categories that they couldn't be handled effectively. Categorization is _not_
natural language processing (NLP). natural language processing (NLP).
@ -32,7 +32,7 @@ volume and pattern is normal for each category over time. You can then detect
anomalies and surface rare events or unusual types of messages by using anomalies and surface rare events or unusual types of messages by using
<<ml-count-functions,count>> or <<ml-rare-functions,rare>> functions. <<ml-count-functions,count>> or <<ml-rare-functions,rare>> functions.
In {kib}, there is a categorization wizard to help you create this type of In {kib}, there is a categorization wizard to help you create this type of
{anomaly-job}. For example, the following job generates categories from the {anomaly-job}. For example, the following job generates categories from the
contents of the `message` field and uses the count function to determine when contents of the `message` field and uses the count function to determine when
certain categories are occurring at anomalous rates: certain categories are occurring at anomalous rates:
@ -69,7 +69,7 @@ do not specify this keyword in one of those properties, the API request fails.
==== ====
You can use the **Anomaly Explorer** in {kib} to view the analysis results: You can use the **Anomaly Explorer** in {kib} to view the analysis results:
[role="screenshot"] [role="screenshot"]
image::images/ml-category-anomalies.jpg["Categorization results in the Anomaly Explorer"] image::images/ml-category-anomalies.jpg["Categorization results in the Anomaly Explorer"]
@ -105,7 +105,7 @@ SQL statement from the categorization algorithm.
If you enable per-partition categorization, categories are determined If you enable per-partition categorization, categories are determined
independently for each partition. For example, if your data includes messages independently for each partition. For example, if your data includes messages
from multiple types of logs from different applications, you can use a field from multiple types of logs from different applications, you can use a field
like the ECS {ecs-ref}/ecs-event.html[`event.dataset` field] as the like the ECS {ecs-ref}/ecs-event.html[`event.dataset` field] as the
`partition_field_name` and categorize the messages for each type of log `partition_field_name` and categorize the messages for each type of log
separately. separately.
@ -116,7 +116,7 @@ create or update a job and enable per-partition categorization, it fails.
When per-partition categorization is enabled, you can also take advantage of a When per-partition categorization is enabled, you can also take advantage of a
`stop_on_warn` configuration option. If the categorization status for a `stop_on_warn` configuration option. If the categorization status for a
partition changes to `warn`, it doesn't categorize well and can cause a lot of partition changes to `warn`, it doesn't categorize well and can cause a lot of
unnecessary resource usage. When you set `stop_on_warn` to `true`, the job stops unnecessary resource usage. When you set `stop_on_warn` to `true`, the job stops
analyzing these problematic partitions. You can thus avoid an ongoing analyzing these problematic partitions. You can thus avoid an ongoing
performance cost for partitions that are unsuitable for categorization. performance cost for partitions that are unsuitable for categorization.
@ -128,7 +128,7 @@ performance cost for partitions that are unsuitable for categorization.
Categorization uses English dictionary words to identify log message categories. Categorization uses English dictionary words to identify log message categories.
By default, it also uses English tokenization rules. For this reason, if you use By default, it also uses English tokenization rules. For this reason, if you use
the default categorization analyzer, only English language log messages are the default categorization analyzer, only English language log messages are
supported, as described in the <<ml-limitations>>. supported, as described in the <<ml-limitations>>.
If you use the categorization wizard in {kib}, you can see which categorization If you use the categorization wizard in {kib}, you can see which categorization
analyzer it uses and highlighted examples of the tokens that it identifies. You analyzer it uses and highlighted examples of the tokens that it identifies. You
@ -140,7 +140,7 @@ image::images/ml-category-analyzer.jpg["Editing the categorization analyzer in K
The categorization analyzer can refer to a built-in {es} analyzer or a The categorization analyzer can refer to a built-in {es} analyzer or a
combination of zero or more character filters, a tokenizer, and zero or more combination of zero or more character filters, a tokenizer, and zero or more
token filters. In this example, adding a token filters. In this example, adding a
{ref}/analysis-pattern-replace-charfilter.html[`pattern_replace` character filter] {ref}/analysis-pattern-replace-charfilter.html[`pattern_replace` character filter]
achieves exactly the same behavior as the `categorization_filters` job achieves exactly the same behavior as the `categorization_filters` job
configuration option described earlier. For more details about these properties, configuration option described earlier. For more details about these properties,
@ -157,7 +157,10 @@ POST _ml/anomaly_detectors/_validate
{ {
"analysis_config" : { "analysis_config" : {
"categorization_analyzer" : { "categorization_analyzer" : {
"tokenizer" : "ml_classic", "char_filter" : [
"first_non_blank_line"
],
"tokenizer" : "ml_standard",
"filter" : [ "filter" : [
{ "type" : "stop", "stopwords": [ { "type" : "stop", "stopwords": [
"Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday", "Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday",
@ -182,8 +185,8 @@ POST _ml/anomaly_detectors/_validate
If you specify any part of the `categorization_analyzer`, however, any omitted If you specify any part of the `categorization_analyzer`, however, any omitted
sub-properties are _not_ set to default values. sub-properties are _not_ set to default values.
The `ml_classic` tokenizer and the day and month stopword filter are more or The `ml_standard` tokenizer and the day and month stopword filter are more or
less equivalent to the following analyzer, which is defined using only built-in less equivalent to the following analyzer, which is defined using only built-in
{es} {ref}/analysis-tokenizers.html[tokenizers] and {es} {ref}/analysis-tokenizers.html[tokenizers] and
{ref}/analysis-tokenfilters.html[token filters]: {ref}/analysis-tokenfilters.html[token filters]:
@ -201,15 +204,18 @@ PUT _ml/anomaly_detectors/it_ops_new_logs3
"detector_description": "Unusual message counts" "detector_description": "Unusual message counts"
}], }],
"categorization_analyzer":{ "categorization_analyzer":{
"char_filter" : [
"first_non_blank_line" <1>
],
"tokenizer": { "tokenizer": {
"type" : "simple_pattern_split", "type" : "simple_pattern_split",
"pattern" : "[^-0-9A-Za-z_.]+" <1> "pattern" : "[^-0-9A-Za-z_./]+" <2>
}, },
"filter": [ "filter": [
{ "type" : "pattern_replace", "pattern": "^[0-9].*" }, <2> { "type" : "pattern_replace", "pattern": "^[0-9].*" }, <3>
{ "type" : "pattern_replace", "pattern": "^[-0-9A-Fa-f.]+$" }, <3> { "type" : "pattern_replace", "pattern": "^[-0-9A-Fa-f.]+$" }, <4>
{ "type" : "pattern_replace", "pattern": "^[^0-9A-Za-z]+" }, <4> { "type" : "pattern_replace", "pattern": "^[^0-9A-Za-z]+" }, <5>
{ "type" : "pattern_replace", "pattern": "[^0-9A-Za-z]+$" }, <5> { "type" : "pattern_replace", "pattern": "[^0-9A-Za-z]+$" }, <6>
{ "type" : "stop", "stopwords": [ { "type" : "stop", "stopwords": [
"", "",
"Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday", "Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday",
@ -232,17 +238,20 @@ PUT _ml/anomaly_detectors/it_ops_new_logs3
---------------------------------- ----------------------------------
// TEST[skip:needs-licence] // TEST[skip:needs-licence]
<1> Tokens basically consist of hyphens, digits, letters, underscores and dots. <1> Only consider the first non-blank line of the message for categorization purposes.
<2> By default, categorization ignores tokens that begin with a digit. <2> Tokens basically consist of hyphens, digits, letters, underscores, dots and slashes.
<3> By default, categorization also ignores tokens that are hexadecimal numbers. <3> By default, categorization ignores tokens that begin with a digit.
<4> Underscores, hyphens, and dots are removed from the beginning of tokens. <4> By default, categorization also ignores tokens that are hexadecimal numbers.
<5> Underscores, hyphens, and dots are also removed from the end of tokens. <5> Underscores, hyphens, and dots are removed from the beginning of tokens.
<6> Underscores, hyphens, and dots are also removed from the end of tokens.
The key difference between the default `categorization_analyzer` and this The key difference between the default `categorization_analyzer` and this
example analyzer is that using the `ml_classic` tokenizer is several times example analyzer is that using the `ml_standard` tokenizer is several times
faster. The difference in behavior is that this custom analyzer does not include faster. The `ml_standard` tokenizer also tries to preserve URLs, Windows paths
accented letters in tokens whereas the `ml_classic` tokenizer does, although and email addresses as single tokens. Another difference in behavior is that
that could be fixed by using more complex regular expressions. this custom analyzer does not include accented letters in tokens whereas the
`ml_standard` tokenizer does, although that could be fixed by using more complex
regular expressions.
If you are categorizing non-English messages in a language where words are If you are categorizing non-English messages in a language where words are
separated by spaces, you might get better results if you change the day or month separated by spaces, you might get better results if you change the day or month

View file

@ -1592,11 +1592,17 @@ end::timestamp-results[]
tag::tokenizer[] tag::tokenizer[]
The name or definition of the <<analysis-tokenizers,tokenizer>> to use after The name or definition of the <<analysis-tokenizers,tokenizer>> to use after
character filters are applied. This property is compulsory if character filters are applied. This property is compulsory if
`categorization_analyzer` is specified as an object. Machine learning provides a `categorization_analyzer` is specified as an object. Machine learning provides
tokenizer called `ml_classic` that tokenizes in the same way as the a tokenizer called `ml_standard` that tokenizes in a way that has been
non-customizable tokenizer in older versions of the product. If you want to use determined to produce good categorization results on a variety of log
that tokenizer but change the character or token filters, specify file formats for logs in English. If you want to use that tokenizer but
`"tokenizer": "ml_classic"` in your `categorization_analyzer`. change the character or token filters, specify `"tokenizer": "ml_standard"`
in your `categorization_analyzer`. Additionally, the `ml_classic` tokenizer
is available, which tokenizes in the same way as the non-customizable
tokenizer in old versions of the product (before 6.2). `ml_classic` was
the default categorization tokenizer in versions 6.2 to 7.13, so if you
need categorization identical to the default for jobs created in these
versions, specify `"tokenizer": "ml_classic"` in your `categorization_analyzer`.
end::tokenizer[] end::tokenizer[]
tag::total-by-field-count[] tag::total-by-field-count[]

View file

@ -117,9 +117,12 @@ tasks.named("yamlRestCompatTest").configure {
'ml/datafeeds_crud/Test update datafeed to point to job already attached to another datafeed', 'ml/datafeeds_crud/Test update datafeed to point to job already attached to another datafeed',
'ml/datafeeds_crud/Test update datafeed to point to missing job', 'ml/datafeeds_crud/Test update datafeed to point to missing job',
'ml/job_cat_apis/Test cat anomaly detector jobs', 'ml/job_cat_apis/Test cat anomaly detector jobs',
'ml/jobs_crud/Test update job',
'ml/jobs_get_stats/Test get job stats after uploading data prompting the creation of some stats', 'ml/jobs_get_stats/Test get job stats after uploading data prompting the creation of some stats',
'ml/jobs_get_stats/Test get job stats for closed job', 'ml/jobs_get_stats/Test get job stats for closed job',
'ml/jobs_get_stats/Test no exception on get job stats with missing index', 'ml/jobs_get_stats/Test no exception on get job stats with missing index',
// TODO: the ml_info mute can be removed from master once the ml_standard tokenizer is in 7.x
'ml/ml_info/Test ml info',
'ml/post_data/Test POST data job api, flush, close and verify DataCounts doc', 'ml/post_data/Test POST data job api, flush, close and verify DataCounts doc',
'ml/post_data/Test flush with skip_time', 'ml/post_data/Test flush with skip_time',
'ml/set_upgrade_mode/Setting upgrade mode to disabled from enabled', 'ml/set_upgrade_mode/Setting upgrade mode to disabled from enabled',

View file

@ -145,38 +145,39 @@ public class CategorizationAnalyzerConfig implements ToXContentFragment, Writeab
} }
/** /**
* Create a <code>categorization_analyzer</code> that mimics what the tokenizer and filters built into the ML C++ * Create a <code>categorization_analyzer</code> that mimics what the tokenizer and filters built into the original ML
* code do. This is the default analyzer for categorization to ensure that people upgrading from previous versions * C++ code do. This is the default analyzer for categorization to ensure that people upgrading from old versions
* get the same behaviour from their categorization jobs before and after upgrade. * get the same behaviour from their categorization jobs before and after upgrade.
* @param categorizationFilters Categorization filters (if any) from the <code>analysis_config</code>. * @param categorizationFilters Categorization filters (if any) from the <code>analysis_config</code>.
* @return The default categorization analyzer. * @return The default categorization analyzer.
*/ */
public static CategorizationAnalyzerConfig buildDefaultCategorizationAnalyzer(List<String> categorizationFilters) { public static CategorizationAnalyzerConfig buildDefaultCategorizationAnalyzer(List<String> categorizationFilters) {
CategorizationAnalyzerConfig.Builder builder = new CategorizationAnalyzerConfig.Builder(); return new CategorizationAnalyzerConfig.Builder()
.addCategorizationFilters(categorizationFilters)
.setTokenizer("ml_classic")
.addDateWordsTokenFilter()
.build();
}
if (categorizationFilters != null) { /**
for (String categorizationFilter : categorizationFilters) { * Create a <code>categorization_analyzer</code> that will be used for newly created jobs where no categorization
Map<String, Object> charFilter = new HashMap<>(); * analyzer is explicitly provided. This analyzer differs from the default one in that it uses the <code>ml_standard</code>
charFilter.put("type", "pattern_replace"); * tokenizer instead of the <code>ml_classic</code> tokenizer, and it only considers the first non-blank line of each message.
charFilter.put("pattern", categorizationFilter); * This analyzer is <em>not</em> used for jobs that specify no categorization analyzer, as that would break jobs that were
builder.addCharFilter(charFilter); * originally run in older versions. Instead, this analyzer is explicitly added to newly created jobs once the entire cluster
} * is upgraded to version 7.14 or above.
} * @param categorizationFilters Categorization filters (if any) from the <code>analysis_config</code>.
* @return The standard categorization analyzer.
*/
public static CategorizationAnalyzerConfig buildStandardCategorizationAnalyzer(List<String> categorizationFilters) {
builder.setTokenizer("ml_classic"); return new CategorizationAnalyzerConfig.Builder()
.addCharFilter("first_non_blank_line")
Map<String, Object> tokenFilter = new HashMap<>(); .addCategorizationFilters(categorizationFilters)
tokenFilter.put("type", "stop"); .setTokenizer("ml_standard")
tokenFilter.put("stopwords", Arrays.asList( .addDateWordsTokenFilter()
"Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday", .build();
"Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun",
"January", "February", "March", "April", "May", "June", "July", "August", "September", "October", "November", "December",
"Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec",
"GMT", "UTC"));
builder.addTokenFilter(tokenFilter);
return builder.build();
} }
private final String analyzer; private final String analyzer;
@ -311,6 +312,18 @@ public class CategorizationAnalyzerConfig implements ToXContentFragment, Writeab
return this; return this;
} }
public Builder addCategorizationFilters(List<String> categorizationFilters) {
if (categorizationFilters != null) {
for (String categorizationFilter : categorizationFilters) {
Map<String, Object> charFilter = new HashMap<>();
charFilter.put("type", "pattern_replace");
charFilter.put("pattern", categorizationFilter);
addCharFilter(charFilter);
}
}
return this;
}
public Builder setTokenizer(String tokenizer) { public Builder setTokenizer(String tokenizer) {
this.tokenizer = new NameOrDefinition(tokenizer); this.tokenizer = new NameOrDefinition(tokenizer);
return this; return this;
@ -331,6 +344,19 @@ public class CategorizationAnalyzerConfig implements ToXContentFragment, Writeab
return this; return this;
} }
Builder addDateWordsTokenFilter() {
Map<String, Object> tokenFilter = new HashMap<>();
tokenFilter.put("type", "stop");
tokenFilter.put("stopwords", Arrays.asList(
"Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday",
"Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun",
"January", "February", "March", "April", "May", "June", "July", "August", "September", "October", "November", "December",
"Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec",
"GMT", "UTC"));
addTokenFilter(tokenFilter);
return this;
}
/** /**
* Create a config validating only structure, not exact analyzer/tokenizer/filter names * Create a config validating only structure, not exact analyzer/tokenizer/filter names
*/ */

View file

@ -17,9 +17,11 @@ restResources {
tasks.named("yamlRestTest").configure { tasks.named("yamlRestTest").configure {
systemProperty 'tests.rest.blacklist', [ systemProperty 'tests.rest.blacklist', [
// Remove this test because it doesn't call an ML endpoint and we don't want // Remove these tests because they don't call an ML endpoint and we don't want
// to grant extra permissions to the users used in this test suite // to grant extra permissions to the users used in this test suite
'ml/ml_classic_analyze/Test analyze API with an analyzer that does what we used to do in native code', 'ml/ml_classic_analyze/Test analyze API with an analyzer that does what we used to do in native code',
'ml/ml_standard_analyze/Test analyze API with the standard 7.14 ML analyzer',
'ml/ml_standard_analyze/Test 7.14 analyzer with blank lines',
// Remove tests that are expected to throw an exception, because we cannot then // Remove tests that are expected to throw an exception, because we cannot then
// know whether to expect an authorization exception or a validation exception // know whether to expect an authorization exception or a validation exception
'ml/calendar_crud/Test get calendar given missing', 'ml/calendar_crud/Test get calendar given missing',

View file

@ -45,6 +45,7 @@ import org.elasticsearch.common.util.concurrent.EsExecutors;
import org.elasticsearch.common.xcontent.NamedXContentRegistry; import org.elasticsearch.common.xcontent.NamedXContentRegistry;
import org.elasticsearch.env.Environment; import org.elasticsearch.env.Environment;
import org.elasticsearch.env.NodeEnvironment; import org.elasticsearch.env.NodeEnvironment;
import org.elasticsearch.index.analysis.CharFilterFactory;
import org.elasticsearch.index.analysis.TokenizerFactory; import org.elasticsearch.index.analysis.TokenizerFactory;
import org.elasticsearch.indices.SystemIndexDescriptor; import org.elasticsearch.indices.SystemIndexDescriptor;
import org.elasticsearch.indices.analysis.AnalysisModule.AnalysisProvider; import org.elasticsearch.indices.analysis.AnalysisModule.AnalysisProvider;
@ -264,6 +265,8 @@ import org.elasticsearch.xpack.ml.inference.persistence.TrainedModelProvider;
import org.elasticsearch.xpack.ml.job.JobManager; import org.elasticsearch.xpack.ml.job.JobManager;
import org.elasticsearch.xpack.ml.job.JobManagerHolder; import org.elasticsearch.xpack.ml.job.JobManagerHolder;
import org.elasticsearch.xpack.ml.job.UpdateJobProcessNotifier; import org.elasticsearch.xpack.ml.job.UpdateJobProcessNotifier;
import org.elasticsearch.xpack.ml.job.categorization.FirstNonBlankLineCharFilter;
import org.elasticsearch.xpack.ml.job.categorization.FirstNonBlankLineCharFilterFactory;
import org.elasticsearch.xpack.ml.job.categorization.MlClassicTokenizer; import org.elasticsearch.xpack.ml.job.categorization.MlClassicTokenizer;
import org.elasticsearch.xpack.ml.job.categorization.MlClassicTokenizerFactory; import org.elasticsearch.xpack.ml.job.categorization.MlClassicTokenizerFactory;
import org.elasticsearch.xpack.ml.job.categorization.MlStandardTokenizer; import org.elasticsearch.xpack.ml.job.categorization.MlStandardTokenizer;
@ -1076,6 +1079,10 @@ public class MachineLearning extends Plugin implements SystemIndexPlugin,
return Arrays.asList(jobComms, utility, datafeed); return Arrays.asList(jobComms, utility, datafeed);
} }
public Map<String, AnalysisProvider<CharFilterFactory>> getCharFilters() {
return Collections.singletonMap(FirstNonBlankLineCharFilter.NAME, FirstNonBlankLineCharFilterFactory::new);
}
@Override @Override
public Map<String, AnalysisProvider<TokenizerFactory>> getTokenizers() { public Map<String, AnalysisProvider<TokenizerFactory>> getTokenizers() {
return Map.of(MlClassicTokenizer.NAME, MlClassicTokenizerFactory::new, return Map.of(MlClassicTokenizer.NAME, MlClassicTokenizerFactory::new,

View file

@ -98,7 +98,7 @@ public class TransportMlInfoAction extends HandledTransportAction<MlInfoAction.R
Job.DEFAULT_DAILY_MODEL_SNAPSHOT_RETENTION_AFTER_DAYS); Job.DEFAULT_DAILY_MODEL_SNAPSHOT_RETENTION_AFTER_DAYS);
try { try {
defaults.put(CategorizationAnalyzerConfig.CATEGORIZATION_ANALYZER.getPreferredName(), defaults.put(CategorizationAnalyzerConfig.CATEGORIZATION_ANALYZER.getPreferredName(),
CategorizationAnalyzerConfig.buildDefaultCategorizationAnalyzer(Collections.emptyList()) CategorizationAnalyzerConfig.buildStandardCategorizationAnalyzer(Collections.emptyList())
.asMap(xContentRegistry).get(CategorizationAnalyzerConfig.CATEGORIZATION_ANALYZER.getPreferredName())); .asMap(xContentRegistry).get(CategorizationAnalyzerConfig.CATEGORIZATION_ANALYZER.getPreferredName()));
} catch (IOException e) { } catch (IOException e) {
logger.error("failed to convert default categorization analyzer to map", e); logger.error("failed to convert default categorization analyzer to map", e);

View file

@ -10,6 +10,7 @@ import org.apache.logging.log4j.LogManager;
import org.apache.logging.log4j.Logger; import org.apache.logging.log4j.Logger;
import org.elasticsearch.ResourceAlreadyExistsException; import org.elasticsearch.ResourceAlreadyExistsException;
import org.elasticsearch.ResourceNotFoundException; import org.elasticsearch.ResourceNotFoundException;
import org.elasticsearch.Version;
import org.elasticsearch.action.ActionListener; import org.elasticsearch.action.ActionListener;
import org.elasticsearch.action.index.IndexResponse; import org.elasticsearch.action.index.IndexResponse;
import org.elasticsearch.action.support.WriteRequest; import org.elasticsearch.action.support.WriteRequest;
@ -39,6 +40,7 @@ import org.elasticsearch.xpack.core.ml.MlTasks;
import org.elasticsearch.xpack.core.ml.action.PutJobAction; import org.elasticsearch.xpack.core.ml.action.PutJobAction;
import org.elasticsearch.xpack.core.ml.action.RevertModelSnapshotAction; import org.elasticsearch.xpack.core.ml.action.RevertModelSnapshotAction;
import org.elasticsearch.xpack.core.ml.action.UpdateJobAction; import org.elasticsearch.xpack.core.ml.action.UpdateJobAction;
import org.elasticsearch.xpack.core.ml.job.config.AnalysisConfig;
import org.elasticsearch.xpack.core.ml.job.config.AnalysisLimits; import org.elasticsearch.xpack.core.ml.job.config.AnalysisLimits;
import org.elasticsearch.xpack.core.ml.job.config.CategorizationAnalyzerConfig; import org.elasticsearch.xpack.core.ml.job.config.CategorizationAnalyzerConfig;
import org.elasticsearch.xpack.core.ml.job.config.DataDescription; import org.elasticsearch.xpack.core.ml.job.config.DataDescription;
@ -85,6 +87,8 @@ import java.util.regex.Pattern;
*/ */
public class JobManager { public class JobManager {
private static final Version MIN_NODE_VERSION_FOR_STANDARD_CATEGORIZATION_ANALYZER = Version.V_7_14_0;
private static final Logger logger = LogManager.getLogger(JobManager.class); private static final Logger logger = LogManager.getLogger(JobManager.class);
private static final DeprecationLogger deprecationLogger = DeprecationLogger.getLogger(JobManager.class); private static final DeprecationLogger deprecationLogger = DeprecationLogger.getLogger(JobManager.class);
@ -220,17 +224,31 @@ public class JobManager {
/** /**
* Validate the char filter/tokenizer/token filter names used in the categorization analyzer config (if any). * Validate the char filter/tokenizer/token filter names used in the categorization analyzer config (if any).
* This validation has to be done server-side; it cannot be done in a client as that won't have loaded the * If the user has not provided a categorization analyzer then set the standard one if categorization is
* appropriate analysis modules/plugins. * being used at all and all the nodes in the cluster are running a version that will understand it. This
* The overall structure can be validated at parse time, but the exact names need to be checked separately, * method must only be called when a job is first created - since it applies a default if it were to be
* as plugins that provide the functionality can be installed/uninstalled. * called after that it could change the meaning of a job that has already run. The validation in this
* method has to be done server-side; it cannot be done in a client as that won't have loaded the appropriate
* analysis modules/plugins. (The overall structure can be validated at parse time, but the exact names need
* to be checked separately, as plugins that provide the functionality can be installed/uninstalled.)
*/ */
static void validateCategorizationAnalyzer(Job.Builder jobBuilder, AnalysisRegistry analysisRegistry) static void validateCategorizationAnalyzerOrSetDefault(Job.Builder jobBuilder, AnalysisRegistry analysisRegistry,
throws IOException { Version minNodeVersion) throws IOException {
CategorizationAnalyzerConfig categorizationAnalyzerConfig = jobBuilder.getAnalysisConfig().getCategorizationAnalyzerConfig(); AnalysisConfig analysisConfig = jobBuilder.getAnalysisConfig();
CategorizationAnalyzerConfig categorizationAnalyzerConfig = analysisConfig.getCategorizationAnalyzerConfig();
if (categorizationAnalyzerConfig != null) { if (categorizationAnalyzerConfig != null) {
CategorizationAnalyzer.verifyConfigBuilder(new CategorizationAnalyzerConfig.Builder(categorizationAnalyzerConfig), CategorizationAnalyzer.verifyConfigBuilder(new CategorizationAnalyzerConfig.Builder(categorizationAnalyzerConfig),
analysisRegistry); analysisRegistry);
} else if (analysisConfig.getCategorizationFieldName() != null
&& minNodeVersion.onOrAfter(MIN_NODE_VERSION_FOR_STANDARD_CATEGORIZATION_ANALYZER)) {
// Any supplied categorization filters are transferred into the new categorization analyzer.
// The user supplied categorization filters will already have been validated when the put job
// request was built, so we know they're valid.
AnalysisConfig.Builder analysisConfigBuilder = new AnalysisConfig.Builder(analysisConfig)
.setCategorizationAnalyzerConfig(
CategorizationAnalyzerConfig.buildStandardCategorizationAnalyzer(analysisConfig.getCategorizationFilters()))
.setCategorizationFilters(null);
jobBuilder.setAnalysisConfig(analysisConfigBuilder);
} }
} }
@ -240,10 +258,12 @@ public class JobManager {
public void putJob(PutJobAction.Request request, AnalysisRegistry analysisRegistry, ClusterState state, public void putJob(PutJobAction.Request request, AnalysisRegistry analysisRegistry, ClusterState state,
ActionListener<PutJobAction.Response> actionListener) throws IOException { ActionListener<PutJobAction.Response> actionListener) throws IOException {
Version minNodeVersion = state.getNodes().getMinNodeVersion();
Job.Builder jobBuilder = request.getJobBuilder(); Job.Builder jobBuilder = request.getJobBuilder();
jobBuilder.validateAnalysisLimitsAndSetDefaults(maxModelMemoryLimit); jobBuilder.validateAnalysisLimitsAndSetDefaults(maxModelMemoryLimit);
jobBuilder.validateModelSnapshotRetentionSettingsAndSetDefaults(); jobBuilder.validateModelSnapshotRetentionSettingsAndSetDefaults();
validateCategorizationAnalyzer(jobBuilder, analysisRegistry); validateCategorizationAnalyzerOrSetDefault(jobBuilder, analysisRegistry, minNodeVersion);
Job job = jobBuilder.build(new Date()); Job job = jobBuilder.build(new Date());

View file

@ -20,10 +20,14 @@ public abstract class AbstractMlTokenizer extends Tokenizer {
protected final OffsetAttribute offsetAtt = addAttribute(OffsetAttribute.class); protected final OffsetAttribute offsetAtt = addAttribute(OffsetAttribute.class);
protected final PositionIncrementAttribute posIncrAtt = addAttribute(PositionIncrementAttribute.class); protected final PositionIncrementAttribute posIncrAtt = addAttribute(PositionIncrementAttribute.class);
/**
* The internal offset stores the offset in the potentially filtered input to the tokenizer.
* This must be corrected before setting the offset attribute for user-visible output.
*/
protected int nextOffset; protected int nextOffset;
protected int skippedPositions; protected int skippedPositions;
AbstractMlTokenizer() { protected AbstractMlTokenizer() {
} }
@Override @Override
@ -31,7 +35,8 @@ public abstract class AbstractMlTokenizer extends Tokenizer {
super.end(); super.end();
// Set final offset // Set final offset
int finalOffset = nextOffset + (int) input.skip(Integer.MAX_VALUE); int finalOffset = nextOffset + (int) input.skip(Integer.MAX_VALUE);
offsetAtt.setOffset(finalOffset, finalOffset); int correctedFinalOffset = correctOffset(finalOffset);
offsetAtt.setOffset(correctedFinalOffset, correctedFinalOffset);
// Adjust any skipped tokens // Adjust any skipped tokens
posIncrAtt.setPositionIncrement(posIncrAtt.getPositionIncrement() + skippedPositions); posIncrAtt.setPositionIncrement(posIncrAtt.getPositionIncrement() + skippedPositions);
} }

View file

@ -0,0 +1,98 @@
/*
* Copyright Elasticsearch B.V. and/or licensed to Elasticsearch B.V. under one
* or more contributor license agreements. Licensed under the Elastic License
* 2.0; you may not use this file except in compliance with the Elastic License
* 2.0.
*/
package org.elasticsearch.xpack.ml.job.categorization;
import org.apache.lucene.analysis.charfilter.BaseCharFilter;
import java.io.IOException;
import java.io.Reader;
import java.io.StringReader;
/**
* A character filter that keeps the first non-blank line in the input, and discards everything before and after it.
* Treats both <code>\n</code> and <code>\r\n</code> as line endings. If there is a line ending at the end of the
* first non-blank line this is discarded. A line is considered blank if {@link Character#isWhitespace} returns
* <code>true</code> for all the characters in it.
*
* It is possible to achieve the same effect with a <code>pattern_replace</code> filter, but since this filter
* needs to be run on every single message to be categorized it is worth having a more performant specialization.
*/
public class FirstNonBlankLineCharFilter extends BaseCharFilter {
public static final String NAME = "first_non_blank_line";
private Reader transformedInput;
FirstNonBlankLineCharFilter(Reader in) {
super(in);
}
@Override
public int read(char[] cbuf, int off, int len) throws IOException {
// Buffer all input on the first call.
if (transformedInput == null) {
fill();
}
return transformedInput.read(cbuf, off, len);
}
@Override
public int read() throws IOException {
if (transformedInput == null) {
fill();
}
return transformedInput.read();
}
private void fill() throws IOException {
StringBuilder buffered = new StringBuilder();
char[] temp = new char[1024];
for (int cnt = input.read(temp); cnt > 0; cnt = input.read(temp)) {
buffered.append(temp, 0, cnt);
}
transformedInput = new StringReader(process(buffered).toString());
}
private CharSequence process(CharSequence input) {
boolean seenNonWhitespaceChar = false;
int prevNewlineIndex = -1;
int endIndex = -1;
for (int index = 0; index < input.length(); ++index) {
if (input.charAt(index) == '\n') {
if (seenNonWhitespaceChar) {
// With Windows line endings chop the \r as well as the \n
endIndex = (input.charAt(index - 1) == '\r') ? (index - 1) : index;
break;
}
prevNewlineIndex = index;
} else {
seenNonWhitespaceChar = seenNonWhitespaceChar || Character.isWhitespace(input.charAt(index)) == false;
}
}
if (seenNonWhitespaceChar == false) {
return "";
}
if (endIndex == -1) {
if (prevNewlineIndex == -1) {
// This is pretty likely, as most log messages _aren't_ multiline, so worth optimising
// for even though the return at the end of the method would be functionally identical
return input;
}
endIndex = input.length();
}
addOffCorrectMap(0, prevNewlineIndex + 1);
return input.subSequence(prevNewlineIndex + 1, endIndex);
}
}

View file

@ -0,0 +1,27 @@
/*
* Copyright Elasticsearch B.V. and/or licensed to Elasticsearch B.V. under one
* or more contributor license agreements. Licensed under the Elastic License
* 2.0; you may not use this file except in compliance with the Elastic License
* 2.0.
*/
package org.elasticsearch.xpack.ml.job.categorization;
import org.elasticsearch.common.settings.Settings;
import org.elasticsearch.env.Environment;
import org.elasticsearch.index.IndexSettings;
import org.elasticsearch.index.analysis.AbstractCharFilterFactory;
import java.io.Reader;
public class FirstNonBlankLineCharFilterFactory extends AbstractCharFilterFactory {
public FirstNonBlankLineCharFilterFactory(IndexSettings indexSettings, Environment env, String name, Settings settings) {
super(indexSettings, name);
}
@Override
public Reader create(Reader tokenStream) {
return new FirstNonBlankLineCharFilter(tokenStream);
}
}

View file

@ -84,7 +84,7 @@ public class MlClassicTokenizer extends AbstractMlTokenizer {
// Characters that may exist in the term attribute beyond its defined length are ignored // Characters that may exist in the term attribute beyond its defined length are ignored
termAtt.setLength(length); termAtt.setLength(length);
offsetAtt.setOffset(start, start + length); offsetAtt.setOffset(correctOffset(start), correctOffset(start + length));
posIncrAtt.setPositionIncrement(skippedPositions + 1); posIncrAtt.setPositionIncrement(skippedPositions + 1);
return true; return true;

View file

@ -136,7 +136,7 @@ public class MlStandardTokenizer extends AbstractMlTokenizer {
// Characters that may exist in the term attribute beyond its defined length are ignored // Characters that may exist in the term attribute beyond its defined length are ignored
termAtt.setLength(length); termAtt.setLength(length);
offsetAtt.setOffset(start, start + length); offsetAtt.setOffset(correctOffset(start), correctOffset(start + length));
posIncrAtt.setPositionIncrement(skippedPositions + 1); posIncrAtt.setPositionIncrement(skippedPositions + 1);
return true; return true;

View file

@ -49,6 +49,7 @@ import org.elasticsearch.xpack.core.ml.action.PutJobAction;
import org.elasticsearch.xpack.core.ml.action.UpdateJobAction; import org.elasticsearch.xpack.core.ml.action.UpdateJobAction;
import org.elasticsearch.xpack.core.action.util.QueryPage; import org.elasticsearch.xpack.core.action.util.QueryPage;
import org.elasticsearch.xpack.core.ml.job.config.AnalysisConfig; import org.elasticsearch.xpack.core.ml.job.config.AnalysisConfig;
import org.elasticsearch.xpack.core.ml.job.config.CategorizationAnalyzerConfig;
import org.elasticsearch.xpack.core.ml.job.config.DataDescription; import org.elasticsearch.xpack.core.ml.job.config.DataDescription;
import org.elasticsearch.xpack.core.ml.job.config.DetectionRule; import org.elasticsearch.xpack.core.ml.job.config.DetectionRule;
import org.elasticsearch.xpack.core.ml.job.config.Detector; import org.elasticsearch.xpack.core.ml.job.config.Detector;
@ -91,6 +92,7 @@ import static org.hamcrest.Matchers.hasSize;
import static org.hamcrest.Matchers.instanceOf; import static org.hamcrest.Matchers.instanceOf;
import static org.hamcrest.Matchers.is; import static org.hamcrest.Matchers.is;
import static org.hamcrest.Matchers.lessThanOrEqualTo; import static org.hamcrest.Matchers.lessThanOrEqualTo;
import static org.hamcrest.Matchers.nullValue;
import static org.mockito.Matchers.any; import static org.mockito.Matchers.any;
import static org.mockito.Mockito.doAnswer; import static org.mockito.Mockito.doAnswer;
import static org.mockito.Mockito.mock; import static org.mockito.Mockito.mock;
@ -584,10 +586,68 @@ public class JobManagerTests extends ESTestCase {
assertThat(capturedUpdateParams.get(1).isUpdateScheduledEvents(), is(true)); assertThat(capturedUpdateParams.get(1).isUpdateScheduledEvents(), is(true));
} }
public void testValidateCategorizationAnalyzer_GivenValid() throws IOException {
List<String> categorizationFilters = randomBoolean() ? Collections.singletonList("query: .*") : null;
CategorizationAnalyzerConfig c = CategorizationAnalyzerConfig.buildDefaultCategorizationAnalyzer(categorizationFilters);
Job.Builder jobBuilder = createCategorizationJob(c, null);
JobManager.validateCategorizationAnalyzerOrSetDefault(jobBuilder, analysisRegistry, Version.CURRENT);
Job job = jobBuilder.build(new Date());
assertThat(job.getAnalysisConfig().getCategorizationAnalyzerConfig(),
equalTo(CategorizationAnalyzerConfig.buildDefaultCategorizationAnalyzer(categorizationFilters)));
}
public void testValidateCategorizationAnalyzer_GivenInvalid() {
CategorizationAnalyzerConfig c = new CategorizationAnalyzerConfig.Builder().setAnalyzer("does_not_exist").build();
Job.Builder jobBuilder = createCategorizationJob(c, null);
IllegalArgumentException e = expectThrows(IllegalArgumentException.class,
() -> JobManager.validateCategorizationAnalyzerOrSetDefault(jobBuilder, analysisRegistry, Version.CURRENT));
assertThat(e.getMessage(), equalTo("Failed to find global analyzer [does_not_exist]"));
}
public void testSetDefaultCategorizationAnalyzer_GivenAllNewNodes() throws IOException {
List<String> categorizationFilters = randomBoolean() ? Collections.singletonList("query: .*") : null;
Job.Builder jobBuilder = createCategorizationJob(null, categorizationFilters);
JobManager.validateCategorizationAnalyzerOrSetDefault(jobBuilder, analysisRegistry, Version.CURRENT);
Job job = jobBuilder.build(new Date());
assertThat(job.getAnalysisConfig().getCategorizationAnalyzerConfig(),
equalTo(CategorizationAnalyzerConfig.buildStandardCategorizationAnalyzer(categorizationFilters)));
}
// TODO: This test can be deleted from branches that would never have to talk to a 7.13 node
public void testSetDefaultCategorizationAnalyzer_GivenOldNodeInCluster() throws IOException {
List<String> categorizationFilters = randomBoolean() ? Collections.singletonList("query: .*") : null;
Job.Builder jobBuilder = createCategorizationJob(null, categorizationFilters);
JobManager.validateCategorizationAnalyzerOrSetDefault(jobBuilder, analysisRegistry, Version.V_7_13_0);
Job job = jobBuilder.build(new Date());
assertThat(job.getAnalysisConfig().getCategorizationAnalyzerConfig(), nullValue());
}
private Job.Builder createCategorizationJob(CategorizationAnalyzerConfig categorizationAnalyzerConfig,
List<String> categorizationFilters) {
Detector.Builder d = new Detector.Builder("count", null).setByFieldName("mlcategory");
AnalysisConfig.Builder ac = new AnalysisConfig.Builder(Collections.singletonList(d.build()))
.setCategorizationFieldName("message")
.setCategorizationAnalyzerConfig(categorizationAnalyzerConfig)
.setCategorizationFilters(categorizationFilters);
Job.Builder builder = new Job.Builder();
builder.setId("cat");
builder.setAnalysisConfig(ac);
builder.setDataDescription(new DataDescription.Builder());
return builder;
}
private Job.Builder createJob() { private Job.Builder createJob() {
Detector.Builder d1 = new Detector.Builder("info_content", "domain"); Detector.Builder d = new Detector.Builder("info_content", "domain").setOverFieldName("client");
d1.setOverFieldName("client"); AnalysisConfig.Builder ac = new AnalysisConfig.Builder(Collections.singletonList(d.build()));
AnalysisConfig.Builder ac = new AnalysisConfig.Builder(Collections.singletonList(d1.build()));
Job.Builder builder = new Job.Builder(); Job.Builder builder = new Job.Builder();
builder.setId("foo"); builder.setId("foo");

View file

@ -25,6 +25,29 @@ import java.util.Map;
public class CategorizationAnalyzerTests extends ESTestCase { public class CategorizationAnalyzerTests extends ESTestCase {
private static final String NGINX_ERROR_EXAMPLE =
"a client request body is buffered to a temporary file /tmp/client-body/0000021894, client: 10.8.0.12, " +
"server: apm.35.205.226.121.ip.es.io, request: \"POST /intake/v2/events HTTP/1.1\", host: \"apm.35.205.226.121.ip.es.io\"\n" +
"10.8.0.12 - - [29/Nov/2020:21:34:55 +0000] \"POST /intake/v2/events HTTP/1.1\" 202 0 \"-\" " +
"\"elasticapm-dotnet/1.5.1 System.Net.Http/4.6.28208.02 .NET_Core/2.2.8\" 27821 0.002 [default-apm-apm-server-8200] [] " +
"10.8.1.19:8200 0 0.001 202 f961c776ff732f5c8337530aa22c7216\n" +
"10.8.0.14 - - [29/Nov/2020:21:34:56 +0000] \"POST /intake/v2/events HTTP/1.1\" 202 0 \"-\" " +
"\"elasticapm-python/5.10.0\" 3594 0.002 [default-apm-apm-server-8200] [] 10.8.1.18:8200 0 0.001 202 " +
"61feb8fb9232b1ebe54b588b95771ce4\n" +
"10.8.4.90 - - [29/Nov/2020:21:34:56 +0000] \"OPTIONS /intake/v2/rum/events HTTP/2.0\" 200 0 " +
"\"http://opbeans-frontend:3000/dashboard\" \"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) " +
"Cypress/3.3.1 Chrome/61.0.3163.100 Electron/2.0.18 Safari/537.36\" 292 0.001 [default-apm-apm-server-8200] [] " +
"10.8.1.19:8200 0 0.000 200 5fbe8cd4d217b932def1c17ed381c66b\n" +
"10.8.4.90 - - [29/Nov/2020:21:34:56 +0000] \"POST /intake/v2/rum/events HTTP/2.0\" 202 0 " +
"\"http://opbeans-frontend:3000/dashboard\" \"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) " +
"Cypress/3.3.1 Chrome/61.0.3163.100 Electron/2.0.18 Safari/537.36\" 3004 0.001 [default-apm-apm-server-8200] [] " +
"10.8.1.18:8200 0 0.001 202 4735f571928595744ac6a9545c3ecdf5\n" +
"10.8.0.11 - - [29/Nov/2020:21:34:56 +0000] \"POST /intake/v2/events HTTP/1.1\" 202 0 \"-\" " +
"\"elasticapm-node/3.8.0 elastic-apm-http-client/9.4.2 node/12.20.0\" 4913 10.006 [default-apm-apm-server-8200] [] " +
"10.8.1.18:8200 0 0.002 202 1eac41789ea9a60a8be4e476c54cbbc9\n" +
"10.8.0.14 - - [29/Nov/2020:21:34:57 +0000] \"POST /intake/v2/events HTTP/1.1\" 202 0 \"-\" \"elasticapm-python/5.10.0\" 1025 " +
"0.001 [default-apm-apm-server-8200] [] 10.8.1.18:8200 0 0.001 202 d27088936cadd3b8804b68998a5f94fa";
private AnalysisRegistry analysisRegistry; private AnalysisRegistry analysisRegistry;
public static AnalysisRegistry buildTestAnalysisRegistry(Environment environment) throws Exception { public static AnalysisRegistry buildTestAnalysisRegistry(Environment environment) throws Exception {
@ -218,6 +241,19 @@ public class CategorizationAnalyzerTests extends ESTestCase {
assertEquals(Arrays.asList("PSYoungGen", "total", "used"), assertEquals(Arrays.asList("PSYoungGen", "total", "used"),
categorizationAnalyzer.tokenizeField("java", categorizationAnalyzer.tokenizeField("java",
"PSYoungGen total 2572800K, used 1759355K [0x0000000759500000, 0x0000000800000000, 0x0000000800000000)")); "PSYoungGen total 2572800K, used 1759355K [0x0000000759500000, 0x0000000800000000, 0x0000000800000000)"));
assertEquals(Arrays.asList("client", "request", "body", "is", "buffered", "to", "temporary", "file", "tmp", "client-body",
"client", "server", "apm.35.205.226.121.ip.es.io", "request", "POST", "intake", "v2", "events", "HTTP", "host",
"apm.35.205.226.121.ip.es.io", "POST", "intake", "v2", "events", "HTTP", "elasticapm-dotnet", "System.Net.Http", "NET_Core",
"default-apm-apm-server-8200", "POST", "intake", "v2", "events", "HTTP", "elasticapm-python", "default-apm-apm-server-8200",
"OPTIONS", "intake", "v2", "rum", "events", "HTTP", "http", "opbeans-frontend", "dashboard", "Mozilla", "X11", "Linux",
"x86_64", "AppleWebKit", "KHTML", "like", "Gecko", "Cypress", "Chrome", "Electron", "Safari", "default-apm-apm-server-8200",
"POST", "intake", "v2", "rum", "events", "HTTP", "http", "opbeans-frontend", "dashboard", "Mozilla", "X11", "Linux",
"x86_64", "AppleWebKit", "KHTML", "like", "Gecko", "Cypress", "Chrome", "Electron", "Safari", "default-apm-apm-server-8200",
"POST", "intake", "v2", "events", "HTTP", "elasticapm-node", "elastic-apm-http-client", "node",
"default-apm-apm-server-8200", "POST", "intake", "v2", "events", "HTTP", "elasticapm-python",
"default-apm-apm-server-8200"),
categorizationAnalyzer.tokenizeField("nginx_error", NGINX_ERROR_EXAMPLE));
} }
} }
@ -251,6 +287,51 @@ public class CategorizationAnalyzerTests extends ESTestCase {
} }
} }
public void testMlStandardCategorizationAnalyzer() throws IOException {
CategorizationAnalyzerConfig standardConfig = CategorizationAnalyzerConfig.buildStandardCategorizationAnalyzer(null);
try (CategorizationAnalyzer categorizationAnalyzer = new CategorizationAnalyzer(analysisRegistry, standardConfig)) {
assertEquals(Arrays.asList("ml13-4608.1.p2ps", "Info", "Source", "ML_SERVICE2", "on", "has", "shut", "down"),
categorizationAnalyzer.tokenizeField("p2ps",
"<ml13-4608.1.p2ps: Info: > Source ML_SERVICE2 on 13122:867 has shut down."));
assertEquals(Arrays.asList("Vpxa", "verbose", "VpxaHalCnxHostagent", "opID", "WFU-ddeadb59", "WaitForUpdatesDone", "Received",
"callback"),
categorizationAnalyzer.tokenizeField("vmware",
"Vpxa: [49EC0B90 verbose 'VpxaHalCnxHostagent' opID=WFU-ddeadb59] [WaitForUpdatesDone] Received callback"));
assertEquals(Arrays.asList("org.apache.coyote.http11.Http11BaseProtocol", "destroy"),
categorizationAnalyzer.tokenizeField("apache",
"org.apache.coyote.http11.Http11BaseProtocol destroy"));
assertEquals(Arrays.asList("INFO", "session", "PROXY", "Session", "DESTROYED"),
categorizationAnalyzer.tokenizeField("proxy",
" [1111529792] INFO session <45409105041220090733@192.168.251.123> - " +
"----------------- PROXY Session DESTROYED --------------------"));
assertEquals(Arrays.asList("PSYoungGen", "total", "used"),
categorizationAnalyzer.tokenizeField("java",
"PSYoungGen total 2572800K, used 1759355K [0x0000000759500000, 0x0000000800000000, 0x0000000800000000)"));
assertEquals(Arrays.asList("first", "line"),
categorizationAnalyzer.tokenizeField("multiline", "first line\nsecond line\nthird line"));
assertEquals(Arrays.asList("first", "line"),
categorizationAnalyzer.tokenizeField("windows_multiline", "first line\r\nsecond line\r\nthird line"));
assertEquals(Arrays.asList("second", "line"),
categorizationAnalyzer.tokenizeField("multiline_first_blank", "\nsecond line\nthird line"));
assertEquals(Arrays.asList("second", "line"),
categorizationAnalyzer.tokenizeField("windows_multiline_first_blank", "\r\nsecond line\r\nthird line"));
assertEquals(Arrays.asList("client", "request", "body", "is", "buffered", "to", "temporary", "file",
"/tmp/client-body/0000021894", "client", "server", "apm.35.205.226.121.ip.es.io", "request", "POST", "/intake/v2/events",
"HTTP/1.1", "host", "apm.35.205.226.121.ip.es.io"),
categorizationAnalyzer.tokenizeField("nginx_error", NGINX_ERROR_EXAMPLE));
}
}
// The Elasticsearch standard analyzer - this is the default for indexing in Elasticsearch, but // The Elasticsearch standard analyzer - this is the default for indexing in Elasticsearch, but
// NOT for ML categorization (and you'll see why if you look at the expected results of this test!) // NOT for ML categorization (and you'll see why if you look at the expected results of this test!)
public void testStandardAnalyzer() throws IOException { public void testStandardAnalyzer() throws IOException {

View file

@ -0,0 +1,128 @@
/*
* Copyright Elasticsearch B.V. and/or licensed to Elasticsearch B.V. under one
* or more contributor license agreements. Licensed under the Elastic License
* 2.0; you may not use this file except in compliance with the Elastic License
* 2.0.
*/
package org.elasticsearch.xpack.ml.job.categorization;
import org.elasticsearch.test.ESTestCase;
import java.io.IOException;
import java.io.StringReader;
import static org.hamcrest.Matchers.equalTo;
public class FirstNonBlankLineCharFilterTests extends ESTestCase {
public void testEmpty() throws IOException {
String input = "";
FirstNonBlankLineCharFilter filter = new FirstNonBlankLineCharFilter(new StringReader(input));
assertThat(filter.read(), equalTo(-1));
}
public void testAllBlankOneLine() throws IOException {
String input = "\t";
if (randomBoolean()) {
input = " " + input;
}
if (randomBoolean()) {
input = input + " ";
}
FirstNonBlankLineCharFilter filter = new FirstNonBlankLineCharFilter(new StringReader(input));
assertThat(filter.read(), equalTo(-1));
}
public void testNonBlankNoNewlines() throws IOException {
String input = "the quick brown fox jumped over the lazy dog";
if (randomBoolean()) {
input = " " + input;
}
if (randomBoolean()) {
input = input + " ";
}
FirstNonBlankLineCharFilter filter = new FirstNonBlankLineCharFilter(new StringReader(input));
char[] output = new char[input.length()];
assertThat(filter.read(output, 0, output.length), equalTo(input.length()));
assertThat(filter.read(), equalTo(-1));
assertThat(new String(output), equalTo(input));
}
public void testAllBlankMultiline() throws IOException {
StringBuilder input = new StringBuilder();
String lineEnding = randomBoolean() ? "\n" : "\r\n";
for (int lineNum = randomIntBetween(2, 5); lineNum > 0; --lineNum) {
for (int charNum = randomIntBetween(0, 5); charNum > 0; --charNum) {
input.append(randomBoolean() ? " " : "\t");
}
if (lineNum > 1 || randomBoolean()) {
input.append(lineEnding);
}
}
FirstNonBlankLineCharFilter filter = new FirstNonBlankLineCharFilter(new StringReader(input.toString()));
assertThat(filter.read(), equalTo(-1));
}
public void testNonBlankMultiline() throws IOException {
StringBuilder input = new StringBuilder();
String lineEnding = randomBoolean() ? "\n" : "\r\n";
for (int lineBeforeNum = randomIntBetween(2, 5); lineBeforeNum > 0; --lineBeforeNum) {
for (int charNum = randomIntBetween(0, 5); charNum > 0; --charNum) {
input.append(randomBoolean() ? " " : "\t");
}
input.append(lineEnding);
}
String lineToKeep = "the quick brown fox jumped over the lazy dog";
if (randomBoolean()) {
lineToKeep = " " + lineToKeep;
}
if (randomBoolean()) {
lineToKeep = lineToKeep + " ";
}
input.append(lineToKeep).append(lineEnding);
for (int lineAfterNum = randomIntBetween(2, 5); lineAfterNum > 0; --lineAfterNum) {
for (int charNum = randomIntBetween(0, 5); charNum > 0; --charNum) {
input.append(randomBoolean() ? " " : "more");
}
if (lineAfterNum > 1 || randomBoolean()) {
input.append(lineEnding);
}
}
FirstNonBlankLineCharFilter filter = new FirstNonBlankLineCharFilter(new StringReader(input.toString()));
char[] output = new char[lineToKeep.length()];
assertThat(filter.read(output, 0, output.length), equalTo(lineToKeep.length()));
assertThat(filter.read(), equalTo(-1));
assertThat(new String(output), equalTo(lineToKeep));
}
public void testCorrect() throws IOException {
String input = " \nfirst line\nsecond line";
FirstNonBlankLineCharFilter filter = new FirstNonBlankLineCharFilter(new StringReader(input));
String expectedOutput = "first line";
char[] output = new char[expectedOutput.length()];
assertThat(filter.read(output, 0, output.length), equalTo(expectedOutput.length()));
assertThat(filter.read(), equalTo(-1));
assertThat(new String(output), equalTo(expectedOutput));
int expectedOutputIndex = input.indexOf(expectedOutput);
for (int i = 0; i <= expectedOutput.length(); ++i) {
assertThat(filter.correctOffset(i), equalTo(expectedOutputIndex + i));
}
}
}

View file

@ -341,6 +341,12 @@
} }
} }
- match: { job_id: "jobs-crud-update-job" } - match: { job_id: "jobs-crud-update-job" }
- length: { analysis_config.categorization_analyzer.filter: 1 }
- match: { analysis_config.categorization_analyzer.tokenizer: "ml_standard" }
- length: { analysis_config.categorization_analyzer.char_filter: 3 }
- match: { analysis_config.categorization_analyzer.char_filter.0: "first_non_blank_line" }
- match: { analysis_config.categorization_analyzer.char_filter.1.pattern: "cat1.*" }
- match: { analysis_config.categorization_analyzer.char_filter.2.pattern: "cat2.*" }
- do: - do:
ml.open_job: ml.open_job:
@ -381,7 +387,6 @@
"background_persist_interval": "3h", "background_persist_interval": "3h",
"model_snapshot_retention_days": 30, "model_snapshot_retention_days": 30,
"results_retention_days": 40, "results_retention_days": 40,
"categorization_filters" : ["cat3.*"],
"custom_settings": { "custom_settings": {
"setting3": "custom3" "setting3": "custom3"
} }
@ -392,7 +397,6 @@
- match: { model_plot_config.enabled: false } - match: { model_plot_config.enabled: false }
- match: { model_plot_config.terms: "foobar" } - match: { model_plot_config.terms: "foobar" }
- match: { model_plot_config.annotations_enabled: false } - match: { model_plot_config.annotations_enabled: false }
- match: { analysis_config.categorization_filters: ["cat3.*"] }
- match: { analysis_config.detectors.0.custom_rules.0.actions: ["skip_result"] } - match: { analysis_config.detectors.0.custom_rules.0.actions: ["skip_result"] }
- length: { analysis_config.detectors.0.custom_rules.0.conditions: 1 } - length: { analysis_config.detectors.0.custom_rules.0.conditions: 1 }
- match: { analysis_config.detectors.0.detector_index: 0 } - match: { analysis_config.detectors.0.detector_index: 0 }

View file

@ -10,7 +10,7 @@ teardown:
"Test ml info": "Test ml info":
- do: - do:
ml.info: {} ml.info: {}
- match: { defaults.anomaly_detectors.categorization_analyzer.tokenizer: "ml_classic" } - match: { defaults.anomaly_detectors.categorization_analyzer.tokenizer: "ml_standard" }
- match: { defaults.anomaly_detectors.model_memory_limit: "1gb" } - match: { defaults.anomaly_detectors.model_memory_limit: "1gb" }
- match: { defaults.anomaly_detectors.categorization_examples_limit: 4 } - match: { defaults.anomaly_detectors.categorization_examples_limit: 4 }
- match: { defaults.anomaly_detectors.model_snapshot_retention_days: 10 } - match: { defaults.anomaly_detectors.model_snapshot_retention_days: 10 }
@ -30,7 +30,7 @@ teardown:
- do: - do:
ml.info: {} ml.info: {}
- match: { defaults.anomaly_detectors.categorization_analyzer.tokenizer: "ml_classic" } - match: { defaults.anomaly_detectors.categorization_analyzer.tokenizer: "ml_standard" }
- match: { defaults.anomaly_detectors.model_memory_limit: "512mb" } - match: { defaults.anomaly_detectors.model_memory_limit: "512mb" }
- match: { defaults.anomaly_detectors.categorization_examples_limit: 4 } - match: { defaults.anomaly_detectors.categorization_examples_limit: 4 }
- match: { defaults.anomaly_detectors.model_snapshot_retention_days: 10 } - match: { defaults.anomaly_detectors.model_snapshot_retention_days: 10 }
@ -50,7 +50,7 @@ teardown:
- do: - do:
ml.info: {} ml.info: {}
- match: { defaults.anomaly_detectors.categorization_analyzer.tokenizer: "ml_classic" } - match: { defaults.anomaly_detectors.categorization_analyzer.tokenizer: "ml_standard" }
- match: { defaults.anomaly_detectors.model_memory_limit: "1gb" } - match: { defaults.anomaly_detectors.model_memory_limit: "1gb" }
- match: { defaults.anomaly_detectors.categorization_examples_limit: 4 } - match: { defaults.anomaly_detectors.categorization_examples_limit: 4 }
- match: { defaults.anomaly_detectors.model_snapshot_retention_days: 10 } - match: { defaults.anomaly_detectors.model_snapshot_retention_days: 10 }
@ -70,7 +70,7 @@ teardown:
- do: - do:
ml.info: {} ml.info: {}
- match: { defaults.anomaly_detectors.categorization_analyzer.tokenizer: "ml_classic" } - match: { defaults.anomaly_detectors.categorization_analyzer.tokenizer: "ml_standard" }
- match: { defaults.anomaly_detectors.model_memory_limit: "1gb" } - match: { defaults.anomaly_detectors.model_memory_limit: "1gb" }
- match: { defaults.anomaly_detectors.categorization_examples_limit: 4 } - match: { defaults.anomaly_detectors.categorization_examples_limit: 4 }
- match: { defaults.anomaly_detectors.model_snapshot_retention_days: 10 } - match: { defaults.anomaly_detectors.model_snapshot_retention_days: 10 }
@ -90,7 +90,7 @@ teardown:
- do: - do:
ml.info: {} ml.info: {}
- match: { defaults.anomaly_detectors.categorization_analyzer.tokenizer: "ml_classic" } - match: { defaults.anomaly_detectors.categorization_analyzer.tokenizer: "ml_standard" }
- match: { defaults.anomaly_detectors.model_memory_limit: "1mb" } - match: { defaults.anomaly_detectors.model_memory_limit: "1mb" }
- match: { defaults.anomaly_detectors.categorization_examples_limit: 4 } - match: { defaults.anomaly_detectors.categorization_examples_limit: 4 }
- match: { defaults.anomaly_detectors.model_snapshot_retention_days: 10 } - match: { defaults.anomaly_detectors.model_snapshot_retention_days: 10 }

View file

@ -0,0 +1,115 @@
---
"Test analyze API with the standard 7.14 ML analyzer":
- do:
indices.analyze:
body: >
{
"char_filter" : [
"first_non_blank_line"
],
"tokenizer" : "ml_standard",
"filter" : [
{ "type" : "stop", "stopwords": [
"Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday",
"Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun",
"January", "February", "March", "April", "May", "June", "July", "August", "September", "October", "November", "December",
"Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec",
"GMT", "UTC"
] }
],
"text" : "[elasticsearch] [2017-12-13T10:46:30,816][INFO ][o.e.c.m.MetadataCreateIndexService] [node-0] [.watcher-history-7-2017.12.13] creating index, cause [auto(bulk api)], templates [.watch-history-7], shards [1]/[1], mappings [doc]"
}
- match: { tokens.0.token: "elasticsearch" }
- match: { tokens.0.start_offset: 1 }
- match: { tokens.0.end_offset: 14 }
- match: { tokens.0.position: 0 }
- match: { tokens.1.token: "INFO" }
- match: { tokens.1.start_offset: 42 }
- match: { tokens.1.end_offset: 46 }
- match: { tokens.1.position: 5 }
- match: { tokens.2.token: "o.e.c.m.MetadataCreateIndexService" }
- match: { tokens.2.start_offset: 49 }
- match: { tokens.2.end_offset: 83 }
- match: { tokens.2.position: 6 }
- match: { tokens.3.token: "node-0" }
- match: { tokens.3.start_offset: 86 }
- match: { tokens.3.end_offset: 92 }
- match: { tokens.3.position: 7 }
- match: { tokens.4.token: "watcher-history-7-2017.12.13" }
- match: { tokens.4.start_offset: 96 }
- match: { tokens.4.end_offset: 124 }
- match: { tokens.4.position: 8 }
- match: { tokens.5.token: "creating" }
- match: { tokens.5.start_offset: 126 }
- match: { tokens.5.end_offset: 134 }
- match: { tokens.5.position: 9 }
- match: { tokens.6.token: "index" }
- match: { tokens.6.start_offset: 135 }
- match: { tokens.6.end_offset: 140 }
- match: { tokens.6.position: 10 }
- match: { tokens.7.token: "cause" }
- match: { tokens.7.start_offset: 142 }
- match: { tokens.7.end_offset: 147 }
- match: { tokens.7.position: 11 }
- match: { tokens.8.token: "auto" }
- match: { tokens.8.start_offset: 149 }
- match: { tokens.8.end_offset: 153 }
- match: { tokens.8.position: 12 }
- match: { tokens.9.token: "bulk" }
- match: { tokens.9.start_offset: 154 }
- match: { tokens.9.end_offset: 158 }
- match: { tokens.9.position: 13 }
- match: { tokens.10.token: "api" }
- match: { tokens.10.start_offset: 159 }
- match: { tokens.10.end_offset: 162 }
- match: { tokens.10.position: 14 }
- match: { tokens.11.token: "templates" }
- match: { tokens.11.start_offset: 166 }
- match: { tokens.11.end_offset: 175 }
- match: { tokens.11.position: 15 }
- match: { tokens.12.token: "watch-history-7" }
- match: { tokens.12.start_offset: 178 }
- match: { tokens.12.end_offset: 193 }
- match: { tokens.12.position: 16 }
- match: { tokens.13.token: "shards" }
- match: { tokens.13.start_offset: 196 }
- match: { tokens.13.end_offset: 202 }
- match: { tokens.13.position: 17 }
- match: { tokens.14.token: "mappings" }
- match: { tokens.14.start_offset: 212 }
- match: { tokens.14.end_offset: 220 }
- match: { tokens.14.position: 21 }
- match: { tokens.15.token: "doc" }
- match: { tokens.15.start_offset: 222 }
- match: { tokens.15.end_offset: 225 }
- match: { tokens.15.position: 22 }
---
"Test 7.14 analyzer with blank lines":
- do:
indices.analyze:
body: >
{
"char_filter" : [
"first_non_blank_line"
],
"tokenizer" : "ml_standard",
"filter" : [
{ "type" : "stop", "stopwords": [
"Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday",
"Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun",
"January", "February", "March", "April", "May", "June", "July", "August", "September", "October", "November", "December",
"Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec",
"GMT", "UTC"
] }
],
"text" : " \nfirst line\nsecond line"
}
- match: { tokens.0.token: "first" }
- match: { tokens.0.start_offset: 4 }
- match: { tokens.0.end_offset: 9 }
- match: { tokens.0.position: 0 }
- match: { tokens.1.token: "line" }
- match: { tokens.1.start_offset: 10 }
- match: { tokens.1.end_offset: 14 }
- match: { tokens.1.position: 1 }