mirror of
https://github.com/elastic/elasticsearch.git
synced 2025-06-28 17:34:17 -04:00
[ML] Make ml_standard tokenizer the default for new categorization jobs (#72805)
Categorization jobs created once the entire cluster is upgraded to version 7.14 or higher will default to using the new ml_standard tokenizer rather than the previous default of the ml_classic tokenizer, and will incorporate the new first_non_blank_line char filter so that categorization is based purely on the first non-blank line of each message. The difference between the ml_classic and ml_standard tokenizers is that ml_classic splits on slashes and colons, so creates multiple tokens from URLs and filesystem paths, whereas ml_standard attempts to keep URLs, email addresses and filesystem paths as single tokens. It is still possible to config the ml_classic tokenizer if you prefer: just provide a categorization_analyzer within your analysis_config and whichever tokenizer you choose (which could be ml_classic or any other Elasticsearch tokenizer) will be used. To opt out of using first_non_blank_line as a default char filter, you must explicitly specify a categorization_analyzer that does not include it. If no categorization_analyzer is specified but categorization_filters are specified then the categorization filters are converted to char filters applied that are applied after first_non_blank_line. Closes elastic/ml-cpp#1724
This commit is contained in:
parent
88dfe1aebf
commit
0059c59e25
22 changed files with 688 additions and 96 deletions
|
@ -588,14 +588,13 @@ public class MlClientDocumentationIT extends ESRestHighLevelClientTestCase {
|
||||||
.setDescription("My description") // <2>
|
.setDescription("My description") // <2>
|
||||||
.setAnalysisLimits(new AnalysisLimits(1000L, null)) // <3>
|
.setAnalysisLimits(new AnalysisLimits(1000L, null)) // <3>
|
||||||
.setBackgroundPersistInterval(TimeValue.timeValueHours(3)) // <4>
|
.setBackgroundPersistInterval(TimeValue.timeValueHours(3)) // <4>
|
||||||
.setCategorizationFilters(Arrays.asList("categorization-filter")) // <5>
|
.setDetectorUpdates(Arrays.asList(detectorUpdate)) // <5>
|
||||||
.setDetectorUpdates(Arrays.asList(detectorUpdate)) // <6>
|
.setGroups(Arrays.asList("job-group-1")) // <6>
|
||||||
.setGroups(Arrays.asList("job-group-1")) // <7>
|
.setResultsRetentionDays(10L) // <7>
|
||||||
.setResultsRetentionDays(10L) // <8>
|
.setModelPlotConfig(new ModelPlotConfig(true, null, true)) // <8>
|
||||||
.setModelPlotConfig(new ModelPlotConfig(true, null, true)) // <9>
|
.setModelSnapshotRetentionDays(7L) // <9>
|
||||||
.setModelSnapshotRetentionDays(7L) // <10>
|
.setCustomSettings(customSettings) // <10>
|
||||||
.setCustomSettings(customSettings) // <11>
|
.setRenormalizationWindowDays(3L) // <11>
|
||||||
.setRenormalizationWindowDays(3L) // <12>
|
|
||||||
.build();
|
.build();
|
||||||
// end::update-job-options
|
// end::update-job-options
|
||||||
|
|
||||||
|
|
|
@ -35,14 +35,13 @@ include-tagged::{doc-tests-file}[{api}-options]
|
||||||
<2> Updated description.
|
<2> Updated description.
|
||||||
<3> Updated analysis limits.
|
<3> Updated analysis limits.
|
||||||
<4> Updated background persistence interval.
|
<4> Updated background persistence interval.
|
||||||
<5> Updated analysis config's categorization filters.
|
<5> Updated detectors through the `JobUpdate.DetectorUpdate` object.
|
||||||
<6> Updated detectors through the `JobUpdate.DetectorUpdate` object.
|
<6> Updated group membership.
|
||||||
<7> Updated group membership.
|
<7> Updated result retention.
|
||||||
<8> Updated result retention.
|
<8> Updated model plot configuration.
|
||||||
<9> Updated model plot configuration.
|
<9> Updated model snapshot retention setting.
|
||||||
<10> Updated model snapshot retention setting.
|
<10> Updated custom settings.
|
||||||
<11> Updated custom settings.
|
<11> Updated renormalization window.
|
||||||
<12> Updated renormalization window.
|
|
||||||
|
|
||||||
Included with these options are specific optional `JobUpdate.DetectorUpdate` updates.
|
Included with these options are specific optional `JobUpdate.DetectorUpdate` updates.
|
||||||
["source","java",subs="attributes,callouts,macros"]
|
["source","java",subs="attributes,callouts,macros"]
|
||||||
|
|
|
@ -49,7 +49,10 @@ This is a possible response:
|
||||||
"defaults" : {
|
"defaults" : {
|
||||||
"anomaly_detectors" : {
|
"anomaly_detectors" : {
|
||||||
"categorization_analyzer" : {
|
"categorization_analyzer" : {
|
||||||
"tokenizer" : "ml_classic",
|
"char_filter" : [
|
||||||
|
"first_non_blank_line"
|
||||||
|
],
|
||||||
|
"tokenizer" : "ml_standard",
|
||||||
"filter" : [
|
"filter" : [
|
||||||
{
|
{
|
||||||
"type" : "stop",
|
"type" : "stop",
|
||||||
|
|
|
@ -21,8 +21,8 @@ of possible messages:
|
||||||
Categorization is tuned to work best on data like log messages by taking token
|
Categorization is tuned to work best on data like log messages by taking token
|
||||||
order into account, including stop words, and not considering synonyms in its
|
order into account, including stop words, and not considering synonyms in its
|
||||||
analysis. Complete sentences in human communication or literary text (for
|
analysis. Complete sentences in human communication or literary text (for
|
||||||
example email, wiki pages, prose, or other human-generated content) can be
|
example email, wiki pages, prose, or other human-generated content) can be
|
||||||
extremely diverse in structure. Since categorization is tuned for machine data,
|
extremely diverse in structure. Since categorization is tuned for machine data,
|
||||||
it gives poor results for human-generated data. It would create so many
|
it gives poor results for human-generated data. It would create so many
|
||||||
categories that they couldn't be handled effectively. Categorization is _not_
|
categories that they couldn't be handled effectively. Categorization is _not_
|
||||||
natural language processing (NLP).
|
natural language processing (NLP).
|
||||||
|
@ -32,7 +32,7 @@ volume and pattern is normal for each category over time. You can then detect
|
||||||
anomalies and surface rare events or unusual types of messages by using
|
anomalies and surface rare events or unusual types of messages by using
|
||||||
<<ml-count-functions,count>> or <<ml-rare-functions,rare>> functions.
|
<<ml-count-functions,count>> or <<ml-rare-functions,rare>> functions.
|
||||||
|
|
||||||
In {kib}, there is a categorization wizard to help you create this type of
|
In {kib}, there is a categorization wizard to help you create this type of
|
||||||
{anomaly-job}. For example, the following job generates categories from the
|
{anomaly-job}. For example, the following job generates categories from the
|
||||||
contents of the `message` field and uses the count function to determine when
|
contents of the `message` field and uses the count function to determine when
|
||||||
certain categories are occurring at anomalous rates:
|
certain categories are occurring at anomalous rates:
|
||||||
|
@ -69,7 +69,7 @@ do not specify this keyword in one of those properties, the API request fails.
|
||||||
====
|
====
|
||||||
|
|
||||||
|
|
||||||
You can use the **Anomaly Explorer** in {kib} to view the analysis results:
|
You can use the **Anomaly Explorer** in {kib} to view the analysis results:
|
||||||
|
|
||||||
[role="screenshot"]
|
[role="screenshot"]
|
||||||
image::images/ml-category-anomalies.jpg["Categorization results in the Anomaly Explorer"]
|
image::images/ml-category-anomalies.jpg["Categorization results in the Anomaly Explorer"]
|
||||||
|
@ -105,7 +105,7 @@ SQL statement from the categorization algorithm.
|
||||||
If you enable per-partition categorization, categories are determined
|
If you enable per-partition categorization, categories are determined
|
||||||
independently for each partition. For example, if your data includes messages
|
independently for each partition. For example, if your data includes messages
|
||||||
from multiple types of logs from different applications, you can use a field
|
from multiple types of logs from different applications, you can use a field
|
||||||
like the ECS {ecs-ref}/ecs-event.html[`event.dataset` field] as the
|
like the ECS {ecs-ref}/ecs-event.html[`event.dataset` field] as the
|
||||||
`partition_field_name` and categorize the messages for each type of log
|
`partition_field_name` and categorize the messages for each type of log
|
||||||
separately.
|
separately.
|
||||||
|
|
||||||
|
@ -116,7 +116,7 @@ create or update a job and enable per-partition categorization, it fails.
|
||||||
|
|
||||||
When per-partition categorization is enabled, you can also take advantage of a
|
When per-partition categorization is enabled, you can also take advantage of a
|
||||||
`stop_on_warn` configuration option. If the categorization status for a
|
`stop_on_warn` configuration option. If the categorization status for a
|
||||||
partition changes to `warn`, it doesn't categorize well and can cause a lot of
|
partition changes to `warn`, it doesn't categorize well and can cause a lot of
|
||||||
unnecessary resource usage. When you set `stop_on_warn` to `true`, the job stops
|
unnecessary resource usage. When you set `stop_on_warn` to `true`, the job stops
|
||||||
analyzing these problematic partitions. You can thus avoid an ongoing
|
analyzing these problematic partitions. You can thus avoid an ongoing
|
||||||
performance cost for partitions that are unsuitable for categorization.
|
performance cost for partitions that are unsuitable for categorization.
|
||||||
|
@ -128,7 +128,7 @@ performance cost for partitions that are unsuitable for categorization.
|
||||||
Categorization uses English dictionary words to identify log message categories.
|
Categorization uses English dictionary words to identify log message categories.
|
||||||
By default, it also uses English tokenization rules. For this reason, if you use
|
By default, it also uses English tokenization rules. For this reason, if you use
|
||||||
the default categorization analyzer, only English language log messages are
|
the default categorization analyzer, only English language log messages are
|
||||||
supported, as described in the <<ml-limitations>>.
|
supported, as described in the <<ml-limitations>>.
|
||||||
|
|
||||||
If you use the categorization wizard in {kib}, you can see which categorization
|
If you use the categorization wizard in {kib}, you can see which categorization
|
||||||
analyzer it uses and highlighted examples of the tokens that it identifies. You
|
analyzer it uses and highlighted examples of the tokens that it identifies. You
|
||||||
|
@ -140,7 +140,7 @@ image::images/ml-category-analyzer.jpg["Editing the categorization analyzer in K
|
||||||
|
|
||||||
The categorization analyzer can refer to a built-in {es} analyzer or a
|
The categorization analyzer can refer to a built-in {es} analyzer or a
|
||||||
combination of zero or more character filters, a tokenizer, and zero or more
|
combination of zero or more character filters, a tokenizer, and zero or more
|
||||||
token filters. In this example, adding a
|
token filters. In this example, adding a
|
||||||
{ref}/analysis-pattern-replace-charfilter.html[`pattern_replace` character filter]
|
{ref}/analysis-pattern-replace-charfilter.html[`pattern_replace` character filter]
|
||||||
achieves exactly the same behavior as the `categorization_filters` job
|
achieves exactly the same behavior as the `categorization_filters` job
|
||||||
configuration option described earlier. For more details about these properties,
|
configuration option described earlier. For more details about these properties,
|
||||||
|
@ -157,7 +157,10 @@ POST _ml/anomaly_detectors/_validate
|
||||||
{
|
{
|
||||||
"analysis_config" : {
|
"analysis_config" : {
|
||||||
"categorization_analyzer" : {
|
"categorization_analyzer" : {
|
||||||
"tokenizer" : "ml_classic",
|
"char_filter" : [
|
||||||
|
"first_non_blank_line"
|
||||||
|
],
|
||||||
|
"tokenizer" : "ml_standard",
|
||||||
"filter" : [
|
"filter" : [
|
||||||
{ "type" : "stop", "stopwords": [
|
{ "type" : "stop", "stopwords": [
|
||||||
"Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday",
|
"Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday",
|
||||||
|
@ -182,8 +185,8 @@ POST _ml/anomaly_detectors/_validate
|
||||||
If you specify any part of the `categorization_analyzer`, however, any omitted
|
If you specify any part of the `categorization_analyzer`, however, any omitted
|
||||||
sub-properties are _not_ set to default values.
|
sub-properties are _not_ set to default values.
|
||||||
|
|
||||||
The `ml_classic` tokenizer and the day and month stopword filter are more or
|
The `ml_standard` tokenizer and the day and month stopword filter are more or
|
||||||
less equivalent to the following analyzer, which is defined using only built-in
|
less equivalent to the following analyzer, which is defined using only built-in
|
||||||
{es} {ref}/analysis-tokenizers.html[tokenizers] and
|
{es} {ref}/analysis-tokenizers.html[tokenizers] and
|
||||||
{ref}/analysis-tokenfilters.html[token filters]:
|
{ref}/analysis-tokenfilters.html[token filters]:
|
||||||
|
|
||||||
|
@ -201,15 +204,18 @@ PUT _ml/anomaly_detectors/it_ops_new_logs3
|
||||||
"detector_description": "Unusual message counts"
|
"detector_description": "Unusual message counts"
|
||||||
}],
|
}],
|
||||||
"categorization_analyzer":{
|
"categorization_analyzer":{
|
||||||
|
"char_filter" : [
|
||||||
|
"first_non_blank_line" <1>
|
||||||
|
],
|
||||||
"tokenizer": {
|
"tokenizer": {
|
||||||
"type" : "simple_pattern_split",
|
"type" : "simple_pattern_split",
|
||||||
"pattern" : "[^-0-9A-Za-z_.]+" <1>
|
"pattern" : "[^-0-9A-Za-z_./]+" <2>
|
||||||
},
|
},
|
||||||
"filter": [
|
"filter": [
|
||||||
{ "type" : "pattern_replace", "pattern": "^[0-9].*" }, <2>
|
{ "type" : "pattern_replace", "pattern": "^[0-9].*" }, <3>
|
||||||
{ "type" : "pattern_replace", "pattern": "^[-0-9A-Fa-f.]+$" }, <3>
|
{ "type" : "pattern_replace", "pattern": "^[-0-9A-Fa-f.]+$" }, <4>
|
||||||
{ "type" : "pattern_replace", "pattern": "^[^0-9A-Za-z]+" }, <4>
|
{ "type" : "pattern_replace", "pattern": "^[^0-9A-Za-z]+" }, <5>
|
||||||
{ "type" : "pattern_replace", "pattern": "[^0-9A-Za-z]+$" }, <5>
|
{ "type" : "pattern_replace", "pattern": "[^0-9A-Za-z]+$" }, <6>
|
||||||
{ "type" : "stop", "stopwords": [
|
{ "type" : "stop", "stopwords": [
|
||||||
"",
|
"",
|
||||||
"Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday",
|
"Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday",
|
||||||
|
@ -232,17 +238,20 @@ PUT _ml/anomaly_detectors/it_ops_new_logs3
|
||||||
----------------------------------
|
----------------------------------
|
||||||
// TEST[skip:needs-licence]
|
// TEST[skip:needs-licence]
|
||||||
|
|
||||||
<1> Tokens basically consist of hyphens, digits, letters, underscores and dots.
|
<1> Only consider the first non-blank line of the message for categorization purposes.
|
||||||
<2> By default, categorization ignores tokens that begin with a digit.
|
<2> Tokens basically consist of hyphens, digits, letters, underscores, dots and slashes.
|
||||||
<3> By default, categorization also ignores tokens that are hexadecimal numbers.
|
<3> By default, categorization ignores tokens that begin with a digit.
|
||||||
<4> Underscores, hyphens, and dots are removed from the beginning of tokens.
|
<4> By default, categorization also ignores tokens that are hexadecimal numbers.
|
||||||
<5> Underscores, hyphens, and dots are also removed from the end of tokens.
|
<5> Underscores, hyphens, and dots are removed from the beginning of tokens.
|
||||||
|
<6> Underscores, hyphens, and dots are also removed from the end of tokens.
|
||||||
|
|
||||||
The key difference between the default `categorization_analyzer` and this
|
The key difference between the default `categorization_analyzer` and this
|
||||||
example analyzer is that using the `ml_classic` tokenizer is several times
|
example analyzer is that using the `ml_standard` tokenizer is several times
|
||||||
faster. The difference in behavior is that this custom analyzer does not include
|
faster. The `ml_standard` tokenizer also tries to preserve URLs, Windows paths
|
||||||
accented letters in tokens whereas the `ml_classic` tokenizer does, although
|
and email addresses as single tokens. Another difference in behavior is that
|
||||||
that could be fixed by using more complex regular expressions.
|
this custom analyzer does not include accented letters in tokens whereas the
|
||||||
|
`ml_standard` tokenizer does, although that could be fixed by using more complex
|
||||||
|
regular expressions.
|
||||||
|
|
||||||
If you are categorizing non-English messages in a language where words are
|
If you are categorizing non-English messages in a language where words are
|
||||||
separated by spaces, you might get better results if you change the day or month
|
separated by spaces, you might get better results if you change the day or month
|
||||||
|
|
|
@ -1592,11 +1592,17 @@ end::timestamp-results[]
|
||||||
tag::tokenizer[]
|
tag::tokenizer[]
|
||||||
The name or definition of the <<analysis-tokenizers,tokenizer>> to use after
|
The name or definition of the <<analysis-tokenizers,tokenizer>> to use after
|
||||||
character filters are applied. This property is compulsory if
|
character filters are applied. This property is compulsory if
|
||||||
`categorization_analyzer` is specified as an object. Machine learning provides a
|
`categorization_analyzer` is specified as an object. Machine learning provides
|
||||||
tokenizer called `ml_classic` that tokenizes in the same way as the
|
a tokenizer called `ml_standard` that tokenizes in a way that has been
|
||||||
non-customizable tokenizer in older versions of the product. If you want to use
|
determined to produce good categorization results on a variety of log
|
||||||
that tokenizer but change the character or token filters, specify
|
file formats for logs in English. If you want to use that tokenizer but
|
||||||
`"tokenizer": "ml_classic"` in your `categorization_analyzer`.
|
change the character or token filters, specify `"tokenizer": "ml_standard"`
|
||||||
|
in your `categorization_analyzer`. Additionally, the `ml_classic` tokenizer
|
||||||
|
is available, which tokenizes in the same way as the non-customizable
|
||||||
|
tokenizer in old versions of the product (before 6.2). `ml_classic` was
|
||||||
|
the default categorization tokenizer in versions 6.2 to 7.13, so if you
|
||||||
|
need categorization identical to the default for jobs created in these
|
||||||
|
versions, specify `"tokenizer": "ml_classic"` in your `categorization_analyzer`.
|
||||||
end::tokenizer[]
|
end::tokenizer[]
|
||||||
|
|
||||||
tag::total-by-field-count[]
|
tag::total-by-field-count[]
|
||||||
|
|
|
@ -117,9 +117,12 @@ tasks.named("yamlRestCompatTest").configure {
|
||||||
'ml/datafeeds_crud/Test update datafeed to point to job already attached to another datafeed',
|
'ml/datafeeds_crud/Test update datafeed to point to job already attached to another datafeed',
|
||||||
'ml/datafeeds_crud/Test update datafeed to point to missing job',
|
'ml/datafeeds_crud/Test update datafeed to point to missing job',
|
||||||
'ml/job_cat_apis/Test cat anomaly detector jobs',
|
'ml/job_cat_apis/Test cat anomaly detector jobs',
|
||||||
|
'ml/jobs_crud/Test update job',
|
||||||
'ml/jobs_get_stats/Test get job stats after uploading data prompting the creation of some stats',
|
'ml/jobs_get_stats/Test get job stats after uploading data prompting the creation of some stats',
|
||||||
'ml/jobs_get_stats/Test get job stats for closed job',
|
'ml/jobs_get_stats/Test get job stats for closed job',
|
||||||
'ml/jobs_get_stats/Test no exception on get job stats with missing index',
|
'ml/jobs_get_stats/Test no exception on get job stats with missing index',
|
||||||
|
// TODO: the ml_info mute can be removed from master once the ml_standard tokenizer is in 7.x
|
||||||
|
'ml/ml_info/Test ml info',
|
||||||
'ml/post_data/Test POST data job api, flush, close and verify DataCounts doc',
|
'ml/post_data/Test POST data job api, flush, close and verify DataCounts doc',
|
||||||
'ml/post_data/Test flush with skip_time',
|
'ml/post_data/Test flush with skip_time',
|
||||||
'ml/set_upgrade_mode/Setting upgrade mode to disabled from enabled',
|
'ml/set_upgrade_mode/Setting upgrade mode to disabled from enabled',
|
||||||
|
|
|
@ -145,38 +145,39 @@ public class CategorizationAnalyzerConfig implements ToXContentFragment, Writeab
|
||||||
}
|
}
|
||||||
|
|
||||||
/**
|
/**
|
||||||
* Create a <code>categorization_analyzer</code> that mimics what the tokenizer and filters built into the ML C++
|
* Create a <code>categorization_analyzer</code> that mimics what the tokenizer and filters built into the original ML
|
||||||
* code do. This is the default analyzer for categorization to ensure that people upgrading from previous versions
|
* C++ code do. This is the default analyzer for categorization to ensure that people upgrading from old versions
|
||||||
* get the same behaviour from their categorization jobs before and after upgrade.
|
* get the same behaviour from their categorization jobs before and after upgrade.
|
||||||
* @param categorizationFilters Categorization filters (if any) from the <code>analysis_config</code>.
|
* @param categorizationFilters Categorization filters (if any) from the <code>analysis_config</code>.
|
||||||
* @return The default categorization analyzer.
|
* @return The default categorization analyzer.
|
||||||
*/
|
*/
|
||||||
public static CategorizationAnalyzerConfig buildDefaultCategorizationAnalyzer(List<String> categorizationFilters) {
|
public static CategorizationAnalyzerConfig buildDefaultCategorizationAnalyzer(List<String> categorizationFilters) {
|
||||||
|
|
||||||
CategorizationAnalyzerConfig.Builder builder = new CategorizationAnalyzerConfig.Builder();
|
return new CategorizationAnalyzerConfig.Builder()
|
||||||
|
.addCategorizationFilters(categorizationFilters)
|
||||||
|
.setTokenizer("ml_classic")
|
||||||
|
.addDateWordsTokenFilter()
|
||||||
|
.build();
|
||||||
|
}
|
||||||
|
|
||||||
if (categorizationFilters != null) {
|
/**
|
||||||
for (String categorizationFilter : categorizationFilters) {
|
* Create a <code>categorization_analyzer</code> that will be used for newly created jobs where no categorization
|
||||||
Map<String, Object> charFilter = new HashMap<>();
|
* analyzer is explicitly provided. This analyzer differs from the default one in that it uses the <code>ml_standard</code>
|
||||||
charFilter.put("type", "pattern_replace");
|
* tokenizer instead of the <code>ml_classic</code> tokenizer, and it only considers the first non-blank line of each message.
|
||||||
charFilter.put("pattern", categorizationFilter);
|
* This analyzer is <em>not</em> used for jobs that specify no categorization analyzer, as that would break jobs that were
|
||||||
builder.addCharFilter(charFilter);
|
* originally run in older versions. Instead, this analyzer is explicitly added to newly created jobs once the entire cluster
|
||||||
}
|
* is upgraded to version 7.14 or above.
|
||||||
}
|
* @param categorizationFilters Categorization filters (if any) from the <code>analysis_config</code>.
|
||||||
|
* @return The standard categorization analyzer.
|
||||||
|
*/
|
||||||
|
public static CategorizationAnalyzerConfig buildStandardCategorizationAnalyzer(List<String> categorizationFilters) {
|
||||||
|
|
||||||
builder.setTokenizer("ml_classic");
|
return new CategorizationAnalyzerConfig.Builder()
|
||||||
|
.addCharFilter("first_non_blank_line")
|
||||||
Map<String, Object> tokenFilter = new HashMap<>();
|
.addCategorizationFilters(categorizationFilters)
|
||||||
tokenFilter.put("type", "stop");
|
.setTokenizer("ml_standard")
|
||||||
tokenFilter.put("stopwords", Arrays.asList(
|
.addDateWordsTokenFilter()
|
||||||
"Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday",
|
.build();
|
||||||
"Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun",
|
|
||||||
"January", "February", "March", "April", "May", "June", "July", "August", "September", "October", "November", "December",
|
|
||||||
"Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec",
|
|
||||||
"GMT", "UTC"));
|
|
||||||
builder.addTokenFilter(tokenFilter);
|
|
||||||
|
|
||||||
return builder.build();
|
|
||||||
}
|
}
|
||||||
|
|
||||||
private final String analyzer;
|
private final String analyzer;
|
||||||
|
@ -311,6 +312,18 @@ public class CategorizationAnalyzerConfig implements ToXContentFragment, Writeab
|
||||||
return this;
|
return this;
|
||||||
}
|
}
|
||||||
|
|
||||||
|
public Builder addCategorizationFilters(List<String> categorizationFilters) {
|
||||||
|
if (categorizationFilters != null) {
|
||||||
|
for (String categorizationFilter : categorizationFilters) {
|
||||||
|
Map<String, Object> charFilter = new HashMap<>();
|
||||||
|
charFilter.put("type", "pattern_replace");
|
||||||
|
charFilter.put("pattern", categorizationFilter);
|
||||||
|
addCharFilter(charFilter);
|
||||||
|
}
|
||||||
|
}
|
||||||
|
return this;
|
||||||
|
}
|
||||||
|
|
||||||
public Builder setTokenizer(String tokenizer) {
|
public Builder setTokenizer(String tokenizer) {
|
||||||
this.tokenizer = new NameOrDefinition(tokenizer);
|
this.tokenizer = new NameOrDefinition(tokenizer);
|
||||||
return this;
|
return this;
|
||||||
|
@ -331,6 +344,19 @@ public class CategorizationAnalyzerConfig implements ToXContentFragment, Writeab
|
||||||
return this;
|
return this;
|
||||||
}
|
}
|
||||||
|
|
||||||
|
Builder addDateWordsTokenFilter() {
|
||||||
|
Map<String, Object> tokenFilter = new HashMap<>();
|
||||||
|
tokenFilter.put("type", "stop");
|
||||||
|
tokenFilter.put("stopwords", Arrays.asList(
|
||||||
|
"Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday",
|
||||||
|
"Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun",
|
||||||
|
"January", "February", "March", "April", "May", "June", "July", "August", "September", "October", "November", "December",
|
||||||
|
"Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec",
|
||||||
|
"GMT", "UTC"));
|
||||||
|
addTokenFilter(tokenFilter);
|
||||||
|
return this;
|
||||||
|
}
|
||||||
|
|
||||||
/**
|
/**
|
||||||
* Create a config validating only structure, not exact analyzer/tokenizer/filter names
|
* Create a config validating only structure, not exact analyzer/tokenizer/filter names
|
||||||
*/
|
*/
|
||||||
|
|
|
@ -17,9 +17,11 @@ restResources {
|
||||||
|
|
||||||
tasks.named("yamlRestTest").configure {
|
tasks.named("yamlRestTest").configure {
|
||||||
systemProperty 'tests.rest.blacklist', [
|
systemProperty 'tests.rest.blacklist', [
|
||||||
// Remove this test because it doesn't call an ML endpoint and we don't want
|
// Remove these tests because they don't call an ML endpoint and we don't want
|
||||||
// to grant extra permissions to the users used in this test suite
|
// to grant extra permissions to the users used in this test suite
|
||||||
'ml/ml_classic_analyze/Test analyze API with an analyzer that does what we used to do in native code',
|
'ml/ml_classic_analyze/Test analyze API with an analyzer that does what we used to do in native code',
|
||||||
|
'ml/ml_standard_analyze/Test analyze API with the standard 7.14 ML analyzer',
|
||||||
|
'ml/ml_standard_analyze/Test 7.14 analyzer with blank lines',
|
||||||
// Remove tests that are expected to throw an exception, because we cannot then
|
// Remove tests that are expected to throw an exception, because we cannot then
|
||||||
// know whether to expect an authorization exception or a validation exception
|
// know whether to expect an authorization exception or a validation exception
|
||||||
'ml/calendar_crud/Test get calendar given missing',
|
'ml/calendar_crud/Test get calendar given missing',
|
||||||
|
|
|
@ -45,6 +45,7 @@ import org.elasticsearch.common.util.concurrent.EsExecutors;
|
||||||
import org.elasticsearch.common.xcontent.NamedXContentRegistry;
|
import org.elasticsearch.common.xcontent.NamedXContentRegistry;
|
||||||
import org.elasticsearch.env.Environment;
|
import org.elasticsearch.env.Environment;
|
||||||
import org.elasticsearch.env.NodeEnvironment;
|
import org.elasticsearch.env.NodeEnvironment;
|
||||||
|
import org.elasticsearch.index.analysis.CharFilterFactory;
|
||||||
import org.elasticsearch.index.analysis.TokenizerFactory;
|
import org.elasticsearch.index.analysis.TokenizerFactory;
|
||||||
import org.elasticsearch.indices.SystemIndexDescriptor;
|
import org.elasticsearch.indices.SystemIndexDescriptor;
|
||||||
import org.elasticsearch.indices.analysis.AnalysisModule.AnalysisProvider;
|
import org.elasticsearch.indices.analysis.AnalysisModule.AnalysisProvider;
|
||||||
|
@ -264,6 +265,8 @@ import org.elasticsearch.xpack.ml.inference.persistence.TrainedModelProvider;
|
||||||
import org.elasticsearch.xpack.ml.job.JobManager;
|
import org.elasticsearch.xpack.ml.job.JobManager;
|
||||||
import org.elasticsearch.xpack.ml.job.JobManagerHolder;
|
import org.elasticsearch.xpack.ml.job.JobManagerHolder;
|
||||||
import org.elasticsearch.xpack.ml.job.UpdateJobProcessNotifier;
|
import org.elasticsearch.xpack.ml.job.UpdateJobProcessNotifier;
|
||||||
|
import org.elasticsearch.xpack.ml.job.categorization.FirstNonBlankLineCharFilter;
|
||||||
|
import org.elasticsearch.xpack.ml.job.categorization.FirstNonBlankLineCharFilterFactory;
|
||||||
import org.elasticsearch.xpack.ml.job.categorization.MlClassicTokenizer;
|
import org.elasticsearch.xpack.ml.job.categorization.MlClassicTokenizer;
|
||||||
import org.elasticsearch.xpack.ml.job.categorization.MlClassicTokenizerFactory;
|
import org.elasticsearch.xpack.ml.job.categorization.MlClassicTokenizerFactory;
|
||||||
import org.elasticsearch.xpack.ml.job.categorization.MlStandardTokenizer;
|
import org.elasticsearch.xpack.ml.job.categorization.MlStandardTokenizer;
|
||||||
|
@ -1076,6 +1079,10 @@ public class MachineLearning extends Plugin implements SystemIndexPlugin,
|
||||||
return Arrays.asList(jobComms, utility, datafeed);
|
return Arrays.asList(jobComms, utility, datafeed);
|
||||||
}
|
}
|
||||||
|
|
||||||
|
public Map<String, AnalysisProvider<CharFilterFactory>> getCharFilters() {
|
||||||
|
return Collections.singletonMap(FirstNonBlankLineCharFilter.NAME, FirstNonBlankLineCharFilterFactory::new);
|
||||||
|
}
|
||||||
|
|
||||||
@Override
|
@Override
|
||||||
public Map<String, AnalysisProvider<TokenizerFactory>> getTokenizers() {
|
public Map<String, AnalysisProvider<TokenizerFactory>> getTokenizers() {
|
||||||
return Map.of(MlClassicTokenizer.NAME, MlClassicTokenizerFactory::new,
|
return Map.of(MlClassicTokenizer.NAME, MlClassicTokenizerFactory::new,
|
||||||
|
|
|
@ -98,7 +98,7 @@ public class TransportMlInfoAction extends HandledTransportAction<MlInfoAction.R
|
||||||
Job.DEFAULT_DAILY_MODEL_SNAPSHOT_RETENTION_AFTER_DAYS);
|
Job.DEFAULT_DAILY_MODEL_SNAPSHOT_RETENTION_AFTER_DAYS);
|
||||||
try {
|
try {
|
||||||
defaults.put(CategorizationAnalyzerConfig.CATEGORIZATION_ANALYZER.getPreferredName(),
|
defaults.put(CategorizationAnalyzerConfig.CATEGORIZATION_ANALYZER.getPreferredName(),
|
||||||
CategorizationAnalyzerConfig.buildDefaultCategorizationAnalyzer(Collections.emptyList())
|
CategorizationAnalyzerConfig.buildStandardCategorizationAnalyzer(Collections.emptyList())
|
||||||
.asMap(xContentRegistry).get(CategorizationAnalyzerConfig.CATEGORIZATION_ANALYZER.getPreferredName()));
|
.asMap(xContentRegistry).get(CategorizationAnalyzerConfig.CATEGORIZATION_ANALYZER.getPreferredName()));
|
||||||
} catch (IOException e) {
|
} catch (IOException e) {
|
||||||
logger.error("failed to convert default categorization analyzer to map", e);
|
logger.error("failed to convert default categorization analyzer to map", e);
|
||||||
|
|
|
@ -10,6 +10,7 @@ import org.apache.logging.log4j.LogManager;
|
||||||
import org.apache.logging.log4j.Logger;
|
import org.apache.logging.log4j.Logger;
|
||||||
import org.elasticsearch.ResourceAlreadyExistsException;
|
import org.elasticsearch.ResourceAlreadyExistsException;
|
||||||
import org.elasticsearch.ResourceNotFoundException;
|
import org.elasticsearch.ResourceNotFoundException;
|
||||||
|
import org.elasticsearch.Version;
|
||||||
import org.elasticsearch.action.ActionListener;
|
import org.elasticsearch.action.ActionListener;
|
||||||
import org.elasticsearch.action.index.IndexResponse;
|
import org.elasticsearch.action.index.IndexResponse;
|
||||||
import org.elasticsearch.action.support.WriteRequest;
|
import org.elasticsearch.action.support.WriteRequest;
|
||||||
|
@ -39,6 +40,7 @@ import org.elasticsearch.xpack.core.ml.MlTasks;
|
||||||
import org.elasticsearch.xpack.core.ml.action.PutJobAction;
|
import org.elasticsearch.xpack.core.ml.action.PutJobAction;
|
||||||
import org.elasticsearch.xpack.core.ml.action.RevertModelSnapshotAction;
|
import org.elasticsearch.xpack.core.ml.action.RevertModelSnapshotAction;
|
||||||
import org.elasticsearch.xpack.core.ml.action.UpdateJobAction;
|
import org.elasticsearch.xpack.core.ml.action.UpdateJobAction;
|
||||||
|
import org.elasticsearch.xpack.core.ml.job.config.AnalysisConfig;
|
||||||
import org.elasticsearch.xpack.core.ml.job.config.AnalysisLimits;
|
import org.elasticsearch.xpack.core.ml.job.config.AnalysisLimits;
|
||||||
import org.elasticsearch.xpack.core.ml.job.config.CategorizationAnalyzerConfig;
|
import org.elasticsearch.xpack.core.ml.job.config.CategorizationAnalyzerConfig;
|
||||||
import org.elasticsearch.xpack.core.ml.job.config.DataDescription;
|
import org.elasticsearch.xpack.core.ml.job.config.DataDescription;
|
||||||
|
@ -85,6 +87,8 @@ import java.util.regex.Pattern;
|
||||||
*/
|
*/
|
||||||
public class JobManager {
|
public class JobManager {
|
||||||
|
|
||||||
|
private static final Version MIN_NODE_VERSION_FOR_STANDARD_CATEGORIZATION_ANALYZER = Version.V_7_14_0;
|
||||||
|
|
||||||
private static final Logger logger = LogManager.getLogger(JobManager.class);
|
private static final Logger logger = LogManager.getLogger(JobManager.class);
|
||||||
private static final DeprecationLogger deprecationLogger = DeprecationLogger.getLogger(JobManager.class);
|
private static final DeprecationLogger deprecationLogger = DeprecationLogger.getLogger(JobManager.class);
|
||||||
|
|
||||||
|
@ -220,17 +224,31 @@ public class JobManager {
|
||||||
|
|
||||||
/**
|
/**
|
||||||
* Validate the char filter/tokenizer/token filter names used in the categorization analyzer config (if any).
|
* Validate the char filter/tokenizer/token filter names used in the categorization analyzer config (if any).
|
||||||
* This validation has to be done server-side; it cannot be done in a client as that won't have loaded the
|
* If the user has not provided a categorization analyzer then set the standard one if categorization is
|
||||||
* appropriate analysis modules/plugins.
|
* being used at all and all the nodes in the cluster are running a version that will understand it. This
|
||||||
* The overall structure can be validated at parse time, but the exact names need to be checked separately,
|
* method must only be called when a job is first created - since it applies a default if it were to be
|
||||||
* as plugins that provide the functionality can be installed/uninstalled.
|
* called after that it could change the meaning of a job that has already run. The validation in this
|
||||||
|
* method has to be done server-side; it cannot be done in a client as that won't have loaded the appropriate
|
||||||
|
* analysis modules/plugins. (The overall structure can be validated at parse time, but the exact names need
|
||||||
|
* to be checked separately, as plugins that provide the functionality can be installed/uninstalled.)
|
||||||
*/
|
*/
|
||||||
static void validateCategorizationAnalyzer(Job.Builder jobBuilder, AnalysisRegistry analysisRegistry)
|
static void validateCategorizationAnalyzerOrSetDefault(Job.Builder jobBuilder, AnalysisRegistry analysisRegistry,
|
||||||
throws IOException {
|
Version minNodeVersion) throws IOException {
|
||||||
CategorizationAnalyzerConfig categorizationAnalyzerConfig = jobBuilder.getAnalysisConfig().getCategorizationAnalyzerConfig();
|
AnalysisConfig analysisConfig = jobBuilder.getAnalysisConfig();
|
||||||
|
CategorizationAnalyzerConfig categorizationAnalyzerConfig = analysisConfig.getCategorizationAnalyzerConfig();
|
||||||
if (categorizationAnalyzerConfig != null) {
|
if (categorizationAnalyzerConfig != null) {
|
||||||
CategorizationAnalyzer.verifyConfigBuilder(new CategorizationAnalyzerConfig.Builder(categorizationAnalyzerConfig),
|
CategorizationAnalyzer.verifyConfigBuilder(new CategorizationAnalyzerConfig.Builder(categorizationAnalyzerConfig),
|
||||||
analysisRegistry);
|
analysisRegistry);
|
||||||
|
} else if (analysisConfig.getCategorizationFieldName() != null
|
||||||
|
&& minNodeVersion.onOrAfter(MIN_NODE_VERSION_FOR_STANDARD_CATEGORIZATION_ANALYZER)) {
|
||||||
|
// Any supplied categorization filters are transferred into the new categorization analyzer.
|
||||||
|
// The user supplied categorization filters will already have been validated when the put job
|
||||||
|
// request was built, so we know they're valid.
|
||||||
|
AnalysisConfig.Builder analysisConfigBuilder = new AnalysisConfig.Builder(analysisConfig)
|
||||||
|
.setCategorizationAnalyzerConfig(
|
||||||
|
CategorizationAnalyzerConfig.buildStandardCategorizationAnalyzer(analysisConfig.getCategorizationFilters()))
|
||||||
|
.setCategorizationFilters(null);
|
||||||
|
jobBuilder.setAnalysisConfig(analysisConfigBuilder);
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
|
||||||
|
@ -240,10 +258,12 @@ public class JobManager {
|
||||||
public void putJob(PutJobAction.Request request, AnalysisRegistry analysisRegistry, ClusterState state,
|
public void putJob(PutJobAction.Request request, AnalysisRegistry analysisRegistry, ClusterState state,
|
||||||
ActionListener<PutJobAction.Response> actionListener) throws IOException {
|
ActionListener<PutJobAction.Response> actionListener) throws IOException {
|
||||||
|
|
||||||
|
Version minNodeVersion = state.getNodes().getMinNodeVersion();
|
||||||
|
|
||||||
Job.Builder jobBuilder = request.getJobBuilder();
|
Job.Builder jobBuilder = request.getJobBuilder();
|
||||||
jobBuilder.validateAnalysisLimitsAndSetDefaults(maxModelMemoryLimit);
|
jobBuilder.validateAnalysisLimitsAndSetDefaults(maxModelMemoryLimit);
|
||||||
jobBuilder.validateModelSnapshotRetentionSettingsAndSetDefaults();
|
jobBuilder.validateModelSnapshotRetentionSettingsAndSetDefaults();
|
||||||
validateCategorizationAnalyzer(jobBuilder, analysisRegistry);
|
validateCategorizationAnalyzerOrSetDefault(jobBuilder, analysisRegistry, minNodeVersion);
|
||||||
|
|
||||||
Job job = jobBuilder.build(new Date());
|
Job job = jobBuilder.build(new Date());
|
||||||
|
|
||||||
|
|
|
@ -20,10 +20,14 @@ public abstract class AbstractMlTokenizer extends Tokenizer {
|
||||||
protected final OffsetAttribute offsetAtt = addAttribute(OffsetAttribute.class);
|
protected final OffsetAttribute offsetAtt = addAttribute(OffsetAttribute.class);
|
||||||
protected final PositionIncrementAttribute posIncrAtt = addAttribute(PositionIncrementAttribute.class);
|
protected final PositionIncrementAttribute posIncrAtt = addAttribute(PositionIncrementAttribute.class);
|
||||||
|
|
||||||
|
/**
|
||||||
|
* The internal offset stores the offset in the potentially filtered input to the tokenizer.
|
||||||
|
* This must be corrected before setting the offset attribute for user-visible output.
|
||||||
|
*/
|
||||||
protected int nextOffset;
|
protected int nextOffset;
|
||||||
protected int skippedPositions;
|
protected int skippedPositions;
|
||||||
|
|
||||||
AbstractMlTokenizer() {
|
protected AbstractMlTokenizer() {
|
||||||
}
|
}
|
||||||
|
|
||||||
@Override
|
@Override
|
||||||
|
@ -31,7 +35,8 @@ public abstract class AbstractMlTokenizer extends Tokenizer {
|
||||||
super.end();
|
super.end();
|
||||||
// Set final offset
|
// Set final offset
|
||||||
int finalOffset = nextOffset + (int) input.skip(Integer.MAX_VALUE);
|
int finalOffset = nextOffset + (int) input.skip(Integer.MAX_VALUE);
|
||||||
offsetAtt.setOffset(finalOffset, finalOffset);
|
int correctedFinalOffset = correctOffset(finalOffset);
|
||||||
|
offsetAtt.setOffset(correctedFinalOffset, correctedFinalOffset);
|
||||||
// Adjust any skipped tokens
|
// Adjust any skipped tokens
|
||||||
posIncrAtt.setPositionIncrement(posIncrAtt.getPositionIncrement() + skippedPositions);
|
posIncrAtt.setPositionIncrement(posIncrAtt.getPositionIncrement() + skippedPositions);
|
||||||
}
|
}
|
||||||
|
|
|
@ -0,0 +1,98 @@
|
||||||
|
/*
|
||||||
|
* Copyright Elasticsearch B.V. and/or licensed to Elasticsearch B.V. under one
|
||||||
|
* or more contributor license agreements. Licensed under the Elastic License
|
||||||
|
* 2.0; you may not use this file except in compliance with the Elastic License
|
||||||
|
* 2.0.
|
||||||
|
*/
|
||||||
|
|
||||||
|
package org.elasticsearch.xpack.ml.job.categorization;
|
||||||
|
|
||||||
|
import org.apache.lucene.analysis.charfilter.BaseCharFilter;
|
||||||
|
|
||||||
|
import java.io.IOException;
|
||||||
|
import java.io.Reader;
|
||||||
|
import java.io.StringReader;
|
||||||
|
|
||||||
|
/**
|
||||||
|
* A character filter that keeps the first non-blank line in the input, and discards everything before and after it.
|
||||||
|
* Treats both <code>\n</code> and <code>\r\n</code> as line endings. If there is a line ending at the end of the
|
||||||
|
* first non-blank line this is discarded. A line is considered blank if {@link Character#isWhitespace} returns
|
||||||
|
* <code>true</code> for all the characters in it.
|
||||||
|
*
|
||||||
|
* It is possible to achieve the same effect with a <code>pattern_replace</code> filter, but since this filter
|
||||||
|
* needs to be run on every single message to be categorized it is worth having a more performant specialization.
|
||||||
|
*/
|
||||||
|
public class FirstNonBlankLineCharFilter extends BaseCharFilter {
|
||||||
|
|
||||||
|
public static final String NAME = "first_non_blank_line";
|
||||||
|
|
||||||
|
private Reader transformedInput;
|
||||||
|
|
||||||
|
FirstNonBlankLineCharFilter(Reader in) {
|
||||||
|
super(in);
|
||||||
|
}
|
||||||
|
|
||||||
|
@Override
|
||||||
|
public int read(char[] cbuf, int off, int len) throws IOException {
|
||||||
|
// Buffer all input on the first call.
|
||||||
|
if (transformedInput == null) {
|
||||||
|
fill();
|
||||||
|
}
|
||||||
|
|
||||||
|
return transformedInput.read(cbuf, off, len);
|
||||||
|
}
|
||||||
|
|
||||||
|
@Override
|
||||||
|
public int read() throws IOException {
|
||||||
|
if (transformedInput == null) {
|
||||||
|
fill();
|
||||||
|
}
|
||||||
|
|
||||||
|
return transformedInput.read();
|
||||||
|
}
|
||||||
|
|
||||||
|
private void fill() throws IOException {
|
||||||
|
StringBuilder buffered = new StringBuilder();
|
||||||
|
char[] temp = new char[1024];
|
||||||
|
for (int cnt = input.read(temp); cnt > 0; cnt = input.read(temp)) {
|
||||||
|
buffered.append(temp, 0, cnt);
|
||||||
|
}
|
||||||
|
transformedInput = new StringReader(process(buffered).toString());
|
||||||
|
}
|
||||||
|
|
||||||
|
private CharSequence process(CharSequence input) {
|
||||||
|
|
||||||
|
boolean seenNonWhitespaceChar = false;
|
||||||
|
int prevNewlineIndex = -1;
|
||||||
|
int endIndex = -1;
|
||||||
|
|
||||||
|
for (int index = 0; index < input.length(); ++index) {
|
||||||
|
if (input.charAt(index) == '\n') {
|
||||||
|
if (seenNonWhitespaceChar) {
|
||||||
|
// With Windows line endings chop the \r as well as the \n
|
||||||
|
endIndex = (input.charAt(index - 1) == '\r') ? (index - 1) : index;
|
||||||
|
break;
|
||||||
|
}
|
||||||
|
prevNewlineIndex = index;
|
||||||
|
} else {
|
||||||
|
seenNonWhitespaceChar = seenNonWhitespaceChar || Character.isWhitespace(input.charAt(index)) == false;
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
if (seenNonWhitespaceChar == false) {
|
||||||
|
return "";
|
||||||
|
}
|
||||||
|
|
||||||
|
if (endIndex == -1) {
|
||||||
|
if (prevNewlineIndex == -1) {
|
||||||
|
// This is pretty likely, as most log messages _aren't_ multiline, so worth optimising
|
||||||
|
// for even though the return at the end of the method would be functionally identical
|
||||||
|
return input;
|
||||||
|
}
|
||||||
|
endIndex = input.length();
|
||||||
|
}
|
||||||
|
|
||||||
|
addOffCorrectMap(0, prevNewlineIndex + 1);
|
||||||
|
return input.subSequence(prevNewlineIndex + 1, endIndex);
|
||||||
|
}
|
||||||
|
}
|
|
@ -0,0 +1,27 @@
|
||||||
|
/*
|
||||||
|
* Copyright Elasticsearch B.V. and/or licensed to Elasticsearch B.V. under one
|
||||||
|
* or more contributor license agreements. Licensed under the Elastic License
|
||||||
|
* 2.0; you may not use this file except in compliance with the Elastic License
|
||||||
|
* 2.0.
|
||||||
|
*/
|
||||||
|
|
||||||
|
package org.elasticsearch.xpack.ml.job.categorization;
|
||||||
|
|
||||||
|
import org.elasticsearch.common.settings.Settings;
|
||||||
|
import org.elasticsearch.env.Environment;
|
||||||
|
import org.elasticsearch.index.IndexSettings;
|
||||||
|
import org.elasticsearch.index.analysis.AbstractCharFilterFactory;
|
||||||
|
|
||||||
|
import java.io.Reader;
|
||||||
|
|
||||||
|
public class FirstNonBlankLineCharFilterFactory extends AbstractCharFilterFactory {
|
||||||
|
|
||||||
|
public FirstNonBlankLineCharFilterFactory(IndexSettings indexSettings, Environment env, String name, Settings settings) {
|
||||||
|
super(indexSettings, name);
|
||||||
|
}
|
||||||
|
|
||||||
|
@Override
|
||||||
|
public Reader create(Reader tokenStream) {
|
||||||
|
return new FirstNonBlankLineCharFilter(tokenStream);
|
||||||
|
}
|
||||||
|
}
|
|
@ -84,7 +84,7 @@ public class MlClassicTokenizer extends AbstractMlTokenizer {
|
||||||
|
|
||||||
// Characters that may exist in the term attribute beyond its defined length are ignored
|
// Characters that may exist in the term attribute beyond its defined length are ignored
|
||||||
termAtt.setLength(length);
|
termAtt.setLength(length);
|
||||||
offsetAtt.setOffset(start, start + length);
|
offsetAtt.setOffset(correctOffset(start), correctOffset(start + length));
|
||||||
posIncrAtt.setPositionIncrement(skippedPositions + 1);
|
posIncrAtt.setPositionIncrement(skippedPositions + 1);
|
||||||
|
|
||||||
return true;
|
return true;
|
||||||
|
|
|
@ -136,7 +136,7 @@ public class MlStandardTokenizer extends AbstractMlTokenizer {
|
||||||
|
|
||||||
// Characters that may exist in the term attribute beyond its defined length are ignored
|
// Characters that may exist in the term attribute beyond its defined length are ignored
|
||||||
termAtt.setLength(length);
|
termAtt.setLength(length);
|
||||||
offsetAtt.setOffset(start, start + length);
|
offsetAtt.setOffset(correctOffset(start), correctOffset(start + length));
|
||||||
posIncrAtt.setPositionIncrement(skippedPositions + 1);
|
posIncrAtt.setPositionIncrement(skippedPositions + 1);
|
||||||
|
|
||||||
return true;
|
return true;
|
||||||
|
|
|
@ -49,6 +49,7 @@ import org.elasticsearch.xpack.core.ml.action.PutJobAction;
|
||||||
import org.elasticsearch.xpack.core.ml.action.UpdateJobAction;
|
import org.elasticsearch.xpack.core.ml.action.UpdateJobAction;
|
||||||
import org.elasticsearch.xpack.core.action.util.QueryPage;
|
import org.elasticsearch.xpack.core.action.util.QueryPage;
|
||||||
import org.elasticsearch.xpack.core.ml.job.config.AnalysisConfig;
|
import org.elasticsearch.xpack.core.ml.job.config.AnalysisConfig;
|
||||||
|
import org.elasticsearch.xpack.core.ml.job.config.CategorizationAnalyzerConfig;
|
||||||
import org.elasticsearch.xpack.core.ml.job.config.DataDescription;
|
import org.elasticsearch.xpack.core.ml.job.config.DataDescription;
|
||||||
import org.elasticsearch.xpack.core.ml.job.config.DetectionRule;
|
import org.elasticsearch.xpack.core.ml.job.config.DetectionRule;
|
||||||
import org.elasticsearch.xpack.core.ml.job.config.Detector;
|
import org.elasticsearch.xpack.core.ml.job.config.Detector;
|
||||||
|
@ -91,6 +92,7 @@ import static org.hamcrest.Matchers.hasSize;
|
||||||
import static org.hamcrest.Matchers.instanceOf;
|
import static org.hamcrest.Matchers.instanceOf;
|
||||||
import static org.hamcrest.Matchers.is;
|
import static org.hamcrest.Matchers.is;
|
||||||
import static org.hamcrest.Matchers.lessThanOrEqualTo;
|
import static org.hamcrest.Matchers.lessThanOrEqualTo;
|
||||||
|
import static org.hamcrest.Matchers.nullValue;
|
||||||
import static org.mockito.Matchers.any;
|
import static org.mockito.Matchers.any;
|
||||||
import static org.mockito.Mockito.doAnswer;
|
import static org.mockito.Mockito.doAnswer;
|
||||||
import static org.mockito.Mockito.mock;
|
import static org.mockito.Mockito.mock;
|
||||||
|
@ -584,10 +586,68 @@ public class JobManagerTests extends ESTestCase {
|
||||||
assertThat(capturedUpdateParams.get(1).isUpdateScheduledEvents(), is(true));
|
assertThat(capturedUpdateParams.get(1).isUpdateScheduledEvents(), is(true));
|
||||||
}
|
}
|
||||||
|
|
||||||
|
public void testValidateCategorizationAnalyzer_GivenValid() throws IOException {
|
||||||
|
|
||||||
|
List<String> categorizationFilters = randomBoolean() ? Collections.singletonList("query: .*") : null;
|
||||||
|
CategorizationAnalyzerConfig c = CategorizationAnalyzerConfig.buildDefaultCategorizationAnalyzer(categorizationFilters);
|
||||||
|
Job.Builder jobBuilder = createCategorizationJob(c, null);
|
||||||
|
JobManager.validateCategorizationAnalyzerOrSetDefault(jobBuilder, analysisRegistry, Version.CURRENT);
|
||||||
|
|
||||||
|
Job job = jobBuilder.build(new Date());
|
||||||
|
assertThat(job.getAnalysisConfig().getCategorizationAnalyzerConfig(),
|
||||||
|
equalTo(CategorizationAnalyzerConfig.buildDefaultCategorizationAnalyzer(categorizationFilters)));
|
||||||
|
}
|
||||||
|
|
||||||
|
public void testValidateCategorizationAnalyzer_GivenInvalid() {
|
||||||
|
|
||||||
|
CategorizationAnalyzerConfig c = new CategorizationAnalyzerConfig.Builder().setAnalyzer("does_not_exist").build();
|
||||||
|
Job.Builder jobBuilder = createCategorizationJob(c, null);
|
||||||
|
IllegalArgumentException e = expectThrows(IllegalArgumentException.class,
|
||||||
|
() -> JobManager.validateCategorizationAnalyzerOrSetDefault(jobBuilder, analysisRegistry, Version.CURRENT));
|
||||||
|
|
||||||
|
assertThat(e.getMessage(), equalTo("Failed to find global analyzer [does_not_exist]"));
|
||||||
|
}
|
||||||
|
|
||||||
|
public void testSetDefaultCategorizationAnalyzer_GivenAllNewNodes() throws IOException {
|
||||||
|
|
||||||
|
List<String> categorizationFilters = randomBoolean() ? Collections.singletonList("query: .*") : null;
|
||||||
|
Job.Builder jobBuilder = createCategorizationJob(null, categorizationFilters);
|
||||||
|
JobManager.validateCategorizationAnalyzerOrSetDefault(jobBuilder, analysisRegistry, Version.CURRENT);
|
||||||
|
|
||||||
|
Job job = jobBuilder.build(new Date());
|
||||||
|
assertThat(job.getAnalysisConfig().getCategorizationAnalyzerConfig(),
|
||||||
|
equalTo(CategorizationAnalyzerConfig.buildStandardCategorizationAnalyzer(categorizationFilters)));
|
||||||
|
}
|
||||||
|
|
||||||
|
// TODO: This test can be deleted from branches that would never have to talk to a 7.13 node
|
||||||
|
public void testSetDefaultCategorizationAnalyzer_GivenOldNodeInCluster() throws IOException {
|
||||||
|
|
||||||
|
List<String> categorizationFilters = randomBoolean() ? Collections.singletonList("query: .*") : null;
|
||||||
|
Job.Builder jobBuilder = createCategorizationJob(null, categorizationFilters);
|
||||||
|
JobManager.validateCategorizationAnalyzerOrSetDefault(jobBuilder, analysisRegistry, Version.V_7_13_0);
|
||||||
|
|
||||||
|
Job job = jobBuilder.build(new Date());
|
||||||
|
assertThat(job.getAnalysisConfig().getCategorizationAnalyzerConfig(), nullValue());
|
||||||
|
}
|
||||||
|
|
||||||
|
private Job.Builder createCategorizationJob(CategorizationAnalyzerConfig categorizationAnalyzerConfig,
|
||||||
|
List<String> categorizationFilters) {
|
||||||
|
Detector.Builder d = new Detector.Builder("count", null).setByFieldName("mlcategory");
|
||||||
|
AnalysisConfig.Builder ac = new AnalysisConfig.Builder(Collections.singletonList(d.build()))
|
||||||
|
.setCategorizationFieldName("message")
|
||||||
|
.setCategorizationAnalyzerConfig(categorizationAnalyzerConfig)
|
||||||
|
.setCategorizationFilters(categorizationFilters);
|
||||||
|
|
||||||
|
Job.Builder builder = new Job.Builder();
|
||||||
|
builder.setId("cat");
|
||||||
|
builder.setAnalysisConfig(ac);
|
||||||
|
builder.setDataDescription(new DataDescription.Builder());
|
||||||
|
return builder;
|
||||||
|
}
|
||||||
|
|
||||||
private Job.Builder createJob() {
|
private Job.Builder createJob() {
|
||||||
Detector.Builder d1 = new Detector.Builder("info_content", "domain");
|
Detector.Builder d = new Detector.Builder("info_content", "domain").setOverFieldName("client");
|
||||||
d1.setOverFieldName("client");
|
AnalysisConfig.Builder ac = new AnalysisConfig.Builder(Collections.singletonList(d.build()));
|
||||||
AnalysisConfig.Builder ac = new AnalysisConfig.Builder(Collections.singletonList(d1.build()));
|
|
||||||
|
|
||||||
Job.Builder builder = new Job.Builder();
|
Job.Builder builder = new Job.Builder();
|
||||||
builder.setId("foo");
|
builder.setId("foo");
|
||||||
|
|
|
@ -25,6 +25,29 @@ import java.util.Map;
|
||||||
|
|
||||||
public class CategorizationAnalyzerTests extends ESTestCase {
|
public class CategorizationAnalyzerTests extends ESTestCase {
|
||||||
|
|
||||||
|
private static final String NGINX_ERROR_EXAMPLE =
|
||||||
|
"a client request body is buffered to a temporary file /tmp/client-body/0000021894, client: 10.8.0.12, " +
|
||||||
|
"server: apm.35.205.226.121.ip.es.io, request: \"POST /intake/v2/events HTTP/1.1\", host: \"apm.35.205.226.121.ip.es.io\"\n" +
|
||||||
|
"10.8.0.12 - - [29/Nov/2020:21:34:55 +0000] \"POST /intake/v2/events HTTP/1.1\" 202 0 \"-\" " +
|
||||||
|
"\"elasticapm-dotnet/1.5.1 System.Net.Http/4.6.28208.02 .NET_Core/2.2.8\" 27821 0.002 [default-apm-apm-server-8200] [] " +
|
||||||
|
"10.8.1.19:8200 0 0.001 202 f961c776ff732f5c8337530aa22c7216\n" +
|
||||||
|
"10.8.0.14 - - [29/Nov/2020:21:34:56 +0000] \"POST /intake/v2/events HTTP/1.1\" 202 0 \"-\" " +
|
||||||
|
"\"elasticapm-python/5.10.0\" 3594 0.002 [default-apm-apm-server-8200] [] 10.8.1.18:8200 0 0.001 202 " +
|
||||||
|
"61feb8fb9232b1ebe54b588b95771ce4\n" +
|
||||||
|
"10.8.4.90 - - [29/Nov/2020:21:34:56 +0000] \"OPTIONS /intake/v2/rum/events HTTP/2.0\" 200 0 " +
|
||||||
|
"\"http://opbeans-frontend:3000/dashboard\" \"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) " +
|
||||||
|
"Cypress/3.3.1 Chrome/61.0.3163.100 Electron/2.0.18 Safari/537.36\" 292 0.001 [default-apm-apm-server-8200] [] " +
|
||||||
|
"10.8.1.19:8200 0 0.000 200 5fbe8cd4d217b932def1c17ed381c66b\n" +
|
||||||
|
"10.8.4.90 - - [29/Nov/2020:21:34:56 +0000] \"POST /intake/v2/rum/events HTTP/2.0\" 202 0 " +
|
||||||
|
"\"http://opbeans-frontend:3000/dashboard\" \"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) " +
|
||||||
|
"Cypress/3.3.1 Chrome/61.0.3163.100 Electron/2.0.18 Safari/537.36\" 3004 0.001 [default-apm-apm-server-8200] [] " +
|
||||||
|
"10.8.1.18:8200 0 0.001 202 4735f571928595744ac6a9545c3ecdf5\n" +
|
||||||
|
"10.8.0.11 - - [29/Nov/2020:21:34:56 +0000] \"POST /intake/v2/events HTTP/1.1\" 202 0 \"-\" " +
|
||||||
|
"\"elasticapm-node/3.8.0 elastic-apm-http-client/9.4.2 node/12.20.0\" 4913 10.006 [default-apm-apm-server-8200] [] " +
|
||||||
|
"10.8.1.18:8200 0 0.002 202 1eac41789ea9a60a8be4e476c54cbbc9\n" +
|
||||||
|
"10.8.0.14 - - [29/Nov/2020:21:34:57 +0000] \"POST /intake/v2/events HTTP/1.1\" 202 0 \"-\" \"elasticapm-python/5.10.0\" 1025 " +
|
||||||
|
"0.001 [default-apm-apm-server-8200] [] 10.8.1.18:8200 0 0.001 202 d27088936cadd3b8804b68998a5f94fa";
|
||||||
|
|
||||||
private AnalysisRegistry analysisRegistry;
|
private AnalysisRegistry analysisRegistry;
|
||||||
|
|
||||||
public static AnalysisRegistry buildTestAnalysisRegistry(Environment environment) throws Exception {
|
public static AnalysisRegistry buildTestAnalysisRegistry(Environment environment) throws Exception {
|
||||||
|
@ -218,6 +241,19 @@ public class CategorizationAnalyzerTests extends ESTestCase {
|
||||||
assertEquals(Arrays.asList("PSYoungGen", "total", "used"),
|
assertEquals(Arrays.asList("PSYoungGen", "total", "used"),
|
||||||
categorizationAnalyzer.tokenizeField("java",
|
categorizationAnalyzer.tokenizeField("java",
|
||||||
"PSYoungGen total 2572800K, used 1759355K [0x0000000759500000, 0x0000000800000000, 0x0000000800000000)"));
|
"PSYoungGen total 2572800K, used 1759355K [0x0000000759500000, 0x0000000800000000, 0x0000000800000000)"));
|
||||||
|
|
||||||
|
assertEquals(Arrays.asList("client", "request", "body", "is", "buffered", "to", "temporary", "file", "tmp", "client-body",
|
||||||
|
"client", "server", "apm.35.205.226.121.ip.es.io", "request", "POST", "intake", "v2", "events", "HTTP", "host",
|
||||||
|
"apm.35.205.226.121.ip.es.io", "POST", "intake", "v2", "events", "HTTP", "elasticapm-dotnet", "System.Net.Http", "NET_Core",
|
||||||
|
"default-apm-apm-server-8200", "POST", "intake", "v2", "events", "HTTP", "elasticapm-python", "default-apm-apm-server-8200",
|
||||||
|
"OPTIONS", "intake", "v2", "rum", "events", "HTTP", "http", "opbeans-frontend", "dashboard", "Mozilla", "X11", "Linux",
|
||||||
|
"x86_64", "AppleWebKit", "KHTML", "like", "Gecko", "Cypress", "Chrome", "Electron", "Safari", "default-apm-apm-server-8200",
|
||||||
|
"POST", "intake", "v2", "rum", "events", "HTTP", "http", "opbeans-frontend", "dashboard", "Mozilla", "X11", "Linux",
|
||||||
|
"x86_64", "AppleWebKit", "KHTML", "like", "Gecko", "Cypress", "Chrome", "Electron", "Safari", "default-apm-apm-server-8200",
|
||||||
|
"POST", "intake", "v2", "events", "HTTP", "elasticapm-node", "elastic-apm-http-client", "node",
|
||||||
|
"default-apm-apm-server-8200", "POST", "intake", "v2", "events", "HTTP", "elasticapm-python",
|
||||||
|
"default-apm-apm-server-8200"),
|
||||||
|
categorizationAnalyzer.tokenizeField("nginx_error", NGINX_ERROR_EXAMPLE));
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
|
||||||
|
@ -251,6 +287,51 @@ public class CategorizationAnalyzerTests extends ESTestCase {
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
|
||||||
|
public void testMlStandardCategorizationAnalyzer() throws IOException {
|
||||||
|
CategorizationAnalyzerConfig standardConfig = CategorizationAnalyzerConfig.buildStandardCategorizationAnalyzer(null);
|
||||||
|
try (CategorizationAnalyzer categorizationAnalyzer = new CategorizationAnalyzer(analysisRegistry, standardConfig)) {
|
||||||
|
|
||||||
|
assertEquals(Arrays.asList("ml13-4608.1.p2ps", "Info", "Source", "ML_SERVICE2", "on", "has", "shut", "down"),
|
||||||
|
categorizationAnalyzer.tokenizeField("p2ps",
|
||||||
|
"<ml13-4608.1.p2ps: Info: > Source ML_SERVICE2 on 13122:867 has shut down."));
|
||||||
|
|
||||||
|
assertEquals(Arrays.asList("Vpxa", "verbose", "VpxaHalCnxHostagent", "opID", "WFU-ddeadb59", "WaitForUpdatesDone", "Received",
|
||||||
|
"callback"),
|
||||||
|
categorizationAnalyzer.tokenizeField("vmware",
|
||||||
|
"Vpxa: [49EC0B90 verbose 'VpxaHalCnxHostagent' opID=WFU-ddeadb59] [WaitForUpdatesDone] Received callback"));
|
||||||
|
|
||||||
|
assertEquals(Arrays.asList("org.apache.coyote.http11.Http11BaseProtocol", "destroy"),
|
||||||
|
categorizationAnalyzer.tokenizeField("apache",
|
||||||
|
"org.apache.coyote.http11.Http11BaseProtocol destroy"));
|
||||||
|
|
||||||
|
assertEquals(Arrays.asList("INFO", "session", "PROXY", "Session", "DESTROYED"),
|
||||||
|
categorizationAnalyzer.tokenizeField("proxy",
|
||||||
|
" [1111529792] INFO session <45409105041220090733@192.168.251.123> - " +
|
||||||
|
"----------------- PROXY Session DESTROYED --------------------"));
|
||||||
|
|
||||||
|
assertEquals(Arrays.asList("PSYoungGen", "total", "used"),
|
||||||
|
categorizationAnalyzer.tokenizeField("java",
|
||||||
|
"PSYoungGen total 2572800K, used 1759355K [0x0000000759500000, 0x0000000800000000, 0x0000000800000000)"));
|
||||||
|
|
||||||
|
assertEquals(Arrays.asList("first", "line"),
|
||||||
|
categorizationAnalyzer.tokenizeField("multiline", "first line\nsecond line\nthird line"));
|
||||||
|
|
||||||
|
assertEquals(Arrays.asList("first", "line"),
|
||||||
|
categorizationAnalyzer.tokenizeField("windows_multiline", "first line\r\nsecond line\r\nthird line"));
|
||||||
|
|
||||||
|
assertEquals(Arrays.asList("second", "line"),
|
||||||
|
categorizationAnalyzer.tokenizeField("multiline_first_blank", "\nsecond line\nthird line"));
|
||||||
|
|
||||||
|
assertEquals(Arrays.asList("second", "line"),
|
||||||
|
categorizationAnalyzer.tokenizeField("windows_multiline_first_blank", "\r\nsecond line\r\nthird line"));
|
||||||
|
|
||||||
|
assertEquals(Arrays.asList("client", "request", "body", "is", "buffered", "to", "temporary", "file",
|
||||||
|
"/tmp/client-body/0000021894", "client", "server", "apm.35.205.226.121.ip.es.io", "request", "POST", "/intake/v2/events",
|
||||||
|
"HTTP/1.1", "host", "apm.35.205.226.121.ip.es.io"),
|
||||||
|
categorizationAnalyzer.tokenizeField("nginx_error", NGINX_ERROR_EXAMPLE));
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
// The Elasticsearch standard analyzer - this is the default for indexing in Elasticsearch, but
|
// The Elasticsearch standard analyzer - this is the default for indexing in Elasticsearch, but
|
||||||
// NOT for ML categorization (and you'll see why if you look at the expected results of this test!)
|
// NOT for ML categorization (and you'll see why if you look at the expected results of this test!)
|
||||||
public void testStandardAnalyzer() throws IOException {
|
public void testStandardAnalyzer() throws IOException {
|
||||||
|
|
|
@ -0,0 +1,128 @@
|
||||||
|
/*
|
||||||
|
* Copyright Elasticsearch B.V. and/or licensed to Elasticsearch B.V. under one
|
||||||
|
* or more contributor license agreements. Licensed under the Elastic License
|
||||||
|
* 2.0; you may not use this file except in compliance with the Elastic License
|
||||||
|
* 2.0.
|
||||||
|
*/
|
||||||
|
|
||||||
|
package org.elasticsearch.xpack.ml.job.categorization;
|
||||||
|
|
||||||
|
import org.elasticsearch.test.ESTestCase;
|
||||||
|
|
||||||
|
import java.io.IOException;
|
||||||
|
import java.io.StringReader;
|
||||||
|
|
||||||
|
import static org.hamcrest.Matchers.equalTo;
|
||||||
|
|
||||||
|
public class FirstNonBlankLineCharFilterTests extends ESTestCase {
|
||||||
|
|
||||||
|
public void testEmpty() throws IOException {
|
||||||
|
|
||||||
|
String input = "";
|
||||||
|
FirstNonBlankLineCharFilter filter = new FirstNonBlankLineCharFilter(new StringReader(input));
|
||||||
|
|
||||||
|
assertThat(filter.read(), equalTo(-1));
|
||||||
|
}
|
||||||
|
|
||||||
|
public void testAllBlankOneLine() throws IOException {
|
||||||
|
|
||||||
|
String input = "\t";
|
||||||
|
if (randomBoolean()) {
|
||||||
|
input = " " + input;
|
||||||
|
}
|
||||||
|
if (randomBoolean()) {
|
||||||
|
input = input + " ";
|
||||||
|
}
|
||||||
|
FirstNonBlankLineCharFilter filter = new FirstNonBlankLineCharFilter(new StringReader(input));
|
||||||
|
|
||||||
|
assertThat(filter.read(), equalTo(-1));
|
||||||
|
}
|
||||||
|
|
||||||
|
public void testNonBlankNoNewlines() throws IOException {
|
||||||
|
|
||||||
|
String input = "the quick brown fox jumped over the lazy dog";
|
||||||
|
if (randomBoolean()) {
|
||||||
|
input = " " + input;
|
||||||
|
}
|
||||||
|
if (randomBoolean()) {
|
||||||
|
input = input + " ";
|
||||||
|
}
|
||||||
|
FirstNonBlankLineCharFilter filter = new FirstNonBlankLineCharFilter(new StringReader(input));
|
||||||
|
|
||||||
|
char[] output = new char[input.length()];
|
||||||
|
assertThat(filter.read(output, 0, output.length), equalTo(input.length()));
|
||||||
|
assertThat(filter.read(), equalTo(-1));
|
||||||
|
assertThat(new String(output), equalTo(input));
|
||||||
|
}
|
||||||
|
|
||||||
|
public void testAllBlankMultiline() throws IOException {
|
||||||
|
|
||||||
|
StringBuilder input = new StringBuilder();
|
||||||
|
String lineEnding = randomBoolean() ? "\n" : "\r\n";
|
||||||
|
for (int lineNum = randomIntBetween(2, 5); lineNum > 0; --lineNum) {
|
||||||
|
for (int charNum = randomIntBetween(0, 5); charNum > 0; --charNum) {
|
||||||
|
input.append(randomBoolean() ? " " : "\t");
|
||||||
|
}
|
||||||
|
if (lineNum > 1 || randomBoolean()) {
|
||||||
|
input.append(lineEnding);
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
FirstNonBlankLineCharFilter filter = new FirstNonBlankLineCharFilter(new StringReader(input.toString()));
|
||||||
|
|
||||||
|
assertThat(filter.read(), equalTo(-1));
|
||||||
|
}
|
||||||
|
|
||||||
|
public void testNonBlankMultiline() throws IOException {
|
||||||
|
|
||||||
|
StringBuilder input = new StringBuilder();
|
||||||
|
String lineEnding = randomBoolean() ? "\n" : "\r\n";
|
||||||
|
for (int lineBeforeNum = randomIntBetween(2, 5); lineBeforeNum > 0; --lineBeforeNum) {
|
||||||
|
for (int charNum = randomIntBetween(0, 5); charNum > 0; --charNum) {
|
||||||
|
input.append(randomBoolean() ? " " : "\t");
|
||||||
|
}
|
||||||
|
input.append(lineEnding);
|
||||||
|
}
|
||||||
|
String lineToKeep = "the quick brown fox jumped over the lazy dog";
|
||||||
|
if (randomBoolean()) {
|
||||||
|
lineToKeep = " " + lineToKeep;
|
||||||
|
}
|
||||||
|
if (randomBoolean()) {
|
||||||
|
lineToKeep = lineToKeep + " ";
|
||||||
|
}
|
||||||
|
input.append(lineToKeep).append(lineEnding);
|
||||||
|
for (int lineAfterNum = randomIntBetween(2, 5); lineAfterNum > 0; --lineAfterNum) {
|
||||||
|
for (int charNum = randomIntBetween(0, 5); charNum > 0; --charNum) {
|
||||||
|
input.append(randomBoolean() ? " " : "more");
|
||||||
|
}
|
||||||
|
if (lineAfterNum > 1 || randomBoolean()) {
|
||||||
|
input.append(lineEnding);
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
FirstNonBlankLineCharFilter filter = new FirstNonBlankLineCharFilter(new StringReader(input.toString()));
|
||||||
|
|
||||||
|
char[] output = new char[lineToKeep.length()];
|
||||||
|
assertThat(filter.read(output, 0, output.length), equalTo(lineToKeep.length()));
|
||||||
|
assertThat(filter.read(), equalTo(-1));
|
||||||
|
assertThat(new String(output), equalTo(lineToKeep));
|
||||||
|
}
|
||||||
|
|
||||||
|
public void testCorrect() throws IOException {
|
||||||
|
|
||||||
|
String input = " \nfirst line\nsecond line";
|
||||||
|
FirstNonBlankLineCharFilter filter = new FirstNonBlankLineCharFilter(new StringReader(input));
|
||||||
|
|
||||||
|
String expectedOutput = "first line";
|
||||||
|
|
||||||
|
char[] output = new char[expectedOutput.length()];
|
||||||
|
assertThat(filter.read(output, 0, output.length), equalTo(expectedOutput.length()));
|
||||||
|
assertThat(filter.read(), equalTo(-1));
|
||||||
|
assertThat(new String(output), equalTo(expectedOutput));
|
||||||
|
|
||||||
|
int expectedOutputIndex = input.indexOf(expectedOutput);
|
||||||
|
for (int i = 0; i <= expectedOutput.length(); ++i) {
|
||||||
|
assertThat(filter.correctOffset(i), equalTo(expectedOutputIndex + i));
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
|
@ -341,6 +341,12 @@
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
- match: { job_id: "jobs-crud-update-job" }
|
- match: { job_id: "jobs-crud-update-job" }
|
||||||
|
- length: { analysis_config.categorization_analyzer.filter: 1 }
|
||||||
|
- match: { analysis_config.categorization_analyzer.tokenizer: "ml_standard" }
|
||||||
|
- length: { analysis_config.categorization_analyzer.char_filter: 3 }
|
||||||
|
- match: { analysis_config.categorization_analyzer.char_filter.0: "first_non_blank_line" }
|
||||||
|
- match: { analysis_config.categorization_analyzer.char_filter.1.pattern: "cat1.*" }
|
||||||
|
- match: { analysis_config.categorization_analyzer.char_filter.2.pattern: "cat2.*" }
|
||||||
|
|
||||||
- do:
|
- do:
|
||||||
ml.open_job:
|
ml.open_job:
|
||||||
|
@ -381,7 +387,6 @@
|
||||||
"background_persist_interval": "3h",
|
"background_persist_interval": "3h",
|
||||||
"model_snapshot_retention_days": 30,
|
"model_snapshot_retention_days": 30,
|
||||||
"results_retention_days": 40,
|
"results_retention_days": 40,
|
||||||
"categorization_filters" : ["cat3.*"],
|
|
||||||
"custom_settings": {
|
"custom_settings": {
|
||||||
"setting3": "custom3"
|
"setting3": "custom3"
|
||||||
}
|
}
|
||||||
|
@ -392,7 +397,6 @@
|
||||||
- match: { model_plot_config.enabled: false }
|
- match: { model_plot_config.enabled: false }
|
||||||
- match: { model_plot_config.terms: "foobar" }
|
- match: { model_plot_config.terms: "foobar" }
|
||||||
- match: { model_plot_config.annotations_enabled: false }
|
- match: { model_plot_config.annotations_enabled: false }
|
||||||
- match: { analysis_config.categorization_filters: ["cat3.*"] }
|
|
||||||
- match: { analysis_config.detectors.0.custom_rules.0.actions: ["skip_result"] }
|
- match: { analysis_config.detectors.0.custom_rules.0.actions: ["skip_result"] }
|
||||||
- length: { analysis_config.detectors.0.custom_rules.0.conditions: 1 }
|
- length: { analysis_config.detectors.0.custom_rules.0.conditions: 1 }
|
||||||
- match: { analysis_config.detectors.0.detector_index: 0 }
|
- match: { analysis_config.detectors.0.detector_index: 0 }
|
||||||
|
|
|
@ -10,7 +10,7 @@ teardown:
|
||||||
"Test ml info":
|
"Test ml info":
|
||||||
- do:
|
- do:
|
||||||
ml.info: {}
|
ml.info: {}
|
||||||
- match: { defaults.anomaly_detectors.categorization_analyzer.tokenizer: "ml_classic" }
|
- match: { defaults.anomaly_detectors.categorization_analyzer.tokenizer: "ml_standard" }
|
||||||
- match: { defaults.anomaly_detectors.model_memory_limit: "1gb" }
|
- match: { defaults.anomaly_detectors.model_memory_limit: "1gb" }
|
||||||
- match: { defaults.anomaly_detectors.categorization_examples_limit: 4 }
|
- match: { defaults.anomaly_detectors.categorization_examples_limit: 4 }
|
||||||
- match: { defaults.anomaly_detectors.model_snapshot_retention_days: 10 }
|
- match: { defaults.anomaly_detectors.model_snapshot_retention_days: 10 }
|
||||||
|
@ -30,7 +30,7 @@ teardown:
|
||||||
|
|
||||||
- do:
|
- do:
|
||||||
ml.info: {}
|
ml.info: {}
|
||||||
- match: { defaults.anomaly_detectors.categorization_analyzer.tokenizer: "ml_classic" }
|
- match: { defaults.anomaly_detectors.categorization_analyzer.tokenizer: "ml_standard" }
|
||||||
- match: { defaults.anomaly_detectors.model_memory_limit: "512mb" }
|
- match: { defaults.anomaly_detectors.model_memory_limit: "512mb" }
|
||||||
- match: { defaults.anomaly_detectors.categorization_examples_limit: 4 }
|
- match: { defaults.anomaly_detectors.categorization_examples_limit: 4 }
|
||||||
- match: { defaults.anomaly_detectors.model_snapshot_retention_days: 10 }
|
- match: { defaults.anomaly_detectors.model_snapshot_retention_days: 10 }
|
||||||
|
@ -50,7 +50,7 @@ teardown:
|
||||||
|
|
||||||
- do:
|
- do:
|
||||||
ml.info: {}
|
ml.info: {}
|
||||||
- match: { defaults.anomaly_detectors.categorization_analyzer.tokenizer: "ml_classic" }
|
- match: { defaults.anomaly_detectors.categorization_analyzer.tokenizer: "ml_standard" }
|
||||||
- match: { defaults.anomaly_detectors.model_memory_limit: "1gb" }
|
- match: { defaults.anomaly_detectors.model_memory_limit: "1gb" }
|
||||||
- match: { defaults.anomaly_detectors.categorization_examples_limit: 4 }
|
- match: { defaults.anomaly_detectors.categorization_examples_limit: 4 }
|
||||||
- match: { defaults.anomaly_detectors.model_snapshot_retention_days: 10 }
|
- match: { defaults.anomaly_detectors.model_snapshot_retention_days: 10 }
|
||||||
|
@ -70,7 +70,7 @@ teardown:
|
||||||
|
|
||||||
- do:
|
- do:
|
||||||
ml.info: {}
|
ml.info: {}
|
||||||
- match: { defaults.anomaly_detectors.categorization_analyzer.tokenizer: "ml_classic" }
|
- match: { defaults.anomaly_detectors.categorization_analyzer.tokenizer: "ml_standard" }
|
||||||
- match: { defaults.anomaly_detectors.model_memory_limit: "1gb" }
|
- match: { defaults.anomaly_detectors.model_memory_limit: "1gb" }
|
||||||
- match: { defaults.anomaly_detectors.categorization_examples_limit: 4 }
|
- match: { defaults.anomaly_detectors.categorization_examples_limit: 4 }
|
||||||
- match: { defaults.anomaly_detectors.model_snapshot_retention_days: 10 }
|
- match: { defaults.anomaly_detectors.model_snapshot_retention_days: 10 }
|
||||||
|
@ -90,7 +90,7 @@ teardown:
|
||||||
|
|
||||||
- do:
|
- do:
|
||||||
ml.info: {}
|
ml.info: {}
|
||||||
- match: { defaults.anomaly_detectors.categorization_analyzer.tokenizer: "ml_classic" }
|
- match: { defaults.anomaly_detectors.categorization_analyzer.tokenizer: "ml_standard" }
|
||||||
- match: { defaults.anomaly_detectors.model_memory_limit: "1mb" }
|
- match: { defaults.anomaly_detectors.model_memory_limit: "1mb" }
|
||||||
- match: { defaults.anomaly_detectors.categorization_examples_limit: 4 }
|
- match: { defaults.anomaly_detectors.categorization_examples_limit: 4 }
|
||||||
- match: { defaults.anomaly_detectors.model_snapshot_retention_days: 10 }
|
- match: { defaults.anomaly_detectors.model_snapshot_retention_days: 10 }
|
||||||
|
|
|
@ -0,0 +1,115 @@
|
||||||
|
---
|
||||||
|
"Test analyze API with the standard 7.14 ML analyzer":
|
||||||
|
- do:
|
||||||
|
indices.analyze:
|
||||||
|
body: >
|
||||||
|
{
|
||||||
|
"char_filter" : [
|
||||||
|
"first_non_blank_line"
|
||||||
|
],
|
||||||
|
"tokenizer" : "ml_standard",
|
||||||
|
"filter" : [
|
||||||
|
{ "type" : "stop", "stopwords": [
|
||||||
|
"Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday",
|
||||||
|
"Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun",
|
||||||
|
"January", "February", "March", "April", "May", "June", "July", "August", "September", "October", "November", "December",
|
||||||
|
"Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec",
|
||||||
|
"GMT", "UTC"
|
||||||
|
] }
|
||||||
|
],
|
||||||
|
"text" : "[elasticsearch] [2017-12-13T10:46:30,816][INFO ][o.e.c.m.MetadataCreateIndexService] [node-0] [.watcher-history-7-2017.12.13] creating index, cause [auto(bulk api)], templates [.watch-history-7], shards [1]/[1], mappings [doc]"
|
||||||
|
}
|
||||||
|
- match: { tokens.0.token: "elasticsearch" }
|
||||||
|
- match: { tokens.0.start_offset: 1 }
|
||||||
|
- match: { tokens.0.end_offset: 14 }
|
||||||
|
- match: { tokens.0.position: 0 }
|
||||||
|
- match: { tokens.1.token: "INFO" }
|
||||||
|
- match: { tokens.1.start_offset: 42 }
|
||||||
|
- match: { tokens.1.end_offset: 46 }
|
||||||
|
- match: { tokens.1.position: 5 }
|
||||||
|
- match: { tokens.2.token: "o.e.c.m.MetadataCreateIndexService" }
|
||||||
|
- match: { tokens.2.start_offset: 49 }
|
||||||
|
- match: { tokens.2.end_offset: 83 }
|
||||||
|
- match: { tokens.2.position: 6 }
|
||||||
|
- match: { tokens.3.token: "node-0" }
|
||||||
|
- match: { tokens.3.start_offset: 86 }
|
||||||
|
- match: { tokens.3.end_offset: 92 }
|
||||||
|
- match: { tokens.3.position: 7 }
|
||||||
|
- match: { tokens.4.token: "watcher-history-7-2017.12.13" }
|
||||||
|
- match: { tokens.4.start_offset: 96 }
|
||||||
|
- match: { tokens.4.end_offset: 124 }
|
||||||
|
- match: { tokens.4.position: 8 }
|
||||||
|
- match: { tokens.5.token: "creating" }
|
||||||
|
- match: { tokens.5.start_offset: 126 }
|
||||||
|
- match: { tokens.5.end_offset: 134 }
|
||||||
|
- match: { tokens.5.position: 9 }
|
||||||
|
- match: { tokens.6.token: "index" }
|
||||||
|
- match: { tokens.6.start_offset: 135 }
|
||||||
|
- match: { tokens.6.end_offset: 140 }
|
||||||
|
- match: { tokens.6.position: 10 }
|
||||||
|
- match: { tokens.7.token: "cause" }
|
||||||
|
- match: { tokens.7.start_offset: 142 }
|
||||||
|
- match: { tokens.7.end_offset: 147 }
|
||||||
|
- match: { tokens.7.position: 11 }
|
||||||
|
- match: { tokens.8.token: "auto" }
|
||||||
|
- match: { tokens.8.start_offset: 149 }
|
||||||
|
- match: { tokens.8.end_offset: 153 }
|
||||||
|
- match: { tokens.8.position: 12 }
|
||||||
|
- match: { tokens.9.token: "bulk" }
|
||||||
|
- match: { tokens.9.start_offset: 154 }
|
||||||
|
- match: { tokens.9.end_offset: 158 }
|
||||||
|
- match: { tokens.9.position: 13 }
|
||||||
|
- match: { tokens.10.token: "api" }
|
||||||
|
- match: { tokens.10.start_offset: 159 }
|
||||||
|
- match: { tokens.10.end_offset: 162 }
|
||||||
|
- match: { tokens.10.position: 14 }
|
||||||
|
- match: { tokens.11.token: "templates" }
|
||||||
|
- match: { tokens.11.start_offset: 166 }
|
||||||
|
- match: { tokens.11.end_offset: 175 }
|
||||||
|
- match: { tokens.11.position: 15 }
|
||||||
|
- match: { tokens.12.token: "watch-history-7" }
|
||||||
|
- match: { tokens.12.start_offset: 178 }
|
||||||
|
- match: { tokens.12.end_offset: 193 }
|
||||||
|
- match: { tokens.12.position: 16 }
|
||||||
|
- match: { tokens.13.token: "shards" }
|
||||||
|
- match: { tokens.13.start_offset: 196 }
|
||||||
|
- match: { tokens.13.end_offset: 202 }
|
||||||
|
- match: { tokens.13.position: 17 }
|
||||||
|
- match: { tokens.14.token: "mappings" }
|
||||||
|
- match: { tokens.14.start_offset: 212 }
|
||||||
|
- match: { tokens.14.end_offset: 220 }
|
||||||
|
- match: { tokens.14.position: 21 }
|
||||||
|
- match: { tokens.15.token: "doc" }
|
||||||
|
- match: { tokens.15.start_offset: 222 }
|
||||||
|
- match: { tokens.15.end_offset: 225 }
|
||||||
|
- match: { tokens.15.position: 22 }
|
||||||
|
|
||||||
|
---
|
||||||
|
"Test 7.14 analyzer with blank lines":
|
||||||
|
- do:
|
||||||
|
indices.analyze:
|
||||||
|
body: >
|
||||||
|
{
|
||||||
|
"char_filter" : [
|
||||||
|
"first_non_blank_line"
|
||||||
|
],
|
||||||
|
"tokenizer" : "ml_standard",
|
||||||
|
"filter" : [
|
||||||
|
{ "type" : "stop", "stopwords": [
|
||||||
|
"Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday",
|
||||||
|
"Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun",
|
||||||
|
"January", "February", "March", "April", "May", "June", "July", "August", "September", "October", "November", "December",
|
||||||
|
"Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec",
|
||||||
|
"GMT", "UTC"
|
||||||
|
] }
|
||||||
|
],
|
||||||
|
"text" : " \nfirst line\nsecond line"
|
||||||
|
}
|
||||||
|
- match: { tokens.0.token: "first" }
|
||||||
|
- match: { tokens.0.start_offset: 4 }
|
||||||
|
- match: { tokens.0.end_offset: 9 }
|
||||||
|
- match: { tokens.0.position: 0 }
|
||||||
|
- match: { tokens.1.token: "line" }
|
||||||
|
- match: { tokens.1.start_offset: 10 }
|
||||||
|
- match: { tokens.1.end_offset: 14 }
|
||||||
|
- match: { tokens.1.position: 1 }
|
Loading…
Add table
Add a link
Reference in a new issue