mirror of
https://github.com/elastic/elasticsearch.git
synced 2025-04-25 15:47:23 -04:00
This replaces the implementation of the categorize_text aggregation with the new algorithm that was added in #80867. The new algorithm works in the same way as the ML C++ code used for categorization jobs (and now includes the fixes of elastic/ml-cpp#2277). The docs are updated to reflect the workings of the new implementation.
481 lines
15 KiB
Text
481 lines
15 KiB
Text
[[search-aggregations-bucket-categorize-text-aggregation]]
|
|
=== Categorize text aggregation
|
|
++++
|
|
<titleabbrev>Categorize text</titleabbrev>
|
|
++++
|
|
|
|
experimental::[]
|
|
|
|
A multi-bucket aggregation that groups semi-structured text into buckets. Each `text` field is re-analyzed
|
|
using a custom analyzer. The resulting tokens are then categorized creating buckets of similarly formatted
|
|
text values. This aggregation works best with machine generated text like system logs. Only the first 100 analyzed
|
|
tokens are used to categorize the text.
|
|
|
|
NOTE: If you have considerable memory allocated to your JVM but are receiving circuit breaker exceptions from this
|
|
aggregation, you may be attempting to categorize text that is poorly formatted for categorization. Consider
|
|
adding `categorization_filters` or running under <<search-aggregations-bucket-sampler-aggregation,sampler>>,
|
|
<<search-aggregations-bucket-diversified-sampler-aggregation,diversified sampler>>, or
|
|
<<search-aggregations-random-sampler-aggregation,random sampler>> to explore the created categories.
|
|
|
|
NOTE: The algorithm used for categorization was completely changed in version 8.3.0. As a result this aggregation
|
|
will not work in a mixed version cluster where some nodes are on version 8.3.0 or higher and others are
|
|
on a version older than 8.3.0. Upgrade all nodes in your cluster to the same version if you experience
|
|
an error related to this change.
|
|
|
|
[[bucket-categorize-text-agg-syntax]]
|
|
==== Parameters
|
|
|
|
`categorization_analyzer`::
|
|
(Optional, object or string)
|
|
The categorization analyzer specifies how the text is analyzed and tokenized before
|
|
being categorized. The syntax is very similar to that used to define the `analyzer` in the
|
|
<<indices-analyze,Analyze endpoint>>. This
|
|
property cannot be used at the same time as `categorization_filters`.
|
|
+
|
|
The `categorization_analyzer` field can be specified either as a string or as an
|
|
object. If it is a string it must refer to a
|
|
<<analysis-analyzers,built-in analyzer>> or one added by another plugin. If it
|
|
is an object it has the following properties:
|
|
+
|
|
.Properties of `categorization_analyzer`
|
|
[%collapsible%open]
|
|
=====
|
|
`char_filter`::::
|
|
(array of strings or objects)
|
|
include::{es-repo-dir}/ml/ml-shared.asciidoc[tag=char-filter]
|
|
|
|
`tokenizer`::::
|
|
(string or object)
|
|
include::{es-repo-dir}/ml/ml-shared.asciidoc[tag=tokenizer]
|
|
|
|
`filter`::::
|
|
(array of strings or objects)
|
|
include::{es-repo-dir}/ml/ml-shared.asciidoc[tag=filter]
|
|
=====
|
|
|
|
`categorization_filters`::
|
|
(Optional, array of strings)
|
|
This property expects an array of regular expressions. The expressions
|
|
are used to filter out matching sequences from the categorization field values.
|
|
You can use this functionality to fine tune the categorization by excluding
|
|
sequences from consideration when categories are defined. For example, you can
|
|
exclude SQL statements that appear in your log files. This
|
|
property cannot be used at the same time as `categorization_analyzer`. If you
|
|
only want to define simple regular expression filters that are applied prior to
|
|
tokenization, setting this property is the easiest method. If you also want to
|
|
customize the tokenizer or post-tokenization filtering, use the
|
|
`categorization_analyzer` property instead and include the filters as
|
|
`pattern_replace` character filters.
|
|
|
|
`field`::
|
|
(Required, string)
|
|
The semi-structured text field to categorize.
|
|
|
|
`max_matched_tokens`::
|
|
(Optional, integer)
|
|
This parameter does nothing now, but is permitted for compatibility with the original
|
|
pre-8.3.0 implementation.
|
|
|
|
`max_unique_tokens`::
|
|
(Optional, integer)
|
|
This parameter does nothing now, but is permitted for compatibility with the original
|
|
pre-8.3.0 implementation.
|
|
|
|
`min_doc_count`::
|
|
(Optional, integer)
|
|
The minimum number of documents for a bucket to be returned to the results.
|
|
|
|
`shard_min_doc_count`::
|
|
(Optional, integer)
|
|
The minimum number of documents for a bucket to be returned from the shard before
|
|
merging.
|
|
|
|
`shard_size`::
|
|
(Optional, integer)
|
|
The number of categorization buckets to return from each shard before merging
|
|
all the results.
|
|
|
|
`similarity_threshold`::
|
|
(Optional, integer, default: `70`)
|
|
The minimum percentage of token weight that must match for text to be added to the
|
|
category bucket.
|
|
Must be between 1 and 100. The larger the value the narrower the categories.
|
|
Larger values will increase memory usage and create narrower categories.
|
|
|
|
`size`::
|
|
(Optional, integer, default: `10`)
|
|
The number of buckets to return.
|
|
|
|
==== Basic use
|
|
|
|
WARNING: Re-analyzing _large_ result sets will require a lot of time and memory. This aggregation should be
|
|
used in conjunction with <<async-search, Async search>>. Additionally, you may consider
|
|
using the aggregation as a child of either the <<search-aggregations-bucket-sampler-aggregation,sampler>> or
|
|
<<search-aggregations-bucket-diversified-sampler-aggregation,diversified sampler>> aggregation.
|
|
This will typically improve speed and memory use.
|
|
|
|
Example:
|
|
|
|
[source,console]
|
|
--------------------------------------------------
|
|
POST log-messages/_search?filter_path=aggregations
|
|
{
|
|
"aggs": {
|
|
"categories": {
|
|
"categorize_text": {
|
|
"field": "message"
|
|
}
|
|
}
|
|
}
|
|
}
|
|
--------------------------------------------------
|
|
// TEST[setup:categorize_text]
|
|
|
|
Response:
|
|
|
|
[source,console-result]
|
|
--------------------------------------------------
|
|
{
|
|
"aggregations" : {
|
|
"categories" : {
|
|
"buckets" : [
|
|
{
|
|
"doc_count" : 3,
|
|
"key" : "Node shutting down",
|
|
"max_matching_length" : 49
|
|
},
|
|
{
|
|
"doc_count" : 1,
|
|
"key" : "Node starting up",
|
|
"max_matching_length" : 47
|
|
},
|
|
{
|
|
"doc_count" : 1,
|
|
"key" : "User foo_325 logging on",
|
|
"max_matching_length" : 52
|
|
},
|
|
{
|
|
"doc_count" : 1,
|
|
"key" : "User foo_864 logged off",
|
|
"max_matching_length" : 52
|
|
}
|
|
]
|
|
}
|
|
}
|
|
}
|
|
--------------------------------------------------
|
|
|
|
Here is an example using `categorization_filters`
|
|
|
|
[source,console]
|
|
--------------------------------------------------
|
|
POST log-messages/_search?filter_path=aggregations
|
|
{
|
|
"aggs": {
|
|
"categories": {
|
|
"categorize_text": {
|
|
"field": "message",
|
|
"categorization_filters": ["\\w+\\_\\d{3}"] <1>
|
|
}
|
|
}
|
|
}
|
|
}
|
|
--------------------------------------------------
|
|
// TEST[setup:categorize_text]
|
|
|
|
<1> The filters to apply to the analyzed tokens. It filters
|
|
out tokens like `bar_123`.
|
|
|
|
Note how the `foo_<number>` tokens are not part of the
|
|
category results
|
|
|
|
[source,console-result]
|
|
--------------------------------------------------
|
|
{
|
|
"aggregations" : {
|
|
"categories" : {
|
|
"buckets" : [
|
|
{
|
|
"doc_count" : 3,
|
|
"key" : "Node shutting down",
|
|
"max_matching_length" : 49
|
|
},
|
|
{
|
|
"doc_count" : 1,
|
|
"key" : "Node starting up",
|
|
"max_matching_length" : 47
|
|
},
|
|
{
|
|
"doc_count" : 1,
|
|
"key" : "User logged off",
|
|
"max_matching_length" : 52
|
|
},
|
|
{
|
|
"doc_count" : 1,
|
|
"key" : "User logging on",
|
|
"max_matching_length" : 52
|
|
}
|
|
]
|
|
}
|
|
}
|
|
}
|
|
--------------------------------------------------
|
|
|
|
Here is an example using `categorization_filters`.
|
|
The default analyzer uses the `ml_standard` tokenizer which is similar to a whitespace tokenizer
|
|
but filters out tokens that could be interpreted as hexadecimal numbers. The default analyzer
|
|
also uses the `first_line_with_letters` character filter, so that only the first meaningful line
|
|
of multi-line messages is considered.
|
|
But, it may be that a token is a known highly-variable token (formatted usernames, emails, etc.). In that case, it is good to supply
|
|
custom `categorization_filters` to filter out those tokens for better categories. These filters may also reduce memory usage as fewer
|
|
tokens are held in memory for the categories. (If there are sufficient examples of different usernames, emails, etc., then
|
|
categories will form that naturally discard them as variables, but for small input data where only one example exists this won't
|
|
happen.)
|
|
|
|
[source,console]
|
|
--------------------------------------------------
|
|
POST log-messages/_search?filter_path=aggregations
|
|
{
|
|
"aggs": {
|
|
"categories": {
|
|
"categorize_text": {
|
|
"field": "message",
|
|
"categorization_filters": ["\\w+\\_\\d{3}"], <1>
|
|
"similarity_threshold": 11 <2>
|
|
}
|
|
}
|
|
}
|
|
}
|
|
--------------------------------------------------
|
|
// TEST[setup:categorize_text]
|
|
<1> The filters to apply to the analyzed tokens. It filters
|
|
out tokens like `bar_123`.
|
|
<2> Require 11% of token weight to match before adding a message to an
|
|
existing category rather than creating a new one.
|
|
|
|
The resulting categories are now very broad, merging the log groups.
|
|
(A `similarity_threshold` of 11% is generally too low. Settings over
|
|
50% are usually better.)
|
|
|
|
[source,console-result]
|
|
--------------------------------------------------
|
|
{
|
|
"aggregations" : {
|
|
"categories" : {
|
|
"buckets" : [
|
|
{
|
|
"doc_count" : 4,
|
|
"key" : "Node",
|
|
"max_matching_length" : 49
|
|
},
|
|
{
|
|
"doc_count" : 2,
|
|
"key" : "User",
|
|
"max_matching_length" : 52
|
|
}
|
|
]
|
|
}
|
|
}
|
|
}
|
|
--------------------------------------------------
|
|
|
|
This aggregation can have both sub-aggregations and itself be a sub-aggregation. This allows gathering the top daily categories and the
|
|
top sample doc as below.
|
|
|
|
[source,console]
|
|
--------------------------------------------------
|
|
POST log-messages/_search?filter_path=aggregations
|
|
{
|
|
"aggs": {
|
|
"daily": {
|
|
"date_histogram": {
|
|
"field": "time",
|
|
"fixed_interval": "1d"
|
|
},
|
|
"aggs": {
|
|
"categories": {
|
|
"categorize_text": {
|
|
"field": "message",
|
|
"categorization_filters": ["\\w+\\_\\d{3}"]
|
|
},
|
|
"aggs": {
|
|
"hit": {
|
|
"top_hits": {
|
|
"size": 1,
|
|
"sort": ["time"],
|
|
"_source": "message"
|
|
}
|
|
}
|
|
}
|
|
}
|
|
}
|
|
}
|
|
}
|
|
}
|
|
--------------------------------------------------
|
|
// TEST[setup:categorize_text]
|
|
|
|
[source,console-result]
|
|
--------------------------------------------------
|
|
{
|
|
"aggregations" : {
|
|
"daily" : {
|
|
"buckets" : [
|
|
{
|
|
"key_as_string" : "2016-02-07T00:00:00.000Z",
|
|
"key" : 1454803200000,
|
|
"doc_count" : 3,
|
|
"categories" : {
|
|
"buckets" : [
|
|
{
|
|
"doc_count" : 2,
|
|
"key" : "Node shutting down",
|
|
"max_matching_length" : 49,
|
|
"hit" : {
|
|
"hits" : {
|
|
"total" : {
|
|
"value" : 2,
|
|
"relation" : "eq"
|
|
},
|
|
"max_score" : null,
|
|
"hits" : [
|
|
{
|
|
"_index" : "log-messages",
|
|
"_id" : "1",
|
|
"_score" : null,
|
|
"_source" : {
|
|
"message" : "2016-02-07T00:00:00+0000 Node 3 shutting down"
|
|
},
|
|
"sort" : [
|
|
1454803260000
|
|
]
|
|
}
|
|
]
|
|
}
|
|
}
|
|
},
|
|
{
|
|
"doc_count" : 1,
|
|
"key" : "Node starting up",
|
|
"max_matching_length" : 47,
|
|
"hit" : {
|
|
"hits" : {
|
|
"total" : {
|
|
"value" : 1,
|
|
"relation" : "eq"
|
|
},
|
|
"max_score" : null,
|
|
"hits" : [
|
|
{
|
|
"_index" : "log-messages",
|
|
"_id" : "2",
|
|
"_score" : null,
|
|
"_source" : {
|
|
"message" : "2016-02-07T00:00:00+0000 Node 5 starting up"
|
|
},
|
|
"sort" : [
|
|
1454803320000
|
|
]
|
|
}
|
|
]
|
|
}
|
|
}
|
|
}
|
|
]
|
|
}
|
|
},
|
|
{
|
|
"key_as_string" : "2016-02-08T00:00:00.000Z",
|
|
"key" : 1454889600000,
|
|
"doc_count" : 3,
|
|
"categories" : {
|
|
"buckets" : [
|
|
{
|
|
"doc_count" : 1,
|
|
"key" : "Node shutting down",
|
|
"max_matching_length" : 49,
|
|
"hit" : {
|
|
"hits" : {
|
|
"total" : {
|
|
"value" : 1,
|
|
"relation" : "eq"
|
|
},
|
|
"max_score" : null,
|
|
"hits" : [
|
|
{
|
|
"_index" : "log-messages",
|
|
"_id" : "4",
|
|
"_score" : null,
|
|
"_source" : {
|
|
"message" : "2016-02-08T00:00:00+0000 Node 5 shutting down"
|
|
},
|
|
"sort" : [
|
|
1454889660000
|
|
]
|
|
}
|
|
]
|
|
}
|
|
}
|
|
},
|
|
{
|
|
"doc_count" : 1,
|
|
"key" : "User logged off",
|
|
"max_matching_length" : 52,
|
|
"hit" : {
|
|
"hits" : {
|
|
"total" : {
|
|
"value" : 1,
|
|
"relation" : "eq"
|
|
},
|
|
"max_score" : null,
|
|
"hits" : [
|
|
{
|
|
"_index" : "log-messages",
|
|
"_id" : "6",
|
|
"_score" : null,
|
|
"_source" : {
|
|
"message" : "2016-02-08T00:00:00+0000 User foo_864 logged off"
|
|
},
|
|
"sort" : [
|
|
1454889840000
|
|
]
|
|
}
|
|
]
|
|
}
|
|
}
|
|
},
|
|
{
|
|
"doc_count" : 1,
|
|
"key" : "User logging on",
|
|
"max_matching_length" : 52,
|
|
"hit" : {
|
|
"hits" : {
|
|
"total" : {
|
|
"value" : 1,
|
|
"relation" : "eq"
|
|
},
|
|
"max_score" : null,
|
|
"hits" : [
|
|
{
|
|
"_index" : "log-messages",
|
|
"_id" : "5",
|
|
"_score" : null,
|
|
"_source" : {
|
|
"message" : "2016-02-08T00:00:00+0000 User foo_325 logging on"
|
|
},
|
|
"sort" : [
|
|
1454889720000
|
|
]
|
|
}
|
|
]
|
|
}
|
|
}
|
|
}
|
|
]
|
|
}
|
|
}
|
|
]
|
|
}
|
|
}
|
|
}
|
|
--------------------------------------------------
|