mirror of
https://github.com/elastic/elasticsearch.git
synced 2025-04-25 15:47:23 -04:00
This commit fixes a handful of bugs with categorize_text agg - The agg now fails on fields that are not text fields - Limits the number of tokens categorized - Validates the configuration inputs to disallow settings above static maximums
471 lines
14 KiB
Text
471 lines
14 KiB
Text
[[search-aggregations-bucket-categorize-text-aggregation]]
|
|
=== Categorize text aggregation
|
|
++++
|
|
<titleabbrev>Categorize text</titleabbrev>
|
|
++++
|
|
|
|
experimental::[]
|
|
|
|
A multi-bucket aggregation that groups semi-structured text into buckets. Each `text` field is re-analyzed
|
|
using a custom analyzer. The resulting tokens are then categorized creating buckets of similarly formatted
|
|
text values. This aggregation works best with machine generated text like system logs. Only the first 100 analyzed
|
|
tokens are used to categorize the text.
|
|
|
|
NOTE: If you have considerable memory allocated to your JVM but are receiving circuit breaker exceptions from this
|
|
aggregation, you may be attempting to categorize text that is poorly formatted for categorization. Consider
|
|
adding `categorization_filters` or running under <<search-aggregations-bucket-sampler-aggregation,sampler>> or
|
|
<<search-aggregations-bucket-diversified-sampler-aggregation,diversified sampler>> to explore the created categories.
|
|
|
|
[[bucket-categorize-text-agg-syntax]]
|
|
==== Parameters
|
|
|
|
`field`::
|
|
(Required, string)
|
|
The semi-structured text field to categorize.
|
|
|
|
`max_unique_tokens`::
|
|
(Optional, integer, default: `50`)
|
|
The maximum number of unique tokens at any position up to `max_matched_tokens`.
|
|
Must be larger than 1. Smaller values use less memory and create fewer categories.
|
|
Larger values will use more memory and create narrower categories.
|
|
Max allowed value is `100`.
|
|
|
|
`max_matched_tokens`::
|
|
(Optional, integer, default: `5`)
|
|
The maximum number of token positions to match on before attempting to merge categories.
|
|
Larger values will use more memory and create narrower categories.
|
|
Max allowed value is `100`.
|
|
|
|
Example:
|
|
`max_matched_tokens` of 2 would disallow merging of the categories
|
|
[`foo` `bar` `baz`]
|
|
[`foo` `baz` `bozo`]
|
|
As the first 2 tokens are required to match for the category.
|
|
|
|
NOTE: Once `max_unique_tokens` is reached at a given position, a new `*` token is
|
|
added and all new tokens at that position are matched by the `*` token.
|
|
|
|
`similarity_threshold`::
|
|
(Optional, integer, default: `50`)
|
|
The minimum percentage of tokens that must match for text to be added to the
|
|
category bucket.
|
|
Must be between 1 and 100. The larger the value the narrower the categories.
|
|
Larger values will increase memory usage and create narrower categories.
|
|
|
|
`categorization_filters`::
|
|
(Optional, array of strings)
|
|
This property expects an array of regular expressions. The expressions
|
|
are used to filter out matching sequences from the categorization field values.
|
|
You can use this functionality to fine tune the categorization by excluding
|
|
sequences from consideration when categories are defined. For example, you can
|
|
exclude SQL statements that appear in your log files. This
|
|
property cannot be used at the same time as `categorization_analyzer`. If you
|
|
only want to define simple regular expression filters that are applied prior to
|
|
tokenization, setting this property is the easiest method. If you also want to
|
|
customize the tokenizer or post-tokenization filtering, use the
|
|
`categorization_analyzer` property instead and include the filters as
|
|
`pattern_replace` character filters.
|
|
|
|
`categorization_analyzer`::
|
|
(Optional, object or string)
|
|
The categorization analyzer specifies how the text is analyzed and tokenized before
|
|
being categorized. The syntax is very similar to that used to define the `analyzer` in the
|
|
<<indices-analyze,Analyze endpoint>>. This
|
|
property cannot be used at the same time as `categorization_filters`.
|
|
+
|
|
The `categorization_analyzer` field can be specified either as a string or as an
|
|
object. If it is a string it must refer to a
|
|
<<analysis-analyzers,built-in analyzer>> or one added by another plugin. If it
|
|
is an object it has the following properties:
|
|
+
|
|
.Properties of `categorization_analyzer`
|
|
[%collapsible%open]
|
|
=====
|
|
`char_filter`::::
|
|
(array of strings or objects)
|
|
include::{es-repo-dir}/ml/ml-shared.asciidoc[tag=char-filter]
|
|
|
|
`tokenizer`::::
|
|
(string or object)
|
|
include::{es-repo-dir}/ml/ml-shared.asciidoc[tag=tokenizer]
|
|
|
|
`filter`::::
|
|
(array of strings or objects)
|
|
include::{es-repo-dir}/ml/ml-shared.asciidoc[tag=filter]
|
|
=====
|
|
|
|
`shard_size`::
|
|
(Optional, integer)
|
|
The number of categorization buckets to return from each shard before merging
|
|
all the results.
|
|
|
|
`size`::
|
|
(Optional, integer, default: `10`)
|
|
The number of buckets to return.
|
|
|
|
`min_doc_count`::
|
|
(Optional, integer)
|
|
The minimum number of documents for a bucket to be returned to the results.
|
|
|
|
`shard_min_doc_count`::
|
|
(Optional, integer)
|
|
The minimum number of documents for a bucket to be returned from the shard before
|
|
merging.
|
|
|
|
==== Basic use
|
|
|
|
|
|
WARNING: Re-analyzing _large_ result sets will require a lot of time and memory. This aggregation should be
|
|
used in conjunction with <<async-search, Async search>>. Additionally, you may consider
|
|
using the aggregation as a child of either the <<search-aggregations-bucket-sampler-aggregation,sampler>> or
|
|
<<search-aggregations-bucket-diversified-sampler-aggregation,diversified sampler>> aggregation.
|
|
This will typically improve speed and memory use.
|
|
|
|
Example:
|
|
|
|
[source,console]
|
|
--------------------------------------------------
|
|
POST log-messages/_search?filter_path=aggregations
|
|
{
|
|
"aggs": {
|
|
"categories": {
|
|
"categorize_text": {
|
|
"field": "message"
|
|
}
|
|
}
|
|
}
|
|
}
|
|
--------------------------------------------------
|
|
// TEST[setup:categorize_text]
|
|
|
|
Response:
|
|
|
|
[source,console-result]
|
|
--------------------------------------------------
|
|
{
|
|
"aggregations" : {
|
|
"categories" : {
|
|
"buckets" : [
|
|
{
|
|
"doc_count" : 3,
|
|
"key" : "Node shutting down"
|
|
},
|
|
{
|
|
"doc_count" : 1,
|
|
"key" : "Node starting up"
|
|
},
|
|
{
|
|
"doc_count" : 1,
|
|
"key" : "User foo_325 logging on"
|
|
},
|
|
{
|
|
"doc_count" : 1,
|
|
"key" : "User foo_864 logged off"
|
|
}
|
|
]
|
|
}
|
|
}
|
|
}
|
|
--------------------------------------------------
|
|
|
|
|
|
Here is an example using `categorization_filters`
|
|
|
|
[source,console]
|
|
--------------------------------------------------
|
|
POST log-messages/_search?filter_path=aggregations
|
|
{
|
|
"aggs": {
|
|
"categories": {
|
|
"categorize_text": {
|
|
"field": "message",
|
|
"categorization_filters": ["\\w+\\_\\d{3}"] <1>
|
|
}
|
|
}
|
|
}
|
|
}
|
|
--------------------------------------------------
|
|
// TEST[setup:categorize_text]
|
|
|
|
<1> The filters to apply to the analyzed tokens. It filters
|
|
out tokens like `bar_123`.
|
|
|
|
Note how the `foo_<number>` tokens are not part of the
|
|
category results
|
|
|
|
[source,console-result]
|
|
--------------------------------------------------
|
|
{
|
|
"aggregations" : {
|
|
"categories" : {
|
|
"buckets" : [
|
|
{
|
|
"doc_count" : 3,
|
|
"key" : "Node shutting down"
|
|
},
|
|
{
|
|
"doc_count" : 1,
|
|
"key" : "Node starting up"
|
|
},
|
|
{
|
|
"doc_count" : 1,
|
|
"key" : "User logged off"
|
|
},
|
|
{
|
|
"doc_count" : 1,
|
|
"key" : "User logging on"
|
|
}
|
|
]
|
|
}
|
|
}
|
|
}
|
|
--------------------------------------------------
|
|
|
|
Here is an example using `categorization_filters`.
|
|
The default analyzer is a whitespace analyzer with a custom token filter
|
|
which filters out tokens that start with any number.
|
|
But, it may be that a token is a known highly-variable token (formatted usernames, emails, etc.). In that case, it is good to supply
|
|
custom `categorization_filters` to filter out those tokens for better categories. These filters will also reduce memory usage as fewer
|
|
tokens are held in memory for the categories.
|
|
|
|
[source,console]
|
|
--------------------------------------------------
|
|
POST log-messages/_search?filter_path=aggregations
|
|
{
|
|
"aggs": {
|
|
"categories": {
|
|
"categorize_text": {
|
|
"field": "message",
|
|
"categorization_filters": ["\\w+\\_\\d{3}"], <1>
|
|
"max_matched_tokens": 2, <2>
|
|
"similarity_threshold": 30 <3>
|
|
}
|
|
}
|
|
}
|
|
}
|
|
--------------------------------------------------
|
|
// TEST[setup:categorize_text]
|
|
<1> The filters to apply to the analyzed tokens. It filters
|
|
out tokens like `bar_123`.
|
|
<2> Require at least 2 tokens before the log categories attempt to merge together
|
|
<3> Require 30% of the tokens to match before expanding a log categories
|
|
to add a new log entry
|
|
|
|
The resulting categories are now broad, matching the first token
|
|
and merging the log groups.
|
|
|
|
[source,console-result]
|
|
--------------------------------------------------
|
|
{
|
|
"aggregations" : {
|
|
"categories" : {
|
|
"buckets" : [
|
|
{
|
|
"doc_count" : 4,
|
|
"key" : "Node *"
|
|
},
|
|
{
|
|
"doc_count" : 2,
|
|
"key" : "User *"
|
|
}
|
|
]
|
|
}
|
|
}
|
|
}
|
|
--------------------------------------------------
|
|
|
|
This aggregation can have both sub-aggregations and itself be a sub-aggregation. This allows gathering the top daily categories and the
|
|
top sample doc as below.
|
|
|
|
[source,console]
|
|
--------------------------------------------------
|
|
POST log-messages/_search?filter_path=aggregations
|
|
{
|
|
"aggs": {
|
|
"daily": {
|
|
"date_histogram": {
|
|
"field": "time",
|
|
"fixed_interval": "1d"
|
|
},
|
|
"aggs": {
|
|
"categories": {
|
|
"categorize_text": {
|
|
"field": "message",
|
|
"categorization_filters": ["\\w+\\_\\d{3}"]
|
|
},
|
|
"aggs": {
|
|
"hit": {
|
|
"top_hits": {
|
|
"size": 1,
|
|
"sort": ["time"],
|
|
"_source": "message"
|
|
}
|
|
}
|
|
}
|
|
}
|
|
}
|
|
}
|
|
}
|
|
}
|
|
--------------------------------------------------
|
|
// TEST[setup:categorize_text]
|
|
|
|
[source,console-result]
|
|
--------------------------------------------------
|
|
{
|
|
"aggregations" : {
|
|
"daily" : {
|
|
"buckets" : [
|
|
{
|
|
"key_as_string" : "2016-02-07T00:00:00.000Z",
|
|
"key" : 1454803200000,
|
|
"doc_count" : 3,
|
|
"categories" : {
|
|
"buckets" : [
|
|
{
|
|
"doc_count" : 2,
|
|
"key" : "Node shutting down",
|
|
"hit" : {
|
|
"hits" : {
|
|
"total" : {
|
|
"value" : 2,
|
|
"relation" : "eq"
|
|
},
|
|
"max_score" : null,
|
|
"hits" : [
|
|
{
|
|
"_index" : "log-messages",
|
|
"_id" : "1",
|
|
"_score" : null,
|
|
"_source" : {
|
|
"message" : "2016-02-07T00:00:00+0000 Node 3 shutting down"
|
|
},
|
|
"sort" : [
|
|
1454803260000
|
|
]
|
|
}
|
|
]
|
|
}
|
|
}
|
|
},
|
|
{
|
|
"doc_count" : 1,
|
|
"key" : "Node starting up",
|
|
"hit" : {
|
|
"hits" : {
|
|
"total" : {
|
|
"value" : 1,
|
|
"relation" : "eq"
|
|
},
|
|
"max_score" : null,
|
|
"hits" : [
|
|
{
|
|
"_index" : "log-messages",
|
|
"_id" : "2",
|
|
"_score" : null,
|
|
"_source" : {
|
|
"message" : "2016-02-07T00:00:00+0000 Node 5 starting up"
|
|
},
|
|
"sort" : [
|
|
1454803320000
|
|
]
|
|
}
|
|
]
|
|
}
|
|
}
|
|
}
|
|
]
|
|
}
|
|
},
|
|
{
|
|
"key_as_string" : "2016-02-08T00:00:00.000Z",
|
|
"key" : 1454889600000,
|
|
"doc_count" : 3,
|
|
"categories" : {
|
|
"buckets" : [
|
|
{
|
|
"doc_count" : 1,
|
|
"key" : "Node shutting down",
|
|
"hit" : {
|
|
"hits" : {
|
|
"total" : {
|
|
"value" : 1,
|
|
"relation" : "eq"
|
|
},
|
|
"max_score" : null,
|
|
"hits" : [
|
|
{
|
|
"_index" : "log-messages",
|
|
"_id" : "4",
|
|
"_score" : null,
|
|
"_source" : {
|
|
"message" : "2016-02-08T00:00:00+0000 Node 5 shutting down"
|
|
},
|
|
"sort" : [
|
|
1454889660000
|
|
]
|
|
}
|
|
]
|
|
}
|
|
}
|
|
},
|
|
{
|
|
"doc_count" : 1,
|
|
"key" : "User logged off",
|
|
"hit" : {
|
|
"hits" : {
|
|
"total" : {
|
|
"value" : 1,
|
|
"relation" : "eq"
|
|
},
|
|
"max_score" : null,
|
|
"hits" : [
|
|
{
|
|
"_index" : "log-messages",
|
|
"_id" : "6",
|
|
"_score" : null,
|
|
"_source" : {
|
|
"message" : "2016-02-08T00:00:00+0000 User foo_864 logged off"
|
|
},
|
|
"sort" : [
|
|
1454889840000
|
|
]
|
|
}
|
|
]
|
|
}
|
|
}
|
|
},
|
|
{
|
|
"doc_count" : 1,
|
|
"key" : "User logging on",
|
|
"hit" : {
|
|
"hits" : {
|
|
"total" : {
|
|
"value" : 1,
|
|
"relation" : "eq"
|
|
},
|
|
"max_score" : null,
|
|
"hits" : [
|
|
{
|
|
"_index" : "log-messages",
|
|
"_id" : "5",
|
|
"_score" : null,
|
|
"_source" : {
|
|
"message" : "2016-02-08T00:00:00+0000 User foo_325 logging on"
|
|
},
|
|
"sort" : [
|
|
1454889720000
|
|
]
|
|
}
|
|
]
|
|
}
|
|
}
|
|
}
|
|
]
|
|
}
|
|
}
|
|
]
|
|
}
|
|
}
|
|
}
|
|
--------------------------------------------------
|