mirror of
https://github.com/elastic/elasticsearch.git
synced 2025-04-25 07:37:19 -04:00
330 lines
11 KiB
Text
330 lines
11 KiB
Text
[[esql-process-data-with-dissect-and-grok]]
|
|
=== Data processing with DISSECT and GROK
|
|
|
|
++++
|
|
<titleabbrev>Data processing with DISSECT and GROK</titleabbrev>
|
|
++++
|
|
|
|
Your data may contain unstructured strings that you want to structure. This
|
|
makes it easier to analyze the data. For example, log messages may contain IP
|
|
addresses that you want to extract so you can find the most active IP addresses.
|
|
|
|
image::images/esql/unstructured-data.png[align="center",width=75%]
|
|
|
|
{es} can structure your data at index time or query time. At index time, you can
|
|
use the <<dissect-processor,Dissect>> and <<grok-processor,Grok>> ingest
|
|
processors, or the {ls} {logstash-ref}/plugins-filters-dissect.html[Dissect] and
|
|
{logstash-ref}/plugins-filters-grok.html[Grok] filters. At query time, you can
|
|
use the {esql} <<esql-dissect>> and <<esql-grok>> commands.
|
|
|
|
[[esql-grok-or-dissect]]
|
|
==== `DISSECT` or `GROK`? Or both?
|
|
|
|
`DISSECT` works by breaking up a string using a delimiter-based pattern. `GROK`
|
|
works similarly, but uses regular expressions. This makes `GROK` more powerful,
|
|
but generally also slower. `DISSECT` works well when data is reliably repeated.
|
|
`GROK` is a better choice when you really need the power of regular expressions,
|
|
for example when the structure of your text varies from row to row.
|
|
|
|
You can use both `DISSECT` and `GROK` for hybrid use cases. For example when a
|
|
section of the line is reliably repeated, but the entire line is not. `DISSECT`
|
|
can deconstruct the section of the line that is repeated. `GROK` can process the
|
|
remaining field values using regular expressions.
|
|
|
|
[[esql-process-data-with-dissect]]
|
|
==== Process data with `DISSECT`
|
|
|
|
The <<esql-dissect>> processing command matches a string against a
|
|
delimiter-based pattern, and extracts the specified keys as columns.
|
|
|
|
For example, the following pattern:
|
|
[source,txt]
|
|
----
|
|
%{clientip} [%{@timestamp}] %{status}
|
|
----
|
|
|
|
matches a log line of this format:
|
|
[source,txt]
|
|
----
|
|
1.2.3.4 [2023-01-23T12:15:00.000Z] Connected
|
|
----
|
|
|
|
and results in adding the following columns to the input table:
|
|
|
|
[%header.monospaced.styled,format=dsv,separator=|]
|
|
|===
|
|
clientip:keyword | @timestamp:keyword | status:keyword
|
|
1.2.3.4 | 2023-01-23T12:15:00.000Z | Connected
|
|
|===
|
|
|
|
[[esql-dissect-patterns]]
|
|
===== Dissect patterns
|
|
|
|
include::../ingest/processors/dissect.asciidoc[tag=intro-example-explanation]
|
|
|
|
An empty key (`%{}`) or <<esql-named-skip-key,named skip key>> can be used to
|
|
match values, but exclude the value from the output.
|
|
|
|
All matched values are output as keyword string data types. Use the
|
|
<<esql-type-conversion-functions>> to convert to another data type.
|
|
|
|
Dissect also supports <<esql-dissect-key-modifiers,key modifiers>> that can
|
|
change dissect's default behavior. For example, you can instruct dissect to
|
|
ignore certain fields, append fields, skip over padding, etc.
|
|
|
|
[[esql-dissect-terminology]]
|
|
===== Terminology
|
|
|
|
dissect pattern::
|
|
the set of fields and delimiters describing the textual
|
|
format. Also known as a dissection.
|
|
The dissection is described using a set of `%{}` sections:
|
|
`%{a} - %{b} - %{c}`
|
|
|
|
field::
|
|
the text from `%{` to `}` inclusive.
|
|
|
|
delimiter::
|
|
the text between `}` and the next `%{` characters.
|
|
Any set of characters other than `%{`, `'not }'`, or `}` is a delimiter.
|
|
|
|
key::
|
|
+
|
|
--
|
|
the text between the `%{` and `}`, exclusive of the `?`, `+`, `&` prefixes
|
|
and the ordinal suffix.
|
|
|
|
Examples:
|
|
|
|
* `%{?aaa}` - the key is `aaa`
|
|
* `%{+bbb/3}` - the key is `bbb`
|
|
* `%{&ccc}` - the key is `ccc`
|
|
--
|
|
|
|
[[esql-dissect-examples]]
|
|
===== Examples
|
|
|
|
include::processing-commands/dissect.asciidoc[tag=examples]
|
|
|
|
[[esql-dissect-key-modifiers]]
|
|
===== Dissect key modifiers
|
|
|
|
include::../ingest/processors/dissect.asciidoc[tag=dissect-key-modifiers]
|
|
|
|
[[esql-dissect-key-modifiers-table]]
|
|
.Dissect key modifiers
|
|
[options="header",role="styled"]
|
|
|======
|
|
| Modifier | Name | Position | Example | Description | Details
|
|
| `->` | Skip right padding | (far) right | `%{keyname1->}` | Skips any repeated characters to the right | <<esql-dissect-modifier-skip-right-padding,link>>
|
|
| `+` | Append | left | `%{+keyname} %{+keyname}` | Appends two or more fields together | <<esql-append-modifier,link>>
|
|
| `+` with `/n` | Append with order | left and right | `%{+keyname/2} %{+keyname/1}` | Appends two or more fields together in the order specified | <<esql-append-order-modifier,link>>
|
|
| `?` | Named skip key | left | `%{?ignoreme}` | Skips the matched value in the output. Same behavior as `%{}`| <<esql-named-skip-key,link>>
|
|
|======
|
|
|
|
[[esql-dissect-modifier-skip-right-padding]]
|
|
====== Right padding modifier (`->`)
|
|
include::../ingest/processors/dissect.asciidoc[tag=dissect-modifier-skip-right-padding]
|
|
|
|
For example:
|
|
[source.merge.styled,esql]
|
|
----
|
|
include::{esql-specs}/docs.csv-spec[tag=dissectRightPaddingModifier]
|
|
----
|
|
[%header.monospaced.styled,format=dsv,separator=|]
|
|
|===
|
|
include::{esql-specs}/docs.csv-spec[tag=dissectRightPaddingModifier-result]
|
|
|===
|
|
|
|
include::../ingest/processors/dissect.asciidoc[tag=dissect-modifier-empty-right-padding]
|
|
|
|
For example:
|
|
[source.merge.styled,esql]
|
|
----
|
|
include::{esql-specs}/docs.csv-spec[tag=dissectEmptyRightPaddingModifier]
|
|
----
|
|
[%header.monospaced.styled,format=dsv,separator=|]
|
|
|===
|
|
include::{esql-specs}/docs.csv-spec[tag=dissectEmptyRightPaddingModifier-result]
|
|
|===
|
|
|
|
[[esql-append-modifier]]
|
|
====== Append modifier (`+`)
|
|
include::../ingest/processors/dissect.asciidoc[tag=append-modifier]
|
|
|
|
[source.merge.styled,esql]
|
|
----
|
|
include::{esql-specs}/docs.csv-spec[tag=dissectAppendModifier]
|
|
----
|
|
[%header.monospaced.styled,format=dsv,separator=|]
|
|
|===
|
|
include::{esql-specs}/docs.csv-spec[tag=dissectAppendModifier-result]
|
|
|===
|
|
|
|
[[esql-append-order-modifier]]
|
|
====== Append with order modifier (`+` and `/n`)
|
|
include::../ingest/processors/dissect.asciidoc[tag=append-order-modifier]
|
|
|
|
[source.merge.styled,esql]
|
|
----
|
|
include::{esql-specs}/docs.csv-spec[tag=dissectAppendWithOrderModifier]
|
|
----
|
|
[%header.monospaced.styled,format=dsv,separator=|]
|
|
|===
|
|
include::{esql-specs}/docs.csv-spec[tag=dissectAppendWithOrderModifier-result]
|
|
|===
|
|
|
|
[[esql-named-skip-key]]
|
|
====== Named skip key (`?`)
|
|
include::../ingest/processors/dissect.asciidoc[tag=named-skip-key]
|
|
This can be done with a named skip key using the `{?name}` syntax. In the
|
|
following query, `ident` and `auth` are not added to the output table:
|
|
|
|
[source.merge.styled,esql]
|
|
----
|
|
include::{esql-specs}/docs.csv-spec[tag=dissectNamedSkipKey]
|
|
----
|
|
[%header.monospaced.styled,format=dsv,separator=|]
|
|
|===
|
|
include::{esql-specs}/docs.csv-spec[tag=dissectNamedSkipKey-result]
|
|
|===
|
|
|
|
[[esql-dissect-limitations]]
|
|
===== Limitations
|
|
|
|
// tag::dissect-limitations[]
|
|
The `DISSECT` command does not support reference keys.
|
|
// end::dissect-limitations[]
|
|
|
|
[[esql-process-data-with-grok]]
|
|
==== Process data with `GROK`
|
|
|
|
The <<esql-grok>> processing command matches a string against a pattern based on
|
|
regular expressions, and extracts the specified keys as columns.
|
|
|
|
For example, the following pattern:
|
|
[source,txt]
|
|
----
|
|
%{IP:ip} \[%{TIMESTAMP_ISO8601:@timestamp}\] %{GREEDYDATA:status}
|
|
----
|
|
|
|
matches a log line of this format:
|
|
[source,txt]
|
|
----
|
|
1.2.3.4 [2023-01-23T12:15:00.000Z] Connected
|
|
----
|
|
|
|
Putting it together as an {esql} query:
|
|
|
|
[source.merge.styled,esql]
|
|
----
|
|
include::{esql-specs}/docs.csv-spec[tag=grokWithEscapeTripleQuotes]
|
|
----
|
|
|
|
`GROK` adds the following columns to the input table:
|
|
|
|
[%header.monospaced.styled,format=dsv,separator=|]
|
|
|===
|
|
@timestamp:keyword | ip:keyword | status:keyword
|
|
2023-01-23T12:15:00.000Z | 1.2.3.4 | Connected
|
|
|===
|
|
|
|
[NOTE]
|
|
====
|
|
|
|
Special regex characters in grok patterns, like `[` and `]` need to be escaped
|
|
with a `\`. For example, in the earlier pattern:
|
|
[source,txt]
|
|
----
|
|
%{IP:ip} \[%{TIMESTAMP_ISO8601:@timestamp}\] %{GREEDYDATA:status}
|
|
----
|
|
|
|
In {esql} queries, when using single quotes for strings, the backslash character itself is a special character that
|
|
needs to be escaped with another `\`. For this example, the corresponding {esql}
|
|
query becomes:
|
|
[source.merge.styled,esql]
|
|
----
|
|
include::{esql-specs}/docs.csv-spec[tag=grokWithEscape]
|
|
----
|
|
|
|
For this reason, in general it is more convenient to use triple quotes `"""` for GROK patterns,
|
|
that do not require escaping for backslash.
|
|
|
|
[source.merge.styled,esql]
|
|
----
|
|
include::{esql-specs}/docs.csv-spec[tag=grokWithEscapeTripleQuotes]
|
|
----
|
|
====
|
|
|
|
|
|
[[esql-grok-patterns]]
|
|
===== Grok patterns
|
|
|
|
The syntax for a grok pattern is `%{SYNTAX:SEMANTIC}`
|
|
|
|
The `SYNTAX` is the name of the pattern that matches your text. For example,
|
|
`3.44` is matched by the `NUMBER` pattern and `55.3.244.1` is matched by the
|
|
`IP` pattern. The syntax is how you match.
|
|
|
|
The `SEMANTIC` is the identifier you give to the piece of text being matched.
|
|
For example, `3.44` could be the duration of an event, so you could call it
|
|
simply `duration`. Further, a string `55.3.244.1` might identify the `client`
|
|
making a request.
|
|
|
|
By default, matched values are output as keyword string data types. To convert a
|
|
semantic's data type, suffix it with the target data type. For example
|
|
`%{NUMBER:num:int}`, which converts the `num` semantic from a string to an
|
|
integer. Currently the only supported conversions are `int` and `float`. For
|
|
other types, use the <<esql-type-conversion-functions>>.
|
|
|
|
For an overview of the available patterns, refer to
|
|
{es-repo}/blob/{branch}/libs/grok/src/main/resources/patterns[GitHub]. You can
|
|
also retrieve a list of all patterns using a <<grok-processor-rest-get,REST
|
|
API>>.
|
|
|
|
[[esql-grok-regex]]
|
|
===== Regular expressions
|
|
|
|
Grok is based on regular expressions. Any regular expressions are valid in grok
|
|
as well. Grok uses the Oniguruma regular expression library. Refer to
|
|
https://github.com/kkos/oniguruma/blob/master/doc/RE[the Oniguruma GitHub
|
|
repository] for the full supported regexp syntax.
|
|
|
|
[[esql-custom-patterns]]
|
|
===== Custom patterns
|
|
|
|
If grok doesn't have a pattern you need, you can use the Oniguruma syntax for
|
|
named capture which lets you match a piece of text and save it as a column:
|
|
[source,txt]
|
|
----
|
|
(?<field_name>the pattern here)
|
|
----
|
|
|
|
For example, postfix logs have a `queue id` that is a 10 or 11-character
|
|
hexadecimal value. This can be captured to a column named `queue_id` with:
|
|
[source,txt]
|
|
----
|
|
(?<queue_id>[0-9A-F]{10,11})
|
|
----
|
|
|
|
[[esql-grok-examples]]
|
|
===== Examples
|
|
|
|
include::processing-commands/grok.asciidoc[tag=examples]
|
|
|
|
[[esql-grok-debugger]]
|
|
===== Grok debugger
|
|
|
|
To write and debug grok patterns, you can use the
|
|
{kibana-ref}/xpack-grokdebugger.html[Grok Debugger]. It provides a UI for
|
|
testing patterns against sample data. Under the covers, it uses the same engine
|
|
as the `GROK` command.
|
|
|
|
[[esql-grok-limitations]]
|
|
===== Limitations
|
|
|
|
// tag::grok-limitations[]
|
|
The `GROK` command does not support configuring <<custom-patterns,custom
|
|
patterns>>, or <<trace-match,multiple patterns>>. The `GROK` command is not
|
|
subject to <<grok-watchdog,Grok watchdog settings>>.
|
|
// end::grok-limitations[]
|