elasticsearch/docs/reference/scripting/grok-syntax.asciidoc
Adam Locke 0aa0171ce1
[DOCS] Create a new page for grok content in scripting docs (#73118)
* [DOCS] Moving grok to its own scripting page

* Adding examples

* Updating cross link for grok page

* Adds same runtime field in a search request for #73262

* Clarify titles and shift navigation

* Incorporating review feedback

* Updating cross-link to Painless
2021-05-27 15:18:34 -04:00

235 lines
7.9 KiB
Text

[[grok]]
=== Grokking grok
Grok is a regular expression dialect that supports reusable aliased expressions. Grok works really well with syslog logs, Apache and other webserver
logs, mysql logs, and generally any log format that is written for humans and
not computer consumption.
Grok sits on top of the https://github.com/kkos/oniguruma/blob/master/doc/RE[Oniguruma] regular expression library, so any regular expressions are
valid in grok. Grok uses this regular expression language to allow naming
existing patterns and combining them into more complex patterns that match your
fields.
[[grok-syntax]]
==== Grok patterns
The {stack} ships with numerous https://github.com/elastic/elasticsearch/blob/master/libs/grok/src/main/resources/patterns/grok-patterns[predefined grok patterns] that simplify working with grok. The syntax for reusing grok patterns
takes one of the following forms:
[%autowidth]
|===
|`%{SYNTAX}` | `%{SYNTAX:ID}` |`%{SYNTAX:ID:TYPE}`
|===
`SYNTAX`::
The name of the pattern that will match your text. For example, `NUMBER` and
`IP` are both patterns that are provided within the default patterns set. The
`NUMBER` pattern matches data like `3.44`, and the `IP` pattern matches data
like `55.3.244.1`.
`ID`::
The identifier you give to the piece of text being matched. For example, `3.44`
could be the duration of an event, so you might call it `duration`. The string
`55.3.244.1` might identify the `client` making a request.
`TYPE`::
The data type you want to cast your named field. `int`, `long`, `double`,
`float` and `boolean` are supported types.
For example, let's say you have message data that looks like this:
[source,txt]
----
3.44 55.3.244.1
----
The first value is a number, followed by what appears to be an IP address. You
can match this text by using the following grok expression:
[source,txt]
----
%{NUMBER:duration} %{IP:client}
----
[[grok-patterns]]
==== Use grok patterns in Painless scripts
You can incorporate predefined grok patterns into Painless scripts to extract
data. To test your script, use either the {painless}/painless-execute-api.html#painless-execute-runtime-field-context[field contexts] of the Painless
execute API or create a runtime field that includes the script. Runtime fields
offer greater flexibility and accept multiple documents, but the Painless
execute API is a great option if you don't have write access on a cluster
where you're testing a script.
TIP: If you need help building grok patterns to match your data, use the
{kibana-ref}/xpack-grokdebugger.html[Grok Debugger] tool in {kib}.
For example, if you're working with Apache log data, you can use the
`%{COMMONAPACHELOG}` syntax, which understands the structure of Apache logs. A
sample document might look like this:
// Note to contributors that the line break in the following example is
// intentional to promote better readability in the output
[source,js]
----
"timestamp":"2020-04-30T14:30:17-05:00","message":"40.135.0.0 - -
[30/Apr/2020:14:30:17 -0500] \"GET /images/hm_bg.jpg HTTP/1.0\" 200 24736"
----
// NOTCONSOLE
To extract the IP address from the `message` field, you can write a Painless
script that incorporates the `%{COMMONAPACHELOG}` syntax. You can test this
script using the {painless}/painless-execute-api.html#painless-runtime-ip[`ip` field context] of the Painless execute API, but let's use a runtime field
instead.
Based on the sample document, index the `@timestamp` and `message` fields. To
remain flexible, use `wildcard` as the field type for `message`:
[source,console]
----
PUT /my-index/
{
"mappings": {
"properties": {
"@timestamp": {
"format": "strict_date_optional_time||epoch_second",
"type": "date"
},
"message": {
"type": "wildcard"
}
}
}
}
----
Next, use the <<docs-bulk,bulk API>> to index some log data into
`my-index`.
[source,console]
----
POST /my-index/_bulk?refresh
{"index":{}}
{"timestamp":"2020-04-30T14:30:17-05:00","message":"40.135.0.0 - - [30/Apr/2020:14:30:17 -0500] \"GET /images/hm_bg.jpg HTTP/1.0\" 200 24736"}
{"index":{}}
{"timestamp":"2020-04-30T14:30:53-05:00","message":"232.0.0.0 - - [30/Apr/2020:14:30:53 -0500] \"GET /images/hm_bg.jpg HTTP/1.0\" 200 24736"}
{"index":{}}
{"timestamp":"2020-04-30T14:31:12-05:00","message":"26.1.0.0 - - [30/Apr/2020:14:31:12 -0500] \"GET /images/hm_bg.jpg HTTP/1.0\" 200 24736"}
{"index":{}}
{"timestamp":"2020-04-30T14:31:19-05:00","message":"247.37.0.0 - - [30/Apr/2020:14:31:19 -0500] \"GET /french/splash_inet.html HTTP/1.0\" 200 3781"}
{"index":{}}
{"timestamp":"2020-04-30T14:31:22-05:00","message":"247.37.0.0 - - [30/Apr/2020:14:31:22 -0500] \"GET /images/hm_nbg.jpg HTTP/1.0\" 304 0"}
{"index":{}}
{"timestamp":"2020-04-30T14:31:27-05:00","message":"252.0.0.0 - - [30/Apr/2020:14:31:27 -0500] \"GET /images/hm_bg.jpg HTTP/1.0\" 200 24736"}
{"index":{}}
{"timestamp":"2020-04-30T14:31:28-05:00","message":"not a valid apache log"}
----
// TEST[continued]
[[grok-patterns-runtime]]
==== Incorporate grok patterns and scripts in runtime fields
Now you can define a runtime field in the mappings that includes your Painless
script and grok pattern. If the pattern matches, the script emits the value of
the matching IP address. If the pattern doesn't match (`clientip != null`), the
script just returns the field value without crashing.
[source,console]
----
PUT my-index/_mappings
{
"runtime": {
"http.clientip": {
"type": "ip",
"script": """
String clientip=grok('%{COMMONAPACHELOG}').extract(doc["message"].value)?.clientip;
if (clientip != null) emit(clientip);
"""
}
}
}
----
// TEST[continued]
Alternatively, you can define the same runtime field but in the context of a
search request. The runtime definition and the script are exactly the same as
the one defined previously in the index mapping. Just copy that definition into
the search request under the `runtime_mappings` section and include a query
that matches on the runtime field. This query returns the same results as if
you <<grok-pattern-results,defined a search query>> for the `http.clientip`
runtime field in your index mappings, but only in the context of this specific
search:
[source,console]
----
GET my-index/_search
{
"runtime_mappings": {
"http.clientip": {
"type": "ip",
"script": """
String clientip=grok('%{COMMONAPACHELOG}').extract(doc["message"].value)?.clientip;
if (clientip != null) emit(clientip);
"""
}
},
"query": {
"match": {
"http.clientip": "40.135.0.0"
}
},
"fields" : ["http.clientip"]
}
----
// TEST[continued]
[[grok-pattern-results]]
==== Return calculated results
Using the `http.clientip` runtime field, you can define a simple query to run a
search for a specific IP address and return all related fields. The <<search-fields,`fields`>> parameter on the `_search` API works for all fields,
even those that weren't sent as part of the original `_source`:
[source,console]
----
GET my-index/_search
{
"query": {
"match": {
"http.clientip": "40.135.0.0"
}
},
"fields" : ["http.clientip"]
}
----
// TEST[continued]
// TEST[s/_search/_search\?filter_path=hits/]
The response includes the specific IP address indicated in your search query.
The grok pattern within the Painless script extracted this value from the
`message` field at runtime.
[source,console-result]
----
{
"hits" : {
"total" : {
"value" : 1,
"relation" : "eq"
},
"max_score" : 1.0,
"hits" : [
{
"_index" : "my-index",
"_id" : "1iN2a3kBw4xTzEDqyYE0",
"_score" : 1.0,
"_source" : {
"timestamp" : "2020-04-30T14:30:17-05:00",
"message" : "40.135.0.0 - - [30/Apr/2020:14:30:17 -0500] \"GET /images/hm_bg.jpg HTTP/1.0\" 200 24736"
},
"fields" : {
"http.clientip" : [
"40.135.0.0"
]
}
}
]
}
}
----
// TESTRESPONSE[s/"_id" : "1iN2a3kBw4xTzEDqyYE0"/"_id": $body.hits.hits.0._id/]