mirror of
https://github.com/elastic/elasticsearch.git
synced 2025-04-25 07:37:19 -04:00
* [DOCS] Moving grok to its own scripting page * Adding examples * Updating cross link for grok page * Adds same runtime field in a search request for #73262 * Clarify titles and shift navigation * Incorporating review feedback * Updating cross-link to Painless
235 lines
7.9 KiB
Text
235 lines
7.9 KiB
Text
[[grok]]
|
|
=== Grokking grok
|
|
Grok is a regular expression dialect that supports reusable aliased expressions. Grok works really well with syslog logs, Apache and other webserver
|
|
logs, mysql logs, and generally any log format that is written for humans and
|
|
not computer consumption.
|
|
|
|
Grok sits on top of the https://github.com/kkos/oniguruma/blob/master/doc/RE[Oniguruma] regular expression library, so any regular expressions are
|
|
valid in grok. Grok uses this regular expression language to allow naming
|
|
existing patterns and combining them into more complex patterns that match your
|
|
fields.
|
|
|
|
[[grok-syntax]]
|
|
==== Grok patterns
|
|
The {stack} ships with numerous https://github.com/elastic/elasticsearch/blob/master/libs/grok/src/main/resources/patterns/grok-patterns[predefined grok patterns] that simplify working with grok. The syntax for reusing grok patterns
|
|
takes one of the following forms:
|
|
|
|
[%autowidth]
|
|
|===
|
|
|`%{SYNTAX}` | `%{SYNTAX:ID}` |`%{SYNTAX:ID:TYPE}`
|
|
|===
|
|
|
|
`SYNTAX`::
|
|
The name of the pattern that will match your text. For example, `NUMBER` and
|
|
`IP` are both patterns that are provided within the default patterns set. The
|
|
`NUMBER` pattern matches data like `3.44`, and the `IP` pattern matches data
|
|
like `55.3.244.1`.
|
|
|
|
`ID`::
|
|
The identifier you give to the piece of text being matched. For example, `3.44`
|
|
could be the duration of an event, so you might call it `duration`. The string
|
|
`55.3.244.1` might identify the `client` making a request.
|
|
|
|
`TYPE`::
|
|
The data type you want to cast your named field. `int`, `long`, `double`,
|
|
`float` and `boolean` are supported types.
|
|
|
|
For example, let's say you have message data that looks like this:
|
|
|
|
[source,txt]
|
|
----
|
|
3.44 55.3.244.1
|
|
----
|
|
|
|
The first value is a number, followed by what appears to be an IP address. You
|
|
can match this text by using the following grok expression:
|
|
|
|
[source,txt]
|
|
----
|
|
%{NUMBER:duration} %{IP:client}
|
|
----
|
|
|
|
[[grok-patterns]]
|
|
==== Use grok patterns in Painless scripts
|
|
You can incorporate predefined grok patterns into Painless scripts to extract
|
|
data. To test your script, use either the {painless}/painless-execute-api.html#painless-execute-runtime-field-context[field contexts] of the Painless
|
|
execute API or create a runtime field that includes the script. Runtime fields
|
|
offer greater flexibility and accept multiple documents, but the Painless
|
|
execute API is a great option if you don't have write access on a cluster
|
|
where you're testing a script.
|
|
|
|
TIP: If you need help building grok patterns to match your data, use the
|
|
{kibana-ref}/xpack-grokdebugger.html[Grok Debugger] tool in {kib}.
|
|
|
|
For example, if you're working with Apache log data, you can use the
|
|
`%{COMMONAPACHELOG}` syntax, which understands the structure of Apache logs. A
|
|
sample document might look like this:
|
|
|
|
// Note to contributors that the line break in the following example is
|
|
// intentional to promote better readability in the output
|
|
[source,js]
|
|
----
|
|
"timestamp":"2020-04-30T14:30:17-05:00","message":"40.135.0.0 - -
|
|
[30/Apr/2020:14:30:17 -0500] \"GET /images/hm_bg.jpg HTTP/1.0\" 200 24736"
|
|
----
|
|
// NOTCONSOLE
|
|
|
|
To extract the IP address from the `message` field, you can write a Painless
|
|
script that incorporates the `%{COMMONAPACHELOG}` syntax. You can test this
|
|
script using the {painless}/painless-execute-api.html#painless-runtime-ip[`ip` field context] of the Painless execute API, but let's use a runtime field
|
|
instead.
|
|
|
|
Based on the sample document, index the `@timestamp` and `message` fields. To
|
|
remain flexible, use `wildcard` as the field type for `message`:
|
|
|
|
[source,console]
|
|
----
|
|
PUT /my-index/
|
|
{
|
|
"mappings": {
|
|
"properties": {
|
|
"@timestamp": {
|
|
"format": "strict_date_optional_time||epoch_second",
|
|
"type": "date"
|
|
},
|
|
"message": {
|
|
"type": "wildcard"
|
|
}
|
|
}
|
|
}
|
|
}
|
|
----
|
|
|
|
Next, use the <<docs-bulk,bulk API>> to index some log data into
|
|
`my-index`.
|
|
|
|
[source,console]
|
|
----
|
|
POST /my-index/_bulk?refresh
|
|
{"index":{}}
|
|
{"timestamp":"2020-04-30T14:30:17-05:00","message":"40.135.0.0 - - [30/Apr/2020:14:30:17 -0500] \"GET /images/hm_bg.jpg HTTP/1.0\" 200 24736"}
|
|
{"index":{}}
|
|
{"timestamp":"2020-04-30T14:30:53-05:00","message":"232.0.0.0 - - [30/Apr/2020:14:30:53 -0500] \"GET /images/hm_bg.jpg HTTP/1.0\" 200 24736"}
|
|
{"index":{}}
|
|
{"timestamp":"2020-04-30T14:31:12-05:00","message":"26.1.0.0 - - [30/Apr/2020:14:31:12 -0500] \"GET /images/hm_bg.jpg HTTP/1.0\" 200 24736"}
|
|
{"index":{}}
|
|
{"timestamp":"2020-04-30T14:31:19-05:00","message":"247.37.0.0 - - [30/Apr/2020:14:31:19 -0500] \"GET /french/splash_inet.html HTTP/1.0\" 200 3781"}
|
|
{"index":{}}
|
|
{"timestamp":"2020-04-30T14:31:22-05:00","message":"247.37.0.0 - - [30/Apr/2020:14:31:22 -0500] \"GET /images/hm_nbg.jpg HTTP/1.0\" 304 0"}
|
|
{"index":{}}
|
|
{"timestamp":"2020-04-30T14:31:27-05:00","message":"252.0.0.0 - - [30/Apr/2020:14:31:27 -0500] \"GET /images/hm_bg.jpg HTTP/1.0\" 200 24736"}
|
|
{"index":{}}
|
|
{"timestamp":"2020-04-30T14:31:28-05:00","message":"not a valid apache log"}
|
|
----
|
|
// TEST[continued]
|
|
|
|
[[grok-patterns-runtime]]
|
|
==== Incorporate grok patterns and scripts in runtime fields
|
|
Now you can define a runtime field in the mappings that includes your Painless
|
|
script and grok pattern. If the pattern matches, the script emits the value of
|
|
the matching IP address. If the pattern doesn't match (`clientip != null`), the
|
|
script just returns the field value without crashing.
|
|
|
|
[source,console]
|
|
----
|
|
PUT my-index/_mappings
|
|
{
|
|
"runtime": {
|
|
"http.clientip": {
|
|
"type": "ip",
|
|
"script": """
|
|
String clientip=grok('%{COMMONAPACHELOG}').extract(doc["message"].value)?.clientip;
|
|
if (clientip != null) emit(clientip);
|
|
"""
|
|
}
|
|
}
|
|
}
|
|
----
|
|
// TEST[continued]
|
|
|
|
Alternatively, you can define the same runtime field but in the context of a
|
|
search request. The runtime definition and the script are exactly the same as
|
|
the one defined previously in the index mapping. Just copy that definition into
|
|
the search request under the `runtime_mappings` section and include a query
|
|
that matches on the runtime field. This query returns the same results as if
|
|
you <<grok-pattern-results,defined a search query>> for the `http.clientip`
|
|
runtime field in your index mappings, but only in the context of this specific
|
|
search:
|
|
|
|
[source,console]
|
|
----
|
|
GET my-index/_search
|
|
{
|
|
"runtime_mappings": {
|
|
"http.clientip": {
|
|
"type": "ip",
|
|
"script": """
|
|
String clientip=grok('%{COMMONAPACHELOG}').extract(doc["message"].value)?.clientip;
|
|
if (clientip != null) emit(clientip);
|
|
"""
|
|
}
|
|
},
|
|
"query": {
|
|
"match": {
|
|
"http.clientip": "40.135.0.0"
|
|
}
|
|
},
|
|
"fields" : ["http.clientip"]
|
|
}
|
|
----
|
|
// TEST[continued]
|
|
|
|
[[grok-pattern-results]]
|
|
==== Return calculated results
|
|
Using the `http.clientip` runtime field, you can define a simple query to run a
|
|
search for a specific IP address and return all related fields. The <<search-fields,`fields`>> parameter on the `_search` API works for all fields,
|
|
even those that weren't sent as part of the original `_source`:
|
|
|
|
[source,console]
|
|
----
|
|
GET my-index/_search
|
|
{
|
|
"query": {
|
|
"match": {
|
|
"http.clientip": "40.135.0.0"
|
|
}
|
|
},
|
|
"fields" : ["http.clientip"]
|
|
}
|
|
----
|
|
// TEST[continued]
|
|
// TEST[s/_search/_search\?filter_path=hits/]
|
|
|
|
The response includes the specific IP address indicated in your search query.
|
|
The grok pattern within the Painless script extracted this value from the
|
|
`message` field at runtime.
|
|
|
|
[source,console-result]
|
|
----
|
|
{
|
|
"hits" : {
|
|
"total" : {
|
|
"value" : 1,
|
|
"relation" : "eq"
|
|
},
|
|
"max_score" : 1.0,
|
|
"hits" : [
|
|
{
|
|
"_index" : "my-index",
|
|
"_id" : "1iN2a3kBw4xTzEDqyYE0",
|
|
"_score" : 1.0,
|
|
"_source" : {
|
|
"timestamp" : "2020-04-30T14:30:17-05:00",
|
|
"message" : "40.135.0.0 - - [30/Apr/2020:14:30:17 -0500] \"GET /images/hm_bg.jpg HTTP/1.0\" 200 24736"
|
|
},
|
|
"fields" : {
|
|
"http.clientip" : [
|
|
"40.135.0.0"
|
|
]
|
|
}
|
|
}
|
|
]
|
|
}
|
|
}
|
|
----
|
|
// TESTRESPONSE[s/"_id" : "1iN2a3kBw4xTzEDqyYE0"/"_id": $body.hits.hits.0._id/]
|