logstash/docs/static/field-reference.asciidoc
Ry Biesemeyer d8454110ba
Field Reference: handle special characters (#14044)
* add failing tests for Event.new with field that look like field references

* fix: correctly handle FieldReference-special characters in field names.

Keys passed to most methods of `ConvertedMap`, based on `IdentityHashMap`
depend on identity and not equivalence, and therefore rely on the keys being
_interned_ strings. In order to avoid hitting the JVM's global String intern
pool (which can have performance problems), operations to normalize a string
to its interned counterpart have traditionally relied on the behaviour of
`FieldReference#from` returning a likely-cached `FieldReference`, that had
an interned `key` and an empty `path`.

This is problematic on two points.

First, when `ConvertedMap` was given data with keys that _were_ valid string
field references representing a nested field (such as `[host][geo][location]`),
the implementation of `ConvertedMap#put` effectively silently discarded the
path components because it assumed them to be empty, and only the key was
kept (`location`).

Second, when `ConvertedMap` was given a map whose keys contained what the
field reference parser considered special characters but _were NOT_
valid field references, the resulting `FieldReference.IllegalSyntaxException`
caused the operation to abort.

Instead of using the `FieldReference` cache, which sits on top of objects whose
`key` and `path`-components are known to have been interned, we introduce an
internment helper on our `ConvertedMap` that is also backed by the global string
intern pool, and ensure that our field references are primed through this pool.

In addition to fixing the `ConvertedMap#newFromMap` functionality, this has
three net effects:

 - Our ConvertedMap operations still use strings
   from the global intern pool
 - We have a new, smaller cache of individual field
   names, improving lookup performance
 - Our FieldReference cache no longer is flooded
   with fragments and therefore is more likely to
   remain performant

NOTE: this does NOT create isolated intern pools, as doing so would require
      a careful audit of the possible code-paths to `ConvertedMap#putInterned`.
      The new cache is limited to 10k strings, and when more are used only
      the FIRST 10k strings will be primed into the cache, leaving the
      remainder to always hit the global String intern pool.

NOTE: by fixing this bug, we alow events to be created whose fields _CANNOT_
      be referenced with the existing FieldReference implementation.

Resolves: https://github.com/elastic/logstash/issues/13606
Resolves: https://github.com/elastic/logstash/issues/11608

* field_reference: support escape sequences

Adds a `config.field_reference.escape_style` option and a companion
command-line flag `--field-reference-escape-style` allowing a user
to opt into one of two proposed escape-sequence implementations for field
reference parsing:

 - `PERCENT`: URI-style `%`+`HH` hexadecimal encoding of UTF-8 bytes
 - `AMPERSAND`: HTML-style `&#`+`DD`+`;` encoding of decimal Unicode code-points

The default is `NONE`, which does _not_ proccess escape sequences.
With this setting a user effectively cannot reference a field whose name
contains FieldReference-reserved characters.

| ESCAPE STYLE | `[`     | `]`     |
| ------------ | ------- | ------- |
| `NONE`       | _N/A_   | _N/A_   |
| `PERCENT`    | `%5B`   | `%5D`   |
| `AMPERSAND`  | `[` | `]` |

* fixup: no need to double-escape HTML-ish escape sequences in docs

* Apply suggestions from code review

Co-authored-by: Karol Bucek <kares@users.noreply.github.com>

* field-reference: load escape style in runner

* docs: sentences over semiciolons

* field-reference: faster shortcut for PERCENT escape mode

* field-reference: escape mode control downcase

* field_reference: more s/experimental/technical preview/

* field_reference: still more s/experimental/technical preview/

Co-authored-by: Karol Bucek <kares@users.noreply.github.com>
2022-05-24 07:48:47 -07:00

147 lines
5.7 KiB
Text

[role="exclude",id="field-references-deepdive"]
== Field References Deep Dive
It is often useful to be able to refer to a field or collection of fields by name. To do this,
you can use the Logstash field reference syntax.
The syntax to access a field specifies the entire path to the field, with each fragment wrapped in square brackets.
When a field name contains square brackets, they must be properly <<formal-grammar-escape-sequences, _escaped_>>.
_Field References_ can be expressed literally within <<conditionals,_Conditional_>> statements in your pipeline configurations,
as string arguments to your pipeline plugins, or within sprintf statements that will be used by your pipeline plugins:
[source,pipelineconf]
filter {
# +----literal----+ +----literal----+
# | | | |
if [@metadata][date] and [@metadata][time] {
mutate {
add_field {
"[@metadata][timestamp]" => "%{[@metadata][date]} %{[@metadata][time]}"
# | | | | | | | |
# +----string-argument---+ | +--field-ref----+ +--field-ref----+ |
# +-------- sprintf format string ----------+
}
}
}
}
[float]
[[formal-grammar]]
=== Formal Grammar
Below is the formal grammar of the Field Reference, with notes and examples.
[float]
[[formal-grammar-field-reference-literal]]
==== Field Reference Literal
A _Field Reference Literal_ is a sequence of one or more _Path Fragments_ that can be used directly in Logstash pipeline <<conditionals,conditionals>> without any additional quoting (e.g. `[request]`, `[response][status]`).
[source,antlr]
fieldReferenceLiteral
: ( pathFragment )+
;
NOTE: In Logstash 7.x and earlier, a quoted value (such as `["foo"]`) is
considered a field reference and isn't treated as a single element array. This
behavior might cause confusion in conditionals, such as `[message] in ["foo",
"bar"]` compared to `[message] in ["foo"]`. We discourage using names with
quotes, such as `"\"foo\""`, as this behavior might change in the future.
[float]
[[formal-grammar-field-reference]]
==== Field Reference (Event APIs)
The Event API's methods for manipulating the fields of an event or using the sprintf syntax are more flexible than the pipeline grammar in what they accept as a Field Reference.
Top-level fields can be referenced directly by their _Field Name_ without the square brackets, and there is some support for _Composite Field References_, simplifying use of programmatically-generated Field References.
A _Field Reference_ for use with the Event API is therefore one of:
- a single _Field Reference Literal_; OR
- a single _Field Name_ (referencing a top-level field); OR
- a single _Composite Field Reference_.
[source,antlr]
eventApiFieldReference
: fieldReferenceLiteral
| fieldName
| compositeFieldReference
;
[float]
[[formal-grammar-path-fragment]]
==== Path Fragment
A _Path Fragment_ is a _Field Name_ wrapped in square brackets (e.g., `[request]`).
[source,antlr]
pathFragment
: '[' fieldName ']'
;
[float]
[[formal-grammar-field-name]]
==== Field Name
A _Field Name_ is a sequence of characters that are _not_ square brackets (`[` or `]`).
[source,antlr]
fieldName
: ( ~( '[' | ']' ) )+
;
[float]
[[formal-grammar-event-api-composite-field-reference]]
==== Composite Field Reference
In some cases, it may be necessary to programmatically _compose_ a Field Reference from one or more Field References,
such as when manipulating fields in a plugin or while using the Ruby Filter plugin and the Event API.
[source,ruby]
fieldReference = "[path][to][deep nested field]"
compositeFieldReference = "[@metadata][#{fieldReference}][size]"
# => "[@metadata][[path][to][deep nested field]][size]"
// NOTE: table below uses "plus for passthrough" quoting to prevent double square-brackets
// from being interpreted as asciidoc anchors when converted to HTML.
[float]
===== Canonical Representations of Composite Field References
|===
| Acceptable _Composite Field Reference_ | Canonical _Field Reference_ Representation
| `+[[deep][nesting]][field]+` | `+[deep][nesting][field]+`
| `+[foo][[bar]][bingo]+` | `+[foo][bar][bingo]+`
| `+[[ok]]+` | `+[ok]+`
|===
A _Composite Field Reference_ is a sequence of one or more _Path Fragments_ or _Embedded Field References_.
[source,antlr]
compositeFieldReference
: ( pathFragment | embeddedFieldReference )+
;
_Composite Field References_ are supported by the Event API, but are _not_ supported as literals in the Pipeline Configuration.
[float]
[[formal-grammar-event-api-embedded-field-reference]]
==== Embedded Field Reference
[source,antlr]
embeddedFieldReference
: '[' fieldReference ']'
;
An _Embedded Field Reference_ is a _Field Reference_ that is itself wrapped in square brackets (`[` and `]`), and can be a component of a _Composite Field Reference_.
[float]
[[formal-grammar-escape-sequences]]
=== Escape Sequences
For {ls} to reference a field whose name contains a character that has special meaning in the field reference grammar, the character must be escaped.
Logstash can be globally configured to use one of two field reference escape modes:
- `none` (default): no escape sequence processing is done. Fields containing literal square brackets cannot be referenced by the Event API.
- `percent`: URI-style percent encoding of UTF-8 bytes. The left square bracket (`[`) is expressed as `%5B`, and the right square bracket (`]`) is expressed as `%5D`.
- `ampersand`: HTML-style ampersand encoding (`&#` + decimal unicode codepoint + `;`). The left square bracket (`[`) is expressed as `&#91;`, and the right square bracket (`]`) is expressed as `&#93;`.