mirror of
https://github.com/elastic/elasticsearch.git
synced 2025-04-25 07:37:19 -04:00
* Extract AbstractFindStructureRequest * Extract FindStructureResponse * Extract RestFindStructureRequestParser * FindFieldStructure endpoint * FindMessageStructure endpoint * Improve FindTextStructureResponseTests * REST API spec + YAML REST tests * Lint fixes * Remove POST find_field_structure * Update docs/changelog/105660.yaml * Update changelog * Fix text_structure.find_field_structure.json * Fix find_field_structure yaml rest test * Fix FindTextStructureResponseTests * Fix YAML tests with security * Remove unreachable code * DelimitedTextStructureFinder::createFromMessages * NdJsonTextStructureFinderFactory::createFromMessages * XmlTextStructureFinderFactory::createFromMessages * LogTextStructureFinderFactory::createFromMessages * Lint fixes * Add createFromMessages to TextStructureFinderFactory interface * Wire createFromMessages in the endpoints * Uppercase UTF-8 * REST test for semi-structured messages * Restrict query params to applicable endpoints * typo * Polish thread scheduling * Propagate parent task in search request * No header row for find message/field structure * Expose findTextStructure more consistently * Move text structure query params to shared doc * Rename "find structure API" -> "find text structure API" * Find message structure API docs * Find field structure docs * Maybe fix docs error? * bugfix * Fix docs? * Fix find-field-structure test from docs * Improve docs * Add param documents_to_sample to docs * improve docs
215 lines
8.9 KiB
Text
215 lines
8.9 KiB
Text
tag::param-charset[]
|
|
`charset`::
|
|
(Optional, string) The text's character set. It must be a character set that is
|
|
supported by the JVM that {es} uses. For example, `UTF-8`, `UTF-16LE`,
|
|
`windows-1252`, or `EUC-JP`. If this parameter is not specified, the structure
|
|
finder chooses an appropriate character set.
|
|
end::param-charset[]
|
|
|
|
tag::param-column-names[]
|
|
`column_names`::
|
|
(Optional, string) If you have set `format` to `delimited`, you can specify the
|
|
column names in a comma-separated list. If this parameter is not specified, the
|
|
structure finder uses the column names from the header row of the text. If the
|
|
text does not have a header row, columns are named "column1", "column2",
|
|
"column3", etc.
|
|
end::param-column-names[]
|
|
|
|
tag::param-delimiter[]
|
|
`delimiter`::
|
|
(Optional, string) If you have set `format` to `delimited`, you can specify the
|
|
character used to delimit the values in each row. Only a single character is
|
|
supported; the delimiter cannot have multiple characters. By default, the API
|
|
considers the following possibilities: comma, tab, semi-colon, and pipe (`|`).
|
|
In this default scenario, all rows must have the same number of fields for the
|
|
delimited format to be detected. If you specify a delimiter, up to 10% of the
|
|
rows can have a different number of columns than the first row.
|
|
end::param-delimiter[]
|
|
|
|
tag::param-explain[]
|
|
`explain`::
|
|
(Optional, Boolean) If `true`, the response includes a
|
|
field named `explanation`, which is an array of strings that indicate how the
|
|
structure finder produced its result. The default value is `false`.
|
|
end::param-explain[]
|
|
|
|
tag::param-format[]
|
|
`format`::
|
|
(Optional, string) The high level structure of the text. Valid values are
|
|
`ndjson`, `xml`, `delimited`, and `semi_structured_text`. By default, the API
|
|
chooses the format. In this default scenario, all rows must have the same number
|
|
of fields for a delimited format to be detected. If the `format` is set to
|
|
`delimited` and the `delimiter` is not set, however, the API tolerates up to 5%
|
|
of rows that have a different number of columns than the first row.
|
|
end::param-format[]
|
|
|
|
tag::param-grok-pattern[]
|
|
`grok_pattern`::
|
|
(Optional, string) If you have set `format` to `semi_structured_text`, you can
|
|
specify a Grok pattern that is used to extract fields from every message in the
|
|
text. The name of the timestamp field in the Grok pattern must match what is
|
|
specified in the `timestamp_field` parameter. If that parameter is not
|
|
specified, the name of the timestamp field in the Grok pattern must match
|
|
"timestamp". If `grok_pattern` is not specified, the structure finder creates a
|
|
Grok pattern.
|
|
end::param-grok-pattern[]
|
|
|
|
tag::param-ecs-compatibility[]
|
|
`ecs_compatibility`::
|
|
(Optional, string) The mode of compatibility with ECS compliant Grok patterns.
|
|
Use this parameter to specify whether to use ECS Grok patterns instead of
|
|
legacy ones when the structure finder creates a Grok pattern. Valid values
|
|
are `disabled` and `v1`. The default value is `disabled`. This setting primarily
|
|
has an impact when a whole message Grok pattern such as `%{CATALINALOG}`
|
|
matches the input. If the structure finder identifies a common structure but
|
|
has no idea of meaning then generic field names such as `path`, `ipaddress`,
|
|
`field1` and `field2` are used in the `grok_pattern` output, with the intention
|
|
that a user who knows the meanings rename these fields before using it.
|
|
end::param-ecs-compatibility[]
|
|
|
|
tag::param-has-header-row[]
|
|
`has_header_row`::
|
|
(Optional, Boolean) If you have set `format` to `delimited`, you can use this
|
|
parameter to indicate whether the column names are in the first row of the text.
|
|
If this parameter is not specified, the structure finder guesses based on the
|
|
similarity of the first row of the text to other rows.
|
|
end::param-has-header-row[]
|
|
|
|
tag::param-line-merge-size-limit[]
|
|
`line_merge_size_limit`::
|
|
(Optional, unsigned integer) The maximum number of characters in a message when
|
|
lines are merged to form messages while analyzing semi-structured text. The
|
|
default is `10000`. If you have extremely long messages you may need to increase
|
|
this, but be aware that this may lead to very long processing times if the way
|
|
to group lines into messages is misdetected.
|
|
end::param-line-merge-size-limit[]
|
|
|
|
tag::param-lines-to-sample[]
|
|
`lines_to_sample`::
|
|
(Optional, unsigned integer) The number of lines to include in the structural
|
|
analysis, starting from the beginning of the text. The minimum is 2; the default
|
|
is `1000`. If the value of this parameter is greater than the number of lines in
|
|
the text, the analysis proceeds (as long as there are at least two lines in the
|
|
text) for all of the lines.
|
|
+
|
|
--
|
|
NOTE: The number of lines and the variation of the lines affects the speed of
|
|
the analysis. For example, if you upload text where the first 1000 lines
|
|
are all variations on the same message, the analysis will find more commonality
|
|
than would be seen with a bigger sample. If possible, however, it is more
|
|
efficient to upload sample text with more variety in the first 1000 lines than
|
|
to request analysis of 100000 lines to achieve some variety.
|
|
|
|
--
|
|
end::param-lines-to-sample[]
|
|
|
|
tag::param-quote[]
|
|
`quote`::
|
|
(Optional, string) If you have set `format` to `delimited`, you can specify the
|
|
character used to quote the values in each row if they contain newlines or the
|
|
delimiter character. Only a single character is supported. If this parameter is
|
|
not specified, the default value is a double quote (`"`). If your delimited text
|
|
format does not use quoting, a workaround is to set this argument to a character
|
|
that does not appear anywhere in the sample.
|
|
end::param-quote[]
|
|
|
|
tag::param-should-trim-fields[]
|
|
`should_trim_fields`::
|
|
(Optional, Boolean) If you have set `format` to `delimited`, you can specify
|
|
whether values between delimiters should have whitespace trimmed from them. If
|
|
this parameter is not specified and the delimiter is pipe (`|`), the default
|
|
value is `true`. Otherwise, the default value is `false`.
|
|
end::param-should-trim-fields[]
|
|
|
|
tag::param-timeout[]
|
|
`timeout`::
|
|
(Optional, <<time-units,time units>>) Sets the maximum amount of time that the
|
|
structure analysis may take. If the analysis is still running when the timeout
|
|
expires then it will be stopped. The default value is 25 seconds.
|
|
end::param-timeout[]
|
|
|
|
tag::param-timestamp-field[]
|
|
`timestamp_field`::
|
|
(Optional, string) The name of the field that contains the primary timestamp of
|
|
each record in the text. In particular, if the text were ingested into an index,
|
|
this is the field that would be used to populate the `@timestamp` field.
|
|
+
|
|
--
|
|
If the `format` is `semi_structured_text`, this field must match the name of the
|
|
appropriate extraction in the `grok_pattern`. Therefore, for semi-structured
|
|
text, it is best not to specify this parameter unless `grok_pattern` is
|
|
also specified.
|
|
|
|
For structured text, if you specify this parameter, the field must exist
|
|
within the text.
|
|
|
|
If this parameter is not specified, the structure finder makes a decision about
|
|
which field (if any) is the primary timestamp field. For structured text,
|
|
it is not compulsory to have a timestamp in the text.
|
|
--
|
|
end::param-timestamp-field[]
|
|
|
|
tag::param-timestamp-format[]
|
|
`timestamp_format`::
|
|
(Optional, string) The Java time format of the timestamp field in the text.
|
|
+
|
|
--
|
|
Only a subset of Java time format letter groups are supported:
|
|
|
|
* `a`
|
|
* `d`
|
|
* `dd`
|
|
* `EEE`
|
|
* `EEEE`
|
|
* `H`
|
|
* `HH`
|
|
* `h`
|
|
* `M`
|
|
* `MM`
|
|
* `MMM`
|
|
* `MMMM`
|
|
* `mm`
|
|
* `ss`
|
|
* `XX`
|
|
* `XXX`
|
|
* `yy`
|
|
* `yyyy`
|
|
* `zzz`
|
|
|
|
Additionally `S` letter groups (fractional seconds) of length one to nine are
|
|
supported providing they occur after `ss` and separated from the `ss` by a `.`,
|
|
`,` or `:`. Spacing and punctuation is also permitted with the exception of `?`,
|
|
newline and carriage return, together with literal text enclosed in single
|
|
quotes. For example, `MM/dd HH.mm.ss,SSSSSS 'in' yyyy` is a valid override
|
|
format.
|
|
|
|
One valuable use case for this parameter is when the format is semi-structured
|
|
text, there are multiple timestamp formats in the text, and you know which
|
|
format corresponds to the primary timestamp, but you do not want to specify the
|
|
full `grok_pattern`. Another is when the timestamp format is one that the
|
|
structure finder does not consider by default.
|
|
|
|
If this parameter is not specified, the structure finder chooses the best
|
|
format from a built-in set.
|
|
|
|
If the special value `null` is specified the structure finder will not look
|
|
for a primary timestamp in the text. When the format is semi-structured text
|
|
this will result in the structure finder treating the text as single-line
|
|
messages.
|
|
|
|
The following table provides the appropriate `timeformat` values for some example timestamps:
|
|
|
|
|===
|
|
| Timeformat | Presentation
|
|
|
|
| yyyy-MM-dd HH:mm:ssZ | 2019-04-20 13:15:22+0000
|
|
| EEE, d MMM yyyy HH:mm:ss Z | Sat, 20 Apr 2019 13:15:22 +0000
|
|
| dd.MM.yy HH:mm:ss.SSS | 20.04.19 13:15:22.285
|
|
|===
|
|
|
|
Refer to
|
|
https://docs.oracle.com/javase/8/docs/api/java/time/format/DateTimeFormatter.html[the Java date/time format documentation]
|
|
for more information about date and time format syntax.
|
|
|
|
--
|
|
end::param-timestamp-format[]
|