mirror of
https://github.com/elastic/elasticsearch.git
synced 2025-06-28 17:34:17 -04:00
[ML] [DOCS] update find-structure reference docs (#67586)
The text structure finder API documentation had many references to the "files". While this is one use of the API, the API now has a more generic name. This commit replaces many references to the word "file" to the more generic word "text".
This commit is contained in:
parent
a794743d43
commit
24ebcc8c24
5 changed files with 62 additions and 63 deletions
|
@ -3,7 +3,7 @@
|
|||
[[find-structure]]
|
||||
= Find structure API
|
||||
|
||||
Finds the structure of a text file. The text file must
|
||||
Finds the structure of text. The text must
|
||||
contain data that is suitable to be ingested into the
|
||||
{stack}.
|
||||
|
||||
|
@ -30,25 +30,24 @@ is suitable for subsequent use with other {stack} functionality.
|
|||
|
||||
Unlike other {es} endpoints, the data that is posted to this endpoint does not
|
||||
need to be UTF-8 encoded and in JSON format. It must, however, be text; binary
|
||||
file formats are not currently supported.
|
||||
text formats are not currently supported.
|
||||
|
||||
The response from the API contains:
|
||||
|
||||
* A couple of messages from the beginning of the file.
|
||||
* A couple of messages from the beginning of the text.
|
||||
* Statistics that reveal the most common values for all fields detected within
|
||||
the file and basic numeric statistics for numeric fields.
|
||||
* Information about the structure of the file, which is useful when you write
|
||||
ingest configurations to index the file contents.
|
||||
* Appropriate mappings for an {es} index, which you could use to ingest the file
|
||||
contents.
|
||||
the text and basic numeric statistics for numeric fields.
|
||||
* Information about the structure of the text, which is useful when you write
|
||||
ingest configurations to index it or similarly formatted text.
|
||||
* Appropriate mappings for an {es} index, which you could use to ingest the text.
|
||||
|
||||
All this information can be calculated by the structure finder with no guidance.
|
||||
However, you can optionally override some of the decisions about the file
|
||||
However, you can optionally override some of the decisions about the text
|
||||
structure by specifying one or more query parameters.
|
||||
|
||||
Details of the output can be seen in the <<find-structure-examples,examples>>.
|
||||
|
||||
If the structure finder produces unexpected results for a particular file,
|
||||
If the structure finder produces unexpected results for some text,
|
||||
specify the `explain` query parameter. It causes an `explanation` to appear in
|
||||
the response, which should help in determining why the returned structure was
|
||||
chosen.
|
||||
|
@ -58,7 +57,7 @@ chosen.
|
|||
== {api-query-parms-title}
|
||||
|
||||
`charset`::
|
||||
(Optional, string) The file's character set. It must be a character set that is
|
||||
(Optional, string) The text's character set. It must be a character set that is
|
||||
supported by the JVM that {es} uses. For example, `UTF-8`, `UTF-16LE`,
|
||||
`windows-1252`, or `EUC-JP`. If this parameter is not specified, the structure
|
||||
finder chooses an appropriate character set.
|
||||
|
@ -66,8 +65,8 @@ finder chooses an appropriate character set.
|
|||
`column_names`::
|
||||
(Optional, string) If you have set `format` to `delimited`, you can specify the
|
||||
column names in a comma-separated list. If this parameter is not specified, the
|
||||
structure finder uses the column names from the header row of the file. If the
|
||||
file does not have a header role, columns are named "column1", "column2",
|
||||
structure finder uses the column names from the header row of the text. If the
|
||||
text does not have a header role, columns are named "column1", "column2",
|
||||
"column3", etc.
|
||||
|
||||
`delimiter`::
|
||||
|
@ -85,7 +84,7 @@ field named `explanation`, which is an array of strings that indicate how the
|
|||
structure finder produced its result. The default value is `false`.
|
||||
|
||||
`format`::
|
||||
(Optional, string) The high level structure of the file. Valid values are
|
||||
(Optional, string) The high level structure of the text. Valid values are
|
||||
`ndjson`, `xml`, `delimited`, and `semi_structured_text`. By default, the API
|
||||
chooses the format. In this default scenario, all rows must have the same number
|
||||
of fields for a delimited format to be detected. If the `format` is set to
|
||||
|
@ -95,7 +94,7 @@ of rows that have a different number of columns than the first row.
|
|||
`grok_pattern`::
|
||||
(Optional, string) If you have set `format` to `semi_structured_text`, you can
|
||||
specify a Grok pattern that is used to extract fields from every message in the
|
||||
file. The name of the timestamp field in the Grok pattern must match what is
|
||||
text. The name of the timestamp field in the Grok pattern must match what is
|
||||
specified in the `timestamp_field` parameter. If that parameter is not
|
||||
specified, the name of the timestamp field in the Grok pattern must match
|
||||
"timestamp". If `grok_pattern` is not specified, the structure finder creates a
|
||||
|
@ -103,30 +102,30 @@ Grok pattern.
|
|||
|
||||
`has_header_row`::
|
||||
(Optional, Boolean) If you have set `format` to `delimited`, you can use this
|
||||
parameter to indicate whether the column names are in the first row of the file.
|
||||
parameter to indicate whether the column names are in the first row of the text.
|
||||
If this parameter is not specified, the structure finder guesses based on the
|
||||
similarity of the first row of the file to other rows.
|
||||
similarity of the first row of the text to other rows.
|
||||
|
||||
`line_merge_size_limit`::
|
||||
(Optional, unsigned integer) The maximum number of characters in a message when
|
||||
lines are merged to form messages while analyzing semi-structured files. The
|
||||
lines are merged to form messages while analyzing semi-structured text. The
|
||||
default is `10000`. If you have extremely long messages you may need to increase
|
||||
this, but be aware that this may lead to very long processing times if the way
|
||||
to group lines into messages is misdetected.
|
||||
|
||||
`lines_to_sample`::
|
||||
(Optional, unsigned integer) The number of lines to include in the structural
|
||||
analysis, starting from the beginning of the file. The minimum is 2; the default
|
||||
analysis, starting from the beginning of the text. The minimum is 2; the default
|
||||
is `1000`. If the value of this parameter is greater than the number of lines in
|
||||
the file, the analysis proceeds (as long as there are at least two lines in the
|
||||
file) for all of the lines.
|
||||
the text, the analysis proceeds (as long as there are at least two lines in the
|
||||
text) for all of the lines.
|
||||
+
|
||||
--
|
||||
NOTE: The number of lines and the variation of the lines affects the speed of
|
||||
the analysis. For example, if you upload a log file where the first 1000 lines
|
||||
the analysis. For example, if you upload text where the first 1000 lines
|
||||
are all variations on the same message, the analysis will find more commonality
|
||||
than would be seen with a bigger sample. If possible, however, it is more
|
||||
efficient to upload a sample file with more variety in the first 1000 lines than
|
||||
efficient to upload sample text with more variety in the first 1000 lines than
|
||||
to request analysis of 100000 lines to achieve some variety.
|
||||
|
||||
--
|
||||
|
@ -135,7 +134,7 @@ to request analysis of 100000 lines to achieve some variety.
|
|||
(Optional, string) If you have set `format` to `delimited`, you can specify the
|
||||
character used to quote the values in each row if they contain newlines or the
|
||||
delimiter character. Only a single character is supported. If this parameter is
|
||||
not specified, the default value is a double quote (`"`). If your delimited file
|
||||
not specified, the default value is a double quote (`"`). If your delimited text
|
||||
format does not use quoting, a workaround is to set this argument to a character
|
||||
that does not appear anywhere in the sample.
|
||||
|
||||
|
@ -152,25 +151,25 @@ expires then it will be aborted. The default value is 25 seconds.
|
|||
|
||||
`timestamp_field`::
|
||||
(Optional, string) The name of the field that contains the primary timestamp of
|
||||
each record in the file. In particular, if the file were ingested into an index,
|
||||
each record in the text. In particular, if the text were ingested into an index,
|
||||
this is the field that would be used to populate the `@timestamp` field.
|
||||
+
|
||||
--
|
||||
If the `format` is `semi_structured_text`, this field must match the name of the
|
||||
appropriate extraction in the `grok_pattern`. Therefore, for semi-structured
|
||||
file formats, it is best not to specify this parameter unless `grok_pattern` is
|
||||
text, it is best not to specify this parameter unless `grok_pattern` is
|
||||
also specified.
|
||||
|
||||
For structured file formats, if you specify this parameter, the field must exist
|
||||
within the file.
|
||||
For structured text, if you specify this parameter, the field must exist
|
||||
within the text.
|
||||
|
||||
If this parameter is not specified, the structure finder makes a decision about
|
||||
which field (if any) is the primary timestamp field. For structured file
|
||||
formats, it is not compulsory to have a timestamp in the file.
|
||||
which field (if any) is the primary timestamp field. For structured text,
|
||||
it is not compulsory to have a timestamp in the text.
|
||||
--
|
||||
|
||||
`timestamp_format`::
|
||||
(Optional, string) The Java time format of the timestamp field in the file.
|
||||
(Optional, string) The Java time format of the timestamp field in the text.
|
||||
+
|
||||
--
|
||||
Only a subset of Java time format letter groups are supported:
|
||||
|
@ -203,7 +202,7 @@ quotes. For example, `MM/dd HH.mm.ss,SSSSSS 'in' yyyy` is a valid override
|
|||
format.
|
||||
|
||||
One valuable use case for this parameter is when the format is semi-structured
|
||||
text, there are multiple timestamp formats in the file, and you know which
|
||||
text, there are multiple timestamp formats in the text, and you know which
|
||||
format corresponds to the primary timestamp, but you do not want to specify the
|
||||
full `grok_pattern`. Another is when the timestamp format is one that the
|
||||
structure finder does not consider by default.
|
||||
|
@ -231,7 +230,7 @@ for more information about date and time format syntax.
|
|||
[[find-structure-request-body]]
|
||||
== {api-request-body-title}
|
||||
|
||||
The text file that you want to analyze. It must contain data that is suitable to
|
||||
The text that you want to analyze. It must contain data that is suitable to
|
||||
be ingested into {es}. It does not need to be in JSON format and it does not
|
||||
need to be UTF-8 encoded. The size is limited to the {es} HTTP receive buffer
|
||||
size, which defaults to 100 Mb.
|
||||
|
@ -244,7 +243,7 @@ size, which defaults to 100 Mb.
|
|||
[[find-structure-example-nld-json]]
|
||||
=== Ingesting newline-delimited JSON
|
||||
|
||||
Suppose you have a newline-delimited JSON file that contains information about
|
||||
Suppose you have newline-delimited JSON text that contains information about
|
||||
some books. You can send the contents to the `find_structure` endpoint:
|
||||
|
||||
[source,console]
|
||||
|
@ -317,7 +316,7 @@ If the request does not encounter errors, you receive the following result:
|
|||
}
|
||||
},
|
||||
"ingest_pipeline" : {
|
||||
"description" : "Ingest pipeline created by file structure finder",
|
||||
"description" : "Ingest pipeline created by text structure finder",
|
||||
"processors" : [
|
||||
{
|
||||
"date" : {
|
||||
|
@ -525,18 +524,18 @@ If the request does not encounter errors, you receive the following result:
|
|||
}
|
||||
----
|
||||
// TESTRESPONSE[s/"sample_start" : ".*",/"sample_start" : "$body.sample_start",/]
|
||||
// The substitution is because the "file" is pre-processed by the test harness,
|
||||
// The substitution is because the text is pre-processed by the test harness,
|
||||
// so the fields may get reordered in the JSON the endpoint sees
|
||||
|
||||
<1> `num_lines_analyzed` indicates how many lines of the file were analyzed.
|
||||
<1> `num_lines_analyzed` indicates how many lines of the text were analyzed.
|
||||
<2> `num_messages_analyzed` indicates how many distinct messages the lines
|
||||
contained. For NDJSON, this value is the same as `num_lines_analyzed`. For other
|
||||
file formats, messages can span several lines.
|
||||
<3> `sample_start` reproduces the first two messages in the file verbatim. This
|
||||
may help diagnose parse errors or accidental uploads of the wrong file.
|
||||
<4> `charset` indicates the character encoding used to parse the file.
|
||||
text formats, messages can span several lines.
|
||||
<3> `sample_start` reproduces the first two messages in the text verbatim. This
|
||||
may help diagnose parse errors or accidental uploads of the wrong text.
|
||||
<4> `charset` indicates the character encoding used to parse the text.
|
||||
<5> For UTF character encodings, `has_byte_order_marker` indicates whether the
|
||||
file begins with a byte order marker.
|
||||
text begins with a byte order marker.
|
||||
<6> `format` is one of `ndjson`, `xml`, `delimited` or `semi_structured_text`.
|
||||
<7> The `timestamp_field` names the field considered most likely to be the
|
||||
primary timestamp of each document.
|
||||
|
@ -544,7 +543,7 @@ primary timestamp of each document.
|
|||
<9> `java_timestamp_formats` are the Java time formats recognized in the time
|
||||
fields. {es} mappings and ingest pipelines use this format.
|
||||
<10> If a timestamp format is detected that does not include a timezone,
|
||||
`need_client_timezone` will be `true`. The server that parses the file must
|
||||
`need_client_timezone` will be `true`. The server that parses the text must
|
||||
therefore be told the correct timezone by the client.
|
||||
<11> `mappings` contains some suitable mappings for an index into which the data
|
||||
could be ingested. In this case, the `release_date` field has been given a
|
||||
|
@ -683,7 +682,7 @@ If the request does not encounter errors, you receive the following result:
|
|||
}
|
||||
},
|
||||
"ingest_pipeline" : {
|
||||
"description" : "Ingest pipeline created by file structure finder",
|
||||
"description" : "Ingest pipeline created by text structure finder",
|
||||
"processors" : [
|
||||
{
|
||||
"csv" : {
|
||||
|
@ -1463,10 +1462,10 @@ lists the column names in the order they appear in the sample.
|
|||
<4> `has_header_row` indicates that for this sample the column names were in
|
||||
the first row of the sample. (If they hadn't been then it would have been a good
|
||||
idea to specify them in the `column_names` query parameter.)
|
||||
<5> The `delimiter` for this sample is a comma, as it's a CSV file.
|
||||
<5> The `delimiter` for this sample is a comma, as it's CSV formatted text.
|
||||
<6> The `quote` character is the default double quote. (The structure finder
|
||||
does not attempt to deduce any other quote character, so if you have a delimited
|
||||
file that's quoted with some other character you must specify it using the
|
||||
does not attempt to deduce any other quote character, so if you have delimited
|
||||
text that's quoted with some other character you must specify it using the
|
||||
`quote` query parameter.)
|
||||
<7> The `timestamp_field` has been chosen to be `tpep_pickup_datetime`.
|
||||
`tpep_dropoff_datetime` would work just as well, but `tpep_pickup_datetime` was
|
||||
|
@ -1577,7 +1576,7 @@ this:
|
|||
}
|
||||
},
|
||||
"ingest_pipeline" : {
|
||||
"description" : "Ingest pipeline created by file structure finder",
|
||||
"description" : "Ingest pipeline created by text structure finder",
|
||||
"processors" : [
|
||||
{
|
||||
"grok" : {
|
||||
|
@ -1693,7 +1692,7 @@ calculate `field_stats` for your additional fields.
|
|||
|
||||
In the case of the {es} log a more complete Grok pattern is
|
||||
`\[%{TIMESTAMP_ISO8601:timestamp}\]\[%{LOGLEVEL:loglevel} *\]\[%{JAVACLASS:class} *\] \[%{HOSTNAME:node}\] %{JAVALOGMESSAGE:message}`.
|
||||
You can analyze the same log file again, submitting this `grok_pattern` as a
|
||||
You can analyze the same text again, submitting this `grok_pattern` as a
|
||||
query parameter (appropriately URL escaped):
|
||||
|
||||
[source,js]
|
||||
|
@ -1745,7 +1744,7 @@ this:
|
|||
}
|
||||
},
|
||||
"ingest_pipeline" : {
|
||||
"description" : "Ingest pipeline created by file structure finder",
|
||||
"description" : "Ingest pipeline created by text structure finder",
|
||||
"processors" : [
|
||||
{
|
||||
"grok" : {
|
||||
|
|
Loading…
Add table
Add a link
Reference in a new issue