* [docs] Prepare for docs-assembler (#125118)
* reorg files for docs-assembler and create toc.yml files
* fix build error, add redirects
* only toc
* move images
(cherry picked from commit 9bcd59596d
)
# Conflicts:
# docs/reference/aggregations/search-aggregations-pipeline-bucket-script-aggregation.md
# docs/reference/aggregations/search-aggregations-pipeline-cumulative-cardinality-aggregation.md
# docs/reference/aggregations/search-aggregations-pipeline-cumulative-sum-aggregation.md
# docs/reference/aggregations/search-aggregations-pipeline-derivative-aggregation.md
# docs/reference/aggregations/search-aggregations-pipeline-extended-stats-bucket-aggregation.md
# docs/reference/aggregations/search-aggregations-pipeline-max-bucket-aggregation.md
# docs/reference/aggregations/search-aggregations-pipeline-min-bucket-aggregation.md
# docs/reference/aggregations/search-aggregations-pipeline-percentiles-bucket-aggregation.md
# docs/reference/aggregations/search-aggregations-pipeline-stats-bucket-aggregation.md
# docs/reference/aggregations/search-aggregations-pipeline-sum-bucket-aggregation.md
# docs/reference/query-languages/esql/esql-commands.md
# docs/reference/query-languages/esql/esql-lookup-join.md
# docs/reference/query-languages/esql/esql-process-data-with-dissect-grok.md
# docs/reference/query-languages/images/esql-lookup-join.png
# docs/reference/query-languages/toc.yml
# docs/reference/search-connectors/es-connectors-run-from-docker.md
# docs/reference/text-analysis/analysis-apostrophe-tokenfilter.md
# docs/reference/toc.yml
* remove markers
---------
Co-authored-by: Colleen McGinnis <colleen.mcginnis@elastic.co>
5.1 KiB
navigation_title | mapped_pages | |
---|---|---|
Pattern |
|
Pattern analyzer [analysis-pattern-analyzer]
The pattern
analyzer uses a regular expression to split the text into terms. The regular expression should match the token separators not the tokens themselves. The regular expression defaults to \W+
(or all non-word characters).
::::{admonition} Beware of Pathological Regular Expressions :class: warning
The pattern analyzer uses Java Regular Expressions.
A badly written regular expression could run very slowly or even throw a StackOverflowError and cause the node it is running on to exit suddenly.
Read more about pathological regular expressions and how to avoid them.
::::
Example output [_example_output_3]
POST _analyze
{
"analyzer": "pattern",
"text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
}
The above sentence would produce the following terms:
[ the, 2, quick, brown, foxes, jumped, over, the, lazy, dog, s, bone ]
Configuration [_configuration_4]
The pattern
analyzer accepts the following parameters:
pattern
- A Java regular expression, defaults to
\W+
. flags
- Java regular expression flags. Flags should be pipe-separated, eg
"CASE_INSENSITIVE|COMMENTS"
. lowercase
- Should terms be lowercased or not. Defaults to
true
. stopwords
- A pre-defined stop words list like
_english_
or an array containing a list of stop words. Defaults to_none_
. stopwords_path
- The path to a file containing stop words.
See the Stop Token Filter for more information about stop word configuration.
Example configuration [_example_configuration_3]
In this example, we configure the pattern
analyzer to split email addresses on non-word characters or on underscores (\W|_
), and to lower-case the result:
PUT my-index-000001
{
"settings": {
"analysis": {
"analyzer": {
"my_email_analyzer": {
"type": "pattern",
"pattern": "\\W|_", <1>
"lowercase": true
}
}
}
}
}
POST my-index-000001/_analyze
{
"analyzer": "my_email_analyzer",
"text": "John_Smith@foo-bar.com"
}
- The backslashes in the pattern need to be escaped when specifying the pattern as a JSON string.
The above example produces the following terms:
[ john, smith, foo, bar, com ]
CamelCase tokenizer [_camelcase_tokenizer]
The following more complicated example splits CamelCase text into tokens:
PUT my-index-000001
{
"settings": {
"analysis": {
"analyzer": {
"camel": {
"type": "pattern",
"pattern": "([^\\p{L}\\d]+)|(?<=\\D)(?=\\d)|(?<=\\d)(?=\\D)|(?<=[\\p{L}&&[^\\p{Lu}]])(?=\\p{Lu})|(?<=\\p{Lu})(?=\\p{Lu}[\\p{L}&&[^\\p{Lu}]])"
}
}
}
}
}
GET my-index-000001/_analyze
{
"analyzer": "camel",
"text": "MooseX::FTPClass2_beta"
}
The above example produces the following terms:
[ moose, x, ftp, class, 2, beta ]
The regex above is easier to understand as:
([^\p{L}\d]+) # swallow non letters and numbers,
| (?<=\D)(?=\d) # or non-number followed by number,
| (?<=\d)(?=\D) # or number followed by non-number,
| (?<=[ \p{L} && [^\p{Lu}]]) # or lower case
(?=\p{Lu}) # followed by upper case,
| (?<=\p{Lu}) # or upper case
(?=\p{Lu} # followed by upper case
[\p{L}&&[^\p{Lu}]] # then lower case
)
Definition [_definition_3]
The pattern
analyzer consists of:
- Tokenizer
- Token Filters
- Stop Token Filter (disabled by default)
If you need to customize the pattern
analyzer beyond the configuration parameters then you need to recreate it as a custom
analyzer and modify it, usually by adding token filters. This would recreate the built-in pattern
analyzer and you can use it as a starting point for further customization:
PUT /pattern_example
{
"settings": {
"analysis": {
"tokenizer": {
"split_on_non_word": {
"type": "pattern",
"pattern": "\\W+" <1>
}
},
"analyzer": {
"rebuilt_pattern": {
"tokenizer": "split_on_non_word",
"filter": [
"lowercase" <2>
]
}
}
}
}
}
- The default pattern is
\W+
which splits on non-word characters and this is where you’d change it. - You’d add other token filters after
lowercase
.