* [docs] Prepare for docs-assembler (#125118)
* reorg files for docs-assembler and create toc.yml files
* fix build error, add redirects
* only toc
* move images
(cherry picked from commit 9bcd59596d
)
# Conflicts:
# docs/reference/aggregations/search-aggregations-pipeline-bucket-script-aggregation.md
# docs/reference/aggregations/search-aggregations-pipeline-cumulative-cardinality-aggregation.md
# docs/reference/aggregations/search-aggregations-pipeline-cumulative-sum-aggregation.md
# docs/reference/aggregations/search-aggregations-pipeline-derivative-aggregation.md
# docs/reference/aggregations/search-aggregations-pipeline-extended-stats-bucket-aggregation.md
# docs/reference/aggregations/search-aggregations-pipeline-max-bucket-aggregation.md
# docs/reference/aggregations/search-aggregations-pipeline-min-bucket-aggregation.md
# docs/reference/aggregations/search-aggregations-pipeline-percentiles-bucket-aggregation.md
# docs/reference/aggregations/search-aggregations-pipeline-stats-bucket-aggregation.md
# docs/reference/aggregations/search-aggregations-pipeline-sum-bucket-aggregation.md
# docs/reference/query-languages/esql/esql-commands.md
# docs/reference/query-languages/esql/esql-lookup-join.md
# docs/reference/query-languages/esql/esql-process-data-with-dissect-grok.md
# docs/reference/query-languages/images/esql-lookup-join.png
# docs/reference/query-languages/toc.yml
# docs/reference/search-connectors/es-connectors-run-from-docker.md
# docs/reference/text-analysis/analysis-apostrophe-tokenfilter.md
# docs/reference/toc.yml
* remove markers
---------
Co-authored-by: Colleen McGinnis <colleen.mcginnis@elastic.co>
2.4 KiB
mapped_pages | |
---|---|
|
ICU tokenizer [analysis-icu-tokenizer]
Tokenizes text into words on word boundaries, as defined in UAX #29: Unicode Text Segmentation. It behaves much like the standard
tokenizer, but adds better support for some Asian languages by using a dictionary-based approach to identify words in Thai, Lao, Chinese, Japanese, and Korean, and using custom rules to break Myanmar and Khmer text into syllables.
PUT icu_sample
{
"settings": {
"index": {
"analysis": {
"analyzer": {
"my_icu_analyzer": {
"tokenizer": "icu_tokenizer"
}
}
}
}
}
}
Rules customization [_rules_customization]
::::{warning} This functionality is marked as experimental in Lucene ::::
You can customize the icu-tokenizer
behavior by specifying per-script rule files, see the RBBI rules syntax reference for a more detailed explanation.
To add icu tokenizer rules, set the rule_files
settings, which should contain a comma-separated list of code:rulefile
pairs in the following format: four-letter ISO 15924 script code, followed by a colon, then a rule file name. Rule files are placed ES_HOME/config
directory.
As a demonstration of how the rule files can be used, save the following user file to $ES_HOME/config/KeywordTokenizer.rbbi
:
.+ {200};
Then create an analyzer to use this rule file as follows:
PUT icu_sample
{
"settings": {
"index": {
"analysis": {
"tokenizer": {
"icu_user_file": {
"type": "icu_tokenizer",
"rule_files": "Latn:KeywordTokenizer.rbbi"
}
},
"analyzer": {
"my_analyzer": {
"type": "custom",
"tokenizer": "icu_user_file"
}
}
}
}
}
}
GET icu_sample/_analyze
{
"analyzer": "my_analyzer",
"text": "Elasticsearch. Wow!"
}
The above analyze
request returns the following:
{
"tokens": [
{
"token": "Elasticsearch. Wow!",
"start_offset": 0,
"end_offset": 19,
"type": "<ALPHANUM>",
"position": 0
}
]
}