elasticsearch/docs/reference/elasticsearch-plugins/analysis-icu-folding.md
Liam Thompson d90c6b76a2
[9.0] [docs] Prepare for docs-assembler (#125118) (#125339)
* [docs] Prepare for docs-assembler (#125118)

* reorg files for docs-assembler and create toc.yml files

* fix build error, add redirects

* only toc

* move images

(cherry picked from commit 9bcd59596d)

# Conflicts:
#	docs/reference/aggregations/search-aggregations-pipeline-bucket-script-aggregation.md
#	docs/reference/aggregations/search-aggregations-pipeline-cumulative-cardinality-aggregation.md
#	docs/reference/aggregations/search-aggregations-pipeline-cumulative-sum-aggregation.md
#	docs/reference/aggregations/search-aggregations-pipeline-derivative-aggregation.md
#	docs/reference/aggregations/search-aggregations-pipeline-extended-stats-bucket-aggregation.md
#	docs/reference/aggregations/search-aggregations-pipeline-max-bucket-aggregation.md
#	docs/reference/aggregations/search-aggregations-pipeline-min-bucket-aggregation.md
#	docs/reference/aggregations/search-aggregations-pipeline-percentiles-bucket-aggregation.md
#	docs/reference/aggregations/search-aggregations-pipeline-stats-bucket-aggregation.md
#	docs/reference/aggregations/search-aggregations-pipeline-sum-bucket-aggregation.md
#	docs/reference/query-languages/esql/esql-commands.md
#	docs/reference/query-languages/esql/esql-lookup-join.md
#	docs/reference/query-languages/esql/esql-process-data-with-dissect-grok.md
#	docs/reference/query-languages/images/esql-lookup-join.png
#	docs/reference/query-languages/toc.yml
#	docs/reference/search-connectors/es-connectors-run-from-docker.md
#	docs/reference/text-analysis/analysis-apostrophe-tokenfilter.md
#	docs/reference/toc.yml

* remove markers

---------

Co-authored-by: Colleen McGinnis <colleen.mcginnis@elastic.co>
2025-03-20 22:20:12 +02:00

1.7 KiB

mapped_pages
https://www.elastic.co/guide/en/elasticsearch/plugins/current/analysis-icu-folding.html

ICU folding token filter [analysis-icu-folding]

Case folding of Unicode characters based on UTR#30, like the ASCII-folding token filter on steroids. It registers itself as the icu_folding token filter and is available to all indices:

PUT icu_sample
{
  "settings": {
    "index": {
      "analysis": {
        "analyzer": {
          "folded": {
            "tokenizer": "icu_tokenizer",
            "filter": [
              "icu_folding"
            ]
          }
        }
      }
    }
  }
}

The ICU folding token filter already does Unicode normalization, so there is no need to use Normalize character or token filter as well.

Which letters are folded can be controlled by specifying the unicode_set_filter parameter, which accepts a UnicodeSet.

The following example exempts Swedish characters from folding. It is important to note that both upper and lowercase forms should be specified, and that these filtered character are not lowercased which is why we add the lowercase filter as well:

PUT icu_sample
{
  "settings": {
    "index": {
      "analysis": {
        "analyzer": {
          "swedish_analyzer": {
            "tokenizer": "icu_tokenizer",
            "filter": [
              "swedish_folding",
              "lowercase"
            ]
          }
        },
        "filter": {
          "swedish_folding": {
            "type": "icu_folding",
            "unicode_set_filter": "[^åäöÅÄÖ]"
          }
        }
      }
    }
  }
}