elasticsearch/docs/reference/data-analysis/text-analysis/analysis-simplepattern-tokenizer.md
Liam Thompson b98606e712
[9.0] [docs] Migrate docs from AsciiDoc to Markdown (#123507) (#124124)
* [docs] Migrate docs from AsciiDoc to Markdown (#123507)

* delete asciidoc files

* add migrated files

* fix errors

* Disable docs tests

* Clarify release notes page titles

* Revert "Clarify release notes page titles"

This reverts commit 8be688648d.

* Comment out edternal URI images

* Clean up query languages landing pages, link to conceptual docs

* Add .md to url

* Fixes inference processor nesting.

---------

Co-authored-by: Liam Thompson <32779855+leemthompo@users.noreply.github.com>
Co-authored-by: Liam Thompson <leemthompo@gmail.com>
Co-authored-by: Martijn Laarman <Mpdreamz@gmail.com>
Co-authored-by: István Zoltán Szabó <szabosteve@gmail.com>
(cherry picked from commit b7e3a1e14b)

# Conflicts:
#	docs/build.gradle
#	docs/reference/migration/index.asciidoc
#	docs/reference/migration/migrate_9_0.asciidoc
#	docs/reference/release-notes.asciidoc
#	docs/reference/release-notes/9.0.0.asciidoc
#	docs/reference/release-notes/highlights.asciidoc

* Fix build file

* Really fix build file

---------

Co-authored-by: Colleen McGinnis <colleen.j.mcginnis@gmail.com>
2025-03-06 07:53:46 +01:00

63 lines
2.2 KiB
Markdown

---
navigation_title: "Simple pattern"
mapped_pages:
- https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-simplepattern-tokenizer.html
---
# Simple pattern tokenizer [analysis-simplepattern-tokenizer]
The `simple_pattern` tokenizer uses a regular expression to capture matching text as terms. The set of regular expression features it supports is more limited than the [`pattern`](/reference/data-analysis/text-analysis/analysis-pattern-tokenizer.md) tokenizer, but the tokenization is generally faster.
This tokenizer does not support splitting the input on a pattern match, unlike the [`pattern`](/reference/data-analysis/text-analysis/analysis-pattern-tokenizer.md) tokenizer. To split on pattern matches using the same restricted regular expression subset, see the [`simple_pattern_split`](/reference/data-analysis/text-analysis/analysis-simplepatternsplit-tokenizer.md) tokenizer.
This tokenizer uses [Lucene regular expressions](https://lucene.apache.org/core/10_0_0/core/org/apache/lucene/util/automaton/RegExp.md). For an explanation of the supported features and syntax, see [Regular Expression Syntax](/reference/query-languages/regexp-syntax.md).
The default pattern is the empty string, which produces no terms. This tokenizer should always be configured with a non-default pattern.
## Configuration [_configuration_17]
The `simple_pattern` tokenizer accepts the following parameters:
`pattern`
: [Lucene regular expression](https://lucene.apache.org/core/10_0_0/core/org/apache/lucene/util/automaton/RegExp.md), defaults to the empty string.
## Example configuration [_example_configuration_11]
This example configures the `simple_pattern` tokenizer to produce terms that are three-digit numbers
```console
PUT my-index-000001
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "my_tokenizer"
}
},
"tokenizer": {
"my_tokenizer": {
"type": "simple_pattern",
"pattern": "[0123456789]{3}"
}
}
}
}
}
POST my-index-000001/_analyze
{
"analyzer": "my_analyzer",
"text": "fd-786-335-514-x"
}
```
The above example produces these terms:
```text
[ 786, 335, 514 ]
```