elasticsearch/docs/reference/text-analysis/analysis-pattern-tokenizer.md

---
navigation_title: "Pattern"
mapped_pages:
  - https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-pattern-tokenizer.html
---

# Pattern tokenizer [analysis-pattern-tokenizer]


The `pattern` tokenizer uses a regular expression to either split text into terms whenever it matches a word separator, or to capture matching text as terms.

The default pattern is `\W+`, which splits text whenever it encounters non-word characters.

::::{admonition} Beware of Pathological Regular Expressions
:class: warning

The pattern tokenizer uses [Java Regular Expressions](https://docs.oracle.com/javase/8/docs/api/java/util/regex/Pattern.md).

A badly written regular expression could run very slowly or even throw a StackOverflowError and cause the node it is running on to exit suddenly.

Read more about [pathological regular expressions and how to avoid them](https://www.regular-expressions.info/catastrophic.html).

::::


## Example output [_example_output_15]

```console
POST _analyze
{
  "tokenizer": "pattern",
  "text": "The foo_bar_size's default is 5."
}
```

The above sentence would produce the following terms:

```text
[ The, foo_bar_size, s, default, is, 5 ]
```


## Configuration [_configuration_16]

The `pattern` tokenizer accepts the following parameters:

`pattern`
:   A [Java regular expression](https://docs.oracle.com/javase/8/docs/api/java/util/regex/Pattern.md), defaults to `\W+`.

`flags`
:   Java regular expression [flags](https://docs.oracle.com/javase/8/docs/api/java/util/regex/Pattern.md#field.summary). Flags should be pipe-separated, eg `"CASE_INSENSITIVE|COMMENTS"`.

`group`
:   Which capture group to extract as tokens. Defaults to `-1` (split).


## Example configuration [_example_configuration_10]

In this example, we configure the `pattern` tokenizer to break text into tokens when it encounters commas:

```console
PUT my-index-000001
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_analyzer": {
          "tokenizer": "my_tokenizer"
        }
      },
      "tokenizer": {
        "my_tokenizer": {
          "type": "pattern",
          "pattern": ","
        }
      }
    }
  }
}

POST my-index-000001/_analyze
{
  "analyzer": "my_analyzer",
  "text": "comma,separated,values"
}
```

The above example produces the following terms:

```text
[ comma, separated, values ]
```

In the next example, we configure the `pattern` tokenizer to capture values enclosed in double quotes (ignoring embedded escaped quotes `\"`). The regex itself looks like this:

```
"((?:\\"|[^"]|\\")*)"
```
And reads as follows:

* A literal `"`
* Start capturing:

    * A literal `\"` OR any character except `"`
    * Repeat until no more characters match

* A literal closing `"`

When the pattern is specified in JSON, the `"` and `\` characters need to be escaped, so the pattern ends up looking like:

```
\"((?:\\\\\"|[^\"]|\\\\\")+)\"
```
```console
PUT my-index-000001
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_analyzer": {
          "tokenizer": "my_tokenizer"
        }
      },
      "tokenizer": {
        "my_tokenizer": {
          "type": "pattern",
          "pattern": "\"((?:\\\\\"|[^\"]|\\\\\")+)\"",
          "group": 1
        }
      }
    }
  }
}

POST my-index-000001/_analyze
{
  "analyzer": "my_analyzer",
  "text": "\"value\", \"value with embedded \\\" quote\""
}
```

The above example produces the following two terms:

```text
[ value, value with embedded \" quote ]
```