mirror of
https://github.com/elastic/elasticsearch.git
synced 2025-04-25 07:37:19 -04:00
The most relevant ES changes that upgrading to Lucene 10 requires are: - use the appropriate IOContext - Scorer / ScorerSupplier breaking changes - Regex automaton are no longer determinized by default - minimize moved to test classes - introduce Elasticsearch900Codec - adjust slicing code according to the added support for intra-segment concurrency - disable intra-segment concurrency in tests - adjust accessor methods for many Lucene classes that became a record - adapt to breaking changes in the analysis area Co-authored-by: Christoph Büscher <christophbuescher@posteo.de> Co-authored-by: Mayya Sharipova <mayya.sharipova@elastic.co> Co-authored-by: ChrisHegarty <chegar999@gmail.com> Co-authored-by: Brian Seeders <brian.seeders@elastic.co> Co-authored-by: Armin Braun <me@obrown.io> Co-authored-by: Panagiotis Bailis <pmpailis@gmail.com> Co-authored-by: Benjamin Trent <4357155+benwtrent@users.noreply.github.com>
364 lines
8.5 KiB
Text
364 lines
8.5 KiB
Text
[[analysis-pathhierarchy-tokenizer]]
|
||
=== Path hierarchy tokenizer
|
||
++++
|
||
<titleabbrev>Path hierarchy</titleabbrev>
|
||
++++
|
||
|
||
The `path_hierarchy` tokenizer takes a hierarchical value like a filesystem
|
||
path, splits on the path separator, and emits a term for each component in the
|
||
tree. The `path_hierarcy` tokenizer uses Lucene's
|
||
https://lucene.apache.org/core/{lucene_version_path}/analysis/common/org/apache/lucene/analysis/path/PathHierarchyTokenizer.html[PathHierarchyTokenizer]
|
||
underneath.
|
||
|
||
[discrete]
|
||
=== Example output
|
||
|
||
[source,console]
|
||
---------------------------
|
||
POST _analyze
|
||
{
|
||
"tokenizer": "path_hierarchy",
|
||
"text": "/one/two/three"
|
||
}
|
||
---------------------------
|
||
|
||
/////////////////////
|
||
|
||
[source,console-result]
|
||
----------------------------
|
||
{
|
||
"tokens": [
|
||
{
|
||
"token": "/one",
|
||
"start_offset": 0,
|
||
"end_offset": 4,
|
||
"type": "word",
|
||
"position": 0
|
||
},
|
||
{
|
||
"token": "/one/two",
|
||
"start_offset": 0,
|
||
"end_offset": 8,
|
||
"type": "word",
|
||
"position": 1
|
||
},
|
||
{
|
||
"token": "/one/two/three",
|
||
"start_offset": 0,
|
||
"end_offset": 14,
|
||
"type": "word",
|
||
"position": 2
|
||
}
|
||
]
|
||
}
|
||
----------------------------
|
||
|
||
/////////////////////
|
||
|
||
|
||
|
||
The above text would produce the following terms:
|
||
|
||
[source,text]
|
||
---------------------------
|
||
[ /one, /one/two, /one/two/three ]
|
||
---------------------------
|
||
|
||
[discrete]
|
||
=== Configuration
|
||
|
||
The `path_hierarchy` tokenizer accepts the following parameters:
|
||
|
||
[horizontal]
|
||
`delimiter`::
|
||
The character to use as the path separator. Defaults to `/`.
|
||
|
||
`replacement`::
|
||
An optional replacement character to use for the delimiter.
|
||
Defaults to the `delimiter`.
|
||
|
||
`buffer_size`::
|
||
The number of characters read into the term buffer in a single pass.
|
||
Defaults to `1024`. The term buffer will grow by this size until all the
|
||
text has been consumed. It is advisable not to change this setting.
|
||
|
||
`reverse`::
|
||
If `true`, uses Lucene's
|
||
http://lucene.apache.org/core/{lucene_version_path}/analysis/common/org/apache/lucene/analysis/path/ReversePathHierarchyTokenizer.html[ReversePathHierarchyTokenizer],
|
||
which is suitable for domain–like hierarchies. Defaults to `false`.
|
||
|
||
`skip`::
|
||
The number of initial tokens to skip. Defaults to `0`.
|
||
|
||
[discrete]
|
||
=== Example configuration
|
||
|
||
In this example, we configure the `path_hierarchy` tokenizer to split on `-`
|
||
characters, and to replace them with `/`. The first two tokens are skipped:
|
||
|
||
[source,console]
|
||
----------------------------
|
||
PUT my-index-000001
|
||
{
|
||
"settings": {
|
||
"analysis": {
|
||
"analyzer": {
|
||
"my_analyzer": {
|
||
"tokenizer": "my_tokenizer"
|
||
}
|
||
},
|
||
"tokenizer": {
|
||
"my_tokenizer": {
|
||
"type": "path_hierarchy",
|
||
"delimiter": "-",
|
||
"replacement": "/",
|
||
"skip": 2
|
||
}
|
||
}
|
||
}
|
||
}
|
||
}
|
||
|
||
POST my-index-000001/_analyze
|
||
{
|
||
"analyzer": "my_analyzer",
|
||
"text": "one-two-three-four-five"
|
||
}
|
||
----------------------------
|
||
|
||
/////////////////////
|
||
|
||
[source,console-result]
|
||
----------------------------
|
||
{
|
||
"tokens": [
|
||
{
|
||
"token": "/three",
|
||
"start_offset": 7,
|
||
"end_offset": 13,
|
||
"type": "word",
|
||
"position": 0
|
||
},
|
||
{
|
||
"token": "/three/four",
|
||
"start_offset": 7,
|
||
"end_offset": 18,
|
||
"type": "word",
|
||
"position": 1
|
||
},
|
||
{
|
||
"token": "/three/four/five",
|
||
"start_offset": 7,
|
||
"end_offset": 23,
|
||
"type": "word",
|
||
"position": 2
|
||
}
|
||
]
|
||
}
|
||
----------------------------
|
||
|
||
/////////////////////
|
||
|
||
|
||
The above example produces the following terms:
|
||
|
||
[source,text]
|
||
---------------------------
|
||
[ /three, /three/four, /three/four/five ]
|
||
---------------------------
|
||
|
||
If we were to set `reverse` to `true`, it would produce the following:
|
||
|
||
[source,text]
|
||
---------------------------
|
||
[ one/two/three/, two/three/, three/ ]
|
||
---------------------------
|
||
|
||
[discrete]
|
||
[[analysis-pathhierarchy-tokenizer-detailed-examples]]
|
||
=== Detailed examples
|
||
|
||
A common use-case for the `path_hierarchy` tokenizer is filtering results by
|
||
file paths. If indexing a file path along with the data, the use of the
|
||
`path_hierarchy` tokenizer to analyze the path allows filtering the results
|
||
by different parts of the file path string.
|
||
|
||
|
||
This example configures an index to have two custom analyzers and applies
|
||
those analyzers to multifields of the `file_path` text field that will
|
||
store filenames. One of the two analyzers uses reverse tokenization.
|
||
Some sample documents are then indexed to represent some file paths
|
||
for photos inside photo folders of two different users.
|
||
|
||
|
||
[source,console]
|
||
--------------------------------------------------
|
||
PUT file-path-test
|
||
{
|
||
"settings": {
|
||
"analysis": {
|
||
"analyzer": {
|
||
"custom_path_tree": {
|
||
"tokenizer": "custom_hierarchy"
|
||
},
|
||
"custom_path_tree_reversed": {
|
||
"tokenizer": "custom_hierarchy_reversed"
|
||
}
|
||
},
|
||
"tokenizer": {
|
||
"custom_hierarchy": {
|
||
"type": "path_hierarchy",
|
||
"delimiter": "/"
|
||
},
|
||
"custom_hierarchy_reversed": {
|
||
"type": "path_hierarchy",
|
||
"delimiter": "/",
|
||
"reverse": "true"
|
||
}
|
||
}
|
||
}
|
||
},
|
||
"mappings": {
|
||
"properties": {
|
||
"file_path": {
|
||
"type": "text",
|
||
"fields": {
|
||
"tree": {
|
||
"type": "text",
|
||
"analyzer": "custom_path_tree"
|
||
},
|
||
"tree_reversed": {
|
||
"type": "text",
|
||
"analyzer": "custom_path_tree_reversed"
|
||
}
|
||
}
|
||
}
|
||
}
|
||
}
|
||
}
|
||
|
||
POST file-path-test/_doc/1
|
||
{
|
||
"file_path": "/User/alice/photos/2017/05/16/my_photo1.jpg"
|
||
}
|
||
|
||
POST file-path-test/_doc/2
|
||
{
|
||
"file_path": "/User/alice/photos/2017/05/16/my_photo2.jpg"
|
||
}
|
||
|
||
POST file-path-test/_doc/3
|
||
{
|
||
"file_path": "/User/alice/photos/2017/05/16/my_photo3.jpg"
|
||
}
|
||
|
||
POST file-path-test/_doc/4
|
||
{
|
||
"file_path": "/User/alice/photos/2017/05/15/my_photo1.jpg"
|
||
}
|
||
|
||
POST file-path-test/_doc/5
|
||
{
|
||
"file_path": "/User/bob/photos/2017/05/16/my_photo1.jpg"
|
||
}
|
||
--------------------------------------------------
|
||
|
||
|
||
A search for a particular file path string against the text field matches all
|
||
the example documents, with Bob's documents ranking highest due to `bob` also
|
||
being one of the terms created by the standard analyzer boosting relevance for
|
||
Bob's documents.
|
||
|
||
[source,console]
|
||
--------------------------------------------------
|
||
GET file-path-test/_search
|
||
{
|
||
"query": {
|
||
"match": {
|
||
"file_path": "/User/bob/photos/2017/05"
|
||
}
|
||
}
|
||
}
|
||
--------------------------------------------------
|
||
// TEST[continued]
|
||
|
||
It's simple to match or filter documents with file paths that exist within a
|
||
particular directory using the `file_path.tree` field.
|
||
|
||
[source,console]
|
||
--------------------------------------------------
|
||
GET file-path-test/_search
|
||
{
|
||
"query": {
|
||
"term": {
|
||
"file_path.tree": "/User/alice/photos/2017/05/16"
|
||
}
|
||
}
|
||
}
|
||
--------------------------------------------------
|
||
// TEST[continued]
|
||
|
||
With the reverse parameter for this tokenizer, it's also possible to match
|
||
from the other end of the file path, such as individual file names or a deep
|
||
level subdirectory. The following example shows a search for all files named
|
||
`my_photo1.jpg` within any directory via the `file_path.tree_reversed` field
|
||
configured to use the reverse parameter in the mapping.
|
||
|
||
|
||
[source,console]
|
||
--------------------------------------------------
|
||
GET file-path-test/_search
|
||
{
|
||
"query": {
|
||
"term": {
|
||
"file_path.tree_reversed": {
|
||
"value": "my_photo1.jpg"
|
||
}
|
||
}
|
||
}
|
||
}
|
||
--------------------------------------------------
|
||
// TEST[continued]
|
||
|
||
Viewing the tokens generated with both forward and reverse is instructive
|
||
in showing the tokens created for the same file path value.
|
||
|
||
|
||
[source,console]
|
||
--------------------------------------------------
|
||
POST file-path-test/_analyze
|
||
{
|
||
"analyzer": "custom_path_tree",
|
||
"text": "/User/alice/photos/2017/05/16/my_photo1.jpg"
|
||
}
|
||
|
||
POST file-path-test/_analyze
|
||
{
|
||
"analyzer": "custom_path_tree_reversed",
|
||
"text": "/User/alice/photos/2017/05/16/my_photo1.jpg"
|
||
}
|
||
--------------------------------------------------
|
||
// TEST[continued]
|
||
|
||
|
||
It's also useful to be able to filter with file paths when combined with other
|
||
types of searches, such as this example looking for any files paths with `16`
|
||
that also must be in Alice's photo directory.
|
||
|
||
[source,console]
|
||
--------------------------------------------------
|
||
GET file-path-test/_search
|
||
{
|
||
"query": {
|
||
"bool" : {
|
||
"must" : {
|
||
"match" : { "file_path" : "16" }
|
||
},
|
||
"filter": {
|
||
"term" : { "file_path.tree" : "/User/alice" }
|
||
}
|
||
}
|
||
}
|
||
}
|
||
--------------------------------------------------
|
||
// TEST[continued]
|