[DOCS] Fix double spaces (#71082)

This commit is contained in:
James Rodewig 2021-03-31 09:57:47 -04:00 committed by GitHub
parent 83725e4b38
commit 693807a6d3
No known key found for this signature in database
GPG key ID: 4AEE18F83AFDEB23
282 changed files with 834 additions and 834 deletions

View file

@ -92,7 +92,7 @@ password: `elastic-password`.
=== Test case filtering.
You can run a single test, provided that you specify the Gradle project. See the documentation on
You can run a single test, provided that you specify the Gradle project. See the documentation on
https://docs.gradle.org/current/userguide/userguide_single.html#simple_name_pattern[simple name pattern filtering].
Run a single test case in the `server` project:
@ -385,13 +385,13 @@ vagrant plugin install vagrant-cachier
. You can run all of the OS packaging tests with `./gradlew packagingTest`.
This task includes our legacy `bats` tests. To run only the OS tests that are
written in Java, run `.gradlew distroTest`, will cause Gradle to build the tar,
zip, and deb packages and all the plugins. It will then run the tests on every
zip, and deb packages and all the plugins. It will then run the tests on every
available system. This will take a very long time.
+
Fortunately, the various systems under test have their own Gradle tasks under
`qa/os`. To find the systems tested, do a listing of the `qa/os` directory.
To find out what packaging combinations can be tested on a system, run
the `tasks` task. For example:
the `tasks` task. For example:
+
----------------------------------
./gradlew :qa:os:ubuntu-1804:tasks
@ -558,7 +558,7 @@ fetching the latest from the remote.
== Testing in FIPS 140-2 mode
We have a CI matrix job that periodically runs all our tests with the JVM configured
We have a CI matrix job that periodically runs all our tests with the JVM configured
to be FIPS 140-2 compliant with the use of the BouncyCastle FIPS approved Security Provider.
FIPS 140-2 imposes certain requirements that affect how our tests should be set up or what
can be tested. This section summarizes what one needs to take into consideration so that

View file

@ -150,7 +150,7 @@ Also see the {client}/php-api/current/index.html[official Elasticsearch PHP clie
* https://github.com/nervetattoo/elasticsearch[elasticsearch] PHP client.
* https://github.com/madewithlove/elasticsearcher[elasticsearcher] Agnostic lightweight package on top of the Elasticsearch PHP client. Its main goal is to allow for easier structuring of queries and indices in your application. It does not want to hide or replace functionality of the Elasticsearch PHP client.
* https://github.com/madewithlove/elasticsearcher[elasticsearcher] Agnostic lightweight package on top of the Elasticsearch PHP client. Its main goal is to allow for easier structuring of queries and indices in your application. It does not want to hide or replace functionality of the Elasticsearch PHP client.
[[python]]
== Python

View file

@ -51,7 +51,7 @@ offsets.
payloads.
<6> Set `filterSettings` to filter the terms that can be returned based
on their tf-idf scores.
<7> Set `perFieldAnalyzer` to specify a different analyzer than
<7> Set `perFieldAnalyzer` to specify a different analyzer than
the one that the field has.
<8> Set `realtime` to `false` (default is `true`) to retrieve term vectors
near realtime.

View file

@ -20,7 +20,7 @@ The simplest version uses a built-in analyzer:
include-tagged::{doc-tests-file}[{api}-builtin-request]
---------------------------------------------------
<1> A built-in analyzer
<2> The text to include. Multiple strings are treated as a multi-valued field
<2> The text to include. Multiple strings are treated as a multi-valued field
You can configure a custom analyzer:
["source","java",subs="attributes,callouts,macros"]

View file

@ -38,7 +38,7 @@ include-tagged::{doc-tests-file}[{api}-request-masterTimeout]
--------------------------------------------------
include-tagged::{doc-tests-file}[{api}-request-waitForActiveShards]
--------------------------------------------------
<1> The number of active shard copies to wait for before the freeze index API
<1> The number of active shard copies to wait for before the freeze index API
returns a response, as an `ActiveShardCount`
["source","java",subs="attributes,callouts,macros"]

View file

@ -25,7 +25,7 @@ The following arguments can optionally be provided:
--------------------------------------------------
include-tagged::{doc-tests-file}[{api}-request-names]
--------------------------------------------------
<1> One or more settings that be the only settings retrieved. If unset, all settings will be retrieved
<1> One or more settings that be the only settings retrieved. If unset, all settings will be retrieved
["source","java",subs="attributes,callouts,macros"]
--------------------------------------------------

View file

@ -43,7 +43,7 @@ include-tagged::{doc-tests-file}[{api}-request-waitForActiveShards]
--------------------------------------------------
<1> The number of active shard copies to wait for before the open index API
returns a response, as an `int`
<2> The number of active shard copies to wait for before the open index API
<2> The number of active shard copies to wait for before the open index API
returns a response, as an `ActiveShardCount`
["source","java",subs="attributes,callouts,macros"]

View file

@ -37,7 +37,7 @@ include-tagged::{doc-tests-file}[{api}-request-masterTimeout]
--------------------------------------------------
include-tagged::{doc-tests-file}[{api}-request-waitForActiveShards]
--------------------------------------------------
<1> The number of active shard copies to wait for before the unfreeze index API
<1> The number of active shard copies to wait for before the unfreeze index API
returns a response, as an `ActiveShardCount`
["source","java",subs="attributes,callouts,macros"]

View file

@ -20,7 +20,7 @@ license started. If it was not started, it returns an error message describing
why.
Acknowledgement messages may also be returned if this API was called without
the `acknowledge` flag set to `true`. In this case you need to display the
the `acknowledge` flag set to `true`. In this case you need to display the
messages to the end user and if they agree, resubmit the request with the
`acknowledge` flag set to `true`. Please note that the response will still
return a 200 return code even if it requires an acknowledgement. So, it is

View file

@ -23,7 +23,7 @@ license started. If it was not started, it returns an error message describing
why.
Acknowledgement messages may also be returned if this API was called without
the `acknowledge` flag set to `true`. In this case you need to display the
the `acknowledge` flag set to `true`. In this case you need to display the
messages to the end user and if they agree, resubmit the request with the
`acknowledge` flag set to `true`. Please note that the response will still
return a 200 return code even if it requires an acknowledgement. So, it is

View file

@ -40,7 +40,7 @@ include-tagged::{doc-tests-file}[x-pack-{api}-execute]
The returned +{response}+ holds lists and maps of values which correspond to the capabilities
of the target index/index pattern (what jobs were configured for the pattern, where the data is stored, what
aggregations are available, etc). It provides essentially the same data as the original job configuration,
aggregations are available, etc). It provides essentially the same data as the original job configuration,
just presented in a different manner.
For example, if we had created a job with the following config:

View file

@ -10,7 +10,7 @@
experimental::[]
The Get Rollup Index Capabilities API allows the user to determine if a concrete index or index pattern contains
stored rollup jobs and data. If it contains data stored from rollup jobs, the capabilities of those jobs
stored rollup jobs and data. If it contains data stored from rollup jobs, the capabilities of those jobs
are returned. The API accepts a `GetRollupIndexCapsRequest` object as a request and returns a `GetRollupIndexCapsResponse`.
[id="{upid}-x-pack-{api}-request"]
@ -40,7 +40,7 @@ include-tagged::{doc-tests-file}[x-pack-{api}-execute]
The returned +{response}+ holds lists and maps of values which correspond to the capabilities
of the rollup index/index pattern (what jobs are stored in the index, their capabilities, what
aggregations are available, etc). Because multiple jobs can be stored in one index, the
aggregations are available, etc). Because multiple jobs can be stored in one index, the
response may include several jobs with different configurations.
The capabilities are essentially the same as the original job configuration, just presented in a different

View file

@ -62,7 +62,7 @@ if the privilege was not part of the request).
A `Map<String, Map<String, Map<String, Boolean>>>>` where each key is the
name of an application (as specified in the +{request}+).
For each application, the value is a `Map` keyed by resource name, with
each value being another `Map` from privilege name to a `Boolean`.
each value being another `Map` from privilege name to a `Boolean`.
The `Boolean` value is `true` if the user has that privilege on that
resource for that application, and `false` otherwise.
+

View file

@ -34,7 +34,7 @@ include-tagged::{doc-tests}/SnapshotClientDocumentationIT.java[delete-snapshot-e
[[java-rest-high-snapshot-delete-snapshot-async]]
==== Asynchronous Execution
The asynchronous execution of a delete snapshot request requires both the
The asynchronous execution of a delete snapshot request requires both the
`DeleteSnapshotRequest` instance and an `ActionListener` instance to be
passed to the asynchronous method:

View file

@ -150,7 +150,7 @@ should be consulted: https://hc.apache.org/httpcomponents-asyncclient-4.1.x/ .
NOTE: If your application runs under the security manager you might be subject
to the JVM default policies of caching positive hostname resolutions
indefinitely and negative hostname resolutions for ten seconds. If the resolved
indefinitely and negative hostname resolutions for ten seconds. If the resolved
addresses of the hosts to which you are connecting the client to vary with time
then you might want to modify the default JVM behavior. These can be modified by
adding
@ -184,6 +184,6 @@ whenever none of the nodes from the preferred rack is available.
WARNING: Node selectors that do not consistently select the same set of nodes
will make round-robin behaviour unpredictable and possibly unfair. The
preference example above is fine as it reasons about availability of nodes
preference example above is fine as it reasons about availability of nodes
which already affects the predictability of round-robin. Node selection should
not depend on other external factors or round-robin will not work properly.

View file

@ -97,7 +97,7 @@ include-tagged::{doc-tests}/SnifferDocumentation.java[sniff-on-failure]
failure, but an additional sniffing round is also scheduled sooner than usual,
by default one minute after the failure, assuming that things will go back to
normal and we want to detect that as soon as possible. Said interval can be
customized at `Sniffer` creation time through the `setSniffAfterFailureDelayMillis`
customized at `Sniffer` creation time through the `setSniffAfterFailureDelayMillis`
method. Note that this last configuration parameter has no effect in case sniffing
on failure is not enabled like explained above.
<3> Set the `Sniffer` instance to the failure listener

View file

@ -24,7 +24,7 @@ The standard <<painless-api-reference-shared, Painless API>> is available.
To run this example, first follow the steps in <<painless-context-examples, context examples>>.
The painless context in a `bucket_script` aggregation provides a `params` map. This map contains both
The painless context in a `bucket_script` aggregation provides a `params` map. This map contains both
user-specified custom values, as well as the values from other aggregations specified in the `buckets_path`
property.
@ -36,7 +36,7 @@ and adds the user-specified base_cost to the result:
(params.max - params.min) + params.base_cost
--------------------------------------------------
Note that the values are extracted from the `params` map. In context, the aggregation looks like this:
Note that the values are extracted from the `params` map. In context, the aggregation looks like this:
[source,console]
--------------------------------------------------

View file

@ -26,7 +26,7 @@ The standard <<painless-api-reference-shared, Painless API>> is available.
To run this example, first follow the steps in <<painless-context-examples, context examples>>.
The painless context in a `bucket_selector` aggregation provides a `params` map. This map contains both
The painless context in a `bucket_selector` aggregation provides a `params` map. This map contains both
user-specified custom values, as well as the values from other aggregations specified in the `buckets_path`
property.
@ -41,7 +41,7 @@ params.max + params.base_cost > 10
--------------------------------------------------
Note that the values are extracted from the `params` map. The script is in the form of an expression
that returns `true` or `false`. In context, the aggregation looks like this:
that returns `true` or `false`. In context, the aggregation looks like this:
[source,console]
--------------------------------------------------

View file

@ -19,7 +19,7 @@ full metric aggregation.
*Side Effects*
`state` (`Map`)::
Add values to this `Map` to for use in a map. Additional values must
Add values to this `Map` to for use in a map. Additional values must
be of the type `Map`, `List`, `String` or primitive.
*Return*

View file

@ -32,7 +32,7 @@ part of a full metric aggregation.
primitive. The same `state` `Map` is shared between all aggregated documents
on a given shard. If an initialization script is provided as part of the
aggregation then values added from the initialization script are
available. If no combine script is specified, values must be
available. If no combine script is specified, values must be
directly stored in `state` in a usable form. If no combine script and no
<<painless-metric-agg-reduce-context, reduce script>> are specified, the
`state` values are used as the result.

View file

@ -11,8 +11,8 @@ score to documents returned from a query.
User-defined parameters passed in as part of the query.
`doc` (`Map`, read-only)::
Contains the fields of the current document. For single-valued fields,
the value can be accessed via `doc['fieldname'].value`. For multi-valued
Contains the fields of the current document. For single-valued fields,
the value can be accessed via `doc['fieldname'].value`. For multi-valued
fields, this returns the first value; other values can be accessed
via `doc['fieldname'].get(index)`

View file

@ -11,19 +11,19 @@ documents in a query.
The weight as calculated by a <<painless-weight-context,weight script>>
`query.boost` (`float`, read-only)::
The boost value if provided by the query. If this is not provided the
The boost value if provided by the query. If this is not provided the
value is `1.0f`.
`field.docCount` (`long`, read-only)::
The number of documents that have a value for the current field.
`field.sumDocFreq` (`long`, read-only)::
The sum of all terms that exist for the current field. If this is not
The sum of all terms that exist for the current field. If this is not
available the value is `-1`.
`field.sumTotalTermFreq` (`long`, read-only)::
The sum of occurrences in the index for all the terms that exist in the
current field. If this is not available the value is `-1`.
current field. If this is not available the value is `-1`.
`term.docFreq` (`long`, read-only)::
The number of documents that contain the current term in the index.
@ -32,7 +32,7 @@ documents in a query.
The total occurrences of the current term in the index.
`doc.length` (`long`, read-only)::
The number of tokens the current document has in the current field. This
The number of tokens the current document has in the current field. This
is decoded from the stored {ref}/norms.html[norms] and may be approximate for
long fields
@ -45,7 +45,7 @@ Note that the `query`, `field`, and `term` variables are also available to the
there, as they are constant for all documents.
For queries that contain multiple terms, the script is called once for each
term with that term's calculated weight, and the results are summed. Note that some
term with that term's calculated weight, and the results are summed. Note that some
terms might have a `doc.freq` value of `0` on a document, for example if a query
uses synonyms.

View file

@ -10,8 +10,8 @@ Use a Painless script to
User-defined parameters passed in as part of the query.
`doc` (`Map`, read-only)::
Contains the fields of the current document. For single-valued fields,
the value can be accessed via `doc['fieldname'].value`. For multi-valued
Contains the fields of the current document. For single-valued fields,
the value can be accessed via `doc['fieldname'].value`. For multi-valued
fields, this returns the first value; other values can be accessed
via `doc['fieldname'].get(index)`

View file

@ -3,7 +3,7 @@
Use a Painless script to create a
{ref}/index-modules-similarity.html[weight] for use in a
<<painless-similarity-context, similarity script>>. The weight makes up the
<<painless-similarity-context, similarity script>>. The weight makes up the
part of the similarity calculation that is independent of the document being
scored, and so can be built up front and cached.
@ -12,19 +12,19 @@ Queries that contain multiple terms calculate a separate weight for each term.
*Variables*
`query.boost` (`float`, read-only)::
The boost value if provided by the query. If this is not provided the
The boost value if provided by the query. If this is not provided the
value is `1.0f`.
`field.docCount` (`long`, read-only)::
The number of documents that have a value for the current field.
`field.sumDocFreq` (`long`, read-only)::
The sum of all terms that exist for the current field. If this is not
The sum of all terms that exist for the current field. If this is not
available the value is `-1`.
`field.sumTotalTermFreq` (`long`, read-only)::
The sum of occurrences in the index for all the terms that exist in the
current field. If this is not available the value is `-1`.
current field. If this is not available the value is `-1`.
`term.docFreq` (`long`, read-only)::
The number of documents that contain the current term in the index.

View file

@ -4,7 +4,7 @@
A cast converts the value of an original type to the equivalent value of a
target type. An implicit cast infers the target type and automatically occurs
during certain <<painless-operators, operations>>. An explicit cast specifies
the target type and forcefully occurs as its own operation. Use the `cast
the target type and forcefully occurs as its own operation. Use the `cast
operator '()'` to specify an explicit cast.
Refer to the <<allowed-casts, cast table>> for a quick reference on all

View file

@ -8,7 +8,7 @@ to repeat its specific task. A parameter is a named type value available as a
function specifies zero-to-many parameters, and when a function is called a
value is specified per parameter. An argument is a value passed into a function
at the point of call. A function specifies a return type value, though if the
type is <<void-type, void>> then no value is returned. Any non-void type return
type is <<void-type, void>> then no value is returned. Any non-void type return
value is available for use within an <<painless-operators, operation>> or is
discarded otherwise.

View file

@ -11,7 +11,7 @@ Use an integer literal to specify an integer type value in decimal, octal, or
hex notation of a <<primitive-types, primitive type>> `int`, `long`, `float`,
or `double`. Use the following single letter designations to specify the
primitive type: `l` or `L` for `long`, `f` or `F` for `float`, and `d` or `D`
for `double`. If not specified, the type defaults to `int`. Use `0` as a prefix
for `double`. If not specified, the type defaults to `int`. Use `0` as a prefix
to specify an integer literal as octal, and use `0x` or `0X` as a prefix to
specify an integer literal as hex.
@ -86,7 +86,7 @@ EXPONENT: ( [eE] [+\-]? [0-9]+ );
Use a string literal to specify a <<string-type, `String` type>> value with
either single-quotes or double-quotes. Use a `\"` token to include a
double-quote as part of a double-quoted string literal. Use a `\'` token to
include a single-quote as part of a single-quoted string literal. Use a `\\`
include a single-quote as part of a single-quoted string literal. Use a `\\`
token to include a backslash as part of any string literal.
*Grammar*

View file

@ -76,7 +76,7 @@ int z = add(1, 2); <2>
==== Cast
An explicit cast converts the value of an original type to the equivalent value
of a target type forcefully as an operation. Use the `cast operator '()'` to
of a target type forcefully as an operation. Use the `cast operator '()'` to
specify an explicit cast. Refer to <<painless-casting, casting>> for more
information.
@ -85,7 +85,7 @@ information.
A conditional consists of three expressions. The first expression is evaluated
with an expected boolean result type. If the first expression evaluates to true
then the second expression will be evaluated. If the first expression evaluates
then the second expression will be evaluated. If the first expression evaluates
to false then the third expression will be evaluated. The second and third
expressions will be <<promotion, promoted>> if the evaluated values are not the
same type. Use the `conditional operator '? :'` as a shortcut to avoid the need
@ -254,7 +254,7 @@ V = (T)(V op expression);
The table below shows the available operators for use in a compound assignment.
Each operator follows the casting/promotion rules according to their regular
definition. For numeric operations there is an extra implicit cast when
definition. For numeric operations there is an extra implicit cast when
necessary to return the promoted numeric type value to the original numeric type
value of the variable/field and can result in data loss.

View file

@ -668,7 +668,7 @@ def y = x/2; <2>
==== Remainder
Use the `remainder operator '%'` to calculate the REMAINDER for division
between two numeric type values. Rules for NaN values and division by zero follow the JVM
between two numeric type values. Rules for NaN values and division by zero follow the JVM
specification.
*Errors*
@ -809,7 +809,7 @@ def y = x+2; <2>
==== Subtraction
Use the `subtraction operator '-'` to SUBTRACT a right-hand side numeric type
value from a left-hand side numeric type value. Rules for resultant overflow
value from a left-hand side numeric type value. Rules for resultant overflow
and NaN values follow the JVM specification.
*Errors*
@ -955,7 +955,7 @@ def y = x << 1; <2>
Use the `right shift operator '>>'` to SHIFT higher order bits to lower order
bits in a left-hand side integer type value by the distance specified in a
right-hand side integer type value. The highest order bit of the left-hand side
right-hand side integer type value. The highest order bit of the left-hand side
integer type value is preserved.
*Errors*

View file

@ -2,10 +2,10 @@
=== Operators
An operator is the most basic action that can be taken to evaluate values in a
script. An expression is one-to-many consecutive operations. Precedence is the
script. An expression is one-to-many consecutive operations. Precedence is the
order in which an operator will be evaluated relative to another operator.
Associativity is the direction within an expression in which a specific operator
is evaluated. The following table lists all available operators:
is evaluated. The following table lists all available operators:
[cols="<6,<3,^3,^2,^4"]
|====

View file

@ -259,7 +259,7 @@ during operations.
Declare a `def` type <<painless-variables, variable>> or access a `def` type
member field (from a reference type instance), and assign it any type of value
for evaluation during later operations. The default value for a newly-declared
`def` type variable is `null`. A `def` type variable or method/function
`def` type variable is `null`. A `def` type variable or method/function
parameter can change the type it represents during the compilation and
evaluation of a script.
@ -400,7 +400,7 @@ range `[2, d]` where `d >= 2`, each element within each dimension in the range
`[1, d-1]` is also an array type. The element type of each dimension, `n`, is an
array type with the number of dimensions equal to `d-n`. For example, consider
`int[][][]` with 3 dimensions. Each element in the 3rd dimension, `d-3`, is the
primitive type `int`. Each element in the 2nd dimension, `d-2`, is the array
primitive type `int`. Each element in the 2nd dimension, `d-2`, is the array
type `int[]`. And each element in the 1st dimension, `d-1` is the array type
`int[][]`.

View file

@ -12,7 +12,7 @@ transliteration.
================================================
From time to time, the ICU library receives updates such as adding new
characters and emojis, and improving collation (sort) orders. These changes
characters and emojis, and improving collation (sort) orders. These changes
may or may not affect search and sort orders, depending on which characters
sets you are using.
@ -38,11 +38,11 @@ The following parameters are accepted:
`method`::
Normalization method. Accepts `nfkc`, `nfc` or `nfkc_cf` (default)
Normalization method. Accepts `nfkc`, `nfc` or `nfkc_cf` (default)
`mode`::
Normalization mode. Accepts `compose` (default) or `decompose`.
Normalization mode. Accepts `compose` (default) or `decompose`.
[[analysis-icu-normalization-charfilter]]
==== ICU Normalization Character Filter
@ -52,7 +52,7 @@ http://userguide.icu-project.org/transforms/normalization[here].
It registers itself as the `icu_normalizer` character filter, which is
available to all indices without any further configuration. The type of
normalization can be specified with the `name` parameter, which accepts `nfc`,
`nfkc`, and `nfkc_cf` (default). Set the `mode` parameter to `decompose` to
`nfkc`, and `nfkc_cf` (default). Set the `mode` parameter to `decompose` to
convert `nfc` to `nfd` or `nfkc` to `nfkd` respectively:
Which letters are normalized can be controlled by specifying the
@ -328,7 +328,7 @@ PUT icu_sample
[WARNING]
======
This token filter has been deprecated since Lucene 5.0. Please use
This token filter has been deprecated since Lucene 5.0. Please use
<<analysis-icu-collation-keyword-field, ICU Collation Keyword Field>>.
======
@ -404,7 +404,7 @@ The following parameters are accepted by `icu_collation_keyword` fields:
`null_value`::
Accepts a string value which is substituted for any explicit `null`
values. Defaults to `null`, which means the field is treated as missing.
values. Defaults to `null`, which means the field is treated as missing.
{ref}/ignore-above.html[`ignore_above`]::
@ -434,7 +434,7 @@ The strength property determines the minimum level of difference considered
significant during comparison. Possible values are : `primary`, `secondary`,
`tertiary`, `quaternary` or `identical`. See the
https://icu-project.org/apiref/icu4j/com/ibm/icu/text/Collator.html[ICU Collation documentation]
for a more detailed explanation for each value. Defaults to `tertiary`
for a more detailed explanation for each value. Defaults to `tertiary`
unless otherwise specified in the collation.
`decomposition`::
@ -483,7 +483,7 @@ Single character or contraction. Controls what is variable for `alternate`.
`hiragana_quaternary_mode`::
Possible values: `true` or `false`. Distinguishing between Katakana and
Possible values: `true` or `false`. Distinguishing between Katakana and
Hiragana characters in `quaternary` strength.
@ -495,7 +495,7 @@ case mapping, normalization, transliteration and bidirectional text handling.
You can define which transformation you want to apply with the `id` parameter
(defaults to `Null`), and specify text direction with the `dir` parameter
which accepts `forward` (default) for LTR and `reverse` for RTL. Custom
which accepts `forward` (default) for LTR and `reverse` for RTL. Custom
rulesets are not yet supported.
For example:

View file

@ -103,7 +103,7 @@ The `kuromoji_tokenizer` accepts the following settings:
--
The tokenization mode determines how the tokenizer handles compound and
unknown words. It can be set to:
unknown words. It can be set to:
`normal`::
@ -403,11 +403,11 @@ form in either katakana or romaji. It accepts the following setting:
`use_romaji`::
Whether romaji reading form should be output instead of katakana. Defaults to `false`.
Whether romaji reading form should be output instead of katakana. Defaults to `false`.
When using the pre-defined `kuromoji_readingform` filter, `use_romaji` is set
to `true`. The default when defining a custom `kuromoji_readingform`, however,
is `false`. The only reason to use the custom form is if you need the
is `false`. The only reason to use the custom form is if you need the
katakana reading form:
[source,console]
@ -521,7 +521,7 @@ GET kuromoji_sample/_analyze
The `ja_stop` token filter filters out Japanese stopwords (`_japanese_`), and
any other custom stopwords specified by the user. This filter only supports
the predefined `_japanese_` stopwords list. If you want to use a different
the predefined `_japanese_` stopwords list. If you want to use a different
predefined list, then use the
{ref}/analysis-stop-tokenfilter.html[`stop` token filter] instead.

View file

@ -16,7 +16,7 @@ The `phonetic` token filter takes the following settings:
`encoder`::
Which phonetic encoder to use. Accepts `metaphone` (default),
Which phonetic encoder to use. Accepts `metaphone` (default),
`double_metaphone`, `soundex`, `refined_soundex`, `caverphone1`,
`caverphone2`, `cologne`, `nysiis`, `koelnerphonetik`, `haasephonetik`,
`beider_morse`, `daitch_mokotoff`.
@ -24,7 +24,7 @@ The `phonetic` token filter takes the following settings:
`replace`::
Whether or not the original token should be replaced by the phonetic
token. Accepts `true` (default) and `false`. Not supported by
token. Accepts `true` (default) and `false`. Not supported by
`beider_morse` encoding.
[source,console]
@ -81,7 +81,7 @@ supported:
`max_code_len`::
The maximum length of the emitted metaphone token. Defaults to `4`.
The maximum length of the emitted metaphone token. Defaults to `4`.
[discrete]
===== Beider Morse settings

View file

@ -46,7 +46,7 @@ PUT /stempel_example
The `polish_stop` token filter filters out Polish stopwords (`_polish_`), and
any other custom stopwords specified by the user. This filter only supports
the predefined `_polish_` stopwords list. If you want to use a different
the predefined `_polish_` stopwords list. If you want to use a different
predefined list, then use the
{ref}/analysis-stop-tokenfilter.html[`stop` token filter] instead.

View file

@ -14,7 +14,7 @@ The Elasticsearch repository contains examples of:
* a https://github.com/elastic/elasticsearch/tree/master/plugins/examples/script-expert-scoring[Java plugin]
which contains a script plugin.
These examples provide the bare bones needed to get started. For more
These examples provide the bare bones needed to get started. For more
information about how to write a plugin, we recommend looking at the plugins
listed in this documentation for inspiration.
@ -74,7 +74,7 @@ in the presence of plugins with the incorrect `elasticsearch.version`.
=== Testing your plugin
When testing a Java plugin, it will only be auto-loaded if it is in the
`plugins/` directory. Use `bin/elasticsearch-plugin install file:///path/to/your/plugin`
`plugins/` directory. Use `bin/elasticsearch-plugin install file:///path/to/your/plugin`
to install your plugin for testing.
You may also load your plugin within the test framework for integration tests.

View file

@ -130,7 +130,7 @@ discovery:
We will expose here one strategy which is to hide our Elasticsearch cluster from outside.
With this strategy, only VMs behind the same virtual port can talk to each
other. That means that with this mode, you can use Elasticsearch unicast
other. That means that with this mode, you can use Elasticsearch unicast
discovery to build a cluster, using the Azure API to retrieve information
about your nodes.

View file

@ -416,7 +416,7 @@ gcloud config set project es-cloud
[[discovery-gce-usage-tips-permissions]]
===== Machine Permissions
If you have created a machine without the correct permissions, you will see `403 unauthorized` error messages. To change machine permission on an existing instance, first stop the instance then Edit. Scroll down to `Access Scopes` to change permission. The other way to alter these permissions is to delete the instance (NOT THE DISK). Then create another with the correct permissions.
If you have created a machine without the correct permissions, you will see `403 unauthorized` error messages. To change machine permission on an existing instance, first stop the instance then Edit. Scroll down to `Access Scopes` to change permission. The other way to alter these permissions is to delete the instance (NOT THE DISK). Then create another with the correct permissions.
Creating machines with gcloud::
+

View file

@ -293,7 +293,7 @@ The annotated highlighter is based on the `unified` highlighter and supports the
settings but does not use the `pre_tags` or `post_tags` parameters. Rather than using
html-like markup such as `<em>cat</em>` the annotated highlighter uses the same
markdown-like syntax used for annotations and injects a key=value annotation where `_hit_term`
is the key and the matched search term is the value e.g.
is the key and the matched search term is the value e.g.
The [cat](_hit_term=cat) sat on the [mat](sku3578)

View file

@ -231,7 +231,7 @@ user for confirmation before continuing with installation.
When running the plugin install script from another program (e.g. install
automation scripts), the plugin script should detect that it is not being
called from the console and skip the confirmation response, automatically
granting all requested permissions. If console detection fails, then batch
granting all requested permissions. If console detection fails, then batch
mode can be forced by specifying `-b` or `--batch` as follows:
[source,shell]
@ -243,7 +243,7 @@ sudo bin/elasticsearch-plugin install --batch [pluginname]
=== Custom config directory
If your `elasticsearch.yml` config file is in a custom location, you will need
to specify the path to the config file when using the `plugin` script. You
to specify the path to the config file when using the `plugin` script. You
can do this as follows:
[source,sh]

View file

@ -6,7 +6,7 @@ The following pages have moved or been deleted.
[role="exclude",id="discovery-multicast"]
=== Multicast Discovery Plugin
The `multicast-discovery` plugin has been removed. Instead, configure networking
The `multicast-discovery` plugin has been removed. Instead, configure networking
using unicast (see {ref}/modules-network.html[Network settings]) or using
one of the <<discovery,cloud discovery plugins>>.

View file

@ -57,7 +57,7 @@ this configuration (such as Compute Engine, Kubernetes Engine or App Engine).
You have to obtain and provide https://cloud.google.com/iam/docs/overview#service_account[service account credentials]
manually.
For detailed information about generating JSON service account files, see the https://cloud.google.com/storage/docs/authentication?hl=en#service_accounts[Google Cloud documentation].
For detailed information about generating JSON service account files, see the https://cloud.google.com/storage/docs/authentication?hl=en#service_accounts[Google Cloud documentation].
Note that the PKCS12 format is not supported by this plugin.
Here is a summary of the steps:
@ -88,7 +88,7 @@ A JSON service account file looks like this:
----
// NOTCONSOLE
To provide this file to the plugin, it must be stored in the {ref}/secure-settings.html[Elasticsearch keystore]. You must
To provide this file to the plugin, it must be stored in the {ref}/secure-settings.html[Elasticsearch keystore]. You must
add a `file` setting with the name `gcs.client.NAME.credentials_file` using the `add-file` subcommand.
`NAME` is the name of the client configuration for the repository. The implicit client
name is `default`, but a different client name can be specified in the

View file

@ -312,7 +312,7 @@ include::repository-shared-settings.asciidoc[]
https://docs.aws.amazon.com/AmazonS3/latest/dev/acl-overview.html#canned-acl[S3
canned ACLs] : `private`, `public-read`, `public-read-write`,
`authenticated-read`, `log-delivery-write`, `bucket-owner-read`,
`bucket-owner-full-control`. Defaults to `private`. You could specify a
`bucket-owner-full-control`. Defaults to `private`. You could specify a
canned ACL using the `canned_acl` setting. When the S3 repository creates
buckets and objects, it adds the canned ACL into the buckets and objects.
@ -324,8 +324,8 @@ include::repository-shared-settings.asciidoc[]
Changing this setting on an existing repository only affects the
storage class for newly created objects, resulting in a mixed usage of
storage classes. Additionally, S3 Lifecycle Policies can be used to manage
the storage class of existing objects. Due to the extra complexity with the
Glacier class lifecycle, it is not currently supported by the plugin. For
the storage class of existing objects. Due to the extra complexity with the
Glacier class lifecycle, it is not currently supported by the plugin. For
more information about the different classes, see
https://docs.aws.amazon.com/AmazonS3/latest/dev/storage-class-intro.html[AWS
Storage Classes Guide]
@ -335,9 +335,9 @@ documented below is considered deprecated, and will be removed in a future
version.
In addition to the above settings, you may also specify all non-secure client
settings in the repository settings. In this case, the client settings found in
settings in the repository settings. In this case, the client settings found in
the repository settings will be merged with those of the named client used by
the repository. Conflicts between client and repository settings are resolved
the repository. Conflicts between client and repository settings are resolved
by the repository settings taking precedence over client settings.
For example:

View file

@ -9,4 +9,4 @@
`readonly`::
Makes repository read-only. Defaults to `false`.
Makes repository read-only. Defaults to `false`.

View file

@ -28,7 +28,7 @@ other aggregations instead of documents or fields.
=== Run an aggregation
You can run aggregations as part of a <<search-your-data,search>> by specifying the <<search-search,search API>>'s `aggs` parameter. The
following search runs a
following search runs a
<<search-aggregations-bucket-terms-aggregation,terms aggregation>> on
`my-field`:

View file

@ -110,7 +110,7 @@ buckets requested.
==== Time Zone
Date-times are stored in Elasticsearch in UTC. By default, all bucketing and
Date-times are stored in Elasticsearch in UTC. By default, all bucketing and
rounding is also done in UTC. The `time_zone` parameter can be used to indicate
that bucketing should use a different time zone.

View file

@ -291,7 +291,7 @@ GET /_search
*Time Zone*
Date-times are stored in Elasticsearch in UTC. By default, all bucketing and
Date-times are stored in Elasticsearch in UTC. By default, all bucketing and
rounding is also done in UTC. The `time_zone` parameter can be used to indicate
that bucketing should use a different time zone.
@ -853,7 +853,7 @@ GET /_search
The composite agg is not currently compatible with pipeline aggregations, nor does it make sense in most cases.
E.g. due to the paging nature of composite aggs, a single logical partition (one day for example) might be spread
over multiple pages. Since pipeline aggregations are purely post-processing on the final list of buckets,
over multiple pages. Since pipeline aggregations are purely post-processing on the final list of buckets,
running something like a derivative on a composite page could lead to inaccurate results as it is only taking into
account a "partial" result on that page.

View file

@ -51,7 +51,7 @@ This behavior has been deprecated in favor of two new, explicit fields: `calenda
and `fixed_interval`.
By forcing a choice between calendar and intervals up front, the semantics of the interval
are clear to the user immediately and there is no ambiguity. The old `interval` field
are clear to the user immediately and there is no ambiguity. The old `interval` field
will be removed in the future.
==================================

View file

@ -92,7 +92,7 @@ GET logs/_search
// TEST[continued]
The filtered buckets are returned in the same order as provided in the
request. The response for this example would be:
request. The response for this example would be:
[source,console-result]
--------------------------------------------------

View file

@ -19,7 +19,7 @@ bucket_key = Math.floor((value - offset) / interval) * interval + offset
--------------------------------------------------
For range values, a document can fall into multiple buckets. The first bucket is computed from the lower
bound of the range in the same way as a bucket for a single value is computed. The final bucket is computed in the same
bound of the range in the same way as a bucket for a single value is computed. The final bucket is computed in the same
way from the upper bound of the range, and the range is counted in all buckets in between and including those two.
The `interval` must be a positive decimal, while the `offset` must be a decimal in `[0, interval)`
@ -183,7 +183,7 @@ POST /sales/_search?size=0
--------------------------------------------------
// TEST[setup:sales]
When aggregating ranges, buckets are based on the values of the returned documents. This means the response may include
When aggregating ranges, buckets are based on the values of the returned documents. This means the response may include
buckets outside of a query's range. For example, if your query looks for values greater than 100, and you have a range
covering 50 to 150, and an interval of 50, that document will land in 3 buckets - 50, 100, and 150. In general, it's
best to think of the query and aggregation steps as independent - the query selects a set of documents, and then the

View file

@ -6,7 +6,7 @@
Since a range represents multiple values, running a bucket aggregation over a
range field can result in the same document landing in multiple buckets. This
can lead to surprising behavior, such as the sum of bucket counts being higher
than the number of matched documents. For example, consider the following
than the number of matched documents. For example, consider the following
index:
[source, console]
--------------------------------------------------
@ -184,7 +184,7 @@ calculated over the ranges of all matching documents.
// TESTRESPONSE[s/\.\.\./"took": $body.took,"timed_out": false,"_shards": $body._shards,"hits": $body.hits,/]
Depending on the use case, a `CONTAINS` query could limit the documents to only
those that fall entirely in the queried range. In this example, the one
document would not be included and the aggregation would be empty. Filtering
those that fall entirely in the queried range. In this example, the one
document would not be included and the aggregation would be empty. Filtering
the buckets after the aggregation is also an option, for use cases where the
document should be counted but the out of bounds data can be safely ignored.

View file

@ -5,9 +5,9 @@
++++
A multi-bucket value source based aggregation which finds "rare" terms -- terms that are at the long-tail
of the distribution and are not frequent. Conceptually, this is like a `terms` aggregation that is
sorted by `_count` ascending. As noted in the <<search-aggregations-bucket-terms-aggregation-order,terms aggregation docs>>,
actually ordering a `terms` agg by count ascending has unbounded error. Instead, you should use the `rare_terms`
of the distribution and are not frequent. Conceptually, this is like a `terms` aggregation that is
sorted by `_count` ascending. As noted in the <<search-aggregations-bucket-terms-aggregation-order,terms aggregation docs>>,
actually ordering a `terms` agg by count ascending has unbounded error. Instead, you should use the `rare_terms`
aggregation
//////////////////////////
@ -78,7 +78,7 @@ A `rare_terms` aggregation looks like this in isolation:
|Parameter Name |Description |Required |Default Value
|`field` |The field we wish to find rare terms in |Required |
|`max_doc_count` |The maximum number of documents a term should appear in. |Optional |`1`
|`precision` |The precision of the internal CuckooFilters. Smaller precision leads to
|`precision` |The precision of the internal CuckooFilters. Smaller precision leads to
better approximation, but higher memory usage. Cannot be smaller than `0.00001` |Optional |`0.01`
|`include` |Terms that should be included in the aggregation|Optional |
|`exclude` |Terms that should be excluded from the aggregation|Optional |
@ -124,7 +124,7 @@ Response:
// TESTRESPONSE[s/\.\.\.//]
In this example, the only bucket that we see is the "swing" bucket, because it is the only term that appears in
one document. If we increase the `max_doc_count` to `2`, we'll see some more buckets:
one document. If we increase the `max_doc_count` to `2`, we'll see some more buckets:
[source,console,id=rare-terms-aggregation-max-doc-count-example]
--------------------------------------------------
@ -169,27 +169,27 @@ This now shows the "jazz" term which has a `doc_count` of 2":
[[search-aggregations-bucket-rare-terms-aggregation-max-doc-count]]
==== Maximum document count
The `max_doc_count` parameter is used to control the upper bound of document counts that a term can have. There
is not a size limitation on the `rare_terms` agg like `terms` agg has. This means that terms
which match the `max_doc_count` criteria will be returned. The aggregation functions in this manner to avoid
The `max_doc_count` parameter is used to control the upper bound of document counts that a term can have. There
is not a size limitation on the `rare_terms` agg like `terms` agg has. This means that terms
which match the `max_doc_count` criteria will be returned. The aggregation functions in this manner to avoid
the order-by-ascending issues that afflict the `terms` aggregation.
This does, however, mean that a large number of results can be returned if chosen incorrectly.
This does, however, mean that a large number of results can be returned if chosen incorrectly.
To limit the danger of this setting, the maximum `max_doc_count` is 100.
[[search-aggregations-bucket-rare-terms-aggregation-max-buckets]]
==== Max Bucket Limit
The Rare Terms aggregation is more liable to trip the `search.max_buckets` soft limit than other aggregations due
to how it works. The `max_bucket` soft-limit is evaluated on a per-shard basis while the aggregation is collecting
results. It is possible for a term to be "rare" on a shard but become "not rare" once all the shard results are
merged together. This means that individual shards tend to collect more buckets than are truly rare, because
they only have their own local view. This list is ultimately pruned to the correct, smaller list of rare
to how it works. The `max_bucket` soft-limit is evaluated on a per-shard basis while the aggregation is collecting
results. It is possible for a term to be "rare" on a shard but become "not rare" once all the shard results are
merged together. This means that individual shards tend to collect more buckets than are truly rare, because
they only have their own local view. This list is ultimately pruned to the correct, smaller list of rare
terms on the coordinating node... but a shard may have already tripped the `max_buckets` soft limit and aborted
the request.
When aggregating on fields that have potentially many "rare" terms, you may need to increase the `max_buckets` soft
limit. Alternatively, you might need to find a way to filter the results to return fewer rare values (smaller time
limit. Alternatively, you might need to find a way to filter the results to return fewer rare values (smaller time
span, filter by category, etc), or re-evaluate your definition of "rare" (e.g. if something
appears 100,000 times, is it truly "rare"?)
@ -197,8 +197,8 @@ appears 100,000 times, is it truly "rare"?)
==== Document counts are approximate
The naive way to determine the "rare" terms in a dataset is to place all the values in a map, incrementing counts
as each document is visited, then return the bottom `n` rows. This does not scale beyond even modestly sized data
sets. A sharded approach where only the "top n" values are retained from each shard (ala the `terms` aggregation)
as each document is visited, then return the bottom `n` rows. This does not scale beyond even modestly sized data
sets. A sharded approach where only the "top n" values are retained from each shard (ala the `terms` aggregation)
fails because the long-tail nature of the problem means it is impossible to find the "top n" bottom values without
simply collecting all the values from all shards.
@ -208,16 +208,16 @@ Instead, the Rare Terms aggregation uses a different approximate algorithm:
2. Each addition occurrence of the term increments a counter in the map
3. If the counter > the `max_doc_count` threshold, the term is removed from the map and placed in a
https://www.cs.cmu.edu/~dga/papers/cuckoo-conext2014.pdf[CuckooFilter]
4. The CuckooFilter is consulted on each term. If the value is inside the filter, it is known to be above the
4. The CuckooFilter is consulted on each term. If the value is inside the filter, it is known to be above the
threshold already and skipped.
After execution, the map of values is the map of "rare" terms under the `max_doc_count` threshold. This map and CuckooFilter
are then merged with all other shards. If there are terms that are greater than the threshold (or appear in
a different shard's CuckooFilter) the term is removed from the merged list. The final map of values is returned
After execution, the map of values is the map of "rare" terms under the `max_doc_count` threshold. This map and CuckooFilter
are then merged with all other shards. If there are terms that are greater than the threshold (or appear in
a different shard's CuckooFilter) the term is removed from the merged list. The final map of values is returned
to the user as the "rare" terms.
CuckooFilters have the possibility of returning false positives (they can say a value exists in their collection when
it actually does not). Since the CuckooFilter is being used to see if a term is over threshold, this means a false positive
it actually does not). Since the CuckooFilter is being used to see if a term is over threshold, this means a false positive
from the CuckooFilter will mistakenly say a value is common when it is not (and thus exclude it from it final list of buckets).
Practically, this means the aggregations exhibits false-negative behavior since the filter is being used "in reverse"
of how people generally think of approximate set membership sketches.
@ -230,14 +230,14 @@ Proceedings of the 10th ACM International on Conference on emerging Networking E
==== Precision
Although the internal CuckooFilter is approximate in nature, the false-negative rate can be controlled with a
`precision` parameter. This allows the user to trade more runtime memory for more accurate results.
`precision` parameter. This allows the user to trade more runtime memory for more accurate results.
The default precision is `0.001`, and the smallest (e.g. most accurate and largest memory overhead) is `0.00001`.
Below are some charts which demonstrate how the accuracy of the aggregation is affected by precision and number
of distinct terms.
The X-axis shows the number of distinct values the aggregation has seen, and the Y-axis shows the percent error.
Each line series represents one "rarity" condition (ranging from one rare item to 100,000 rare items). For example,
Each line series represents one "rarity" condition (ranging from one rare item to 100,000 rare items). For example,
the orange "10" line means ten of the values were "rare" (`doc_count == 1`), out of 1-20m distinct values (where the
rest of the values had `doc_count > 1`)
@ -258,14 +258,14 @@ degrades in a controlled, linear fashion as the number of distinct values increa
The default precision of `0.001` has a memory profile of `1.748⁻⁶ * n` bytes, where `n` is the number
of distinct values the aggregation has seen (it can also be roughly eyeballed, e.g. 20 million unique values is about
30mb of memory). The memory usage is linear to the number of distinct values regardless of which precision is chosen,
30mb of memory). The memory usage is linear to the number of distinct values regardless of which precision is chosen,
the precision only affects the slope of the memory profile as seen in this chart:
image:images/rare_terms/memory.png[]
For comparison, an equivalent terms aggregation at 20 million buckets would be roughly
`20m * 69b == ~1.38gb` (with 69 bytes being a very optimistic estimate of an empty bucket cost, far lower than what
the circuit breaker accounts for). So although the `rare_terms` agg is relatively heavy, it is still orders of
the circuit breaker accounts for). So although the `rare_terms` agg is relatively heavy, it is still orders of
magnitude smaller than the equivalent terms aggregation
==== Filtering Values
@ -347,9 +347,9 @@ GET /_search
==== Nested, RareTerms, and scoring sub-aggregations
The RareTerms aggregation has to operate in `breadth_first` mode, since it needs to prune terms as doc count thresholds
are breached. This requirement means the RareTerms aggregation is incompatible with certain combinations of aggregations
are breached. This requirement means the RareTerms aggregation is incompatible with certain combinations of aggregations
that require `depth_first`. In particular, scoring sub-aggregations that are inside a `nested` force the entire aggregation tree to run
in `depth_first` mode. This will throw an exception since RareTerms is unable to process `depth_first`.
in `depth_first` mode. This will throw an exception since RareTerms is unable to process `depth_first`.
As a concrete example, if `rare_terms` aggregation is the child of a `nested` aggregation, and one of the child aggregations of `rare_terms`
needs document scores (like a `top_hits` aggregation), this will throw an exception.

View file

@ -305,7 +305,7 @@ If there is the equivalent of a `match_all` query or no query criteria providing
top-most aggregation - in this scenario the _foreground_ set is exactly the same as the _background_ set and
so there is no difference in document frequencies to observe and from which to make sensible suggestions.
Another consideration is that the significant_terms aggregation produces many candidate results at shard level
Another consideration is that the significant_terms aggregation produces many candidate results at shard level
that are only later pruned on the reducing node once all statistics from all shards are merged. As a result,
it can be inefficient and costly in terms of RAM to embed large child aggregations under a significant_terms
aggregation that later discards many candidate terms. It is advisable in these cases to perform two searches - the first to provide a rationalized list of
@ -374,7 +374,7 @@ Chi square behaves like mutual information and can be configured with the same p
===== Google normalized distance
Google normalized distance as described in "The Google Similarity Distance", Cilibrasi and Vitanyi, 2007 (https://arxiv.org/pdf/cs/0412098v3.pdf) can be used as significance score by adding the parameter
Google normalized distance as described in "The Google Similarity Distance", Cilibrasi and Vitanyi, 2007 (https://arxiv.org/pdf/cs/0412098v3.pdf) can be used as significance score by adding the parameter
[source,js]
--------------------------------------------------
@ -448,13 +448,13 @@ size buckets was not returned).
To ensure better accuracy a multiple of the final `size` is used as the number of terms to request from each shard
(`2 * (size * 1.5 + 10)`). To take manual control of this setting the `shard_size` parameter
can be used to control the volumes of candidate terms produced by each shard.
can be used to control the volumes of candidate terms produced by each shard.
Low-frequency terms can turn out to be the most interesting ones once all results are combined so the
significant_terms aggregation can produce higher-quality results when the `shard_size` parameter is set to
values significantly higher than the `size` setting. This ensures that a bigger volume of promising candidate terms are given
a consolidated review by the reducing node before the final selection. Obviously large candidate term lists
will cause extra network traffic and RAM usage so this is quality/cost trade off that needs to be balanced. If `shard_size` is set to -1 (the default) then `shard_size` will be automatically estimated based on the number of shards and the `size` parameter.
will cause extra network traffic and RAM usage so this is quality/cost trade off that needs to be balanced. If `shard_size` is set to -1 (the default) then `shard_size` will be automatically estimated based on the number of shards and the `size` parameter.
NOTE: `shard_size` cannot be smaller than `size` (as it doesn't make much sense). When it is, Elasticsearch will

View file

@ -367,13 +367,13 @@ size buckets was not returned).
To ensure better accuracy a multiple of the final `size` is used as the number of terms to request from each shard
(`2 * (size * 1.5 + 10)`). To take manual control of this setting the `shard_size` parameter
can be used to control the volumes of candidate terms produced by each shard.
can be used to control the volumes of candidate terms produced by each shard.
Low-frequency terms can turn out to be the most interesting ones once all results are combined so the
significant_terms aggregation can produce higher-quality results when the `shard_size` parameter is set to
values significantly higher than the `size` setting. This ensures that a bigger volume of promising candidate terms are given
a consolidated review by the reducing node before the final selection. Obviously large candidate term lists
will cause extra network traffic and RAM usage so this is quality/cost trade off that needs to be balanced. If `shard_size` is set to -1 (the default) then `shard_size` will be automatically estimated based on the number of shards and the `size` parameter.
will cause extra network traffic and RAM usage so this is quality/cost trade off that needs to be balanced. If `shard_size` is set to -1 (the default) then `shard_size` will be automatically estimated based on the number of shards and the `size` parameter.
NOTE: `shard_size` cannot be smaller than `size` (as it doesn't make much sense). When it is, elasticsearch will

View file

@ -136,7 +136,7 @@ The higher the requested `size` is, the more accurate the results will be, but a
compute the final results (both due to bigger priority queues that are managed on a shard level and due to bigger data
transfers between the nodes and the client).
The `shard_size` parameter can be used to minimize the extra work that comes with bigger requested `size`. When defined,
The `shard_size` parameter can be used to minimize the extra work that comes with bigger requested `size`. When defined,
it will determine how many terms the coordinating node will request from each shard. Once all the shards responded, the
coordinating node will then reduce them to a final result which will be based on the `size` parameter - this way,
one can increase the accuracy of the returned terms and avoid the overhead of streaming a big list of buckets back to
@ -191,7 +191,7 @@ determined and is given a value of -1 to indicate this.
==== Order
The order of the buckets can be customized by setting the `order` parameter. By default, the buckets are ordered by
their `doc_count` descending. It is possible to change this behaviour as documented below:
their `doc_count` descending. It is possible to change this behaviour as documented below:
WARNING: Sorting by ascending `_count` or by sub aggregation is discouraged as it increases the
<<search-aggregations-bucket-terms-aggregation-approximate-counts,error>> on document counts.
@ -283,7 +283,7 @@ GET /_search
=======================================
<<search-aggregations-pipeline,Pipeline aggregations>> are run during the
reduce phase after all other aggregations have already completed. For this
reduce phase after all other aggregations have already completed. For this
reason, they cannot be used for ordering.
=======================================
@ -606,10 +606,10 @@ WARNING: Partitions cannot be used together with an `exclude` parameter.
==== Multi-field terms aggregation
The `terms` aggregation does not support collecting terms from multiple fields
in the same document. The reason is that the `terms` agg doesn't collect the
in the same document. The reason is that the `terms` agg doesn't collect the
string term values themselves, but rather uses
<<search-aggregations-bucket-terms-aggregation-execution-hint,global ordinals>>
to produce a list of all of the unique values in the field. Global ordinals
to produce a list of all of the unique values in the field. Global ordinals
results in an important performance boost which would not be possible across
multiple fields.
@ -618,7 +618,7 @@ multiple fields:
<<search-aggregations-bucket-terms-aggregation-script,Script>>::
Use a script to retrieve terms from multiple fields. This disables the global
Use a script to retrieve terms from multiple fields. This disables the global
ordinals optimization and will be slower than collecting terms from a single
field, but it gives you the flexibility to implement this option at search
time.
@ -627,7 +627,7 @@ time.
If you know ahead of time that you want to collect the terms from two or more
fields, then use `copy_to` in your mapping to create a new dedicated field at
index time which contains the values from both fields. You can aggregate on
index time which contains the values from both fields. You can aggregate on
this single field, which will benefit from the global ordinals optimization.
<<search-aggregations-bucket-multi-terms-aggregation, `multi_terms` aggregation>>::

View file

@ -68,15 +68,15 @@ The response will look like this:
--------------------------------------------------
// TESTRESPONSE[s/\.\.\./"took": $body.took,"timed_out": false,"_shards": $body._shards,"hits": $body.hits,/]
In this case, the lower and upper whisker values are equal to the min and max. In general, these values are the 1.5 *
IQR range, which is to say the nearest values to `q1 - (1.5 * IQR)` and `q3 + (1.5 * IQR)`. Since this is an approximation, the given values
may not actually be observed values from the data, but should be within a reasonable error bound of them. While the Boxplot aggregation
In this case, the lower and upper whisker values are equal to the min and max. In general, these values are the 1.5 *
IQR range, which is to say the nearest values to `q1 - (1.5 * IQR)` and `q3 + (1.5 * IQR)`. Since this is an approximation, the given values
may not actually be observed values from the data, but should be within a reasonable error bound of them. While the Boxplot aggregation
doesn't directly return outlier points, you can check if `lower > min` or `upper < max` to see if outliers exist on either side, and then
query for them directly.
==== Script
The boxplot metric supports scripting. For example, if our load times
The boxplot metric supports scripting. For example, if our load times
are in milliseconds but we want values calculated in seconds, we could use
a script to convert them on-the-fly:

View file

@ -152,8 +152,8 @@ public static void main(String[] args) {
image:images/cardinality_error.png[]
For all 3 thresholds, counts have been accurate up to the configured threshold.
Although not guaranteed, this is likely to be the case. Accuracy in practice depends
on the dataset in question. In general, most datasets show consistently good
Although not guaranteed, this is likely to be the case. Accuracy in practice depends
on the dataset in question. In general, most datasets show consistently good
accuracy. Also note that even with a threshold as low as 100, the error
remains very low (1-6% as seen in the above graph) even when counting millions of items.

View file

@ -63,7 +63,7 @@ The name of the aggregation (`grades_stats` above) also serves as the key by whi
==== Standard Deviation Bounds
By default, the `extended_stats` metric will return an object called `std_deviation_bounds`, which provides an interval of plus/minus two standard
deviations from the mean. This can be a useful way to visualize variance of your data. If you want a different boundary, for example
deviations from the mean. This can be a useful way to visualize variance of your data. If you want a different boundary, for example
three standard deviations, you can set `sigma` in the request:
[source,console]
@ -84,7 +84,7 @@ GET /exams/_search
// TEST[setup:exams]
<1> `sigma` controls how many standard deviations +/- from the mean should be displayed
`sigma` can be any non-negative double, meaning you can request non-integer values such as `1.5`. A value of `0` is valid, but will simply
`sigma` can be any non-negative double, meaning you can request non-integer values such as `1.5`. A value of `0` is valid, but will simply
return the average for both `upper` and `lower` bounds.
The `upper` and `lower` bounds are calculated as population metrics so they are always the same as `upper_population` and
@ -93,8 +93,8 @@ The `upper` and `lower` bounds are calculated as population metrics so they are
.Standard Deviation and Bounds require normality
[NOTE]
=====
The standard deviation and its bounds are displayed by default, but they are not always applicable to all data-sets. Your data must
be normally distributed for the metrics to make sense. The statistics behind standard deviations assumes normally distributed data, so
The standard deviation and its bounds are displayed by default, but they are not always applicable to all data-sets. Your data must
be normally distributed for the metrics to make sense. The statistics behind standard deviations assumes normally distributed data, so
if your data is skewed heavily left or right, the value returned will be misleading.
=====

View file

@ -10,19 +10,19 @@ generated by a provided script or extracted from specific numeric or
<<histogram,histogram fields>> in the documents.
Percentiles show the point at which a certain percentage of observed values
occur. For example, the 95th percentile is the value which is greater than 95%
occur. For example, the 95th percentile is the value which is greater than 95%
of the observed values.
Percentiles are often used to find outliers. In normal distributions, the
Percentiles are often used to find outliers. In normal distributions, the
0.13th and 99.87th percentiles represents three standard deviations from the
mean. Any data which falls outside three standard deviations is often considered
mean. Any data which falls outside three standard deviations is often considered
an anomaly.
When a range of percentiles are retrieved, they can be used to estimate the
data distribution and determine if the data is skewed, bimodal, etc.
Assume your data consists of website load times. The average and median
load times are not overly useful to an administrator. The max may be interesting,
Assume your data consists of website load times. The average and median
load times are not overly useful to an administrator. The max may be interesting,
but it can be easily skewed by a single slow response.
Let's look at a range of percentiles representing load time:
@ -45,7 +45,7 @@ GET latency/_search
<1> The field `load_time` must be a numeric field
By default, the `percentile` metric will generate a range of
percentiles: `[ 1, 5, 25, 50, 75, 95, 99 ]`. The response will look like this:
percentiles: `[ 1, 5, 25, 50, 75, 95, 99 ]`. The response will look like this:
[source,console-result]
--------------------------------------------------
@ -70,7 +70,7 @@ percentiles: `[ 1, 5, 25, 50, 75, 95, 99 ]`. The response will look like this:
// TESTRESPONSE[s/\.\.\./"took": $body.took,"timed_out": false,"_shards": $body._shards,"hits": $body.hits,/]
As you can see, the aggregation will return a calculated value for each percentile
in the default range. If we assume response times are in milliseconds, it is
in the default range. If we assume response times are in milliseconds, it is
immediately obvious that the webpage normally loads in 10-725ms, but occasionally
spikes to 945-985ms.
@ -164,7 +164,7 @@ Response:
==== Script
The percentile metric supports scripting. For example, if our load times
The percentile metric supports scripting. For example, if our load times
are in milliseconds but we want percentiles calculated in seconds, we could use
a script to convert them on-the-fly:
@ -220,12 +220,12 @@ GET latency/_search
[[search-aggregations-metrics-percentile-aggregation-approximation]]
==== Percentiles are (usually) approximate
There are many different algorithms to calculate percentiles. The naive
implementation simply stores all the values in a sorted array. To find the 50th
There are many different algorithms to calculate percentiles. The naive
implementation simply stores all the values in a sorted array. To find the 50th
percentile, you simply find the value that is at `my_array[count(my_array) * 0.5]`.
Clearly, the naive implementation does not scale -- the sorted array grows
linearly with the number of values in your dataset. To calculate percentiles
linearly with the number of values in your dataset. To calculate percentiles
across potentially billions of values in an Elasticsearch cluster, _approximate_
percentiles are calculated.
@ -235,12 +235,12 @@ https://github.com/tdunning/t-digest/blob/master/docs/t-digest-paper/histo.pdf[C
When using this metric, there are a few guidelines to keep in mind:
- Accuracy is proportional to `q(1-q)`. This means that extreme percentiles (e.g. 99%)
- Accuracy is proportional to `q(1-q)`. This means that extreme percentiles (e.g. 99%)
are more accurate than less extreme percentiles, such as the median
- For small sets of values, percentiles are highly accurate (and potentially
100% accurate if the data is small enough).
- As the quantity of values in a bucket grows, the algorithm begins to approximate
the percentiles. It is effectively trading accuracy for memory savings. The
the percentiles. It is effectively trading accuracy for memory savings. The
exact level of inaccuracy is difficult to generalize, since it depends on your
data distribution and volume of data being aggregated
@ -291,18 +291,18 @@ GET latency/_search
// tag::t-digest[]
The TDigest algorithm uses a number of "nodes" to approximate percentiles -- the
more nodes available, the higher the accuracy (and large memory footprint) proportional
to the volume of data. The `compression` parameter limits the maximum number of
to the volume of data. The `compression` parameter limits the maximum number of
nodes to `20 * compression`.
Therefore, by increasing the compression value, you can increase the accuracy of
your percentiles at the cost of more memory. Larger compression values also
your percentiles at the cost of more memory. Larger compression values also
make the algorithm slower since the underlying tree data structure grows in size,
resulting in more expensive operations. The default compression value is
resulting in more expensive operations. The default compression value is
`100`.
A "node" uses roughly 32 bytes of memory, so under worst-case scenarios (large amount
of data which arrives sorted and in-order) the default settings will produce a
TDigest roughly 64KB in size. In practice data tends to be more random and
TDigest roughly 64KB in size. In practice data tends to be more random and
the TDigest will use less memory.
// end::t-digest[]

View file

@ -17,10 +17,10 @@ regarding approximation and memory use of the percentile ranks aggregation
==================================================
Percentile rank show the percentage of observed values which are below certain
value. For example, if a value is greater than or equal to 95% of the observed values
value. For example, if a value is greater than or equal to 95% of the observed values
it is said to be at the 95th percentile rank.
Assume your data consists of website load times. You may have a service agreement that
Assume your data consists of website load times. You may have a service agreement that
95% of page loads complete within 500ms and 99% of page loads complete within 600ms.
Let's look at a range of percentiles representing load time:
@ -120,7 +120,7 @@ Response:
==== Script
The percentile rank metric supports scripting. For example, if our load times
The percentile rank metric supports scripting. For example, if our load times
are in milliseconds but we want to specify values in seconds, we could use
a script to convert them on-the-fly:

View file

@ -142,7 +142,7 @@ indices, the term filter on the <<mapping-index-field,`_index`>> field can be us
==== Script
The `t_test` metric supports scripting. For example, if we need to adjust out load times for the before values, we could use
The `t_test` metric supports scripting. For example, if we need to adjust out load times for the before values, we could use
a script to recalculate them on-the-fly:
[source,console]

View file

@ -7,8 +7,8 @@
A `single-value` metrics aggregation that computes the weighted average of numeric values that are extracted from the aggregated documents.
These values can be extracted either from specific numeric fields in the documents, or provided by a script.
When calculating a regular average, each datapoint has an equal "weight" ... it contributes equally to the final value. Weighted averages,
on the other hand, weight each datapoint differently. The amount that each datapoint contributes to the final value is extracted from the
When calculating a regular average, each datapoint has an equal "weight" ... it contributes equally to the final value. Weighted averages,
on the other hand, weight each datapoint differently. The amount that each datapoint contributes to the final value is extracted from the
document, or provided by a script.
As a formula, a weighted average is the `∑(value * weight) / ∑(weight)`
@ -35,7 +35,7 @@ The `value` and `weight` objects have per-field specific configuration:
|Parameter Name |Description |Required |Default Value
|`field` | The field that values should be extracted from |Required |
|`missing` | A value to use if the field is missing entirely |Optional |
|`script` | A script which provides the values for the document. This is mutually exclusive with `field` |Optional
|`script` | A script which provides the values for the document. This is mutually exclusive with `field` |Optional
|===
[[weight-params]]
@ -45,7 +45,7 @@ The `value` and `weight` objects have per-field specific configuration:
|Parameter Name |Description |Required |Default Value
|`field` | The field that weights should be extracted from |Required |
|`missing` | A weight to use if the field is missing entirely |Optional |
|`script` | A script which provides the weights for the document. This is mutually exclusive with `field` |Optional
|`script` | A script which provides the weights for the document. This is mutually exclusive with `field` |Optional
|===
@ -91,7 +91,7 @@ Which yields a response like:
// TESTRESPONSE[s/\.\.\./"took": $body.took,"timed_out": false,"_shards": $body._shards,"hits": $body.hits,/]
While multiple values-per-field are allowed, only one weight is allowed. If the aggregation encounters
While multiple values-per-field are allowed, only one weight is allowed. If the aggregation encounters
a document that has more than one weight (e.g. the weight field is a multi-valued field) it will throw an exception.
If you have this situation, you will need to specify a `script` for the weight field, and use the script
to combine the multiple values into a single value to be used.
@ -147,7 +147,7 @@ The aggregation returns `2.0` as the result, which matches what we would expect
==== Script
Both the value and the weight can be derived from a script, instead of a field. As a simple example, the following
Both the value and the weight can be derived from a script, instead of a field. As a simple example, the following
will add one to the grade and weight in the document using a script:
[source,console]

View file

@ -19,7 +19,7 @@ parameter to indicate the paths to the required metrics. The syntax for defining
<<buckets-path-syntax, `buckets_path` Syntax>> section below.
Pipeline aggregations cannot have sub-aggregations but depending on the type it can reference another pipeline in the `buckets_path`
allowing pipeline aggregations to be chained. For example, you can chain together two derivatives to calculate the second derivative
allowing pipeline aggregations to be chained. For example, you can chain together two derivatives to calculate the second derivative
(i.e. a derivative of a derivative).
NOTE: Because pipeline aggregations only add to the output, when chaining pipeline aggregations the output of each pipeline aggregation
@ -29,7 +29,7 @@ will be included in the final output.
[discrete]
=== `buckets_path` Syntax
Most pipeline aggregations require another aggregation as their input. The input aggregation is defined via the `buckets_path`
Most pipeline aggregations require another aggregation as their input. The input aggregation is defined via the `buckets_path`
parameter, which follows a specific format:
// https://en.wikipedia.org/wiki/Extended_Backus%E2%80%93Naur_Form
@ -77,7 +77,7 @@ POST /_search
<2> The `buckets_path` refers to the metric via a relative path `"the_sum"`
`buckets_path` is also used for Sibling pipeline aggregations, where the aggregation is "next" to a series of buckets
instead of embedded "inside" them. For example, the `max_bucket` aggregation uses the `buckets_path` to specify
instead of embedded "inside" them. For example, the `max_bucket` aggregation uses the `buckets_path` to specify
a metric embedded inside a sibling aggregation:
[source,console,id=buckets-path-sibling-example]
@ -112,7 +112,7 @@ POST /_search
`sales_per_month` date histogram.
If a Sibling pipeline agg references a multi-bucket aggregation, such as a `terms` agg, it also has the option to
select specific keys from the multi-bucket. For example, a `bucket_script` could select two specific buckets (via
select specific keys from the multi-bucket. For example, a `bucket_script` could select two specific buckets (via
their bucket keys) to perform the calculation:
[source,console,id=buckets-path-specific-bucket-example]
@ -160,8 +160,8 @@ instead of fetching all the buckets from `sale_type` aggregation
[discrete]
=== Special Paths
Instead of pathing to a metric, `buckets_path` can use a special `"_count"` path. This instructs
the pipeline aggregation to use the document count as its input. For example, a derivative can be calculated
Instead of pathing to a metric, `buckets_path` can use a special `"_count"` path. This instructs
the pipeline aggregation to use the document count as its input. For example, a derivative can be calculated
on the document count of each bucket, instead of a specific metric:
[source,console,id=buckets-path-count-example]
@ -246,7 +246,7 @@ may be referred to as:
[discrete]
=== Dealing with gaps in the data
Data in the real world is often noisy and sometimes contains *gaps* -- places where data simply doesn't exist. This can
Data in the real world is often noisy and sometimes contains *gaps* -- places where data simply doesn't exist. This can
occur for a variety of reasons, the most common being:
* Documents falling into a bucket do not contain a required field
@ -256,11 +256,11 @@ Some pipeline aggregations have specific requirements that must be met (e.g. a d
first value because there is no previous value, HoltWinters moving average need "warmup" data to begin calculating, etc)
Gap policies are a mechanism to inform the pipeline aggregation about the desired behavior when "gappy" or missing
data is encountered. All pipeline aggregations accept the `gap_policy` parameter. There are currently two gap policies
data is encountered. All pipeline aggregations accept the `gap_policy` parameter. There are currently two gap policies
to choose from:
_skip_::
This option treats missing data as if the bucket does not exist. It will skip the bucket and continue
This option treats missing data as if the bucket does not exist. It will skip the bucket and continue
calculating using the next available value.
_insert_zeros_::

View file

@ -11,8 +11,8 @@ aggregation. The specified metric must be a cardinality aggregation and the encl
must have `min_doc_count` set to `0` (default for `histogram` aggregations).
The `cumulative_cardinality` agg is useful for finding "total new items", like the number of new visitors to your
website each day. A regular cardinality aggregation will tell you how many unique visitors came each day, but doesn't
differentiate between "new" or "repeat" visitors. The Cumulative Cardinality aggregation can be used to determine
website each day. A regular cardinality aggregation will tell you how many unique visitors came each day, but doesn't
differentiate between "new" or "repeat" visitors. The Cumulative Cardinality aggregation can be used to determine
how many of each day's unique visitors are "new".
==== Syntax
@ -128,14 +128,14 @@ And the following may be the response:
Note how the second day, `2019-01-02`, has two distinct users but the `total_new_users` metric generated by the
cumulative pipeline agg only increments to three. This means that only one of the two users that day were
new, the other had already been seen in the previous day. This happens again on the third day, where only
cumulative pipeline agg only increments to three. This means that only one of the two users that day were
new, the other had already been seen in the previous day. This happens again on the third day, where only
one of three users is completely new.
==== Incremental cumulative cardinality
The `cumulative_cardinality` agg will show you the total, distinct count since the beginning of the time period
being queried. Sometimes, however, it is useful to see the "incremental" count. Meaning, how many new users
being queried. Sometimes, however, it is useful to see the "incremental" count. Meaning, how many new users
are added each day, rather than the total cumulative count.
This can be accomplished by adding a `derivative` aggregation to our query:

View file

@ -226,7 +226,7 @@ second derivative
==== Units
The derivative aggregation allows the units of the derivative values to be specified. This returns an extra field in the response
`normalized_value` which reports the derivative value in the desired x-axis units. In the below example we calculate the derivative
`normalized_value` which reports the derivative value in the desired x-axis units. In the below example we calculate the derivative
of the total sales per month but ask for the derivative of the sales as in the units of sales per day:
[source,console]

View file

@ -5,7 +5,7 @@
++++
Given an ordered series of data, the Moving Function aggregation will slide a window across the data and allow the user to specify a custom
script that is executed on each window of data. For convenience, a number of common functions are predefined such as min/max, moving averages,
script that is executed on each window of data. For convenience, a number of common functions are predefined such as min/max, moving averages,
etc.
==== Syntax
@ -36,7 +36,7 @@ A `moving_fn` aggregation looks like this in isolation:
|`shift` |<<shift-parameter, Shift>> of window position. |Optional | 0
|===
`moving_fn` aggregations must be embedded inside of a `histogram` or `date_histogram` aggregation. They can be
`moving_fn` aggregations must be embedded inside of a `histogram` or `date_histogram` aggregation. They can be
embedded like any other metric aggregation:
[source,console]
@ -69,11 +69,11 @@ POST /_search
// TEST[setup:sales]
<1> A `date_histogram` named "my_date_histo" is constructed on the "timestamp" field, with one-day intervals
<2> A `sum` metric is used to calculate the sum of a field. This could be any numeric metric (sum, min, max, etc)
<2> A `sum` metric is used to calculate the sum of a field. This could be any numeric metric (sum, min, max, etc)
<3> Finally, we specify a `moving_fn` aggregation which uses "the_sum" metric as its input.
Moving averages are built by first specifying a `histogram` or `date_histogram` over a field. You can then optionally
add numeric metrics, such as a `sum`, inside of that histogram. Finally, the `moving_fn` is embedded inside the histogram.
Moving averages are built by first specifying a `histogram` or `date_histogram` over a field. You can then optionally
add numeric metrics, such as a `sum`, inside of that histogram. Finally, the `moving_fn` is embedded inside the histogram.
The `buckets_path` parameter is then used to "point" at one of the sibling metrics inside of the histogram (see
<<buckets-path-syntax>> for a description of the syntax for `buckets_path`.
@ -134,9 +134,9 @@ An example response from the above aggregation may look like:
==== Custom user scripting
The Moving Function aggregation allows the user to specify any arbitrary script to define custom logic. The script is invoked each time a
new window of data is collected. These values are provided to the script in the `values` variable. The script should then perform some
kind of calculation and emit a single `double` as the result. Emitting `null` is not permitted, although `NaN` and +/- `Inf` are allowed.
The Moving Function aggregation allows the user to specify any arbitrary script to define custom logic. The script is invoked each time a
new window of data is collected. These values are provided to the script in the `values` variable. The script should then perform some
kind of calculation and emit a single `double` as the result. Emitting `null` is not permitted, although `NaN` and +/- `Inf` are allowed.
For example, this script will simply return the first value from the window, or `NaN` if no values are available:
@ -195,7 +195,7 @@ For convenience, a number of functions have been prebuilt and are available insi
- `holt()`
- `holtWinters()`
The functions are available from the `MovingFunctions` namespace. E.g. `MovingFunctions.max()`
The functions are available from the `MovingFunctions` namespace. E.g. `MovingFunctions.max()`
===== max Function
@ -284,7 +284,7 @@ POST /_search
===== sum Function
This function accepts a collection of doubles and returns the sum of the values in that window. `null` and `NaN` values are ignored;
the sum is only calculated over the real values. If the window is empty, or all values are `null`/`NaN`, `0.0` is returned as the result.
the sum is only calculated over the real values. If the window is empty, or all values are `null`/`NaN`, `0.0` is returned as the result.
[[sum-params]]
.`sum(double[] values)` Parameters
@ -326,7 +326,7 @@ POST /_search
===== stdDev Function
This function accepts a collection of doubles and average, then returns the standard deviation of the values in that window.
`null` and `NaN` values are ignored; the sum is only calculated over the real values. If the window is empty, or all values are
`null` and `NaN` values are ignored; the sum is only calculated over the real values. If the window is empty, or all values are
`null`/`NaN`, `0.0` is returned as the result.
[[stddev-params]]
@ -368,17 +368,17 @@ POST /_search
// TEST[setup:sales]
The `avg` parameter must be provided to the standard deviation function because different styles of averages can be computed on the window
(simple, linearly weighted, etc). The various moving averages that are detailed below can be used to calculate the average for the
(simple, linearly weighted, etc). The various moving averages that are detailed below can be used to calculate the average for the
standard deviation function.
===== unweightedAvg Function
The `unweightedAvg` function calculates the sum of all values in the window, then divides by the size of the window. It is effectively
a simple arithmetic mean of the window. The simple moving average does not perform any time-dependent weighting, which means
The `unweightedAvg` function calculates the sum of all values in the window, then divides by the size of the window. It is effectively
a simple arithmetic mean of the window. The simple moving average does not perform any time-dependent weighting, which means
the values from a `simple` moving average tend to "lag" behind the real data.
`null` and `NaN` values are ignored; the average is only calculated over the real values. If the window is empty, or all values are
`null`/`NaN`, `NaN` is returned as the result. This means that the count used in the average calculation is count of non-`null`,non-`NaN`
`null`/`NaN`, `NaN` is returned as the result. This means that the count used in the average calculation is count of non-`null`,non-`NaN`
values.
[[unweightedavg-params]]
@ -421,7 +421,7 @@ POST /_search
==== linearWeightedAvg Function
The `linearWeightedAvg` function assigns a linear weighting to points in the series, such that "older" datapoints (e.g. those at
the beginning of the window) contribute a linearly less amount to the total average. The linear weighting helps reduce
the beginning of the window) contribute a linearly less amount to the total average. The linear weighting helps reduce
the "lag" behind the data's mean, since older points have less influence.
If the window is empty, or all values are `null`/`NaN`, `NaN` is returned as the result.
@ -467,13 +467,13 @@ POST /_search
The `ewma` function (aka "single-exponential") is similar to the `linearMovAvg` function,
except older data-points become exponentially less important,
rather than linearly less important. The speed at which the importance decays can be controlled with an `alpha`
setting. Small values make the weight decay slowly, which provides greater smoothing and takes into account a larger
portion of the window. Larger values make the weight decay quickly, which reduces the impact of older values on the
moving average. This tends to make the moving average track the data more closely but with less smoothing.
rather than linearly less important. The speed at which the importance decays can be controlled with an `alpha`
setting. Small values make the weight decay slowly, which provides greater smoothing and takes into account a larger
portion of the window. Larger values make the weight decay quickly, which reduces the impact of older values on the
moving average. This tends to make the moving average track the data more closely but with less smoothing.
`null` and `NaN` values are ignored; the average is only calculated over the real values. If the window is empty, or all values are
`null`/`NaN`, `NaN` is returned as the result. This means that the count used in the average calculation is count of non-`null`,non-`NaN`
`null`/`NaN`, `NaN` is returned as the result. This means that the count used in the average calculation is count of non-`null`,non-`NaN`
values.
[[ewma-params]]
@ -518,18 +518,18 @@ POST /_search
==== holt Function
The `holt` function (aka "double exponential") incorporates a second exponential term which
tracks the data's trend. Single exponential does not perform well when the data has an underlying linear trend. The
tracks the data's trend. Single exponential does not perform well when the data has an underlying linear trend. The
double exponential model calculates two values internally: a "level" and a "trend".
The level calculation is similar to `ewma`, and is an exponentially weighted view of the data. The difference is
The level calculation is similar to `ewma`, and is an exponentially weighted view of the data. The difference is
that the previously smoothed value is used instead of the raw value, which allows it to stay close to the original series.
The trend calculation looks at the difference between the current and last value (e.g. the slope, or trend, of the
smoothed data). The trend value is also exponentially weighted.
smoothed data). The trend value is also exponentially weighted.
Values are produced by multiplying the level and trend components.
`null` and `NaN` values are ignored; the average is only calculated over the real values. If the window is empty, or all values are
`null`/`NaN`, `NaN` is returned as the result. This means that the count used in the average calculation is count of non-`null`,non-`NaN`
`null`/`NaN`, `NaN` is returned as the result. This means that the count used in the average calculation is count of non-`null`,non-`NaN`
values.
[[holt-params]]
@ -572,26 +572,26 @@ POST /_search
// TEST[setup:sales]
In practice, the `alpha` value behaves very similarly in `holtMovAvg` as `ewmaMovAvg`: small values produce more smoothing
and more lag, while larger values produce closer tracking and less lag. The value of `beta` is often difficult
to see. Small values emphasize long-term trends (such as a constant linear trend in the whole series), while larger
and more lag, while larger values produce closer tracking and less lag. The value of `beta` is often difficult
to see. Small values emphasize long-term trends (such as a constant linear trend in the whole series), while larger
values emphasize short-term trends.
==== holtWinters Function
The `holtWinters` function (aka "triple exponential") incorporates a third exponential term which
tracks the seasonal aspect of your data. This aggregation therefore smooths based on three components: "level", "trend"
tracks the seasonal aspect of your data. This aggregation therefore smooths based on three components: "level", "trend"
and "seasonality".
The level and trend calculation is identical to `holt` The seasonal calculation looks at the difference between
the current point, and the point one period earlier.
Holt-Winters requires a little more handholding than the other moving averages. You need to specify the "periodicity"
of your data: e.g. if your data has cyclic trends every 7 days, you would set `period = 7`. Similarly if there was
a monthly trend, you would set it to `30`. There is currently no periodicity detection, although that is planned
Holt-Winters requires a little more handholding than the other moving averages. You need to specify the "periodicity"
of your data: e.g. if your data has cyclic trends every 7 days, you would set `period = 7`. Similarly if there was
a monthly trend, you would set it to `30`. There is currently no periodicity detection, although that is planned
for future enhancements.
`null` and `NaN` values are ignored; the average is only calculated over the real values. If the window is empty, or all values are
`null`/`NaN`, `NaN` is returned as the result. This means that the count used in the average calculation is count of non-`null`,non-`NaN`
`null`/`NaN`, `NaN` is returned as the result. This means that the count used in the average calculation is count of non-`null`,non-`NaN`
values.
[[holtwinters-params]]
@ -638,20 +638,20 @@ POST /_search
[WARNING]
======
Multiplicative Holt-Winters works by dividing each data point by the seasonal value. This is problematic if any of
your data is zero, or if there are gaps in the data (since this results in a divid-by-zero). To combat this, the
`mult` Holt-Winters pads all values by a very small amount (1*10^-10^) so that all values are non-zero. This affects
the result, but only minimally. If your data is non-zero, or you prefer to see `NaN` when zero's are encountered,
Multiplicative Holt-Winters works by dividing each data point by the seasonal value. This is problematic if any of
your data is zero, or if there are gaps in the data (since this results in a divid-by-zero). To combat this, the
`mult` Holt-Winters pads all values by a very small amount (1*10^-10^) so that all values are non-zero. This affects
the result, but only minimally. If your data is non-zero, or you prefer to see `NaN` when zero's are encountered,
you can disable this behavior with `pad: false`
======
===== "Cold Start"
Unfortunately, due to the nature of Holt-Winters, it requires two periods of data to "bootstrap" the algorithm. This
means that your `window` must always be *at least* twice the size of your period. An exception will be thrown if it
isn't. It also means that Holt-Winters will not emit a value for the first `2 * period` buckets; the current algorithm
Unfortunately, due to the nature of Holt-Winters, it requires two periods of data to "bootstrap" the algorithm. This
means that your `window` must always be *at least* twice the size of your period. An exception will be thrown if it
isn't. It also means that Holt-Winters will not emit a value for the first `2 * period` buckets; the current algorithm
does not backcast.
You'll notice in the above example we have an `if ()` statement checking the size of values. This is checking to make sure
You'll notice in the above example we have an `if ()` statement checking the size of values. This is checking to make sure
we have two periods worth of data (`5 * 2`, where 5 is the period specified in the `holtWintersMovAvg` function) before calling
the holt-winters function.

View file

@ -37,7 +37,7 @@ A `moving_percentiles` aggregation looks like this in isolation:
|`shift` |<<shift-parameter, Shift>> of window position. |Optional | 0
|===
`moving_percentiles` aggregations must be embedded inside of a `histogram` or `date_histogram` aggregation. They can be
`moving_percentiles` aggregations must be embedded inside of a `histogram` or `date_histogram` aggregation. They can be
embedded like any other metric aggregation:
[source,console]
@ -75,8 +75,8 @@ POST /_search
<2> A `percentile` metric is used to calculate the percentiles of a field.
<3> Finally, we specify a `moving_percentiles` aggregation which uses "the_percentile" sketch as its input.
Moving percentiles are built by first specifying a `histogram` or `date_histogram` over a field. You then add
a percentile metric inside of that histogram. Finally, the `moving_percentiles` is embedded inside the histogram.
Moving percentiles are built by first specifying a `histogram` or `date_histogram` over a field. You then add
a percentile metric inside of that histogram. Finally, the `moving_percentiles` is embedded inside the histogram.
The `buckets_path` parameter is then used to "point" at the percentiles aggregation inside of the histogram (see
<<buckets-path-syntax>> for a description of the syntax for `buckets_path`).

View file

@ -130,5 +130,5 @@ interpolate between data points.
The percentiles are calculated exactly and is not an approximation (unlike the Percentiles Metric). This means
the implementation maintains an in-memory, sorted list of your data to compute the percentiles, before discarding the
data. You may run into memory pressure issues if you attempt to calculate percentiles over many millions of
data. You may run into memory pressure issues if you attempt to calculate percentiles over many millions of
data-points in a single `percentiles_bucket`.

View file

@ -13,10 +13,10 @@ next. Single periods are useful for removing constant, linear trends.
Single periods are also useful for transforming data into a stationary series. In this example, the Dow Jones is
plotted over ~250 days. The raw data is not stationary, which would make it difficult to use with some techniques.
By calculating the first-difference, we de-trend the data (e.g. remove a constant, linear trend). We can see that the
By calculating the first-difference, we de-trend the data (e.g. remove a constant, linear trend). We can see that the
data becomes a stationary series (e.g. the first difference is randomly distributed around zero, and doesn't seem to
exhibit any pattern/behavior). The transformation reveals that the dataset is following a random-walk; the value is the
previous value +/- a random amount. This insight allows selection of further tools for analysis.
previous value +/- a random amount. This insight allows selection of further tools for analysis.
[[serialdiff_dow]]
.Dow Jones plotted and made stationary with first-differencing
@ -93,10 +93,10 @@ POST /_search
--------------------------------------------------
<1> A `date_histogram` named "my_date_histo" is constructed on the "timestamp" field, with one-day intervals
<2> A `sum` metric is used to calculate the sum of a field. This could be any metric (sum, min, max, etc)
<2> A `sum` metric is used to calculate the sum of a field. This could be any metric (sum, min, max, etc)
<3> Finally, we specify a `serial_diff` aggregation which uses "the_sum" metric as its input.
Serial differences are built by first specifying a `histogram` or `date_histogram` over a field. You can then optionally
add normal metrics, such as a `sum`, inside of that histogram. Finally, the `serial_diff` is embedded inside the histogram.
Serial differences are built by first specifying a `histogram` or `date_histogram` over a field. You can then optionally
add normal metrics, such as a `sum`, inside of that histogram. Finally, the `serial_diff` is embedded inside the histogram.
The `buckets_path` parameter is then used to "point" at one of the sibling metrics inside of the histogram (see
<<buckets-path-syntax>> for a description of the syntax for `buckets_path`.

View file

@ -13,12 +13,12 @@ lowercases terms, and supports removing stop words.
<<analysis-simple-analyzer,Simple Analyzer>>::
The `simple` analyzer divides text into terms whenever it encounters a
character which is not a letter. It lowercases all terms.
character which is not a letter. It lowercases all terms.
<<analysis-whitespace-analyzer,Whitespace Analyzer>>::
The `whitespace` analyzer divides text into terms whenever it encounters any
whitespace character. It does not lowercase terms.
whitespace character. It does not lowercase terms.
<<analysis-stop-analyzer,Stop Analyzer>>::

View file

@ -1,8 +1,8 @@
[[configuring-analyzers]]
=== Configuring built-in analyzers
The built-in analyzers can be used directly without any configuration. Some
of them, however, support configuration options to alter their behaviour. For
The built-in analyzers can be used directly without any configuration. Some
of them, however, support configuration options to alter their behaviour. For
instance, the <<analysis-standard-analyzer,`standard` analyzer>> can be configured
to support a list of stop words:
@ -53,10 +53,10 @@ POST my-index-000001/_analyze
<1> We define the `std_english` analyzer to be based on the `standard`
analyzer, but configured to remove the pre-defined list of English stopwords.
<2> The `my_text` field uses the `standard` analyzer directly, without
any configuration. No stop words will be removed from this field.
any configuration. No stop words will be removed from this field.
The resulting terms are: `[ the, old, brown, cow ]`
<3> The `my_text.english` field uses the `std_english` analyzer, so
English stop words will be removed. The resulting terms are:
English stop words will be removed. The resulting terms are:
`[ old, brown, cow ]`

View file

@ -38,7 +38,7 @@ The `custom` analyzer accepts the following parameters:
When indexing an array of text values, Elasticsearch inserts a fake "gap"
between the last term of one value and the first term of the next value to
ensure that a phrase query doesn't match two terms from different array
elements. Defaults to `100`. See <<position-increment-gap>> for more.
elements. Defaults to `100`. See <<position-increment-gap>> for more.
[discrete]
=== Example configuration

View file

@ -9,7 +9,7 @@ https://github.com/OpenRefine/OpenRefine/wiki/Clustering-In-Depth#fingerprint[fi
which is used by the OpenRefine project to assist in clustering.
Input text is lowercased, normalized to remove extended characters, sorted,
deduplicated and concatenated into a single token. If a stopword list is
deduplicated and concatenated into a single token. If a stopword list is
configured, stop words will also be removed.
[discrete]
@ -59,17 +59,17 @@ The `fingerprint` analyzer accepts the following parameters:
[horizontal]
`separator`::
The character to use to concatenate the terms. Defaults to a space.
The character to use to concatenate the terms. Defaults to a space.
`max_output_size`::
The maximum token size to emit. Defaults to `255`. Tokens larger than
The maximum token size to emit. Defaults to `255`. Tokens larger than
this size will be discarded.
`stopwords`::
A pre-defined stop words list like `_english_` or an array containing a
list of stop words. Defaults to `_none_`.
A pre-defined stop words list like `_english_` or an array containing a
list of stop words. Defaults to `_none_`.
`stopwords_path`::

View file

@ -55,7 +55,7 @@ more details.
===== Excluding words from stemming
The `stem_exclusion` parameter allows you to specify an array
of lowercase words that should not be stemmed. Internally, this
of lowercase words that should not be stemmed. Internally, this
functionality is implemented by adding the
<<analysis-keyword-marker-tokenfilter,`keyword_marker` token filter>>
with the `keywords` set to the value of the `stem_exclusion` parameter.
@ -427,7 +427,7 @@ PUT /catalan_example
===== `cjk` analyzer
NOTE: You may find that `icu_analyzer` in the ICU analysis plugin works better
for CJK text than the `cjk` analyzer. Experiment with your text and queries.
for CJK text than the `cjk` analyzer. Experiment with your text and queries.
The `cjk` analyzer could be reimplemented as a `custom` analyzer as follows:

View file

@ -159,8 +159,8 @@ The `pattern` analyzer accepts the following parameters:
`stopwords`::
A pre-defined stop words list like `_english_` or an array containing a
list of stop words. Defaults to `_none_`.
A pre-defined stop words list like `_english_` or an array containing a
list of stop words. Defaults to `_none_`.
`stopwords_path`::

View file

@ -132,8 +132,8 @@ The `standard` analyzer accepts the following parameters:
`stopwords`::
A pre-defined stop words list like `_english_` or an array containing a
list of stop words. Defaults to `_none_`.
A pre-defined stop words list like `_english_` or an array containing a
list of stop words. Defaults to `_none_`.
`stopwords_path`::

View file

@ -5,7 +5,7 @@
++++
The `stop` analyzer is the same as the <<analysis-simple-analyzer,`simple` analyzer>>
but adds support for removing stop words. It defaults to using the
but adds support for removing stop words. It defaults to using the
`_english_` stop words.
[discrete]
@ -111,8 +111,8 @@ The `stop` analyzer accepts the following parameters:
[horizontal]
`stopwords`::
A pre-defined stop words list like `_english_` or an array containing a
list of stop words. Defaults to `_english_`.
A pre-defined stop words list like `_english_` or an array containing a
list of stop words. Defaults to `_english_`.
`stopwords_path`::

View file

@ -14,7 +14,7 @@ combined to define new <<analysis-custom-analyzer,`custom`>> analyzers.
==== Character filters
A _character filter_ receives the original text as a stream of characters and
can transform the stream by adding, removing, or changing characters. For
can transform the stream by adding, removing, or changing characters. For
instance, a character filter could be used to convert Hindu-Arabic numerals
(٠‎١٢٣٤٥٦٧٨‎٩‎) into their Arabic-Latin equivalents (0123456789), or to strip HTML
elements like `<b>` from the stream.
@ -25,10 +25,10 @@ which are applied in order.
[[analyzer-anatomy-tokenizer]]
==== Tokenizer
A _tokenizer_ receives a stream of characters, breaks it up into individual
A _tokenizer_ receives a stream of characters, breaks it up into individual
_tokens_ (usually individual words), and outputs a stream of _tokens_. For
instance, a <<analysis-whitespace-tokenizer,`whitespace`>> tokenizer breaks
text into tokens whenever it sees any whitespace. It would convert the text
text into tokens whenever it sees any whitespace. It would convert the text
`"Quick brown fox!"` into the terms `[Quick, brown, fox!]`.
The tokenizer is also responsible for recording the order or _position_ of
@ -41,7 +41,7 @@ An analyzer must have *exactly one* <<analysis-tokenizers,tokenizer>>.
==== Token filters
A _token filter_ receives the token stream and may add, remove, or change
tokens. For example, a <<analysis-lowercase-tokenfilter,`lowercase`>> token
tokens. For example, a <<analysis-lowercase-tokenfilter,`lowercase`>> token
filter converts all tokens to lowercase, a
<<analysis-stop-tokenfilter,`stop`>> token filter removes common words
(_stop words_) like `the` from the token stream, and a

View file

@ -5,7 +5,7 @@ _Character filters_ are used to preprocess the stream of characters before it
is passed to the <<analysis-tokenizers,tokenizer>>.
A character filter receives the original text as a stream of characters and
can transform the stream by adding, removing, or changing characters. For
can transform the stream by adding, removing, or changing characters. For
instance, a character filter could be used to convert Hindu-Arabic numerals
(٠‎١٢٣٤٥٦٧٨‎٩‎) into their Arabic-Latin equivalents (0123456789), or to strip HTML
elements like `<b>` from the stream.

View file

@ -4,7 +4,7 @@
<titleabbrev>Mapping</titleabbrev>
++++
The `mapping` character filter accepts a map of keys and values. Whenever it
The `mapping` character filter accepts a map of keys and values. Whenever it
encounters a string of characters that is the same as a key, it replaces them
with the value associated with that key.

View file

@ -5,7 +5,7 @@
++++
A token filter of type `multiplexer` will emit multiple tokens at the same position,
each version of the token having been run through a different filter. Identical
each version of the token having been run through a different filter. Identical
output tokens at the same position will be removed.
WARNING: If the incoming token stream has duplicate tokens, then these will also be
@ -14,8 +14,8 @@ removed by the multiplexer
[discrete]
=== Options
[horizontal]
filters:: a list of token filters to apply to incoming tokens. These can be any
token filters defined elsewhere in the index mappings. Filters can be chained
filters:: a list of token filters to apply to incoming tokens. These can be any
token filters defined elsewhere in the index mappings. Filters can be chained
using a comma-delimited string, so for example `"lowercase, porter_stem"` would
apply the `lowercase` filter and then the `porter_stem` filter to a single token.

View file

@ -8,14 +8,14 @@ The `synonym_graph` token filter allows to easily handle synonyms,
including multi-word synonyms correctly during the analysis process.
In order to properly handle multi-word synonyms this token filter
creates a <<token-graphs,graph token stream>> during processing. For more
creates a <<token-graphs,graph token stream>> during processing. For more
information on this topic and its various complexities, please read the
http://blog.mikemccandless.com/2012/04/lucenes-tokenstreams-are-actually.html[Lucene's TokenStreams are actually graphs] blog post.
["NOTE",id="synonym-graph-index-note"]
===============================
This token filter is designed to be used as part of a search analyzer
only. If you want to apply synonyms during indexing please use the
only. If you want to apply synonyms during indexing please use the
standard <<analysis-synonym-tokenfilter,synonym token filter>>.
===============================
@ -179,13 +179,13 @@ as well.
==== Parsing synonym files
Elasticsearch will use the token filters preceding the synonym filter
in a tokenizer chain to parse the entries in a synonym file. So, for example, if a
in a tokenizer chain to parse the entries in a synonym file. So, for example, if a
synonym filter is placed after a stemmer, then the stemmer will also be applied
to the synonym entries. Because entries in the synonym map cannot have stacked
positions, some token filters may cause issues here. Token filters that produce
to the synonym entries. Because entries in the synonym map cannot have stacked
positions, some token filters may cause issues here. Token filters that produce
multiple versions of a token may choose which version of the token to emit when
parsing synonyms, e.g. `asciifolding` will only produce the folded version of the
token. Others, e.g. `multiplexer`, `word_delimiter_graph` or `ngram` will throw an
token. Others, e.g. `multiplexer`, `word_delimiter_graph` or `ngram` will throw an
error.
If you need to build analyzers that include both multi-token filters and synonym

View file

@ -170,13 +170,13 @@ as well.
=== Parsing synonym files
Elasticsearch will use the token filters preceding the synonym filter
in a tokenizer chain to parse the entries in a synonym file. So, for example, if a
in a tokenizer chain to parse the entries in a synonym file. So, for example, if a
synonym filter is placed after a stemmer, then the stemmer will also be applied
to the synonym entries. Because entries in the synonym map cannot have stacked
positions, some token filters may cause issues here. Token filters that produce
to the synonym entries. Because entries in the synonym map cannot have stacked
positions, some token filters may cause issues here. Token filters that produce
multiple versions of a token may choose which version of the token to emit when
parsing synonyms, e.g. `asciifolding` will only produce the folded version of the
token. Others, e.g. `multiplexer`, `word_delimiter_graph` or `ngram` will throw an
token. Others, e.g. `multiplexer`, `word_delimiter_graph` or `ngram` will throw an
error.
If you need to build analyzers that include both multi-token filters and synonym

View file

@ -1,10 +1,10 @@
[[analysis-tokenizers]]
== Tokenizer reference
A _tokenizer_ receives a stream of characters, breaks it up into individual
A _tokenizer_ receives a stream of characters, breaks it up into individual
_tokens_ (usually individual words), and outputs a stream of _tokens_. For
instance, a <<analysis-whitespace-tokenizer,`whitespace`>> tokenizer breaks
text into tokens whenever it sees any whitespace. It would convert the text
text into tokens whenever it sees any whitespace. It would convert the text
`"Quick brown fox!"` into the terms `[Quick, brown, fox!]`.
The tokenizer is also responsible for recording the following:
@ -90,7 +90,7 @@ text:
<<analysis-keyword-tokenizer,Keyword Tokenizer>>::
The `keyword` tokenizer is a ``noop'' tokenizer that accepts whatever text it
is given and outputs the exact same text as a single term. It can be combined
is given and outputs the exact same text as a single term. It can be combined
with token filters like <<analysis-lowercase-tokenfilter,`lowercase`>> to
normalise the analysed terms.

View file

@ -14,7 +14,7 @@ Edge N-Grams are useful for _search-as-you-type_ queries.
TIP: When you need _search-as-you-type_ for text which has a widely known
order, such as movie or song titles, the
<<completion-suggester,completion suggester>> is a much more efficient
choice than edge N-grams. Edge N-grams have the advantage when trying to
choice than edge N-grams. Edge N-grams have the advantage when trying to
autocomplete words that can appear in any order.
[discrete]
@ -67,7 +67,7 @@ The above sentence would produce the following terms:
[ Q, Qu ]
---------------------------
NOTE: These default gram lengths are almost entirely useless. You need to
NOTE: These default gram lengths are almost entirely useless. You need to
configure the `edge_ngram` before using it.
[discrete]
@ -76,19 +76,19 @@ configure the `edge_ngram` before using it.
The `edge_ngram` tokenizer accepts the following parameters:
`min_gram`::
Minimum length of characters in a gram. Defaults to `1`.
Minimum length of characters in a gram. Defaults to `1`.
`max_gram`::
+
--
Maximum length of characters in a gram. Defaults to `2`.
Maximum length of characters in a gram. Defaults to `2`.
See <<max-gram-limits>>.
--
`token_chars`::
Character classes that should be included in a token. Elasticsearch
Character classes that should be included in a token. Elasticsearch
will split on characters that don't belong to the classes specified.
Defaults to `[]` (keep all characters).
+
@ -106,7 +106,7 @@ Character classes may be any of the following:
Custom characters that should be treated as part of a token. For example,
setting this to `+-_` will make the tokenizer treat the plus, minus and
underscore sign as part of a token.
underscore sign as part of a token.
[discrete]
[[max-gram-limits]]

View file

@ -4,8 +4,8 @@
<titleabbrev>Keyword</titleabbrev>
++++
The `keyword` tokenizer is a ``noop'' tokenizer that accepts whatever text it
is given and outputs the exact same text as a single term. It can be combined
The `keyword` tokenizer is a ``noop'' tokenizer that accepts whatever text it
is given and outputs the exact same text as a single term. It can be combined
with token filters to normalise output, e.g. lower-casing email addresses.
[discrete]
@ -104,6 +104,6 @@ The `keyword` tokenizer accepts the following parameters:
`buffer_size`::
The number of characters read into the term buffer in a single pass.
Defaults to `256`. The term buffer will grow by this size until all the
text has been consumed. It is advisable not to change this setting.
Defaults to `256`. The term buffer will grow by this size until all the
text has been consumed. It is advisable not to change this setting.

View file

@ -7,7 +7,7 @@
The `lowercase` tokenizer, like the
<<analysis-letter-tokenizer, `letter` tokenizer>> breaks text into terms
whenever it encounters a character which is not a letter, but it also
lowercases all terms. It is functionally equivalent to the
lowercases all terms. It is functionally equivalent to the
<<analysis-letter-tokenizer, `letter` tokenizer>> combined with the
<<analysis-lowercase-tokenfilter, `lowercase` token filter>>, but is more
efficient as it performs both steps in a single pass.

View file

@ -175,14 +175,14 @@ The `ngram` tokenizer accepts the following parameters:
[horizontal]
`min_gram`::
Minimum length of characters in a gram. Defaults to `1`.
Minimum length of characters in a gram. Defaults to `1`.
`max_gram`::
Maximum length of characters in a gram. Defaults to `2`.
Maximum length of characters in a gram. Defaults to `2`.
`token_chars`::
Character classes that should be included in a token. Elasticsearch
Character classes that should be included in a token. Elasticsearch
will split on characters that don't belong to the classes specified.
Defaults to `[]` (keep all characters).
+
@ -200,12 +200,12 @@ Character classes may be any of the following:
Custom characters that should be treated as part of a token. For example,
setting this to `+-_` will make the tokenizer treat the plus, minus and
underscore sign as part of a token.
underscore sign as part of a token.
TIP: It usually makes sense to set `min_gram` and `max_gram` to the same
value. The smaller the length, the more documents will match but the lower
the quality of the matches. The longer the length, the more specific the
matches. A tri-gram (length `3`) is a good place to start.
value. The smaller the length, the more documents will match but the lower
the quality of the matches. The longer the length, the more specific the
matches. A tri-gram (length `3`) is a good place to start.
The index level setting `index.max_ngram_diff` controls the maximum allowed
difference between `max_gram` and `min_gram`.

View file

@ -69,7 +69,7 @@ The `path_hierarchy` tokenizer accepts the following parameters:
[horizontal]
`delimiter`::
The character to use as the path separator. Defaults to `/`.
The character to use as the path separator. Defaults to `/`.
`replacement`::
An optional replacement character to use for the delimiter.
@ -77,20 +77,20 @@ The `path_hierarchy` tokenizer accepts the following parameters:
`buffer_size`::
The number of characters read into the term buffer in a single pass.
Defaults to `1024`. The term buffer will grow by this size until all the
text has been consumed. It is advisable not to change this setting.
Defaults to `1024`. The term buffer will grow by this size until all the
text has been consumed. It is advisable not to change this setting.
`reverse`::
If set to `true`, emits the tokens in reverse order. Defaults to `false`.
If set to `true`, emits the tokens in reverse order. Defaults to `false`.
`skip`::
The number of initial tokens to skip. Defaults to `0`.
The number of initial tokens to skip. Defaults to `0`.
[discrete]
=== Example configuration
In this example, we configure the `path_hierarchy` tokenizer to split on `-`
characters, and to replace them with `/`. The first two tokens are skipped:
characters, and to replace them with `/`. The first two tokens are skipped:
[source,console]
----------------------------

View file

@ -116,7 +116,7 @@ The `pattern` tokenizer accepts the following parameters:
`group`::
Which capture group to extract as tokens. Defaults to `-1` (split).
Which capture group to extract as tokens. Defaults to `-1` (split).
[discrete]
=== Example configuration
@ -194,7 +194,7 @@ The above example produces the following terms:
---------------------------
In the next example, we configure the `pattern` tokenizer to capture values
enclosed in double quotes (ignoring embedded escaped quotes `\"`). The regex
enclosed in double quotes (ignoring embedded escaped quotes `\"`). The regex
itself looks like this:
"((?:\\"|[^"]|\\")*)"

View file

@ -199,7 +199,7 @@ Statistics are returned in a format suitable for humans
The human readable values can be turned off by adding `?human=false`
to the query string. This makes sense when the stats results are
being consumed by a monitoring tool, rather than intended for human
consumption. The default for the `human` flag is
consumption. The default for the `human` flag is
`false`.
[[date-math]]
@ -499,7 +499,7 @@ of supporting the native JSON number types.
==== Time units
Whenever durations need to be specified, e.g. for a `timeout` parameter, the duration must specify
the unit, like `2d` for 2 days. The supported units are:
the unit, like `2d` for 2 days. The supported units are:
[horizontal]
`d`:: Days

View file

@ -103,14 +103,14 @@ with `queue`.
==== Numeric formats
Many commands provide a few types of numeric output, either a byte, size
or a time value. By default, these types are human-formatted,
for example, `3.5mb` instead of `3763212`. The human values are not
or a time value. By default, these types are human-formatted,
for example, `3.5mb` instead of `3763212`. The human values are not
sortable numerically, so in order to operate on these values where
order is important, you can change it.
Say you want to find the largest index in your cluster (storage used
by all the shards, not number of documents). The `/_cat/indices` API
is ideal. You only need to add three things to the API request:
by all the shards, not number of documents). The `/_cat/indices` API
is ideal. You only need to add three things to the API request:
. The `bytes` query string parameter with a value of `b` to get byte-level resolution.
. The `s` (sort) parameter with a value of `store.size:desc` and a comma with `index:asc` to sort the output

View file

@ -25,7 +25,7 @@ include::{es-repo-dir}/rest-api/common-parms.asciidoc[tag=bytes]
include::{es-repo-dir}/rest-api/common-parms.asciidoc[tag=http-format]
`full_id`::
(Optional, Boolean) If `true`, return the full node ID. If `false`, return the
(Optional, Boolean) If `true`, return the full node ID. If `false`, return the
shortened node ID. Defaults to `false`.
include::{es-repo-dir}/rest-api/common-parms.asciidoc[tag=cat-h]

View file

@ -22,7 +22,7 @@ Provides explanations for shard allocations in the cluster.
==== {api-description-title}
The purpose of the cluster allocation explain API is to provide
explanations for shard allocations in the cluster. For unassigned shards,
explanations for shard allocations in the cluster. For unassigned shards,
the explain API provides an explanation for why the shard is unassigned.
For assigned shards, the explain API provides an explanation for why the
shard is remaining on its current node and has not moved or rebalanced to

View file

@ -40,7 +40,7 @@ include::{es-repo-dir}/rest-api/common-parms.asciidoc[tag=node-id]
`ignore_idle_threads`::
(Optional, Boolean) If true, known idle threads (e.g. waiting in a socket
select, or to get a task from an empty queue) are filtered out. Defaults to
select, or to get a task from an empty queue) are filtered out. Defaults to
true.
`interval`::

View file

@ -108,7 +108,7 @@ include::{es-repo-dir}/rest-api/common-parms.asciidoc[tag=node-id]
`total_indexing_buffer`::
Total heap allowed to be used to hold recently indexed
documents before they must be written to disk. This size is
documents before they must be written to disk. This size is
a shared pool across all shards on this node, and is
controlled by <<indexing-buffer,Indexing Buffer settings>>.

View file

@ -1199,7 +1199,7 @@ since the {wikipedia}/Unix_time[Unix Epoch].
`open_file_descriptors`::
(integer)
Number of opened file descriptors associated with the current or
Number of opened file descriptors associated with the current or
`-1` if not supported.
`max_file_descriptors`::

View file

@ -75,7 +75,7 @@ GET _tasks?nodes=nodeId1,nodeId2&actions=cluster:* <3>
// TEST[skip:No tasks to retrieve]
<1> Retrieves all tasks currently running on all nodes in the cluster.
<2> Retrieves all tasks running on nodes `nodeId1` and `nodeId2`. See <<cluster-nodes>> for more info about how to select individual nodes.
<2> Retrieves all tasks running on nodes `nodeId1` and `nodeId2`. See <<cluster-nodes>> for more info about how to select individual nodes.
<3> Retrieves all cluster-related tasks running on nodes `nodeId1` and `nodeId2`.
The API returns the following result:

View file

@ -41,7 +41,7 @@ manually. It adds an entry for that node in the voting configuration exclusions
list. The cluster then tries to reconfigure the voting configuration to remove
that node and to prevent it from returning.
If the API fails, you can safely retry it. Only a successful response
If the API fails, you can safely retry it. Only a successful response
guarantees that the node has been removed from the voting configuration and will
not be reinstated.

View file

@ -36,11 +36,11 @@ This tool has a number of modes:
prevents the cluster state from being loaded.
* `elasticsearch-node unsafe-bootstrap` can be used to perform _unsafe cluster
bootstrapping_. It forces one of the nodes to form a brand-new cluster on
bootstrapping_. It forces one of the nodes to form a brand-new cluster on
its own, using its local copy of the cluster metadata.
* `elasticsearch-node detach-cluster` enables you to move nodes from one
cluster to another. This can be used to move nodes into a new cluster
cluster to another. This can be used to move nodes into a new cluster
created with the `elasticsearch-node unsafe-bootstrap` command. If unsafe
cluster bootstrapping was not possible, it also enables you to move nodes
into a brand-new cluster.
@ -218,7 +218,7 @@ node with the same term, pick the one with the largest version.
This information identifies the node with the freshest cluster state, which minimizes the
quantity of data that might be lost. For example, if the first node reports
`(4, 12)` and a second node reports `(5, 3)`, then the second node is preferred
since its term is larger. However if the second node reports `(3, 17)` then
since its term is larger. However if the second node reports `(3, 17)` then
the first node is preferred since its term is larger. If the second node
reports `(4, 10)` then it has the same term as the first node, but has a
smaller version, so the first node is preferred.

Some files were not shown because too many files have changed in this diff Show more