mirror of
https://github.com/elastic/elasticsearch.git
synced 2025-04-24 23:27:25 -04:00
[DOCS] Fix double spaces (#71082)
This commit is contained in:
parent
83725e4b38
commit
693807a6d3
282 changed files with 834 additions and 834 deletions
|
@ -92,7 +92,7 @@ password: `elastic-password`.
|
|||
|
||||
=== Test case filtering.
|
||||
|
||||
You can run a single test, provided that you specify the Gradle project. See the documentation on
|
||||
You can run a single test, provided that you specify the Gradle project. See the documentation on
|
||||
https://docs.gradle.org/current/userguide/userguide_single.html#simple_name_pattern[simple name pattern filtering].
|
||||
|
||||
Run a single test case in the `server` project:
|
||||
|
@ -385,13 +385,13 @@ vagrant plugin install vagrant-cachier
|
|||
. You can run all of the OS packaging tests with `./gradlew packagingTest`.
|
||||
This task includes our legacy `bats` tests. To run only the OS tests that are
|
||||
written in Java, run `.gradlew distroTest`, will cause Gradle to build the tar,
|
||||
zip, and deb packages and all the plugins. It will then run the tests on every
|
||||
zip, and deb packages and all the plugins. It will then run the tests on every
|
||||
available system. This will take a very long time.
|
||||
+
|
||||
Fortunately, the various systems under test have their own Gradle tasks under
|
||||
`qa/os`. To find the systems tested, do a listing of the `qa/os` directory.
|
||||
To find out what packaging combinations can be tested on a system, run
|
||||
the `tasks` task. For example:
|
||||
the `tasks` task. For example:
|
||||
+
|
||||
----------------------------------
|
||||
./gradlew :qa:os:ubuntu-1804:tasks
|
||||
|
@ -558,7 +558,7 @@ fetching the latest from the remote.
|
|||
|
||||
== Testing in FIPS 140-2 mode
|
||||
|
||||
We have a CI matrix job that periodically runs all our tests with the JVM configured
|
||||
We have a CI matrix job that periodically runs all our tests with the JVM configured
|
||||
to be FIPS 140-2 compliant with the use of the BouncyCastle FIPS approved Security Provider.
|
||||
FIPS 140-2 imposes certain requirements that affect how our tests should be set up or what
|
||||
can be tested. This section summarizes what one needs to take into consideration so that
|
||||
|
|
|
@ -150,7 +150,7 @@ Also see the {client}/php-api/current/index.html[official Elasticsearch PHP clie
|
|||
|
||||
* https://github.com/nervetattoo/elasticsearch[elasticsearch] PHP client.
|
||||
|
||||
* https://github.com/madewithlove/elasticsearcher[elasticsearcher] Agnostic lightweight package on top of the Elasticsearch PHP client. Its main goal is to allow for easier structuring of queries and indices in your application. It does not want to hide or replace functionality of the Elasticsearch PHP client.
|
||||
* https://github.com/madewithlove/elasticsearcher[elasticsearcher] Agnostic lightweight package on top of the Elasticsearch PHP client. Its main goal is to allow for easier structuring of queries and indices in your application. It does not want to hide or replace functionality of the Elasticsearch PHP client.
|
||||
|
||||
[[python]]
|
||||
== Python
|
||||
|
|
|
@ -51,7 +51,7 @@ offsets.
|
|||
payloads.
|
||||
<6> Set `filterSettings` to filter the terms that can be returned based
|
||||
on their tf-idf scores.
|
||||
<7> Set `perFieldAnalyzer` to specify a different analyzer than
|
||||
<7> Set `perFieldAnalyzer` to specify a different analyzer than
|
||||
the one that the field has.
|
||||
<8> Set `realtime` to `false` (default is `true`) to retrieve term vectors
|
||||
near realtime.
|
||||
|
|
|
@ -20,7 +20,7 @@ The simplest version uses a built-in analyzer:
|
|||
include-tagged::{doc-tests-file}[{api}-builtin-request]
|
||||
---------------------------------------------------
|
||||
<1> A built-in analyzer
|
||||
<2> The text to include. Multiple strings are treated as a multi-valued field
|
||||
<2> The text to include. Multiple strings are treated as a multi-valued field
|
||||
|
||||
You can configure a custom analyzer:
|
||||
["source","java",subs="attributes,callouts,macros"]
|
||||
|
|
|
@ -38,7 +38,7 @@ include-tagged::{doc-tests-file}[{api}-request-masterTimeout]
|
|||
--------------------------------------------------
|
||||
include-tagged::{doc-tests-file}[{api}-request-waitForActiveShards]
|
||||
--------------------------------------------------
|
||||
<1> The number of active shard copies to wait for before the freeze index API
|
||||
<1> The number of active shard copies to wait for before the freeze index API
|
||||
returns a response, as an `ActiveShardCount`
|
||||
|
||||
["source","java",subs="attributes,callouts,macros"]
|
||||
|
|
|
@ -25,7 +25,7 @@ The following arguments can optionally be provided:
|
|||
--------------------------------------------------
|
||||
include-tagged::{doc-tests-file}[{api}-request-names]
|
||||
--------------------------------------------------
|
||||
<1> One or more settings that be the only settings retrieved. If unset, all settings will be retrieved
|
||||
<1> One or more settings that be the only settings retrieved. If unset, all settings will be retrieved
|
||||
|
||||
["source","java",subs="attributes,callouts,macros"]
|
||||
--------------------------------------------------
|
||||
|
|
|
@ -43,7 +43,7 @@ include-tagged::{doc-tests-file}[{api}-request-waitForActiveShards]
|
|||
--------------------------------------------------
|
||||
<1> The number of active shard copies to wait for before the open index API
|
||||
returns a response, as an `int`
|
||||
<2> The number of active shard copies to wait for before the open index API
|
||||
<2> The number of active shard copies to wait for before the open index API
|
||||
returns a response, as an `ActiveShardCount`
|
||||
|
||||
["source","java",subs="attributes,callouts,macros"]
|
||||
|
|
|
@ -37,7 +37,7 @@ include-tagged::{doc-tests-file}[{api}-request-masterTimeout]
|
|||
--------------------------------------------------
|
||||
include-tagged::{doc-tests-file}[{api}-request-waitForActiveShards]
|
||||
--------------------------------------------------
|
||||
<1> The number of active shard copies to wait for before the unfreeze index API
|
||||
<1> The number of active shard copies to wait for before the unfreeze index API
|
||||
returns a response, as an `ActiveShardCount`
|
||||
|
||||
["source","java",subs="attributes,callouts,macros"]
|
||||
|
|
|
@ -20,7 +20,7 @@ license started. If it was not started, it returns an error message describing
|
|||
why.
|
||||
|
||||
Acknowledgement messages may also be returned if this API was called without
|
||||
the `acknowledge` flag set to `true`. In this case you need to display the
|
||||
the `acknowledge` flag set to `true`. In this case you need to display the
|
||||
messages to the end user and if they agree, resubmit the request with the
|
||||
`acknowledge` flag set to `true`. Please note that the response will still
|
||||
return a 200 return code even if it requires an acknowledgement. So, it is
|
||||
|
|
|
@ -23,7 +23,7 @@ license started. If it was not started, it returns an error message describing
|
|||
why.
|
||||
|
||||
Acknowledgement messages may also be returned if this API was called without
|
||||
the `acknowledge` flag set to `true`. In this case you need to display the
|
||||
the `acknowledge` flag set to `true`. In this case you need to display the
|
||||
messages to the end user and if they agree, resubmit the request with the
|
||||
`acknowledge` flag set to `true`. Please note that the response will still
|
||||
return a 200 return code even if it requires an acknowledgement. So, it is
|
||||
|
|
|
@ -40,7 +40,7 @@ include-tagged::{doc-tests-file}[x-pack-{api}-execute]
|
|||
|
||||
The returned +{response}+ holds lists and maps of values which correspond to the capabilities
|
||||
of the target index/index pattern (what jobs were configured for the pattern, where the data is stored, what
|
||||
aggregations are available, etc). It provides essentially the same data as the original job configuration,
|
||||
aggregations are available, etc). It provides essentially the same data as the original job configuration,
|
||||
just presented in a different manner.
|
||||
|
||||
For example, if we had created a job with the following config:
|
||||
|
|
|
@ -10,7 +10,7 @@
|
|||
experimental::[]
|
||||
|
||||
The Get Rollup Index Capabilities API allows the user to determine if a concrete index or index pattern contains
|
||||
stored rollup jobs and data. If it contains data stored from rollup jobs, the capabilities of those jobs
|
||||
stored rollup jobs and data. If it contains data stored from rollup jobs, the capabilities of those jobs
|
||||
are returned. The API accepts a `GetRollupIndexCapsRequest` object as a request and returns a `GetRollupIndexCapsResponse`.
|
||||
|
||||
[id="{upid}-x-pack-{api}-request"]
|
||||
|
@ -40,7 +40,7 @@ include-tagged::{doc-tests-file}[x-pack-{api}-execute]
|
|||
|
||||
The returned +{response}+ holds lists and maps of values which correspond to the capabilities
|
||||
of the rollup index/index pattern (what jobs are stored in the index, their capabilities, what
|
||||
aggregations are available, etc). Because multiple jobs can be stored in one index, the
|
||||
aggregations are available, etc). Because multiple jobs can be stored in one index, the
|
||||
response may include several jobs with different configurations.
|
||||
|
||||
The capabilities are essentially the same as the original job configuration, just presented in a different
|
||||
|
|
|
@ -62,7 +62,7 @@ if the privilege was not part of the request).
|
|||
A `Map<String, Map<String, Map<String, Boolean>>>>` where each key is the
|
||||
name of an application (as specified in the +{request}+).
|
||||
For each application, the value is a `Map` keyed by resource name, with
|
||||
each value being another `Map` from privilege name to a `Boolean`.
|
||||
each value being another `Map` from privilege name to a `Boolean`.
|
||||
The `Boolean` value is `true` if the user has that privilege on that
|
||||
resource for that application, and `false` otherwise.
|
||||
+
|
||||
|
|
|
@ -34,7 +34,7 @@ include-tagged::{doc-tests}/SnapshotClientDocumentationIT.java[delete-snapshot-e
|
|||
[[java-rest-high-snapshot-delete-snapshot-async]]
|
||||
==== Asynchronous Execution
|
||||
|
||||
The asynchronous execution of a delete snapshot request requires both the
|
||||
The asynchronous execution of a delete snapshot request requires both the
|
||||
`DeleteSnapshotRequest` instance and an `ActionListener` instance to be
|
||||
passed to the asynchronous method:
|
||||
|
||||
|
|
|
@ -150,7 +150,7 @@ should be consulted: https://hc.apache.org/httpcomponents-asyncclient-4.1.x/ .
|
|||
|
||||
NOTE: If your application runs under the security manager you might be subject
|
||||
to the JVM default policies of caching positive hostname resolutions
|
||||
indefinitely and negative hostname resolutions for ten seconds. If the resolved
|
||||
indefinitely and negative hostname resolutions for ten seconds. If the resolved
|
||||
addresses of the hosts to which you are connecting the client to vary with time
|
||||
then you might want to modify the default JVM behavior. These can be modified by
|
||||
adding
|
||||
|
@ -184,6 +184,6 @@ whenever none of the nodes from the preferred rack is available.
|
|||
|
||||
WARNING: Node selectors that do not consistently select the same set of nodes
|
||||
will make round-robin behaviour unpredictable and possibly unfair. The
|
||||
preference example above is fine as it reasons about availability of nodes
|
||||
preference example above is fine as it reasons about availability of nodes
|
||||
which already affects the predictability of round-robin. Node selection should
|
||||
not depend on other external factors or round-robin will not work properly.
|
||||
|
|
|
@ -97,7 +97,7 @@ include-tagged::{doc-tests}/SnifferDocumentation.java[sniff-on-failure]
|
|||
failure, but an additional sniffing round is also scheduled sooner than usual,
|
||||
by default one minute after the failure, assuming that things will go back to
|
||||
normal and we want to detect that as soon as possible. Said interval can be
|
||||
customized at `Sniffer` creation time through the `setSniffAfterFailureDelayMillis`
|
||||
customized at `Sniffer` creation time through the `setSniffAfterFailureDelayMillis`
|
||||
method. Note that this last configuration parameter has no effect in case sniffing
|
||||
on failure is not enabled like explained above.
|
||||
<3> Set the `Sniffer` instance to the failure listener
|
||||
|
|
|
@ -24,7 +24,7 @@ The standard <<painless-api-reference-shared, Painless API>> is available.
|
|||
|
||||
To run this example, first follow the steps in <<painless-context-examples, context examples>>.
|
||||
|
||||
The painless context in a `bucket_script` aggregation provides a `params` map. This map contains both
|
||||
The painless context in a `bucket_script` aggregation provides a `params` map. This map contains both
|
||||
user-specified custom values, as well as the values from other aggregations specified in the `buckets_path`
|
||||
property.
|
||||
|
||||
|
@ -36,7 +36,7 @@ and adds the user-specified base_cost to the result:
|
|||
(params.max - params.min) + params.base_cost
|
||||
--------------------------------------------------
|
||||
|
||||
Note that the values are extracted from the `params` map. In context, the aggregation looks like this:
|
||||
Note that the values are extracted from the `params` map. In context, the aggregation looks like this:
|
||||
|
||||
[source,console]
|
||||
--------------------------------------------------
|
||||
|
|
|
@ -26,7 +26,7 @@ The standard <<painless-api-reference-shared, Painless API>> is available.
|
|||
|
||||
To run this example, first follow the steps in <<painless-context-examples, context examples>>.
|
||||
|
||||
The painless context in a `bucket_selector` aggregation provides a `params` map. This map contains both
|
||||
The painless context in a `bucket_selector` aggregation provides a `params` map. This map contains both
|
||||
user-specified custom values, as well as the values from other aggregations specified in the `buckets_path`
|
||||
property.
|
||||
|
||||
|
@ -41,7 +41,7 @@ params.max + params.base_cost > 10
|
|||
--------------------------------------------------
|
||||
|
||||
Note that the values are extracted from the `params` map. The script is in the form of an expression
|
||||
that returns `true` or `false`. In context, the aggregation looks like this:
|
||||
that returns `true` or `false`. In context, the aggregation looks like this:
|
||||
|
||||
[source,console]
|
||||
--------------------------------------------------
|
||||
|
|
|
@ -19,7 +19,7 @@ full metric aggregation.
|
|||
*Side Effects*
|
||||
|
||||
`state` (`Map`)::
|
||||
Add values to this `Map` to for use in a map. Additional values must
|
||||
Add values to this `Map` to for use in a map. Additional values must
|
||||
be of the type `Map`, `List`, `String` or primitive.
|
||||
|
||||
*Return*
|
||||
|
|
|
@ -32,7 +32,7 @@ part of a full metric aggregation.
|
|||
primitive. The same `state` `Map` is shared between all aggregated documents
|
||||
on a given shard. If an initialization script is provided as part of the
|
||||
aggregation then values added from the initialization script are
|
||||
available. If no combine script is specified, values must be
|
||||
available. If no combine script is specified, values must be
|
||||
directly stored in `state` in a usable form. If no combine script and no
|
||||
<<painless-metric-agg-reduce-context, reduce script>> are specified, the
|
||||
`state` values are used as the result.
|
||||
|
|
|
@ -11,8 +11,8 @@ score to documents returned from a query.
|
|||
User-defined parameters passed in as part of the query.
|
||||
|
||||
`doc` (`Map`, read-only)::
|
||||
Contains the fields of the current document. For single-valued fields,
|
||||
the value can be accessed via `doc['fieldname'].value`. For multi-valued
|
||||
Contains the fields of the current document. For single-valued fields,
|
||||
the value can be accessed via `doc['fieldname'].value`. For multi-valued
|
||||
fields, this returns the first value; other values can be accessed
|
||||
via `doc['fieldname'].get(index)`
|
||||
|
||||
|
|
|
@ -11,19 +11,19 @@ documents in a query.
|
|||
The weight as calculated by a <<painless-weight-context,weight script>>
|
||||
|
||||
`query.boost` (`float`, read-only)::
|
||||
The boost value if provided by the query. If this is not provided the
|
||||
The boost value if provided by the query. If this is not provided the
|
||||
value is `1.0f`.
|
||||
|
||||
`field.docCount` (`long`, read-only)::
|
||||
The number of documents that have a value for the current field.
|
||||
|
||||
`field.sumDocFreq` (`long`, read-only)::
|
||||
The sum of all terms that exist for the current field. If this is not
|
||||
The sum of all terms that exist for the current field. If this is not
|
||||
available the value is `-1`.
|
||||
|
||||
`field.sumTotalTermFreq` (`long`, read-only)::
|
||||
The sum of occurrences in the index for all the terms that exist in the
|
||||
current field. If this is not available the value is `-1`.
|
||||
current field. If this is not available the value is `-1`.
|
||||
|
||||
`term.docFreq` (`long`, read-only)::
|
||||
The number of documents that contain the current term in the index.
|
||||
|
@ -32,7 +32,7 @@ documents in a query.
|
|||
The total occurrences of the current term in the index.
|
||||
|
||||
`doc.length` (`long`, read-only)::
|
||||
The number of tokens the current document has in the current field. This
|
||||
The number of tokens the current document has in the current field. This
|
||||
is decoded from the stored {ref}/norms.html[norms] and may be approximate for
|
||||
long fields
|
||||
|
||||
|
@ -45,7 +45,7 @@ Note that the `query`, `field`, and `term` variables are also available to the
|
|||
there, as they are constant for all documents.
|
||||
|
||||
For queries that contain multiple terms, the script is called once for each
|
||||
term with that term's calculated weight, and the results are summed. Note that some
|
||||
term with that term's calculated weight, and the results are summed. Note that some
|
||||
terms might have a `doc.freq` value of `0` on a document, for example if a query
|
||||
uses synonyms.
|
||||
|
||||
|
|
|
@ -10,8 +10,8 @@ Use a Painless script to
|
|||
User-defined parameters passed in as part of the query.
|
||||
|
||||
`doc` (`Map`, read-only)::
|
||||
Contains the fields of the current document. For single-valued fields,
|
||||
the value can be accessed via `doc['fieldname'].value`. For multi-valued
|
||||
Contains the fields of the current document. For single-valued fields,
|
||||
the value can be accessed via `doc['fieldname'].value`. For multi-valued
|
||||
fields, this returns the first value; other values can be accessed
|
||||
via `doc['fieldname'].get(index)`
|
||||
|
||||
|
|
|
@ -3,7 +3,7 @@
|
|||
|
||||
Use a Painless script to create a
|
||||
{ref}/index-modules-similarity.html[weight] for use in a
|
||||
<<painless-similarity-context, similarity script>>. The weight makes up the
|
||||
<<painless-similarity-context, similarity script>>. The weight makes up the
|
||||
part of the similarity calculation that is independent of the document being
|
||||
scored, and so can be built up front and cached.
|
||||
|
||||
|
@ -12,19 +12,19 @@ Queries that contain multiple terms calculate a separate weight for each term.
|
|||
*Variables*
|
||||
|
||||
`query.boost` (`float`, read-only)::
|
||||
The boost value if provided by the query. If this is not provided the
|
||||
The boost value if provided by the query. If this is not provided the
|
||||
value is `1.0f`.
|
||||
|
||||
`field.docCount` (`long`, read-only)::
|
||||
The number of documents that have a value for the current field.
|
||||
|
||||
`field.sumDocFreq` (`long`, read-only)::
|
||||
The sum of all terms that exist for the current field. If this is not
|
||||
The sum of all terms that exist for the current field. If this is not
|
||||
available the value is `-1`.
|
||||
|
||||
`field.sumTotalTermFreq` (`long`, read-only)::
|
||||
The sum of occurrences in the index for all the terms that exist in the
|
||||
current field. If this is not available the value is `-1`.
|
||||
current field. If this is not available the value is `-1`.
|
||||
|
||||
`term.docFreq` (`long`, read-only)::
|
||||
The number of documents that contain the current term in the index.
|
||||
|
|
|
@ -4,7 +4,7 @@
|
|||
A cast converts the value of an original type to the equivalent value of a
|
||||
target type. An implicit cast infers the target type and automatically occurs
|
||||
during certain <<painless-operators, operations>>. An explicit cast specifies
|
||||
the target type and forcefully occurs as its own operation. Use the `cast
|
||||
the target type and forcefully occurs as its own operation. Use the `cast
|
||||
operator '()'` to specify an explicit cast.
|
||||
|
||||
Refer to the <<allowed-casts, cast table>> for a quick reference on all
|
||||
|
|
|
@ -8,7 +8,7 @@ to repeat its specific task. A parameter is a named type value available as a
|
|||
function specifies zero-to-many parameters, and when a function is called a
|
||||
value is specified per parameter. An argument is a value passed into a function
|
||||
at the point of call. A function specifies a return type value, though if the
|
||||
type is <<void-type, void>> then no value is returned. Any non-void type return
|
||||
type is <<void-type, void>> then no value is returned. Any non-void type return
|
||||
value is available for use within an <<painless-operators, operation>> or is
|
||||
discarded otherwise.
|
||||
|
||||
|
|
|
@ -11,7 +11,7 @@ Use an integer literal to specify an integer type value in decimal, octal, or
|
|||
hex notation of a <<primitive-types, primitive type>> `int`, `long`, `float`,
|
||||
or `double`. Use the following single letter designations to specify the
|
||||
primitive type: `l` or `L` for `long`, `f` or `F` for `float`, and `d` or `D`
|
||||
for `double`. If not specified, the type defaults to `int`. Use `0` as a prefix
|
||||
for `double`. If not specified, the type defaults to `int`. Use `0` as a prefix
|
||||
to specify an integer literal as octal, and use `0x` or `0X` as a prefix to
|
||||
specify an integer literal as hex.
|
||||
|
||||
|
@ -86,7 +86,7 @@ EXPONENT: ( [eE] [+\-]? [0-9]+ );
|
|||
Use a string literal to specify a <<string-type, `String` type>> value with
|
||||
either single-quotes or double-quotes. Use a `\"` token to include a
|
||||
double-quote as part of a double-quoted string literal. Use a `\'` token to
|
||||
include a single-quote as part of a single-quoted string literal. Use a `\\`
|
||||
include a single-quote as part of a single-quoted string literal. Use a `\\`
|
||||
token to include a backslash as part of any string literal.
|
||||
|
||||
*Grammar*
|
||||
|
|
|
@ -76,7 +76,7 @@ int z = add(1, 2); <2>
|
|||
==== Cast
|
||||
|
||||
An explicit cast converts the value of an original type to the equivalent value
|
||||
of a target type forcefully as an operation. Use the `cast operator '()'` to
|
||||
of a target type forcefully as an operation. Use the `cast operator '()'` to
|
||||
specify an explicit cast. Refer to <<painless-casting, casting>> for more
|
||||
information.
|
||||
|
||||
|
@ -85,7 +85,7 @@ information.
|
|||
|
||||
A conditional consists of three expressions. The first expression is evaluated
|
||||
with an expected boolean result type. If the first expression evaluates to true
|
||||
then the second expression will be evaluated. If the first expression evaluates
|
||||
then the second expression will be evaluated. If the first expression evaluates
|
||||
to false then the third expression will be evaluated. The second and third
|
||||
expressions will be <<promotion, promoted>> if the evaluated values are not the
|
||||
same type. Use the `conditional operator '? :'` as a shortcut to avoid the need
|
||||
|
@ -254,7 +254,7 @@ V = (T)(V op expression);
|
|||
|
||||
The table below shows the available operators for use in a compound assignment.
|
||||
Each operator follows the casting/promotion rules according to their regular
|
||||
definition. For numeric operations there is an extra implicit cast when
|
||||
definition. For numeric operations there is an extra implicit cast when
|
||||
necessary to return the promoted numeric type value to the original numeric type
|
||||
value of the variable/field and can result in data loss.
|
||||
|
||||
|
|
|
@ -668,7 +668,7 @@ def y = x/2; <2>
|
|||
==== Remainder
|
||||
|
||||
Use the `remainder operator '%'` to calculate the REMAINDER for division
|
||||
between two numeric type values. Rules for NaN values and division by zero follow the JVM
|
||||
between two numeric type values. Rules for NaN values and division by zero follow the JVM
|
||||
specification.
|
||||
|
||||
*Errors*
|
||||
|
@ -809,7 +809,7 @@ def y = x+2; <2>
|
|||
==== Subtraction
|
||||
|
||||
Use the `subtraction operator '-'` to SUBTRACT a right-hand side numeric type
|
||||
value from a left-hand side numeric type value. Rules for resultant overflow
|
||||
value from a left-hand side numeric type value. Rules for resultant overflow
|
||||
and NaN values follow the JVM specification.
|
||||
|
||||
*Errors*
|
||||
|
@ -955,7 +955,7 @@ def y = x << 1; <2>
|
|||
|
||||
Use the `right shift operator '>>'` to SHIFT higher order bits to lower order
|
||||
bits in a left-hand side integer type value by the distance specified in a
|
||||
right-hand side integer type value. The highest order bit of the left-hand side
|
||||
right-hand side integer type value. The highest order bit of the left-hand side
|
||||
integer type value is preserved.
|
||||
|
||||
*Errors*
|
||||
|
|
|
@ -2,10 +2,10 @@
|
|||
=== Operators
|
||||
|
||||
An operator is the most basic action that can be taken to evaluate values in a
|
||||
script. An expression is one-to-many consecutive operations. Precedence is the
|
||||
script. An expression is one-to-many consecutive operations. Precedence is the
|
||||
order in which an operator will be evaluated relative to another operator.
|
||||
Associativity is the direction within an expression in which a specific operator
|
||||
is evaluated. The following table lists all available operators:
|
||||
is evaluated. The following table lists all available operators:
|
||||
|
||||
[cols="<6,<3,^3,^2,^4"]
|
||||
|====
|
||||
|
|
|
@ -259,7 +259,7 @@ during operations.
|
|||
Declare a `def` type <<painless-variables, variable>> or access a `def` type
|
||||
member field (from a reference type instance), and assign it any type of value
|
||||
for evaluation during later operations. The default value for a newly-declared
|
||||
`def` type variable is `null`. A `def` type variable or method/function
|
||||
`def` type variable is `null`. A `def` type variable or method/function
|
||||
parameter can change the type it represents during the compilation and
|
||||
evaluation of a script.
|
||||
|
||||
|
@ -400,7 +400,7 @@ range `[2, d]` where `d >= 2`, each element within each dimension in the range
|
|||
`[1, d-1]` is also an array type. The element type of each dimension, `n`, is an
|
||||
array type with the number of dimensions equal to `d-n`. For example, consider
|
||||
`int[][][]` with 3 dimensions. Each element in the 3rd dimension, `d-3`, is the
|
||||
primitive type `int`. Each element in the 2nd dimension, `d-2`, is the array
|
||||
primitive type `int`. Each element in the 2nd dimension, `d-2`, is the array
|
||||
type `int[]`. And each element in the 1st dimension, `d-1` is the array type
|
||||
`int[][]`.
|
||||
|
||||
|
|
|
@ -12,7 +12,7 @@ transliteration.
|
|||
================================================
|
||||
|
||||
From time to time, the ICU library receives updates such as adding new
|
||||
characters and emojis, and improving collation (sort) orders. These changes
|
||||
characters and emojis, and improving collation (sort) orders. These changes
|
||||
may or may not affect search and sort orders, depending on which characters
|
||||
sets you are using.
|
||||
|
||||
|
@ -38,11 +38,11 @@ The following parameters are accepted:
|
|||
|
||||
`method`::
|
||||
|
||||
Normalization method. Accepts `nfkc`, `nfc` or `nfkc_cf` (default)
|
||||
Normalization method. Accepts `nfkc`, `nfc` or `nfkc_cf` (default)
|
||||
|
||||
`mode`::
|
||||
|
||||
Normalization mode. Accepts `compose` (default) or `decompose`.
|
||||
Normalization mode. Accepts `compose` (default) or `decompose`.
|
||||
|
||||
[[analysis-icu-normalization-charfilter]]
|
||||
==== ICU Normalization Character Filter
|
||||
|
@ -52,7 +52,7 @@ http://userguide.icu-project.org/transforms/normalization[here].
|
|||
It registers itself as the `icu_normalizer` character filter, which is
|
||||
available to all indices without any further configuration. The type of
|
||||
normalization can be specified with the `name` parameter, which accepts `nfc`,
|
||||
`nfkc`, and `nfkc_cf` (default). Set the `mode` parameter to `decompose` to
|
||||
`nfkc`, and `nfkc_cf` (default). Set the `mode` parameter to `decompose` to
|
||||
convert `nfc` to `nfd` or `nfkc` to `nfkd` respectively:
|
||||
|
||||
Which letters are normalized can be controlled by specifying the
|
||||
|
@ -328,7 +328,7 @@ PUT icu_sample
|
|||
|
||||
[WARNING]
|
||||
======
|
||||
This token filter has been deprecated since Lucene 5.0. Please use
|
||||
This token filter has been deprecated since Lucene 5.0. Please use
|
||||
<<analysis-icu-collation-keyword-field, ICU Collation Keyword Field>>.
|
||||
======
|
||||
|
||||
|
@ -404,7 +404,7 @@ The following parameters are accepted by `icu_collation_keyword` fields:
|
|||
`null_value`::
|
||||
|
||||
Accepts a string value which is substituted for any explicit `null`
|
||||
values. Defaults to `null`, which means the field is treated as missing.
|
||||
values. Defaults to `null`, which means the field is treated as missing.
|
||||
|
||||
{ref}/ignore-above.html[`ignore_above`]::
|
||||
|
||||
|
@ -434,7 +434,7 @@ The strength property determines the minimum level of difference considered
|
|||
significant during comparison. Possible values are : `primary`, `secondary`,
|
||||
`tertiary`, `quaternary` or `identical`. See the
|
||||
https://icu-project.org/apiref/icu4j/com/ibm/icu/text/Collator.html[ICU Collation documentation]
|
||||
for a more detailed explanation for each value. Defaults to `tertiary`
|
||||
for a more detailed explanation for each value. Defaults to `tertiary`
|
||||
unless otherwise specified in the collation.
|
||||
|
||||
`decomposition`::
|
||||
|
@ -483,7 +483,7 @@ Single character or contraction. Controls what is variable for `alternate`.
|
|||
|
||||
`hiragana_quaternary_mode`::
|
||||
|
||||
Possible values: `true` or `false`. Distinguishing between Katakana and
|
||||
Possible values: `true` or `false`. Distinguishing between Katakana and
|
||||
Hiragana characters in `quaternary` strength.
|
||||
|
||||
|
||||
|
@ -495,7 +495,7 @@ case mapping, normalization, transliteration and bidirectional text handling.
|
|||
|
||||
You can define which transformation you want to apply with the `id` parameter
|
||||
(defaults to `Null`), and specify text direction with the `dir` parameter
|
||||
which accepts `forward` (default) for LTR and `reverse` for RTL. Custom
|
||||
which accepts `forward` (default) for LTR and `reverse` for RTL. Custom
|
||||
rulesets are not yet supported.
|
||||
|
||||
For example:
|
||||
|
|
|
@ -103,7 +103,7 @@ The `kuromoji_tokenizer` accepts the following settings:
|
|||
--
|
||||
|
||||
The tokenization mode determines how the tokenizer handles compound and
|
||||
unknown words. It can be set to:
|
||||
unknown words. It can be set to:
|
||||
|
||||
`normal`::
|
||||
|
||||
|
@ -403,11 +403,11 @@ form in either katakana or romaji. It accepts the following setting:
|
|||
|
||||
`use_romaji`::
|
||||
|
||||
Whether romaji reading form should be output instead of katakana. Defaults to `false`.
|
||||
Whether romaji reading form should be output instead of katakana. Defaults to `false`.
|
||||
|
||||
When using the pre-defined `kuromoji_readingform` filter, `use_romaji` is set
|
||||
to `true`. The default when defining a custom `kuromoji_readingform`, however,
|
||||
is `false`. The only reason to use the custom form is if you need the
|
||||
is `false`. The only reason to use the custom form is if you need the
|
||||
katakana reading form:
|
||||
|
||||
[source,console]
|
||||
|
@ -521,7 +521,7 @@ GET kuromoji_sample/_analyze
|
|||
|
||||
The `ja_stop` token filter filters out Japanese stopwords (`_japanese_`), and
|
||||
any other custom stopwords specified by the user. This filter only supports
|
||||
the predefined `_japanese_` stopwords list. If you want to use a different
|
||||
the predefined `_japanese_` stopwords list. If you want to use a different
|
||||
predefined list, then use the
|
||||
{ref}/analysis-stop-tokenfilter.html[`stop` token filter] instead.
|
||||
|
||||
|
|
|
@ -16,7 +16,7 @@ The `phonetic` token filter takes the following settings:
|
|||
|
||||
`encoder`::
|
||||
|
||||
Which phonetic encoder to use. Accepts `metaphone` (default),
|
||||
Which phonetic encoder to use. Accepts `metaphone` (default),
|
||||
`double_metaphone`, `soundex`, `refined_soundex`, `caverphone1`,
|
||||
`caverphone2`, `cologne`, `nysiis`, `koelnerphonetik`, `haasephonetik`,
|
||||
`beider_morse`, `daitch_mokotoff`.
|
||||
|
@ -24,7 +24,7 @@ The `phonetic` token filter takes the following settings:
|
|||
`replace`::
|
||||
|
||||
Whether or not the original token should be replaced by the phonetic
|
||||
token. Accepts `true` (default) and `false`. Not supported by
|
||||
token. Accepts `true` (default) and `false`. Not supported by
|
||||
`beider_morse` encoding.
|
||||
|
||||
[source,console]
|
||||
|
@ -81,7 +81,7 @@ supported:
|
|||
|
||||
`max_code_len`::
|
||||
|
||||
The maximum length of the emitted metaphone token. Defaults to `4`.
|
||||
The maximum length of the emitted metaphone token. Defaults to `4`.
|
||||
|
||||
[discrete]
|
||||
===== Beider Morse settings
|
||||
|
|
|
@ -46,7 +46,7 @@ PUT /stempel_example
|
|||
|
||||
The `polish_stop` token filter filters out Polish stopwords (`_polish_`), and
|
||||
any other custom stopwords specified by the user. This filter only supports
|
||||
the predefined `_polish_` stopwords list. If you want to use a different
|
||||
the predefined `_polish_` stopwords list. If you want to use a different
|
||||
predefined list, then use the
|
||||
{ref}/analysis-stop-tokenfilter.html[`stop` token filter] instead.
|
||||
|
||||
|
|
|
@ -14,7 +14,7 @@ The Elasticsearch repository contains examples of:
|
|||
* a https://github.com/elastic/elasticsearch/tree/master/plugins/examples/script-expert-scoring[Java plugin]
|
||||
which contains a script plugin.
|
||||
|
||||
These examples provide the bare bones needed to get started. For more
|
||||
These examples provide the bare bones needed to get started. For more
|
||||
information about how to write a plugin, we recommend looking at the plugins
|
||||
listed in this documentation for inspiration.
|
||||
|
||||
|
@ -74,7 +74,7 @@ in the presence of plugins with the incorrect `elasticsearch.version`.
|
|||
=== Testing your plugin
|
||||
|
||||
When testing a Java plugin, it will only be auto-loaded if it is in the
|
||||
`plugins/` directory. Use `bin/elasticsearch-plugin install file:///path/to/your/plugin`
|
||||
`plugins/` directory. Use `bin/elasticsearch-plugin install file:///path/to/your/plugin`
|
||||
to install your plugin for testing.
|
||||
|
||||
You may also load your plugin within the test framework for integration tests.
|
||||
|
|
|
@ -130,7 +130,7 @@ discovery:
|
|||
We will expose here one strategy which is to hide our Elasticsearch cluster from outside.
|
||||
|
||||
With this strategy, only VMs behind the same virtual port can talk to each
|
||||
other. That means that with this mode, you can use Elasticsearch unicast
|
||||
other. That means that with this mode, you can use Elasticsearch unicast
|
||||
discovery to build a cluster, using the Azure API to retrieve information
|
||||
about your nodes.
|
||||
|
||||
|
|
|
@ -416,7 +416,7 @@ gcloud config set project es-cloud
|
|||
[[discovery-gce-usage-tips-permissions]]
|
||||
===== Machine Permissions
|
||||
|
||||
If you have created a machine without the correct permissions, you will see `403 unauthorized` error messages. To change machine permission on an existing instance, first stop the instance then Edit. Scroll down to `Access Scopes` to change permission. The other way to alter these permissions is to delete the instance (NOT THE DISK). Then create another with the correct permissions.
|
||||
If you have created a machine without the correct permissions, you will see `403 unauthorized` error messages. To change machine permission on an existing instance, first stop the instance then Edit. Scroll down to `Access Scopes` to change permission. The other way to alter these permissions is to delete the instance (NOT THE DISK). Then create another with the correct permissions.
|
||||
|
||||
Creating machines with gcloud::
|
||||
+
|
||||
|
|
|
@ -293,7 +293,7 @@ The annotated highlighter is based on the `unified` highlighter and supports the
|
|||
settings but does not use the `pre_tags` or `post_tags` parameters. Rather than using
|
||||
html-like markup such as `<em>cat</em>` the annotated highlighter uses the same
|
||||
markdown-like syntax used for annotations and injects a key=value annotation where `_hit_term`
|
||||
is the key and the matched search term is the value e.g.
|
||||
is the key and the matched search term is the value e.g.
|
||||
|
||||
The [cat](_hit_term=cat) sat on the [mat](sku3578)
|
||||
|
||||
|
|
|
@ -231,7 +231,7 @@ user for confirmation before continuing with installation.
|
|||
When running the plugin install script from another program (e.g. install
|
||||
automation scripts), the plugin script should detect that it is not being
|
||||
called from the console and skip the confirmation response, automatically
|
||||
granting all requested permissions. If console detection fails, then batch
|
||||
granting all requested permissions. If console detection fails, then batch
|
||||
mode can be forced by specifying `-b` or `--batch` as follows:
|
||||
|
||||
[source,shell]
|
||||
|
@ -243,7 +243,7 @@ sudo bin/elasticsearch-plugin install --batch [pluginname]
|
|||
=== Custom config directory
|
||||
|
||||
If your `elasticsearch.yml` config file is in a custom location, you will need
|
||||
to specify the path to the config file when using the `plugin` script. You
|
||||
to specify the path to the config file when using the `plugin` script. You
|
||||
can do this as follows:
|
||||
|
||||
[source,sh]
|
||||
|
|
|
@ -6,7 +6,7 @@ The following pages have moved or been deleted.
|
|||
[role="exclude",id="discovery-multicast"]
|
||||
=== Multicast Discovery Plugin
|
||||
|
||||
The `multicast-discovery` plugin has been removed. Instead, configure networking
|
||||
The `multicast-discovery` plugin has been removed. Instead, configure networking
|
||||
using unicast (see {ref}/modules-network.html[Network settings]) or using
|
||||
one of the <<discovery,cloud discovery plugins>>.
|
||||
|
||||
|
|
|
@ -57,7 +57,7 @@ this configuration (such as Compute Engine, Kubernetes Engine or App Engine).
|
|||
You have to obtain and provide https://cloud.google.com/iam/docs/overview#service_account[service account credentials]
|
||||
manually.
|
||||
|
||||
For detailed information about generating JSON service account files, see the https://cloud.google.com/storage/docs/authentication?hl=en#service_accounts[Google Cloud documentation].
|
||||
For detailed information about generating JSON service account files, see the https://cloud.google.com/storage/docs/authentication?hl=en#service_accounts[Google Cloud documentation].
|
||||
Note that the PKCS12 format is not supported by this plugin.
|
||||
|
||||
Here is a summary of the steps:
|
||||
|
@ -88,7 +88,7 @@ A JSON service account file looks like this:
|
|||
----
|
||||
// NOTCONSOLE
|
||||
|
||||
To provide this file to the plugin, it must be stored in the {ref}/secure-settings.html[Elasticsearch keystore]. You must
|
||||
To provide this file to the plugin, it must be stored in the {ref}/secure-settings.html[Elasticsearch keystore]. You must
|
||||
add a `file` setting with the name `gcs.client.NAME.credentials_file` using the `add-file` subcommand.
|
||||
`NAME` is the name of the client configuration for the repository. The implicit client
|
||||
name is `default`, but a different client name can be specified in the
|
||||
|
|
|
@ -312,7 +312,7 @@ include::repository-shared-settings.asciidoc[]
|
|||
https://docs.aws.amazon.com/AmazonS3/latest/dev/acl-overview.html#canned-acl[S3
|
||||
canned ACLs] : `private`, `public-read`, `public-read-write`,
|
||||
`authenticated-read`, `log-delivery-write`, `bucket-owner-read`,
|
||||
`bucket-owner-full-control`. Defaults to `private`. You could specify a
|
||||
`bucket-owner-full-control`. Defaults to `private`. You could specify a
|
||||
canned ACL using the `canned_acl` setting. When the S3 repository creates
|
||||
buckets and objects, it adds the canned ACL into the buckets and objects.
|
||||
|
||||
|
@ -324,8 +324,8 @@ include::repository-shared-settings.asciidoc[]
|
|||
Changing this setting on an existing repository only affects the
|
||||
storage class for newly created objects, resulting in a mixed usage of
|
||||
storage classes. Additionally, S3 Lifecycle Policies can be used to manage
|
||||
the storage class of existing objects. Due to the extra complexity with the
|
||||
Glacier class lifecycle, it is not currently supported by the plugin. For
|
||||
the storage class of existing objects. Due to the extra complexity with the
|
||||
Glacier class lifecycle, it is not currently supported by the plugin. For
|
||||
more information about the different classes, see
|
||||
https://docs.aws.amazon.com/AmazonS3/latest/dev/storage-class-intro.html[AWS
|
||||
Storage Classes Guide]
|
||||
|
@ -335,9 +335,9 @@ documented below is considered deprecated, and will be removed in a future
|
|||
version.
|
||||
|
||||
In addition to the above settings, you may also specify all non-secure client
|
||||
settings in the repository settings. In this case, the client settings found in
|
||||
settings in the repository settings. In this case, the client settings found in
|
||||
the repository settings will be merged with those of the named client used by
|
||||
the repository. Conflicts between client and repository settings are resolved
|
||||
the repository. Conflicts between client and repository settings are resolved
|
||||
by the repository settings taking precedence over client settings.
|
||||
|
||||
For example:
|
||||
|
|
|
@ -9,4 +9,4 @@
|
|||
|
||||
`readonly`::
|
||||
|
||||
Makes repository read-only. Defaults to `false`.
|
||||
Makes repository read-only. Defaults to `false`.
|
||||
|
|
|
@ -28,7 +28,7 @@ other aggregations instead of documents or fields.
|
|||
=== Run an aggregation
|
||||
|
||||
You can run aggregations as part of a <<search-your-data,search>> by specifying the <<search-search,search API>>'s `aggs` parameter. The
|
||||
following search runs a
|
||||
following search runs a
|
||||
<<search-aggregations-bucket-terms-aggregation,terms aggregation>> on
|
||||
`my-field`:
|
||||
|
||||
|
|
|
@ -110,7 +110,7 @@ buckets requested.
|
|||
|
||||
==== Time Zone
|
||||
|
||||
Date-times are stored in Elasticsearch in UTC. By default, all bucketing and
|
||||
Date-times are stored in Elasticsearch in UTC. By default, all bucketing and
|
||||
rounding is also done in UTC. The `time_zone` parameter can be used to indicate
|
||||
that bucketing should use a different time zone.
|
||||
|
||||
|
|
|
@ -291,7 +291,7 @@ GET /_search
|
|||
|
||||
*Time Zone*
|
||||
|
||||
Date-times are stored in Elasticsearch in UTC. By default, all bucketing and
|
||||
Date-times are stored in Elasticsearch in UTC. By default, all bucketing and
|
||||
rounding is also done in UTC. The `time_zone` parameter can be used to indicate
|
||||
that bucketing should use a different time zone.
|
||||
|
||||
|
@ -853,7 +853,7 @@ GET /_search
|
|||
|
||||
The composite agg is not currently compatible with pipeline aggregations, nor does it make sense in most cases.
|
||||
E.g. due to the paging nature of composite aggs, a single logical partition (one day for example) might be spread
|
||||
over multiple pages. Since pipeline aggregations are purely post-processing on the final list of buckets,
|
||||
over multiple pages. Since pipeline aggregations are purely post-processing on the final list of buckets,
|
||||
running something like a derivative on a composite page could lead to inaccurate results as it is only taking into
|
||||
account a "partial" result on that page.
|
||||
|
||||
|
|
|
@ -51,7 +51,7 @@ This behavior has been deprecated in favor of two new, explicit fields: `calenda
|
|||
and `fixed_interval`.
|
||||
|
||||
By forcing a choice between calendar and intervals up front, the semantics of the interval
|
||||
are clear to the user immediately and there is no ambiguity. The old `interval` field
|
||||
are clear to the user immediately and there is no ambiguity. The old `interval` field
|
||||
will be removed in the future.
|
||||
==================================
|
||||
|
||||
|
|
|
@ -92,7 +92,7 @@ GET logs/_search
|
|||
// TEST[continued]
|
||||
|
||||
The filtered buckets are returned in the same order as provided in the
|
||||
request. The response for this example would be:
|
||||
request. The response for this example would be:
|
||||
|
||||
[source,console-result]
|
||||
--------------------------------------------------
|
||||
|
|
|
@ -19,7 +19,7 @@ bucket_key = Math.floor((value - offset) / interval) * interval + offset
|
|||
--------------------------------------------------
|
||||
|
||||
For range values, a document can fall into multiple buckets. The first bucket is computed from the lower
|
||||
bound of the range in the same way as a bucket for a single value is computed. The final bucket is computed in the same
|
||||
bound of the range in the same way as a bucket for a single value is computed. The final bucket is computed in the same
|
||||
way from the upper bound of the range, and the range is counted in all buckets in between and including those two.
|
||||
|
||||
The `interval` must be a positive decimal, while the `offset` must be a decimal in `[0, interval)`
|
||||
|
@ -183,7 +183,7 @@ POST /sales/_search?size=0
|
|||
--------------------------------------------------
|
||||
// TEST[setup:sales]
|
||||
|
||||
When aggregating ranges, buckets are based on the values of the returned documents. This means the response may include
|
||||
When aggregating ranges, buckets are based on the values of the returned documents. This means the response may include
|
||||
buckets outside of a query's range. For example, if your query looks for values greater than 100, and you have a range
|
||||
covering 50 to 150, and an interval of 50, that document will land in 3 buckets - 50, 100, and 150. In general, it's
|
||||
best to think of the query and aggregation steps as independent - the query selects a set of documents, and then the
|
||||
|
|
|
@ -6,7 +6,7 @@
|
|||
Since a range represents multiple values, running a bucket aggregation over a
|
||||
range field can result in the same document landing in multiple buckets. This
|
||||
can lead to surprising behavior, such as the sum of bucket counts being higher
|
||||
than the number of matched documents. For example, consider the following
|
||||
than the number of matched documents. For example, consider the following
|
||||
index:
|
||||
[source, console]
|
||||
--------------------------------------------------
|
||||
|
@ -184,7 +184,7 @@ calculated over the ranges of all matching documents.
|
|||
// TESTRESPONSE[s/\.\.\./"took": $body.took,"timed_out": false,"_shards": $body._shards,"hits": $body.hits,/]
|
||||
|
||||
Depending on the use case, a `CONTAINS` query could limit the documents to only
|
||||
those that fall entirely in the queried range. In this example, the one
|
||||
document would not be included and the aggregation would be empty. Filtering
|
||||
those that fall entirely in the queried range. In this example, the one
|
||||
document would not be included and the aggregation would be empty. Filtering
|
||||
the buckets after the aggregation is also an option, for use cases where the
|
||||
document should be counted but the out of bounds data can be safely ignored.
|
||||
|
|
|
@ -5,9 +5,9 @@
|
|||
++++
|
||||
|
||||
A multi-bucket value source based aggregation which finds "rare" terms -- terms that are at the long-tail
|
||||
of the distribution and are not frequent. Conceptually, this is like a `terms` aggregation that is
|
||||
sorted by `_count` ascending. As noted in the <<search-aggregations-bucket-terms-aggregation-order,terms aggregation docs>>,
|
||||
actually ordering a `terms` agg by count ascending has unbounded error. Instead, you should use the `rare_terms`
|
||||
of the distribution and are not frequent. Conceptually, this is like a `terms` aggregation that is
|
||||
sorted by `_count` ascending. As noted in the <<search-aggregations-bucket-terms-aggregation-order,terms aggregation docs>>,
|
||||
actually ordering a `terms` agg by count ascending has unbounded error. Instead, you should use the `rare_terms`
|
||||
aggregation
|
||||
|
||||
//////////////////////////
|
||||
|
@ -78,7 +78,7 @@ A `rare_terms` aggregation looks like this in isolation:
|
|||
|Parameter Name |Description |Required |Default Value
|
||||
|`field` |The field we wish to find rare terms in |Required |
|
||||
|`max_doc_count` |The maximum number of documents a term should appear in. |Optional |`1`
|
||||
|`precision` |The precision of the internal CuckooFilters. Smaller precision leads to
|
||||
|`precision` |The precision of the internal CuckooFilters. Smaller precision leads to
|
||||
better approximation, but higher memory usage. Cannot be smaller than `0.00001` |Optional |`0.01`
|
||||
|`include` |Terms that should be included in the aggregation|Optional |
|
||||
|`exclude` |Terms that should be excluded from the aggregation|Optional |
|
||||
|
@ -124,7 +124,7 @@ Response:
|
|||
// TESTRESPONSE[s/\.\.\.//]
|
||||
|
||||
In this example, the only bucket that we see is the "swing" bucket, because it is the only term that appears in
|
||||
one document. If we increase the `max_doc_count` to `2`, we'll see some more buckets:
|
||||
one document. If we increase the `max_doc_count` to `2`, we'll see some more buckets:
|
||||
|
||||
[source,console,id=rare-terms-aggregation-max-doc-count-example]
|
||||
--------------------------------------------------
|
||||
|
@ -169,27 +169,27 @@ This now shows the "jazz" term which has a `doc_count` of 2":
|
|||
[[search-aggregations-bucket-rare-terms-aggregation-max-doc-count]]
|
||||
==== Maximum document count
|
||||
|
||||
The `max_doc_count` parameter is used to control the upper bound of document counts that a term can have. There
|
||||
is not a size limitation on the `rare_terms` agg like `terms` agg has. This means that terms
|
||||
which match the `max_doc_count` criteria will be returned. The aggregation functions in this manner to avoid
|
||||
The `max_doc_count` parameter is used to control the upper bound of document counts that a term can have. There
|
||||
is not a size limitation on the `rare_terms` agg like `terms` agg has. This means that terms
|
||||
which match the `max_doc_count` criteria will be returned. The aggregation functions in this manner to avoid
|
||||
the order-by-ascending issues that afflict the `terms` aggregation.
|
||||
|
||||
This does, however, mean that a large number of results can be returned if chosen incorrectly.
|
||||
This does, however, mean that a large number of results can be returned if chosen incorrectly.
|
||||
To limit the danger of this setting, the maximum `max_doc_count` is 100.
|
||||
|
||||
[[search-aggregations-bucket-rare-terms-aggregation-max-buckets]]
|
||||
==== Max Bucket Limit
|
||||
|
||||
The Rare Terms aggregation is more liable to trip the `search.max_buckets` soft limit than other aggregations due
|
||||
to how it works. The `max_bucket` soft-limit is evaluated on a per-shard basis while the aggregation is collecting
|
||||
results. It is possible for a term to be "rare" on a shard but become "not rare" once all the shard results are
|
||||
merged together. This means that individual shards tend to collect more buckets than are truly rare, because
|
||||
they only have their own local view. This list is ultimately pruned to the correct, smaller list of rare
|
||||
to how it works. The `max_bucket` soft-limit is evaluated on a per-shard basis while the aggregation is collecting
|
||||
results. It is possible for a term to be "rare" on a shard but become "not rare" once all the shard results are
|
||||
merged together. This means that individual shards tend to collect more buckets than are truly rare, because
|
||||
they only have their own local view. This list is ultimately pruned to the correct, smaller list of rare
|
||||
terms on the coordinating node... but a shard may have already tripped the `max_buckets` soft limit and aborted
|
||||
the request.
|
||||
|
||||
When aggregating on fields that have potentially many "rare" terms, you may need to increase the `max_buckets` soft
|
||||
limit. Alternatively, you might need to find a way to filter the results to return fewer rare values (smaller time
|
||||
limit. Alternatively, you might need to find a way to filter the results to return fewer rare values (smaller time
|
||||
span, filter by category, etc), or re-evaluate your definition of "rare" (e.g. if something
|
||||
appears 100,000 times, is it truly "rare"?)
|
||||
|
||||
|
@ -197,8 +197,8 @@ appears 100,000 times, is it truly "rare"?)
|
|||
==== Document counts are approximate
|
||||
|
||||
The naive way to determine the "rare" terms in a dataset is to place all the values in a map, incrementing counts
|
||||
as each document is visited, then return the bottom `n` rows. This does not scale beyond even modestly sized data
|
||||
sets. A sharded approach where only the "top n" values are retained from each shard (ala the `terms` aggregation)
|
||||
as each document is visited, then return the bottom `n` rows. This does not scale beyond even modestly sized data
|
||||
sets. A sharded approach where only the "top n" values are retained from each shard (ala the `terms` aggregation)
|
||||
fails because the long-tail nature of the problem means it is impossible to find the "top n" bottom values without
|
||||
simply collecting all the values from all shards.
|
||||
|
||||
|
@ -208,16 +208,16 @@ Instead, the Rare Terms aggregation uses a different approximate algorithm:
|
|||
2. Each addition occurrence of the term increments a counter in the map
|
||||
3. If the counter > the `max_doc_count` threshold, the term is removed from the map and placed in a
|
||||
https://www.cs.cmu.edu/~dga/papers/cuckoo-conext2014.pdf[CuckooFilter]
|
||||
4. The CuckooFilter is consulted on each term. If the value is inside the filter, it is known to be above the
|
||||
4. The CuckooFilter is consulted on each term. If the value is inside the filter, it is known to be above the
|
||||
threshold already and skipped.
|
||||
|
||||
After execution, the map of values is the map of "rare" terms under the `max_doc_count` threshold. This map and CuckooFilter
|
||||
are then merged with all other shards. If there are terms that are greater than the threshold (or appear in
|
||||
a different shard's CuckooFilter) the term is removed from the merged list. The final map of values is returned
|
||||
After execution, the map of values is the map of "rare" terms under the `max_doc_count` threshold. This map and CuckooFilter
|
||||
are then merged with all other shards. If there are terms that are greater than the threshold (or appear in
|
||||
a different shard's CuckooFilter) the term is removed from the merged list. The final map of values is returned
|
||||
to the user as the "rare" terms.
|
||||
|
||||
CuckooFilters have the possibility of returning false positives (they can say a value exists in their collection when
|
||||
it actually does not). Since the CuckooFilter is being used to see if a term is over threshold, this means a false positive
|
||||
it actually does not). Since the CuckooFilter is being used to see if a term is over threshold, this means a false positive
|
||||
from the CuckooFilter will mistakenly say a value is common when it is not (and thus exclude it from it final list of buckets).
|
||||
Practically, this means the aggregations exhibits false-negative behavior since the filter is being used "in reverse"
|
||||
of how people generally think of approximate set membership sketches.
|
||||
|
@ -230,14 +230,14 @@ Proceedings of the 10th ACM International on Conference on emerging Networking E
|
|||
==== Precision
|
||||
|
||||
Although the internal CuckooFilter is approximate in nature, the false-negative rate can be controlled with a
|
||||
`precision` parameter. This allows the user to trade more runtime memory for more accurate results.
|
||||
`precision` parameter. This allows the user to trade more runtime memory for more accurate results.
|
||||
|
||||
The default precision is `0.001`, and the smallest (e.g. most accurate and largest memory overhead) is `0.00001`.
|
||||
Below are some charts which demonstrate how the accuracy of the aggregation is affected by precision and number
|
||||
of distinct terms.
|
||||
|
||||
The X-axis shows the number of distinct values the aggregation has seen, and the Y-axis shows the percent error.
|
||||
Each line series represents one "rarity" condition (ranging from one rare item to 100,000 rare items). For example,
|
||||
Each line series represents one "rarity" condition (ranging from one rare item to 100,000 rare items). For example,
|
||||
the orange "10" line means ten of the values were "rare" (`doc_count == 1`), out of 1-20m distinct values (where the
|
||||
rest of the values had `doc_count > 1`)
|
||||
|
||||
|
@ -258,14 +258,14 @@ degrades in a controlled, linear fashion as the number of distinct values increa
|
|||
|
||||
The default precision of `0.001` has a memory profile of `1.748⁻⁶ * n` bytes, where `n` is the number
|
||||
of distinct values the aggregation has seen (it can also be roughly eyeballed, e.g. 20 million unique values is about
|
||||
30mb of memory). The memory usage is linear to the number of distinct values regardless of which precision is chosen,
|
||||
30mb of memory). The memory usage is linear to the number of distinct values regardless of which precision is chosen,
|
||||
the precision only affects the slope of the memory profile as seen in this chart:
|
||||
|
||||
image:images/rare_terms/memory.png[]
|
||||
|
||||
For comparison, an equivalent terms aggregation at 20 million buckets would be roughly
|
||||
`20m * 69b == ~1.38gb` (with 69 bytes being a very optimistic estimate of an empty bucket cost, far lower than what
|
||||
the circuit breaker accounts for). So although the `rare_terms` agg is relatively heavy, it is still orders of
|
||||
the circuit breaker accounts for). So although the `rare_terms` agg is relatively heavy, it is still orders of
|
||||
magnitude smaller than the equivalent terms aggregation
|
||||
|
||||
==== Filtering Values
|
||||
|
@ -347,9 +347,9 @@ GET /_search
|
|||
==== Nested, RareTerms, and scoring sub-aggregations
|
||||
|
||||
The RareTerms aggregation has to operate in `breadth_first` mode, since it needs to prune terms as doc count thresholds
|
||||
are breached. This requirement means the RareTerms aggregation is incompatible with certain combinations of aggregations
|
||||
are breached. This requirement means the RareTerms aggregation is incompatible with certain combinations of aggregations
|
||||
that require `depth_first`. In particular, scoring sub-aggregations that are inside a `nested` force the entire aggregation tree to run
|
||||
in `depth_first` mode. This will throw an exception since RareTerms is unable to process `depth_first`.
|
||||
in `depth_first` mode. This will throw an exception since RareTerms is unable to process `depth_first`.
|
||||
|
||||
As a concrete example, if `rare_terms` aggregation is the child of a `nested` aggregation, and one of the child aggregations of `rare_terms`
|
||||
needs document scores (like a `top_hits` aggregation), this will throw an exception.
|
|
@ -305,7 +305,7 @@ If there is the equivalent of a `match_all` query or no query criteria providing
|
|||
top-most aggregation - in this scenario the _foreground_ set is exactly the same as the _background_ set and
|
||||
so there is no difference in document frequencies to observe and from which to make sensible suggestions.
|
||||
|
||||
Another consideration is that the significant_terms aggregation produces many candidate results at shard level
|
||||
Another consideration is that the significant_terms aggregation produces many candidate results at shard level
|
||||
that are only later pruned on the reducing node once all statistics from all shards are merged. As a result,
|
||||
it can be inefficient and costly in terms of RAM to embed large child aggregations under a significant_terms
|
||||
aggregation that later discards many candidate terms. It is advisable in these cases to perform two searches - the first to provide a rationalized list of
|
||||
|
@ -374,7 +374,7 @@ Chi square behaves like mutual information and can be configured with the same p
|
|||
|
||||
|
||||
===== Google normalized distance
|
||||
Google normalized distance as described in "The Google Similarity Distance", Cilibrasi and Vitanyi, 2007 (https://arxiv.org/pdf/cs/0412098v3.pdf) can be used as significance score by adding the parameter
|
||||
Google normalized distance as described in "The Google Similarity Distance", Cilibrasi and Vitanyi, 2007 (https://arxiv.org/pdf/cs/0412098v3.pdf) can be used as significance score by adding the parameter
|
||||
|
||||
[source,js]
|
||||
--------------------------------------------------
|
||||
|
@ -448,13 +448,13 @@ size buckets was not returned).
|
|||
|
||||
To ensure better accuracy a multiple of the final `size` is used as the number of terms to request from each shard
|
||||
(`2 * (size * 1.5 + 10)`). To take manual control of this setting the `shard_size` parameter
|
||||
can be used to control the volumes of candidate terms produced by each shard.
|
||||
can be used to control the volumes of candidate terms produced by each shard.
|
||||
|
||||
Low-frequency terms can turn out to be the most interesting ones once all results are combined so the
|
||||
significant_terms aggregation can produce higher-quality results when the `shard_size` parameter is set to
|
||||
values significantly higher than the `size` setting. This ensures that a bigger volume of promising candidate terms are given
|
||||
a consolidated review by the reducing node before the final selection. Obviously large candidate term lists
|
||||
will cause extra network traffic and RAM usage so this is quality/cost trade off that needs to be balanced. If `shard_size` is set to -1 (the default) then `shard_size` will be automatically estimated based on the number of shards and the `size` parameter.
|
||||
will cause extra network traffic and RAM usage so this is quality/cost trade off that needs to be balanced. If `shard_size` is set to -1 (the default) then `shard_size` will be automatically estimated based on the number of shards and the `size` parameter.
|
||||
|
||||
|
||||
NOTE: `shard_size` cannot be smaller than `size` (as it doesn't make much sense). When it is, Elasticsearch will
|
||||
|
|
|
@ -367,13 +367,13 @@ size buckets was not returned).
|
|||
|
||||
To ensure better accuracy a multiple of the final `size` is used as the number of terms to request from each shard
|
||||
(`2 * (size * 1.5 + 10)`). To take manual control of this setting the `shard_size` parameter
|
||||
can be used to control the volumes of candidate terms produced by each shard.
|
||||
can be used to control the volumes of candidate terms produced by each shard.
|
||||
|
||||
Low-frequency terms can turn out to be the most interesting ones once all results are combined so the
|
||||
significant_terms aggregation can produce higher-quality results when the `shard_size` parameter is set to
|
||||
values significantly higher than the `size` setting. This ensures that a bigger volume of promising candidate terms are given
|
||||
a consolidated review by the reducing node before the final selection. Obviously large candidate term lists
|
||||
will cause extra network traffic and RAM usage so this is quality/cost trade off that needs to be balanced. If `shard_size` is set to -1 (the default) then `shard_size` will be automatically estimated based on the number of shards and the `size` parameter.
|
||||
will cause extra network traffic and RAM usage so this is quality/cost trade off that needs to be balanced. If `shard_size` is set to -1 (the default) then `shard_size` will be automatically estimated based on the number of shards and the `size` parameter.
|
||||
|
||||
|
||||
NOTE: `shard_size` cannot be smaller than `size` (as it doesn't make much sense). When it is, elasticsearch will
|
||||
|
|
|
@ -136,7 +136,7 @@ The higher the requested `size` is, the more accurate the results will be, but a
|
|||
compute the final results (both due to bigger priority queues that are managed on a shard level and due to bigger data
|
||||
transfers between the nodes and the client).
|
||||
|
||||
The `shard_size` parameter can be used to minimize the extra work that comes with bigger requested `size`. When defined,
|
||||
The `shard_size` parameter can be used to minimize the extra work that comes with bigger requested `size`. When defined,
|
||||
it will determine how many terms the coordinating node will request from each shard. Once all the shards responded, the
|
||||
coordinating node will then reduce them to a final result which will be based on the `size` parameter - this way,
|
||||
one can increase the accuracy of the returned terms and avoid the overhead of streaming a big list of buckets back to
|
||||
|
@ -191,7 +191,7 @@ determined and is given a value of -1 to indicate this.
|
|||
==== Order
|
||||
|
||||
The order of the buckets can be customized by setting the `order` parameter. By default, the buckets are ordered by
|
||||
their `doc_count` descending. It is possible to change this behaviour as documented below:
|
||||
their `doc_count` descending. It is possible to change this behaviour as documented below:
|
||||
|
||||
WARNING: Sorting by ascending `_count` or by sub aggregation is discouraged as it increases the
|
||||
<<search-aggregations-bucket-terms-aggregation-approximate-counts,error>> on document counts.
|
||||
|
@ -283,7 +283,7 @@ GET /_search
|
|||
=======================================
|
||||
|
||||
<<search-aggregations-pipeline,Pipeline aggregations>> are run during the
|
||||
reduce phase after all other aggregations have already completed. For this
|
||||
reduce phase after all other aggregations have already completed. For this
|
||||
reason, they cannot be used for ordering.
|
||||
|
||||
=======================================
|
||||
|
@ -606,10 +606,10 @@ WARNING: Partitions cannot be used together with an `exclude` parameter.
|
|||
==== Multi-field terms aggregation
|
||||
|
||||
The `terms` aggregation does not support collecting terms from multiple fields
|
||||
in the same document. The reason is that the `terms` agg doesn't collect the
|
||||
in the same document. The reason is that the `terms` agg doesn't collect the
|
||||
string term values themselves, but rather uses
|
||||
<<search-aggregations-bucket-terms-aggregation-execution-hint,global ordinals>>
|
||||
to produce a list of all of the unique values in the field. Global ordinals
|
||||
to produce a list of all of the unique values in the field. Global ordinals
|
||||
results in an important performance boost which would not be possible across
|
||||
multiple fields.
|
||||
|
||||
|
@ -618,7 +618,7 @@ multiple fields:
|
|||
|
||||
<<search-aggregations-bucket-terms-aggregation-script,Script>>::
|
||||
|
||||
Use a script to retrieve terms from multiple fields. This disables the global
|
||||
Use a script to retrieve terms from multiple fields. This disables the global
|
||||
ordinals optimization and will be slower than collecting terms from a single
|
||||
field, but it gives you the flexibility to implement this option at search
|
||||
time.
|
||||
|
@ -627,7 +627,7 @@ time.
|
|||
|
||||
If you know ahead of time that you want to collect the terms from two or more
|
||||
fields, then use `copy_to` in your mapping to create a new dedicated field at
|
||||
index time which contains the values from both fields. You can aggregate on
|
||||
index time which contains the values from both fields. You can aggregate on
|
||||
this single field, which will benefit from the global ordinals optimization.
|
||||
|
||||
<<search-aggregations-bucket-multi-terms-aggregation, `multi_terms` aggregation>>::
|
||||
|
|
|
@ -68,15 +68,15 @@ The response will look like this:
|
|||
--------------------------------------------------
|
||||
// TESTRESPONSE[s/\.\.\./"took": $body.took,"timed_out": false,"_shards": $body._shards,"hits": $body.hits,/]
|
||||
|
||||
In this case, the lower and upper whisker values are equal to the min and max. In general, these values are the 1.5 *
|
||||
IQR range, which is to say the nearest values to `q1 - (1.5 * IQR)` and `q3 + (1.5 * IQR)`. Since this is an approximation, the given values
|
||||
may not actually be observed values from the data, but should be within a reasonable error bound of them. While the Boxplot aggregation
|
||||
In this case, the lower and upper whisker values are equal to the min and max. In general, these values are the 1.5 *
|
||||
IQR range, which is to say the nearest values to `q1 - (1.5 * IQR)` and `q3 + (1.5 * IQR)`. Since this is an approximation, the given values
|
||||
may not actually be observed values from the data, but should be within a reasonable error bound of them. While the Boxplot aggregation
|
||||
doesn't directly return outlier points, you can check if `lower > min` or `upper < max` to see if outliers exist on either side, and then
|
||||
query for them directly.
|
||||
|
||||
==== Script
|
||||
|
||||
The boxplot metric supports scripting. For example, if our load times
|
||||
The boxplot metric supports scripting. For example, if our load times
|
||||
are in milliseconds but we want values calculated in seconds, we could use
|
||||
a script to convert them on-the-fly:
|
||||
|
||||
|
|
|
@ -152,8 +152,8 @@ public static void main(String[] args) {
|
|||
image:images/cardinality_error.png[]
|
||||
|
||||
For all 3 thresholds, counts have been accurate up to the configured threshold.
|
||||
Although not guaranteed, this is likely to be the case. Accuracy in practice depends
|
||||
on the dataset in question. In general, most datasets show consistently good
|
||||
Although not guaranteed, this is likely to be the case. Accuracy in practice depends
|
||||
on the dataset in question. In general, most datasets show consistently good
|
||||
accuracy. Also note that even with a threshold as low as 100, the error
|
||||
remains very low (1-6% as seen in the above graph) even when counting millions of items.
|
||||
|
||||
|
|
|
@ -63,7 +63,7 @@ The name of the aggregation (`grades_stats` above) also serves as the key by whi
|
|||
|
||||
==== Standard Deviation Bounds
|
||||
By default, the `extended_stats` metric will return an object called `std_deviation_bounds`, which provides an interval of plus/minus two standard
|
||||
deviations from the mean. This can be a useful way to visualize variance of your data. If you want a different boundary, for example
|
||||
deviations from the mean. This can be a useful way to visualize variance of your data. If you want a different boundary, for example
|
||||
three standard deviations, you can set `sigma` in the request:
|
||||
|
||||
[source,console]
|
||||
|
@ -84,7 +84,7 @@ GET /exams/_search
|
|||
// TEST[setup:exams]
|
||||
<1> `sigma` controls how many standard deviations +/- from the mean should be displayed
|
||||
|
||||
`sigma` can be any non-negative double, meaning you can request non-integer values such as `1.5`. A value of `0` is valid, but will simply
|
||||
`sigma` can be any non-negative double, meaning you can request non-integer values such as `1.5`. A value of `0` is valid, but will simply
|
||||
return the average for both `upper` and `lower` bounds.
|
||||
|
||||
The `upper` and `lower` bounds are calculated as population metrics so they are always the same as `upper_population` and
|
||||
|
@ -93,8 +93,8 @@ The `upper` and `lower` bounds are calculated as population metrics so they are
|
|||
.Standard Deviation and Bounds require normality
|
||||
[NOTE]
|
||||
=====
|
||||
The standard deviation and its bounds are displayed by default, but they are not always applicable to all data-sets. Your data must
|
||||
be normally distributed for the metrics to make sense. The statistics behind standard deviations assumes normally distributed data, so
|
||||
The standard deviation and its bounds are displayed by default, but they are not always applicable to all data-sets. Your data must
|
||||
be normally distributed for the metrics to make sense. The statistics behind standard deviations assumes normally distributed data, so
|
||||
if your data is skewed heavily left or right, the value returned will be misleading.
|
||||
=====
|
||||
|
||||
|
|
|
@ -10,19 +10,19 @@ generated by a provided script or extracted from specific numeric or
|
|||
<<histogram,histogram fields>> in the documents.
|
||||
|
||||
Percentiles show the point at which a certain percentage of observed values
|
||||
occur. For example, the 95th percentile is the value which is greater than 95%
|
||||
occur. For example, the 95th percentile is the value which is greater than 95%
|
||||
of the observed values.
|
||||
|
||||
Percentiles are often used to find outliers. In normal distributions, the
|
||||
Percentiles are often used to find outliers. In normal distributions, the
|
||||
0.13th and 99.87th percentiles represents three standard deviations from the
|
||||
mean. Any data which falls outside three standard deviations is often considered
|
||||
mean. Any data which falls outside three standard deviations is often considered
|
||||
an anomaly.
|
||||
|
||||
When a range of percentiles are retrieved, they can be used to estimate the
|
||||
data distribution and determine if the data is skewed, bimodal, etc.
|
||||
|
||||
Assume your data consists of website load times. The average and median
|
||||
load times are not overly useful to an administrator. The max may be interesting,
|
||||
Assume your data consists of website load times. The average and median
|
||||
load times are not overly useful to an administrator. The max may be interesting,
|
||||
but it can be easily skewed by a single slow response.
|
||||
|
||||
Let's look at a range of percentiles representing load time:
|
||||
|
@ -45,7 +45,7 @@ GET latency/_search
|
|||
<1> The field `load_time` must be a numeric field
|
||||
|
||||
By default, the `percentile` metric will generate a range of
|
||||
percentiles: `[ 1, 5, 25, 50, 75, 95, 99 ]`. The response will look like this:
|
||||
percentiles: `[ 1, 5, 25, 50, 75, 95, 99 ]`. The response will look like this:
|
||||
|
||||
[source,console-result]
|
||||
--------------------------------------------------
|
||||
|
@ -70,7 +70,7 @@ percentiles: `[ 1, 5, 25, 50, 75, 95, 99 ]`. The response will look like this:
|
|||
// TESTRESPONSE[s/\.\.\./"took": $body.took,"timed_out": false,"_shards": $body._shards,"hits": $body.hits,/]
|
||||
|
||||
As you can see, the aggregation will return a calculated value for each percentile
|
||||
in the default range. If we assume response times are in milliseconds, it is
|
||||
in the default range. If we assume response times are in milliseconds, it is
|
||||
immediately obvious that the webpage normally loads in 10-725ms, but occasionally
|
||||
spikes to 945-985ms.
|
||||
|
||||
|
@ -164,7 +164,7 @@ Response:
|
|||
|
||||
==== Script
|
||||
|
||||
The percentile metric supports scripting. For example, if our load times
|
||||
The percentile metric supports scripting. For example, if our load times
|
||||
are in milliseconds but we want percentiles calculated in seconds, we could use
|
||||
a script to convert them on-the-fly:
|
||||
|
||||
|
@ -220,12 +220,12 @@ GET latency/_search
|
|||
[[search-aggregations-metrics-percentile-aggregation-approximation]]
|
||||
==== Percentiles are (usually) approximate
|
||||
|
||||
There are many different algorithms to calculate percentiles. The naive
|
||||
implementation simply stores all the values in a sorted array. To find the 50th
|
||||
There are many different algorithms to calculate percentiles. The naive
|
||||
implementation simply stores all the values in a sorted array. To find the 50th
|
||||
percentile, you simply find the value that is at `my_array[count(my_array) * 0.5]`.
|
||||
|
||||
Clearly, the naive implementation does not scale -- the sorted array grows
|
||||
linearly with the number of values in your dataset. To calculate percentiles
|
||||
linearly with the number of values in your dataset. To calculate percentiles
|
||||
across potentially billions of values in an Elasticsearch cluster, _approximate_
|
||||
percentiles are calculated.
|
||||
|
||||
|
@ -235,12 +235,12 @@ https://github.com/tdunning/t-digest/blob/master/docs/t-digest-paper/histo.pdf[C
|
|||
|
||||
When using this metric, there are a few guidelines to keep in mind:
|
||||
|
||||
- Accuracy is proportional to `q(1-q)`. This means that extreme percentiles (e.g. 99%)
|
||||
- Accuracy is proportional to `q(1-q)`. This means that extreme percentiles (e.g. 99%)
|
||||
are more accurate than less extreme percentiles, such as the median
|
||||
- For small sets of values, percentiles are highly accurate (and potentially
|
||||
100% accurate if the data is small enough).
|
||||
- As the quantity of values in a bucket grows, the algorithm begins to approximate
|
||||
the percentiles. It is effectively trading accuracy for memory savings. The
|
||||
the percentiles. It is effectively trading accuracy for memory savings. The
|
||||
exact level of inaccuracy is difficult to generalize, since it depends on your
|
||||
data distribution and volume of data being aggregated
|
||||
|
||||
|
@ -291,18 +291,18 @@ GET latency/_search
|
|||
// tag::t-digest[]
|
||||
The TDigest algorithm uses a number of "nodes" to approximate percentiles -- the
|
||||
more nodes available, the higher the accuracy (and large memory footprint) proportional
|
||||
to the volume of data. The `compression` parameter limits the maximum number of
|
||||
to the volume of data. The `compression` parameter limits the maximum number of
|
||||
nodes to `20 * compression`.
|
||||
|
||||
Therefore, by increasing the compression value, you can increase the accuracy of
|
||||
your percentiles at the cost of more memory. Larger compression values also
|
||||
your percentiles at the cost of more memory. Larger compression values also
|
||||
make the algorithm slower since the underlying tree data structure grows in size,
|
||||
resulting in more expensive operations. The default compression value is
|
||||
resulting in more expensive operations. The default compression value is
|
||||
`100`.
|
||||
|
||||
A "node" uses roughly 32 bytes of memory, so under worst-case scenarios (large amount
|
||||
of data which arrives sorted and in-order) the default settings will produce a
|
||||
TDigest roughly 64KB in size. In practice data tends to be more random and
|
||||
TDigest roughly 64KB in size. In practice data tends to be more random and
|
||||
the TDigest will use less memory.
|
||||
// end::t-digest[]
|
||||
|
||||
|
|
|
@ -17,10 +17,10 @@ regarding approximation and memory use of the percentile ranks aggregation
|
|||
==================================================
|
||||
|
||||
Percentile rank show the percentage of observed values which are below certain
|
||||
value. For example, if a value is greater than or equal to 95% of the observed values
|
||||
value. For example, if a value is greater than or equal to 95% of the observed values
|
||||
it is said to be at the 95th percentile rank.
|
||||
|
||||
Assume your data consists of website load times. You may have a service agreement that
|
||||
Assume your data consists of website load times. You may have a service agreement that
|
||||
95% of page loads complete within 500ms and 99% of page loads complete within 600ms.
|
||||
|
||||
Let's look at a range of percentiles representing load time:
|
||||
|
@ -120,7 +120,7 @@ Response:
|
|||
|
||||
==== Script
|
||||
|
||||
The percentile rank metric supports scripting. For example, if our load times
|
||||
The percentile rank metric supports scripting. For example, if our load times
|
||||
are in milliseconds but we want to specify values in seconds, we could use
|
||||
a script to convert them on-the-fly:
|
||||
|
||||
|
|
|
@ -142,7 +142,7 @@ indices, the term filter on the <<mapping-index-field,`_index`>> field can be us
|
|||
|
||||
==== Script
|
||||
|
||||
The `t_test` metric supports scripting. For example, if we need to adjust out load times for the before values, we could use
|
||||
The `t_test` metric supports scripting. For example, if we need to adjust out load times for the before values, we could use
|
||||
a script to recalculate them on-the-fly:
|
||||
|
||||
[source,console]
|
||||
|
|
|
@ -7,8 +7,8 @@
|
|||
A `single-value` metrics aggregation that computes the weighted average of numeric values that are extracted from the aggregated documents.
|
||||
These values can be extracted either from specific numeric fields in the documents, or provided by a script.
|
||||
|
||||
When calculating a regular average, each datapoint has an equal "weight" ... it contributes equally to the final value. Weighted averages,
|
||||
on the other hand, weight each datapoint differently. The amount that each datapoint contributes to the final value is extracted from the
|
||||
When calculating a regular average, each datapoint has an equal "weight" ... it contributes equally to the final value. Weighted averages,
|
||||
on the other hand, weight each datapoint differently. The amount that each datapoint contributes to the final value is extracted from the
|
||||
document, or provided by a script.
|
||||
|
||||
As a formula, a weighted average is the `∑(value * weight) / ∑(weight)`
|
||||
|
@ -35,7 +35,7 @@ The `value` and `weight` objects have per-field specific configuration:
|
|||
|Parameter Name |Description |Required |Default Value
|
||||
|`field` | The field that values should be extracted from |Required |
|
||||
|`missing` | A value to use if the field is missing entirely |Optional |
|
||||
|`script` | A script which provides the values for the document. This is mutually exclusive with `field` |Optional
|
||||
|`script` | A script which provides the values for the document. This is mutually exclusive with `field` |Optional
|
||||
|===
|
||||
|
||||
[[weight-params]]
|
||||
|
@ -45,7 +45,7 @@ The `value` and `weight` objects have per-field specific configuration:
|
|||
|Parameter Name |Description |Required |Default Value
|
||||
|`field` | The field that weights should be extracted from |Required |
|
||||
|`missing` | A weight to use if the field is missing entirely |Optional |
|
||||
|`script` | A script which provides the weights for the document. This is mutually exclusive with `field` |Optional
|
||||
|`script` | A script which provides the weights for the document. This is mutually exclusive with `field` |Optional
|
||||
|===
|
||||
|
||||
|
||||
|
@ -91,7 +91,7 @@ Which yields a response like:
|
|||
// TESTRESPONSE[s/\.\.\./"took": $body.took,"timed_out": false,"_shards": $body._shards,"hits": $body.hits,/]
|
||||
|
||||
|
||||
While multiple values-per-field are allowed, only one weight is allowed. If the aggregation encounters
|
||||
While multiple values-per-field are allowed, only one weight is allowed. If the aggregation encounters
|
||||
a document that has more than one weight (e.g. the weight field is a multi-valued field) it will throw an exception.
|
||||
If you have this situation, you will need to specify a `script` for the weight field, and use the script
|
||||
to combine the multiple values into a single value to be used.
|
||||
|
@ -147,7 +147,7 @@ The aggregation returns `2.0` as the result, which matches what we would expect
|
|||
|
||||
==== Script
|
||||
|
||||
Both the value and the weight can be derived from a script, instead of a field. As a simple example, the following
|
||||
Both the value and the weight can be derived from a script, instead of a field. As a simple example, the following
|
||||
will add one to the grade and weight in the document using a script:
|
||||
|
||||
[source,console]
|
||||
|
|
|
@ -19,7 +19,7 @@ parameter to indicate the paths to the required metrics. The syntax for defining
|
|||
<<buckets-path-syntax, `buckets_path` Syntax>> section below.
|
||||
|
||||
Pipeline aggregations cannot have sub-aggregations but depending on the type it can reference another pipeline in the `buckets_path`
|
||||
allowing pipeline aggregations to be chained. For example, you can chain together two derivatives to calculate the second derivative
|
||||
allowing pipeline aggregations to be chained. For example, you can chain together two derivatives to calculate the second derivative
|
||||
(i.e. a derivative of a derivative).
|
||||
|
||||
NOTE: Because pipeline aggregations only add to the output, when chaining pipeline aggregations the output of each pipeline aggregation
|
||||
|
@ -29,7 +29,7 @@ will be included in the final output.
|
|||
[discrete]
|
||||
=== `buckets_path` Syntax
|
||||
|
||||
Most pipeline aggregations require another aggregation as their input. The input aggregation is defined via the `buckets_path`
|
||||
Most pipeline aggregations require another aggregation as their input. The input aggregation is defined via the `buckets_path`
|
||||
parameter, which follows a specific format:
|
||||
|
||||
// https://en.wikipedia.org/wiki/Extended_Backus%E2%80%93Naur_Form
|
||||
|
@ -77,7 +77,7 @@ POST /_search
|
|||
<2> The `buckets_path` refers to the metric via a relative path `"the_sum"`
|
||||
|
||||
`buckets_path` is also used for Sibling pipeline aggregations, where the aggregation is "next" to a series of buckets
|
||||
instead of embedded "inside" them. For example, the `max_bucket` aggregation uses the `buckets_path` to specify
|
||||
instead of embedded "inside" them. For example, the `max_bucket` aggregation uses the `buckets_path` to specify
|
||||
a metric embedded inside a sibling aggregation:
|
||||
|
||||
[source,console,id=buckets-path-sibling-example]
|
||||
|
@ -112,7 +112,7 @@ POST /_search
|
|||
`sales_per_month` date histogram.
|
||||
|
||||
If a Sibling pipeline agg references a multi-bucket aggregation, such as a `terms` agg, it also has the option to
|
||||
select specific keys from the multi-bucket. For example, a `bucket_script` could select two specific buckets (via
|
||||
select specific keys from the multi-bucket. For example, a `bucket_script` could select two specific buckets (via
|
||||
their bucket keys) to perform the calculation:
|
||||
|
||||
[source,console,id=buckets-path-specific-bucket-example]
|
||||
|
@ -160,8 +160,8 @@ instead of fetching all the buckets from `sale_type` aggregation
|
|||
[discrete]
|
||||
=== Special Paths
|
||||
|
||||
Instead of pathing to a metric, `buckets_path` can use a special `"_count"` path. This instructs
|
||||
the pipeline aggregation to use the document count as its input. For example, a derivative can be calculated
|
||||
Instead of pathing to a metric, `buckets_path` can use a special `"_count"` path. This instructs
|
||||
the pipeline aggregation to use the document count as its input. For example, a derivative can be calculated
|
||||
on the document count of each bucket, instead of a specific metric:
|
||||
|
||||
[source,console,id=buckets-path-count-example]
|
||||
|
@ -246,7 +246,7 @@ may be referred to as:
|
|||
[discrete]
|
||||
=== Dealing with gaps in the data
|
||||
|
||||
Data in the real world is often noisy and sometimes contains *gaps* -- places where data simply doesn't exist. This can
|
||||
Data in the real world is often noisy and sometimes contains *gaps* -- places where data simply doesn't exist. This can
|
||||
occur for a variety of reasons, the most common being:
|
||||
|
||||
* Documents falling into a bucket do not contain a required field
|
||||
|
@ -256,11 +256,11 @@ Some pipeline aggregations have specific requirements that must be met (e.g. a d
|
|||
first value because there is no previous value, HoltWinters moving average need "warmup" data to begin calculating, etc)
|
||||
|
||||
Gap policies are a mechanism to inform the pipeline aggregation about the desired behavior when "gappy" or missing
|
||||
data is encountered. All pipeline aggregations accept the `gap_policy` parameter. There are currently two gap policies
|
||||
data is encountered. All pipeline aggregations accept the `gap_policy` parameter. There are currently two gap policies
|
||||
to choose from:
|
||||
|
||||
_skip_::
|
||||
This option treats missing data as if the bucket does not exist. It will skip the bucket and continue
|
||||
This option treats missing data as if the bucket does not exist. It will skip the bucket and continue
|
||||
calculating using the next available value.
|
||||
|
||||
_insert_zeros_::
|
||||
|
|
|
@ -11,8 +11,8 @@ aggregation. The specified metric must be a cardinality aggregation and the encl
|
|||
must have `min_doc_count` set to `0` (default for `histogram` aggregations).
|
||||
|
||||
The `cumulative_cardinality` agg is useful for finding "total new items", like the number of new visitors to your
|
||||
website each day. A regular cardinality aggregation will tell you how many unique visitors came each day, but doesn't
|
||||
differentiate between "new" or "repeat" visitors. The Cumulative Cardinality aggregation can be used to determine
|
||||
website each day. A regular cardinality aggregation will tell you how many unique visitors came each day, but doesn't
|
||||
differentiate between "new" or "repeat" visitors. The Cumulative Cardinality aggregation can be used to determine
|
||||
how many of each day's unique visitors are "new".
|
||||
|
||||
==== Syntax
|
||||
|
@ -128,14 +128,14 @@ And the following may be the response:
|
|||
|
||||
|
||||
Note how the second day, `2019-01-02`, has two distinct users but the `total_new_users` metric generated by the
|
||||
cumulative pipeline agg only increments to three. This means that only one of the two users that day were
|
||||
new, the other had already been seen in the previous day. This happens again on the third day, where only
|
||||
cumulative pipeline agg only increments to three. This means that only one of the two users that day were
|
||||
new, the other had already been seen in the previous day. This happens again on the third day, where only
|
||||
one of three users is completely new.
|
||||
|
||||
==== Incremental cumulative cardinality
|
||||
|
||||
The `cumulative_cardinality` agg will show you the total, distinct count since the beginning of the time period
|
||||
being queried. Sometimes, however, it is useful to see the "incremental" count. Meaning, how many new users
|
||||
being queried. Sometimes, however, it is useful to see the "incremental" count. Meaning, how many new users
|
||||
are added each day, rather than the total cumulative count.
|
||||
|
||||
This can be accomplished by adding a `derivative` aggregation to our query:
|
||||
|
|
|
@ -226,7 +226,7 @@ second derivative
|
|||
==== Units
|
||||
|
||||
The derivative aggregation allows the units of the derivative values to be specified. This returns an extra field in the response
|
||||
`normalized_value` which reports the derivative value in the desired x-axis units. In the below example we calculate the derivative
|
||||
`normalized_value` which reports the derivative value in the desired x-axis units. In the below example we calculate the derivative
|
||||
of the total sales per month but ask for the derivative of the sales as in the units of sales per day:
|
||||
|
||||
[source,console]
|
||||
|
|
|
@ -5,7 +5,7 @@
|
|||
++++
|
||||
|
||||
Given an ordered series of data, the Moving Function aggregation will slide a window across the data and allow the user to specify a custom
|
||||
script that is executed on each window of data. For convenience, a number of common functions are predefined such as min/max, moving averages,
|
||||
script that is executed on each window of data. For convenience, a number of common functions are predefined such as min/max, moving averages,
|
||||
etc.
|
||||
|
||||
==== Syntax
|
||||
|
@ -36,7 +36,7 @@ A `moving_fn` aggregation looks like this in isolation:
|
|||
|`shift` |<<shift-parameter, Shift>> of window position. |Optional | 0
|
||||
|===
|
||||
|
||||
`moving_fn` aggregations must be embedded inside of a `histogram` or `date_histogram` aggregation. They can be
|
||||
`moving_fn` aggregations must be embedded inside of a `histogram` or `date_histogram` aggregation. They can be
|
||||
embedded like any other metric aggregation:
|
||||
|
||||
[source,console]
|
||||
|
@ -69,11 +69,11 @@ POST /_search
|
|||
// TEST[setup:sales]
|
||||
|
||||
<1> A `date_histogram` named "my_date_histo" is constructed on the "timestamp" field, with one-day intervals
|
||||
<2> A `sum` metric is used to calculate the sum of a field. This could be any numeric metric (sum, min, max, etc)
|
||||
<2> A `sum` metric is used to calculate the sum of a field. This could be any numeric metric (sum, min, max, etc)
|
||||
<3> Finally, we specify a `moving_fn` aggregation which uses "the_sum" metric as its input.
|
||||
|
||||
Moving averages are built by first specifying a `histogram` or `date_histogram` over a field. You can then optionally
|
||||
add numeric metrics, such as a `sum`, inside of that histogram. Finally, the `moving_fn` is embedded inside the histogram.
|
||||
Moving averages are built by first specifying a `histogram` or `date_histogram` over a field. You can then optionally
|
||||
add numeric metrics, such as a `sum`, inside of that histogram. Finally, the `moving_fn` is embedded inside the histogram.
|
||||
The `buckets_path` parameter is then used to "point" at one of the sibling metrics inside of the histogram (see
|
||||
<<buckets-path-syntax>> for a description of the syntax for `buckets_path`.
|
||||
|
||||
|
@ -134,9 +134,9 @@ An example response from the above aggregation may look like:
|
|||
|
||||
==== Custom user scripting
|
||||
|
||||
The Moving Function aggregation allows the user to specify any arbitrary script to define custom logic. The script is invoked each time a
|
||||
new window of data is collected. These values are provided to the script in the `values` variable. The script should then perform some
|
||||
kind of calculation and emit a single `double` as the result. Emitting `null` is not permitted, although `NaN` and +/- `Inf` are allowed.
|
||||
The Moving Function aggregation allows the user to specify any arbitrary script to define custom logic. The script is invoked each time a
|
||||
new window of data is collected. These values are provided to the script in the `values` variable. The script should then perform some
|
||||
kind of calculation and emit a single `double` as the result. Emitting `null` is not permitted, although `NaN` and +/- `Inf` are allowed.
|
||||
|
||||
For example, this script will simply return the first value from the window, or `NaN` if no values are available:
|
||||
|
||||
|
@ -195,7 +195,7 @@ For convenience, a number of functions have been prebuilt and are available insi
|
|||
- `holt()`
|
||||
- `holtWinters()`
|
||||
|
||||
The functions are available from the `MovingFunctions` namespace. E.g. `MovingFunctions.max()`
|
||||
The functions are available from the `MovingFunctions` namespace. E.g. `MovingFunctions.max()`
|
||||
|
||||
===== max Function
|
||||
|
||||
|
@ -284,7 +284,7 @@ POST /_search
|
|||
===== sum Function
|
||||
|
||||
This function accepts a collection of doubles and returns the sum of the values in that window. `null` and `NaN` values are ignored;
|
||||
the sum is only calculated over the real values. If the window is empty, or all values are `null`/`NaN`, `0.0` is returned as the result.
|
||||
the sum is only calculated over the real values. If the window is empty, or all values are `null`/`NaN`, `0.0` is returned as the result.
|
||||
|
||||
[[sum-params]]
|
||||
.`sum(double[] values)` Parameters
|
||||
|
@ -326,7 +326,7 @@ POST /_search
|
|||
===== stdDev Function
|
||||
|
||||
This function accepts a collection of doubles and average, then returns the standard deviation of the values in that window.
|
||||
`null` and `NaN` values are ignored; the sum is only calculated over the real values. If the window is empty, or all values are
|
||||
`null` and `NaN` values are ignored; the sum is only calculated over the real values. If the window is empty, or all values are
|
||||
`null`/`NaN`, `0.0` is returned as the result.
|
||||
|
||||
[[stddev-params]]
|
||||
|
@ -368,17 +368,17 @@ POST /_search
|
|||
// TEST[setup:sales]
|
||||
|
||||
The `avg` parameter must be provided to the standard deviation function because different styles of averages can be computed on the window
|
||||
(simple, linearly weighted, etc). The various moving averages that are detailed below can be used to calculate the average for the
|
||||
(simple, linearly weighted, etc). The various moving averages that are detailed below can be used to calculate the average for the
|
||||
standard deviation function.
|
||||
|
||||
===== unweightedAvg Function
|
||||
|
||||
The `unweightedAvg` function calculates the sum of all values in the window, then divides by the size of the window. It is effectively
|
||||
a simple arithmetic mean of the window. The simple moving average does not perform any time-dependent weighting, which means
|
||||
The `unweightedAvg` function calculates the sum of all values in the window, then divides by the size of the window. It is effectively
|
||||
a simple arithmetic mean of the window. The simple moving average does not perform any time-dependent weighting, which means
|
||||
the values from a `simple` moving average tend to "lag" behind the real data.
|
||||
|
||||
`null` and `NaN` values are ignored; the average is only calculated over the real values. If the window is empty, or all values are
|
||||
`null`/`NaN`, `NaN` is returned as the result. This means that the count used in the average calculation is count of non-`null`,non-`NaN`
|
||||
`null`/`NaN`, `NaN` is returned as the result. This means that the count used in the average calculation is count of non-`null`,non-`NaN`
|
||||
values.
|
||||
|
||||
[[unweightedavg-params]]
|
||||
|
@ -421,7 +421,7 @@ POST /_search
|
|||
==== linearWeightedAvg Function
|
||||
|
||||
The `linearWeightedAvg` function assigns a linear weighting to points in the series, such that "older" datapoints (e.g. those at
|
||||
the beginning of the window) contribute a linearly less amount to the total average. The linear weighting helps reduce
|
||||
the beginning of the window) contribute a linearly less amount to the total average. The linear weighting helps reduce
|
||||
the "lag" behind the data's mean, since older points have less influence.
|
||||
|
||||
If the window is empty, or all values are `null`/`NaN`, `NaN` is returned as the result.
|
||||
|
@ -467,13 +467,13 @@ POST /_search
|
|||
|
||||
The `ewma` function (aka "single-exponential") is similar to the `linearMovAvg` function,
|
||||
except older data-points become exponentially less important,
|
||||
rather than linearly less important. The speed at which the importance decays can be controlled with an `alpha`
|
||||
setting. Small values make the weight decay slowly, which provides greater smoothing and takes into account a larger
|
||||
portion of the window. Larger values make the weight decay quickly, which reduces the impact of older values on the
|
||||
moving average. This tends to make the moving average track the data more closely but with less smoothing.
|
||||
rather than linearly less important. The speed at which the importance decays can be controlled with an `alpha`
|
||||
setting. Small values make the weight decay slowly, which provides greater smoothing and takes into account a larger
|
||||
portion of the window. Larger values make the weight decay quickly, which reduces the impact of older values on the
|
||||
moving average. This tends to make the moving average track the data more closely but with less smoothing.
|
||||
|
||||
`null` and `NaN` values are ignored; the average is only calculated over the real values. If the window is empty, or all values are
|
||||
`null`/`NaN`, `NaN` is returned as the result. This means that the count used in the average calculation is count of non-`null`,non-`NaN`
|
||||
`null`/`NaN`, `NaN` is returned as the result. This means that the count used in the average calculation is count of non-`null`,non-`NaN`
|
||||
values.
|
||||
|
||||
[[ewma-params]]
|
||||
|
@ -518,18 +518,18 @@ POST /_search
|
|||
==== holt Function
|
||||
|
||||
The `holt` function (aka "double exponential") incorporates a second exponential term which
|
||||
tracks the data's trend. Single exponential does not perform well when the data has an underlying linear trend. The
|
||||
tracks the data's trend. Single exponential does not perform well when the data has an underlying linear trend. The
|
||||
double exponential model calculates two values internally: a "level" and a "trend".
|
||||
|
||||
The level calculation is similar to `ewma`, and is an exponentially weighted view of the data. The difference is
|
||||
The level calculation is similar to `ewma`, and is an exponentially weighted view of the data. The difference is
|
||||
that the previously smoothed value is used instead of the raw value, which allows it to stay close to the original series.
|
||||
The trend calculation looks at the difference between the current and last value (e.g. the slope, or trend, of the
|
||||
smoothed data). The trend value is also exponentially weighted.
|
||||
smoothed data). The trend value is also exponentially weighted.
|
||||
|
||||
Values are produced by multiplying the level and trend components.
|
||||
|
||||
`null` and `NaN` values are ignored; the average is only calculated over the real values. If the window is empty, or all values are
|
||||
`null`/`NaN`, `NaN` is returned as the result. This means that the count used in the average calculation is count of non-`null`,non-`NaN`
|
||||
`null`/`NaN`, `NaN` is returned as the result. This means that the count used in the average calculation is count of non-`null`,non-`NaN`
|
||||
values.
|
||||
|
||||
[[holt-params]]
|
||||
|
@ -572,26 +572,26 @@ POST /_search
|
|||
// TEST[setup:sales]
|
||||
|
||||
In practice, the `alpha` value behaves very similarly in `holtMovAvg` as `ewmaMovAvg`: small values produce more smoothing
|
||||
and more lag, while larger values produce closer tracking and less lag. The value of `beta` is often difficult
|
||||
to see. Small values emphasize long-term trends (such as a constant linear trend in the whole series), while larger
|
||||
and more lag, while larger values produce closer tracking and less lag. The value of `beta` is often difficult
|
||||
to see. Small values emphasize long-term trends (such as a constant linear trend in the whole series), while larger
|
||||
values emphasize short-term trends.
|
||||
|
||||
==== holtWinters Function
|
||||
|
||||
The `holtWinters` function (aka "triple exponential") incorporates a third exponential term which
|
||||
tracks the seasonal aspect of your data. This aggregation therefore smooths based on three components: "level", "trend"
|
||||
tracks the seasonal aspect of your data. This aggregation therefore smooths based on three components: "level", "trend"
|
||||
and "seasonality".
|
||||
|
||||
The level and trend calculation is identical to `holt` The seasonal calculation looks at the difference between
|
||||
the current point, and the point one period earlier.
|
||||
|
||||
Holt-Winters requires a little more handholding than the other moving averages. You need to specify the "periodicity"
|
||||
of your data: e.g. if your data has cyclic trends every 7 days, you would set `period = 7`. Similarly if there was
|
||||
a monthly trend, you would set it to `30`. There is currently no periodicity detection, although that is planned
|
||||
Holt-Winters requires a little more handholding than the other moving averages. You need to specify the "periodicity"
|
||||
of your data: e.g. if your data has cyclic trends every 7 days, you would set `period = 7`. Similarly if there was
|
||||
a monthly trend, you would set it to `30`. There is currently no periodicity detection, although that is planned
|
||||
for future enhancements.
|
||||
|
||||
`null` and `NaN` values are ignored; the average is only calculated over the real values. If the window is empty, or all values are
|
||||
`null`/`NaN`, `NaN` is returned as the result. This means that the count used in the average calculation is count of non-`null`,non-`NaN`
|
||||
`null`/`NaN`, `NaN` is returned as the result. This means that the count used in the average calculation is count of non-`null`,non-`NaN`
|
||||
values.
|
||||
|
||||
[[holtwinters-params]]
|
||||
|
@ -638,20 +638,20 @@ POST /_search
|
|||
|
||||
[WARNING]
|
||||
======
|
||||
Multiplicative Holt-Winters works by dividing each data point by the seasonal value. This is problematic if any of
|
||||
your data is zero, or if there are gaps in the data (since this results in a divid-by-zero). To combat this, the
|
||||
`mult` Holt-Winters pads all values by a very small amount (1*10^-10^) so that all values are non-zero. This affects
|
||||
the result, but only minimally. If your data is non-zero, or you prefer to see `NaN` when zero's are encountered,
|
||||
Multiplicative Holt-Winters works by dividing each data point by the seasonal value. This is problematic if any of
|
||||
your data is zero, or if there are gaps in the data (since this results in a divid-by-zero). To combat this, the
|
||||
`mult` Holt-Winters pads all values by a very small amount (1*10^-10^) so that all values are non-zero. This affects
|
||||
the result, but only minimally. If your data is non-zero, or you prefer to see `NaN` when zero's are encountered,
|
||||
you can disable this behavior with `pad: false`
|
||||
======
|
||||
|
||||
===== "Cold Start"
|
||||
|
||||
Unfortunately, due to the nature of Holt-Winters, it requires two periods of data to "bootstrap" the algorithm. This
|
||||
means that your `window` must always be *at least* twice the size of your period. An exception will be thrown if it
|
||||
isn't. It also means that Holt-Winters will not emit a value for the first `2 * period` buckets; the current algorithm
|
||||
Unfortunately, due to the nature of Holt-Winters, it requires two periods of data to "bootstrap" the algorithm. This
|
||||
means that your `window` must always be *at least* twice the size of your period. An exception will be thrown if it
|
||||
isn't. It also means that Holt-Winters will not emit a value for the first `2 * period` buckets; the current algorithm
|
||||
does not backcast.
|
||||
|
||||
You'll notice in the above example we have an `if ()` statement checking the size of values. This is checking to make sure
|
||||
You'll notice in the above example we have an `if ()` statement checking the size of values. This is checking to make sure
|
||||
we have two periods worth of data (`5 * 2`, where 5 is the period specified in the `holtWintersMovAvg` function) before calling
|
||||
the holt-winters function.
|
||||
|
|
|
@ -37,7 +37,7 @@ A `moving_percentiles` aggregation looks like this in isolation:
|
|||
|`shift` |<<shift-parameter, Shift>> of window position. |Optional | 0
|
||||
|===
|
||||
|
||||
`moving_percentiles` aggregations must be embedded inside of a `histogram` or `date_histogram` aggregation. They can be
|
||||
`moving_percentiles` aggregations must be embedded inside of a `histogram` or `date_histogram` aggregation. They can be
|
||||
embedded like any other metric aggregation:
|
||||
|
||||
[source,console]
|
||||
|
@ -75,8 +75,8 @@ POST /_search
|
|||
<2> A `percentile` metric is used to calculate the percentiles of a field.
|
||||
<3> Finally, we specify a `moving_percentiles` aggregation which uses "the_percentile" sketch as its input.
|
||||
|
||||
Moving percentiles are built by first specifying a `histogram` or `date_histogram` over a field. You then add
|
||||
a percentile metric inside of that histogram. Finally, the `moving_percentiles` is embedded inside the histogram.
|
||||
Moving percentiles are built by first specifying a `histogram` or `date_histogram` over a field. You then add
|
||||
a percentile metric inside of that histogram. Finally, the `moving_percentiles` is embedded inside the histogram.
|
||||
The `buckets_path` parameter is then used to "point" at the percentiles aggregation inside of the histogram (see
|
||||
<<buckets-path-syntax>> for a description of the syntax for `buckets_path`).
|
||||
|
||||
|
|
|
@ -130,5 +130,5 @@ interpolate between data points.
|
|||
|
||||
The percentiles are calculated exactly and is not an approximation (unlike the Percentiles Metric). This means
|
||||
the implementation maintains an in-memory, sorted list of your data to compute the percentiles, before discarding the
|
||||
data. You may run into memory pressure issues if you attempt to calculate percentiles over many millions of
|
||||
data. You may run into memory pressure issues if you attempt to calculate percentiles over many millions of
|
||||
data-points in a single `percentiles_bucket`.
|
||||
|
|
|
@ -13,10 +13,10 @@ next. Single periods are useful for removing constant, linear trends.
|
|||
Single periods are also useful for transforming data into a stationary series. In this example, the Dow Jones is
|
||||
plotted over ~250 days. The raw data is not stationary, which would make it difficult to use with some techniques.
|
||||
|
||||
By calculating the first-difference, we de-trend the data (e.g. remove a constant, linear trend). We can see that the
|
||||
By calculating the first-difference, we de-trend the data (e.g. remove a constant, linear trend). We can see that the
|
||||
data becomes a stationary series (e.g. the first difference is randomly distributed around zero, and doesn't seem to
|
||||
exhibit any pattern/behavior). The transformation reveals that the dataset is following a random-walk; the value is the
|
||||
previous value +/- a random amount. This insight allows selection of further tools for analysis.
|
||||
previous value +/- a random amount. This insight allows selection of further tools for analysis.
|
||||
|
||||
[[serialdiff_dow]]
|
||||
.Dow Jones plotted and made stationary with first-differencing
|
||||
|
@ -93,10 +93,10 @@ POST /_search
|
|||
--------------------------------------------------
|
||||
|
||||
<1> A `date_histogram` named "my_date_histo" is constructed on the "timestamp" field, with one-day intervals
|
||||
<2> A `sum` metric is used to calculate the sum of a field. This could be any metric (sum, min, max, etc)
|
||||
<2> A `sum` metric is used to calculate the sum of a field. This could be any metric (sum, min, max, etc)
|
||||
<3> Finally, we specify a `serial_diff` aggregation which uses "the_sum" metric as its input.
|
||||
|
||||
Serial differences are built by first specifying a `histogram` or `date_histogram` over a field. You can then optionally
|
||||
add normal metrics, such as a `sum`, inside of that histogram. Finally, the `serial_diff` is embedded inside the histogram.
|
||||
Serial differences are built by first specifying a `histogram` or `date_histogram` over a field. You can then optionally
|
||||
add normal metrics, such as a `sum`, inside of that histogram. Finally, the `serial_diff` is embedded inside the histogram.
|
||||
The `buckets_path` parameter is then used to "point" at one of the sibling metrics inside of the histogram (see
|
||||
<<buckets-path-syntax>> for a description of the syntax for `buckets_path`.
|
||||
|
|
|
@ -13,12 +13,12 @@ lowercases terms, and supports removing stop words.
|
|||
<<analysis-simple-analyzer,Simple Analyzer>>::
|
||||
|
||||
The `simple` analyzer divides text into terms whenever it encounters a
|
||||
character which is not a letter. It lowercases all terms.
|
||||
character which is not a letter. It lowercases all terms.
|
||||
|
||||
<<analysis-whitespace-analyzer,Whitespace Analyzer>>::
|
||||
|
||||
The `whitespace` analyzer divides text into terms whenever it encounters any
|
||||
whitespace character. It does not lowercase terms.
|
||||
whitespace character. It does not lowercase terms.
|
||||
|
||||
<<analysis-stop-analyzer,Stop Analyzer>>::
|
||||
|
||||
|
|
|
@ -1,8 +1,8 @@
|
|||
[[configuring-analyzers]]
|
||||
=== Configuring built-in analyzers
|
||||
|
||||
The built-in analyzers can be used directly without any configuration. Some
|
||||
of them, however, support configuration options to alter their behaviour. For
|
||||
The built-in analyzers can be used directly without any configuration. Some
|
||||
of them, however, support configuration options to alter their behaviour. For
|
||||
instance, the <<analysis-standard-analyzer,`standard` analyzer>> can be configured
|
||||
to support a list of stop words:
|
||||
|
||||
|
@ -53,10 +53,10 @@ POST my-index-000001/_analyze
|
|||
<1> We define the `std_english` analyzer to be based on the `standard`
|
||||
analyzer, but configured to remove the pre-defined list of English stopwords.
|
||||
<2> The `my_text` field uses the `standard` analyzer directly, without
|
||||
any configuration. No stop words will be removed from this field.
|
||||
any configuration. No stop words will be removed from this field.
|
||||
The resulting terms are: `[ the, old, brown, cow ]`
|
||||
<3> The `my_text.english` field uses the `std_english` analyzer, so
|
||||
English stop words will be removed. The resulting terms are:
|
||||
English stop words will be removed. The resulting terms are:
|
||||
`[ old, brown, cow ]`
|
||||
|
||||
|
||||
|
|
|
@ -38,7 +38,7 @@ The `custom` analyzer accepts the following parameters:
|
|||
When indexing an array of text values, Elasticsearch inserts a fake "gap"
|
||||
between the last term of one value and the first term of the next value to
|
||||
ensure that a phrase query doesn't match two terms from different array
|
||||
elements. Defaults to `100`. See <<position-increment-gap>> for more.
|
||||
elements. Defaults to `100`. See <<position-increment-gap>> for more.
|
||||
|
||||
[discrete]
|
||||
=== Example configuration
|
||||
|
|
|
@ -9,7 +9,7 @@ https://github.com/OpenRefine/OpenRefine/wiki/Clustering-In-Depth#fingerprint[fi
|
|||
which is used by the OpenRefine project to assist in clustering.
|
||||
|
||||
Input text is lowercased, normalized to remove extended characters, sorted,
|
||||
deduplicated and concatenated into a single token. If a stopword list is
|
||||
deduplicated and concatenated into a single token. If a stopword list is
|
||||
configured, stop words will also be removed.
|
||||
|
||||
[discrete]
|
||||
|
@ -59,17 +59,17 @@ The `fingerprint` analyzer accepts the following parameters:
|
|||
[horizontal]
|
||||
`separator`::
|
||||
|
||||
The character to use to concatenate the terms. Defaults to a space.
|
||||
The character to use to concatenate the terms. Defaults to a space.
|
||||
|
||||
`max_output_size`::
|
||||
|
||||
The maximum token size to emit. Defaults to `255`. Tokens larger than
|
||||
The maximum token size to emit. Defaults to `255`. Tokens larger than
|
||||
this size will be discarded.
|
||||
|
||||
`stopwords`::
|
||||
|
||||
A pre-defined stop words list like `_english_` or an array containing a
|
||||
list of stop words. Defaults to `_none_`.
|
||||
A pre-defined stop words list like `_english_` or an array containing a
|
||||
list of stop words. Defaults to `_none_`.
|
||||
|
||||
`stopwords_path`::
|
||||
|
||||
|
|
|
@ -55,7 +55,7 @@ more details.
|
|||
===== Excluding words from stemming
|
||||
|
||||
The `stem_exclusion` parameter allows you to specify an array
|
||||
of lowercase words that should not be stemmed. Internally, this
|
||||
of lowercase words that should not be stemmed. Internally, this
|
||||
functionality is implemented by adding the
|
||||
<<analysis-keyword-marker-tokenfilter,`keyword_marker` token filter>>
|
||||
with the `keywords` set to the value of the `stem_exclusion` parameter.
|
||||
|
@ -427,7 +427,7 @@ PUT /catalan_example
|
|||
===== `cjk` analyzer
|
||||
|
||||
NOTE: You may find that `icu_analyzer` in the ICU analysis plugin works better
|
||||
for CJK text than the `cjk` analyzer. Experiment with your text and queries.
|
||||
for CJK text than the `cjk` analyzer. Experiment with your text and queries.
|
||||
|
||||
The `cjk` analyzer could be reimplemented as a `custom` analyzer as follows:
|
||||
|
||||
|
|
|
@ -159,8 +159,8 @@ The `pattern` analyzer accepts the following parameters:
|
|||
|
||||
`stopwords`::
|
||||
|
||||
A pre-defined stop words list like `_english_` or an array containing a
|
||||
list of stop words. Defaults to `_none_`.
|
||||
A pre-defined stop words list like `_english_` or an array containing a
|
||||
list of stop words. Defaults to `_none_`.
|
||||
|
||||
`stopwords_path`::
|
||||
|
||||
|
|
|
@ -132,8 +132,8 @@ The `standard` analyzer accepts the following parameters:
|
|||
|
||||
`stopwords`::
|
||||
|
||||
A pre-defined stop words list like `_english_` or an array containing a
|
||||
list of stop words. Defaults to `_none_`.
|
||||
A pre-defined stop words list like `_english_` or an array containing a
|
||||
list of stop words. Defaults to `_none_`.
|
||||
|
||||
`stopwords_path`::
|
||||
|
||||
|
|
|
@ -5,7 +5,7 @@
|
|||
++++
|
||||
|
||||
The `stop` analyzer is the same as the <<analysis-simple-analyzer,`simple` analyzer>>
|
||||
but adds support for removing stop words. It defaults to using the
|
||||
but adds support for removing stop words. It defaults to using the
|
||||
`_english_` stop words.
|
||||
|
||||
[discrete]
|
||||
|
@ -111,8 +111,8 @@ The `stop` analyzer accepts the following parameters:
|
|||
[horizontal]
|
||||
`stopwords`::
|
||||
|
||||
A pre-defined stop words list like `_english_` or an array containing a
|
||||
list of stop words. Defaults to `_english_`.
|
||||
A pre-defined stop words list like `_english_` or an array containing a
|
||||
list of stop words. Defaults to `_english_`.
|
||||
|
||||
`stopwords_path`::
|
||||
|
||||
|
|
|
@ -14,7 +14,7 @@ combined to define new <<analysis-custom-analyzer,`custom`>> analyzers.
|
|||
==== Character filters
|
||||
|
||||
A _character filter_ receives the original text as a stream of characters and
|
||||
can transform the stream by adding, removing, or changing characters. For
|
||||
can transform the stream by adding, removing, or changing characters. For
|
||||
instance, a character filter could be used to convert Hindu-Arabic numerals
|
||||
(٠١٢٣٤٥٦٧٨٩) into their Arabic-Latin equivalents (0123456789), or to strip HTML
|
||||
elements like `<b>` from the stream.
|
||||
|
@ -25,10 +25,10 @@ which are applied in order.
|
|||
[[analyzer-anatomy-tokenizer]]
|
||||
==== Tokenizer
|
||||
|
||||
A _tokenizer_ receives a stream of characters, breaks it up into individual
|
||||
A _tokenizer_ receives a stream of characters, breaks it up into individual
|
||||
_tokens_ (usually individual words), and outputs a stream of _tokens_. For
|
||||
instance, a <<analysis-whitespace-tokenizer,`whitespace`>> tokenizer breaks
|
||||
text into tokens whenever it sees any whitespace. It would convert the text
|
||||
text into tokens whenever it sees any whitespace. It would convert the text
|
||||
`"Quick brown fox!"` into the terms `[Quick, brown, fox!]`.
|
||||
|
||||
The tokenizer is also responsible for recording the order or _position_ of
|
||||
|
@ -41,7 +41,7 @@ An analyzer must have *exactly one* <<analysis-tokenizers,tokenizer>>.
|
|||
==== Token filters
|
||||
|
||||
A _token filter_ receives the token stream and may add, remove, or change
|
||||
tokens. For example, a <<analysis-lowercase-tokenfilter,`lowercase`>> token
|
||||
tokens. For example, a <<analysis-lowercase-tokenfilter,`lowercase`>> token
|
||||
filter converts all tokens to lowercase, a
|
||||
<<analysis-stop-tokenfilter,`stop`>> token filter removes common words
|
||||
(_stop words_) like `the` from the token stream, and a
|
||||
|
|
|
@ -5,7 +5,7 @@ _Character filters_ are used to preprocess the stream of characters before it
|
|||
is passed to the <<analysis-tokenizers,tokenizer>>.
|
||||
|
||||
A character filter receives the original text as a stream of characters and
|
||||
can transform the stream by adding, removing, or changing characters. For
|
||||
can transform the stream by adding, removing, or changing characters. For
|
||||
instance, a character filter could be used to convert Hindu-Arabic numerals
|
||||
(٠١٢٣٤٥٦٧٨٩) into their Arabic-Latin equivalents (0123456789), or to strip HTML
|
||||
elements like `<b>` from the stream.
|
||||
|
|
|
@ -4,7 +4,7 @@
|
|||
<titleabbrev>Mapping</titleabbrev>
|
||||
++++
|
||||
|
||||
The `mapping` character filter accepts a map of keys and values. Whenever it
|
||||
The `mapping` character filter accepts a map of keys and values. Whenever it
|
||||
encounters a string of characters that is the same as a key, it replaces them
|
||||
with the value associated with that key.
|
||||
|
||||
|
|
|
@ -5,7 +5,7 @@
|
|||
++++
|
||||
|
||||
A token filter of type `multiplexer` will emit multiple tokens at the same position,
|
||||
each version of the token having been run through a different filter. Identical
|
||||
each version of the token having been run through a different filter. Identical
|
||||
output tokens at the same position will be removed.
|
||||
|
||||
WARNING: If the incoming token stream has duplicate tokens, then these will also be
|
||||
|
@ -14,8 +14,8 @@ removed by the multiplexer
|
|||
[discrete]
|
||||
=== Options
|
||||
[horizontal]
|
||||
filters:: a list of token filters to apply to incoming tokens. These can be any
|
||||
token filters defined elsewhere in the index mappings. Filters can be chained
|
||||
filters:: a list of token filters to apply to incoming tokens. These can be any
|
||||
token filters defined elsewhere in the index mappings. Filters can be chained
|
||||
using a comma-delimited string, so for example `"lowercase, porter_stem"` would
|
||||
apply the `lowercase` filter and then the `porter_stem` filter to a single token.
|
||||
|
||||
|
|
|
@ -8,14 +8,14 @@ The `synonym_graph` token filter allows to easily handle synonyms,
|
|||
including multi-word synonyms correctly during the analysis process.
|
||||
|
||||
In order to properly handle multi-word synonyms this token filter
|
||||
creates a <<token-graphs,graph token stream>> during processing. For more
|
||||
creates a <<token-graphs,graph token stream>> during processing. For more
|
||||
information on this topic and its various complexities, please read the
|
||||
http://blog.mikemccandless.com/2012/04/lucenes-tokenstreams-are-actually.html[Lucene's TokenStreams are actually graphs] blog post.
|
||||
|
||||
["NOTE",id="synonym-graph-index-note"]
|
||||
===============================
|
||||
This token filter is designed to be used as part of a search analyzer
|
||||
only. If you want to apply synonyms during indexing please use the
|
||||
only. If you want to apply synonyms during indexing please use the
|
||||
standard <<analysis-synonym-tokenfilter,synonym token filter>>.
|
||||
===============================
|
||||
|
||||
|
@ -179,13 +179,13 @@ as well.
|
|||
==== Parsing synonym files
|
||||
|
||||
Elasticsearch will use the token filters preceding the synonym filter
|
||||
in a tokenizer chain to parse the entries in a synonym file. So, for example, if a
|
||||
in a tokenizer chain to parse the entries in a synonym file. So, for example, if a
|
||||
synonym filter is placed after a stemmer, then the stemmer will also be applied
|
||||
to the synonym entries. Because entries in the synonym map cannot have stacked
|
||||
positions, some token filters may cause issues here. Token filters that produce
|
||||
to the synonym entries. Because entries in the synonym map cannot have stacked
|
||||
positions, some token filters may cause issues here. Token filters that produce
|
||||
multiple versions of a token may choose which version of the token to emit when
|
||||
parsing synonyms, e.g. `asciifolding` will only produce the folded version of the
|
||||
token. Others, e.g. `multiplexer`, `word_delimiter_graph` or `ngram` will throw an
|
||||
token. Others, e.g. `multiplexer`, `word_delimiter_graph` or `ngram` will throw an
|
||||
error.
|
||||
|
||||
If you need to build analyzers that include both multi-token filters and synonym
|
||||
|
|
|
@ -170,13 +170,13 @@ as well.
|
|||
=== Parsing synonym files
|
||||
|
||||
Elasticsearch will use the token filters preceding the synonym filter
|
||||
in a tokenizer chain to parse the entries in a synonym file. So, for example, if a
|
||||
in a tokenizer chain to parse the entries in a synonym file. So, for example, if a
|
||||
synonym filter is placed after a stemmer, then the stemmer will also be applied
|
||||
to the synonym entries. Because entries in the synonym map cannot have stacked
|
||||
positions, some token filters may cause issues here. Token filters that produce
|
||||
to the synonym entries. Because entries in the synonym map cannot have stacked
|
||||
positions, some token filters may cause issues here. Token filters that produce
|
||||
multiple versions of a token may choose which version of the token to emit when
|
||||
parsing synonyms, e.g. `asciifolding` will only produce the folded version of the
|
||||
token. Others, e.g. `multiplexer`, `word_delimiter_graph` or `ngram` will throw an
|
||||
token. Others, e.g. `multiplexer`, `word_delimiter_graph` or `ngram` will throw an
|
||||
error.
|
||||
|
||||
If you need to build analyzers that include both multi-token filters and synonym
|
||||
|
|
|
@ -1,10 +1,10 @@
|
|||
[[analysis-tokenizers]]
|
||||
== Tokenizer reference
|
||||
|
||||
A _tokenizer_ receives a stream of characters, breaks it up into individual
|
||||
A _tokenizer_ receives a stream of characters, breaks it up into individual
|
||||
_tokens_ (usually individual words), and outputs a stream of _tokens_. For
|
||||
instance, a <<analysis-whitespace-tokenizer,`whitespace`>> tokenizer breaks
|
||||
text into tokens whenever it sees any whitespace. It would convert the text
|
||||
text into tokens whenever it sees any whitespace. It would convert the text
|
||||
`"Quick brown fox!"` into the terms `[Quick, brown, fox!]`.
|
||||
|
||||
The tokenizer is also responsible for recording the following:
|
||||
|
@ -90,7 +90,7 @@ text:
|
|||
<<analysis-keyword-tokenizer,Keyword Tokenizer>>::
|
||||
|
||||
The `keyword` tokenizer is a ``noop'' tokenizer that accepts whatever text it
|
||||
is given and outputs the exact same text as a single term. It can be combined
|
||||
is given and outputs the exact same text as a single term. It can be combined
|
||||
with token filters like <<analysis-lowercase-tokenfilter,`lowercase`>> to
|
||||
normalise the analysed terms.
|
||||
|
||||
|
|
|
@ -14,7 +14,7 @@ Edge N-Grams are useful for _search-as-you-type_ queries.
|
|||
TIP: When you need _search-as-you-type_ for text which has a widely known
|
||||
order, such as movie or song titles, the
|
||||
<<completion-suggester,completion suggester>> is a much more efficient
|
||||
choice than edge N-grams. Edge N-grams have the advantage when trying to
|
||||
choice than edge N-grams. Edge N-grams have the advantage when trying to
|
||||
autocomplete words that can appear in any order.
|
||||
|
||||
[discrete]
|
||||
|
@ -67,7 +67,7 @@ The above sentence would produce the following terms:
|
|||
[ Q, Qu ]
|
||||
---------------------------
|
||||
|
||||
NOTE: These default gram lengths are almost entirely useless. You need to
|
||||
NOTE: These default gram lengths are almost entirely useless. You need to
|
||||
configure the `edge_ngram` before using it.
|
||||
|
||||
[discrete]
|
||||
|
@ -76,19 +76,19 @@ configure the `edge_ngram` before using it.
|
|||
The `edge_ngram` tokenizer accepts the following parameters:
|
||||
|
||||
`min_gram`::
|
||||
Minimum length of characters in a gram. Defaults to `1`.
|
||||
Minimum length of characters in a gram. Defaults to `1`.
|
||||
|
||||
`max_gram`::
|
||||
+
|
||||
--
|
||||
Maximum length of characters in a gram. Defaults to `2`.
|
||||
Maximum length of characters in a gram. Defaults to `2`.
|
||||
|
||||
See <<max-gram-limits>>.
|
||||
--
|
||||
|
||||
`token_chars`::
|
||||
|
||||
Character classes that should be included in a token. Elasticsearch
|
||||
Character classes that should be included in a token. Elasticsearch
|
||||
will split on characters that don't belong to the classes specified.
|
||||
Defaults to `[]` (keep all characters).
|
||||
+
|
||||
|
@ -106,7 +106,7 @@ Character classes may be any of the following:
|
|||
|
||||
Custom characters that should be treated as part of a token. For example,
|
||||
setting this to `+-_` will make the tokenizer treat the plus, minus and
|
||||
underscore sign as part of a token.
|
||||
underscore sign as part of a token.
|
||||
|
||||
[discrete]
|
||||
[[max-gram-limits]]
|
||||
|
|
|
@ -4,8 +4,8 @@
|
|||
<titleabbrev>Keyword</titleabbrev>
|
||||
++++
|
||||
|
||||
The `keyword` tokenizer is a ``noop'' tokenizer that accepts whatever text it
|
||||
is given and outputs the exact same text as a single term. It can be combined
|
||||
The `keyword` tokenizer is a ``noop'' tokenizer that accepts whatever text it
|
||||
is given and outputs the exact same text as a single term. It can be combined
|
||||
with token filters to normalise output, e.g. lower-casing email addresses.
|
||||
|
||||
[discrete]
|
||||
|
@ -104,6 +104,6 @@ The `keyword` tokenizer accepts the following parameters:
|
|||
`buffer_size`::
|
||||
|
||||
The number of characters read into the term buffer in a single pass.
|
||||
Defaults to `256`. The term buffer will grow by this size until all the
|
||||
text has been consumed. It is advisable not to change this setting.
|
||||
Defaults to `256`. The term buffer will grow by this size until all the
|
||||
text has been consumed. It is advisable not to change this setting.
|
||||
|
||||
|
|
|
@ -7,7 +7,7 @@
|
|||
The `lowercase` tokenizer, like the
|
||||
<<analysis-letter-tokenizer, `letter` tokenizer>> breaks text into terms
|
||||
whenever it encounters a character which is not a letter, but it also
|
||||
lowercases all terms. It is functionally equivalent to the
|
||||
lowercases all terms. It is functionally equivalent to the
|
||||
<<analysis-letter-tokenizer, `letter` tokenizer>> combined with the
|
||||
<<analysis-lowercase-tokenfilter, `lowercase` token filter>>, but is more
|
||||
efficient as it performs both steps in a single pass.
|
||||
|
|
|
@ -175,14 +175,14 @@ The `ngram` tokenizer accepts the following parameters:
|
|||
|
||||
[horizontal]
|
||||
`min_gram`::
|
||||
Minimum length of characters in a gram. Defaults to `1`.
|
||||
Minimum length of characters in a gram. Defaults to `1`.
|
||||
|
||||
`max_gram`::
|
||||
Maximum length of characters in a gram. Defaults to `2`.
|
||||
Maximum length of characters in a gram. Defaults to `2`.
|
||||
|
||||
`token_chars`::
|
||||
|
||||
Character classes that should be included in a token. Elasticsearch
|
||||
Character classes that should be included in a token. Elasticsearch
|
||||
will split on characters that don't belong to the classes specified.
|
||||
Defaults to `[]` (keep all characters).
|
||||
+
|
||||
|
@ -200,12 +200,12 @@ Character classes may be any of the following:
|
|||
|
||||
Custom characters that should be treated as part of a token. For example,
|
||||
setting this to `+-_` will make the tokenizer treat the plus, minus and
|
||||
underscore sign as part of a token.
|
||||
underscore sign as part of a token.
|
||||
|
||||
TIP: It usually makes sense to set `min_gram` and `max_gram` to the same
|
||||
value. The smaller the length, the more documents will match but the lower
|
||||
the quality of the matches. The longer the length, the more specific the
|
||||
matches. A tri-gram (length `3`) is a good place to start.
|
||||
value. The smaller the length, the more documents will match but the lower
|
||||
the quality of the matches. The longer the length, the more specific the
|
||||
matches. A tri-gram (length `3`) is a good place to start.
|
||||
|
||||
The index level setting `index.max_ngram_diff` controls the maximum allowed
|
||||
difference between `max_gram` and `min_gram`.
|
||||
|
|
|
@ -69,7 +69,7 @@ The `path_hierarchy` tokenizer accepts the following parameters:
|
|||
|
||||
[horizontal]
|
||||
`delimiter`::
|
||||
The character to use as the path separator. Defaults to `/`.
|
||||
The character to use as the path separator. Defaults to `/`.
|
||||
|
||||
`replacement`::
|
||||
An optional replacement character to use for the delimiter.
|
||||
|
@ -77,20 +77,20 @@ The `path_hierarchy` tokenizer accepts the following parameters:
|
|||
|
||||
`buffer_size`::
|
||||
The number of characters read into the term buffer in a single pass.
|
||||
Defaults to `1024`. The term buffer will grow by this size until all the
|
||||
text has been consumed. It is advisable not to change this setting.
|
||||
Defaults to `1024`. The term buffer will grow by this size until all the
|
||||
text has been consumed. It is advisable not to change this setting.
|
||||
|
||||
`reverse`::
|
||||
If set to `true`, emits the tokens in reverse order. Defaults to `false`.
|
||||
If set to `true`, emits the tokens in reverse order. Defaults to `false`.
|
||||
|
||||
`skip`::
|
||||
The number of initial tokens to skip. Defaults to `0`.
|
||||
The number of initial tokens to skip. Defaults to `0`.
|
||||
|
||||
[discrete]
|
||||
=== Example configuration
|
||||
|
||||
In this example, we configure the `path_hierarchy` tokenizer to split on `-`
|
||||
characters, and to replace them with `/`. The first two tokens are skipped:
|
||||
characters, and to replace them with `/`. The first two tokens are skipped:
|
||||
|
||||
[source,console]
|
||||
----------------------------
|
||||
|
|
|
@ -116,7 +116,7 @@ The `pattern` tokenizer accepts the following parameters:
|
|||
|
||||
`group`::
|
||||
|
||||
Which capture group to extract as tokens. Defaults to `-1` (split).
|
||||
Which capture group to extract as tokens. Defaults to `-1` (split).
|
||||
|
||||
[discrete]
|
||||
=== Example configuration
|
||||
|
@ -194,7 +194,7 @@ The above example produces the following terms:
|
|||
---------------------------
|
||||
|
||||
In the next example, we configure the `pattern` tokenizer to capture values
|
||||
enclosed in double quotes (ignoring embedded escaped quotes `\"`). The regex
|
||||
enclosed in double quotes (ignoring embedded escaped quotes `\"`). The regex
|
||||
itself looks like this:
|
||||
|
||||
"((?:\\"|[^"]|\\")*)"
|
||||
|
|
|
@ -199,7 +199,7 @@ Statistics are returned in a format suitable for humans
|
|||
The human readable values can be turned off by adding `?human=false`
|
||||
to the query string. This makes sense when the stats results are
|
||||
being consumed by a monitoring tool, rather than intended for human
|
||||
consumption. The default for the `human` flag is
|
||||
consumption. The default for the `human` flag is
|
||||
`false`.
|
||||
|
||||
[[date-math]]
|
||||
|
@ -499,7 +499,7 @@ of supporting the native JSON number types.
|
|||
==== Time units
|
||||
|
||||
Whenever durations need to be specified, e.g. for a `timeout` parameter, the duration must specify
|
||||
the unit, like `2d` for 2 days. The supported units are:
|
||||
the unit, like `2d` for 2 days. The supported units are:
|
||||
|
||||
[horizontal]
|
||||
`d`:: Days
|
||||
|
|
|
@ -103,14 +103,14 @@ with `queue`.
|
|||
==== Numeric formats
|
||||
|
||||
Many commands provide a few types of numeric output, either a byte, size
|
||||
or a time value. By default, these types are human-formatted,
|
||||
for example, `3.5mb` instead of `3763212`. The human values are not
|
||||
or a time value. By default, these types are human-formatted,
|
||||
for example, `3.5mb` instead of `3763212`. The human values are not
|
||||
sortable numerically, so in order to operate on these values where
|
||||
order is important, you can change it.
|
||||
|
||||
Say you want to find the largest index in your cluster (storage used
|
||||
by all the shards, not number of documents). The `/_cat/indices` API
|
||||
is ideal. You only need to add three things to the API request:
|
||||
by all the shards, not number of documents). The `/_cat/indices` API
|
||||
is ideal. You only need to add three things to the API request:
|
||||
|
||||
. The `bytes` query string parameter with a value of `b` to get byte-level resolution.
|
||||
. The `s` (sort) parameter with a value of `store.size:desc` and a comma with `index:asc` to sort the output
|
||||
|
|
|
@ -25,7 +25,7 @@ include::{es-repo-dir}/rest-api/common-parms.asciidoc[tag=bytes]
|
|||
include::{es-repo-dir}/rest-api/common-parms.asciidoc[tag=http-format]
|
||||
|
||||
`full_id`::
|
||||
(Optional, Boolean) If `true`, return the full node ID. If `false`, return the
|
||||
(Optional, Boolean) If `true`, return the full node ID. If `false`, return the
|
||||
shortened node ID. Defaults to `false`.
|
||||
|
||||
include::{es-repo-dir}/rest-api/common-parms.asciidoc[tag=cat-h]
|
||||
|
|
|
@ -22,7 +22,7 @@ Provides explanations for shard allocations in the cluster.
|
|||
==== {api-description-title}
|
||||
|
||||
The purpose of the cluster allocation explain API is to provide
|
||||
explanations for shard allocations in the cluster. For unassigned shards,
|
||||
explanations for shard allocations in the cluster. For unassigned shards,
|
||||
the explain API provides an explanation for why the shard is unassigned.
|
||||
For assigned shards, the explain API provides an explanation for why the
|
||||
shard is remaining on its current node and has not moved or rebalanced to
|
||||
|
|
|
@ -40,7 +40,7 @@ include::{es-repo-dir}/rest-api/common-parms.asciidoc[tag=node-id]
|
|||
|
||||
`ignore_idle_threads`::
|
||||
(Optional, Boolean) If true, known idle threads (e.g. waiting in a socket
|
||||
select, or to get a task from an empty queue) are filtered out. Defaults to
|
||||
select, or to get a task from an empty queue) are filtered out. Defaults to
|
||||
true.
|
||||
|
||||
`interval`::
|
||||
|
|
|
@ -108,7 +108,7 @@ include::{es-repo-dir}/rest-api/common-parms.asciidoc[tag=node-id]
|
|||
|
||||
`total_indexing_buffer`::
|
||||
Total heap allowed to be used to hold recently indexed
|
||||
documents before they must be written to disk. This size is
|
||||
documents before they must be written to disk. This size is
|
||||
a shared pool across all shards on this node, and is
|
||||
controlled by <<indexing-buffer,Indexing Buffer settings>>.
|
||||
|
||||
|
|
|
@ -1199,7 +1199,7 @@ since the {wikipedia}/Unix_time[Unix Epoch].
|
|||
|
||||
`open_file_descriptors`::
|
||||
(integer)
|
||||
Number of opened file descriptors associated with the current or
|
||||
Number of opened file descriptors associated with the current or
|
||||
`-1` if not supported.
|
||||
|
||||
`max_file_descriptors`::
|
||||
|
|
|
@ -75,7 +75,7 @@ GET _tasks?nodes=nodeId1,nodeId2&actions=cluster:* <3>
|
|||
// TEST[skip:No tasks to retrieve]
|
||||
|
||||
<1> Retrieves all tasks currently running on all nodes in the cluster.
|
||||
<2> Retrieves all tasks running on nodes `nodeId1` and `nodeId2`. See <<cluster-nodes>> for more info about how to select individual nodes.
|
||||
<2> Retrieves all tasks running on nodes `nodeId1` and `nodeId2`. See <<cluster-nodes>> for more info about how to select individual nodes.
|
||||
<3> Retrieves all cluster-related tasks running on nodes `nodeId1` and `nodeId2`.
|
||||
|
||||
The API returns the following result:
|
||||
|
|
|
@ -41,7 +41,7 @@ manually. It adds an entry for that node in the voting configuration exclusions
|
|||
list. The cluster then tries to reconfigure the voting configuration to remove
|
||||
that node and to prevent it from returning.
|
||||
|
||||
If the API fails, you can safely retry it. Only a successful response
|
||||
If the API fails, you can safely retry it. Only a successful response
|
||||
guarantees that the node has been removed from the voting configuration and will
|
||||
not be reinstated.
|
||||
|
||||
|
|
|
@ -36,11 +36,11 @@ This tool has a number of modes:
|
|||
prevents the cluster state from being loaded.
|
||||
|
||||
* `elasticsearch-node unsafe-bootstrap` can be used to perform _unsafe cluster
|
||||
bootstrapping_. It forces one of the nodes to form a brand-new cluster on
|
||||
bootstrapping_. It forces one of the nodes to form a brand-new cluster on
|
||||
its own, using its local copy of the cluster metadata.
|
||||
|
||||
* `elasticsearch-node detach-cluster` enables you to move nodes from one
|
||||
cluster to another. This can be used to move nodes into a new cluster
|
||||
cluster to another. This can be used to move nodes into a new cluster
|
||||
created with the `elasticsearch-node unsafe-bootstrap` command. If unsafe
|
||||
cluster bootstrapping was not possible, it also enables you to move nodes
|
||||
into a brand-new cluster.
|
||||
|
@ -218,7 +218,7 @@ node with the same term, pick the one with the largest version.
|
|||
This information identifies the node with the freshest cluster state, which minimizes the
|
||||
quantity of data that might be lost. For example, if the first node reports
|
||||
`(4, 12)` and a second node reports `(5, 3)`, then the second node is preferred
|
||||
since its term is larger. However if the second node reports `(3, 17)` then
|
||||
since its term is larger. However if the second node reports `(3, 17)` then
|
||||
the first node is preferred since its term is larger. If the second node
|
||||
reports `(4, 10)` then it has the same term as the first node, but has a
|
||||
smaller version, so the first node is preferred.
|
||||
|
|
Some files were not shown because too many files have changed in this diff Show more
Loading…
Add table
Add a link
Reference in a new issue