mirror of
https://github.com/elastic/elasticsearch.git
synced 2025-04-25 15:47:23 -04:00
This replaces the `script` docs for bucket aggregations with runtime
fields. We expect runtime fields to be nicer to work with because you
can also fetch them or filter on them. We expect them to be faster
because their don't need this sort of `instanceof` tree:
a92a647b9f/server/src/main/java/org/elasticsearch/search/aggregations/support/values/ScriptDoubleValues.java (L42)
Relates to #69291
Co-authored-by: James Rodewig <40268737+jrodewig@users.noreply.github.com>
Co-authored-by: Adam Locke <adam.locke@elastic.co>
194 lines
5.6 KiB
Text
194 lines
5.6 KiB
Text
[role="xpack"]
|
|
[testenv="basic"]
|
|
[[search-aggregations-metrics-string-stats-aggregation]]
|
|
=== String stats aggregation
|
|
++++
|
|
<titleabbrev>String stats</titleabbrev>
|
|
++++
|
|
|
|
A `multi-value` metrics aggregation that computes statistics over string values extracted from the aggregated documents.
|
|
These values can be retrieved either from specific `keyword` fields.
|
|
|
|
The string stats aggregation returns the following results:
|
|
|
|
* `count` - The number of non-empty fields counted.
|
|
* `min_length` - The length of the shortest term.
|
|
* `max_length` - The length of the longest term.
|
|
* `avg_length` - The average length computed over all terms.
|
|
* `entropy` - The {wikipedia}/Entropy_(information_theory)[Shannon Entropy] value computed over all terms collected by
|
|
the aggregation. Shannon entropy quantifies the amount of information contained in the field. It is a very useful metric for
|
|
measuring a wide range of properties of a data set, such as diversity, similarity, randomness etc.
|
|
|
|
For example:
|
|
|
|
[source,console]
|
|
--------------------------------------------------
|
|
POST /my-index-000001/_search?size=0
|
|
{
|
|
"aggs": {
|
|
"message_stats": { "string_stats": { "field": "message.keyword" } }
|
|
}
|
|
}
|
|
--------------------------------------------------
|
|
// TEST[setup:messages]
|
|
|
|
The above aggregation computes the string statistics for the `message` field in all documents. The aggregation type
|
|
is `string_stats` and the `field` parameter defines the field of the documents the stats will be computed on.
|
|
The above will return the following:
|
|
|
|
[source,console-result]
|
|
--------------------------------------------------
|
|
{
|
|
...
|
|
|
|
"aggregations": {
|
|
"message_stats": {
|
|
"count": 5,
|
|
"min_length": 24,
|
|
"max_length": 30,
|
|
"avg_length": 28.8,
|
|
"entropy": 3.94617750050791
|
|
}
|
|
}
|
|
}
|
|
--------------------------------------------------
|
|
// TESTRESPONSE[s/\.\.\./"took": $body.took,"timed_out": false,"_shards": $body._shards,"hits": $body.hits,/]
|
|
|
|
The name of the aggregation (`message_stats` above) also serves as the key by which the aggregation result can be retrieved from
|
|
the returned response.
|
|
|
|
==== Character distribution
|
|
|
|
The computation of the Shannon Entropy value is based on the probability of each character appearing in all terms collected
|
|
by the aggregation. To view the probability distribution for all characters, we can add the `show_distribution` (default: `false`) parameter.
|
|
|
|
[source,console]
|
|
--------------------------------------------------
|
|
POST /my-index-000001/_search?size=0
|
|
{
|
|
"aggs": {
|
|
"message_stats": {
|
|
"string_stats": {
|
|
"field": "message.keyword",
|
|
"show_distribution": true <1>
|
|
}
|
|
}
|
|
}
|
|
}
|
|
--------------------------------------------------
|
|
// TEST[setup:messages]
|
|
|
|
<1> Set the `show_distribution` parameter to `true`, so that probability distribution for all characters is returned in the results.
|
|
|
|
[source,console-result]
|
|
--------------------------------------------------
|
|
{
|
|
...
|
|
|
|
"aggregations": {
|
|
"message_stats": {
|
|
"count": 5,
|
|
"min_length": 24,
|
|
"max_length": 30,
|
|
"avg_length": 28.8,
|
|
"entropy": 3.94617750050791,
|
|
"distribution": {
|
|
" ": 0.1527777777777778,
|
|
"e": 0.14583333333333334,
|
|
"s": 0.09722222222222222,
|
|
"m": 0.08333333333333333,
|
|
"t": 0.0763888888888889,
|
|
"h": 0.0625,
|
|
"a": 0.041666666666666664,
|
|
"i": 0.041666666666666664,
|
|
"r": 0.041666666666666664,
|
|
"g": 0.034722222222222224,
|
|
"n": 0.034722222222222224,
|
|
"o": 0.034722222222222224,
|
|
"u": 0.034722222222222224,
|
|
"b": 0.027777777777777776,
|
|
"w": 0.027777777777777776,
|
|
"c": 0.013888888888888888,
|
|
"E": 0.006944444444444444,
|
|
"l": 0.006944444444444444,
|
|
"1": 0.006944444444444444,
|
|
"2": 0.006944444444444444,
|
|
"3": 0.006944444444444444,
|
|
"4": 0.006944444444444444,
|
|
"y": 0.006944444444444444
|
|
}
|
|
}
|
|
}
|
|
}
|
|
--------------------------------------------------
|
|
// TESTRESPONSE[s/\.\.\./"took": $body.took,"timed_out": false,"_shards": $body._shards,"hits": $body.hits,/]
|
|
|
|
The `distribution` object shows the probability of each character appearing in all terms. The characters are sorted by descending probability.
|
|
|
|
==== Script
|
|
|
|
If you need to get the `string_stats` for something more complex than a single
|
|
field, run the aggregation on a <<runtime,runtime field>>.
|
|
|
|
[source,console]
|
|
----
|
|
POST /my-index-000001/_search
|
|
{
|
|
"size": 0,
|
|
"runtime_mappings": {
|
|
"message_and_context": {
|
|
"type": "keyword",
|
|
"script": """
|
|
emit(doc['message.keyword'].value + ' ' + doc['context.keyword'].value)
|
|
"""
|
|
}
|
|
},
|
|
"aggs": {
|
|
"message_stats": {
|
|
"string_stats": { "field": "message_and_context" }
|
|
}
|
|
}
|
|
}
|
|
----
|
|
// TEST[setup:messages]
|
|
// TEST[s/_search/_search?filter_path=aggregations/]
|
|
|
|
////
|
|
[source,console-result]
|
|
----
|
|
{
|
|
"aggregations": {
|
|
"message_stats": {
|
|
"count": 5,
|
|
"min_length": 28,
|
|
"max_length": 34,
|
|
"avg_length": 32.8,
|
|
"entropy": 3.9797778402765784
|
|
}
|
|
}
|
|
}
|
|
----
|
|
////
|
|
|
|
==== Missing value
|
|
|
|
The `missing` parameter defines how documents that are missing a value should be treated.
|
|
By default they will be ignored but it is also possible to treat them as if they had a value.
|
|
|
|
[source,console]
|
|
--------------------------------------------------
|
|
POST /my-index-000001/_search?size=0
|
|
{
|
|
"aggs": {
|
|
"message_stats": {
|
|
"string_stats": {
|
|
"field": "message.keyword",
|
|
"missing": "[empty message]" <1>
|
|
}
|
|
}
|
|
}
|
|
}
|
|
--------------------------------------------------
|
|
// TEST[setup:messages]
|
|
|
|
<1> Documents without a value in the `message` field will be treated as documents that have the value `[empty message]`.
|