elasticsearch/docs/reference/esql/functions/count-distinct.asciidoc
Abdon Pijpelink e87c49cb4b
[DOCS] Improve ES|QL functions reference for functions E-Z (#104623)
* Functions E-Z

* Incorporate changes from #103686

* More functions

* More functions

* Update docs/reference/esql/functions/floor.asciidoc

Co-authored-by: Liam Thompson <32779855+leemthompo@users.noreply.github.com>

* Update docs/reference/esql/functions/left.asciidoc

Co-authored-by: Liam Thompson <32779855+leemthompo@users.noreply.github.com>

* Apply suggestions from code review

Co-authored-by: Alexander Spies <alexander.spies@elastic.co>

* Review feedback

* Fix geo_shape description

* Change 'colum'/'field' into 'expressions'

* Review feedback

* One more

---------

Co-authored-by: Liam Thompson <32779855+leemthompo@users.noreply.github.com>
Co-authored-by: Alexander Spies <alexander.spies@elastic.co>
Co-authored-by: Elastic Machine <elasticmachine@users.noreply.github.com>
2024-01-25 16:32:24 +01:00

73 lines
2.3 KiB
Text

[discrete]
[[esql-agg-count-distinct]]
=== `COUNT_DISTINCT`
*Syntax*
[source,esql]
----
COUNT_DISTINCT(column[, precision_threshold])
----
*Parameters*
`column`::
Column for which to count the number of distinct values.
`precision_threshold`::
Precision threshold. Refer to <<esql-agg-count-distinct-approximate>>. The
maximum supported value is 40000. Thresholds above this number will have the
same effect as a threshold of 40000. The default value is 3000.
*Description*
Returns the approximate number of distinct values.
[discrete]
[[esql-agg-count-distinct-approximate]]
==== Counts are approximate
Computing exact counts requires loading values into a set and returning its
size. This doesn't scale when working on high-cardinality sets and/or large
values as the required memory usage and the need to communicate those
per-shard sets between nodes would utilize too many resources of the cluster.
This `COUNT_DISTINCT` function is based on the
https://static.googleusercontent.com/media/research.google.com/fr//pubs/archive/40671.pdf[HyperLogLog++]
algorithm, which counts based on the hashes of the values with some interesting
properties:
include::../../aggregations/metrics/cardinality-aggregation.asciidoc[tag=explanation]
The `COUNT_DISTINCT` function takes an optional second parameter to configure
the precision threshold. The precision_threshold options allows to trade memory
for accuracy, and defines a unique count below which counts are expected to be
close to accurate. Above this value, counts might become a bit more fuzzy. The
maximum supported value is 40000, thresholds above this number will have the
same effect as a threshold of 40000. The default value is `3000`.
*Supported types*
Can take any field type as input.
*Examples*
[source.merge.styled,esql]
----
include::{esql-specs}/stats_count_distinct.csv-spec[tag=count-distinct]
----
[%header.monospaced.styled,format=dsv,separator=|]
|===
include::{esql-specs}/stats_count_distinct.csv-spec[tag=count-distinct-result]
|===
With the optional second parameter to configure the precision threshold:
[source.merge.styled,esql]
----
include::{esql-specs}/stats_count_distinct.csv-spec[tag=count-distinct-precision]
----
[%header.monospaced.styled,format=dsv,separator=|]
|===
include::{esql-specs}/stats_count_distinct.csv-spec[tag=count-distinct-precision-result]
|===