2023-10-17 23:36:14 +08:00
|
|
|
[discrete]
|
2023-06-14 22:23:34 +08:00
|
|
|
[[esql-agg-count-distinct]]
|
|
|
|
=== `COUNT_DISTINCT`
|
|
|
|
|
2023-12-19 22:59:02 +08:00
|
|
|
*Syntax*
|
|
|
|
|
|
|
|
[source,esql]
|
2023-06-14 22:23:34 +08:00
|
|
|
----
|
2024-01-31 02:29:12 +08:00
|
|
|
COUNT_DISTINCT(expression[, precision_threshold])
|
2023-06-14 22:23:34 +08:00
|
|
|
----
|
|
|
|
|
2023-12-19 22:59:02 +08:00
|
|
|
*Parameters*
|
|
|
|
|
2024-01-31 02:29:12 +08:00
|
|
|
`expression`::
|
|
|
|
Expression that outputs the values on which to perform a distinct count.
|
2023-12-19 22:59:02 +08:00
|
|
|
|
2024-01-25 23:32:24 +08:00
|
|
|
`precision_threshold`::
|
|
|
|
Precision threshold. Refer to <<esql-agg-count-distinct-approximate>>. The
|
|
|
|
maximum supported value is 40000. Thresholds above this number will have the
|
|
|
|
same effect as a threshold of 40000. The default value is 3000.
|
2023-12-19 22:59:02 +08:00
|
|
|
|
|
|
|
*Description*
|
|
|
|
|
|
|
|
Returns the approximate number of distinct values.
|
2023-06-14 22:23:34 +08:00
|
|
|
|
2023-12-19 22:59:02 +08:00
|
|
|
*Supported types*
|
|
|
|
|
|
|
|
Can take any field type as input.
|
|
|
|
|
|
|
|
*Examples*
|
|
|
|
|
|
|
|
[source.merge.styled,esql]
|
|
|
|
----
|
|
|
|
include::{esql-specs}/stats_count_distinct.csv-spec[tag=count-distinct]
|
|
|
|
----
|
|
|
|
[%header.monospaced.styled,format=dsv,separator=|]
|
|
|
|
|===
|
|
|
|
include::{esql-specs}/stats_count_distinct.csv-spec[tag=count-distinct-result]
|
|
|
|
|===
|
|
|
|
|
2024-01-25 23:32:24 +08:00
|
|
|
With the optional second parameter to configure the precision threshold:
|
2023-06-14 22:23:34 +08:00
|
|
|
|
|
|
|
[source.merge.styled,esql]
|
|
|
|
----
|
|
|
|
include::{esql-specs}/stats_count_distinct.csv-spec[tag=count-distinct-precision]
|
|
|
|
----
|
|
|
|
[%header.monospaced.styled,format=dsv,separator=|]
|
|
|
|
|===
|
|
|
|
include::{esql-specs}/stats_count_distinct.csv-spec[tag=count-distinct-precision-result]
|
|
|
|
|===
|
2024-01-31 02:29:12 +08:00
|
|
|
|
|
|
|
The expression can use inline functions. This example splits a string into
|
|
|
|
multiple values using the `SPLIT` function and counts the unique values:
|
|
|
|
|
|
|
|
[source.merge.styled,esql]
|
|
|
|
----
|
|
|
|
include::{esql-specs}/stats_count_distinct.csv-spec[tag=docsCountDistinctWithExpression]
|
|
|
|
----
|
|
|
|
[%header.monospaced.styled,format=dsv,separator=|]
|
|
|
|
|===
|
|
|
|
include::{esql-specs}/stats_count_distinct.csv-spec[tag=docsCountDistinctWithExpression-result]
|
|
|
|
|===
|
|
|
|
|
|
|
|
[discrete]
|
|
|
|
[[esql-agg-count-distinct-approximate]]
|
|
|
|
==== Counts are approximate
|
|
|
|
|
|
|
|
Computing exact counts requires loading values into a set and returning its
|
|
|
|
size. This doesn't scale when working on high-cardinality sets and/or large
|
|
|
|
values as the required memory usage and the need to communicate those
|
|
|
|
per-shard sets between nodes would utilize too many resources of the cluster.
|
|
|
|
|
|
|
|
This `COUNT_DISTINCT` function is based on the
|
|
|
|
https://static.googleusercontent.com/media/research.google.com/fr//pubs/archive/40671.pdf[HyperLogLog++]
|
|
|
|
algorithm, which counts based on the hashes of the values with some interesting
|
|
|
|
properties:
|
|
|
|
|
|
|
|
include::../../aggregations/metrics/cardinality-aggregation.asciidoc[tag=explanation]
|
|
|
|
|
|
|
|
The `COUNT_DISTINCT` function takes an optional second parameter to configure
|
|
|
|
the precision threshold. The precision_threshold options allows to trade memory
|
|
|
|
for accuracy, and defines a unique count below which counts are expected to be
|
|
|
|
close to accurate. Above this value, counts might become a bit more fuzzy. The
|
|
|
|
maximum supported value is 40000, thresholds above this number will have the
|
|
|
|
same effect as a threshold of 40000. The default value is `3000`.
|