elasticsearch/docs/reference/esql/functions/count-distinct.asciidoc

[discrete]
[[esql-agg-count-distinct]]
=== `COUNT_DISTINCT`

*Syntax*

[source,esql]
----
COUNT_DISTINCT(expression[, precision_threshold])
----

*Parameters*

`expression`::
Expression that outputs the values on which to perform a distinct count.

`precision_threshold`::
Precision threshold. Refer to <<esql-agg-count-distinct-approximate>>. The
maximum supported value is 40000. Thresholds above this number will have the
same effect as a threshold of 40000. The default value is 3000.

*Description*

Returns the approximate number of distinct values.

*Supported types*

Can take any field type as input.

*Examples*

[source.merge.styled,esql]
----
include::{esql-specs}/stats_count_distinct.csv-spec[tag=count-distinct]
----
[%header.monospaced.styled,format=dsv,separator=|]
|===
include::{esql-specs}/stats_count_distinct.csv-spec[tag=count-distinct-result]
|===

With the optional second parameter to configure the precision threshold:

[source.merge.styled,esql]
----
include::{esql-specs}/stats_count_distinct.csv-spec[tag=count-distinct-precision]
----
[%header.monospaced.styled,format=dsv,separator=|]
|===
include::{esql-specs}/stats_count_distinct.csv-spec[tag=count-distinct-precision-result]
|===

The expression can use inline functions. This example splits a string into
multiple values using the `SPLIT` function and counts the unique values:

[source.merge.styled,esql]
----
include::{esql-specs}/stats_count_distinct.csv-spec[tag=docsCountDistinctWithExpression]
----
[%header.monospaced.styled,format=dsv,separator=|]
|===
include::{esql-specs}/stats_count_distinct.csv-spec[tag=docsCountDistinctWithExpression-result]
|===

[discrete]
[[esql-agg-count-distinct-approximate]]
==== Counts are approximate

Computing exact counts requires loading values into a set and returning its
size. This doesn't scale when working on high-cardinality sets and/or large
values as the required memory usage and the need to communicate those
per-shard sets between nodes would utilize too many resources of the cluster.

This `COUNT_DISTINCT` function is based on the
https://static.googleusercontent.com/media/research.google.com/fr//pubs/archive/40671.pdf[HyperLogLog++]
algorithm, which counts based on the hashes of the values with some interesting
properties:

include::../../aggregations/metrics/cardinality-aggregation.asciidoc[tag=explanation]

The `COUNT_DISTINCT` function takes an optional second parameter to configure
the precision threshold. The precision_threshold options allows to trade memory
for accuracy, and defines a unique count below which counts are expected to be
close to accurate. Above this value, counts might become a bit more fuzzy. The
maximum supported value is 40000, thresholds above this number will have the
same effect as a threshold of 40000. The default value is `3000`.
Restructure ES\|QL docs (#100806) * Break out 'Limitations' into separate page * Add REST API docs * Restructure commands, functions, and operators refs * Add placeholder for getting started guide * Group 'Syntax', 'Metafields', and 'MV fields' under 'Language' * Add placeholder for Kibana page * Add link from landing page * Apply uniform formatting to ACOS, CASE, and DATE_PARSE function refs * Reword default LIMIT * Add support for COUNT() Move 'Commands' and 'Functions and operators' to individual pages --------- Co-authored-by: Elastic Machine <elasticmachine@users.noreply.github.com> 2023-10-17 23:36:14 +08:00			`[discrete]`
Docs for aggregation functions (ESQL-1268) This adds docs for all of ESQL's aggregation functions. Hopefully from here on out we can add the docs as we add new functions. I've created a few tagged regions in the aggs docs themselves so we can include them into the ESQL docs. --------- Co-authored-by: Abdon Pijpelink <abdon.pijpelink@elastic.co> 2023-06-14 22:23:34 +08:00			`[[esql-agg-count-distinct]]`
			=== `COUNT_DISTINCT`

[DOCS] Improve ES\|QL functions reference for functions A-D (#103447) * Functions starting with A * Functions starting with 'C' * More 'C' functions * Fix tests * Fix missing snippet * DATE_* functions * Apply suggestions from code review Co-authored-by: Bogdan Pintea <pintea@mailbox.org> --------- Co-authored-by: Bogdan Pintea <pintea@mailbox.org> 2023-12-19 22:59:02 +08:00			`Syntax`

			`[source,esql]`
Docs for aggregation functions (ESQL-1268) This adds docs for all of ESQL's aggregation functions. Hopefully from here on out we can add the docs as we add new functions. I've created a few tagged regions in the aggs docs themselves so we can include them into the ESQL docs. --------- Co-authored-by: Abdon Pijpelink <abdon.pijpelink@elastic.co> 2023-06-14 22:23:34 +08:00			`----`
[DOCS] Support for nested functions in ES\|QL STATS...BY (#104788) * Document nested expressions for stats * More docs * Apply suggestions from review - count-distinct.asciidoc - Content restructured, moving the section about approximate counts to end of doc. - count.asciidoc - Clarified that omitting the `expression` parameter in `COUNT` is equivalent to `COUNT()`, which counts the number of rows. - percentile.asciidoc - Moved the note about `PERCENTILE` being approximate and non-deterministic to end of doc. - stats.asciidoc - Clarified the `STATS` command - Added a note indicating that individual `null` values are skipped during aggregation Comment out mentioning a buggy behavior * Update sum with inline function example, update test file * Fix typo * Delete line * Simplify wording * Fix conflict fix typo --------- Co-authored-by: Liam Thompson <leemthompo@gmail.com> Co-authored-by: Liam Thompson <32779855+leemthompo@users.noreply.github.com> 2024-01-31 02:29:12 +08:00			`COUNT_DISTINCT(expression[, precision_threshold])`
Docs for aggregation functions (ESQL-1268) This adds docs for all of ESQL's aggregation functions. Hopefully from here on out we can add the docs as we add new functions. I've created a few tagged regions in the aggs docs themselves so we can include them into the ESQL docs. --------- Co-authored-by: Abdon Pijpelink <abdon.pijpelink@elastic.co> 2023-06-14 22:23:34 +08:00			`----`

[DOCS] Improve ES\|QL functions reference for functions A-D (#103447) * Functions starting with A * Functions starting with 'C' * More 'C' functions * Fix tests * Fix missing snippet * DATE_* functions * Apply suggestions from code review Co-authored-by: Bogdan Pintea <pintea@mailbox.org> --------- Co-authored-by: Bogdan Pintea <pintea@mailbox.org> 2023-12-19 22:59:02 +08:00			`Parameters`

[DOCS] Support for nested functions in ES\|QL STATS...BY (#104788) * Document nested expressions for stats * More docs * Apply suggestions from review - count-distinct.asciidoc - Content restructured, moving the section about approximate counts to end of doc. - count.asciidoc - Clarified that omitting the `expression` parameter in `COUNT` is equivalent to `COUNT()`, which counts the number of rows. - percentile.asciidoc - Moved the note about `PERCENTILE` being approximate and non-deterministic to end of doc. - stats.asciidoc - Clarified the `STATS` command - Added a note indicating that individual `null` values are skipped during aggregation Comment out mentioning a buggy behavior * Update sum with inline function example, update test file * Fix typo * Delete line * Simplify wording * Fix conflict fix typo --------- Co-authored-by: Liam Thompson <leemthompo@gmail.com> Co-authored-by: Liam Thompson <32779855+leemthompo@users.noreply.github.com> 2024-01-31 02:29:12 +08:00			`expression`::
			`Expression that outputs the values on which to perform a distinct count.`
[DOCS] Improve ES\|QL functions reference for functions A-D (#103447) * Functions starting with A * Functions starting with 'C' * More 'C' functions * Fix tests * Fix missing snippet * DATE_* functions * Apply suggestions from code review Co-authored-by: Bogdan Pintea <pintea@mailbox.org> --------- Co-authored-by: Bogdan Pintea <pintea@mailbox.org> 2023-12-19 22:59:02 +08:00
[DOCS] Improve ES\|QL functions reference for functions E-Z (#104623) * Functions E-Z * Incorporate changes from #103686 * More functions * More functions * Update docs/reference/esql/functions/floor.asciidoc Co-authored-by: Liam Thompson <32779855+leemthompo@users.noreply.github.com> * Update docs/reference/esql/functions/left.asciidoc Co-authored-by: Liam Thompson <32779855+leemthompo@users.noreply.github.com> * Apply suggestions from code review Co-authored-by: Alexander Spies <alexander.spies@elastic.co> * Review feedback * Fix geo_shape description * Change 'colum'/'field' into 'expressions' * Review feedback * One more --------- Co-authored-by: Liam Thompson <32779855+leemthompo@users.noreply.github.com> Co-authored-by: Alexander Spies <alexander.spies@elastic.co> Co-authored-by: Elastic Machine <elasticmachine@users.noreply.github.com> 2024-01-25 23:32:24 +08:00			`precision_threshold`::
			`Precision threshold. Refer to <<esql-agg-count-distinct-approximate>>. The`
			`maximum supported value is 40000. Thresholds above this number will have the`
			`same effect as a threshold of 40000. The default value is 3000.`
[DOCS] Improve ES\|QL functions reference for functions A-D (#103447) * Functions starting with A * Functions starting with 'C' * More 'C' functions * Fix tests * Fix missing snippet * DATE_* functions * Apply suggestions from code review Co-authored-by: Bogdan Pintea <pintea@mailbox.org> --------- Co-authored-by: Bogdan Pintea <pintea@mailbox.org> 2023-12-19 22:59:02 +08:00
			`Description`

			`Returns the approximate number of distinct values.`
Docs for aggregation functions (ESQL-1268) This adds docs for all of ESQL's aggregation functions. Hopefully from here on out we can add the docs as we add new functions. I've created a few tagged regions in the aggs docs themselves so we can include them into the ESQL docs. --------- Co-authored-by: Abdon Pijpelink <abdon.pijpelink@elastic.co> 2023-06-14 22:23:34 +08:00
[DOCS] Improve ES\|QL functions reference for functions A-D (#103447) * Functions starting with A * Functions starting with 'C' * More 'C' functions * Fix tests * Fix missing snippet * DATE_* functions * Apply suggestions from code review Co-authored-by: Bogdan Pintea <pintea@mailbox.org> --------- Co-authored-by: Bogdan Pintea <pintea@mailbox.org> 2023-12-19 22:59:02 +08:00			`Supported types`

			`Can take any field type as input.`

			`Examples`

			`[source.merge.styled,esql]`
			`----`
			`include::{esql-specs}/stats_count_distinct.csv-spec[tag=count-distinct]`
			`----`
			`[%header.monospaced.styled,format=dsv,separator=\|]`
			`\|===`
			`include::{esql-specs}/stats_count_distinct.csv-spec[tag=count-distinct-result]`
			`\|===`

[DOCS] Improve ES\|QL functions reference for functions E-Z (#104623) * Functions E-Z * Incorporate changes from #103686 * More functions * More functions * Update docs/reference/esql/functions/floor.asciidoc Co-authored-by: Liam Thompson <32779855+leemthompo@users.noreply.github.com> * Update docs/reference/esql/functions/left.asciidoc Co-authored-by: Liam Thompson <32779855+leemthompo@users.noreply.github.com> * Apply suggestions from code review Co-authored-by: Alexander Spies <alexander.spies@elastic.co> * Review feedback * Fix geo_shape description * Change 'colum'/'field' into 'expressions' * Review feedback * One more --------- Co-authored-by: Liam Thompson <32779855+leemthompo@users.noreply.github.com> Co-authored-by: Alexander Spies <alexander.spies@elastic.co> Co-authored-by: Elastic Machine <elasticmachine@users.noreply.github.com> 2024-01-25 23:32:24 +08:00			`With the optional second parameter to configure the precision threshold:`
Docs for aggregation functions (ESQL-1268) This adds docs for all of ESQL's aggregation functions. Hopefully from here on out we can add the docs as we add new functions. I've created a few tagged regions in the aggs docs themselves so we can include them into the ESQL docs. --------- Co-authored-by: Abdon Pijpelink <abdon.pijpelink@elastic.co> 2023-06-14 22:23:34 +08:00
			`[source.merge.styled,esql]`
			`----`
			`include::{esql-specs}/stats_count_distinct.csv-spec[tag=count-distinct-precision]`
			`----`
			`[%header.monospaced.styled,format=dsv,separator=\|]`
			`\|===`
			`include::{esql-specs}/stats_count_distinct.csv-spec[tag=count-distinct-precision-result]`
			`\|===`
[DOCS] Support for nested functions in ES\|QL STATS...BY (#104788) * Document nested expressions for stats * More docs * Apply suggestions from review - count-distinct.asciidoc - Content restructured, moving the section about approximate counts to end of doc. - count.asciidoc - Clarified that omitting the `expression` parameter in `COUNT` is equivalent to `COUNT()`, which counts the number of rows. - percentile.asciidoc - Moved the note about `PERCENTILE` being approximate and non-deterministic to end of doc. - stats.asciidoc - Clarified the `STATS` command - Added a note indicating that individual `null` values are skipped during aggregation Comment out mentioning a buggy behavior * Update sum with inline function example, update test file * Fix typo * Delete line * Simplify wording * Fix conflict fix typo --------- Co-authored-by: Liam Thompson <leemthompo@gmail.com> Co-authored-by: Liam Thompson <32779855+leemthompo@users.noreply.github.com> 2024-01-31 02:29:12 +08:00
			`The expression can use inline functions. This example splits a string into`
			multiple values using the `SPLIT` function and counts the unique values:

			`[source.merge.styled,esql]`
			`----`
			`include::{esql-specs}/stats_count_distinct.csv-spec[tag=docsCountDistinctWithExpression]`
			`----`
			`[%header.monospaced.styled,format=dsv,separator=\|]`
			`\|===`
			`include::{esql-specs}/stats_count_distinct.csv-spec[tag=docsCountDistinctWithExpression-result]`
			`\|===`

			`[discrete]`
			`[[esql-agg-count-distinct-approximate]]`
			`==== Counts are approximate`

			`Computing exact counts requires loading values into a set and returning its`
			`size. This doesn't scale when working on high-cardinality sets and/or large`
			`values as the required memory usage and the need to communicate those`
			`per-shard sets between nodes would utilize too many resources of the cluster.`

			This `COUNT_DISTINCT` function is based on the
			`https://static.googleusercontent.com/media/research.google.com/fr//pubs/archive/40671.pdf[HyperLogLog++]`
			`algorithm, which counts based on the hashes of the values with some interesting`
			`properties:`

			`include::../../aggregations/metrics/cardinality-aggregation.asciidoc[tag=explanation]`

			The `COUNT_DISTINCT` function takes an optional second parameter to configure
			`the precision threshold. The precision_threshold options allows to trade memory`
			`for accuracy, and defines a unique count below which counts are expected to be`
			`close to accurate. Above this value, counts might become a bit more fuzzy. The`
			`maximum supported value is 40000, thresholds above this number will have the`
			same effect as a threshold of 40000. The default value is `3000`.