Commit Graph

395 Commits

Author SHA1 Message Date
David Kyle b9deb660a8
Include the ml inference aggregation doc (#59219)
Add to the list of pipeline aggregations
2020-07-08 14:22:19 +01:00
Nik Everett 3b3ed4b4a7
Fix lookup support in adjacency matrix (#59099)
This request:
```
POST /_search
{
  "aggs": {
    "a": {
      "adjacency_matrix": {
        "filters": {
          "1": {
            "terms": { "t": { "index": "lookup", "id": "1", "path": "t" } }
          }
        }
      }
    }
  }
}
```

Would fail with a 500 error and a message like:
```
{
  "error": {
    "root_cause": [
      {
        "type": "illegal_state_exception",
        "reason":"async actions are left after rewrite"
      }
    ]
  }
}
```

This fixes that by moving the query rewrite phase from a synchronous
call on the data nodes into the standard aggregation rewrite phase which
can properly handle the asynchronous actions.
2020-07-06 18:53:19 -04:00
David Kyle 7daed3b8af
Pipeline Inference Aggregation (#58193)
Adds a pipeline aggregation that loads a model and performs inference on the 
input aggregation results.
2020-07-02 14:33:02 +01:00
Nik Everett 32bdf8549b
Fail variable_width_histogram that collects from many (#58619)
Adds an explicit check to `variable_width_histogram` to stop it from
trying to collect from many buckets because it can't. I tried to make it
do so but that is more than an afternoon's project, sadly. So for now we
just disallow it.

Relates to #42035
2020-06-30 15:42:46 -04:00
Nik Everett dda78ff760
Docs: Mark variable_width_histogram experimental (#58574)
We're tracking this aggregation's experimental-progress in #58573. We'd
like a little time to be able to make backwards incompatible changes to
the aggregation because we're not 100% sure about the request and
response format yet.
2020-06-25 16:54:37 -04:00
James Dorfman e99d287fbb
Add Variable Width Histogram Aggregation (#42035)
Implements a new histogram aggregation called `variable_width_histogram` which
dynamically determines bucket intervals based on document groupings. These
groups are determined by running a one-pass clustering algorithm on each shard
and then reducing each shard's clusters using an agglomerative
clustering algorithm.

This PR addresses #9572.

The shard-level clustering is done in one pass to minimize memory overhead. The
algorithm was lightly inspired by
[this paper](https://ieeexplore.ieee.org/abstract/document/1198387). It fetches
a small number of documents to sample the data and determine initial clusters.
Subsequent documents are then placed into one of these clusters, or a new one
if they are an outlier. This algorithm is described in more details in the
aggregation's docs.

At reduce time, a
[hierarchical agglomerative clustering](https://en.wikipedia.org/wiki/Hierarchical_clustering)
algorithm inspired by [this paper](https://arxiv.org/abs/1802.00304)
continually merges the closest buckets from all shards (based on their
centroids) until the target number of buckets is reached.

The final values produced by this aggregation are approximate. Each bucket's
min value is used as its key in the histogram. Furthermore, buckets are merged
based on their centroids and not their bounds. So it is possible that adjacent
buckets will overlap after reduction. Because each bucket's key is its min,
this overlap is not shown in the final histogram. However, when such overlap
occurs, we set the key of the bucket with the larger centroid to the midpoint
between its minimum and the smaller bucket’s maximum:
`min[large] = (min[large] + max[small]) / 2`. This heuristic is expected to
increases the accuracy of the clustering.

Nodes are unable to share centroids during the shard-level clustering phase. In
the future, resolving https://github.com/elastic/elasticsearch/issues/50863
would let us solve this issue. 

It doesn’t make sense for this aggregation to support the `min_doc_count`
parameter, since clusters are determined dynamically. The `order` parameter is
not supported here to keep this large PR from becoming too complex.
2020-06-23 09:26:54 -04:00
Cris da Rocha b5de14d3f6
Missing comma between value types (#58383)
This applies to all versions of this document (7.7, 7.8, 7.x, current and master).
2020-06-19 23:01:25 +02:00
Tal Levy c765993d82
add geo_shape documentation for supported aggregations (#58284)
This commit adds documentation for geo_shape fields in aggregations

Closes #55495.
2020-06-18 10:17:49 -07:00
James Rodewig 7826bbee87
[DOCS] Move search API's `docvalue_fields` examples (#57760)
Changes:

* Condenses and relocates the `docvalue_fields` example to the 'Run a search' 
   page.
* Adds docs for the `docvalue_fields` request body parameter.
* Updates several related xrefs.

Co-authored-by: debadair <debadair@elastic.co>
2020-06-11 10:57:15 -04:00
andrewjohnson2 a791d6723d
Added standard deviation / variance sampling to extended stats (#49782)
Per 49554 I added standard deviation sampling and variance sampling to the extended stats interface.

Closes #49554

Co-authored-by: Igor Motov <igor@motovs.org>
2020-06-10 15:00:50 -04:00
James Rodewig 51e3d5ab63
[DOCS] Fix source filtering xrefs (#57720) 2020-06-05 08:46:26 -04:00
Igor Motov 29b5643c1a
Increase search.max_buckets to 65,535 (#57042)
Increases the default search.max_buckets limit to 65,535, and only counts
buckets during reduce phase.

Closes #51731
2020-06-03 11:54:48 -04:00
Benjamin Trent 484de0cd02
Adding transform docs for geotile_grid (#57000)
transforms and composite aggs support geotile_grid as a source. This adds documentation explaining that support.
2020-06-01 15:32:18 -04:00
Nik Everett 1e5e5e2da2
Update date_histogram docs (#56922)
* Make it more clear that you can use `month` or `1M`.
* Explain rounding rules
* Consistently use "time zone" instead of "timezone". It looks like both
  are right but I see "time zone" much more. And the parameter in
  elasticsearch is `time_zone` so we may as well line up.

Closes #56760

Co-authored-by: James Rodewig <james.rodewig@elastic.co>
2020-05-29 17:13:14 -04:00
Gabriel Petrovay 709ee956d7 Fixed calendar intervals documentation (#56666)
- the 1-letter intervals are not parseable (`m`, `h`, `d`, `w`,  `M`, `q`, `y`)
- fixed formatting broken by new lines
2020-05-15 16:56:27 -04:00
Gil Raphaelli f29c9ff652 [DOCS] Sort metric and pipeline agg docs (#56613) 2020-05-15 16:34:47 -04:00
Tal Levy 79367e43da
Add Normalize Pipeline Aggregation (#56399)
This aggregation will perform normalizations of metrics
for a given series of data in the form of bucket values.

The aggregations supports the following normalizations

- rescale 0-1
- rescale 0-100
- percentage of sum
- mean normalization
- z-score normalization
- softmax normalization

To specify which normalization is to be used, it can be specified
in the normalize agg's `normalizer` field.

For example:

```
{
  "normalize": {
    "buckets_path": <>,
    "normalizer": "percent"
  }
}
```

Closes #51005.
2020-05-14 13:32:42 -07:00
Gabriel Petrovay 4029818c24 [Docs] Correct formatting in datehistogram-aggregation.asciidoc (#56664) 2020-05-13 12:02:36 +02:00
Ignacio Vera 4e39184c38
Add moving percentiles pipeline aggregation (#55441)
Similar to what the moving function aggregation does, except merging windows of percentiles 
sketches together instead of cumulatively merging final metrics
2020-05-12 10:30:52 +02:00
James Rodewig af2d13144f
[DOCS] Add reference docs for `search.max_buckets` setting (#56449)
Adds reference-style setting documentation for the `search.max_buckets`
setting.

This setting was previously only documented on the [bucket
aggregations][0] page.

[0]: https://www.elastic.co/guide/en/elasticsearch/reference/master/search-aggregations-bucket.html
2020-05-11 08:35:24 -04:00
Christos Soulios caf6c5ac19
Histogram field type support for ValueCount and Avg aggregations (#55933)
Implements value_count and avg aggregations over Histogram fields as discussed in #53285

- value_count returns the sum of all counts array of the histograms
- avg computes a weighted average of the values array of the histogram by multiplying each value with its associated element in the counts array
2020-05-04 10:24:35 +03:00
AB Prashanth 785527bb58
[DOCS] Remove approximate document counts example from term agg docs (#55442)
Removes an example from the "Document counts are approximate" section of the
terms agg documentation.

As #52377 details, the example was no longer accurate in 7.x or 6.8. Document
counts were more precise than the example presented.

We've opened issue #56025 to discuss re-adding an example later.

Co-authored-by: James Rodewig <james.rodewig@elastic.co>
2020-04-30 09:49:32 -04:00
Christos Soulios cefc6af25b
Histogram field type support for Sum aggregation (#55681)
Implements Sum aggregation over Histogram fields by summing the value of each bucket multiplied by their count as requested in #53285
2020-04-29 11:09:25 +03:00
Zachary Tong 9f165bd44e
Aggs must specify a `field` or `script` (or both) (#52226)
* Aggs must specify a `field` or `script` (or both)

This adds a validation to VSParserHelper to ensure that a field or
script or both are specified by the user.  This is technically
required today already, but throws an exception much deeper
in the agg framework and has a very unintuitive error for the user
(as well as eating more resources instead of failing early)

* Fix StringStats test

* Add yaml test

* Skip test on older versions

Co-authored-by: Elastic Machine <elasticmachine@users.noreply.github.com>
2020-04-23 14:26:38 -04:00
Igor Motov 6d28596ead
Add support for filters to T-Test aggregation (#54980)
Adds support for filters to T-Test aggregation. The filters can be used to
select populations based on some criteria and use values from the same or
different fields.

Closes #53692
2020-04-10 10:19:07 -04:00
Igor Motov 5fc9fc528d
Add Student's t-test aggregation support (#54469)
Adds t_test metric aggregation that can perform paired and unpaired two-sample
t-tests. In this PR support for filters in unpaired is still missing. It will
be added in a follow-up PR.

Relates to #53692
2020-04-03 11:31:13 -04:00
Gil Raphaelli 4090568797
[DOCS] Fix typos in top metrics agg docs (#54299) 2020-03-27 10:48:01 -04:00
Paweł Krześniak de1229cc2b
[DOCS] link fix (#53973)
Fix bad link in top_metrics.
2020-03-23 13:28:43 -04:00
Zachary Tong 84a59f8447
Add scripting, supported-type tests to ValueCount (#53500)
Also adds a few small notes to the documentation regarding potentially
unintuitive behavior
2020-03-16 15:15:25 -04:00
Lisa Cawley 4a5feab88d
[DOCS] Add anchors for scripted metric aggregations (#53618) 2020-03-16 12:14:01 -07:00
Nik Everett 230a9a8975
Improve top_metrics docs (#53521)
* Removes experimental.
* Replaces `"v"` (for value) with `"m"` (for metric).
* Move the note about tiebreaking into the list of limitations of the
  sort.
* Explain how you ask for `metrics`.
* Clean up some wording.
* Link to the docs from `top_metrics`.

Closes #51813
2020-03-16 13:23:22 -04:00
Nik Everett 8410356c5b
Preserve metric types in top_metrics (#53288)
This changes the `top_metrics` aggregation to return metrics in their
original type. Since it only supports numerics, that means that dates,
longs, and doubles will come back as stored, with their appropriate
formatter applied.
2020-03-11 16:44:08 -04:00
Anton Dollmaier e9c8c03fee [DOCS] Fix parameter formatting for GeoHash grid agg docs (#53032)
Adds missing colon (`:`) to the parameter definition list.
2020-03-09 08:17:57 -04:00
Nik Everett 56058ab6af
Support multiple metrics in `top_metrics` agg (#52965)
This adds support for returning multiple metrics to the `top_metrics`
agg. It looks like:
```
POST /test/_search?filter_path=aggregations
{
  "aggs": {
    "tm": {
      "top_metrics": {
        "metrics": [
          {"field": "v"},
          {"field": "m"}
        ],
        "sort": {"s": "desc"}
      }
    }
  }
}
```
2020-03-05 06:53:37 -05:00
Nik Everett f4223b6a8f
Add size support to `top_metrics` (#52662)
This adds support for returning the top "n" metrics instead of just the
very top.

Relates to #51813
2020-02-27 11:14:57 -05:00
István Zoltán Szabó 14555ca01e
[DOCS] Links transforms in aggregation docs (#52563)
Co-authored-by: Lisa Cawley <lcawley@elastic.co>
2020-02-21 08:22:04 +01:00
Nik Everett 5b2266601b
Implement top_metrics agg (#51155)
The `top_metrics` agg is kind of like `top_hits` but it only works on
doc values so it *should* be faster.

At this point it is fairly limited in that it only supports a single,
numeric sort and a single, numeric metric. And it only fetches the "very
topest" document worth of metric. We plan to support returning a
configurable number of top metrics, requesting more than one metric and
more than one sort. And, eventually, non-numeric sorts and metrics. The
trick is doing those things fairly efficiently.

Co-Authored by: Zachary Tong <zach@elastic.co>
2020-02-14 07:13:52 -05:00
Igor Motov 0898df4aac
Add histogram field type support to boxplot aggs (#52265)
Add support for the histogram field type to boxplot aggs.

Closes #52233
Relates to #33112
2020-02-13 08:59:44 -05:00
Igor Motov c50cfa0668
Add Boxplot Aggregation (#51948)
Adds a `boxplot` aggregation that calculates min, max, medium and the first
and the third quartiles of the given data set.

Closes #33112
2020-02-07 18:01:20 -05:00
Mark Tozzi 928c663ce0
Fix dangling 'either' in weighted average docs (#51748) 2020-01-31 12:45:46 -05:00
Elvis Saravia 520da54e63
update pipeline.asciidoc
typo
2020-01-24 14:03:01 +01:00
Igor Motov 23be11cf6c
Fix leftover mentions of method parameter in Percentile Aggs (#51272)
The method parameter is not used in the percentile aggs, instead
the method is determined by the presence of `hdr` or `tdigest`
objects.

Relates to #8324
2020-01-22 05:02:48 -10:00
Tal Levy 6c86606d2a
Adds support for geo-bounds filtering in geogrid aggregations (#50002)
It is fairly common to filter the geo point candidates in
geohash_grid and geotile_grid aggregations according to some
viewable bounding box. This change introduces the option of
specifying this filter directly in the tiling aggregation.

This is even more relevant to `geo_shape` where the bounds will restrict
the shape to be within the bounds

this optional `bounds` parameter is parsed in an equivalent fashion to 
the bounds specified in the geo_bounding_box query.
2020-01-14 08:29:10 -08:00
Nik Everett 326d696d9a
Support offset in composite aggs (#50609)
Adds support for the `offset` parameter to the `date_histogram` source
of composite aggs. The `offset` parameter is supported by the normal
`date_histogram` aggregation and is useful for folks that need to
measure things from, say, 6am one day to 6am the next day.

This is implemented by creating a new `Rounding` that knows how to
handle offsets and delegates to other rounding implementations. That
implementation doesn't fully implement the `Rounding` contract, namely
`nextRoundingValue`. That method isn't used by composite aggs so I can't
be sure that any implementation that I add will be correct. I propose to
leave it throwing `UnsupportedOperationException` until I need it.

Closes #48757
2020-01-07 14:49:09 -05:00
James Rodewig 7f35bcdfc9
[DOCS] Warn about using `geo_centroid` as sub-agg to `geohash_grid` (#50038)
If `geo_point fields` are multi-valued, using `geo_centroid` as a
sub-agg to `geohash_grid` could result in centroids outside of bucket
boundaries.

This adds a related warning to the geo_centroid agg docs.
2020-01-06 07:45:49 -06:00
Nik Everett a7cc0b0159
Docs: Refine note about `after_key` (#50475)
* Docs: Refine note about `after_key`

I was curious about composite aggregations, specifically I wanted to
know how to write a composite aggregation that had all of its buckets
filtered out so you *had* to use the `after_key`. Then I saw that we've
declared composite aggregations not to work with pipelines in #44180. So
I'm not sure you *can* do that any more. Which makes the note about
`after_key` inaccurate. This rejiggers that section of the docs a little
so it is more obvious that you send the `after_key` back to us. And so
it is more obvious that you should *only* use the `after_key` that we
give you rather than try to work it out for yourself.

* Apply suggestions from code review

Co-Authored-By: James Rodewig <james.rodewig@elastic.co>

Co-authored-by: James Rodewig <james.rodewig@elastic.co>
2020-01-02 10:02:55 -05:00
James Rodewig 3460dc9542
[DOCS] Percentile aggs are non-deterministic (#50468)
Percentile aggregations are non-deterministic. A percentile aggregation
can produce different results even when using the same data.

Based on [this discuss post][0], the non-deterministic property stems
from processes in Lucene that can affect the order in which docs are
provided to the aggregation.

This adds a warning stating that the aggregation is non-deterministic
and what that means.

[0]: https://discuss.elastic.co/t/different-results-for-same-query/111757
2019-12-23 13:11:31 -05:00
Florian Kelbert 0778c34630 [DOCS] Fix typo in bucket sum aggregation docs (#50431) 2019-12-20 08:47:24 -05:00
Lisa Cawley 6d608e6a0d
[DOCS] Move transform resource definitions into APIs (#50108) 2019-12-17 09:01:31 -08:00
Jim Ferenczi 804a5042e7
Optimize composite aggregation based on index sorting (#48399)
Co-authored-by: Daniel Huang <danielhuang@tencent.com>

This is a spinoff of #48130 that generalizes the proposal to allow early termination with the composite aggregation when leading sources match a prefix or the entire index sort specification.
In such case the composite aggregation can use the index sort natural order to early terminate the collection when it reaches a composite key that is greater than the bottom of the queue.
The optimization is also applicable when a query other than match_all is provided. However the optimization is deactivated for sources that match the index sort in the following cases:
  * Multi-valued source, in such case early termination is not possible.
  * missing_bucket is set to true
2019-12-17 14:02:06 +01:00