Commit Graph

583 Commits

Author SHA1 Message Date
István Zoltán Szabó 60f3c77e3f
[DOCS] Adds p-value heuristic to significant terms aggregation (#75369)
Co-authored-by: Lisa Cawley <lcawley@elastic.co>
2021-07-27 09:12:45 +02:00
Mark Tozzi 7af39dbc35
Remove deprecated date histo interval (#75000)
Date histogram interval parameter was deprecated in 7.2, in favor of the more specific fixed_interval and calendar_interval parameters.  The old logic used some poorly understood guessing to decide if it should operate in fixed or calendar mode.  The new logic requires a specific choice by the user, which is more explicit.  In 7.x REST compatibility mode, we will parse the interval as calendar if possible, and otherwise interpret it as fixed.
2021-07-20 13:08:45 -04:00
James Rodewig 73397f7fb2 [DOCS] Deduplicate docs for `search.max_buckets` 2021-06-29 08:42:01 -04:00
Benjamin Trent 07b336f1b0
Add support for range aggregations on histogram mapped fields (#74146)
This adds support for the range aggregation over `histogram` mapped fields.

Decisions made for implementation:

 - Sub-aggregations are not allowed. This is to simplify implementation and follows the prior art set by the `histogram` aggregation
 - Nothing fancy is done with the ranges. No filter translations as we cannot easily do a `range` filter query against histogram fields. This may be an optimization in the future.
 - Ranges check the histogram value ONLY. No interpolation of values is done. If we have better statistics around the histogram this MAY be possible.
2021-06-29 07:24:54 -04:00
Nik Everett 1338a11d1c
Document types `terms` agg can consume (#73272)
Co-authored-by: James Rodewig <40268737+jrodewig@users.noreply.github.com>
2021-06-17 14:58:20 -04:00
Igor Motov db36b6c89a
Add keep_values gap policy (#73297)
Adds a new keep_values gap policy that works like skip, except if the metric
calculated on an empty bucket provides a non-null non-NaN value, this value is
used for the bucket.

Fixes #27377

Co-authored-by: Mark Tozzi <mark.tozzi@gmail.com>
2021-06-08 09:47:29 -10:00
James Rodewig 0360ce48b4
[DOCS] Clarify supported fields for `top_metrics` agg (#73907)
Changes:
* Notes `metrics.field` supports `boolean` fields and runtime fields.
* Notes `metrics.field` doesn't support array values.

Closes #72889
2021-06-08 13:19:43 -04:00
James Rodewig ff0cb8ed97
[DOCS] Make doc_count error docs more searchable (#73870)
Changes:
* Combines the `Document counts are approximate` and `Calculating document count
  error` sections.
* Rewrites the section to include `sum_other_doc_count` and
  `doc_count_error_upper_bound` for easier on-page (ctrl+f) searching.

Closes #73200
2021-06-08 09:33:10 -04:00
Mark Tozzi 2d4d3d40a0
Docvalueformat errors (#73121)
Improve the error message when inconsistent mappings cause doc value formatting errors.  For example, trying to format a binary encoded IP address as a UTF8 string often fails with something unexpected, like `ArrayIndexOutOfBounds`.  This change catches that and wraps it with a message suggesting the user check their mappings.  Also gets rid of anonymous instances for doc value formatters, which made it hard to see what format was failing to be applied.
2021-06-07 15:24:27 -04:00
Benjamin Trent 30cf4dc8be
[ML] adding new KS test pipeline aggregation (#73334)
This adds a new pipeline aggregation for calculating Kolmogorov–Smirnov test for a given sample and buckets path.

For now, the buckets path resolution needs to be `_count`. But, this may be relaxed in the future. 

It accepts a parameter `fractions` that indicates the distribution of documents from some other pre-calculated sample. 

This particular version of the K-S test is Two-sample, meaning, it calculates if the `fractions` and the distribution of `_count` values in the buckets_path are taken from the same distribution.

This in combination with the hypothesis alternatives (`less`, `greater`, `two_sided`) and sampling logic (`upper_tail`, `lower_tail`, `uniform`) allow for flexibility and usefulness when comparing two samples and determining the likelihood of them being from the same overall distribution.

Usage:

```
POST correlate_latency/_search?size=0&filter_path=aggregations
{
  "aggs": {
    "buckets": {
      "terms": { <1>
        "field": "version",
        "size": 2
      },
      "aggs": {
        "latency_ranges": {
          "range": { <2>
            "field": "latency",
            "ranges": [
              { "to": 0.0 },
              { "from": 0, "to": 105 },
              { "from": 105, "to": 225 },
              { "from": 225, "to": 445 },
              { "from": 445, "to": 665 },
              { "from": 665, "to": 885 },
              { "from": 885, "to": 1115 },
              { "from": 1115, "to": 1335 },
              { "from": 1335, "to": 1555 },
              { "from": 1555, "to": 1775 },
              { "from": 1775 }
            ]
          }
        },
        "ks_test": { <3>
          "bucket_count_ks_test": {
            "buckets_path": "latency_ranges>_count",
            "alternative": ["less", "greater", "two_sided"]
          }
        }
      }
    }
  }
}
```
2021-06-04 10:04:41 -04:00
Nik Everett a43b166d11
More debugging info for significant_text (#72727)
Adds some extra debugging information to make it clear that you are
running `significant_text`. Also adds some using timing information
around the `_source` fetch and the `terms` accumulation. This lets you
calculate a third useful timing number: the analysis time. It is
`collect_ns - fetch_ns - accumulation_ns`.

This also adds a half dozen extra REST tests to get a *fairly*
comprehensive set of the operations this supports. It doesn't cover all
of the significance heuristic parsing, but its certainly much better
than what we had.
2021-05-10 12:50:46 -04:00
Benjamin Trent 8069e9b233
[ML] add new bucket_correlation aggregation with initial count_correlation function (#72133)
This commit adds a new pipeline aggregation that allows correlation within the aggregation frame work in bucketed values. 

The initial function is a `count_correlation` function. The purpose of which is to correlate the count in a consistent number of buckets with a pre calculated indicator. The indicator and the aggregated buckets should related to the same metrics with in documents. 

Example for correlating terms within a `service.version.keyword` with latency percentiles. The percentiles and provided correlation indicator both refer to the same source data where the indicator was previously calculated.:
```
GET apm-7.12.0-transaction-generated/_search
{
  "size": 0,
  "aggs": {
    "field_terms": {
      "terms": {
        "field": "service.version.keyword",
        "size": 20
      },
      "aggs": {
        "latency_range": {
          "range": {
            "field": "transaction.duration.us",
            "ranges": [<snip>],
            "keyed": true
          }
        },
        "correlation": {
          "bucket_correlation": {
            "buckets_path": "latency_range>_count",
            "count_correlation": {
              "indicator": {
                 "expectations": [<snip>],
                 "doc_count": 20000
               }
            }
          }
        }
      }
    }
  }
}
```
2021-05-10 12:46:11 -04:00
Nik Everett 5808f2febb
Update docs for `filter` agg (#72508)
The docs for the `filter` agg seemed to suggest that it was the
preferred way to filter results for aggs but its really mostly for when
you need to filter things under another bucketing agg.

Co-authored-by: James Rodewig <40268737+jrodewig@users.noreply.github.com>
2021-05-06 14:51:16 -04:00
Ignacio Vera 793166fd1f
[GeoPoint] Grid aggregations with bounds should exclude touching tiles (#72493) 2021-04-30 08:43:18 +02:00
Pierre Grimaud 3c44dfec60
[DOCS] Fix typos (#72227) 2021-04-26 12:40:38 -04:00
Nik Everett 6a1220e7f3
Convert metric aggs docs runtime fields (#71260)
This replaces the `script` docs for bucket aggregations with runtime
fields. We expect runtime fields to be nicer to work with because you
can also fetch them or filter on them. We expect them to be faster
because their don't need this sort of `instanceof` tree:
a92a647b9f/server/src/main/java/org/elasticsearch/search/aggregations/support/values/ScriptDoubleValues.java (L42)

Relates to #69291

Co-authored-by: James Rodewig <40268737+jrodewig@users.noreply.github.com>
Co-authored-by: Adam Locke <adam.locke@elastic.co>
2021-04-05 13:08:13 -04:00
Nik Everett a9d9ee0d4b
Convert bucket aggs docs to runtime fields (#71202)
This replaces the `script` docs for bucket aggregations with runtime
fields. We expect runtime fields to be nicer to work with because you
can also fetch them or filter on them. We expect them to be faster
because their don't need this sort of `instanceof` tree:
a92a647b9f/server/src/main/java/org/elasticsearch/search/aggregations/support/values/ScriptDoubleValues.java (L42)

Relates to #69291

Co-authored-by: Adam Locke <adam.locke@elastic.co>
2021-04-02 12:12:06 -04:00
James Rodewig 693807a6d3
[DOCS] Fix double spaces (#71082) 2021-03-31 09:57:47 -04:00
Benjamin Trent c8415a7924
[ML] adding support for composite aggs in anomaly detection (#69970)
This commit allows for composite aggregations in datafeeds. 

Composite aggs provide a much better solution for having influencers, partitions, etc. on high volume data. Instead of worrying about long scrolls in the datafeed, the calculation is distributed across cluster via the aggregations. 

The restrictions for this support are as follows:

- The composite aggregation must have EXACTLY one `date_histogram` source
- The sub-aggs of the composite aggregation must have a `max` aggregation on the SAME timefield as the aforementioned `date_histogram` source
- The composite agg must be the ONLY top level agg and it cannot have a `composite` or `date_histogram` sub-agg
- If using a `date_histogram` to bucket time, it cannot have a `composite` sub-agg.
- The top-level `composite` agg cannot have a sibling pipeline agg. Pipeline aggregations are supported as a sub-agg (thus a pipeline agg INSIDE the bucket).

Some key user interaction differences:
- Speed + resources used by the cluster should be controlled by the `size` parameter in the `composite` aggregation. Previously, we said if you are using aggs, use a specific `chunking_config`. But, with composite, that is not necessary. 
- Users really shouldn't use nested `terms` aggs anylonger. While this is still a "valid" configuration and MAY be desirable for some users (only wanting the top 10 of certain terms), typically when users want influencers, partition fields, etc. they want the ENTIRE population. Previously, this really wasn't possible with aggs, with `composite` it is.
- I cannot really think of a typical usecase that SHOULD ever use a multi-bucket aggregation that is NOT supported by composite.
2021-03-30 08:25:40 -04:00
István Zoltán Szabó 9a8c6fb66f
[DOCS] Removes beta labels from DFA related docs. (#70808) 2021-03-26 09:46:41 +01:00
Nik Everett 2b9ed7d36f
Docs: Clean doc for agg parameter (#70675)
This adds a heading for `shard_min_doc_count` and merges the paragraphs
for them. I wanted to link to this section earlier today and it wasn't a
"real" section so I couldn't.

Co-authored-by: James Rodewig <40268737+jrodewig@users.noreply.github.com>
2021-03-24 16:22:26 -04:00
Ignacio Vera b81bb42ed9
Increase search.max_bucket by one (#70645) 2021-03-23 08:54:48 +01:00
James Rodewig 53574d2778
[DOCS] Reformat adjacency matrix agg reference (#70034) 2021-03-08 12:33:46 -05:00
James Rodewig 67288a1e4d [DOCS] Fix gap policy xref 2021-03-03 09:31:02 -05:00
James Rodewig e21cab640f
[DOCS] Reformat avg bucket agg reference (#69751) 2021-03-02 13:44:43 -05:00
Nik Everett ea131e5f5a
Docs: Switch terms agg scripting to runtime fields (#69628)
We expect runtime fields to perform a little better than our "native"
aggregation script so we should point folks to them instead of the
"native" aggregation script.
2021-03-02 11:27:21 -05:00
RomainGeffraye fe7afb9d36
[DOCS] Update example for `serial_diff` agg (#69635) 2021-03-01 08:37:29 -05:00
Lisa Cawley efa9b095aa
[DOCS] Adds model alias to inference processor and agg (#69576) 2021-02-24 13:12:39 -08:00
Igor Motov 7ad0201b25
Clarify the intended use case for multi_terms aggs (#69397)
This PR clarifies when multi_terms aggs should be used instead of composite
aggs or nested term aggs.

Relates to #65623
2021-02-23 15:11:53 -05:00
Nik Everett 1195b20a83
Docs: Add example fetching keyword in top_metrics (#69135)
Adds an example of fetching a keyword field.
2021-02-17 12:10:34 -05:00
James Rodewig 9b88ae92e6
[DOCS] Fix typos for duplicate words (#69125) 2021-02-17 10:34:20 -05:00
Dario Gieselaar a28e45c0c5
[DOCS] Remove keyword/ip from list of unsupported fields in top_metrics agg (#69036) 2021-02-17 08:41:57 -05:00
James Rodewig ab0f4d51b2
[DOCS] Add missing newline for bulleted list in top_metrics docs (#68481) (#68550)
Co-authored-by: Nathan L Smith <nathan.smith@elastic.co>
2021-02-04 14:49:02 -05:00
Igor Motov 9e3384ebc9
Add multi_terms aggs (#67597)
Adds a multi_terms aggregation support. The multi terms aggregation works
very similarly to the terms aggregation but supports multiple terms. The goal
of this PR is to add the basic functionality so it is not optimized at the
moment. It will be done in follow up PRs.

Closes #65623
2021-02-03 13:13:33 -05:00
James Rodewig 67f113314d
[DOCS] Fix acasting for agg types (#67469) 2021-01-13 14:44:54 -05:00
Adam Locke 82bfbe1195
[DOCS] Adding headers in TOC for aggregation docs. (#66604) 2020-12-18 11:31:42 -05:00
James Rodewig 77dc63b2de
[DOCS] Fix `search.max_buckets` default (#66311) 2020-12-14 21:55:27 -05:00
Nik Everett 524f39f61e
Drop experimental from variable width histogram (#66055)
Its been several months and we haven't bumped into any good reason to
rework the variable width histogram. So let's drop experimental from it!

Closes #58573
2020-12-08 14:15:21 -05:00
Mike Barretta 12c9ee4d80
Update inference-bucket-aggregation.asciidoc
tiny change to properly align the first code example and to add a missing word
2020-12-03 11:48:45 -05:00
James Rodewig e955f7752b
[DOCS] Fix typo in histogram agg docs (#65822) 2020-12-03 09:55:47 -05:00
Igor Motov a065b6d8da
Return an error when a rate aggregation cannot calculate bucket sizes (#65429)
In some cases when the rate aggregation is not a child of a date histogram
aggregation, it is not possible to determine the actual size of the date
histogram bucket. In this case the rate aggregation now throws an exception.

Closes #63703
2020-11-25 10:05:51 -05:00
Tal Levy a6755c3be8
Add mention of geo_shape support in geotile and geohash grid agg docs (#61129)
Previously, geo_shape support was only mentioned in a dedicated x-pack
section. This may be misleading, as the introductory paragraph only
mentions geo_point.

Co-authored-by: James Rodewig <40268737+jrodewig@users.noreply.github.com>
2020-11-24 13:57:42 -08:00
Tal Levy b514d9bf2e
Add geo_line aggregation (#41612)
A metric aggregation that aggregates a set of points as 
a GeoJSON LineString ordered by some sort parameter.

#### specifics

A `geo_line` aggregation request would specify a `geo_point` field, as well
as a `sort` field. `geo_point` represents the values used in the LineString, 
while the `sort` values will be used as the total ordering of the points.

the `sort` field would support any numeric field, including date.

#### sample usage

```
{
	"query": {
		"bool": {
			"must": [
				{ "term": { "person": "004" } },
				{ "term": { "trajectory": "20090131002206.plt" } }
			]
		}
	},
	"aggs": {
		"make_line": {
			"geo_line": {
				"point": {"field": "location"},
				"sort": { "field": "timestamp" },
                                "include_sort": true,
                                "sort_order": "desc",
                                "size": 15
			}
		}
	}
}
```

#### sample response

```
{
    "took": 21,
    "timed_out": false,
    "_shards": {...},
    "hits": {...},
    "aggregations": {
        "make_line": {
            "type": "LineString",
            "coordinates": [
                [
                    121.52926194481552,
                    38.92878997139633
                ],
                [
                    121.52922699227929,
                    38.92876998055726
                ],
             ]
        }
    }
}
```

#### visual response

<img width="540" alt="Screen Shot 2019-04-26 at 9 40 07 AM" src="https://user-images.githubusercontent.com/388837/56834977-cf278e00-6827-11e9-9c93-005ed48433cc.png">

#### limitations

Due to the cardinality of points, an initial max of 10k points 
will be used. This should support many use-cases.

One solution to overcome this limitation is to keep a PriorityQueue of
points, and simplifying the line once it hits this max. If simplifying
makes sense, it may be a nice option, in general. The ability to use a parameter
to specify how aggressive one wants to simplify. This parameter could be 
the number of points. Example algorithm one could use with a PriorityQueue:
https://bost.ocks.org/mike/simplify/. This would still require O(m) space, where m
is the number of points returned. And would also require heapifying triangles
sorted by their areas, which would be O(log(m)) operations. Since sorting is done, 
anyways, simplifying would still be a O(n log(m)) operation, where n is the total number 
of points to filter........... something to explore


closes #41649
2020-11-23 10:26:27 -08:00
Wylie Conlon 10ee0f2878
Clarify field data cache behavior in docs (#64375)
* Clarify that field data cache includes global ordinals
* Describe that the cache should be cleared once the limit is reached
* Clarify that the `_id` field does not supported aggregations anymore
* Fold the `fielddata` mapping parameter page into the `text field docs
* Improve cross-linking
2020-11-20 13:53:23 -08:00
Adam Locke 9fdcd79927
Explicitly defining types for sources parameter (#65006) 2020-11-12 16:09:04 -05:00
Mark Tozzi f666ccb3bc
Add supports for upper and lower values on boxplot based on the IQR value (#63617) 2020-11-04 14:39:05 -05:00
James Rodewig 8bc922512c
[DOCS] Redirect moving avg aggregation (#64435) 2020-10-30 14:12:09 -04:00
James Rodewig 2e9f95aa73
[DOCS] Change agg titles to sentence case (#64425) 2020-10-30 13:25:21 -04:00
James Rodewig 37b6adaf91
[DOCS] Rewrite aggs overview (#64318)
- Replaces more abstract docs about object structure and values source with task-based examples.
- Relocates several sections from the current `misc.asciidoc` file.
- Alphabetically sorts agg categories in the nav.
- Removes the matrix agg family. Moves the stats matrix agg under the metric agg family

Co-authored-by: debadair <debadair@elastic.co>
2020-10-30 08:39:38 -04:00
István Zoltán Szabó 6093518f4a
[DOCS] Changes experimental flag to beta in DFA related docs (#63992) 2020-10-26 17:02:46 +01:00
Hugo Chargois ff736f078b
Allow mixing set-based and regexp-based include and exclude (#63325)
* Allow mixing set-based and regexp-based include and exclude

* Coding style

* Disallow having both set and regexp include (resp. exclude)

* Test correctness of every combination of include/exclude
2020-10-21 10:26:42 -04:00
Aref Razavi 245663e5b7 Remove useless parentheses in bucket_key formula (#63868) 2020-10-19 11:54:21 +02:00
Igor Motov e6c70f6811
Add value_count mode to rate agg (#63687)
Adds a new value count mode to the rate aggregation.

Closes #63575
2020-10-15 18:00:44 -04:00
Igor Motov 34bff3f776
Add support for histogram fields to rate aggregation (#63289)
The rate aggregation now supports histogram fields. At the moment only sum
is supported. 

Closes #62939
2020-10-08 16:54:25 -04:00
Przemyslaw Gomulka b38eaae47f
[doc] Rounding range query rules (#63109)
a documentation explaining defaulting of missing fields when using date math parser.
relates #62268
2020-10-02 08:59:27 +02:00
Benjamin Trent 1084aaf18a
[ML] renames */inference* apis to */trained_models* (#63097)
This commit renames all `inference` CRUD APIs to `trained_models`.

This aligns with internal terminology, documentation, and use-cases.
2020-10-01 12:13:49 -04:00
Lisa Cawley ecf9e929ba
[DOCS] Add experimental tag to inference processor and bucket aggregation (#63023) 2020-09-30 07:20:38 -07:00
James Rodewig 277709004e
[DOCS] Fix elasticsearch-croneval chunking (#63008) 2020-09-29 09:53:20 -04:00
Christos Soulios b857768bb5
Histogram field type support for min/max aggregations (#62532)
Implement min/max aggregations for histogram fields.

Closes #60951
2020-09-19 23:34:43 +03:00
Julie Tibshirani f29c743a47
Support the 'fields' option in inner_hits and top_hits. (#62259)
This PR adds support for the 'fields' option in the following places:
* Anytime `inner_hits` is used, for both fetching nested/ child docs and field collapsing
* The `top_hits` aggregation

Addresses #61949.
2020-09-14 10:08:58 -07:00
Igor Motov f107dba741
Add rate aggregation (#61369)
Adds a new rate aggregation that can calculate a document rate for buckets
of a date_histogram.

Closes #60674
2020-08-25 11:32:20 -04:00
István Zoltán Szabó 8da6bba0fc
[DOCS] Adds example to the inference aggregation description (#61290) 2020-08-19 11:20:42 +02:00
Nik Everett cebd5d47e2
Redo experimental tag on vwh (#61065)
The docs didn't have the standard experimental text. This adds it.
2020-08-18 10:00:54 -04:00
James Rodewig 456c37b186
[DOCS] Add usage tips to `top_hits` agg (#61215) 2020-08-17 12:42:04 -04:00
Adam Locke fdc867e395
[DOCS] Update info about geo_shape bounding boxes (#61214)
* Adding information about geo_shape bounding boxes.

* Fixing cross link and incorporating review feedback.
2020-08-17 11:07:18 -04:00
James Rodewig a94e5cb7c4
[DOCS] Replace Wikipedia links with attribute (#61171) 2020-08-17 09:44:24 -04:00
Gilad Gal 8534bd5ce7
Update normalize-aggregation.asciidoc
The second method normalizes linearly between 0..100
2020-08-12 22:24:36 +03:00
James Rodewig a0f4edff66
[DOCS] Fix chunking in query docs (#61053)
Changes:
* Moves "Notes" sections for the joining queries and percolate query
  pages to the parent page
* Adds related redirects for the moved "Notes" pages
* Assigns explicit anchor IDs to other "Notes" headings. This was required for
  the redirects to work.
2020-08-12 13:45:49 -04:00
James Rodewig 6b9b8c5e31
[DOCS] Move script and stored fields content to search fields page (#60826)
Changes:

* Moves `Retrieve selected fields` to its own page and adds a title abbreviation.
* Adds existing script and stored fields content to `Retrieve selected fields`
* Adds a xref for `Retrieve selected fields` to `Search your data`
* Adds related redirects and updates existing xrefs
2020-08-06 12:45:03 -04:00
Mark Tozzi 65caee9163
Extensibility for Composite Agg (#59648)
This PR adds the ability to plug new ValuesSourceType support into Composite aggregations via the ValuesSourceRegistry. This should let plugins which define new field types wire those types into composite.  It also updates composite's use of ValueType to follow the conventions we're using in the rest of aggregations, namely splitting the user supplied value out from the default value.
2020-08-06 12:34:14 -04:00
James Rodewig 929033f9dd
[DOCS] Move named query content to bool query (#60748) 2020-08-05 13:27:10 -04:00
James Rodewig a4dc336c16
[DOCS] Replace `twitter` dataset in search/agg docs (#60667) 2020-08-04 13:31:52 -04:00
Alexander Reelsen c7ac9e7073
[DOCS] http -> https, remove outdated plugin docs (#60380)
Plugin discovery documentation contained information about installing
Elasticsearch 2.0 and installing an oracle JDK, both of which is no
longer valid.

While noticing that the instructions used cleartext HTTP to install
packages, this commit replaces HTTPs links instead of HTTP where possible.

In addition a few community links have been removed, as they do not seem
to exist anymore.
2020-07-31 15:58:38 -04:00
James Rodewig aec26b1a23
[DOCS] Move search pagination content to one page (#60515) 2020-07-31 11:43:06 -04:00
Julie Tibshirani 8a89d95372
Add search `fields` parameter to support high-level field retrieval. (#60100)
This feature adds a new `fields` parameter to the search request, which
consults both the document `_source` and the mappings to fetch fields in a
consistent way. The PR merges the `field-retrieval` feature branch.

Addresses #49028 and #55363.
2020-07-27 13:25:55 -07:00
James Rodewig 441c3a21b1
[DOCS] Update my-index examples (#60132)
Changes the following example index names to `my-index-000001` for consistency:

* `my-index`
* `my_index`
* `myindex`
2020-07-27 14:46:39 -04:00
James Rodewig 74c9e56735
[DOCS] Fix default gap policy for moving fn, moving avg aggs (#60223) (#60230) 2020-07-27 12:32:35 -04:00
James Rodewig d5b03f668b
[DOCS] Move search sort docs to separate page (#60123)
Moves the search sort docs from the deprecated 'Request Body Search'
page to a new subpage of 'Run a search'.

No substantive changes were made to the content.
2020-07-23 12:58:57 -04:00
James Rodewig 2774cd6938
[DOCS] Swap `[float]` for `[discrete]` (#60124)
Changes instances of `[float]` in our docs for `[discrete]`.

Asciidoctor prefers the `[discrete]` tag for floating headings:
https://asciidoctor.org/docs/asciidoc-asciidoctor-diffs/#blocks
2020-07-23 11:48:22 -04:00
Howard b8e3ba783a
[DOCS] Fix missing punctuation in agg docs (#59822) 2020-07-21 10:17:59 -04:00
James Rodewig 2c5d6e9c95
[DOCS] Reformat agg snippets to use two-space indents (#59912) 2020-07-20 15:08:04 -04:00
James Rodewig 8a57800f1b
[DOCS] Add performance warning for scripts (#59890) 2020-07-20 14:04:35 -04:00
Igor Motov 6bfde550f9
Add hard_bounds documentation (#59809)
Fixes #59774
2020-07-20 09:54:02 -04:00
Nik Everett 27efb5f3b8
Clean up a few of vwh's rough edges (#59341)
This cleans up a few rough edged in the `variable_width_histogram`,
mostly found by @wwang500:
1. Setting its tuning parameters in an unexpected order could cause the
   request to fail.
2. We checked that the maximum number of buckets was both less than
   50000 and MAX_BUCKETS. This drops the 50000.
3. Fixes a divide by 0 that can occur of the `shard_size` is 1.
4. Fixes a divide by 0 that can occur if the `shard_size * 3` overflows
   a signed int.
5. Requires `shard_size * 3 / 4` to be at least `buckets`. If it is less
   than `buckets` we will very consistently return fewer buckets than
   requested. For the most part we expect folks to leave it at the
   default. If they change it, we expect it to be much bigger than
   `buckets`.
6. Allocate a smaller `mergeMap` in when initially bucketing requests
   that don't use the entire `shard_size * 3 / 4`. Its just a waste.
7. Default `shard_size` to `10 * buckets` rather than `100`. It *looks*
   like that was our intention the whole time. And it feels like it'd
   keep the algorithm humming along more smoothly.
8. Default the `initial_buffer` to `min(10 * shard_size, 50000)` like
   we've documented it rather than `5000`. Like the point above, this
   feels like the right thing to do to keep the algorithm happy.

Co-authored-by: Elastic Machine <elasticmachine@users.noreply.github.com>
2020-07-17 13:39:28 -04:00
James Rodewig aa3ddfeefb
[DOCS] Move highlighting docs to separate page (#59768)
Moves the highlighting docs from the deprecated 'Request Body Search'
chapter to the new subpage of the 'Run a search chapter' section.

No substantive changes were made to the content.
2020-07-17 10:15:20 -04:00
István Zoltán Szabó edccf14478
[DOCS] Adds security privilege info to inference bucket aggregation (#59604) 2020-07-16 18:02:17 +02:00
Adam Locke 4dc5c87211
Indicating that the size parameter defaults to 10. (#59438) 2020-07-13 16:04:48 -04:00
Christos Soulios 2976ba471a
Histogram integration on Histogram field type (#58930)
Implements histogram aggregation over histogram fields as requested in #53285.
2020-07-13 17:07:16 +03:00
David Kyle b9deb660a8
Include the ml inference aggregation doc (#59219)
Add to the list of pipeline aggregations
2020-07-08 14:22:19 +01:00
Nik Everett 3b3ed4b4a7
Fix lookup support in adjacency matrix (#59099)
This request:
```
POST /_search
{
  "aggs": {
    "a": {
      "adjacency_matrix": {
        "filters": {
          "1": {
            "terms": { "t": { "index": "lookup", "id": "1", "path": "t" } }
          }
        }
      }
    }
  }
}
```

Would fail with a 500 error and a message like:
```
{
  "error": {
    "root_cause": [
      {
        "type": "illegal_state_exception",
        "reason":"async actions are left after rewrite"
      }
    ]
  }
}
```

This fixes that by moving the query rewrite phase from a synchronous
call on the data nodes into the standard aggregation rewrite phase which
can properly handle the asynchronous actions.
2020-07-06 18:53:19 -04:00
David Kyle 7daed3b8af
Pipeline Inference Aggregation (#58193)
Adds a pipeline aggregation that loads a model and performs inference on the 
input aggregation results.
2020-07-02 14:33:02 +01:00
Nik Everett 32bdf8549b
Fail variable_width_histogram that collects from many (#58619)
Adds an explicit check to `variable_width_histogram` to stop it from
trying to collect from many buckets because it can't. I tried to make it
do so but that is more than an afternoon's project, sadly. So for now we
just disallow it.

Relates to #42035
2020-06-30 15:42:46 -04:00
Nik Everett dda78ff760
Docs: Mark variable_width_histogram experimental (#58574)
We're tracking this aggregation's experimental-progress in #58573. We'd
like a little time to be able to make backwards incompatible changes to
the aggregation because we're not 100% sure about the request and
response format yet.
2020-06-25 16:54:37 -04:00
James Dorfman e99d287fbb
Add Variable Width Histogram Aggregation (#42035)
Implements a new histogram aggregation called `variable_width_histogram` which
dynamically determines bucket intervals based on document groupings. These
groups are determined by running a one-pass clustering algorithm on each shard
and then reducing each shard's clusters using an agglomerative
clustering algorithm.

This PR addresses #9572.

The shard-level clustering is done in one pass to minimize memory overhead. The
algorithm was lightly inspired by
[this paper](https://ieeexplore.ieee.org/abstract/document/1198387). It fetches
a small number of documents to sample the data and determine initial clusters.
Subsequent documents are then placed into one of these clusters, or a new one
if they are an outlier. This algorithm is described in more details in the
aggregation's docs.

At reduce time, a
[hierarchical agglomerative clustering](https://en.wikipedia.org/wiki/Hierarchical_clustering)
algorithm inspired by [this paper](https://arxiv.org/abs/1802.00304)
continually merges the closest buckets from all shards (based on their
centroids) until the target number of buckets is reached.

The final values produced by this aggregation are approximate. Each bucket's
min value is used as its key in the histogram. Furthermore, buckets are merged
based on their centroids and not their bounds. So it is possible that adjacent
buckets will overlap after reduction. Because each bucket's key is its min,
this overlap is not shown in the final histogram. However, when such overlap
occurs, we set the key of the bucket with the larger centroid to the midpoint
between its minimum and the smaller bucket’s maximum:
`min[large] = (min[large] + max[small]) / 2`. This heuristic is expected to
increases the accuracy of the clustering.

Nodes are unable to share centroids during the shard-level clustering phase. In
the future, resolving https://github.com/elastic/elasticsearch/issues/50863
would let us solve this issue. 

It doesn’t make sense for this aggregation to support the `min_doc_count`
parameter, since clusters are determined dynamically. The `order` parameter is
not supported here to keep this large PR from becoming too complex.
2020-06-23 09:26:54 -04:00
Cris da Rocha b5de14d3f6
Missing comma between value types (#58383)
This applies to all versions of this document (7.7, 7.8, 7.x, current and master).
2020-06-19 23:01:25 +02:00
Tal Levy c765993d82
add geo_shape documentation for supported aggregations (#58284)
This commit adds documentation for geo_shape fields in aggregations

Closes #55495.
2020-06-18 10:17:49 -07:00
James Rodewig 7826bbee87
[DOCS] Move search API's `docvalue_fields` examples (#57760)
Changes:

* Condenses and relocates the `docvalue_fields` example to the 'Run a search' 
   page.
* Adds docs for the `docvalue_fields` request body parameter.
* Updates several related xrefs.

Co-authored-by: debadair <debadair@elastic.co>
2020-06-11 10:57:15 -04:00
andrewjohnson2 a791d6723d
Added standard deviation / variance sampling to extended stats (#49782)
Per 49554 I added standard deviation sampling and variance sampling to the extended stats interface.

Closes #49554

Co-authored-by: Igor Motov <igor@motovs.org>
2020-06-10 15:00:50 -04:00
James Rodewig 51e3d5ab63
[DOCS] Fix source filtering xrefs (#57720) 2020-06-05 08:46:26 -04:00
Igor Motov 29b5643c1a
Increase search.max_buckets to 65,535 (#57042)
Increases the default search.max_buckets limit to 65,535, and only counts
buckets during reduce phase.

Closes #51731
2020-06-03 11:54:48 -04:00
Benjamin Trent 484de0cd02
Adding transform docs for geotile_grid (#57000)
transforms and composite aggs support geotile_grid as a source. This adds documentation explaining that support.
2020-06-01 15:32:18 -04:00
Nik Everett 1e5e5e2da2
Update date_histogram docs (#56922)
* Make it more clear that you can use `month` or `1M`.
* Explain rounding rules
* Consistently use "time zone" instead of "timezone". It looks like both
  are right but I see "time zone" much more. And the parameter in
  elasticsearch is `time_zone` so we may as well line up.

Closes #56760

Co-authored-by: James Rodewig <james.rodewig@elastic.co>
2020-05-29 17:13:14 -04:00
Gabriel Petrovay 709ee956d7 Fixed calendar intervals documentation (#56666)
- the 1-letter intervals are not parseable (`m`, `h`, `d`, `w`,  `M`, `q`, `y`)
- fixed formatting broken by new lines
2020-05-15 16:56:27 -04:00
Gil Raphaelli f29c9ff652 [DOCS] Sort metric and pipeline agg docs (#56613) 2020-05-15 16:34:47 -04:00
Tal Levy 79367e43da
Add Normalize Pipeline Aggregation (#56399)
This aggregation will perform normalizations of metrics
for a given series of data in the form of bucket values.

The aggregations supports the following normalizations

- rescale 0-1
- rescale 0-100
- percentage of sum
- mean normalization
- z-score normalization
- softmax normalization

To specify which normalization is to be used, it can be specified
in the normalize agg's `normalizer` field.

For example:

```
{
  "normalize": {
    "buckets_path": <>,
    "normalizer": "percent"
  }
}
```

Closes #51005.
2020-05-14 13:32:42 -07:00
Gabriel Petrovay 4029818c24 [Docs] Correct formatting in datehistogram-aggregation.asciidoc (#56664) 2020-05-13 12:02:36 +02:00
Ignacio Vera 4e39184c38
Add moving percentiles pipeline aggregation (#55441)
Similar to what the moving function aggregation does, except merging windows of percentiles 
sketches together instead of cumulatively merging final metrics
2020-05-12 10:30:52 +02:00
James Rodewig af2d13144f
[DOCS] Add reference docs for `search.max_buckets` setting (#56449)
Adds reference-style setting documentation for the `search.max_buckets`
setting.

This setting was previously only documented on the [bucket
aggregations][0] page.

[0]: https://www.elastic.co/guide/en/elasticsearch/reference/master/search-aggregations-bucket.html
2020-05-11 08:35:24 -04:00
Christos Soulios caf6c5ac19
Histogram field type support for ValueCount and Avg aggregations (#55933)
Implements value_count and avg aggregations over Histogram fields as discussed in #53285

- value_count returns the sum of all counts array of the histograms
- avg computes a weighted average of the values array of the histogram by multiplying each value with its associated element in the counts array
2020-05-04 10:24:35 +03:00
AB Prashanth 785527bb58
[DOCS] Remove approximate document counts example from term agg docs (#55442)
Removes an example from the "Document counts are approximate" section of the
terms agg documentation.

As #52377 details, the example was no longer accurate in 7.x or 6.8. Document
counts were more precise than the example presented.

We've opened issue #56025 to discuss re-adding an example later.

Co-authored-by: James Rodewig <james.rodewig@elastic.co>
2020-04-30 09:49:32 -04:00
Christos Soulios cefc6af25b
Histogram field type support for Sum aggregation (#55681)
Implements Sum aggregation over Histogram fields by summing the value of each bucket multiplied by their count as requested in #53285
2020-04-29 11:09:25 +03:00
Zachary Tong 9f165bd44e
Aggs must specify a `field` or `script` (or both) (#52226)
* Aggs must specify a `field` or `script` (or both)

This adds a validation to VSParserHelper to ensure that a field or
script or both are specified by the user.  This is technically
required today already, but throws an exception much deeper
in the agg framework and has a very unintuitive error for the user
(as well as eating more resources instead of failing early)

* Fix StringStats test

* Add yaml test

* Skip test on older versions

Co-authored-by: Elastic Machine <elasticmachine@users.noreply.github.com>
2020-04-23 14:26:38 -04:00
Igor Motov 6d28596ead
Add support for filters to T-Test aggregation (#54980)
Adds support for filters to T-Test aggregation. The filters can be used to
select populations based on some criteria and use values from the same or
different fields.

Closes #53692
2020-04-10 10:19:07 -04:00
Igor Motov 5fc9fc528d
Add Student's t-test aggregation support (#54469)
Adds t_test metric aggregation that can perform paired and unpaired two-sample
t-tests. In this PR support for filters in unpaired is still missing. It will
be added in a follow-up PR.

Relates to #53692
2020-04-03 11:31:13 -04:00
Gil Raphaelli 4090568797
[DOCS] Fix typos in top metrics agg docs (#54299) 2020-03-27 10:48:01 -04:00
Paweł Krześniak de1229cc2b
[DOCS] link fix (#53973)
Fix bad link in top_metrics.
2020-03-23 13:28:43 -04:00
Zachary Tong 84a59f8447
Add scripting, supported-type tests to ValueCount (#53500)
Also adds a few small notes to the documentation regarding potentially
unintuitive behavior
2020-03-16 15:15:25 -04:00
Lisa Cawley 4a5feab88d
[DOCS] Add anchors for scripted metric aggregations (#53618) 2020-03-16 12:14:01 -07:00
Nik Everett 230a9a8975
Improve top_metrics docs (#53521)
* Removes experimental.
* Replaces `"v"` (for value) with `"m"` (for metric).
* Move the note about tiebreaking into the list of limitations of the
  sort.
* Explain how you ask for `metrics`.
* Clean up some wording.
* Link to the docs from `top_metrics`.

Closes #51813
2020-03-16 13:23:22 -04:00
Nik Everett 8410356c5b
Preserve metric types in top_metrics (#53288)
This changes the `top_metrics` aggregation to return metrics in their
original type. Since it only supports numerics, that means that dates,
longs, and doubles will come back as stored, with their appropriate
formatter applied.
2020-03-11 16:44:08 -04:00
Anton Dollmaier e9c8c03fee [DOCS] Fix parameter formatting for GeoHash grid agg docs (#53032)
Adds missing colon (`:`) to the parameter definition list.
2020-03-09 08:17:57 -04:00
Nik Everett 56058ab6af
Support multiple metrics in `top_metrics` agg (#52965)
This adds support for returning multiple metrics to the `top_metrics`
agg. It looks like:
```
POST /test/_search?filter_path=aggregations
{
  "aggs": {
    "tm": {
      "top_metrics": {
        "metrics": [
          {"field": "v"},
          {"field": "m"}
        ],
        "sort": {"s": "desc"}
      }
    }
  }
}
```
2020-03-05 06:53:37 -05:00
Nik Everett f4223b6a8f
Add size support to `top_metrics` (#52662)
This adds support for returning the top "n" metrics instead of just the
very top.

Relates to #51813
2020-02-27 11:14:57 -05:00
István Zoltán Szabó 14555ca01e
[DOCS] Links transforms in aggregation docs (#52563)
Co-authored-by: Lisa Cawley <lcawley@elastic.co>
2020-02-21 08:22:04 +01:00
Nik Everett 5b2266601b
Implement top_metrics agg (#51155)
The `top_metrics` agg is kind of like `top_hits` but it only works on
doc values so it *should* be faster.

At this point it is fairly limited in that it only supports a single,
numeric sort and a single, numeric metric. And it only fetches the "very
topest" document worth of metric. We plan to support returning a
configurable number of top metrics, requesting more than one metric and
more than one sort. And, eventually, non-numeric sorts and metrics. The
trick is doing those things fairly efficiently.

Co-Authored by: Zachary Tong <zach@elastic.co>
2020-02-14 07:13:52 -05:00
Igor Motov 0898df4aac
Add histogram field type support to boxplot aggs (#52265)
Add support for the histogram field type to boxplot aggs.

Closes #52233
Relates to #33112
2020-02-13 08:59:44 -05:00
Igor Motov c50cfa0668
Add Boxplot Aggregation (#51948)
Adds a `boxplot` aggregation that calculates min, max, medium and the first
and the third quartiles of the given data set.

Closes #33112
2020-02-07 18:01:20 -05:00
Mark Tozzi 928c663ce0
Fix dangling 'either' in weighted average docs (#51748) 2020-01-31 12:45:46 -05:00
Elvis Saravia 520da54e63
update pipeline.asciidoc
typo
2020-01-24 14:03:01 +01:00
Igor Motov 23be11cf6c
Fix leftover mentions of method parameter in Percentile Aggs (#51272)
The method parameter is not used in the percentile aggs, instead
the method is determined by the presence of `hdr` or `tdigest`
objects.

Relates to #8324
2020-01-22 05:02:48 -10:00
Tal Levy 6c86606d2a
Adds support for geo-bounds filtering in geogrid aggregations (#50002)
It is fairly common to filter the geo point candidates in
geohash_grid and geotile_grid aggregations according to some
viewable bounding box. This change introduces the option of
specifying this filter directly in the tiling aggregation.

This is even more relevant to `geo_shape` where the bounds will restrict
the shape to be within the bounds

this optional `bounds` parameter is parsed in an equivalent fashion to 
the bounds specified in the geo_bounding_box query.
2020-01-14 08:29:10 -08:00
Nik Everett 326d696d9a
Support offset in composite aggs (#50609)
Adds support for the `offset` parameter to the `date_histogram` source
of composite aggs. The `offset` parameter is supported by the normal
`date_histogram` aggregation and is useful for folks that need to
measure things from, say, 6am one day to 6am the next day.

This is implemented by creating a new `Rounding` that knows how to
handle offsets and delegates to other rounding implementations. That
implementation doesn't fully implement the `Rounding` contract, namely
`nextRoundingValue`. That method isn't used by composite aggs so I can't
be sure that any implementation that I add will be correct. I propose to
leave it throwing `UnsupportedOperationException` until I need it.

Closes #48757
2020-01-07 14:49:09 -05:00
James Rodewig 7f35bcdfc9
[DOCS] Warn about using `geo_centroid` as sub-agg to `geohash_grid` (#50038)
If `geo_point fields` are multi-valued, using `geo_centroid` as a
sub-agg to `geohash_grid` could result in centroids outside of bucket
boundaries.

This adds a related warning to the geo_centroid agg docs.
2020-01-06 07:45:49 -06:00
Nik Everett a7cc0b0159
Docs: Refine note about `after_key` (#50475)
* Docs: Refine note about `after_key`

I was curious about composite aggregations, specifically I wanted to
know how to write a composite aggregation that had all of its buckets
filtered out so you *had* to use the `after_key`. Then I saw that we've
declared composite aggregations not to work with pipelines in #44180. So
I'm not sure you *can* do that any more. Which makes the note about
`after_key` inaccurate. This rejiggers that section of the docs a little
so it is more obvious that you send the `after_key` back to us. And so
it is more obvious that you should *only* use the `after_key` that we
give you rather than try to work it out for yourself.

* Apply suggestions from code review

Co-Authored-By: James Rodewig <james.rodewig@elastic.co>

Co-authored-by: James Rodewig <james.rodewig@elastic.co>
2020-01-02 10:02:55 -05:00
James Rodewig 3460dc9542
[DOCS] Percentile aggs are non-deterministic (#50468)
Percentile aggregations are non-deterministic. A percentile aggregation
can produce different results even when using the same data.

Based on [this discuss post][0], the non-deterministic property stems
from processes in Lucene that can affect the order in which docs are
provided to the aggregation.

This adds a warning stating that the aggregation is non-deterministic
and what that means.

[0]: https://discuss.elastic.co/t/different-results-for-same-query/111757
2019-12-23 13:11:31 -05:00
Florian Kelbert 0778c34630 [DOCS] Fix typo in bucket sum aggregation docs (#50431) 2019-12-20 08:47:24 -05:00
Lisa Cawley 6d608e6a0d
[DOCS] Move transform resource definitions into APIs (#50108) 2019-12-17 09:01:31 -08:00
Jim Ferenczi 804a5042e7
Optimize composite aggregation based on index sorting (#48399)
Co-authored-by: Daniel Huang <danielhuang@tencent.com>

This is a spinoff of #48130 that generalizes the proposal to allow early termination with the composite aggregation when leading sources match a prefix or the entire index sort specification.
In such case the composite aggregation can use the index sort natural order to early terminate the collection when it reaches a composite key that is greater than the bottom of the queue.
The optimization is also applicable when a query other than match_all is provided. However the optimization is deactivated for sources that match the index sort in the following cases:
  * Multi-valued source, in such case early termination is not possible.
  * missing_bucket is set to true
2019-12-17 14:02:06 +01:00
James Rodewig 2d9ee5ddfe
[DOCS] Correct percentile rank agg example response (#50052)
The example snippets in the percentile rank agg docs use a test dataset
named `latency`, which is generated from docs/gradle.build.

At some point the dataset and example snippets were updated, but the
text surrounding the snippets was not. This means the text and the
example snippets shown no longer match up.

This corrects that by changing the snippets using /TESTRESPONSE magic comments.
2019-12-12 08:38:48 -05:00
Ignacio Vera eade4f03f4
New Histogram field mapper that supports percentiles aggregations. (#48580)
This commit adds  a new histogram field mapper that consists in a pre-aggregated format of numerical data to be used in percentiles aggregations.
2019-11-28 13:58:20 +01:00
Przemko Robakowski 04f6b6fdb2
[DOCS] IDs for doc snippets (#49008)
* Ids for docs snippets

* Ids for tests

* Ids for docs snippets

* ignoring build folder from idea

* Ignoring build-eclipse
2019-11-25 15:30:00 +01:00
Lisa Cawley a4efab6ab4
[DOCS] Merge rollup config details into API (#49412) 2019-11-22 08:31:30 -08:00
Christos Soulios b0e12c936b
Implement stats aggregation for string terms (#47468)
This PR adds a new metric aggregation called string_stats that operates on string terms of a document and returns the following:

    min_length: The length of the shortest term
    max_length: The length of the longest term
    avg_length: The average length of all terms
    distribution: The probability distribution of all characters appearing in all terms
    entropy: The total Shannon entropy value calculated for all terms

This aggregation has been implemented as an analytics plugin.
2019-11-14 16:07:54 +02:00
James Rodewig f53eba024b
[DOCS] Remove binary gendered language (#48362) 2019-10-23 09:36:31 -05:00
Ian Danforth 24cf883792 [DOCS] Fix typo in percentile rank aggregation docs (#47247) 2019-10-15 15:56:32 -04:00
Alan Woodward 566e1b7d33
Remove type field from DocWriteRequest and associated Response objects (#47671)
This commit removes the type field from index, update and delete requests, and their
associated responses.

Relates to #41059
2019-10-11 10:23:55 +01:00
Alan Woodward 7a622f024f
Remove types from BulkRequest (#46983)
This commit removes types entirely from BulkRequest, both as a global
parameter and as individual entries on update/index/delete lines.

Relates to #41059
2019-10-07 13:29:12 +01:00
Mark Tozzi c26ce1d7f5
DocValueFormat implementation for date range fields (#47472) 2019-10-04 16:01:28 -04:00
Mark Tozzi 57a679fbbb
Documentation notes for Range field histograms (#46890) 2019-10-01 10:46:04 -04:00
Alan Woodward c1f99e2d75
Remove `_type` from SearchHit (#46942)
This commit removes the `_type` field from all search hit responses.

Relates to #41059
2019-09-23 19:14:54 +01:00