Adds an explicit check to `variable_width_histogram` to stop it from
trying to collect from many buckets because it can't. I tried to make it
do so but that is more than an afternoon's project, sadly. So for now we
just disallow it.
Relates to #42035
We're tracking this aggregation's experimental-progress in #58573. We'd
like a little time to be able to make backwards incompatible changes to
the aggregation because we're not 100% sure about the request and
response format yet.
Implements a new histogram aggregation called `variable_width_histogram` which
dynamically determines bucket intervals based on document groupings. These
groups are determined by running a one-pass clustering algorithm on each shard
and then reducing each shard's clusters using an agglomerative
clustering algorithm.
This PR addresses #9572.
The shard-level clustering is done in one pass to minimize memory overhead. The
algorithm was lightly inspired by
[this paper](https://ieeexplore.ieee.org/abstract/document/1198387). It fetches
a small number of documents to sample the data and determine initial clusters.
Subsequent documents are then placed into one of these clusters, or a new one
if they are an outlier. This algorithm is described in more details in the
aggregation's docs.
At reduce time, a
[hierarchical agglomerative clustering](https://en.wikipedia.org/wiki/Hierarchical_clustering)
algorithm inspired by [this paper](https://arxiv.org/abs/1802.00304)
continually merges the closest buckets from all shards (based on their
centroids) until the target number of buckets is reached.
The final values produced by this aggregation are approximate. Each bucket's
min value is used as its key in the histogram. Furthermore, buckets are merged
based on their centroids and not their bounds. So it is possible that adjacent
buckets will overlap after reduction. Because each bucket's key is its min,
this overlap is not shown in the final histogram. However, when such overlap
occurs, we set the key of the bucket with the larger centroid to the midpoint
between its minimum and the smaller bucket’s maximum:
`min[large] = (min[large] + max[small]) / 2`. This heuristic is expected to
increases the accuracy of the clustering.
Nodes are unable to share centroids during the shard-level clustering phase. In
the future, resolving https://github.com/elastic/elasticsearch/issues/50863
would let us solve this issue.
It doesn’t make sense for this aggregation to support the `min_doc_count`
parameter, since clusters are determined dynamically. The `order` parameter is
not supported here to keep this large PR from becoming too complex.
Changes:
* Condenses and relocates the `docvalue_fields` example to the 'Run a search'
page.
* Adds docs for the `docvalue_fields` request body parameter.
* Updates several related xrefs.
Co-authored-by: debadair <debadair@elastic.co>
Per 49554 I added standard deviation sampling and variance sampling to the extended stats interface.
Closes#49554
Co-authored-by: Igor Motov <igor@motovs.org>
* Make it more clear that you can use `month` or `1M`.
* Explain rounding rules
* Consistently use "time zone" instead of "timezone". It looks like both
are right but I see "time zone" much more. And the parameter in
elasticsearch is `time_zone` so we may as well line up.
Closes#56760
Co-authored-by: James Rodewig <james.rodewig@elastic.co>
This aggregation will perform normalizations of metrics
for a given series of data in the form of bucket values.
The aggregations supports the following normalizations
- rescale 0-1
- rescale 0-100
- percentage of sum
- mean normalization
- z-score normalization
- softmax normalization
To specify which normalization is to be used, it can be specified
in the normalize agg's `normalizer` field.
For example:
```
{
"normalize": {
"buckets_path": <>,
"normalizer": "percent"
}
}
```
Closes#51005.
Similar to what the moving function aggregation does, except merging windows of percentiles
sketches together instead of cumulatively merging final metrics
Implements value_count and avg aggregations over Histogram fields as discussed in #53285
- value_count returns the sum of all counts array of the histograms
- avg computes a weighted average of the values array of the histogram by multiplying each value with its associated element in the counts array
Removes an example from the "Document counts are approximate" section of the
terms agg documentation.
As #52377 details, the example was no longer accurate in 7.x or 6.8. Document
counts were more precise than the example presented.
We've opened issue #56025 to discuss re-adding an example later.
Co-authored-by: James Rodewig <james.rodewig@elastic.co>
* Aggs must specify a `field` or `script` (or both)
This adds a validation to VSParserHelper to ensure that a field or
script or both are specified by the user. This is technically
required today already, but throws an exception much deeper
in the agg framework and has a very unintuitive error for the user
(as well as eating more resources instead of failing early)
* Fix StringStats test
* Add yaml test
* Skip test on older versions
Co-authored-by: Elastic Machine <elasticmachine@users.noreply.github.com>
Adds support for filters to T-Test aggregation. The filters can be used to
select populations based on some criteria and use values from the same or
different fields.
Closes#53692
Adds t_test metric aggregation that can perform paired and unpaired two-sample
t-tests. In this PR support for filters in unpaired is still missing. It will
be added in a follow-up PR.
Relates to #53692
* Removes experimental.
* Replaces `"v"` (for value) with `"m"` (for metric).
* Move the note about tiebreaking into the list of limitations of the
sort.
* Explain how you ask for `metrics`.
* Clean up some wording.
* Link to the docs from `top_metrics`.
Closes#51813
This changes the `top_metrics` aggregation to return metrics in their
original type. Since it only supports numerics, that means that dates,
longs, and doubles will come back as stored, with their appropriate
formatter applied.
The `top_metrics` agg is kind of like `top_hits` but it only works on
doc values so it *should* be faster.
At this point it is fairly limited in that it only supports a single,
numeric sort and a single, numeric metric. And it only fetches the "very
topest" document worth of metric. We plan to support returning a
configurable number of top metrics, requesting more than one metric and
more than one sort. And, eventually, non-numeric sorts and metrics. The
trick is doing those things fairly efficiently.
Co-Authored by: Zachary Tong <zach@elastic.co>
The method parameter is not used in the percentile aggs, instead
the method is determined by the presence of `hdr` or `tdigest`
objects.
Relates to #8324
It is fairly common to filter the geo point candidates in
geohash_grid and geotile_grid aggregations according to some
viewable bounding box. This change introduces the option of
specifying this filter directly in the tiling aggregation.
This is even more relevant to `geo_shape` where the bounds will restrict
the shape to be within the bounds
this optional `bounds` parameter is parsed in an equivalent fashion to
the bounds specified in the geo_bounding_box query.
Adds support for the `offset` parameter to the `date_histogram` source
of composite aggs. The `offset` parameter is supported by the normal
`date_histogram` aggregation and is useful for folks that need to
measure things from, say, 6am one day to 6am the next day.
This is implemented by creating a new `Rounding` that knows how to
handle offsets and delegates to other rounding implementations. That
implementation doesn't fully implement the `Rounding` contract, namely
`nextRoundingValue`. That method isn't used by composite aggs so I can't
be sure that any implementation that I add will be correct. I propose to
leave it throwing `UnsupportedOperationException` until I need it.
Closes#48757
If `geo_point fields` are multi-valued, using `geo_centroid` as a
sub-agg to `geohash_grid` could result in centroids outside of bucket
boundaries.
This adds a related warning to the geo_centroid agg docs.
* Docs: Refine note about `after_key`
I was curious about composite aggregations, specifically I wanted to
know how to write a composite aggregation that had all of its buckets
filtered out so you *had* to use the `after_key`. Then I saw that we've
declared composite aggregations not to work with pipelines in #44180. So
I'm not sure you *can* do that any more. Which makes the note about
`after_key` inaccurate. This rejiggers that section of the docs a little
so it is more obvious that you send the `after_key` back to us. And so
it is more obvious that you should *only* use the `after_key` that we
give you rather than try to work it out for yourself.
* Apply suggestions from code review
Co-Authored-By: James Rodewig <james.rodewig@elastic.co>
Co-authored-by: James Rodewig <james.rodewig@elastic.co>
Percentile aggregations are non-deterministic. A percentile aggregation
can produce different results even when using the same data.
Based on [this discuss post][0], the non-deterministic property stems
from processes in Lucene that can affect the order in which docs are
provided to the aggregation.
This adds a warning stating that the aggregation is non-deterministic
and what that means.
[0]: https://discuss.elastic.co/t/different-results-for-same-query/111757
Co-authored-by: Daniel Huang <danielhuang@tencent.com>
This is a spinoff of #48130 that generalizes the proposal to allow early termination with the composite aggregation when leading sources match a prefix or the entire index sort specification.
In such case the composite aggregation can use the index sort natural order to early terminate the collection when it reaches a composite key that is greater than the bottom of the queue.
The optimization is also applicable when a query other than match_all is provided. However the optimization is deactivated for sources that match the index sort in the following cases:
* Multi-valued source, in such case early termination is not possible.
* missing_bucket is set to true
The example snippets in the percentile rank agg docs use a test dataset
named `latency`, which is generated from docs/gradle.build.
At some point the dataset and example snippets were updated, but the
text surrounding the snippets was not. This means the text and the
example snippets shown no longer match up.
This corrects that by changing the snippets using /TESTRESPONSE magic comments.
This PR adds a new metric aggregation called string_stats that operates on string terms of a document and returns the following:
min_length: The length of the shortest term
max_length: The length of the longest term
avg_length: The average length of all terms
distribution: The probability distribution of all characters appearing in all terms
entropy: The total Shannon entropy value calculated for all terms
This aggregation has been implemented as an analytics plugin.
This commit removes types entirely from BulkRequest, both as a global
parameter and as individual entries on update/index/delete lines.
Relates to #41059
* Minor improvement to the nested aggregation docs
* The attributes name and resellers.name were rather confusing,
especially since the first one was dynamically mapped and not shown
in the documentation (you had to read the test to see it). This
change introduces a unique name for the nested attribute and adds
the example document to the documentation.
* Change the index name from "index" to something more speaking.
* Update docs/reference/aggregations/bucket/nested-aggregation.asciidoc
Co-Authored-By: James Rodewig <james.rodewig@elastic.co>
* Update docs/reference/aggregations/bucket/nested-aggregation.asciidoc
Co-Authored-By: James Rodewig <james.rodewig@elastic.co>
* Update docs/reference/aggregations/bucket/nested-aggregation.asciidoc
Co-Authored-By: James Rodewig <james.rodewig@elastic.co>
This adds a pipeline aggregation that calculates the cumulative
cardinality of a field. It does this by iteratively merging in the
HLL sketch from consecutive buckets and emitting the cardinality up
to that point.
This is useful for things like finding the total "new" users that have
visited a website (as opposed to "repeat" visitors).
This is a Basic+ aggregation and adds a new Data Science plugin
to house it and future advanced analytics/data science aggregations.
This adjusts the `buckets_path` parser so that pipeline aggs can
select specific buckets (via their bucket keys) instead of fetching
the entire set of buckets. This is useful for bucket_script in
particular, which might want specific buckets for calculations.
It's possible to workaround this with `filter` aggs, but the workaround
is hacky and probably less performant.
- Adjusts documentation
- Adds a barebones AggregatorTestCase for bucket_script
- Tweaks AggTestCase to use getMockScriptService() for reductions and
pipelines. Previously pipelines could just pass in a script service
for testing, but this didnt work for regular aggs. The new
getMockScriptService() method fixes that issue, but needs to be used
for pipelines too. This had a knock-on effect of touching MovFn,
AvgBucket and ScriptedMetric
Introduce shift field to MovingFunction aggregation.
By default, shift = 0. Behavior, in this case, is the same as before.
Increasing shift by 1 moves starting window position by 1 to the right.
To simply include current bucket to the window, use shift = 1
For center alignment (n/2 values before and after the current bucket), use shift = window / 2
For right alignment (n values after the current bucket), use shift = window.
This adds a `rare_terms` aggregation. It is an aggregation designed
to identify the long-tail of keywords, e.g. terms that are "rare" or
have low doc counts.
This aggregation is designed to be more memory efficient than the
alternative, which is setting a terms aggregation to size: LONG_MAX
(or worse, ordering a terms agg by count ascending, which has
unbounded error).
This aggregation works by maintaining a map of terms that have
been seen. A counter associated with each value is incremented
when we see the term again. If the counter surpasses a predefined
threshold, the term is removed from the map and inserted into a cuckoo
filter. If a future term is found in the cuckoo filter we assume it
was previously removed from the map and is "common".
The map keys are the "rare" terms after collection is done.
The date_histogram accepts an interval which can be either a calendar
interval (DST-aware, leap seconds, arbitrary length of months, etc) or
fixed interval (strict multiples of SI units). Unfortunately this is inferred
by first trying to parse as a calendar interval, then falling back to fixed
if that fails.
This leads to confusing arrangement where `1d` == calendar, but
`2d` == fixed. And if you want a day of fixed time, you have to
specify `24h` (e.g. the next smallest unit). This arrangement is very
error-prone for users.
This PR adds `calendar_interval` and `fixed_interval` parameters to any
code that uses intervals (date_histogram, rollup, composite, datafeed, etc).
Calendar only accepts calendar intervals, fixed accepts any combination of
units (meaning `1d` can be used to specify `24h` in fixed time), and both
are mutually exclusive.
The old interval behavior is deprecated and will throw a deprecation warning.
It is also mutually exclusive with the two new parameters. In the future the
old dual-purpose interval will be removed.
The change applies to both REST and java clients.
Adds some validation to prevent duplicate source names from being
used in the composite agg.
Also refactored to use a ConstructingObjectParser and removed the
private ctor and setter for sources, making it mandatory.
This section should be at the same sub-level as other sections in the
auto date-histogram docs, otherwise it is rendered on to another page
and is confusing for users to understand what it's in reference to.
This helps avoid memory issues when computing deep sub-aggregations. Because it
should be rare to use sub-aggregations with significant terms, we opted to always
choose breadth first as opposed to exposing a `collect_mode` option.
Closes#28652.
Implements `geotile_grid` aggregation
This patch refactors previous implementation https://github.com/elastic/elasticsearch/pull/30240
This code uses the same base classes as `geohash_grid` agg, but uses a different hashing
algorithm to allow zoom consistency. Each grid bucket is aligned to Web Mercator tiles.
* Update the top-level 'getting started' guide.
* Remove custom types from the painless getting started documentation.
* Fix an incorrect references to '_doc' in the cardinality query docs.
* Update the _update docs to use the typeless API format.
This changes adds the support to handle `nested` fields in the `composite`
aggregation. A `nested` aggregation can be used as parent of a `composite`
aggregation in order to target `nested` fields in the `sources`.
Closes#28611
Users may require the sequence number and primary terms to perform optimistic concurrency control operations. Currently, you can get the sequence number via the `docvalues_fields` API but the primary term is not accessible because it is maintained by the `SeqNoFieldMapper` and the infrastructure can't find it.
This commit adds a dedicated sub fetch phase to return both numbers that is connected to a new `seq_no_primary_term` parameter.
The "include_type_name" parameter was temporarily introduced in #37285 to facilitate
moving the default parameter setting to "false" in many places in the documentation
code snippets. Most of the places can simply be reverted without causing errors.
In this change I looked for asciidoc files that contained the
"include_type_name=true" addition when creating new indices but didn't look
likey they made use of the "_doc" type for mappings. This is mostly the case
e.g. in the analysis docs where index creating often only contains settings. I
manually corrected the use of types in some places where the docs still used an
explicit type name and not the dummy "_doc" type.
* Default include_type_name to false for get and put mappings.
* Default include_type_name to false for get field mappings.
* Add a constant for the default include_type_name value.
* Default include_type_name to false for get and put index templates.
* Default include_type_name to false for create index.
* Update create index calls in REST documentation to use include_type_name=true.
* Some minor clean-ups around the get index API.
* In REST tests, use include_type_name=true by default for index creation.
* Make sure to use 'expression == false'.
* Clarify the different IndexTemplateMetaData toXContent methods.
* Fix FullClusterRestartIT#testSnapshotRestore.
* Fix the ml_anomalies_default_mappings test.
* Fix GetFieldMappingsResponseTests and GetIndexTemplateResponseTests.
We make sure to specify include_type_name=true during xContent parsing,
so we continue to test the legacy typed responses. XContent generation
for the typeless responses is currently only covered by REST tests,
but we will be adding unit test coverage for these as we implement
each typeless API in the Java HLRC.
This commit also refactors GetMappingsResponse to follow the same appraoch
as the other mappings-related responses, where we read include_type_name
out of the xContent params, instead of creating a second toXContent method.
This gives better consistency in the response parsing code.
* Fix more REST tests.
* Improve some wording in the create index documentation.
* Add a note about types removal in the create index docs.
* Fix SmokeTestMonitoringWithSecurityIT#testHTTPExporterWithSSL.
* Make sure to mention include_type_name in the REST docs for affected APIs.
* Make sure to use 'expression == false' in FullClusterRestartIT.
* Mention include_type_name in the REST templates docs.
Adds an example on translating geohashes returned by geohashgrid
agg as bucket keys into geo bounding box filters in elasticsearch as well
as 3rd party applications.
Closes#36413
When executing terms aggregations we set the shard_size, meaning the
number of buckets to collect on each shard, to a value that's higher than
the number of requested buckets, to guarantee some basic level of
precision. We have an optimization in place so that we leave shard_size
set to size whenever we are searching against a single shard, in which
case maximum precision is guaranteed by definition.
Such optimization requires us access to the total number of shards that
the search is executing against. In the context of cross-cluster search,
once we will introduce multiple reduction steps (one per cluster) each
cluster will only know the number of local shards, which is problematic
as we should only optimize if we are searching against a single shard in a
single cluster. It could be that we are searching against one shard per cluster
in which case the current code would optimize number of terms causing
a loss of precision.
While discussing how to address the CCS scenario, we decided that we do
not want to introduce further complexity caused by this single shard
optimization, as it benefits only a minority of cases, especially when
the benefits are not so great.
This commit removes the single shard optimization, meaning that we will
always have heuristic enabled on how many number of buckets to collect
on the shards, even when searching against a single shard.
This will cause more buckets to be collected when searching against a single
shard compared to before. If that becomes a problem for some users, they
can work around that by setting the shard_size equal to the size.
Relates to #32125
This commit changes the format of the `hits.total` in the search response to be an object with
a `value` and a `relation`. The `value` indicates the number of hits that match the query and the
`relation` indicates whether the number is accurate (in which case the relation is equals to `eq`)
or a lower bound of the total (in which case it is equals to `gte`).
This change also adds a parameter called `rest_total_hits_as_int` that can be used in the
search APIs to opt out from this change (retrieve the total hits as a number in the rest response).
Note that currently all search responses are accurate (`track_total_hits: true`) or they don't contain
`hits.total` (`track_total_hits: true`). We'll add a way to get a lower bound of the total hits in a
follow up (to allow numbers to be passed to `track_total_hits`).
Relates #33028
`ScriptDocValues#getValues` was added for backwards compatibility but no
longer needed. Scripts using the syntax `doc['foo'].values` when
`doc['foo']` is a list should be using `doc['foo']` instead.
Closes#22919
This commit adds a new single value metric aggregation that calculates
the statistic called median absolute deviation, which is a measure of
variability that works on more types of data than standard deviation
Our calculation of MAD is approximated using t-digests. In the collect
phase, we collect each value visited into a t-digest. In the reduce
phase, we merge all value t-digests, then create a t-digest of
deviations using the first t-digest's median and centroids
When combine_script and reduce_script were made into required
parameters for Scripted Metric aggregations in #33452, the docs were
not updated to reflect that. This marks those parameters as required
in the documentation.
* Replace custom type names with _doc in REST examples.
* Avoid using two mapping types in the percolator docs.
* Rename doc -> _doc in the main repository README.
* Also replace some custom type names in the HLRC docs.
We generate tests from our documentation, including assertions about the
responses returned by a particular API. But sometimes we *can't* assert
that the response is correct because of some defficiency in our tooling.
Previously we marked the response `// NOTCONSOLE` to skip it, but this
is kind of odd because `// NOTCONSOLE` is really to mark snippets that
are json but aren't requests or responses. This introduces a new
construct to skip response assertions:
```
// TESTRESPONSE[skip:reason we skipped this]
```
This commit switches the joda time backcompat in scripting to use
augmentation over ZonedDateTime. The augmentation methods provide
compatibility with the missing methods between joda's DateTime and
java's ZonedDateTime. Due to getDayOfWeek returning an enum in the java
API, ZonedDateTime is wrapped so that the method can return int like the
joda time does. The java time api version is renamed to
getDayOfWeekEnum, which will be kept through 7.x for compatibility while
users switch back to getDayOfWeek once joda compatibility is removed.
The main benefit of the upgrade for users is the search optimization for top scored documents when the total hit count is not needed. However this optimization is not activated in this change, there is another issue opened to discuss how it should be integrated smoothly.
Some comments about the change:
* Tests that can produce negative scores have been adapted but we need to forbid them completely: #33309Closes#32899
We used to set `maxScore` to `0` within `TopDocs` in situations where there is really no score as the size was set to `0` and scores were not even tracked. In such scenarios, `Float.Nan` is more appropriate, which gets converted to `max_score: null` on the REST layer. That's also more consistent with lucene which set `maxScore` to `Float.Nan` when merging empty `TopDocs` (see `TopDocs#merge`).
This commit adds a boolean system property, `es.scripting.use_java_time`,
which controls the concrete return type used by doc values within
scripts. The return type of accessing doc values for a date field is
changed to Object, essentially duck typing the type to allow
co-existence during the transition from joda time to java time.
Adds a new single-value metrics aggregation that computes the weighted
average of numeric values that are extracted from the aggregated
documents. These values can be extracted from specific numeric
fields in the documents.
When calculating a regular average, each datapoint has an equal "weight"; it
contributes equally to the final value. In contrast, weighted averages
scale each datapoint differently. The amount that each datapoint contributes
to the final value is extracted from the document, or provided by a script.
As a formula, a weighted average is the `∑(value * weight) / ∑(weight)`
A regular average can be thought of as a weighted average where every value has
an implicit weight of `1`.
Closes#15731
* Adds a new auto-interval date histogram
This change adds a new type of histogram aggregation called `auto_date_histogram` where you can specify the target number of buckets you require and it will find an appropriate interval for the returned buckets. The aggregation works by first collecting documents in buckets at second interval, when it has created more than the target number of buckets it merges these buckets into minute interval bucket and continues collecting until it reaches the target number of buckets again. It will keep merging buckets when it exceeds the target until either collection is finished or the highest interval (currently years) is reached. A similar process happens at reduce time.
This aggregation intentionally does not support min_doc_count, offest and extended_bounds to keep the already complex logic from becoming more complex. The aggregation accepts sub-aggregations but will always operate in `breadth_first` mode deferring the computation of sub-aggregations until the final buckets from the shard are known. min_doc_count is effectively hard-coded to zero meaning that we will insert empty buckets where necessary.
Closes#9572
* Adds documentation
* Added sub aggregator test
* Fixes failing docs test
* Brings branch up to date with master changes
* trying to get tests to pass again
* Fixes multiBucketConsumer accounting
* Collects more buckets than needed on shards
This gives us more options at reduce time in terms of how we do the
final merge of the buckeets to produce the final result
* Revert "Collects more buckets than needed on shards"
This reverts commit 993c782d11.
* Adds ability to merge within a rounding
* Fixes nonn-timezone doc test failure
* Fix time zone tests
* iterates on tests
* Adds test case and documentation changes
Added some notes in the documentation about the intervals that can bbe
returned.
Also added a test case that utilises the merging of conseecutive buckets
* Fixes performance bug
The bug meant that getAppropriate rounding look a huge amount of time
if the range of the data was large but also sparsely populated. In
these situations the rounding would be very low so iterating through
the rounding values from the min key to the max keey look a long time
(~120 seconds in one test).
The solution is to add a rough estimate first which chooses the
rounding based just on the long values of the min and max keeys alone
but selects the rounding one lower than the one it thinks is
appropriate so the accurate method can choose the final rounding taking
into account the fact that intervals are not always fixed length.
Thee commit also adds more tests
* Changes to only do complex reduction on final reduce
* merge latest with master
* correct tests and add a new test case for 10k buckets
* refactor to perform bucket number check in innerBuild
* correctly derive bucket setting, update tests to increase bucket threshold
* fix checkstyle
* address code review comments
* add documentation for default buckets
* fix typo
* Migrate scripted metric aggregation scripts to ScriptContext design #29328
* Rename new script context container class and add clarifying comments to remaining references to params._agg(s)
* Misc cleanup: make mock metric agg script inner classes static
* Move _score to an accessor rather than an arg for scripted metric agg scripts
This causes the score to be evaluated only when it's used.
* Documentation changes for params._agg -> agg
* Migration doc addition for scripted metric aggs _agg object change
* Rename "agg" Scripted Metric Aggregation script context variable to "state"
* Rename a private base class from ...Agg to ...State that I missed in my last commit
* Clean up imports after merge
This change adds a new option to the composite aggregation named `missing_bucket`.
This option can be set by source and dictates whether documents without a value for the
source should be ignored. When set to true, documents without a value for a field emits
an explicit `null` value which is then added in the composite bucket.
The `missing` option that allows to set an explicit value (instead of `null`) is deprecated in this change and will be removed in a follow up (only in 7.x).
This commit also changes how the big arrays are allocated, instead of reserving
the provided `size` for all sources they are created with a small intial size and they grow
depending on the number of buckets created by the aggregation:
Closes#29380
This pipeline aggregation gives the user the ability to script functions that "move" across a window
of data, instead of single data points. It is the scripted version of MovingAvg pipeline agg.
Through custom script contexts, we expose a number of convenience methods:
- MovingFunctions.max()
- MovingFunctions.min()
- MovingFunctions.sum()
- MovingFunctions.unweightedAvg()
- MovingFunctions.linearWeightedAvg()
- MovingFunctions.ewma()
- MovingFunctions.holt()
- MovingFunctions.holtWinters()
- MovingFunctions.stdDev()
The user can also define any arbitrary logic via their own scripting, or combine with the above methods.
This commit changes the default out-of-the-box configuration for the
number of shards from five to one. We think this will help address a
common problem of oversharding. For users with time-based indices that
need a different default, this can be managed with index templates. For
users with non-time-based indices that find they need to re-shard with
the split API in place they no longer need to resort only to
reindexing.
Since this has the impact of changing the default number of shards used
in REST tests, we want to ensure that we still have coverage for issues
that could arise from multiple shards. As such, we randomize (rarely)
the default number of shards in REST tests to two. This is managed via a
global index template. However, some tests check the templates that are
in the cluster state during the test. Since this template is randomly
there, we need a way for tests to skip adding the template used to set
the number of shards to two. For this we add the default_shards feature
skip. To avoid having to write our docs in a complicated way because
sometimes they might be behind one shard, and sometimes they might be
behind two shards we apply the default_shards feature skip to all docs
tests. That is, these tests will always run with the default number of
shards (one).