This PR is similar to #46586.
When waiting for no initializing shards we also have to wait for events
when we have more than one node in the cluster. When the primary is
started, there is a short period of time, where neither the primary nor
any of the replicas are initializing.
Normally dimension fields are identified by means of a boolean parameter
at mapping time, time_series_dimension. Flattened fields do not have mappings,
other than identifying the top level field as a flattened field type. Moreover a boolean
is not enough to identify the top-level field as a dimension since we would like
users to be able to specify a subset of the fields in the flattened field to be dimensions
(not necessarily all of them). For this reason we introduce a new mapping parameter,
time_series_dimensions, which lists the fields, in any order, in the flattened field
that the user wants as dimensions. Field names must not include the root field name
and their name is the relative path from the root down to the leaf field name.
We require flattened fields to be indexed, to have doc values and disallow usage
of the ignore_above parameter together with time_series_dimensions.
This introduces an endpoint to reset the desired balance.
It could be used if computed balance diverged from the actual one a lot
to start a new computation from the current state.
The `.watches` index is a system index, which means that its settings
cannot be modified by the user. This commit adds APIs (`PUT
/_watcher/settings` and `GET /_watcher/settings`) that allow modifying
and retrieving a subset of index settings for the `.watches` index.
The settings that are currently allowed are `index.number_of_replicas`
and `index.auto_expand_replicas`, though more may be added in the
future.
Resolves https://github.com/elastic/elasticsearch/issues/92991
This PR enables downloading packaged models from `ml-models.elastic.co`,
an endpoint provided by Elastic. Elastic provided models begin with a
`.`, which is a private namespace that does not interfere with user
models(the `.` prefix is disallowed for them). If a user puts a packaged
model, the model gets automatically downloaded. For air-gaped
environments it is possible to load models from a file.
earlier changes: #95175, #95207
A trained model deployment can be started with an optional deployment Id.
Deployment Ids and model Ids considered to be in the same namespace
and unique, a deployment id cannot be the same as any other deployment
or model Id unless it is the same as the model being deployed. When
creating a new model, the id cannot match any models or deployments
Here we add synthetic source support for fields whose type is flattened.
Note that flattened fields and synthetic source have the following limitations,
all arising from the fact that in synthetic source we just see key/value pairs
when reconstructing the original object and have no type information in mappings:
* flattened fields use sorted set doc values of keywords, which means two things:
first we do not allow duplicate values, second we treat all values as keywords
* reconstructing array of objects results in nested objects (no array)
* reconstructing arrays with just one element results in a single-value field since we
have no way to distinguish single-valued from multi-values fields other then looking
at the count of values
This change sets the stability of ent-search APIs to beta and visibility to public.
It also removes the feature flag link since enabling the module is not considered as a feature flag
and the module is enabled by default.
With PR we introduce CRUD endpoints which update/delete the data lifecycle on the data stream level. When this is updated it will apply at the next DLM run to all the backing indices that are managed by DLM.
Document parsing methods currently throw MapperParsingException. This
isn't very helpful, as it doesn't contain any information about where the parse
error happened - it is designed for parsing mappings, which are realised into
java maps before being examined. This commit introduces a new exception
specifically for document parsing that extends XContentException, so that
it reports the current position of the parser as part of its error message.
Fixes#85083
This adds a new parameter to `knn` that allows filtering nearest neighbor results that are outside a given similarity.
`num_candidates` and `k` are still required as this controls the nearest-neighbor vector search accuracy and exploration. For each shard the query will search `num_candidates` and only keep those that are within the provided `similarity` boundary, and then finally reduce to only the global top `k` as normal.
For example, when using the `l2_norm` indexed similarity value, this could be considered a `radius` post-filter on `knn`.
relates to: https://github.com/elastic/elasticsearch/issues/84929 && https://github.com/elastic/elasticsearch/pull/93574
This should help us ensure that desired balance is not producing too many shard movements during computation (that could be a sign of unusual configuration or a bug) that could eventually result in actual cluster balance diverging far from the desired balance (separate change is still required to warn/reset if we are in fact far during reconciliation step).
This change adds a new rest parameter called `rest_include_named_queries_score` that when set, includes the score of the named queries that matched the document.
Note that with this change, the score of named queries is always returned when using the transport client. The rest level has the ability to set the format of
the matched_queries section for BWC (kept as is by default).
Closes#65563
* fix: downsampling unmapped text fields
When a field is unmapped usually dynamic mapping maps it using
a multi field which has the original field name as a text field
and a keyword sub-field. At downsampling time we skip text fields
and we only index the corresponding keyword field in the target index.
As a result, when indexing data into the target index we need to
use the name of the parent (text) field instead of the (keyword)
sub-field in order for indexing to succeed.
Here we derive the name of the parent field by stripping away the
name of the sub-field (whatever appears after the last '.' in the name).
The name of the subfield is still available through `MappedFieldType#name`.
Added mget call to verify the documents being deleted actually got indexed.
And added an assertion to PerThreadIDVersionAndSeqNoLookup to get more information
about the reader if there is no timestamp point values field.
Relates to #93852
We deprecated the _knn_search endpoint with #88828 but we missed deprecating it in the REST spec.
Note that the REST spec parser was not aligned with its json schema in that the deprecated section caused an exception to be thrown. The parser is now updated to accept the deprecated section at the endpoint level.
For managing data streams with DLM we chose to have one cluster setting that will determine the rollover conditions for all data streams. This PR introduces this cluster setting, it exposes it via the 3 existing APIs under the flag `include_defaults` and adjusts DLM to use it. The feature remains behind a feature flag.
* Allow skip-all to work with other ranges
When muting tests that already have version ranges set in the skip
section, it is convenient to remember the previous ranges for when
we later un-mute.
* Mute test that fails 1% of the time (#94239)
* Remember previous versions in test skip.version
We have some YAML tests that would require at least one
replica (search shard) to run with Stateless and since they
wait for green, they explicitly set replicas to 0 (see e.g.
realtime_refresh). Using 2 nodes
by default makes sure we could run those tests w/o any
changes. IMO, they are pretty important/essential tests.
Relates #94303
We deprecated the _knn_search endpoint with #88828 but we missed deprecating it in the REST spec.
Note that the REST spec parser was not aligned with its json schema in that the deprecated section caused an exception to be thrown. The parser is now updated to accept the deprecated section at the endpoint level.
Added position time_series_metric:
* start creating position time_series_metric
* Add yaml tests for queries and aggs
* Disallow multi-values for geo_point as ts-metric
* Limit running on older versions, some parts of the time-series syntax were not supported on all versions
* ScaledFloatFieldMapper does not support POSITION, We should only test it against COUNTER and GAUGE, since it only supports those two metric types
* Expand unit tests and allow parsing of dimension. We expand the tests to cover all cases tested in DoubleFieldMapperTests which also tests the behaviour of setting the dimension to true or false, so we enable parsing that for symmetry, but reject `true` as illegal for geo_point.
* Add unit tests for position metric multi-values
This fixes yaml tests when -Dbuild.snapshot is false.
The data lifecycle functionality is not enabled unless the feature flag
is configured.
This makes the yaml tests enable the feature flag for non-snapshot builds
(it's always enabled for snapshot builds)
* Update yamlRestTest docs skip.version
The skip.version field supports multiple versions,
and the setup/teardown areas combine with test skip versions
in undocumented ways, so we document them.
* Update rest-api-spec/src/yamlRestTest/resources/rest-api-spec/test/README.asciidoc
Co-authored-by: Abdon Pijpelink <abdon.pijpelink@elastic.co>
---------
Co-authored-by: Abdon Pijpelink <abdon.pijpelink@elastic.co>
In this PR we introduce the DLM feature flag and the data lifecycle model. The model is added to the composable templates, the index templates (v2 only) and the data streams.
Co-authored-by: Andrei Dan <andrei.dan@elastic.co>
Fields that have the time_series_metric attribute set to counter in non tsdb indices should use number value source type instead of counter value source type. Essentially not handling these fields as counters at search time.
Relates to #93539
This PR ensures most search features (scroll, async search, pit, field
caps, msearch, vector tile etc) work with the new RCS model. The main
code change is tested by adapting the common yaml CCS tests to use the
new RCS model to provide a broad test coverage. The tests ensure the new
RCS model works from search's perspective. We could still use more tests
from security's perspective, e.g. DLS/FLS, in separate PRs.
Note: * Eql yaml test files are not located under `x-pack/plugin` and
this makes it hard to reuse. It should be possible to relocate them. But
I'll address it separately. * Sql yaml requires special transformation
to work. I'll also have it separately.
Today we report node stats by name, but the desired nodes work in terms
of node IDs. This commit adds a mapping between node name and ID to make
the output easier to interpret.
In #93386 we adjusted some YAML tests to allow their execution
on a 2 nodes cluster where every index has at least 1 replica, but
this caused test failures for single or multi node clusters.
This pull request reverts the changes that was made. Those tests
will be muted for the 2 nodes cluster.
Closes#93572Closes#93599
Some core yaml rest tests use an explicit number of replicas when creating indices. I suspect that this is often not needed and it prevents those tests to run in a 2 nodes (index & search) cluster.
Most of the impacted tests are search related so I'll use the :Search/Search label.
Relates ES-5253
Support for synthetic source is also added to `unsigned_long` field as part of this change.
This is required because `unsigned_long` field types can be used in tsdb indices and
this change would prohibit the usage of these field type otherwise.
Closes#92319
This change introduces the configuration option `ignore_missing_component_templates` as discussed in https://github.com/elastic/elasticsearch/issues/92426 The implementation [option 6](https://github.com/elastic/elasticsearch/issues/92426#issuecomment-1372675683) was picked with a slight adjustment meaning no patterns are allowed.
## Implementation
During the creation of an index template, the list of component templates is checked if all component templates exist. This check is extended to skip any component templates which are listed under `ignore_missing_component_templates`. An index template that skips the check for the component template `logs-foo@custom` looks as following:
```
PUT _index_template/logs-foo
{
"index_patterns": ["logs-foo-*"],
"data_stream": { },
"composed_of": ["logs-foo@package", "logs-foo@custom"],
"ignore_missing_component_templates": ["logs-foo@custom"],
"priority": 500
}
```
The component template `logs-foo@package` has to exist before creation. It can be created with:
```
PUT _component_template/logs-foo@custom
{
"template": {
"mappings": {
"properties": {
"host.ip": {
"type": "ip"
}
}
}
}
}
```
## Testing
For manual testing, different scenarios can be tested. To simplify testing, the commands from `.http` file are added. Before each test run, a clean cluster is expected.
### New behaviour, missing component template
With the new config option, it must be possible to create an index template with a missing component templates without getting an error:
```
### Add logs-foo@package component template
PUT http://localhost:9200/
_component_template/logs-foo@package
Authorization: Basic elastic password
Content-Type: application/json
{
"template": {
"mappings": {
"properties": {
"host.name": {
"type": "keyword"
}
}
}
}
}
### Add logs-foo index template
PUT http://localhost:9200/
_index_template/logs-foo
Authorization: Basic elastic password
Content-Type: application/json
{
"index_patterns": ["logs-foo-*"],
"data_stream": { },
"composed_of": ["logs-foo@package", "logs-foo@custom"],
"ignore_missing_component_templates": ["logs-foo@custom"],
"priority": 500
}
### Create data stream
PUT http://localhost:9200/
_data_stream/logs-foo-bar
Authorization: Basic elastic password
Content-Type: application/json
### Check if mappings exist
GET http://localhost:9200/
logs-foo-bar
Authorization: Basic elastic password
Content-Type: application/json
```
It is checked if all templates could be created and data stream mappings are correct.
### Old behaviour, with all component templates
In the following, a component template is made optional but it already exists. It is checked, that it will show up in the mappings:
```
### Add logs-foo@package component template
PUT http://localhost:9200/
_component_template/logs-foo@package
Authorization: Basic elastic password
Content-Type: application/json
{
"template": {
"mappings": {
"properties": {
"host.name": {
"type": "keyword"
}
}
}
}
}
### Add logs-foo@custom component template
PUT http://localhost:9200/
_component_template/logs-foo@custom
Authorization: Basic elastic password
Content-Type: application/json
{
"template": {
"mappings": {
"properties": {
"host.ip": {
"type": "ip"
}
}
}
}
}
### Add logs-foo index template
PUT http://localhost:9200/
_index_template/logs-foo
Authorization: Basic elastic password
Content-Type: application/json
{
"index_patterns": ["logs-foo-*"],
"data_stream": { },
"composed_of": ["logs-foo@package", "logs-foo@custom"],
"ignore_missing_component_templates": ["logs-foo@custom"],
"priority": 500
}
### Create data stream
PUT http://localhost:9200/
_data_stream/logs-foo-bar
Authorization: Basic elastic password
Content-Type: application/json
### Check if mappings exist
GET http://localhost:9200/
logs-foo-bar
Authorization: Basic elastic password
Content-Type: application/json
```
### Check old behaviour
Ensure, that the old behaviour still exists when a component template is used that is not part of `ignore_missing_component_templates`:
```
### Add logs-foo index template
PUT http://localhost:9200/
_index_template/logs-foo
Authorization: Basic elastic password
Content-Type: application/json
{
"index_patterns": ["logs-foo-*"],
"data_stream": { },
"composed_of": ["logs-foo@package", "logs-foo@custom"],
"ignore_missing_component_templates": ["logs-foo@custom"],
"priority": 500
}
```
Co-authored-by: Lee Hinman <dakrone@users.noreply.github.com>
It makes sense to allow more than one KNN search clause per individual search request. It may be that different documents have separate vector spaces or that a single doc is index with more than one vector space. In both of these scenarios, users may want to retrieve a resulting set that takes into account all their indexed vector spaces.
A prime example here would be searching a semantic text embedding along with searching an image embedding.
closes https://github.com/elastic/elasticsearch/issues/91187
This commit adds a new test framework for configuring and orchestrating
test clusters for both Java and YAML REST testing. This will eventually
replace the existing "test-clusters" Gradle plugin and the build-time
cluster orchestration.
This adds a `size` parameter that controls the maximum number of
returned affected resources. The parameter defaults to `1000`, must be
positive, and less than `10_000`
This PR extends the basic Prevalidation API so that in case there are
red non-searchable-snapshot indices in the cluster, we reach out to
the nodes (whose removal is being prevalidated) to find out if they
have a local copy of any red indices.
Closes#87776
If the `_doc_count` field is sparse we were using Lucene incorrectly to
read it's values. This fixes how we interact with the iterator to load
the values.
Closes#91731
Currently there is no way to remove user-added annotations when a job is deleted or reset.
This change adds an option - delete_user_annotations - to both the delete and reset job APIs.
The default value is false, to keep the behaviour of these calls as it is currently.
This PR adds the first part of the Prevalidate Node Removal API. This
API allows checking whether attempting to remove some node(s) from the
cluster is likely to succeed or not. This check is useful when a node
needs to be removed from a RED cluster, without risking loosing the last
copy of some RED shards.
In this PR, we only check whether a RED index is a Searchable Snapshot
index or not, in which case the removal of any node is safe as the RED
index is backed by a snapshot.
Relates #87776
Loading of stored fields is currently handled directly in FetchPhase, with
some fairly complex logic examining various bits of the FetchContext to work
out what fields need to be loaded. This is further complicated by synthetic
source, which may have its own stored field requirements.
This commit tries to separate out these concerns a little by adding a new
StoredFieldsSpec record that holds information about which stored fields
need to be loaded. Each FetchSubPhaseProcessor can now report a
StoredFieldsSpec detailing what its requirements are, and these specs can
be merged together, along with requirements from a SourceLoader, to
determine up-front what fields should be loaded by the StoredFieldLoader.
The stored fields themselves are added into the SearchHit by a new
StoredFieldsPhase, which handles alias resolution and value post-
processing. The logic to determine when source should be loaded and
when not, based on the presence of script fields or stored fields, is
moved into FetchContext, which highlights some inconsistencies that
can be fixed in follow-up commits.
This renames the explain Health API parameter to verbose.
We decided to rename explain because verbose is a more established
term in the industry for "opt-in to get more information" and allows for more
flexibility to control what exactly that extra information is (explain is already
pushing the limits of what it semantically represents as it's controlling both
the diagnosis insights and the raw details information)
This PR affects requests that contain a single index name
or a single pattern (wildcard/datemath).
It aims to systematize the handling of the `allow_no_indices`
and `ignore_unavailable`indices options:
* the allow_no_indices option is to be concerned with
wildcards that expand to nothing (or the entire request
expands to nothing)
* the ignore_unavailable option is to be concerned with
explicit names only (not wildcards)
In addition, the behavior of the above options will now be
independent of the number of expressions in a request.
This adds a new parameter to the start trained model deployment API,
namely `priority`. The available settings are `normal` and `low`.
For normal priority deployments the allocations get distributed so that
node processors are never oversubscribed.
Low priority deployments allow users to test model functionality even if there
are no node processors available. They are limited to 1 allocation with a single thread.
In addition, the process is executed in low priority which limits the amount of
CPU that can be used when the CPU is under pressure. The intention of this is to
limit the impact of low priority deployments on normal priority deployments.
When we rebalance model assignments we now:
1. compute a plan just for normal priority deployments
2. fix the resources used by normal deployments
3. compute a plan just for low priority deployments
4. merge the two plans
Closes#91024
This change adds an element_type as an optional mapping parameter for dense vector fields as
described in #89784. This also adds a byte element_type for dense vector fields that supports storing
dense vectors using only 8-bits per dimension. This is only supported when the mapping parameter
index is set to true.
The code follows a similar pattern to our NumberFieldMapper where we have an enum for
ElementType, and it has methods that DenseVectorFieldType and DenseVectorMapper can delegate to
to support each available type (just float and byte for now).
We're going to move all aggregations to the module soon and this saves a
little time in the build by only running the tests one time - in the
aggregations module.
Run the aggregations tests v7 compat tests against the aggregations
module and *not* the `rest-api-spec` module. This allows us to drop
`rest-api-spec`'s dependency on the aggregations module and keep it
"just the server" which is nice.
There are a few side effects here that are ok:
1. We run all aggregations REST tests in the aggregations module.
Even the ones in `rest-api-spec`. This means we run them twice. We
plan to move all of the aggregations REST tests into the aggregations
module anyway.
2. We now bundle the REST tests in the aggregations module into the
tests that the clients run for their verification step. This should
keep our clients from losing coverage.
We fail downsampling if field level security or document level security
restrict access to fields and/or documents in the source index.
This is done mainly to prevent situations where a user not allowed to
read documents and/or fields on the source index is (by mistake) allowed
access to documents and/or fields in the target index, which would normally
not be allowed access to.
We also add YAML test for the following four scenarios:
1. Donwsample operation executed by a non-admin user
2. Downsample operation executed by an admin with field level security
3. Downsample operation executed by an admin with document level security
4. Downsample operation executed by an admin without field or document level security
This commit adds a new field, write_load, into the shard stats. This new stat exposes the average number of write threads used while indexing documents.
Closes#90102
Adds a {index}_semantic_search endpoint which first converts the query text into a dense vector
using a NLP text embedding model then performs a knn search against an index containing
dense vectors created with the same embedding model.
This change also moves adjacency_matrix aggregation to its own package.
Note that that this PR also moves test code not related to auto date
histogram. I think this is cleaner then leaving some tests in a non
desired state between PRs. Also the test code that has been moved is
slatted for being moved to the aggregations module. I suspect that
future changes, like for example moving `terms` agg, require that other
aggregations to be moved as well (e.g. `significant_terms`), since a
lot of code is reused as well.
Relates to #90283
This commit adds a new API that users can use calling:
```
POST _ml/trained_models/{model_id}/deployment/_update
{
"number_of_allocations": 4
}
```
This allows a user to update the number of allocations for a deployment
that is `started`.
If the allocations are increased we rebalance and let the assignment
planner find how to allocate the additional allocations.
If the allocations are decreased we cannot use the assignment planner.
Instead, we implement the reduction in a new class `AllocationReducer`
that tries to reduce the allocations so that:
1. availability zone balance is maintained
2. assignments that can be completely stopped are preferred to release memory
the #90458 has been backported to all branches so the bwc testing can be enable for this tests were incorrectly relying on sort order. Added sort to make it deterministic
closes#90668
The health API reports the affected resources in case of an unhealthy
deployment. Until now all indicators reported one type of resource per
diagnosis (index, ILM policy, snapshot repository)
With the introduction of the disk indicator we now have an indicator
that reports multiple types of resources under the same diagnosis (ie.
nodes and indices).
This changes the structure of the `affected_resources` field to
accommodate multiple types of resources:
```
"affected_resources": {
"nodes": [
{
"id": "e1af6F5rTcmgpExkdOMzCg",
"name": "hot"
},
{
"id": "u_wBVl4ZRne4uZq_ziLsuw",
"name": "warm"
}
],
"indices": [
".geoip_databases",
"test_index"
]
}
```
When we switched to using the FieldExistsQuery (see #88312) instead of the deprecated NormsFieldExistsQuery and
DocValuesFieldExistsQuery, we ended up shortcutting the total hit count for text fields to the doc count retrieved
from the terms enum. This does not take into account empty strings, as that converts to an empty token set for text
fields. In presence of text fields, we cannot shortcut, and this can be prevented by checking that the field has
doc_values. This was checked before indirectly by checking that the query is a DocValuesFieldExistsQuery.
Closes#89760
in #89693 the rounding logic was only applied when a field was present on a pattern. This is incorrect as for dates like "2020" we want to default to "2020-01-01T23:59:59.999..." when rounding is enabled.
This commit always applies monthOfYear or dayofMonth defaulting (when rounding enabled) except when the dayOfYear is set
closes#90187
This commit introduces a new aggregation module
and moves the `adjacency_matrix` to this new module.
The new module name is `aggregations`.
The new module will use the `org.elasticsearch.aggregations.bucket` package for all bucket aggregations.
Relates to #90283
This PR moves the user profile feature and associated APIs from
experimental to stable since higher level features built on top of it
are going to be GA. The feature and APIs are still kept private because
they are meant to internally support higher level features and we don't
expect them to be directly used by end-users.
This PR also moves the security domain feature to GA by removing the
beta label. Security domain requires user configuration to work so it is
not something internally controlled by stack and solutions.
In 8.5 the definition of `number_of_allocations` parameter to the
start trained model deployment API was changed. This commit updates
the REST spec accordingly.
As a result of closing issue #75509 here we test data histogram
aggregations including auto date histograms and composite aggregations,
running the aggregation on two different indices having the same
field but with different date type.
This adds synthetic `_source` support for `ip` fields with
`ignore_malfored` set to `true`. We save the field values in hidden
stored field, just like we do for `ignore_above` keyword fields. Then we
load them at load time.
This adds a test for the `top_hits` aggregation using synthetic
`_source`. It works but let's be a bit paranoid here because it's a
whole new fetch phase.....
This adds some tests for `_source` filtering during `GET` and
`POST _search` when the index uses synthetic `_source`. It works, but
let's be paranoia and have an explicit test just in case.
So that they are visible in NodeIndicesStats only at the node and index (but not shard) levels. Also visible in the _cat/nodes table. And make an exact count yaml REST test.
This expands on the REST layer tests for the `moving_fn` agg asserting
the results of the various moving functions, some failure cases, and
some access edge cases. These tests buy us backwards compatibility tests
and, eventually, forwards compatibility testing.
* Fix merging with empty results
If we try to merge responses with an empty response, we might take the RAW
format from an empty response over a format from a non-empty response.
Closes#84622
* Date histogram range edge case fix
This fixes an illegal argument exception when using conflicting ranges.
Sometimes if the min of the range is high enough above the max, we get an error
in TimeUnitRounding.prepareOffsetOrJavaTimeRounding because our min time is
greater than our max. This appears to have started when we began using query
bounds to bound ranges.
The serialization for segment stats was broken for tsdb because we
return a *slightly* different sort configuration. That caused
`_segments` and `_cat/segments` to break when any shard of the tsdb
index is hosted on another node.
Closes#89609
This PR renames all public APIs for downsampling so that they contain the downsample
keyword instead of the rollup that we had until now.
1. The API endpoint for the downsampling action is renamed to:
/source-index/_downsample/target-index
2. The ILM action is renamed to
PUT _ilm/policy/my_policy
{
"policy": {
"phases": {
"warm": {
"actions": {
"downsample": {
"fixed_interval": "24h"
}
}
}
}
}
}
3. unsupported_aggregation_on_rollup_index was renamed to unsupported_aggregation_on_downsampled_index
4. Internal trasport actions were renamed:
indices:admin/xpack/rollup -> indices:admin/xpack/downsample
indices:admin/xpack/rollup_indexer -> indices:admin/xpack/downsample_indexer
5. Renamed the following index settings:
index.rollup.source.uuid -> index.downsample.source.uuid
index.rollup.source.name -> index.downsample.source.name
index.rollup.status -> index.downsample.status
Finally, we renamed many internal variables and classes from *Rollup* to *Downsample*.
However, this effort will be completed in more than one PRs so that we minimize conflicts with other in-flight PRs.
Relates to #74660
The fields loaded to support synthetic `_source` were all coming back in
the `fields` response of `GET` which was confusing. This removes them
from the results unless they are explicitly asked for.
This allows you to use `ignore_above` with `keyword` fields in synthetic
source. Ignored values are stored in a "backup" stored field and added
to the end of the list of results. This makes `ignore_above` work pretty
much the same way as it does when you don't have synthetic source. The
only difference is the order of the results. But synthetic source
changes the order of results anyway. That should be fine.
This change adds the filter query for a filtered alias to the knn query during the dfs phase on the
shard. This ensures the correct number of k results are returned instead of removing results as a post
filter.
Fixes: #89561
This adds some extra paranoid tests for the `meta` parameter in aggs,
specifically the `filters` agg. These tests are at the REST level so
they provide backwards compatibility tests as well. They make sure that
`meta: {}` and `meta: null` do what we expect - return a `meta: {}` and
return an error.
Relates to #89467
This PR updates relevant docs and yaml tests to cover the new feature
of viewing API key's limited-by role descriptors introduced in #89273
Relates: #89058
It's not obvious that a YAML test with a `catch` stanza also permits
`match` blocks to assert things about the structure of the error
response, but this structure may be an important part of the API spec.
This commit adds this info to the docs about YAML tests.
Adds REST tests for the `percentiles_bucket` pipeline bucket
aggregation. This gives us forwards and backwards compatibility tests
for these aggs as well as mixed version cluster tests for these aggs.
Relates to #26220
Adds REST tests for the `cumulative_cardinality` and `cumulative_sum`
pipeline aggregations. This gives us forwards and backwards compatibility
tests for these aggs as well as mixed version cluster tests for these
aggs.
Relates to #26220
Adds support for loading `text` and `keyword` fields that have
`store: true`. We could likely load *any* stored fields, but I
wanted to blaze the trail using something fairly useful.
I broke shard splitting when `_routing` is required and you use `nested`
docs. The mapping would look like this:
```
"mappings": {
"_routing": {
"required": true
},
"properties": {
"n": { "type": "nested" }
}
}
```
If you attempt to split an index with a mapping like this it'll blow up
with an exception like this:
```
Caused by: [idx] org.elasticsearch.action.RoutingMissingException: routing is required for [idx]/[0]
at org.elasticsearch.cluster.routing.IndexRouting$IdAndRoutingOnly.checkRoutingRequired(IndexRouting.java:181)
at org.elasticsearch.cluster.routing.IndexRouting$IdAndRoutingOnly.getShard(IndexRouting.java:175)
```
This fixes the problem by entirely avoiding the branch of code. That
branch was trying to find any top level documents that don't have a
`_routing`. But we *know* that there aren't any top level documents
without a routing in this case - the routing is "required". ES wouldn't
have let you index any top level documents without the routing.
This also adds a small pile of REST layer tests for shard splitting that
hit various branches in this area. For extra paranoia.
Closes#88109
This PR expands the existing GetProfile API to support getting multiple
profiles by IDs. As a result, the response format is also changed to
align with the latest version of API design guideline. Concretely, this
means moving the profiles as an array inside a top level "profiles"
field so that (1) does not mix dynamic fields (uid) with static fields
and (2) enforcing an order in the response which is desirable for
clients.
The change also reports any error encounter in the retrieving process in
a top level "errors" field.
Relates: #81910
This adds a new `_ml/trained_models/<model_id>/deployment/cache/_clear` API. This will clear the inference cache on every node where the model is allocated.
If a docvalues field matches multiple field patterns, then ES will
return the value of that doc-values field multiple times. Like fetching
fields from source, we should deduplicate the matching doc-values
fields.
We previously removed support for `fields` in the request body, to ensure there
was only one way to specify the parameter. We've now decided to undo the
change, since it was disruptive and the request body is actually the best place to
pass variable-length data like `fields`.
This PR restores support for `fields` in the request body. It throws an error
if the parameter is specified both in the URL and the body.
Closes#86875
To assist the user in configuring the visualizations correctly while leveraging TSDB
functionality, information about TSDB configuration should be exposed via the field
caps API per field.
Especially for metrics fields, it must be clear which fields are metrics and if they belong
to only time-series indexes or mixed time-series and non-time-series indexes.
To further distinguish metric fields when they belong to any of the following indices:
- Standard (non-time-series) indexes
- Time series indexes
- Downsampled time series indexes
This PR modifies the field caps API so that the mapping parameters time_series_dimension
and time_series_dimension are presented only when they are set on fields of time-series indexes.
Those parameters are completely ignored when they are set on standard (non-time-series) indexes.
This PR revisits some of the conventions adopted by #78790
Also add support for new CATALINA/TOMCAT timestamp formats used by ECS Grok patterns
Relates #77065
Co-authored-by: David Roberts <dave.roberts@elastic.co>
This change deprecates the kNN search API in favor of the new 'knn' option
inside the search API. The 'knn' option is now the preferred way of performing
kNN search.
Relates to #87625
This formats the result of the `fields` section of the `_search` API for
runtime `geo_point` fields using the `format` parameter like we do for
non-runtime `geo_point` fields. This changes the default format for
those fields from `lat, lon` to `geojson` with the option to get `wkt`
or any other format we support.
The fix does so by preserving the `double, double` nature of the
`geo_point` rather than encoding it immediately in the script. Callers can
use the results. The field fetchers use the `double, double` natively,
preserving as much precision as possible. The queries quantize the points
exactly like lucene indexing does. And like the script did before this Pr.
Closes#85245
This change adds support for kNN vector fields to the `_disk_usage` API. The
strategy:
* Iterate the vector values (using the same strategy as for doc values) to
estimate the vector data size
* Run some random vector searches to estimate the vector index size
Co-authored-by: Yannick Welsch <yannick@welsch.lu>
Closes#84801
Add the dry_run query parameter to support simulating of updating of desired nodes. The update request will be validated, but no cluster state updates will be performed. In order to indicate that the response was a result of a dry run, we add the dry_run run field to the JSON representation of a response.
See #82975
This commit removes the notion of components from the health API. They are gone from being
a top-level field in the response, and indicators is promoted into its place.
Remove help_url,rename summary->symptom,user_actions->diagnosis
Separate the diagnosis `message` field in `cause` and `action`
Co-authored-by: Mary Gouseti <mgouseti@gmail.com>
This PR adds a new `knn` option to the `_search` API to support ANN search.
It's powered by the same Lucene ANN capabilities as the old `_knn_search`
endpoint. The `knn` option can be combined with other search features like
queries and aggregations.
Addresses #87625
This adds support for the `cardinality` aggregation within a random_sampler.
This usecase is helpful in determining the ratio of unique values compared to the count of total documents within the sampled set.
Propagate alias filters to significance aggs filters
If we have an alias filter, use it as part of the background filter on a
signficant terms agg. Previously, alias filters did not apply to background
filters so this will change bg_count results for some significant terms aggs
using background filter.
Closes#81585
With: https://github.com/elastic/ml-cpp/pull/2305 we now support caching pytorch inference responses per node per model.
By default, the cache will be the same size has the model on disk size. This is because our current best estimate for memory used (for deploying) is 2*model_size + constant_overhead.
This is due to the model having to be loaded in memory twice when serializing to the native process.
But, once the model is in memory and accepting requests, its actual memory usage is reduced vs. what we have "reserved" for it within the node.
Consequently, having a cache layer that takes advantage of that unused (but reserved) memory is effectively free. When used in production, especially in search scenarios, caching inference results is critical for decreasing latency.
Currently we have two parameters that control how the source of a document
is stored, `enabled` and `synthetic`, both booleans. However, there are only
three possible combinations of these, with `enabled:false` and `synthetic:true`
being disallowed. To make this easier to reason about, this commit replaces
the `enabled` parameter with a new `mode` parameter, which can take the values
`stored`, `synthetic` and `disabled`. The `mode` parameter cannot be set
in combination with `enabled`, and we will subsequently move towards
deprecating `enabled` entirely.
The build_flavor was previously removed since it is no longer relevant;
only the default distribution now exists. However, the removal of build
flavor included removing it from the version information on the info
response for the root path. This API is supposed to be stable, so
removing that key was a compatibility break. This commit adds the
build_flavor back to that API, hardcoded to `default`. Additionally, a
test is added to ensure the key exists going forward, until it can be
properly deprecated.
closes#88318
Plumbs through a new parameter for the cardinality aggregation, to allow configuring the execution mode. This can have significant impacts on speed and memory usage. This PR exposes three collection modes and two heuristics that we can tune going forward. All of these are treated as hints and can be silently ignored, e.g. if not applicable to the given field type. I've change the default behavior to optimize for time, which potentially uses more memory. Users can override this for the old behavior if needed.
This adds the generation and upload logic of Gradle dependency graphs to snyk
We directly implemented a rest api based snyk plugin as:
the existing snyk gradle plugin delegates to the snyk command line tool the command line tool
uses custom gradle logic by injecting a init file that is
a) using deprecated build logic which we definitely want to avoid
b) uses gradle api we avoid like eager task creation.
Shipping this as a internal gradle plugin gives us the most flexibility as we only want to monitor
production code for now we apply this plugin as part of the elasticsearch.build plugin,
that usage has been for now the de-facto indicator if a project is considered a "production" project
that ends up in our distribution or public maven repositories. This isnt yet ideal and we will revisit
the distinction between production and non production code / projects in a separate effort.
As part of this effort we added the elasticsearch.build plugin to more projects that actually end up
in the distribution. To unblock us on this we for now disabled a few check tasks that started failing by applying elasticsearch.build.
Addresses #87620
Adds REST layer tests for some sneaky cases in the the `avg_bucket`,
`max_bucket`, `min_bucket`, and `sum_bucket` pipeline aggregations.
This gives us forwards and backwards compatibility tests for these
aggs as well as mixed version cluster tests for these aggs.
Relates to #26220
Bootstrap plugins were an internal mechanism added to allow a
filesystemprovider for cloud with the quota-aware-fs plugin. Since that
was removed, bootstrap plugins no longer serve a purpose. They were
never officially documented because they were for internal use only.
This commit removes the bootstrap plugins infrastructure.
This PR moves kNN search and dense vector support out of an xpack plugin and
into server.
In #87625 we plan to integrate ANN search into the main `_search` endpoint as a
new top-level component called `knn`. So kNN will be a dedicated part of the
search request, and we'll have kNN logic within the search phases. The classes
and logic will live in server, matching the other search components like
suggesters, field collapsing, etc.
This adds the option to force synthetic source to the MGET API. See
#87068 for more discussion on why you'd want to do that - the short
version is to get an upper bound on the performance cost of using
synthetic source in MGET.
This adds tests to make sure that we use all of the normal synthetic
source machinery, even when loading from the translog. So all GETs on
synthetic source indices will require an in memory index. That'll be an
extra cost on indices that are updated very very frequently.
Adds REST layer tests for the `avg_bucket`, `max_bucket`, `min_bucket`,
and `sum_bucket` pipeline aggregations. This gives us forwards and
backwards compatibility tests for these aggs as well as mixed version
cluster tests for these aggs.
Relates to #26220
The synthetic source highlighting tests would sometimes fail in a
strange way - they expect the entire search request to fail but it
*didn't* - only a single shard would fail. This locks the tests to
always make single shard indices so the failures are consistent.
Closes#87730
Synthetic source has a habit of reordering text fields. This frustrates
highlighting because it *often* wants to use index structures to find
the offsets to values in the field. This disables the FVH highlighter
for multi-valued text fields when synthetic source is enabled and runs
the unified highlighter in "analyze" mode when synthetic source is
enabled. That's *enough* to stop them from spitting out wrong answers.
We might be leaving some performance on the table when the unified
highlighter works on a single valued text field that is indexed with
offsets or term vectors. We don't really expect that to be common at all
though because *generally* folks will enable synthetic source to save
space and adding offsets or term vectors is quite space inefficient. If
it comes up, we might be able to improve here.
Adds measures of the total size of all mappings and the total number of
fields in the cluster (both before and after deduplication).
Relates #86639
Relates #77466
This adds the option to force synthetic source to the GET API. See
#87068 for more discussion on why you'd want to do that - the short
version is to get an upper bound on the performance cost of using
synthetic source in GET.
Fixes a test for forcing synthetic source that sometimes fails if the
index has more than one shard. We're just looking for a sensible failure
message here so we can lock it to one shard.
The root cause here was that missing did not correctly delegate `supportsGlobalOrdinalsMappnig` to the wrapped values source, instead falling back to the default. I've added the delegation, and made the base method abstract so this doesn't happen again.
With this change the metric field name becomes optional if the
'bukets_path' is pointing to a multi-value aggregation with a single
metric field. Normally the full path would be required including
the aggregation name followed by the metric field.
If the metric is not specified in the path and the multi-value
aggregation computes more than one value an error is thrown.
The old notation is still supported for backward compatibility in case
the full path is specified and the target multi-value aggregation
computes a single value.
This adds `?force_synthetic_source` to, well, force running the fetch
phase with synthetic source. If the mapping is incompatible with
synthetic source it'll throw a 400 error.
This is the first PR for the master stability check, which is part of the health API. It handles the case
when we have seen a master node recently. The more complicated case when we have not seen a
master node recently will be in subsequent PRs.
Back when we introduced the fields parameter to the search API, it could only fetch values from _source, hence
the corresponding sub-fetch phase fails early whenever _source is disabled. Today though runtime fields can
be retrieved from a separate value fetcher that reads from fielddata, and metadata fields can be retrieved
from stored fields. These two scenarios currently throw an unnecessary error whenever _source is disabled.
This commit removes the check for disabled _source, so that runtime fields and metadata fields can be retrieved even when _source is disabled. Fields that need to be loaded from _source are simply skipped whenever _source is disabled, similar to when a field is not found in _source.
Closes#87072
This adds some paranoid tests for synthetic source with disabling
subobjects, as added by #86166. It turns out that synthetic source does
exactly what you'd expect with disabling subobjects - it creates fields
with dots in their names. This adds tests for that.