Adds back `sparse_vector` field type, as a copy of `rank_features`.
The main goal is to have the `sparse_vector` field type available so we
can switch ELSER queries to use the new type.
`dot_product` requires vectors to be unit-length. Previously, we would
check that vectors were unit-length and throw if they were not.
Instead, we will now auto-normalize vectors as they are indexed.
`cosine` will continue to behave as usual, not normalizing the vectors.
closes: https://github.com/elastic/elasticsearch/issues/98935
* Skip segment for MatchNoDocsQuery filters.
When a query of a filter gets rewritten to MatchNoDocsQuery, segments
will not produce any results. We can therefore skip processing such
segments.
This applies to FilterByFilterAggregator and FiltersAggregator, as well
as to TermsAggregator when it uses StringTermsAggregatorFromFilters
internally; the latter is an adapter aggregator to
FilterByFilterAggregator.
Fixes#94637
* Update docs/changelog/98295.yaml
* Check all filters for `MatchNoDocsQuery`.
* Skip optimization when 'other' bucket is requested.
* Revert "Set default index mode for TimeSeries to `null` (#98586)"
This reverts commit 56abb86044.
* Revert "Rollback of #98586 (#98805)"
This reverts commit e370194ac2.
* Skip updating source when missing synthetic mode
* Update docs/changelog/98808.yaml
* Skip matching assert in MapperService too
* Refine the assert
* Extend versions before 8.6, when TS had no synthetic source
* Add source field mapping for non-synthetic TSDB
* Delete 98586.yaml
Duplicate changelog
* Add comment to TSDB_NO_SYNTHETIC mapping
* Spotless fix
* Add yaml test
* Fix version skip in yaml test
It is common to check is a field exists in the response json (regardless the
value) in yaml tests. Today this is done using `is_true` assertion if the value
is not "0" otherwise assertion is failing and need to be replaced with either
`gte` or `is_false`. This change introduces the `exist` assertion that allows to
verify the field exists regardless its value.
Report node "roles" in the /_cluster/allocation/explain response.
Nodes with limited sets of roles may affect shard distribution in ways
users did not originally consider, so it is helpful to surface this
information along with node allocation decision explanations.
* Add 'dataset' size to cat indices and cat shards
This adds the `dataset` computed size for the `/_cat/indices` and `/_cat/shards` APIs. This new
column is reported by default.
Resolves#95092
This makes the data stream lifecycle generally available. This will allow
data streams to take advantage of a native simplified and resilient
lifecycle implementation.
* First version
* Spotless, I liked my version better
* Fix param default values
* Add a supplier for default value to ensure it's calculated correctly
* Can't improve this without breaking tests
* Added checks for not specifying a body in PUT requests
* Fix default provider for enum params
* Added yaml test
* Changed docs and fix TODO
* Removing synonyms changes
* Added separate methods for providing default value as suppliers in enums
* Fixed test
* Add a supplier for default value to ensure it's calculated correctly
* Added checks for not specifying a body in PUT requests
* Remove synonyms changes
* Remove some supplier changes
* Better call enumParam with supplier version
* Fix compiler error on supplier
* Apply validators or requires depending on index version
* Solved BWC tests that involved using validators instead of requiresParameters
* Add tests
* Spotless
* Update docs/changelog/98268.yaml
* Update changelog
* Update docs/changelog/98268.yaml
* PR comments
* PR feedback
* Serialize index only for new index versions
---------
Co-authored-by: Elastic Machine <elasticmachine@users.noreply.github.com>
This change adds the total dense vector count to the output of the indices stats.
This is useful for observability in order to track the number of indexed vectors
in a cluster.
---------
Co-authored-by: Benjamin Trent <ben.w.trent@gmail.com>
In this PR we enable all new data streams to be managed by the data
stream lifecycle by default. This is implemented by adding an empty
`lifecycle: {}` upon new data stream creation.
Opting out is represented by a the `enabled` flag:
```
{
"lifecycle": {
"enabled": false
}
}
```
This change has the following implications on when is an index managed
and by which feature:
| Parent data stream lifecycle| ILM| `prefer_ilm`|Managed by|
|----------------------------|----|----------------|-| | default | yes|
true| ILM| | default | yes| false| data stream lifecycle| |default |
no|true/false|data stream lifecycle| |opt-out or
missing|yes|true/false|ILM| |opt-out or missing|no|true/false|unmanaged|
Data streams that have been created before the data stream lifecycle is
enabled will not have the default lifecycle.
Next steps: - We need to document this when the feature will be GA
(https://github.com/elastic/elasticsearch/issues/97973).
This PR adapts the unified highlighter to use the Weight#matches mode by default when possible.
This is the default mode in Lucene for some time now. For cases where the matches mode won't work (nested and parent-child queries),
the matches mode is disabled automatically.
I didn't expose an option to explicitly disable this mode because that should be seen as an internal implementation detail.
With this change, matches that span multiple terms are highlighted together (something that users asked for years) and the clauses that don't match the document are ignored.
Here we enable aggregations previously not allowed on fields of type counter.
The decision of enabling such aggregations even if the result is "meaningless"
for counters has been taken to favour TSDB adoption.
Aggregations now allowed, other than the existing ones, include:
* avg
* box plot
* cardinality
* extended stats
* median absolute deviation
* percentile ranks
* percentiles
* stats
* sum
* value count
I included tests for the weighted average and matrix stats aggregations too.
Resolves#97882
This PR adds an API similar to #95342 for managing settings
of system indices.
Example calls:
```
GET /_security/settings
PUT /_security/settings
{
"security": {
"index.auto_expand_replicas": "0-all"
},
"security_tokens": {
"index.auto_expand_replicas": "0-all"
},
"security_profile": {
"index.auto_expand_replicas": "0-all"
}
}
```
This adds a new ES|QL endpoint, `_query`, to replace the now deprecated
`_esql`. The latter is still kept for a while, emitting a deprecation
warning.
Fixes ESQL-1379.
We have allowed hybrid search since 8.4. This means the internal changes for sub_searches must be
able to write a compound query as early as 8.4, but currently we only do that back to 8.8. This change
fixes that issue.
Closes ##97144
Before we used to track max_score in collapse when requested (track_scores=true)
or when there is no sort in collapse (see PR#27122). But this feature
was lost through refactoring and changes.
This PR restores this feature.
Closes#97653
Fix for #97334 where incorrect feature name was provided.
Correct more instances of synonyms_feature_flag_enabled for synonyms_api_feature_flag_enabled
Closes#96641, #97177
As described in the issue, the change in #96763
has made the MixedClusterClientYamlTestSuiteIT for mget fail very
often. For now, let's take the same approach that we have for get.
Closes#97236
For snapshots builds we automatically enable all feature flags,
but for release builds they need to be explicitly added to
test clusters for tests.
This PR does it for synonyms feature.
Closes#96641, #97177
A number of aggregations don't support counter fields,
because its computation doesn't make sense on these fields.
For example computing an average on a counter doesn't make
sense.
Relates to #93539
`GET _cat/allocation` is a useful way to get a high-level view of the
balance of a cluster, but clusters are only balanced within each data
tier and today this API does not expose node roles. This commit adds an
optional `node.role` column to this API.
Added additional fields to SearchProfileResults for XContent output: node_id, cluster, index, shard_id.
It parses the existing composite ID using the new parseProfileShardId method, which reverses
the SeachShardTarget.toString method.
No new information is added here, merely the splitting out of the four pieces of information
in the profile shards "composite" id that is created by the SeachShardTarget.toString method.
Profile/shards output now has the form:
```
"profile": {
"shards": [
{
"id": "[2m7SW9oIRrirdrwirM1mwQ][blogs][0]",
"node_id": "2m7SW9oIRrirdrwirM1mwQ",
"shard_id": "0",
"index": "blogs",
"cluster": "(local)",
"searches": [ ... ]
...
},
{
"id": "[UngEVXTBQL-7w5j_tftGAQ][remote1:blogs][2]",
"node_id": "UngEVXTBQL-7w5j_tftGAQ",
"shard_id": "2",
"index": "blogs",
"cluster": "remote1",
"searches": [ ... ]
...
```
where the latter is on a remote cluster and you can see that as the prefix on the index name.
Partially addresses #25896
Added yamlRestTest for the new fields in the profile response.
This PR adds a new optional parameter "resource" for ReloadAnalyzersRequest.
If used, only analyzers that use this specific "resource" will be reload.
This parameter is not documented, for internal use only.
PR #96886 introduced auto-reload of analyzers on synonyms index change. The problem
was that reloading was applied broadly for all indices that contained reloadable
analyzers. This PR improves this, so when a particular synonyms set changes,
only analyzers that use this synonyms set will auto-reloaded. Note that shard
requests will still be sent to all indices shards, as only on a shard we can
decide if analyzers need to be reloaded.
Synonym Management API project
On changes of synonyms in a synonym set, auto-reload analyzers.
Note that currently all updateable analyzers will be reloaded, even
those that are not relevant for a synonyms set being updated.
* WIP Started geo_line for TSDB work
Starting with YAML tests (which currently pass) and AggregatorTests
(currently failing, likely due to mistake in the tests)
* Update docs/changelog/94954.yaml
* WIP Refactoring to prepare for TSDB geo_line
* Created TimeSeries version of GeoLineAggregator, and wired it in so that time-series aggregations use it, but current behavior is still identical to non-time-series.
* Added both yaml and unit tests for testing that geo_line works with correct results in both time-series and non-time-series cases.
* Added additional tests to verify the grouping behaviour of time-series vs. terms aggs, and the combination of the two.
* WIP Refactoring to prepare for TSDB geo_line
* Started refactoring to re-use simplifier for all buckets
* Fixed bug with leaf collector not changing per segment
* Fixed bug with leaf collector not detecting bucket changes
The bucket id can change within a segment, so we need to detect this and save the geo_line.
* Renamed class since it no longer extends BucketedSort
The original geo_line relied on the BucketedSort for all intelligence.
The time-series geo_line uses none of that, and does its own memory management.
* Fixed bug with geo_point leaking between geo_line buckets
And enhanced unit tests to cover multiple groups
* Code review updates
* Verify that the sort field is specifically the TS timestamp
Only activate the time-series optimizations if the aggregation is both:
* Within a time-series aggregation (ie. tsid and @timestamp ordered)
* The geo_line sort field is @timestamp
* Allow geo_point time-series to skip sort config
Also disables the new geo_line for time-series even if the correct
sort and point fields are used if the point field is not explicitly
configured to be a position metric.
* Support geo_centroid and geo_bounds on position metric
* Update yaml tests for multi-terms tests
* Changed to disallow alternative sort-fields in ts-geo_line
Since the primary criteria for switching to the new algorithm is that
geo_line is within a time-series aggregation, we now disallow any other sort field.
We test the negative case in the yaml tests, but changed the unit tests to
use TermsAggregation to minim the time-series aggregation to get comparable
results.
* For non-time-series check missing sort field early
The old code only threw error if there was data because the check was done
inside the leaf collector just before actually reading the sort field.
And there were no tests for missing sort field.
This commit adds the tests, and checks early so even if data is missing.
* Reviewed TODOs
* Test that behaviour is identical with or without POSITION metric
* Removed fallback code in builder (was switching to old geo_line without POSITION metric)
* Removed two TODO's that are no longer valid concerns
* Add repo throttle metrics to node stats api response
* Update docs/changelog/96678.yaml
* Change x-content output structure
* Fix test after merge from main
* Follow PR comments
* minor fixes
* minor fixes 2
* Introduce new TransportVersion (V_8_500_010)
* Fix yaml test
* Follow PR comments
* Make stats datapoints human readable
* Follow common pattern for human readable output
* Bump up TransportVersion
Add a new target (`script`) to the `/_info` API. It consolidates all the script information from the cluster nodes and returns a summary at the cluster level (compared with `_nodes/stats/script` it lacks the `<node>` dimension).