New api designed for use by apps like Kibana for auto-complete use cases.
A search string is supplied which is used as prefix for matching terms found in a given field in the index.
Supported field types are keyword, constant_keyword and flattened.
A timeout can limit the amount of time spent looking for matches (default 1s) and an `index_filter` query can limit indices e.g. those in the hot or warm tier by querying the `_tier` field
Closes#59137
* [DOCS] Focus retrieving selected fields on fields parameter
* Incorporating changes from reviews
* Adding clarifications from review feedback
* Slight wording revisions.
* Clarify language around format parameter and move text out of callout.
This shrinks a runtime field definition so that it fits on the screen
without scrolling. It also converts the doc into a test so we can be
sure it continues to work.
Relates to #69291
This replaces the `script` docs for bucket aggregations with runtime
fields. We expect runtime fields to be nicer to work with because you
can also fetch them or filter on them. We expect them to be faster
because their don't need this sort of `instanceof` tree:
a92a647b9f/server/src/main/java/org/elasticsearch/search/aggregations/support/values/ScriptDoubleValues.java (L42)
Relates to #69291
Co-authored-by: James Rodewig <40268737+jrodewig@users.noreply.github.com>
Co-authored-by: Adam Locke <adam.locke@elastic.co>
This change exposes for each field in the _field_caps response if the field is a metadata field.
This is needed for consumers of this API that want to filter these fields. Currently ML keeps a static list
and QL checks that the family type starts with `_`. In order to ease the addition of new metadata fields, this
change reworks the strategy in this solution and now only checks for the new flag.
Note that the new flag is also applied at the coordinator level in a best-effort to apply the logic on older nodes
in a mixed-version cluster.
* [DOCS] Clarify supported features for CCS.
* Clarify text and add subsection with title.
* Moving APIs to supported API section and paring down text.
If a search after request targets multiple indices and some of its sort
field has type `date` in one index but `date_nanos` in other indices,
then Elasticsearch won't interpret the search_after parameter correctly
in every target index. The sort value of a date field by default is a
long of milliseconds since the epoch while a date_nanos field is a long
of nanoseconds.
This commit introduces the `format` parameter in the sort field so a
sort value of a date or date_nanos will be formatted using a date format
in a search response.
The below example illustrates how to use this new parameter.
```js
{
"query": {
"match_all": {}
},
"sort": [
{
"timestamp": {
"order": "asc",
"format": "strict_date_optional_time_nanos"
}
}
]
}
```
```js
{
"query": {
"match_all": {}
},
"sort": [
{
"timestamp": {
"order": "asc",
"format": "strict_date_optional_time_nanos"
}
}
],
"search_after": [
"2015-01-01T12:10:30.123456789Z" // in `strict_date_optional_time_nanos` format
]
}
```
Closes#69192
This PR adds the special `_shard_doc` sort tiebreaker automatically to any
search requests that use a PIT. Adding the tiebreaker ensures that any
sorted query can be paginated consistently within a PIT.
Closes#56828
This change adds a new cluster privilege cancel_task that allows to:
Cancel running tasks (_tasks/_cancel).
Cancel and delete async searches.
Today the 'manage' cluster privilege is required to cancel tasks and
to delete async searches when security features are enabled.
This new focused privilege allows to handle tasks and searches only.
The change also adds the privilege to the internal 'kibana_system'
and '_async_search' roles. They both need to be able to cancel tasks
and delete async searches.
Relates #67965
Add a `max_analyzed_offset` query parameter to allow users
to limit the highlighting of text fields to a value less than or equal to the
`index.highlight.max_analyzed_offset`, thus avoiding an exception when
the length of the text field exceeds the limit. The highlighting still takes place,
but stops at the length defined by the new parameter.
Closes: #52155
Currently runtime fields from search requests don't appear in the output of the
field capabilities API, but some consumer of runtime fields would like to see
runtime section just like they are defined in search requests reflected and
merged into the field capabilities output.
This change adds parsing of a "runtime_mappings" section equivallent to the one
on search requests to the `_field_caps` endpoint, passes this section down to
the shard level where any runtime fields defined here overwrite the mapping of
the targetet indices.
Closes#68117
This partially reverts #64016 and and adds #67839 and adds
additional tests that would have caught issues with the changes
in #64016. It's mostly Nik's code, I am just cleaning things up
a bit.
Co-authored-by: Nik Everett <nik9000@gmail.com>
* Moving examples to the page for retrieving runtime fields.
* Adding runtime_mappings to request body of search API.
* Updating runtime_mappings properties and adding runtime fields to search your data.
* Updating examples and hopefully fixing build failure.
* Fixing snippet formatting that was causing test failure.
* Adding page in Painless guide for runtime fields.
* Fixing typo.
Co-authored-by: Elastic Machine <elasticmachine@users.noreply.github.com>
This commit allows returning a correct requested response content-type - it did not work for versioned media types.
It is done by adding new vendor specific instances to XContent and TextFormat enums. These instances can then "format" the response content type string when provided with parameters. This is similar to what SQL plugin does with its media types.
#51816
The option to use templates already defined in the cluster is not explicitly stated in the docs.
This PR adds and example to the `rank_eval` documentation.
* First steps in docs for runtime fields.
* Adding new page for runtime fields.
* Adding page for runtime fields.
* Adding more to the runtime fields topic.
* Adding parameters and retrieval options for runtime fields.
* Adding TESTSETUP for index creation.
* Incorporating review feedback.
* Incorporating reviewer feedback.
* Adding examples for runtime fields.
* Adding more context and simplifying the example.
* Changing timestamp to @timestamp throughout.
* Removing duplicate @timestamp field.
* Expanding example to hopefully fix CI builds.
* Adding skip test for result.
* Adding missing callout.
* Adding TESTRESPONSEs, which are currently broken.
* Fixing TESTRESPONSEs.
* Incorporating review feedback.
* Several clarifications, better test cases, and other changes.
* Adding missing callout in example.
* Adding substitutions to TESTRESPONSE for shorter results shown.
* Shuffling some information and adding link to script-fields.
* Fixing typo.
* Updates for API redesign -- will break builds.
* Updating examples and including info about overriding fields.
* Updating examples.
* Adding info for using runtime fields in the search request.
* Adding that queries against runtime fields are expensive.
* Incorporating feedback from reviewers.
* Minor changes from reviews.
* Adding alias for test case.
* Adding aliases to PUT example.
* Fixing test cases, for real this time.
* Updating use cases and introducing overlay throughout.
* Edits, adding 'shadowing', and explaining shadowing better.
* Streamlining tests and other changes.
* Fix formatting in example for test.
* Apply suggestions from code review
* Incorporating reviewer feedback 7 Dec
* Shifting structure of mapping page to fix cross links.
* Revisions for shadowing, overview, and other sections.
* Removing dot notation section and incorporating review changes.
* Adding updated example for shadowing.
* Streamlining shadowing example and TESTRESPONSEs.
Currently, the 'fields' option only supports fetching mapped fields. Since
'fields' is meant to be the central place to retrieve document content, it
should allow for loading unmapped values. This change adds implementation and
tests for this feature.
Closes#63690
- Replaces more abstract docs about object structure and values source with task-based examples.
- Relocates several sections from the current `misc.asciidoc` file.
- Alphabetically sorts agg categories in the nav.
- Removes the matrix agg family. Moves the stats matrix agg under the metric agg family
Co-authored-by: debadair <debadair@elastic.co>
After #63811 it became clear to me that `postCollect` is kind of
dangerous and not all that useful. So this removes it.
The trouble with `postCollect` is that it all happened right after we
finished calling `collect` on the `LeafBucketCollectors` but before we
built the aggregation results. But in #63811 we found out that we can't
call `postCollect` on the children of `parent` or `child` aggregators
until we know which *which* aggregation results we're building.
So this removes `postCollect` and moves all of the things we did at
post-collect phase into `buildAggregations` or into hooks called in
those methods.
We've identified two important enhancements that may affect the API. We expect
any API changes from these enhancements to be minor, but want to leave open the
possibility for small breaks. For example, we may end up returning unmapped
fields by default, or omitting nested fields from the root hit. The impact to
users should be quite small.
We're tracking the issues we need to resolve before removing the 'beta' label
here: #60985.
This PR adds deprecation warnings when accessing System Indices via the REST layer. At this time, these warnings are only enabled for Snapshot builds by default, to allow projects external to Elasticsearch additional time to adjust their access patterns.
Deprecation warnings will be triggered by all REST requests which access registered System Indices, except for purpose-specific APIs which access System Indices as an implementation detail a few specific APIs which will continue to allow access to system indices by default:
- `GET _cluster/health`
- `GET {index}/_recovery`
- `GET _cluster/allocation/explain`
- `GET _cluster/state`
- `POST _cluster/reroute`
- `GET {index}/_stats`
- `GET {index}/_segments`
- `GET {index}/_shard_stores`
- `GET _cat/[indices,aliases,health,recovery,shards,segments]`
Deprecation warnings for accessing system indices take the form:
```
this request accesses system indices: [.some_system_index], but in a future major version, direct access to system indices will be prevented by default
```
We support `"""` in `console` snippets to emulate kibana's CONSOLE.
CONSOLE also spits out `"""` when a json field contains a new line or a
double quote. This adds support for those sorts of responses to the
handling of `console-response` snippets.
If `track_total_hits=true` is used, the exact value of the number of hits is returned - i.e. the value is effectively limitless, and not the default value of 10,000
Co-authored-by: AndyHunt66 <andrew.hunt@elastic.co>
This adds two extra bits of info to the profiler:
1. Count of the number of different types of collectors. This lets us figure
out if we're using the optimization for segment ordinals. It adds a few
more similar counters just for good measure.
2. Profiles the `getLeafCollector` and `postCollection` methods. These are
non-trivial for some aggregations, like cardinality.
This PR adds support for the 'fields' option in the following places:
* Anytime `inner_hits` is used, for both fetching nested/ child docs and field collapsing
* The `top_hits` aggregation
Addresses #61949.
This commit introduces a new API that manages point-in-times in x-pack
basic. Elasticsearch pit (point in time) is a lightweight view into the
state of the data as it existed when initiated. A search request by
default executes against the most recent point in time. In some cases,
it is preferred to perform multiple search requests using the same point
in time. For example, if refreshes happen between search_after requests,
then the results of those requests might not be consistent as changes
happening between searches are only visible to the more recent point in
time.
A point in time must be opened before being used in search requests. The
`keep_alive` parameter tells Elasticsearch how long it should keep a
point in time around.
```
POST /my_index/_pit?keep_alive=1m
```
The response from the above request includes a `id`, which should be
passed to the `id` of the `pit` parameter of search requests.
```
POST /_search
{
"query": {
"match" : {
"title" : "elasticsearch"
}
},
"pit": {
"id": "46ToAwMDaWR4BXV1aWQxAgZub2RlXzEAAAAAAAAAAAEBYQNpZHkFdXVpZDIrBm5vZGVfMwAAAAAAAAAAKgFjA2lkeQV1dWlkMioGbm9kZV8yAAAAAAAAAAAMAWICBXV1aWQyAAAFdXVpZDEAAQltYXRjaF9hbGw_gAAAAA==",
"keep_alive": "1m"
}
}
```
Point-in-times are automatically closed when the `keep_alive` is
elapsed. However, keeping point-in-times has a cost; hence,
point-in-times should be closed as soon as they are no longer used in
search requests.
```
DELETE /_pit
{
"id" : "46ToAwMDaWR4BXV1aWQxAgZub2RlXzEAAAAAAAAAAAEBYQNpZHkFdXVpZDIrBm5vZGVfMwAAAAAAAAAAKgFjA2lkeQV1dWlkMioGbm9kZV8yAAAAAAAAAAAMAWIBBXV1aWQyAAA="
}
```
#### Notable works in this change:
- Move the search state to the coordinating node: #52741
- Allow searches with a specific reader context: #53989
- Add the ability to acquire readers in IndexShard: #54966
Relates #46523
Relates #26472
Co-authored-by: Jim Ferenczi <jimczi@apache.org>
No-op changes to:
* Move `Search your data` source files into the same directory
* Rename `Search your data` source files based on page ID
* Remove unneeded includes
* Remove the `Request` dir
Changes:
* Removes narrative around URI searches. These aren't commonly used in production. The `q` param is already covered in the search API docs: https://www.elastic.co/guide/en/elasticsearch/reference/master/search-search.html#search-api-query-params-q
* Adds a common options section that highlights narrative docs for query DSL, aggregations, multi-index search, search fields, pagination, sorting, and async search.
* Adds a `Search shard routing` page. Moves narrative docs for adaptive replica selection, preference, routing , and shard limits to that section.
* Moves search timeout and cancellation content to the `Search your data` page.
* Creates a `Search multiple data streams and indices` page. Moves related narrative docs for multi-target syntax searches and `indices_boost` to that page.
* Removes narrative examples for the `search_type` parameters. Moves documentation for this parameter to the search API docs.
Changes:
* Moves `Retrieve selected fields` to its own page and adds a title abbreviation.
* Adds existing script and stored fields content to `Retrieve selected fields`
* Adds a xref for `Retrieve selected fields` to `Search your data`
* Adds related redirects and updates existing xrefs
Plugin discovery documentation contained information about installing
Elasticsearch 2.0 and installing an oracle JDK, both of which is no
longer valid.
While noticing that the instructions used cleartext HTTP to install
packages, this commit replaces HTTPs links instead of HTTP where possible.
In addition a few community links have been removed, as they do not seem
to exist anymore.
This feature adds a new `fields` parameter to the search request, which
consults both the document `_source` and the mappings to fetch fields in a
consistent way. The PR merges the `field-retrieval` feature branch.
Addresses #49028 and #55363.
Moves the search sort docs from the deprecated 'Request Body Search'
page to a new subpage of 'Run a search'.
No substantive changes were made to the content.
Adds a new `my-index-00001` REST test for docs snippets.
This test can serve as a lightweight replacement for
our existing `twitter` REST tests.
The new dataset is:
* Based on Apache logs, which is better aligned with Elastic use cases
* Compliant with ECS
* Similar to the existing `twitter` data set, containing the same field data types
* Lightweight, which should keep existing test runtimes roughly the same
Also updates the search API reference docs to use the new test.
Moves the highlighting docs from the deprecated 'Request Body Search'
chapter to the new subpage of the 'Run a search chapter' section.
No substantive changes were made to the content.
Introduces a new method on `MappedFieldType` to return a family type name which defaults to the field type.
Changes `wildcard` and `constant_keyword` field types to return `keyword` for field capabilities.
Relates to #53175
Changes:
* Adds additional examples to the `Search a data stream` section of
`Use a data stream`
* Updates existing search docs to make them aware of data streams
* Add index filtering in field capabilities API
This change allows to use an `index_filter` in the
field capabilities API. Indices are filtered from
the response if the provided query rewrites to `match_none`
on every shard:
````
GET metrics-*
{
"index_filter": {
"bool": {
"must": [
"range": {
"@timestamp": {
"gt": "2019"
}
}
}
}
}
````
The filtering is done on a best-effort basis, it uses the can match phase
to rewrite queries to `match_none` instead of fully executing the request.
The first shard that can match the filter is used to create the field
capabilities response for the entire index.
Closes#56195
* Adding documentation for near real-time search.
* Adding link to NRT topic and clarifying some text.
* Adding diagrams and incorporating changes from David T.
Changes:
* Condenses and relocates the `docvalue_fields` example to the 'Run a search'
page.
* Adds docs for the `docvalue_fields` request body parameter.
* Updates several related xrefs.
Co-authored-by: debadair <debadair@elastic.co>
Cleans up the reference documentation for the following
search API parameters:
* `_source` query parameter
* `_source_excludes` query parameter
* `_source_includes` query parameter
* `_source` request body parameter
* `hits._source` response property
Moves the source filtering example snippets form the "Request body
search" API docs page to the "Return fields in a search" section of the
"Run a search" page.
This PR adds a section to the new 'run a search' reference that explains
the options for returning fields. Previously each option was only listed as a
separate request parameter and it was hard to know what was available.
Several APIs support options that can be specified as a query parameter or a
request body parameter.
Currently, this is documented using notes, which can get rather lengthy. This
replaces those multiple notes with a single note and a footnote.
Generally we don't advocate for using `stored_fields`, and we're interested in
eventually removing the need for this parameter. So it's best to avoid using
stored fields in our docs examples when it's not actually necessary.
Individual changes:
* Avoid using 'stored_fields' in our docs.
* When defining script fields in top-hits, de-emphasize stored fields.
Today we already disallow negative values for the "from" parameter in the search
API when it is set as a request parameter and setting it on the
SearchSourceBuilder, but it is still parsed without complaint from a search
body, leading to differing exceptions later. This PR changes this behavior to be
the same regardless of setting the value directly, as url parameter or in the
search body. While we silently accepted "-1" as meaning "unset" and used the
default value of 0 so far, any negative from-value is now disallowed.
Closes#54897
This saves memory when running numeric significant terms which are not
at the top level by merging its collection into numeric terms and relying
on the optimization that we made in #55873.
The Elasticsearch-JS client only produces documentation for major
versions (e.g., n.x, 7.x, 6.x).
However, the `{jsclient}` attribute uses the current `{branch}`, which
can result in broken links in minor version docs.
This swaps the `{jsclient}` attribute for `{jsclient-current}`,
which is less likely to break across versions.
Reworks the `from / size` content to `Paginate search results`.
Moves those docs from the request body search API page (slated for
deletion) to the `Run a search` tutorial docs.
Also adds some notes to the `from` and `size` param docs.
Co-authored-by: debadair <debadair@elastic.co>
This adds a few things to the `breakdown` of the profiler:
* `histogram` aggregations now contain `total_buckets` which is the
count of buckets that they collected. This could be useful when
debugging a histogram inside of another bucketing agg that is fairly
selective.
* All bucketing aggs that can delay their sub-aggregations will now add
a list of delayed sub-aggregations. This is useful because we
sometimes have fairly involved logic around which sub-aggregations get
delayed and this will save you from having to guess.
* Aggregtations wrapped in the `MultiBucketAggregatorWrapper` can't
accurately add anything to the breakdown. Instead they the wrapper
adds a marker entry `"multi_bucket_aggregator_wrapper": true` so we
can be quickly pick out such aggregations when debugging.
It also fixes a bug where `_count` breakdown entries were contributing
to the overall `time_in_nanos`. They didn't add a large amount of time
so it is unlikely that this caused a big problem, but I was there.
To support the arbitrary breakdown data this reworks the profiler so
that the `breakdown` can contain any data that is supported by
`StreamOutput#writeGenericValue(Object)` and
`XContentBuilder#value(Object)`.
Changes:
* Moves the document request body parameters for the search API
from the Request body search page to the Search API reference page.
* Relocates a search request body example from the Request body search
page to the Search API reference page.
* Adds a note to any duplicated query and request body parameters.
The search API and URI search pages document the same `_search` API.
This combines the documentation from each page under the search API
docs.
Changes:
* Adds an abbreviated title for the search API page.
* Removes the following invalid query parameters:
* `analyzer`
* `analyze_wildcard`
* `default_operator`
* `df`
* `lenient`
* `suggest_mode`
* `suggest_size`
* Removes the URI search docs page and adds a related redirect.
* Updates the headings of several examples
Co-authored-by: debadair <debadair@elastic.co>
* [DOCS] Create top-level "Search your data" page
**Goal**
Create a top-level search section. This will let us clean up our search
API reference docs, particularly content from [`Request body search`][0].
**Changes**
* Creates a top-level `Search your data` page. This page is designed to
house concept and tutorial docs related to search.
* Creates a `Run a search` page under `Search your data`. For now, This
contains a basic search tutorial. The goal is to add content from
[`Request body search`][0] to this in the future.
* Relocates `Long-running searches` and `Search across clusters` under
`Search your data`. Increments several headings in that content.
* Reorders the top-level TOC to move `Search your data` higher. Also
moves the `Query DSL`, `EQL`, and `SQL access` chapters immediately
after.
Relates to #48194
[0]: https://www.elastic.co/guide/en/elasticsearch/reference/master/search-request-body.html
This has no practical impact on users since frozen indices are the only
throttled indices today. However this has an impact on upcoming features
that would use search throttling.
Filtering out throttled indices made sense a couple years ago, but as
we're now improving support for slow requests with `_async_search` and
exploring ways to reduce storage costs, this feature has most likely
become a trap, that we'd like to not have with upcoming features that
would use search throttling.
Relates #54058
The failing suggester documentation test was expecting specific scores in the
test response, which is fragile implementation details that e.g. can change with
different lucene versions and generally shouldn't be done in documentation test.
Instead we usually replace the float values in the output response by the ones
in the actual response.
Closes#54257
This commit renames wait_for_completion to wait_for_completion_timeout in submit async search and get async search.
Also it renames clean_on_completion to keep_on_completion and turns around its behaviour.
Closes#54069
Submit async search forces pre_filter_shard_size for the underlying search that it creates.
With this commit we also prevent users from overriding such default as part of request validation.
This commit changes the pre_filter_shard_size default from 128 to unspecified.
This allows to apply heuristics based on the request and the target indices when deciding
whether the can match phase should run or not. When unspecified, this pr runs the can match phase
automatically if one of these conditions is met:
* The request targets more than 128 shards.
* The request contains read-only indices.
* The primary sort of the query targets an indexed field.
Users can opt-out from this behavior by setting the `pre_filter_shard_size` to a static value.
Closes#39835
The docs snippets for submit async search have proven difficult to test as it is not possible to guarantee that you get a response that is not final, even when providing `wait_for_completion=0`. In the docs we want to show though a proper long-running query, and its first response should be partial rather than final.
With this commit we adapt the docs snippets to show a partial response, and replace under the hood all that's needed to make the snippets tests succeed when we get a final response. Also, increased the timeout so we always get a final response.
Closes#53887Closes#53891
The goal of the version field was to quickly show when you can expect to find something new in the search response, compared to when nothing has changed. This can also be done by looking at the `_shards` section and `num_reduce_phases` returned with the search response. In fact when there has been one or more additional reduction of the results, you can expect new results in the search response. Otherwise, the `_shards` section could notify of additional failures of shards that have completed the query, but that is not a guarantee that their results will be exposed (only when the following partial reduction is performed their results will be available).
That said this commit clarifies this in the docs and removes the version field from the async search response
This change adds the recall@k metric and refactors precision@k to match
the new metric.
Recall@k is an important metric to use for learning to rank (LTR)
use-cases. Candidate generation or first ranking phase ranking functions
are often optimized for high recall, in order to generate as many
relevant candidates in the top-k as possible for a second phase of
ranking. Adding this metric allows tuning that base query for LTR.
See: https://github.com/elastic/elasticsearch/issues/51676
* Adds an example request to the top of the page.
* Relocates several parameters erroneously listed under "Request body"
to the appropriate "Query parameters" section.
* Updates the "Request body" section to better document the NDJSON
structure of msearch requests.
PR #44238 changed several links related to the Elasticsearch search request body API. This updates several places still using outdated links or anchors.
This will ultimately let us remove some redirects related to those link changes.
File scripts were removed in 6.0 with #24627.
This removes an outdated file scripts reference from the conditional clauses section of the search templates docs.
This PR adds per-field metadata that can be set in the mappings and is later
returned by the field capabilities API. This metadata is completely opaque to
Elasticsearch but may be used by tools that index data in Elasticsearch to
communicate metadata about fields with tools that then search this data. A
typical example that has been requested in the past is the ability to attach
a unit to a numeric field.
In order to not bloat the cluster state, Elasticsearch requires that this
metadata be small:
- keys can't be longer than 20 chars,
- values can only be numbers or strings of no more than 50 chars - no inner
arrays or objects,
- the metadata can't have more than 5 keys in total.
Given that metadata is opaque to Elasticsearch, field capabilities don't try to
do anything smart when merging metadata about multiple indices, the union of
all field metadatas is returned.
Here is how the meta might look like in mappings:
```json
{
"properties": {
"latency": {
"type": "long",
"meta": {
"unit": "ms"
}
}
}
}
```
And then in the field capabilities response:
```json
{
"latency": {
"long": {
"searchable": true,
"aggreggatable": true,
"meta": {
"unit": [ "ms" ]
}
}
}
}
```
When there are no conflicts, values are arrays of size 1, but when there are
conflicts, Elasticsearch includes all unique values in this array, without
giving ways to know which index has which metadata value:
```json
{
"latency": {
"long": {
"searchable": true,
"aggreggatable": true,
"meta": {
"unit": [ "ms", "ns" ]
}
}
}
}
```
Closes#33267
This rewrites long sort as a `DistanceFeatureQuery`, which can
efficiently skip non-competitive blocks and segments of documents.
Depending on the dataset, the speedups can be 2 - 10 times.
The optimization can be disabled with setting the system property
`es.search.rewrite_sort` to `false`.
Optimization is skipped when an index has 50% or more data with
the same value.
Optimization is done through:
1. Rewriting sort as `DistanceFeatureQuery` which can
efficiently skip non-competitive blocks and segments of documents.
2. Sorting segments according to the primary numeric sort field(#44021)
This allows to skip non-competitive segments.
3. Using collector manager.
When we optimize sort, we sort segments by their min/max value.
As a collector expects to have segments in order,
we can not use a single collector for sorted segments.
We use collectorManager, where for every segment a dedicated collector
will be created.
4. Using Lucene's shared TopFieldCollector manager
This collector manager is able to exchange minimum competitive
score between collectors, which allows us to efficiently skip
the whole segments that don't contain competitive scores.
5. When index is force merged to a single segment, #48533 interleaving
old and new segments allows for this optimization as well,
as blocks with non-competitive docs can be skipped.
Closes#37043
Co-authored-by: Jim Ferenczi <jim.ferenczi@elastic.co>
* Optimize sort on numeric long and date fields (#39770)
Optimize sort on numeric long and date fields, when
the system property `es.search.long_sort_optimized` is true.
* Skip optimization if the index has duplicate data (#43121)
Skip sort optimization if the index has 50% or more data
with the same value.
When index has a lot of docs with the same value, sort
optimization doesn't make sense, as DistanceFeatureQuery
will produce same scores for these docs, and Lucene
will use the second sort to tie-break. This could be slower
than usual sorting.
* Sort leaves on search according to the primary numeric sort field (#44021)
This change pre-sort the index reader leaves (segment) prior to search
when the primary sort is a numeric field eligible to the distance feature
optimization. It also adds a tie breaker on `_doc` to the rewritten sort
in order to bypass the fact that leaves will be collected in a random order.
I ran this patch on the http_logs benchmark and the results are very promising:
```
| 50th percentile latency | desc_sort_timestamp | 220.706 | 136544 | 136324 | ms |
| 90th percentile latency | desc_sort_timestamp | 244.847 | 162084 | 161839 | ms |
| 99th percentile latency | desc_sort_timestamp | 316.627 | 172005 | 171688 | ms |
| 100th percentile latency | desc_sort_timestamp | 335.306 | 173325 | 172989 | ms |
| 50th percentile service time | desc_sort_timestamp | 218.369 | 1968.11 | 1749.74 | ms |
| 90th percentile service time | desc_sort_timestamp | 244.182 | 2447.2 | 2203.02 | ms |
| 99th percentile service time | desc_sort_timestamp | 313.176 | 2950.85 | 2637.67 | ms |
| 100th percentile service time | desc_sort_timestamp | 332.924 | 2959.38 | 2626.45 | ms |
| error rate | desc_sort_timestamp | 0 | 0 | 0 | % |
| Min Throughput | asc_sort_timestamp | 0.801824 | 0.800855 | -0.00097 | ops/s |
| Median Throughput | asc_sort_timestamp | 0.802595 | 0.801104 | -0.00149 | ops/s |
| Max Throughput | asc_sort_timestamp | 0.803282 | 0.801351 | -0.00193 | ops/s |
| 50th percentile latency | asc_sort_timestamp | 220.761 | 824.098 | 603.336 | ms |
| 90th percentile latency | asc_sort_timestamp | 251.741 | 853.984 | 602.243 | ms |
| 99th percentile latency | asc_sort_timestamp | 368.761 | 893.943 | 525.182 | ms |
| 100th percentile latency | asc_sort_timestamp | 431.042 | 908.85 | 477.808 | ms |
| 50th percentile service time | asc_sort_timestamp | 218.547 | 820.757 | 602.211 | ms |
| 90th percentile service time | asc_sort_timestamp | 249.578 | 849.886 | 600.308 | ms |
| 99th percentile service time | asc_sort_timestamp | 366.317 | 888.894 | 522.577 | ms |
| 100th percentile service time | asc_sort_timestamp | 430.952 | 908.401 | 477.45 | ms |
| error rate | asc_sort_timestamp | 0 | 0 | 0 | % |
```
So roughly 10x faster for the descending sort and 2-3x faster in the ascending case. Note
that I indexed the http_logs with a single client in order to simulate real time-based indices
where document are indexed in their timestamp order.
Relates #37043
* Remove nested collector in docs response
As we don't use cancellableCollector anymore, it should be removed from
the expected docs response.
* Use collector manager for search when necessary (#45829)
When we optimize sort, we sort segments by their min/max value.
As a collector expects to have segments in order,
we can not use a single collector for sorted segments.
Thus for such a case, we use collectorManager,
where for every segment a dedicated collector will be created.
* Use shared TopFieldCollector manager
Use shared TopFieldCollector manager for sort optimization.
This collector manager is able to exchange minimum competitive
score between collectors
* Correct calculation of avg value to avoid overflow
* Optimize calculating if index has duplicate data
All document scores are positive 32-bit floating point numbers. However, this
wasn't previously documented.
This can result in surprising behavior, such as precision loss, for users when
customizing scores using the function score query.
This commit updates an existing admonition in the function score query docs to
document the 32-bits precision limit. It also updates the search API reference
docs to note that `_score` is a 32-bit float.
Customers occasionally discover a known behavior in Elasticsearch's pagination that does not appear to be documented. This warning is intended to educate customers of this behavior while still highlighting alternative solutions.
* [DOCS] Add template docs to scripts. Reorder template examples.
* Adds a 'Search template' section to the 'How to use scripts' chapter.
This links to the 'Search template' chapter for detailed info and
examples.
* Reorders and retitles several examples in the 'Search template'
chapter. This is primarily to make examples for storing, deleting, and
using search templates more prominent.
* Change <templatename> to <templateid>
Several files in the REST APIs nav section are included using
:leveloffset: tags. This increments headings (h2 -> h3, h3 -> h4, etc.)
in those files and removes the :leveloffset: tags.
Other supporting changes:
* Alphabetizes top-level REST API nav items.
* Change 'indices APIs' heading to 'index APIs.'
* Changes 'Snapshot lifecycle management' heading to sentence case.
Moves the following API sections under the REST APIs navigations:
- API Conventions
- Document APIs
- Search APIs
- Index APIs (previously named Indices APIs)
- cat APIs
- Cluster APIs
Other supporting changes:
- Removes the previous index APIs page under REST APIs. Adds a redirect for the removed page.
- Removes several [partintro] macros so the docs build correctly.
- Changes anchors for pages that become sections of a parent page.
- Adds several redirects for existing pages that become sections of a parent page.
This commit re-applies changes from #44238. Changes from that PR were reverted due to broken links in several repos. This commit adds redirects for those broken links.
This commit removes the nested_path and nested_filter options deprecated in 6x.
This change also checks that the sort field has a [nested] option if it is under a nested
object and throws an exception if it's not the case.
Closes#27098
the geo-bounding-box and phrase-suggest docs were susceptible to
failing due to other indices in the cluster. This change restricts
the queries to the index that is set up for the test.
relates to #43271.
Adding notes to the existing docs about how using `preference` might increase
request cache utilization but also add warning about the downsides.
Closes#24278
This PR updates the docs for `docvalue_fields` and `stored_fields` to clarify
that nested fields must be accessed through `inner_hits`. It also tweaks the
nested fields documentation to make this point more visible.
Addresses #23766.
Today the `_field_caps` API returns the list of indices where a field
is present only if this field has different types within the requested indices.
However if the request is an index pattern (or an alias, or both...) there
is no way to infer the indices if the response contains only fields that have
the same type in all indices. This commit changes the response to always return
the list of indices in the response. It also adds a way to retrieve unmapped field
in a specific section per field called `unmapped`. This section is created for each field
that is present in some indices but not all if the parameter `include_unmapped` is set to
true in the request (defaults to false).
Currently enabling profiling disables top-hits optimizations, which is
unfortunate: it would be nice to be able to notice the difference in method
counts and timings depending on whether total hit counts are requested.
The explanation given in the completion suggester documentation why we use the
"simple" analyzer as the default is no longer valid. Since we still use "simple"
as the default, we should just delete the explanation that doesn't fit anymore.
Closes#36715
Adds the search_as_you_type field type that acts like a text field optimized
for as-you-type search completion. It creates a couple subfields that analyze
the indexed terms as shingles, against which full terms are queried, and a
prefix subfield that analyze terms as the largest shingle size used and
edge-ngrams, against which partial terms are queried
Adds a match_bool_prefix query type that creates a boolean clause of a term
query for each term except the last, for which a boolean clause with a prefix
query is created.
The match_bool_prefix query is the recommended way of querying a search as you
type field, which will boil down to term queries for each shingle of the input
text on the appropriate shingle field, and the final (possibly partial) term
as a term query on the prefix field. This field type also supports phrase and
phrase prefix queries however
This change adds an option to convert a `date` field to nanoseconds resolution
and a `date_nanos` field to millisecond resolution when sorting.
The resolution of the sort can be set using the `numeric_type` option of the
field sort builder. The conversion is done at the shard level and is restricted
to dates from 1970 to 2262 for the nanoseconds resolution in order to avoid
numeric overflow.