The failing suggester documentation test was expecting specific scores in the
test response, which is fragile implementation details that e.g. can change with
different lucene versions and generally shouldn't be done in documentation test.
Instead we usually replace the float values in the output response by the ones
in the actual response.
Closes#54257
This commit renames wait_for_completion to wait_for_completion_timeout in submit async search and get async search.
Also it renames clean_on_completion to keep_on_completion and turns around its behaviour.
Closes#54069
Submit async search forces pre_filter_shard_size for the underlying search that it creates.
With this commit we also prevent users from overriding such default as part of request validation.
This commit changes the pre_filter_shard_size default from 128 to unspecified.
This allows to apply heuristics based on the request and the target indices when deciding
whether the can match phase should run or not. When unspecified, this pr runs the can match phase
automatically if one of these conditions is met:
* The request targets more than 128 shards.
* The request contains read-only indices.
* The primary sort of the query targets an indexed field.
Users can opt-out from this behavior by setting the `pre_filter_shard_size` to a static value.
Closes#39835
The docs snippets for submit async search have proven difficult to test as it is not possible to guarantee that you get a response that is not final, even when providing `wait_for_completion=0`. In the docs we want to show though a proper long-running query, and its first response should be partial rather than final.
With this commit we adapt the docs snippets to show a partial response, and replace under the hood all that's needed to make the snippets tests succeed when we get a final response. Also, increased the timeout so we always get a final response.
Closes#53887Closes#53891
The goal of the version field was to quickly show when you can expect to find something new in the search response, compared to when nothing has changed. This can also be done by looking at the `_shards` section and `num_reduce_phases` returned with the search response. In fact when there has been one or more additional reduction of the results, you can expect new results in the search response. Otherwise, the `_shards` section could notify of additional failures of shards that have completed the query, but that is not a guarantee that their results will be exposed (only when the following partial reduction is performed their results will be available).
That said this commit clarifies this in the docs and removes the version field from the async search response
This change adds the recall@k metric and refactors precision@k to match
the new metric.
Recall@k is an important metric to use for learning to rank (LTR)
use-cases. Candidate generation or first ranking phase ranking functions
are often optimized for high recall, in order to generate as many
relevant candidates in the top-k as possible for a second phase of
ranking. Adding this metric allows tuning that base query for LTR.
See: https://github.com/elastic/elasticsearch/issues/51676
* Adds an example request to the top of the page.
* Relocates several parameters erroneously listed under "Request body"
to the appropriate "Query parameters" section.
* Updates the "Request body" section to better document the NDJSON
structure of msearch requests.
PR #44238 changed several links related to the Elasticsearch search request body API. This updates several places still using outdated links or anchors.
This will ultimately let us remove some redirects related to those link changes.
File scripts were removed in 6.0 with #24627.
This removes an outdated file scripts reference from the conditional clauses section of the search templates docs.
This PR adds per-field metadata that can be set in the mappings and is later
returned by the field capabilities API. This metadata is completely opaque to
Elasticsearch but may be used by tools that index data in Elasticsearch to
communicate metadata about fields with tools that then search this data. A
typical example that has been requested in the past is the ability to attach
a unit to a numeric field.
In order to not bloat the cluster state, Elasticsearch requires that this
metadata be small:
- keys can't be longer than 20 chars,
- values can only be numbers or strings of no more than 50 chars - no inner
arrays or objects,
- the metadata can't have more than 5 keys in total.
Given that metadata is opaque to Elasticsearch, field capabilities don't try to
do anything smart when merging metadata about multiple indices, the union of
all field metadatas is returned.
Here is how the meta might look like in mappings:
```json
{
"properties": {
"latency": {
"type": "long",
"meta": {
"unit": "ms"
}
}
}
}
```
And then in the field capabilities response:
```json
{
"latency": {
"long": {
"searchable": true,
"aggreggatable": true,
"meta": {
"unit": [ "ms" ]
}
}
}
}
```
When there are no conflicts, values are arrays of size 1, but when there are
conflicts, Elasticsearch includes all unique values in this array, without
giving ways to know which index has which metadata value:
```json
{
"latency": {
"long": {
"searchable": true,
"aggreggatable": true,
"meta": {
"unit": [ "ms", "ns" ]
}
}
}
}
```
Closes#33267
This rewrites long sort as a `DistanceFeatureQuery`, which can
efficiently skip non-competitive blocks and segments of documents.
Depending on the dataset, the speedups can be 2 - 10 times.
The optimization can be disabled with setting the system property
`es.search.rewrite_sort` to `false`.
Optimization is skipped when an index has 50% or more data with
the same value.
Optimization is done through:
1. Rewriting sort as `DistanceFeatureQuery` which can
efficiently skip non-competitive blocks and segments of documents.
2. Sorting segments according to the primary numeric sort field(#44021)
This allows to skip non-competitive segments.
3. Using collector manager.
When we optimize sort, we sort segments by their min/max value.
As a collector expects to have segments in order,
we can not use a single collector for sorted segments.
We use collectorManager, where for every segment a dedicated collector
will be created.
4. Using Lucene's shared TopFieldCollector manager
This collector manager is able to exchange minimum competitive
score between collectors, which allows us to efficiently skip
the whole segments that don't contain competitive scores.
5. When index is force merged to a single segment, #48533 interleaving
old and new segments allows for this optimization as well,
as blocks with non-competitive docs can be skipped.
Closes#37043
Co-authored-by: Jim Ferenczi <jim.ferenczi@elastic.co>
* Optimize sort on numeric long and date fields (#39770)
Optimize sort on numeric long and date fields, when
the system property `es.search.long_sort_optimized` is true.
* Skip optimization if the index has duplicate data (#43121)
Skip sort optimization if the index has 50% or more data
with the same value.
When index has a lot of docs with the same value, sort
optimization doesn't make sense, as DistanceFeatureQuery
will produce same scores for these docs, and Lucene
will use the second sort to tie-break. This could be slower
than usual sorting.
* Sort leaves on search according to the primary numeric sort field (#44021)
This change pre-sort the index reader leaves (segment) prior to search
when the primary sort is a numeric field eligible to the distance feature
optimization. It also adds a tie breaker on `_doc` to the rewritten sort
in order to bypass the fact that leaves will be collected in a random order.
I ran this patch on the http_logs benchmark and the results are very promising:
```
| 50th percentile latency | desc_sort_timestamp | 220.706 | 136544 | 136324 | ms |
| 90th percentile latency | desc_sort_timestamp | 244.847 | 162084 | 161839 | ms |
| 99th percentile latency | desc_sort_timestamp | 316.627 | 172005 | 171688 | ms |
| 100th percentile latency | desc_sort_timestamp | 335.306 | 173325 | 172989 | ms |
| 50th percentile service time | desc_sort_timestamp | 218.369 | 1968.11 | 1749.74 | ms |
| 90th percentile service time | desc_sort_timestamp | 244.182 | 2447.2 | 2203.02 | ms |
| 99th percentile service time | desc_sort_timestamp | 313.176 | 2950.85 | 2637.67 | ms |
| 100th percentile service time | desc_sort_timestamp | 332.924 | 2959.38 | 2626.45 | ms |
| error rate | desc_sort_timestamp | 0 | 0 | 0 | % |
| Min Throughput | asc_sort_timestamp | 0.801824 | 0.800855 | -0.00097 | ops/s |
| Median Throughput | asc_sort_timestamp | 0.802595 | 0.801104 | -0.00149 | ops/s |
| Max Throughput | asc_sort_timestamp | 0.803282 | 0.801351 | -0.00193 | ops/s |
| 50th percentile latency | asc_sort_timestamp | 220.761 | 824.098 | 603.336 | ms |
| 90th percentile latency | asc_sort_timestamp | 251.741 | 853.984 | 602.243 | ms |
| 99th percentile latency | asc_sort_timestamp | 368.761 | 893.943 | 525.182 | ms |
| 100th percentile latency | asc_sort_timestamp | 431.042 | 908.85 | 477.808 | ms |
| 50th percentile service time | asc_sort_timestamp | 218.547 | 820.757 | 602.211 | ms |
| 90th percentile service time | asc_sort_timestamp | 249.578 | 849.886 | 600.308 | ms |
| 99th percentile service time | asc_sort_timestamp | 366.317 | 888.894 | 522.577 | ms |
| 100th percentile service time | asc_sort_timestamp | 430.952 | 908.401 | 477.45 | ms |
| error rate | asc_sort_timestamp | 0 | 0 | 0 | % |
```
So roughly 10x faster for the descending sort and 2-3x faster in the ascending case. Note
that I indexed the http_logs with a single client in order to simulate real time-based indices
where document are indexed in their timestamp order.
Relates #37043
* Remove nested collector in docs response
As we don't use cancellableCollector anymore, it should be removed from
the expected docs response.
* Use collector manager for search when necessary (#45829)
When we optimize sort, we sort segments by their min/max value.
As a collector expects to have segments in order,
we can not use a single collector for sorted segments.
Thus for such a case, we use collectorManager,
where for every segment a dedicated collector will be created.
* Use shared TopFieldCollector manager
Use shared TopFieldCollector manager for sort optimization.
This collector manager is able to exchange minimum competitive
score between collectors
* Correct calculation of avg value to avoid overflow
* Optimize calculating if index has duplicate data
All document scores are positive 32-bit floating point numbers. However, this
wasn't previously documented.
This can result in surprising behavior, such as precision loss, for users when
customizing scores using the function score query.
This commit updates an existing admonition in the function score query docs to
document the 32-bits precision limit. It also updates the search API reference
docs to note that `_score` is a 32-bit float.