Commit Graph

1161 Commits

Author SHA1 Message Date
Abdon Pijpelink 84326b1b82
Fixes typo in knn search page (#91898) 2022-11-24 16:35:44 +01:00
David Kyle 751dc244f1
[ML] Rename semantic search input parameter (#91787)
Formerly `query_string` now `model_text`
2022-11-23 13:03:43 +00:00
Olivier Cavadenti 7e9ce33e84
Instrumenting Weight#count in ProfileWeight (#85656)
Start instrumenting Weight#count function in ProfileWeight because we start to use it to compute total hit counts and aggregation counts.

Resolve #85203
2022-11-21 12:32:48 +00:00
Luigi Dell'Aquila 6e5c3d952c
Update docs about EQL CCS (#91542) 2022-11-15 14:07:26 +01:00
István Zoltán Szabó b7c75b9214
[DOCS] Fixes asciidoc syntax in semantic search API docs. (#91588) 2022-11-15 13:51:37 +01:00
István Zoltán Szabó e715f3c737
[DOCS] Adds KNN object sub-properties individually to common params (#91503) 2022-11-10 17:23:55 +01:00
István Zoltán Szabó ed452fb53d
[DOCS] Adds knn object to common parameters (#91464) 2022-11-10 11:21:01 +01:00
Alan Woodward 547c8327b2
Allow FetchSubPhaseProcessors to report their required stored fields (#91269)
Loading of stored fields is currently handled directly in FetchPhase, with
some fairly complex logic examining various bits of the FetchContext to work
out what fields need to be loaded. This is further complicated by synthetic
source, which may have its own stored field requirements.

This commit tries to separate out these concerns a little by adding a new
StoredFieldsSpec record that holds information about which stored fields
need to be loaded. Each FetchSubPhaseProcessor can now report a
StoredFieldsSpec detailing what its requirements are, and these specs can
be merged together, along with requirements from a SourceLoader, to
determine up-front what fields should be loaded by the StoredFieldLoader.
The stored fields themselves are added into the SearchHit by a new
StoredFieldsPhase, which handles alias resolution and value post-
processing. The logic to determine when source should be loaded and
when not, based on the presence of script fields or stored fields, is
moved into FetchContext, which highlights some inconsistencies that
can be fixed in follow-up commits.
2022-11-10 08:40:22 +00:00
debadair 3cad9f420f
[DOCS] Add 8.6 to CCS table (#91436)
* [DOCS] Add 8.6 to CCS table

* [DOCS] Fixed header
2022-11-09 14:04:18 -08:00
David Kyle b46ee9caaa
[ML] Hybrid retrieval for Semantic search. (#91348)
Adds the query option to the _semantic_search endpoint for hybrid retrieval. 
Scoring is controlled by the boost fields of the knn search and the query.
2022-11-09 13:48:48 +00:00
debadair b5dd2cd406
[DOC] Update CCS version matrix (#91371)
* [DOC] Update CCS version matrix

* Add two extra columns to table definition

Co-authored-by: Abdon Pijpelink <abdon.pijpelink@elastic.co>
2022-11-08 08:24:52 -08:00
Lisa Cawley 2d30bbab21
[DOCS] Semantic search endpoint (#91210) 2022-11-01 09:01:55 -07:00
Lisa Cawley f0c12cdeea
[DOCS] Fix typo in knn-search.asciidoc (#91206) 2022-10-31 10:07:53 -07:00
Julie Tibshirani 1b249639f1
Remove experimental marking from kNN search (#91065)
This commit removes the experimental tag from kNN search docs and makes some
docs improvements:
* Add a prominent warning about memory usage in the kNN search guide
* Link to the performance tuning guide from the main guide
* Clarify the memory requirements section in the tuning guide
2022-10-27 18:00:56 +02:00
Stéphane Campinas 8c44ed1442
Fix itemized list (#90855) 2022-10-24 15:14:17 -04:00
Jack Conradson f28ae4b288
Add support for indexing byte-sized knn vectors (#90774)
This change adds an element_type as an optional mapping parameter for dense vector fields as 
described in #89784. This also adds a byte element_type for dense vector fields that supports storing 
dense vectors using only 8-bits per dimension. This is only supported when the mapping parameter 
index is set to true.

The code follows a similar pattern to our NumberFieldMapper where we have an enum for 
ElementType, and it has methods that DenseVectorFieldType and DenseVectorMapper can delegate to 
to support each available type (just float and byte for now).
2022-10-20 14:45:58 -07:00
David Kyle 9e6a784aa5
[ML] Semantic search endpoint (#90450)
Adds a {index}_semantic_search endpoint which first converts the query text into a dense vector
using a NLP text embedding model then performs a knn search against an index containing 
dense vectors created with the same embedding model.
2022-10-13 13:17:30 +01:00
Jack Conradson 8b0d0716d1
Add profiling and documentation for dfs phase (#90536)
Adds profiling statistics for the dfs phase, and adds documentation for both the dfs phase profiling 
and kNN profiling.

Closes #89713
2022-10-05 09:54:36 -07:00
Ievgen Degtiarenko 24cf87186d
Limit shard realocation retries (#90296)
This change ensures that elasticsearch would not indefinitely retry relocating shard if operation fails.
2022-09-27 14:44:30 +02:00
Julie Tibshirani b1acb3603d
Clarify that knn does not use postfiltering (#89897)
This PR expands the approximate kNN docs to clarify the filter is applied during
the kNN search, not after. It explains the downsides of postfiltering.
2022-09-19 16:47:17 -07:00
Adam Locke 686a3fd45d
[DOCS] Update CCS compatibility matrix for 8.3 (#88906)
* [DOCS] Update CCS compatibility matrix for 8.3

Updates the CCS compatibility table to include 8.3.

* Fixing busted table 🔨

* Update table for 8.3 -> 8.1 support

Co-authored-by: Elastic Machine <elasticmachine@users.noreply.github.com>
2022-09-07 11:53:22 -04:00
Anthony McGlone 492f5b1751
[DOCS] Update search_after section with an example (#89631)
* [DOCS] Update search_after section with an example

* Update docs/reference/search/search-your-data/paginate-search-results.asciidoc

Co-authored-by: Abdon Pijpelink <abdon@abdon.nl>

* Update docs/reference/search/search-your-data/paginate-search-results.asciidoc

Co-authored-by: Abdon Pijpelink <abdon@abdon.nl>

* Update docs/reference/search/search-your-data/paginate-search-results.asciidoc

Co-authored-by: Abdon Pijpelink <abdon@abdon.nl>

* [DOCS] Update search_after section with an example

* [DOCS] Update search_after example with a response with sort values

Co-authored-by: Abdon Pijpelink <abdon@abdon.nl>
2022-09-05 15:33:03 +02:00
Jack Conradson 8c30b86fe2
Fix bug for kNN with filtered aliases (#89621)
This change adds the filter query for a filtered alias to the knn query during the dfs phase on the 
shard. This ensures the correct number of k results are returned instead of removing results as a post 
filter.

Fixes: #89561
2022-08-30 15:57:37 -07:00
Abdon Pijpelink 772784f3c9
[DOCS] Add note that terms enum API may return terms from deleted docs (#89654) 2022-08-29 15:19:04 +02:00
Abdon Pijpelink 27061a530e
Revert "[DOCS] Update search_after section with an example (#89328)" (#89411)
Reverts elastic/elasticsearch#89328
2022-08-17 18:20:15 +09:30
Anthony McGlone af8ac50788
[DOCS] Update search_after section with an example (#89328)
* [DOCS] Update search_after section with an example

* Update docs/reference/search/search-your-data/paginate-search-results.asciidoc

Co-authored-by: Abdon Pijpelink <abdon@abdon.nl>

* Update docs/reference/search/search-your-data/paginate-search-results.asciidoc

Co-authored-by: Abdon Pijpelink <abdon@abdon.nl>

* Update docs/reference/search/search-your-data/paginate-search-results.asciidoc

Co-authored-by: Abdon Pijpelink <abdon@abdon.nl>

Co-authored-by: Abdon Pijpelink <abdon@abdon.nl>
2022-08-17 09:53:14 +02:00
Julie Tibshirani acf9a67480
Document kNN with aggregations (#89359)
This commit adds a short note to the 'search your data' docs around kNN search
to explain how approximate kNN works with aggregations:
* Make section on 'hybrid retrieval' more general and include aggregations info
* Remove an example response from the previous section on filtering, since this
  page was getting long
2022-08-16 15:28:32 -07:00
Christos Soulios b81f4187ab
[TSDB] Metric fields in the field caps API (#88695)
To assist the user in configuring the visualizations correctly while leveraging TSDB
functionality, information about TSDB configuration should be exposed via the field 
caps API per field.

Especially for metrics fields, it must be clear which fields are metrics and if they belong 
to only time-series indexes or mixed time-series and non-time-series indexes.

To further distinguish metric fields when they belong to any of the following indices:

  -  Standard (non-time-series) indexes
  -  Time series indexes
  -  Downsampled time series indexes

This PR modifies the field caps API so that the mapping parameters time_series_dimension 
and time_series_dimension are presented only when they are set on fields of time-series indexes.
Those parameters are completely ignored when they are set on standard (non-time-series) indexes.

This PR revisits some of the conventions adopted by #78790
2022-08-04 20:42:34 +03:00
Abdon Pijpelink b96c39e7ad
[DOCS] Move completion type asciidoc (#89086)
* [DOCS] Move completion type asciidoc

* Fix failing code snippet test
2022-08-04 10:02:28 +02:00
Julie Tibshirani 21eb984e64
Deprecate the _knn_search endpoint (#88828)
This change deprecates the kNN search API in favor of the new 'knn' option
inside the search API. The 'knn' option is now the preferred way of performing
kNN search.

Relates to #87625
2022-08-03 15:19:01 -04:00
Navanit Dubey 9afb01e14e
Update rank-eval.asciidoc (#88771) 2022-07-25 18:00:49 +02:00
Julie Tibshirani e3ede67262
Integrate ANN into _search endpoint (#88694)
This PR adds a new `knn` option to the `_search` API to support ANN search.
It's powered by the same Lucene ANN capabilities as the old `_knn_search`
endpoint. The `knn` option can be combined with other search features like
queries and aggregations.

Addresses #87625
2022-07-22 08:02:07 -07:00
Ignacio Vera 04bdefd58c
Remove Collector implementation from BucketCollector (#88444)
BucketCollector has now a method called #asCollector that returns the current BucketCollector wrapped as a 
Lucene Collector.
2022-07-18 08:18:13 +02:00
Nhat Nguyen 4732fc2343
Implement count for wrapped Weight in ContextIndexSearcher (#88396)
Implements Weight#count() for wrapped Weights that don't change matching documents.

Relatess #88284
2022-07-13 16:57:12 -04:00
David Kilfoyle 40e9f3097c
[DOCS] Add TSDS docs, take two (#87703)
* Revert "Revert "[DOCS] Add TSDS docs (#86905)" (#87702)"

This reverts commit 0c86d7b9b2.

* First fix to tests

* Add data_stream object to index template

* small rewording

* Add enable data stream object in gradle example setup

* Add bullet about data stream must be enabled in template
2022-06-16 12:44:10 -04:00
David Kilfoyle 0c86d7b9b2
Revert "[DOCS] Add TSDS docs (#86905)" (#87702)
Reverts elastic/elasticsearch#86905
2022-06-15 13:32:12 -04:00
David Kilfoyle d57f4ac2c6
[DOCS] Add TSDS docs (#86905)
* [DOCS] Add TSDB docs

* Update docs/build.gradle

Co-authored-by: Adam Locke <adam.locke@elastic.co>

* Address Nik's comments, part 1

* Address Nik's comments, part deux

* Reword write index

* Add feature flags

* Wrap one more section in feature flag

* Small fixes

* set index.routing_path to optional

* Update storage reduction value

* Update create index template code example

Co-authored-by: James Rodewig <40268737+jrodewig@users.noreply.github.com>
Co-authored-by: Adam Locke <adam.locke@elastic.co>
2022-06-15 12:22:07 -04:00
Julie Tibshirani fab547bef2
Improve kNN with filtering docs (#87538)
This change tries to make it easier to find kNN with filtering in the docs:
* Mention filtering support in the kNN API description
* In kNN tutorial, link to the kNN search API page more prominently
2022-06-09 10:42:54 -07:00
Luca Cavanna 50793a68a8
Fields API to allow fetching values when _source is disabled (#87267)
Back when we introduced the fields parameter to the search API, it could only fetch values from _source, hence
the corresponding sub-fetch phase fails early whenever _source is disabled. Today though runtime fields can
be retrieved from a separate value fetcher that reads from fielddata, and metadata fields can be retrieved
from stored fields. These two scenarios currently throw an unnecessary error whenever _source is disabled.

This commit removes the check for disabled _source, so that runtime fields and metadata fields can be retrieved even when _source is disabled. Fields that need to be loaded from _source are simply skipped whenever _source is disabled, similar to when a field is not found in _source.

Closes #87072
2022-06-02 11:28:36 +02:00
Craig Taverner 5f7ea792ac
Soft-deprecation of point/geo_point formats (#86835)
* Soft-deprecation of point/geo_point formats

Since GeoJSON and WKT are now common formats for all three types:
  geo_shape, geo_point and point
We decided to soft-deprecate the other point formats by ordering:
* GeoJSON (object with keys `type` and `coordinates`)
* WKT `POINT(x y)`
* Object with keys `lat` and `lon` (or `x` and `y` for point)
* Array [lon,lat]
* String `"lat,lon"` (or `"x,y"` in point)
* String with geohash (only in `geo_point`)

The geohash is last because it is only in one field type.
The string version is second last because it is the most controversial
being the only version to reverse the coordinate order from all other
formats (for geo_point only, since the coordinates are not reversed
in point).

In addition we replaced many examples in both documentation and tests
to prioritize WKT over the plain string format.

Many remaining examples of array format or object with keys still exist
and could be replaced by, for example, GeoJSON, if we feel the need.

* Incorrect quote position
2022-05-17 23:46:43 +02:00
Craig Taverner db08d61998
Support geo label position through REST vector tiles API (#86458)
Support label position in REST vector tiles

There is a need to provide sensibly calculated label positions for polygons and lines in Kibana maps. A very convenient way to satisfy this need is through a runtime field that the rest API can make use of when labels are requested. This has the advantage of providing painless access to the label position as well.

This  work adds support for the REST API to provide label positions to MVT queries, both for the HITS layer and the AGGS layer. To enable this feature, set with_labels to true as a query parameter to the vector tile search query.
2022-05-17 15:33:29 +02:00
Sohail Mirza 9117f0e42a
Docs: Remove extraneous backtick (#86750) 2022-05-16 10:49:22 +02:00
Nik Everett a589456b81
Synthetic source (#85649)
This attempts to shrink the index by implementing a "synthetic _source" field.
You configure it by in the mapping:
```
{
  "mappings": {
    "_source": {
      "synthetic": true
    }
  }
}
```

And we just stop storing the `_source` field - kind of. When you go to access
the `_source` we regenerate it on the fly by loading doc values. Doc values
don't preserve the original structure of the source you sent so we have to
make some educated guesses. And we have a rule: the source we generate would
result in the same index if you sent it back to us. That way you can use it
for things like `_reindex`.

Fetching the `_source` from doc values does slow down loading somewhat. See
numbers further down.

## Supported fields
This only works for the following fields:
* `boolean`
* `byte`
* `date`
* `double`
* `float`
* `geo_point` (with precision loss)
* `half_float`
* `integer`
* `ip`
* `keyword`
* `long`
* `scaled_float`
* `short`
* `text` (when there is a `keyword` sub-field that is compatible with this feature)


## Educated guesses

The synthetic source generator makes `_source` fields that are:
* sorted alphabetically
* as "objecty" as possible
* pushes all arrays to the "leaf" fields
* sorts most array values
* removes duplicate text and keyword values

These are mostly artifacts of how doc values are stored.

### sorted alphabetically
```
{
  "b": 1,
  "c": 2,
  "a": 3
}
```
becomes
```
{
  "a": 3,
  "b": 1,
  "c": 2
}
```

### as "objecty" as possible
```
{
  "a.b": "foo"
}
```
becomes
```
{
  "a": {
    "b": "foo"
  }
}
```

### pushes all arrays to the "leaf" fields
```
{
  "a": [
    {
      "b": "foo",
      "c": "bar"
    },
    {
      "c": "bort"
    },
    {
      "b": "snort"
    }
}
```
becomes
```
{
  "a" {
    "b": ["foo", "snort"],
    "c": ["bar", "bort"]
  }
}
```

### sorts most array values
```
{
  "a": [2, 3, 1]
}
```
becomes
```
{
  "a": [1, 2, 3]
}
```

### removes duplicate text and keyword values
```
{
  "a": ["bar", "baz", "baz", "baz", "foo", "foo"]
}
```
becomes
```
{
  "a": ["bar", "baz", "foo"]
}
```
## `_recovery_source`

Elasticsearch's shard "recovery" process needs `_source` *sometimes*. So does
cross cluster replication. If you disable source or filter it somehow we store
a `_recovery_source` field for as long as the recovery process might need it.
When everything is running smoothly that's generally a few seconds or minutes.
Then the fields is removed on merge. This synthetic source feature continues
to produce `_recovery_source` and relies on it for recovery. It's *possible*
to synthesize `_source` during recovery but we don't do it.

That means that synethic source doesn't speed up writing the index. But in the
future we might be able to turn this on to trade writing less data at index
time for slower recovery and cross cluster replication. That's an area of
future improvement.

## perf numbers

I loaded the entire tsdb data set with this change and the size:

```
           standard -> synthetic
store size  31.0 GB ->  7.0 GB  (77.5% reduction)
_source  24695.7 MB -> 47.6 MB  (99.8% reduction - synthetic is in _recovery_source)
```

A second _forcemerge a few minutes after rally finishes should removes the
remaining 47.6MB of _recovery_source.

With this fetching source for 1,000 documents seems to take about 500ms. I
spot checked a lot of different areas and haven't seen any different hit. I
*expect* this performance impact is based on the number of doc values fields
in the index and how sparse they are.
2022-05-10 07:46:58 -04:00
Julie Tibshirani 10aa947707 Remove out-of-date note about kNN with filters
We implemented this in #84734 but forgot to update these docs.
2022-04-14 10:18:07 -07:00
Yannick Welsch 78789e2b5d
Fix wildcard highlighting on match_only_text (#85500)
Fixes a bug where match_only_text fields were ignored during highlighting when a field name with wildcard was specified.

Closes #85493
2022-04-01 08:12:08 +02:00
Craig Taverner 0b84eb1a53
Added buffer to vector tile REST API docs (#85460) 2022-03-30 14:29:01 +02:00
Alan Woodward a5452603cc
Extra testing and some cleanups for filtering on field caps (#85068)
* adds a test for mixed cluster requests
* fixes a bad stream version check (above test will fail if this isn't included)
* replaces private FieldCapsFilter interface with Predicate
* renames 'allowedTypes' to 'types' to maintain consistency with external API
* adds javadoc to ResponseRewriter
* removes isRuntimeField from FieldTypeLookup

Relates to #83636
2022-03-29 11:38:52 +01:00
Ignacio Vera a780558e4c
[DOCS] Fix Vector tiles search docs for features.id (#85067)
Removes the `features.id` property from the response body. This property was actually generated by the tool used to decode the mvt file to JSON.
2022-03-17 16:06:49 -04:00
Ignacio Vera 3f6d460d01
Integrate GeoHexGridAggregation with vector tiles API (#84553)
This commit adds a new optional parameter on the vector tiles API called `grid_agg` with two
possible values, geotile (default) and geohex. This will allow to build the aggs layer using different
grid aggregations, for example we can have a grid aggregation that is built using hexagons.
2022-03-16 11:16:30 +01:00
Julie Tibshirani 15708d5454
Integrate filtering support for ANN (#84734)
This PR integrates support for ANN with filtering added in Lucene 9.1. It adds
a new `filter` section to the `_knn_search` endpoint, which accepts a query (in
the Elasticsearch query DSL). The value can either be a single query or a list
of queries, which matches the syntax we use for defining filter clauses in a
`bool` query.

Closes #81788.
2022-03-10 15:53:51 -08:00