Currently Lucene limits the max number of vector dimensions to 1024.
This commit overrides KnnFloatVectorField and KnnByteVectorField
classes to increase the limit to 2048 for indexed vectors in ES.
Here we add synthetic source support for fields whose type is flattened.
Note that flattened fields and synthetic source have the following limitations,
all arising from the fact that in synthetic source we just see key/value pairs
when reconstructing the original object and have no type information in mappings:
* flattened fields use sorted set doc values of keywords, which means two things:
first we do not allow duplicate values, second we treat all values as keywords
* reconstructing array of objects results in nested objects (no array)
* reconstructing arrays with just one element results in a single-value field since we
have no way to distinguish single-valued from multi-values fields other then looking
at the count of values
`runtime_mappings` is the name of the param in the search request. In the
document `put` statement, it's called `runtime`
Co-authored-by: Matthew Hinea <matthew.hinea@gmail.com>
This PR enables the `ignore_malformed`parameter to be accepted as an option in
boolean field mappings. Support for synthetic source is not added yet, so if
`ignore_malformed` is set to true, synthetic source isn't supported.
Closes#89542
This adds term query capabilities for rank_features fields. term queries against rank_features are not scored in the typical way as regular fields. This is because the stored feature values take advantage of the term frequency storage mechanism, and thus regular BM25 does not work.
Instead, a term query against a rank_features field is very similar to linear rank_feature query. If more complicated combinations of features and values are required, the rank_feature query should be used.
* enhancement: boolean field to support ignore_malformed
* fix: changes in current builder for BooleanFieldMappers within tests files.
* Updating documentation
Co-authored-by: Elastic Machine <elasticmachine@users.noreply.github.com>
Co-authored-by: Amy Jonsson <amy.jonsson@elastic.co>
Documentation incorrectly states that all aggregations are supported by
the `aggregate_metric_double` field.
This PR rectifies this error.
Closes#92236
Docs around the `index` option were not very precise. The term "typical" was used without describing for which fields querying is still available when `index: false` is set. But more precise docs existed in the `doc_values` documentation found here for the index option: https://www.elastic.co/guide/en/elasticsearch/reference/current/doc-values.html This docs were mostly copied over.
Co-authored-by: Abdon Pijpelink <abdon.pijpelink@elastic.co>
Co-authored-by: Abdon Pijpelink <abdon.pijpelink@elastic.co>
Currently Elasticsearch always returns a shard failure once a runtime error arises from using a runtime field, the exception being script-less runtime fields. This also means that execution of the query for that shard stops, which is okay for development and exploration. In a production scenario, however, it is often desirable to ignore runtime errors and continue with the query execution.
This change adds a new a new on_script_error parameter to runtime field definitions similar to the already existing
parameter for index-time scripted fields. When `on_script_error` is set to `continue`, errors from script execution are effectively ignored. This means affected documents don't show up in query results, but also don't prevent other matches from the same shard. Runtime fields accessed through the fields API don't return values on errors, aggregations will ignore documents that throw errors.
Note that this change affects scripted runtime fields only, while leaving default behaviour untouched. Also, ignored errors are not reported back to users for now.
Relates to #72143
* The exception is inserted in a code block
* Update docs/reference/mapping/types/text.asciidoc
Co-authored-by: Abdon Pijpelink <abdon.pijpelink@elastic.co>
Synthetic _source's array flattening activities can remove some arrays
entirely. Specifically:
```
{
"foo": [
{
"bar": 1
},
{
"baz": 2
}
]
}
```
Turns into:
```
{
"foo": {
"bar": 1,
"baz": 2
}
}
```
See, no more array! It's because the values are flattend to the leaf
fields and didn't have multiple values. This is implied by the docs we
had, but sure wasn't obvious. So now it's documented specifically.
This change adds support fielddata and subsequently scripting for byte vectors. This is a follow up to
#90774 and completes the initial work for #89784.
Before it linked to script_score and approximate kNN separately, but now we have
a single page that describes both approaches. This change also removes a link to
the deprecated _knn_search API.
* Refine geo-point and geo-shape docs
While reviewing the docs for another issue, some deprecated
references to prefix-trees were discovered, leading to interest
in bringing the docs a little more up-to-date.
* Update docs/reference/mapping/types/geo-point.asciidoc
Co-authored-by: Abdon Pijpelink <abdon.pijpelink@elastic.co>
* Update docs/reference/mapping/types/geo-shape.asciidoc
Co-authored-by: Abdon Pijpelink <abdon.pijpelink@elastic.co>
Co-authored-by: Abdon Pijpelink <abdon.pijpelink@elastic.co>
This change adds an element_type as an optional mapping parameter for dense vector fields as
described in #89784. This also adds a byte element_type for dense vector fields that supports storing
dense vectors using only 8-bits per dimension. This is only supported when the mapping parameter
index is set to true.
The code follows a similar pattern to our NumberFieldMapper where we have an enum for
ElementType, and it has methods that DenseVectorFieldType and DenseVectorMapper can delegate to
to support each available type (just float and byte for now).
I got some new this morning that we're going to have to rework how we
handle ignore-above in synthetic _source which makes me a bit weary of
removing tech-preview in 8.5. I asked a few folks and they felt more
comfortable giving it a little longer in tech preview. I expect until
ignore-above is in.
This adds synthetic `_source` support for `ip` fields with
`ignore_malfored` set to `true`. We save the field values in hidden
stored field, just like we do for `ignore_above` keyword fields. Then we
load them at load time.
I've been hacking on synthetic source for a while now and not seen any
need to break backwards compatibility or any major bugs. I think it's
time to remove the `preview` marker from it so folks can use it without
fear.
It seems that for now we don't have a good use for the histogram and summary metric types.
They had been left as place holders for a while, but at this point there is no concrete plan forward for them.
This PR removes the histogram and summary metric types. We may add them back in the future.
Also, this PR completely removes the time_series_metric mapping parameter from the histogram field type and only allows the gauge metric type for aggregate_metric_double fields.
This adds support for synthetic _source to the `version` field type. It
works very similarly to `keyword` but with an extra decode step.
I modified the decoder to return a `BytesRef` instead of a `String`
because many of the callers seemed to be converting that string directly
into bytes again. Synthetic source would have wanted to do that. As was
the query infrastructure.
Now that we're releasing synthetic _source as a tech preview feature, we
no longer want to remove the docs from the non-release builds. And we
want to mark all of the headings describing synthetic `_source` as a
preview.
The docs for synthetic `_source` incorrectly claimed that synthetic
`_source` deduplicates numbers. It doesn't. The example below the prose
shows it *not* removing duplicates.
We have a check that enforces the total number of fields needs to be below a
certain (configurable) threshold. Before runtime fields did not contribute
to the count. This patch makes all runtime fields contribute to the
count, runtime fields:
- that were explicitly defined in mapping by a user
- as well as runtime fields that were dynamically created by dynamic
mappings
Closes#88265
* [DOCS] Warn only one date format is added to the field date formats
When using multiple options in `dynamic_date_formats`, only one of the formats of the first document having a date matching one of the date formats provided will be used.
E.g.
```
PUT my-index-000001
{
"mappings": {
"dynamic_date_formats": [ "yyyy/MM", "MM/dd/yyyy"]
}
}
PUT my-index-000001/_doc/1
{
"create_date": "09/25/2015"
}
```
The generated mappings will be:
```
"mappings": {
"dynamic_date_formats": [
"yyyy/MM",
"MM/dd/yyyy"
],
"properties": {
"create_date": {
"type": "date",
"format": "MM/dd/yyyy"
}
}
},
```
Indexing a document with `2015/12` would lead to the `format` `"yyyy/MM"` being used for the `create_date`.
This can be misleading especially if the user is using multiple date formats on the same field.
The first document will determine the format of the `date` field being detected.
Maybe we should provide an additional example, such as:
```
PUT my-index-000001
{
"mappings": {
"dynamic_date_formats": [ "yyyy/MM||MM/dd/yyyy"]
}
}
```
My wording is not great, so feel free to amend/edit.
* Update docs/reference/mapping/dynamic/field-mapping.asciidoc
Reword and add code example
* Turned discussion of the two syntaxes into an admonition
* Fix failing tests
Co-authored-by: Abdon Pijpelink <abdon.pijpelink@elastic.co>
This PR adds a new `knn` option to the `_search` API to support ANN search.
It's powered by the same Lucene ANN capabilities as the old `_knn_search`
endpoint. The `knn` option can be combined with other search features like
queries and aggregations.
Addresses #87625
Currently we have two parameters that control how the source of a document
is stored, `enabled` and `synthetic`, both booleans. However, there are only
three possible combinations of these, with `enabled:false` and `synthetic:true`
being disallowed. To make this easier to reason about, this commit replaces
the `enabled` parameter with a new `mode` parameter, which can take the values
`stored`, `synthetic` and `disabled`. The `mode` parameter cannot be set
in combination with `enabled`, and we will subsequently move towards
deprecating `enabled` entirely.
* Revert "Revert "[DOCS] Add TSDS docs (#86905)" (#87702)"
This reverts commit 0c86d7b9b2.
* First fix to tests
* Add data_stream object to index template
* small rewording
* Add enable data stream object in gradle example setup
* Add bullet about data stream must be enabled in template
* [DOCS] Add TSDB docs
* Update docs/build.gradle
Co-authored-by: Adam Locke <adam.locke@elastic.co>
* Address Nik's comments, part 1
* Address Nik's comments, part deux
* Reword write index
* Add feature flags
* Wrap one more section in feature flag
* Small fixes
* set index.routing_path to optional
* Update storage reduction value
* Update create index template code example
Co-authored-by: James Rodewig <40268737+jrodewig@users.noreply.github.com>
Co-authored-by: Adam Locke <adam.locke@elastic.co>
This adds an example of the precision loss for `double`, `float`, and
`half_float` numbers that we can link folks to when explaining what
happened to their numbers. You can link directly to it with something
like:
```
/guide/number.html#floating_point
```
* Soft-deprecation of point/geo_point formats
Since GeoJSON and WKT are now common formats for all three types:
geo_shape, geo_point and point
We decided to soft-deprecate the other point formats by ordering:
* GeoJSON (object with keys `type` and `coordinates`)
* WKT `POINT(x y)`
* Object with keys `lat` and `lon` (or `x` and `y` for point)
* Array [lon,lat]
* String `"lat,lon"` (or `"x,y"` in point)
* String with geohash (only in `geo_point`)
The geohash is last because it is only in one field type.
The string version is second last because it is the most controversial
being the only version to reverse the coordinate order from all other
formats (for geo_point only, since the coordinates are not reversed
in point).
In addition we replaced many examples in both documentation and tests
to prioritize WKT over the plain string format.
Many remaining examples of array format or object with keys still exist
and could be replaced by, for example, GeoJSON, if we feel the need.
* Incorrect quote position
This PR adds support for a new mapping parameter to the configuration of the object mapper (root as well as individual fields), that makes it possible to store metrics data where it's common to have fields with dots in their names in the following format:
```
{
"metrics.time" : 10,
"metrics.time.min" : 1,
"metrics.time.max" : 500
}
```
Instead of expanding dotted paths the their corresponding object structure, objects can be configured to preserve dots in field names, in which case they can only hold leaf sub-fields and no further objects.
The mapping parameter is called subobjects and controls whether an object can hold other objects (defaults to true) or not. The following example shows how it can be configured in the mappings:
```
{
"mappings" : {
"properties" : {
"metrics" : {
"type" : "object",
"subobjects" : false
}
}
}
}
```
Closes#63530
There has been some confusion over the definition of a field type family. This
PR clarifies the definition in the docs: the two types should have the exact
same search behavior (including supporting the same queries/ aggs, and producing
the same response). It's not sufficient for them to just support the samme
search operations.
This change also fixes an inaccurate statement that there is only one field type
family so far.
This PR introduces the lookup runtime fields which are used to retrieve
data from the related indices. The below search request enriches its
search hits with the location of each IP address from the `ip_location`
index.
```
POST logs/_search
{
"runtime_mappings": {
"location": {
"type": "lookup",
"lookup_index": "ip_location",
"query_type": "term",
"query_input_field": "ip",
"query_target_field": "_id",
"fetch_fields": [
"country",
"city"
]
}
},
"fields": [
"timestamp",
"message",
"location"
]
}
```
Response:
```
{
"hits": {
"hits": [
{
"_index": "logs",
"_id": "1",
"fields": {
"location": [
{
"city": [ "Montreal" ],
"country": [ "Canada" ]
}
],
"message": [ "the first message" ]
}
}
]
}
}
```
Clarifies that the `orientation` mapping parameter only applies to WKT polygons. GeoJSON polygons use a default orientation of `RIGHT`, regardless of the mapping parameter.
Also notes that the document-level `orientation` parameter overrides the default orientation for both WKT and GeoJSON polygons.
Closes https://github.com/elastic/elasticsearch/issues/84009.
Similar to #82409, but for geo_point fields.
Allows searching on geo_point fields when those fields are not indexed (index: false) but just doc values are enabled.
Also adds distance feature query support for date fields (bringing date field to feature parity with runtime fields)
This enables searches on archive data, which has access to doc values but not index structures. When combined with
searchable snapshots, it allows downloading only data for a given (doc value) field to quickly filter down to a select set
of documents.
Relates #81210 and #52728
* [DOCS] Update dynamic mapping docs to clarify supported match_mapping_type
* Add ES data type column header
* Remove sentence about always choosing the larger data type
* Clarify that JSON doesn't distinguish types
* Add frame to table
Updates and reuses a warning against creating multi-level `join` fields to make it more prominent.
The current warning is low on the page, where some users may not seeing until they've already begun mapping fields.
Closes https://github.com/elastic/elasticsearch/issues/82818.
Allows searching on ip fields when those fields are not indexed (index: false) but just doc values are enabled.
This enables searches on archive data, which has access to doc values but not index structures. When combined with
searchable snapshots, it allows downloading only data for a given (doc value) field to quickly filter down to a select set
of documents.
Relates #81210 and #52728
Allows searching on boolean fields when those fields are not indexed (index: false) but just doc values are enabled.
This enables searches on archive data, which has access to doc values but not index structures. When combined with
searchable snapshots, it allows downloading only data for a given (doc value) field to quickly filter down to a select set
of documents.
Relates #81210 and #52728
Allows searching on keyword fields when those fields are not indexed (index: false) but just doc values are enabled.
This enables searches on archive data, which has access to doc values but not index structures. When combined with
searchable snapshots, it allows downloading only data for a given (doc value) field to quickly filter down to a select set
of documents.
Relates #81210 and #52728
Similar to #82409, but for date fields.
Allows searching on date field types (date, date_nanos) when those fields are not indexed (index: false) but just doc
values are enabled.
This enables searches on archive data, which has access to doc values but not index structures. When combined with
searchable snapshots, it allows downloading only data for a given (doc value) field to quickly filter down to a select set
of documents.
Relates #81210 and #52728
Allows searching on number field types (long, short, int, float, double, byte, half_float) when those fields are not
indexed (index: false) but just doc values are enabled.
This enables searches on archive data, which has access to doc values but not index structures. When combined with
searchable snapshots, it allows downloading only data for a given (doc value) field to quickly filter down to a select set
of documents.
Note to reviewers:
I have split isSearchable into two separate methods isIndexed and isSearchable on MappedFieldType. The former one is
about whether actual indexing data structures have been used (postings or points), and the latter one on whether you
can run queries on the given field (e.g. used by field caps). For number field types, queries are now allowed whenever
points are available or when doc values are available (i.e. searchability is expanded).
Relates #81210 and #52728
Cosine similarity is not defined when one of the vectors has zero magnitude.
Before, the kNN search endpoint threw a confusing exception related to top docs
collection. Now we reject vectors early with a clear error message, failing
indexing if the vector has zero magnitude.
This commit adds docs for the new `_knn_search` endpoint.
It focuses on being an API reference and is light on details in terms of how
exactly the kNN search works, and how the endpoint contrasts with
`script_score` queries. We plan to add a high-level guide on kNN search that
will explain this in depth.
Relates to #78473.
Currently, we don't support kNN search against fields in a `nested` mapping.
Before, we were checking this at search-time. This commit moves it earlier, so
you aren't even allowed to set `index: true` if the vector is in a nested
mapping. That way, users are aware of the limitation before they start to index
documents.
Relates to #78473.
This reverts the change to use segment ordinals in composite terms aggregations due to a performance degradation when the field is high cardinality.
Co-authored-by: Mark Tozzi <mark.tozzi@elastic.co>
Removes `testenv` annotations and related code. These annotations originally let you skip x-pack snippet tests in the docs. However, that's no longer possible.
Relates to #79309, #31619
This commit updates the `dense_vector` docs to include information on the new
`index`, `similarity`, and `index_options` parameters. It also tries to clarify
the difference between `similarity` and `index_options` with the existing
parameters that have the same name.
Relates to #78473.