There were some cases where synthetic source wasn't properly rounding in
round trips. `0.15527719259262085` with a scaling factor of
`2.4206374697469164E16` was round tripping to `0.15527719259262088`
which then round trips up to `0.0.1552771925926209`, rounding the wrong
direction! This fixes the round tripping in this case through ever more
paranoid double checking and nudging.
Closes#88854
This change adds an operation parameter to FieldDataContext that allows us to specialize the field data that are returned from fielddataBuilder in MappedFieldType. Keyword, integer, and geo point field types now support source fallback where we build a doc values wrapper using source if doc values doesn't exist for this field under the operation SCRIPT. This allows us to have source fallback in scripting for the scripting fields API.
MappedFieldType#fieldDataBuilder() currently takes two parameters, a fully qualified
index name and a supplier for a SearchLookup. We expect to add more parameters here
as we add support for loading fielddata from source. Rather than telescoping the
parameter list, this commit instead introduces a new FieldDataContext carrier object
which will allow us to add to these context parameters more easily.
DocValueFieldExistsQuery, NormsFieldExistsQuery as well as KnnVectorFieldExistsQuery are deprecated in Lucene in favour of FieldExistsQuery which combines the three into a single query.
This commit updates Elasticsearch to no longer rely on such deprecated queries.
see https://issues.apache.org/jira/browse/LUCENE-10436
This speeds up synthetic source, especially when there are many fields
in the index that are declared in the mapping but don't have values.
This is fairly common with ECS, and the tsdb rally track uses that. And
this improves fetch performance of that track:
```
| 50th percentile service time | default | 6.24029 | 4.85568 | ms | -22.19% |
| 90th percentile service time | default | 7.89923 | 6.52069 | ms | -17.45% |
| 99th percentile service time | default | 12.0306 | 16.435 | ms | +36.61% |
| 100th percentile service time | default | 14.2873 | 17.1175 | ms | +19.81% |
| 50th percentile service time | default_1k | 158.425 | 25.3236 | ms | -84.02% |
| 90th percentile service time | default_1k | 165.46 | 30.8655 | ms | -81.35% |
| 99th percentile service time | default_1k | 168.954 | 33.3342 | ms | -80.27% |
| 100th percentile service time | default_1k | 174.341 | 34.8344 | ms | -80.02% |
```
There's a slight increase in the 99th and 100th percentile service time
for fetching ten document which think is unlucky jitter. Hopefully. The
average performance of fetching ten docs improves anyway so I think
we're ok. Fetching a thousand documents improves 80% across the board
which is lovely.
This works by doing three things:
1. Teach the "leaf" layer of source loader to detect when the field is
empty in that segment and remove it from the synthesis process
entirely. This brings most of the speed up in tsdb.
2. Replace `hasValue` with a callback when writing the first value.
`hasValue` was resulting in a 2^n-like number of calls that really
showed up in the profiler.
3. Replace the `ArrayList` of leaf loaders with an array. Before fixing
the other two issues the `ArrayList`'s iterator really showed up in
the profiling. Probably much less worth it now, but it's small.
All of this brings synthetic source much closer to the fetch performance
of standard _source:
```
| 50th percentile service time | default_1k | 11.4016 | 25.3236 | ms | +122.11% |
| 90th percentile service time | default_1k | 13.7212 | 30.8655 | ms | +124.95% |
| 99th percentile service time | default_1k | 15.8785 | 33.3342 | ms | +109.93% |
| 100th percentile service time | default_1k | 16.9715 | 34.8344 | ms | +105.25% |
```
One important thing, these perf numbers come from fetching *hot* blocks
on disk. They mostly compare CPU overhead and not disk overhead.
Speeding this up some more as it's now 50% of the bootstrap time of the many shards benchmarks.
Iterating an array here in all cases is quite a bit faster than iterating various kinds of lists
and doesn't complicate the code. Also removes a redundant call to `getValue()` for each parameter
during serialization.
Adds support for "text" fields in archive indices, with the goal of adding simple filtering support on text fields when
querying archive indices.
There are some differences to regular text fields:
- no global statistics: queries on text fields return constant score (similar to match_only_text).
- analyzer fields can be updated
- if defined analyzer is not available, falls back to default analyzer
- no guarantees that analyzers are BWC
The above limitations also give us the flexibility to eventually swap out the implementation with a "runtime-text field"
variant, and hence only provide those capabilities that can be emulated via a runtime field.
Relates #81210
In the many-shards benchmarks the singleton maps storing just a single
analyzer for each keyword field mapper cost around 5% of the total heap
usage on data nodes (700MB for ~15k indices which translate into ~16M instances
of keyword field mapper for Beats mappings).
Creating specific implementations for the zero, one or many analyzers
use cases that already have their own specialized constructors eliminates this
overhead completely.
relates #77466
This attempts to shrink the index by implementing a "synthetic _source" field.
You configure it by in the mapping:
```
{
"mappings": {
"_source": {
"synthetic": true
}
}
}
```
And we just stop storing the `_source` field - kind of. When you go to access
the `_source` we regenerate it on the fly by loading doc values. Doc values
don't preserve the original structure of the source you sent so we have to
make some educated guesses. And we have a rule: the source we generate would
result in the same index if you sent it back to us. That way you can use it
for things like `_reindex`.
Fetching the `_source` from doc values does slow down loading somewhat. See
numbers further down.
## Supported fields
This only works for the following fields:
* `boolean`
* `byte`
* `date`
* `double`
* `float`
* `geo_point` (with precision loss)
* `half_float`
* `integer`
* `ip`
* `keyword`
* `long`
* `scaled_float`
* `short`
* `text` (when there is a `keyword` sub-field that is compatible with this feature)
## Educated guesses
The synthetic source generator makes `_source` fields that are:
* sorted alphabetically
* as "objecty" as possible
* pushes all arrays to the "leaf" fields
* sorts most array values
* removes duplicate text and keyword values
These are mostly artifacts of how doc values are stored.
### sorted alphabetically
```
{
"b": 1,
"c": 2,
"a": 3
}
```
becomes
```
{
"a": 3,
"b": 1,
"c": 2
}
```
### as "objecty" as possible
```
{
"a.b": "foo"
}
```
becomes
```
{
"a": {
"b": "foo"
}
}
```
### pushes all arrays to the "leaf" fields
```
{
"a": [
{
"b": "foo",
"c": "bar"
},
{
"c": "bort"
},
{
"b": "snort"
}
}
```
becomes
```
{
"a" {
"b": ["foo", "snort"],
"c": ["bar", "bort"]
}
}
```
### sorts most array values
```
{
"a": [2, 3, 1]
}
```
becomes
```
{
"a": [1, 2, 3]
}
```
### removes duplicate text and keyword values
```
{
"a": ["bar", "baz", "baz", "baz", "foo", "foo"]
}
```
becomes
```
{
"a": ["bar", "baz", "foo"]
}
```
## `_recovery_source`
Elasticsearch's shard "recovery" process needs `_source` *sometimes*. So does
cross cluster replication. If you disable source or filter it somehow we store
a `_recovery_source` field for as long as the recovery process might need it.
When everything is running smoothly that's generally a few seconds or minutes.
Then the fields is removed on merge. This synthetic source feature continues
to produce `_recovery_source` and relies on it for recovery. It's *possible*
to synthesize `_source` during recovery but we don't do it.
That means that synethic source doesn't speed up writing the index. But in the
future we might be able to turn this on to trade writing less data at index
time for slower recovery and cross cluster replication. That's an area of
future improvement.
## perf numbers
I loaded the entire tsdb data set with this change and the size:
```
standard -> synthetic
store size 31.0 GB -> 7.0 GB (77.5% reduction)
_source 24695.7 MB -> 47.6 MB (99.8% reduction - synthetic is in _recovery_source)
```
A second _forcemerge a few minutes after rally finishes should removes the
remaining 47.6MB of _recovery_source.
With this fetching source for 1,000 documents seems to take about 500ms. I
spot checked a lot of different areas and haven't seen any different hit. I
*expect* this performance impact is based on the number of doc values fields
in the index and how sparse they are.
The default type is incredibly common and instances are not trivial
in size with 16 fields. Heap dumps from larger data nodes holding many
keyword fields with the default field type can contain hundreds of MB
of heap used for these.
Same reasoning applies to the `TextSearchInfo` deduplication.
`TextSearchInfo` was turned into a record to give us an `equals` implementation.
Most classes under elasticsearch-core had been moved to the o.e.core
package. However, a couple io related classes remained in an "internal"
package. This commit moves Streams and IOUtils to the core package, as
they are no more "internal" than the rest of the classes in core.
This param was incredibly expensive to set up when parsing mappings and
is one of the big contributors to mapping parsing slowness on master.
Since all uses of this parameter type are statically known it seems the most
straight forward to simply statically hard code the validators so that we save
some allocations.
This change adds a ScriptFieldFactory class with a toScriptField and a DocValuesScriptFieldFactory
class with a toScriptDocValues method. These classes are intended to facilitate the separation of the
supplier of values to a script field and the field itself. The two new classes will provide a way for the
old-style doc values to be accessed directly using a supplier instead of piggybacking off the new Field
types which makes it easier to have source values for only the Field types moving forward.
Note this change is mostly mechanical where the Fields themselves are the DocValuesFieldFactory's
for now. This way we can make each field have its own PR to create a supplier for that field type
making the general change far more manageable.
Lucene issues that resulted in elasticsearch changes:
LUCENE-9820 Separate logic for reading the BKD index from logic to intersecting it.
LUCENE-10377: Replace 'sortPos' with 'enableSkipping' in SortField.getComparator()
LUCENE-10301: make the test-framework a proper module by moving all test
classes to org.apache.lucene.tests
LUCENE-10300: rewrite how resources are read in ukrainian morfologik analyzer:
LUCENE-10054 Make HnswGraph hierarchical
This removes the `boost` from the `toXContent` of `rank_feature` if it
is the default. It also removes the score function if it is the default.
Relates to #76515
Follow-up from #77144 (comment) with converting id/_id to always be strings instead of integers. This makes the type value in the Elasticsearch specification be only string instead of string | number.
this change was generated using following command on ubuntu
find . -type f -name "*.yml" -print0 | xargs -0 sed -i -r 's/([^a-zA-Z0-9_\.]id|[^a-zA-Z0-9_]_id):(\s*)([0-9]+)/\1:\2"\3"/g'
Supporting #81809, we changed query builders to implement 'VersionedNamedWriteable' to be able to detect
when new query builders under the search enpoint are introduced and also to force new implementations to overwrite
'getMinimalSupportedVersion' with a current release version.
This change removes the default implementation in the QueryBuilder interface and replaces it with individual
implementations in the currently existing query builders. For builders that have been around for longer than 7.0 (the
earliest verison constant we currently have around) we use Version.V_EMPTY which sorts always before any other declared version.
In many-shards benchmarks serializing mappers and settings
becomes fairly prominent during batched index creation or
setting updates.
Both mapping and setting serialization spent most of their
time on `org.elasticsearch.xcontent.XContentBuilder#unknownValue`
figuring out write type to serialize.
This commit makes it so the mapper parameters get serialized
by typed serializers (the generic XContentBuilder::field default we used will always
link to `org.elasticsearch.xcontent.XContentBuilder#field(java.lang.String, java.lang.Object)`
which is needlessly slow here when we know the type in the callsite creating the parameter
instance).
Also, for settings I added some educated guesses on the types expected that
cover most real world scenarios (for the non-flat case it's probably all scenarios
except for `null` setting values and that's the case that matters).
The query_string, simple_query_string, combined_fields and multi_match
queries all allow you to query a large number of fields, based on wildcard field name
matches. By default, the wildcard match is *, meaning that these queries will try
and match against every single field in your index. This can cause problems if you
have a very large number of fields defined, and your elasticsearch instance has a
fairly low maximum query clause count.
In many cases, users may have many more fields defined in their mappings than are
actually populated in their index. For example, indexes using ECS mappings may
well only use a small subset of these mapped fields for their data. In these situations,
we can put a limit on the number of fields being searched by doing a quick check of
the Lucene index metadata to see if a mapped field actually has content in the index;
if it doesn't exist, we can trivially skip it.
This commit adds a check to QueryParserHelper.resolveMappingField() that strips
out fields with no content if the field name to resolve contains a wildcard. The check
is delegated down to MappedFieldType and by default returns `true`, but the standard
indexable field types (numeric, text, keyword, range, etc) will check their fieldnames
against the names in the underlying lucene FieldInfos and return `false` if they do not
appear there.
Allows searching on number field types (long, short, int, float, double, byte, half_float) when those fields are not
indexed (index: false) but just doc values are enabled.
This enables searches on archive data, which has access to doc values but not index structures. When combined with
searchable snapshots, it allows downloading only data for a given (doc value) field to quickly filter down to a select set
of documents.
Note to reviewers:
I have split isSearchable into two separate methods isIndexed and isSearchable on MappedFieldType. The former one is
about whether actual indexing data structures have been used (postings or points), and the latter one on whether you
can run queries on the given field (e.g. used by field caps). For number field types, queries are now allowed whenever
points are available or when doc values are available (i.e. searchability is expanded).
Relates #81210 and #52728
It has been reported that the search as you type field accepts sub-fields as part of its mapping definition, but those are being silently ignored. With this commit we add support for multi-fields to the search as you type field.
Closes#56326
JEP 361[https://openjdk.java.net/jeps/361] added support for switch expressions
which can be much more terse and less error-prone than switch statements.
Another useful feature of switch expressions is exhaustiveness: we can make
sure that an enum switch expression covers all the cases at compile time.
The ES code base is quite JSON heavy. It uses a lot of multi-line JSON requests in tests which need to be escaped and concatenated which in turn makes them hard to read. Let's try to leverage Java 15 text blocks for representing them.
This change makes all ScriptDocValues purely a wrapper around a supplier. (Similar to what
FieldValues was.) However, there are some important differences:
* This is meant to be transitory. As more DocValuesFields are completed, more of the simple
suppliers (ones that aren't DocValuesFields) can be removed.
* ScriptDocValues is the wrapper rather than the supplier. DocValuesFields are eventually the target
suppliers which makes it really easy to remove the simple suppliers once they are no longer
necessary.
* ScriptDocValues can be easily deprecated and removed without having to move their code to
DocValuesFields. Once ScriptDocValues is removed we can remove the supplier code from
DocValuesFields.
* DelegateDocValuesField ensures that any ScriptDocValues field are not supplied by another
DocValuesField with an assert statement. This helps us to identify bugs during testing.
* ScriptDocValues no longer have setNextDocId. This helps us identify bugs during compilation.
* Conversions will not share/wrap suppliers since the suppliers are transitory.
This change adds a ToScriptField class with the expectation it will be subclassed based on the
needs of each mapped type to produce a DocValuesField used by the scripting fields api. This is
intended to replace the more generic return of ScriptDocValues.
The change made here only targets classes implementing the LeafNumericFieldData interface to
keep the initial change smaller, but is also an example for how this would work for other types of
LeafFieldData as well.
It starts with the fielddataBuilder method of each MappedFieldType (where the appropriate
subclass of ToScriptField is specified) then passes through the IndexFieldData.Builder to the
IndexData.load method. From here the generated LeafFieldData uses the
ToScriptField.getScriptField method to generate the appropriate type of DocValuesField as
required by the new scripting fields api.
This design seems like the best way to meet the requirements for the scripting fields api by
allowing enough information to pass all the way to the LeafFieldData, but without directly
coupling the LeafFieldData to a mapped type so that the separation remains. There is also a
precedent already set for this design in the keyword field family that uses a scriptFunction to
generate a ScriptDocValues of the appropriate type. ToScriptField would eventually replace
scriptFunction.
This change creates the classes required for the scripting fields API to provide a binary field
composed of doc values using BytesRef as the representation returned to the user as a value.
This change makes it so there is only one path to retrieve values for scripting through the newly
introduced fields API. To support backwards compatibility of ScriptDocValues, DocValuesField will
return ScriptDocValues for continued doc access where the values are shared, so there is no
double loading of field data. For now, for unsupported DocValuesFields we have a
DelegateDocValuesField that returns the ScriptDocValues for long, double, String, etc.
Since Kibana's Discover switched to retrieving values via the fields API rather than source there have been gaps in the display caused by "ignored" fields (those that fall foul of ignore_above and ignore_malformed size and formatting rules).
This PR returns ignored values from source when a user-requested field fails to be parsed for a document. In these cases the corresponding hit adds a new ignored_field_values section in the response.
Closes#74121
Fix the split package org.elasticsearch.common.xcontent, between server and the x-content lib. Move the x-content lib exported package from org.elasticsearch.common.xcontent to org.elasticsearch.xcontent ( following the naming convention of similar libraries ). Removing split packages is a prerequisite to modularization.
Added the time_series_metric mapping parameter to the unsigned_long and scaled_float field types
Added the time_series_dimension mapping parameter to the unsigned_long field type
Fixes#78100
Relates to #76766, #74450 and #74014
Mapper.build() currently takes a ContentPath object that it can use to generate
field type names that will include its parent names. We would like to expand field types
to include more information about their parents, and ContentPath does not hold this
information. This commit replaces the ContentPath parameter with a new
MapperBuilderContext, which currently holds only the content path information but
can be expanded in future to hold parent relationship information.
Relates to #75474
Fixes the text field mapper and the analyzers class that also retained parameter references that go really heavy.
Makes `TextFieldMapper` take hundreds of bytes compared to multiple kb per instance.
closes#73845
Just like #77131 but for the `MatchOnlyTextFieldMapper`. Also, cleaned up a few
other minor things in it to make the constructor code for this class easier to follow.