Data-stream mappings require a @timestamp field to be present and configured
as a date with a specific set of parameters. The index-wide setting of
ignore_malformed can cause problems here if it is set to true, because it needs
to be false for the @timestamp field.
This commit detects if a set of mappings is configured for a datastream by checking
for the presence of a DataStreamTimestampFieldMapper metadata field, and passes
that information on during Mapper construction as part of the MapperBuilderContext.
DateFieldMapper.Builder now checks to see if it is specifically for a data stream timestamp
field, and if it is, sets ignore_malformed to false.
Relates to #96051
We mostly have a handful of `FieldType` values here across all mappers and none of them contain
attributes. There's only so many combinations here, lets deduplicate these to save some heap and set up
subsequent mapper heap savings.
The copyTo builder is really hard to reason about when it comes to
mapper merging, because the `reset` method would actually mutate an
existing mapper. That seems dangerous and the whole thing is quite
inefficient as well. -> this PR just removes it and uses a copy
constructor for copy on write, avoiding instance creation on mapper
merges here and there and leaving no doubt about these things being
immutable.
This PR adapts the unified highlighter to use the Weight#matches mode by default when possible.
This is the default mode in Lucene for some time now. For cases where the matches mode won't work (nested and parent-child queries),
the matches mode is disabled automatically.
I didn't expose an option to explicitly disable this mode because that should be seen as an internal implementation detail.
With this change, matches that span multiple terms are highlighted together (something that users asked for years) and the clauses that don't match the document are ignored.
Document parsing methods currently throw MapperParsingException. This
isn't very helpful, as it doesn't contain any information about where the parse
error happened - it is designed for parsing mappings, which are realised into
java maps before being examined. This commit introduces a new exception
specifically for document parsing that extends XContentException, so that
it reports the current position of the parser as part of its error message.
Fixes#85083
IndexAnalyzers is currently always a concrete class wrapping several
Maps of NamedAnalyzers. This means that whenever it is used it needs
to instantiate all of its component analyzers, making testing much heavier
than it needs to be. It also means that things like overriding analysis for
legacy indexes is pushed into mapper parameters, rather than being
handled in a single place.
This commit makes IndexAnalyzers into an interface, with an anonymous
concrete implementation that handles reloading and closing for index
shards.
This was only needed because the percolator uses a MemoryIndex which did
not support stored fields, and so when it ran a highlighting phase it needed to
force it to read from source. MemoryIndex added stored fields support in
lucene 9.5, so we can remove this internal parameter.
The parameter remains available, but deprecated, via the rest layer, and no
longer has any effect.
The annotation highlighter can miss annotations if they overlap with another search
term. This commit re-sorts incoming passages to ensure that all terms are seen
by the highlighter.
Fixes#91944
This commit adds a new test framework for configuring and orchestrating
test clusters for both Java and YAML REST testing. This will eventually
replace the existing "test-clusters" Gradle plugin and the build-time
cluster orchestration.
This adds support for the `fields` API from painless in to synthetic
_source for `text` fields that have a keyword sub field. This is kind of
esoteric sounding, but it's the default mapping for strings you send in
json so synthetic `_source` supports it. So the `fields` API should too.
This adds support for the `field` scripting API in many but not all
cases. Before this change numbers, dates, and IPs supported the `field`
API when running with _source in synthetic mode because they always have
doc values. This change adds support for `match_only_text`, `store`d
`keyword` fields, and `store`d `text` fields. Two remaining field
configurations work with synthetic _source and do not work with `field`:
* A `text` field with a sub-`keyword` field that has `doc_values` * A
`text` field with a sub-`keyword` field that is `store`d

We currently work out whether or not a mapper should be storing additional
values for synthetic source by looking at the DocumentParserContext. However,
this value does not change for the lifetime of the mapper - it is defined by
metadata on the root mapper and is immutable - and DocumentParserContext
feels like the wrong place for this information as it holds context specific
to the document being parsed.
This commit moves synthetic source status information from DocumentParserContext
to MapperBuilderContext instead. Mappers which need this information retrieve
it at build time and hold it on final fields.
This adds support for `ignore_malformed` to numeric fields other than
`scaled_float` in synthetic `_source`. Their values are saved to a
stored field and loaded to render the `_source`.
This makes sure that all field types that support `ignore_malfored` do
so in the same way.
Production changes:
* All mapper has an `ignoreMalformed` method that must return `true` if
the field accepts the `ignore_malformed` mapping parameter was
configured. It defaults to `false` because many fields either don't
have a concept of "malformed" value or don't have the ability to
ignore malformed values.
* Fix the `scaled_float` field to store it's field name in `_ignored` if
it ignores any malfored values. This is how all other field mappers
work.
Test changes:
* `MapperTestCase` forces subclasses to declare if their
`supportIgnoreMalformed` or not.
* If `MapperTestCase` subclasses `supportIgnoreMalfored` they must
define some `exampleMalformedValues`.
* `MapperTestCase` always grows three new tests:
* One that creates a field without setting `ignore_malformed` and
verifies that all `exampleMalformedValues` throw expected errors
* On that explicitly configured `ignore_malformed` to false and, if
`supportIgnoreMalformed` it verifies the errors again. If not
`supportIgnoreMalformed` it verifies that the parameter is unknown.
* On that explicitly configured `ignore_malformed` to true and, if
`supportIgnoreMalformed` it verifies that parsing doesn't produce
errors and correctly produces `_ignored`. If not
`supportIgnoreMalformed` it verifies that the parameter is unknown.
* Moved some subclasesses of `MapperTestCase` from
`internalClusterTests` to `tests`. This isn't strictly required but
that's the right place for them.
This PR reworks the testing conventions precommit plugin. This plugin now:
- is compatible with yaml, java rest tests and internalClusterTest (aka different sourceSets per test type)
- enforces test base class and simple naming conventions (as it did before)
- adds one check task per test sourceSet
- uses the worker api to improve task execution parallelism and encapsulation
- is gradle configuration cache compatible
This also ports the TestingConventions integration testing to Spock and removes the build-tools-internal/test kit folder that is not required anymore. We also add some common logic for testing java related gradle plugins.
We will apply further cleanup on other tests within our test suite in a dedicated follow up cleanup
Speeding this up some more as it's now 50% of the bootstrap time of the many shards benchmarks.
Iterating an array here in all cases is quite a bit faster than iterating various kinds of lists
and doesn't complicate the code. Also removes a redundant call to `getValue()` for each parameter
during serialization.
Adds support for "text" fields in archive indices, with the goal of adding simple filtering support on text fields when
querying archive indices.
There are some differences to regular text fields:
- no global statistics: queries on text fields return constant score (similar to match_only_text).
- analyzer fields can be updated
- if defined analyzer is not available, falls back to default analyzer
- no guarantees that analyzers are BWC
The above limitations also give us the flexibility to eventually swap out the implementation with a "runtime-text field"
variant, and hence only provide those capabilities that can be emulated via a runtime field.
Relates #81210
In the many-shards benchmarks the singleton maps storing just a single
analyzer for each keyword field mapper cost around 5% of the total heap
usage on data nodes (700MB for ~15k indices which translate into ~16M instances
of keyword field mapper for Beats mappings).
Creating specific implementations for the zero, one or many analyzers
use cases that already have their own specialized constructors eliminates this
overhead completely.
relates #77466
This attempts to shrink the index by implementing a "synthetic _source" field.
You configure it by in the mapping:
```
{
"mappings": {
"_source": {
"synthetic": true
}
}
}
```
And we just stop storing the `_source` field - kind of. When you go to access
the `_source` we regenerate it on the fly by loading doc values. Doc values
don't preserve the original structure of the source you sent so we have to
make some educated guesses. And we have a rule: the source we generate would
result in the same index if you sent it back to us. That way you can use it
for things like `_reindex`.
Fetching the `_source` from doc values does slow down loading somewhat. See
numbers further down.
## Supported fields
This only works for the following fields:
* `boolean`
* `byte`
* `date`
* `double`
* `float`
* `geo_point` (with precision loss)
* `half_float`
* `integer`
* `ip`
* `keyword`
* `long`
* `scaled_float`
* `short`
* `text` (when there is a `keyword` sub-field that is compatible with this feature)
## Educated guesses
The synthetic source generator makes `_source` fields that are:
* sorted alphabetically
* as "objecty" as possible
* pushes all arrays to the "leaf" fields
* sorts most array values
* removes duplicate text and keyword values
These are mostly artifacts of how doc values are stored.
### sorted alphabetically
```
{
"b": 1,
"c": 2,
"a": 3
}
```
becomes
```
{
"a": 3,
"b": 1,
"c": 2
}
```
### as "objecty" as possible
```
{
"a.b": "foo"
}
```
becomes
```
{
"a": {
"b": "foo"
}
}
```
### pushes all arrays to the "leaf" fields
```
{
"a": [
{
"b": "foo",
"c": "bar"
},
{
"c": "bort"
},
{
"b": "snort"
}
}
```
becomes
```
{
"a" {
"b": ["foo", "snort"],
"c": ["bar", "bort"]
}
}
```
### sorts most array values
```
{
"a": [2, 3, 1]
}
```
becomes
```
{
"a": [1, 2, 3]
}
```
### removes duplicate text and keyword values
```
{
"a": ["bar", "baz", "baz", "baz", "foo", "foo"]
}
```
becomes
```
{
"a": ["bar", "baz", "foo"]
}
```
## `_recovery_source`
Elasticsearch's shard "recovery" process needs `_source` *sometimes*. So does
cross cluster replication. If you disable source or filter it somehow we store
a `_recovery_source` field for as long as the recovery process might need it.
When everything is running smoothly that's generally a few seconds or minutes.
Then the fields is removed on merge. This synthetic source feature continues
to produce `_recovery_source` and relies on it for recovery. It's *possible*
to synthesize `_source` during recovery but we don't do it.
That means that synethic source doesn't speed up writing the index. But in the
future we might be able to turn this on to trade writing less data at index
time for slower recovery and cross cluster replication. That's an area of
future improvement.
## perf numbers
I loaded the entire tsdb data set with this change and the size:
```
standard -> synthetic
store size 31.0 GB -> 7.0 GB (77.5% reduction)
_source 24695.7 MB -> 47.6 MB (99.8% reduction - synthetic is in _recovery_source)
```
A second _forcemerge a few minutes after rally finishes should removes the
remaining 47.6MB of _recovery_source.
With this fetching source for 1,000 documents seems to take about 500ms. I
spot checked a lot of different areas and haven't seen any different hit. I
*expect* this performance impact is based on the number of doc values fields
in the index and how sparse they are.
The default type is incredibly common and instances are not trivial
in size with 16 fields. Heap dumps from larger data nodes holding many
keyword fields with the default field type can contain hundreds of MB
of heap used for these.
Same reasoning applies to the `TextSearchInfo` deduplication.
`TextSearchInfo` was turned into a record to give us an `equals` implementation.
There are many places in Elasticsearch which must decode some stream of
bytes into characters. Most of the time this is expected to be UTF-8
encoded data, and we hardcode that charset name. However, methods in the
JDK that take a String charset name require catching
UnsupportedEncodingException. Yet most of these APIs also has a variant
of the same methods which take a known Charset instance, for which we
can use StandardCharsets.UTF_8. This commit converts most instances of
passing string charset names to use a Charset instance.
This param was incredibly expensive to set up when parsing mappings and
is one of the big contributors to mapping parsing slowness on master.
Since all uses of this parameter type are statically known it seems the most
straight forward to simply statically hard code the validators so that we save
some allocations.
Lucene issues that resulted in elasticsearch changes:
LUCENE-9820 Separate logic for reading the BKD index from logic to intersecting it.
LUCENE-10377: Replace 'sortPos' with 'enableSkipping' in SortField.getComparator()
LUCENE-10301: make the test-framework a proper module by moving all test
classes to org.apache.lucene.tests
LUCENE-10300: rewrite how resources are read in ukrainian morfologik analyzer:
LUCENE-10054 Make HnswGraph hierarchical
Follow-up from #77144 (comment) with converting id/_id to always be strings instead of integers. This makes the type value in the Elasticsearch specification be only string instead of string | number.
this change was generated using following command on ubuntu
find . -type f -name "*.yml" -print0 | xargs -0 sed -i -r 's/([^a-zA-Z0-9_\.]id|[^a-zA-Z0-9_]_id):(\s*)([0-9]+)/\1:\2"\3"/g'
fix as the #80945 do.
register a settings update consumer for the end_time for
the tsdb index even when the end_time setting wasn't registered.
Pass the feature flag to reindex yaml tests.
Co-authored-by: Igor Motov <igor@motovs.org>
Fix the split package org.elasticsearch.common.xcontent, between server and the x-content lib. Move the x-content lib exported package from org.elasticsearch.common.xcontent to org.elasticsearch.xcontent ( following the naming convention of similar libraries ). Removing split packages is a prerequisite to modularization.
The annotated text mapper plugin reuses package names from server. This
commit moves the implementation classes into an annotated text package
specifically for the plugin.
Mapper.build() currently takes a ContentPath object that it can use to generate
field type names that will include its parent names. We would like to expand field types
to include more information about their parents, and ContentPath does not hold this
information. This commit replaces the ContentPath parameter with a new
MapperBuilderContext, which currently holds only the content path information but
can be expanded in future to hold parent relationship information.
Relates to #75474
Fixes the text field mapper and the analyzers class that also retained parameter references that go really heavy.
Makes `TextFieldMapper` take hundreds of bytes compared to multiple kb per instance.
closes#73845
This introduces a basic public yaml rest test plugin that is supposed to be used by external
elasticsearch plugin authors. This is driven by #76215
- Rename yaml-rest-test to intern-yaml-rest-test
- Use public yaml plugin in example plugins
Co-authored-by: Mark Vieira <portugee@gmail.com>
ParseContext is used to parse documents. It was easily confused with ParserContext (now renamed to MappingParserContext) which is instead used to parse mappings.
To remove any confusion, this commit renames ParseContext to DocumentParserContext and adapts its subclasses accordingly.
Modularization of the JDK has been ongoing for several years. Recently
in Java 16 the JDK began enforcing module boundaries by default. While
Elasticsearch does not yet use the module system directly, there are
some side effects even for those projects not modularized (eg #73517).
Before we can even begin to think about how to modularize, we must
Prepare The Way by enforcing packages only exist in a single jar file,
since the module system does not allow packages to coexist in multiple
modules.
This commit adds a precommit check to the build which detects split
packages. The expectation is that we will add the existing split
packages to the ignore list so that any new classes will not exacerbate
the problem, and the work to cleanup these split packages can be
parallelized.
relates #73525
The majority of field mappers read a single value from their positioned
XContentParser, and do not need to call nextToken. There is a general
assumption that the same holds for any multifields defined on them, and
so the XContentParser is passed down to their multifields builder as-is.
This assumption does not hold for mappers that accept json objects,
and so we have a second mechanism for passing values around called
'external values', where a mapper can set a specific value on its context
and child mappers can then check for these external values before reading
from xcontent. The disadvantage of this is that every field mapper now
needs to check its context for external values. Because the values are
defined by their java class, we can also know that in the vast majority of
cases this functionality is unused. We have only two mappers that actually
make use of this, CompletionFieldMapper and GeoPointFieldMapper.
This commit removes external values entirely, and replaces it with the ability
to pass a modified XContentParser to multifields. FieldMappers can just check
the parser attached to their context for data and don't need to worry about
multiple sources.
Plugins implementing field mappers will need to take the removal of external
values into account. Implementations that are passing structured objects
as external values should instead use ParseContext.switchParser and
wrap the objects using MapXContentParser.wrapObject().
GeoPointFieldMapper passes on a fake parser that just wraps its input data
formatted as a geohash; CompletionFieldMapper has a slightly more complicated
parser that in general wraps its metadata, but if textOrNull() is called without
the parser being advanced just returns its text input.
Relates to #56063
The FieldNamesFieldMapper is a metadata mapper defining a field that
can be used for exists queries if a mapper does not use doc values or
norms. Currently, data is added to it via a special method on FieldMapper
that pulls the metadata mapper from a mapping lookup, checks to see
if it is enabled, and then adds the relevant value to a lucene document.
This is one of only two places that pulls a metadata mapper from the
MappingLookup, and it would be nice to remove this method. This commit
refactors field name handling by instead storing the names of fields to
index in the fieldnames field in a set on the ParseContext, and then
building the field itself in FieldNamesFieldMapper.postParse(). This means
that all of the responsibility for enabling indexing, etc, is handled within
the metadata mapper itself.
`MappedFieldType` only allows configuring `match` and `prefix` queries today.
This change makes it possible to configure how to create `wildcard` and `fuzzy`
queries as well.
This will allow making the upcoming `match_only_text` field fully support
intervals queries.
We've had a few bugs in the fields API where is doesn't behave like we'd
expect. Typically this happens because it isn't obvious what we expct. So
we'll try and use randomized testing to ferret out what we want. This adds
a test for most field types that asserts that `fields` works similarly
to `docvalues_fields`. We expect this to be true for most fields.
It does so by forcing all subclasses of `MapperTestCase` to define a
method that makes random values. It declares a few other hooks that
subclasses can override to further randomize the test.
We skip the test for a few field types that don't have doc values:
* `annotated_text`
* `completion`
* `search_as_you_type`
* `text`
We should come up with some way to test these without doc values, even
if it isn't as nice. But that is a problem for another time, I think.
We skip the test for a few more types just because I wanted to cut this
PR in half so we could get to reviewing it earlier. We'll get to those
in a follow up change.
I've filed a few bugs for things that are inconsistent with
`docvalues_fields`. Typically that means that we have to limit the
random values that we generate to those that *do* round trip properly.
Custom position increments are handled by wrapping analyzers
with a NamedAnalyzer and passing the custom increment through
to its constructor. However, phrase and prefix analyzers use
delegating analyzer wrappers to add extra filtering to their parent
analyzers, and we can't wrap analyzers multiple times because this
wrecks reuse strategies, so we unwrap the parent before passing
it to phrase and prefix builders. This unwrapping means that we
lose the custom position increments; in particular, it means that
we can end up with a position increment gap of -1, which is the
sentinel value for the unset parameter - and that means exceptions
at index time for backwards-moving positions on fields with multiple
values.
This commit removes the sentinel value and uses standard parameter
defaults and the isConfigured() method instead, plus it adds some
more comprehensive testing for position increments when combined
with phrase/prefix index options on text fields.
Fixes#70049
With the newly introduced `max_analyzed_offset` the analyzer of
`AnnotatedTextHighlighter` was wrapped twice with the
`LimitTokenOffsetAnalyzer` by mistake.
Follows: #67325
Add a `max_analyzed_offset` query parameter to allow users
to limit the highlighting of text fields to a value less than or equal to the
`index.highlight.max_analyzed_offset`, thus avoiding an exception when
the length of the text field exceeds the limit. The highlighting still takes place,
but stops at the length defined by the new parameter.
Closes: #52155
DocumentMapper does not need to implement ToXContent, in fact it is its inner Mapping that needs to and already does. Consumers can switch to calling mapping() and toXContent against it.