While dot expansion is disabled when parsing percolator queries at index
time, as that would interfere with query parsing, we still use a wrapper parser
that is conservative about what methods it supports, assuming that
document parsing needs nextToken and not much more. Turns out that when
parsing queries instead, we need to support all the XContentParser
methods including map, list etc.
This commit adds a test for script score query parsing through document
parsing via percolator field mapper, and removes the limitations in the
wrapper parser when dots expansion is disabled.
Similar to the TransportVersions holder class, IndexVersions is the new
place to contain all constants for IndexVersion. This commit moves all
existing constants to the new class. It is purely mechanical.
Follow up to #100966 introducing new combined assertion `assertSearchHitsWithoutFailures`
to combine no-failure, count, and id assertions into one block.
We'd like to make `SearchResponse` reference counted and pooled but there are around 6k
instances of tests that create a `SearchResponse` local variable that would need to be
released manually to avoid leaks in the tests.
This does away with about 10% of these spots by adding an override for `assertHitCount`
that handles the actual execution of the search request and its release automatically
and making use of it in all spots where the `.get()` on the request build could be inlined
semi-automatically and in a straight-forward fashion without other code changes.
Another round of automated fixes to this, marking things that can be
made static as static. Saves some JIT cycles but also turns some lambdas
from capturing to non-capturing and makes the "utilityness" of some
classes visible.
Data-stream mappings require a @timestamp field to be present and configured
as a date with a specific set of parameters. The index-wide setting of
ignore_malformed can cause problems here if it is set to true, because it needs
to be false for the @timestamp field.
This commit detects if a set of mappings is configured for a datastream by checking
for the presence of a DataStreamTimestampFieldMapper metadata field, and passes
that information on during Mapper construction as part of the MapperBuilderContext.
DateFieldMapper.Builder now checks to see if it is specifically for a data stream timestamp
field, and if it is, sets ignore_malformed to false.
Relates to #96051
The copyTo builder is really hard to reason about when it comes to
mapper merging, because the `reset` method would actually mutate an
existing mapper. That seems dangerous and the whole thing is quite
inefficient as well. -> this PR just removes it and uses a copy
constructor for copy on write, avoiding instance creation on mapper
merges here and there and leaving no doubt about these things being
immutable.
Constants for TransportVersion currently live alongeside the class
definition. This has been fine since there was only one set of
constants. However, to support serverless, some constants will need to
be defined elsewhere.
This commit moves the existing constants to a new holder class,
TransportVersions. It is almost entirely mechanical, using IntelliJ move
members. The only non mechanical part was slightly shifting how CURRENT
is found, defining a LATEST in TransportVersions that is automatically
calculated (since we already have it, no need to manually define it).
The `StreamOutput` and `StreamInput` APIs are designed so that code
which serializes objects to the transport protocol aligns closely with
the corresponding deserialization code. However today
`StreamOutput#writeCollection` pairs up with a variety of methods on
`StreamInput`, including `readList`, `readSet`, and so on. These methods
are not obviously compatible with `writeCollection` unless you look at
the implementation, and that makes verifying transport protocol code
harder than it needs to be.
This commit renames these methods to `readCollectionAsList`,
`readCollectionAsSet`, and so on, to clarify that they are compatible
with `writeCollection`.
Relates
https://github.com/elastic/elasticsearch/pull/98971#issuecomment-1697289815
An optimization introduced in:
https://github.com/elastic/elasticsearch/pull/81985 changed percolator
query behavior.
Users can specify a percolator query which expands fields based on a
wildcard pattern. Just one example is `simple_query_string`, which
allows field names like `"text_*"`. The user expects that this field
name will expand to relevant mapped fields (e.g. "text_foo"). However,
if there are no documents indexed in those fields at the time when the
percolator query is indexed, it doesn't expand to the relevant fields.
Additionally at query time, we may skip expanding fields and not match
the relevant mapped fields if they are considered "empty" (e.g. has no
values in the shard). We should instead allow expansion by indicating
that the field may exist in the shard.
closes: https://github.com/elastic/elasticsearch/issues/98819
When the subobject property is set to false and we encounter an object
while parsing we need a way to understand if its FieldMapper is able to
parse an object. If that's the case we can provide the entire object to
the FieldMapper otherwise its name becomes the part of the dotted field
name of each internal value.
This has being achieved by adding the `supportsParsingObject()` method
to the `FieldMapper` class. This method defaults to `false` since the
majority of FieldMappers do not support parsing objects and is
overwritten to return `true` by the ones that do support objects.
This change swaps test code that directly creates IndexSearcher instances with LuceneTestCase#newSearcher calls
that have the advantage of randomly using concurrency and also randomly use assertion wrappers internally.
While this doesn't guarantee testing the concurrent code path, it should generally increase the likelihood of doing so.
Lots of spots where we did weird things around streams like redundant stream creation, redundant collecting
before adding all the collected elements to another collection or so, redundant streams for joining strings
and using less efficient `Collectors.toList` and in a few cases also incorrectly relying on the result being mutable.
Drying this up further and adding the same short-cut for single node
tests. Dealing with most of the spots that I could grab via automatic
refactorings.
Replacing the remaining usages that I could automatically replace
and a couple that I did by hand in this PR.
Also, added the same shortcut to the single node tests to save some
duplication there.
This commit changes access to the latest TransportVersion constant to
use a static method instead of a public static field. By encapsulating
the field we will be able to (in a followup) lazily determine what the
latest is, outside of clinit.
Motivated by looking into allocations of listeners in detail for shared cache benchmarking.
Wrapping a listener and using `listener::onFailure` as the failure callback means that we
have a reference to the listener from both the failure and the response handler.
If we use the approach used by the `.deleteGate*` methods, we can often save allocating
a response handler lambda or at least make the response handler cheaper.
We also save allocating the failure handler lambda.
Add junit matchers allowing other matchers to be transformed using a
function. This can make checking properties of lists/arrays a lot more
fluent using nested matchers, rather than declaratively checking
individual items. These replace the custom `ElasticsearchMatchers` with
more generic ones.
As a basic example, it allows you to turn this:
assertThat(list.get(0).getName(), equalTo("foo"));
assertThat(list.get(1).getName(), equalTo("bar"));
assertThat(list.get(2).getName(), equalTo("quux"));
into this:
assertThat(list, transformedItems(Item::getName, contains("foo", "bar", "quux")));
Doing this 'properly' without these helpers requires defining your own
matchers, which is very cumbersome.
I've applied the new methods to `ElasticsearchAssertions` and a few
other classes to show various use cases.
Most relevant changes:
- add api to allow concurrent query rewrite (GITHUB-11838 Add api to allow concurrent query rewrite apache/lucene#11840)
- knn query rewrite (Concurrent rewrite for KnnVectorQuery apache/lucene#12160)
- Integrate the incubating Panama Vector API (Integrate the Incubating Panama Vector API apache/lucene#12311)
As part of this commit I moved the ES codebase off of overriding or relying on the deprecated rewrite(IndexReader) method in favour of using rewrite(IndexSearcher) instead. For score functions, I went for not breaking existing plugins and create a new IndexSearcher whenever we rewrite a filter, otherwise we'd need to change the ScoreFunction#rewrite signature to take a searcher instead of a reader.
Co-authored-by: ChrisHegarty <christopher.hegarty@elastic.co>
This changes the serialization format for queries - when the index version is >=8.8.0, it serializes the actual transport version used into the stream. For BwC with old query formats, it uses the mapped TransportVersion for the index version.
This can be modified later if needed to re-interpret the vint used to store TransportVersion to something else, allowing the format to be further modified if necessary.
Fixes#82794. Upgrade the spotless plugin, which addresses the issue
around formatting `instanceof` expressions. Formatting of statements
including lambdas seems to have improved too.
Document parsing methods currently throw MapperParsingException. This
isn't very helpful, as it doesn't contain any information about where the parse
error happened - it is designed for parsing mappings, which are realised into
java maps before being examined. This commit introduces a new exception
specifically for document parsing that extends XContentException, so that
it reports the current position of the parser as part of its error message.
Fixes#85083
Since #91269, fetch phase subprocessors can report any stored fields that they
need back to the top-level FetchPhase so that all stored fields can be loaded
up front. This commit switches the unified and plain highlighters to use this
functionality so that highlighting does not need to open stored field readers
twice.
The most recent Lucene update made `StringField` more efficient than `Field`
when indexing simple keywords. This PR cuts over remaining places where we use
`Field` to index keywords to `StringField` instead.
Follow up to #94213 dealing with the remaining spots.
This also moves some use of deprecated transient settings to
persisted settings and deprecates the request builder methods for
transient settings to hopefully prevent more use of transients.
The percolator parses the query into a QueryBuilder, and then manually walks the query tree to
ensure that the query is supported. This requires instanceof checks that is aware of all the compound queries
and may easily get outdated.
With #90425 we can instead rely on checking the query validity directly while parsing, by providing a consumer
that gets notified for each inner query that gets parsed.
This was only needed because the percolator uses a MemoryIndex which did
not support stored fields, and so when it ran a highlighting phase it needed to
force it to read from source. MemoryIndex added stored fields support in
lucene 9.5, so we can remove this internal parameter.
The parameter remains available, but deprecated, via the rest layer, and no
longer has any effect.
Use local-independent `Strings.format` method instead of `String.format(Locale.ROOT, ...)`.
Inline `ESTestCase.forbidden` calls with `Strings.format` for the consistency sake.
Add `Strings.format` alias in `common.Strings`
This commit adds a new test framework for configuring and orchestrating
test clusters for both Java and YAML REST testing. This will eventually
replace the existing "test-clusters" Gradle plugin and the build-time
cluster orchestration.
Loading of stored fields is currently handled directly in FetchPhase, with
some fairly complex logic examining various bits of the FetchContext to work
out what fields need to be loaded. This is further complicated by synthetic
source, which may have its own stored field requirements.
This commit tries to separate out these concerns a little by adding a new
StoredFieldsSpec record that holds information about which stored fields
need to be loaded. Each FetchSubPhaseProcessor can now report a
StoredFieldsSpec detailing what its requirements are, and these specs can
be merged together, along with requirements from a SourceLoader, to
determine up-front what fields should be loaded by the StoredFieldLoader.
The stored fields themselves are added into the SearchHit by a new
StoredFieldsPhase, which handles alias resolution and value post-
processing. The logic to determine when source should be loaded and
when not, based on the presence of script fields or stored fields, is
moved into FetchContext, which highlights some inconsistencies that
can be fixed in follow-up commits.
We currently work out whether or not a mapper should be storing additional
values for synthetic source by looking at the DocumentParserContext. However,
this value does not change for the lifetime of the mapper - it is defined by
metadata on the root mapper and is immutable - and DocumentParserContext
feels like the wrong place for this information as it holds context specific
to the document being parsed.
This commit moves synthetic source status information from DocumentParserContext
to MapperBuilderContext instead. Mappers which need this information retrieve
it at build time and hold it on final fields.
#91043 surfaced some inconsistencies in field names validation between mapping parsing and document parsing. This commit centralizes the validation of field names when parsing mappings to a single place, and attempts to address some of the inconsistencies.
- field names that contain only whitespaces are no longer accepted in mappings. It was previously possible to map a field containing only whitespaces but a document containing such a field would be rejected. We start rejecting mappings with fields that contain only whitespaces for indices that are created from 8.6 on, just in case existing indices contain such fields. This is true also for dotted fields like top. .foo when subobjects are enabled.
- A clear error message is thrown when mappings hold fields with names made of dots only. An ArrayIndexOutOfBoundsException was thrown before
- The error thrown when a field name is empty is now unified with that thrown when an empty field name is provided as part of a document (field name cannot be an empty string)
- When parsing documents (with subobjects set to false), distinguish between the error thrown when a field name is empty and that thrown when a field name is made of whitespaces only
- When parsing documents (with subobjects set to false), accept field names that are made of dots only (these are already accepted in mappings), effectively reverts #90950
We have two implementations of source filtering, one based on Map filtering
and used by SourceLookup, and one based on jackson stream filtering used
in Get and SourceFieldMapper. There are cases when stream filtering could
be usefully applied to source in the fetch phase, for example if the source is
not being used as a Map by any other subphase; and correspondingly if a
source has already been parsed to a Map then map filtering will generally
be more efficient than stream filtering that ends up re-parsing the bytes.
This commit encapsulates all of this filtering logic into a single SourceFilter
class, which can be passed to the filter method on Source. Different
Source implementations can choose to use map or stream filtering
depending on whether or not they have map or bytes representations
available.
Archive indices < 6.1 do not work together with nested fields. While the documentation makes sure not to claim support for nested fields, see https://www.elastic.co/guide/en/elasticsearch/reference/current/archive-indices.html#archive-indices-supported-field-types, it leads to a situation where the import still works yet none of the documents at all are returned by any query (not even match_all). This is because indices before 6.1 did not have primary terms, but the default search context in ES 8 adds a FieldExistsQuery(SeqNoFieldMapper.PRIMARY_TERM_NAME) filter to the query:
f56126089c/server/src/main/java/org/elasticsearch/search/DefaultSearchContext.java (L284)
This PR fixes the issue by adding basic support for nested fields so that it at least allows queries that are not leveraging
the nested documents to work (i.e. allow extracting the documents e.g. to be reindexed).
Closes#90523
Rather than creating a new SourceLookup for each HitContext, and then
setting a source provider on it after the fact, we instead just take a
Source as a constructor argument.
This commit also adds three Source implementations, `fromBytes` and
`fromMap` to hold pre-loaded data, and `lazyLoading` which will load
the source only if asked for, and tidies up FetchSourcePhase to use them.
When parsing queries on the coordinating node, there is currently no way to share state between the different parsing methods (`fromXContent`). The only query that supports a parse context is bool query, which uses the context to track nested depth of queries, added with #66204. Such nested depth tracking mechanism is not 100% accurate as it tracks bool queries only, while there's many more query types that can hold other queries hence potentially cause stack overflow when deeply nested.
This change removes the parsing context that's specific to bool query, introduced with #66204, in favour of generalizing the nested depth tracking to all query types.
The generic tracking is introduced by wrapping the parser and overriding the method that parses named objects through the xcontent registry. Another way would have been to require a context argument when parsing queries, which would mean adding a context argument to all the QueryBuilder#fromXContent static methods. That would be a breaking change for plugins that provide custom queries, hence I went for trying out a different approach.
One aspect that this change requires and introduces is the distinction between parsing a top level query (which will wrap the parser, or it would create the context if we had one), as opposed to parsing an inner query, which goes ahead with the given parser and context. We already have this distinction as we have two different static methods in `AbstractQueryBuilder` but in practice only bool query makes the distinction being the only context-aware query.
In addition to generalizing tracking nested depth when parsing queries, we should be able to adopt this same strategy to track queries usage as part #90176 .
Given that the depth check is now more restrictive, as it counts all compound queries and not only bool, we have decided to raise the default limit to `30` to ensure that users are not going to hit the limit due to this change.
SourceLookup combines a mutable lookup object that can be advanced
to different documents with access to a document's source. This combination
can make reasoning about where a Source comes from difficult, particularly
in the FetchPhase where the source gets passed around a great deal.
This commit extracts a Source interface from SourceLookup, giving read-only
access to the source, and changes various FetchPhase interfaces to take this
read-only view instead of a full lookup. You can now tell easily if a consumer
of the source is going to try and move it to a different document. As part of this
change we add a new docId parameter to various ValueFetcher methods, as
previously this could be accessed via the SourceLookup.
Fixed a couple of issues around duplicate empty + unmodifiable collection wrappers
polluting heaps lately so this tries to bulk fix all obvious spots where we can just
more efficiently read the immutable collection directly.