This change allows the type `unsigned_long` to be used in the index
sort specification. The internal representation uses plain longs so
we can rely on the existing support for that type.
ParseContext is used to parse documents. It was easily confused with ParserContext (now renamed to MappingParserContext) which is instead used to parse mappings.
To remove any confusion, this commit renames ParseContext to DocumentParserContext and adapts its subclasses accordingly.
Change the formatter config to sort / order imports, and reformat the
codebase. We already had a config file for Eclipse users, so Spotless now
uses that.
The "Eclipse Code Formatter" plugin ought to be able to use this file as
well for import ordering, but in my experiments the results were poor.
Instead, use IntelliJ's `.editorconfig` support to configure import
ordering.
I've also added a config file for the formatter plugin.
Other changes:
* I've quietly enabled the `toggleOnOff` option for Spotless. It was
already possible to disable formatting for sections using the markers
for docs snippets, so enabling this option just accepts this reality
and makes it possible via `formatter:off` and `formatter:on` without
the restrictions around line length. It should still only be used as
a very last resort and with good reason.
* I've removed mention of the `paddedCell` option from the contributing
guide, since I haven't had to use that option for a very long time. I
moved the docs to the spotless config.
Extract usage of internal API from TestClustersPlugin and PluginBuildPlugin and related plugins and build logic
This includes a refactoring of ElasticsearchDistribution to handle types
better in a way we can differentiate between supported Elasticsearch
Distribution types supported in TestCkustersPlugin and types only supported
in internal plugins.
It also introduces a set of internal versions of public plugins.
As part of this we also generate the plugin descriptors now.
As a follow up on this we can actually move these public used classes into
an extra project (declared as included build)
We keep LoggedExec and VersionProperties effectively public And workaround for RestTestBase
The majority of field mappers read a single value from their positioned
XContentParser, and do not need to call nextToken. There is a general
assumption that the same holds for any multifields defined on them, and
so the XContentParser is passed down to their multifields builder as-is.
This assumption does not hold for mappers that accept json objects,
and so we have a second mechanism for passing values around called
'external values', where a mapper can set a specific value on its context
and child mappers can then check for these external values before reading
from xcontent. The disadvantage of this is that every field mapper now
needs to check its context for external values. Because the values are
defined by their java class, we can also know that in the vast majority of
cases this functionality is unused. We have only two mappers that actually
make use of this, CompletionFieldMapper and GeoPointFieldMapper.
This commit removes external values entirely, and replaces it with the ability
to pass a modified XContentParser to multifields. FieldMappers can just check
the parser attached to their context for data and don't need to worry about
multiple sources.
Plugins implementing field mappers will need to take the removal of external
values into account. Implementations that are passing structured objects
as external values should instead use ParseContext.switchParser and
wrap the objects using MapXContentParser.wrapObject().
GeoPointFieldMapper passes on a fake parser that just wraps its input data
formatted as a geohash; CompletionFieldMapper has a slightly more complicated
parser that in general wraps its metadata, but if textOrNull() is called without
the parser being advanced just returns its text input.
Relates to #56063
The FieldNamesFieldMapper is a metadata mapper defining a field that
can be used for exists queries if a mapper does not use doc values or
norms. Currently, data is added to it via a special method on FieldMapper
that pulls the metadata mapper from a mapping lookup, checks to see
if it is enabled, and then adds the relevant value to a lucene document.
This is one of only two places that pulls a metadata mapper from the
MappingLookup, and it would be nice to remove this method. This commit
refactors field name handling by instead storing the names of fields to
index in the fieldnames field in a set on the ParseContext, and then
building the field itself in FieldNamesFieldMapper.postParse(). This means
that all of the responsibility for enabling indexing, etc, is handled within
the metadata mapper itself.
This commit adds a script parameter to long and double fields that makes
it possible to calculate a value for these fields at index time. It uses the same
script context as the equivalent runtime fields, and allows for multiple index-time
scripted fields to cross-refer while still checking for indirection loops.
We've had a few bugs in the fields API where is doesn't behave like we'd
expect. Typically this happens because it isn't obvious what we expct. So
we'll try and use randomized testing to ferret out what we want. This adds
a test for most field types that asserts that `fields` works similarly
to `docvalues_fields`. We expect this to be true for most fields.
It does so by forcing all subclasses of `MapperTestCase` to define a
method that makes random values. It declares a few other hooks that
subclasses can override to further randomize the test.
We skip the test for a few field types that don't have doc values:
* `annotated_text`
* `completion`
* `search_as_you_type`
* `text`
We should come up with some way to test these without doc values, even
if it isn't as nice. But that is a problem for another time, I think.
We skip the test for a few more types just because I wanted to cut this
PR in half so we could get to reviewing it earlier. We'll get to those
in a follow up change.
I've filed a few bugs for things that are inconsistent with
`docvalues_fields`. Typically that means that we have to limit the
random values that we generate to those that *do* round trip properly.
The transport action for FieldCapabilities assumes the meta field for a MappedFieldType
is traversable. This commit adds a requirement to MappedFieldType itself to ensure that
it is implemented for all subtypes.
This reduces the ceremony declaring test artifacts for a project.
It also solves an issue with usage of deprecated testRuntime that
testArtifacts extendsFrom which seems not required at all and would have
broke with Gradle 7.0 anyhow
Test artifact resolution is now variant aware which allows us a more adequate
compile and runtime classpath for the consuming projects.
We also Introduce a convention method in the elasticsearch build to declare
test artifact dependencies in an easy way close to how its done by the gradle build in
test fixture plugin.
Furthermore we cleaned up some inconsistent test dependencies declarations when
relying on a project and on its test artifacts
Also moves it to a top-level interface in fielddata. It is not only used by
DocValueFetcher any more, and Leaf does not really describe what
it does or what it provides.
This has been deprecated in gradle before but we havnt been warned.
Gradle 7.0 will likely introduce a change in behaviour here that we
should fix the usage of this configuration upfront.
See https://github.com/gradle/gradle/issues/16027 for further information
about the change in Gradle 7.0
As per the new licensing change for Elasticsearch and Kibana this commit
moves existing Apache 2.0 licensed source code to the new dual license
SSPL+Elastic license 2.0. In addition, existing x-pack code now uses
the new version 2.0 of the Elastic license. Full changes include:
- Updating LICENSE and NOTICE files throughout the code base, as well
as those packaged in our published artifacts
- Update IDE integration to now use the new license header on newly
created source files
- Remove references to the "OSS" distribution from our documentation
- Update build time verification checks to no longer allow Apache 2.0
license header in Elasticsearch source code
- Replace all existing Apache 2.0 license headers for non-xpack code
with updated header (vendored code with Apache 2.0 headers obviously
remains the same).
- Replace all Elastic license 1.0 headers with new 2.0 header in xpack.
We decided to rename `QueryShardContext` to clarify that it supports all parts
of search request execution. Before there was confusion over whether it should
only be used for building queries, or maybe only used in the query phase. This
PR also updates the javadocs.
Closes#64740.
This PR simplifies how the document source is passed to each fetch subphase. A summary of the strategy:
* For each document, we try to eagerly load the source and store it on `HitContext`. Most subphases that access source, like source filtering and highlighting, use `HitContext`. For nested hits, we filter the parent source and also store this source on `HitContext`.
* Only for non-nested documents, we also store the loaded source on `QueryShardContext#lookup`. This allows subphases that access source through `SearchLookup` to use the pre-loaded source when possible. This is now a common occurrence, since runtime fields are supported in the 'fields' option and may soon be supported in highlighting.
There is no longer a special `SearchLookup` just for the fetch phase. This was not necessary and was mostly caused by a misunderstanding of how `QueryShardContext` should be used.
Addresses #62511.
This speeds up `date_histogram` aggregations without a parent or
children. This is quite common - it's the aggregation that Kibana's Discover
uses all over the place. Also, we hope to be able to use the same
mechanism to speed aggs with children one day, but that day isn't today.
The kind of speedup we're seeing is fairly substantial in many cases:
```
| | | before | after | |
| 90th percentile service time | date_histogram_calendar_interval | 9266.07 | 1376.13 | ms |
| 90th percentile service time | date_histogram_calendar_interval_with_tz | 9217.21 | 1372.67 | ms |
| 90th percentile service time | date_histogram_fixed_interval | 8817.36 | 1312.67 | ms |
| 90th percentile service time | date_histogram_fixed_interval_with_tz | 8801.71 | 1311.69 | ms | <-- discover's agg
| 90th percentile service time | date_histogram_fixed_interval_with_metrics | 44660.2 | 43789.5 | ms |
```
This uses the work we did in #61467 to precompute the rounding points for
a `date_histogram`. Now, when we know the rounding points we execute the
`date_histogram` as a `range` aggregation. This is nice for two reasons:
1. We can further rewrite the `range` aggregation (see below)
2. We don't need to allocate a hash to convert rounding points
to ordinals.
3. We can send precise cardinality estimates to sub-aggs.
Points 2 and 3 above are nice, but most of the speed difference comes from
point 1. Specifically, we now look into executing `range` aggregations as
a `filters` aggregation. Normally the `filters` aggregation is quite slow
but when it doesn't have a parent or any children then we can execute it
"filter by filter" which is significantly faster. So fast, in fact, that
it is faster than the original `date_histogram`.
The `range` aggregation is *fairly* careful in how it rewrites, giving up
on the `filters` aggregation if it won't collect "filter by filter" and
falling back to its original execution mechanism.
So an aggregation like this:
```
POST _search
{
"size": 0,
"query": {
"range": {
"dropoff_datetime": {
"gte": "2015-01-01 00:00:00",
"lt": "2016-01-01 00:00:00"
}
}
},
"aggs": {
"dropoffs_over_time": {
"date_histogram": {
"field": "dropoff_datetime",
"fixed_interval": "60d",
"time_zone": "America/New_York"
}
}
}
}
```
is executed like:
```
POST _search
{
"size": 0,
"query": {
"range": {
"dropoff_datetime": {
"gte": "2015-01-01 00:00:00",
"lt": "2016-01-01 00:00:00"
}
}
},
"aggs": {
"dropoffs_over_time": {
"range": {
"field": "dropoff_datetime",
"ranges": [
{"from": 1415250000000, "to": 1420434000000},
{"from": 1420434000000, "to": 1425618000000},
{"from": 1425618000000, "to": 1430798400000},
{"from": 1430798400000, "to": 1435982400000},
{"from": 1435982400000, "to": 1441166400000},
{"from": 1441166400000, "to": 1446350400000},
{"from": 1446350400000, "to": 1451538000000},
{"from": 1451538000000}
]
}
}
}
}
```
Which in turn is executed like this:
```
POST _search
{
"size": 0,
"query": {
"range": {
"dropoff_datetime": {
"gte": "2015-01-01 00:00:00",
"lt": "2016-01-01 00:00:00"
}
}
},
"aggs": {
"dropoffs_over_time": {
"filters": {
"filters": {
"1": {"range": {"dropoff_datetime": {"gte": "2014-12-30 00:00:00", "lt": "2015-01-05 05:00:00"}}},
"2": {"range": {"dropoff_datetime": {"gte": "2015-01-05 05:00:00", "lt": "2015-03-06 05:00:00"}}},
"3": {"range": {"dropoff_datetime": {"gte": "2015-03-06 00:00:00", "lt": "2015-05-05 00:00:00"}}},
"4": {"range": {"dropoff_datetime": {"gte": "2015-05-05 00:00:00", "lt": "2015-07-04 00:00:00"}}},
"5": {"range": {"dropoff_datetime": {"gte": "2015-07-04 00:00:00", "lt": "2015-09-02 00:00:00"}}},
"6": {"range": {"dropoff_datetime": {"gte": "2015-09-02 00:00:00", "lt": "2015-11-01 00:00:00"}}},
"7": {"range": {"dropoff_datetime": {"gte": "2015-11-01 00:00:00", "lt": "2015-12-31 00:00:00"}}},
"8": {"range": {"dropoff_datetime": {"gte": "2015-12-31 00:00:00"}}}
}
}
}
}
}
```
And *that* is faster because we can execute it "filter by filter".
Finally, notice the `range` query filtering the data. That is required for
the data set that I'm using for testing. The "filter by filter" collection
mechanism for the `filters` agg needs special case handling when the query
is a `range` query and the filter is a `range` query and they are both on
the same field. That special case handling "merges" the range query.
Without it "filter by filter" collection is substantially slower. Its still
quite a bit quicker than the standard `filter` collection, but not nearly
as fast as it could be.
Mapper.BuilderContext is a simple wrapper around two objects, some
IndexSettings and a ContentPath. The IndexSettings are the same as
those provided in the ParserContext, so we can simplify things here
by removing them and just passing ContentPath directly to
Mapper.Builder#build()
The signature of MappedFieldType#valueFetcher requires MapperService as an argument which is unfortunate as that is one of the reasons why FetchContext exposes the whole MapperService.
Such use of MapperService can be replaced with exposing the QueryShardContext which encapsulates the MapperService.
Now that all our FieldMapper implementations extend ParametrizedFieldMapper,
we can collapse the two classes together, and remove a load of cruft from
FieldMapper that is unused. In particular:
* we no longer need the lucene FieldType field on FieldMapper
* we no longer use clone() for merging, so we can remove it from all impls
* the serialization code in FieldMapper that assumes we're looking at text fields can go
We currently use TextSearchInfo to let query parsers know when a field
will support match queries. Some field types (numeric, constant, range)
can produce simple match queries that don't use the terms index, and
it is useful to distinguish between these fields on the one hand and
keyword/text-type fields on the other. In particular, the
SignificantTextAggregation only works on fields that have indexed terms,
but there is currently no efficient way to see this at search time and so
the factory falls back on checking to see if an index analyzer has been
defined, with the result that some nonsensical field types are permitted.
This commit adds a new static TextSearchInfo implementation called
SIMPLE_MATCH_WITHOUT_TERMS that can be returned by field
types with no corresponding terms index. It changes significant text
to check for this rather than for the presence of an index analyzer.
This is a breaking change, in that the significant text agg will now throw
an error up-front if you try and apply it to a numeric field, whereas before
you would get an empty result.
Max and min aggs were producing wrong results for unsigned_long field
if field was indexed. If field is indexed for max/min aggs instead of
field data, we use values from indexed Points, values of which
are derived using method pointReaderIfPossible. Before
UnsignedLongFieldType#pointReaderIfPossible was incorrectly
producing values, as it failed to shift them back to original
values.
This patch fixes method pointReaderIfPossible to produce
correct original values.
Relates to #60050
UnsignedLongTests for the range agg was using very specific intervals
that double type can not distinguish due to lack of precision:
9.223372036854776000E18 == 9.223372036854775807E18 returns true
If we add the corresponding range query test, it will return different
number of hits than the range agg, as range query unlike range agg
doesn't convert valued to double type, and hence more precise.
This patch make broader ranges for the range agg test (so values
converted to doubles don't loose precision), and hence corresponding
range query will return the same number of hits.
Relates to #60050
Collapse was not working on unsigned_long field,
as collapsing was enabled only on KeywordFieldType and NumberFieldType.
This introduces a new method `collapseType` to MappedFieldType,
that is checked to decide if collapsing should be enabled.
Relates to #60050
Some supported field types don't support term queries, and throw exception in their termQuery method. That exception is either an IllegalArgumentException or a QueryShardException. There is logic in MatchQuery that skips the field or not depending on the exception that is thrown.
Also, such field types should hold a TextSearchInfo.NONE while that is not always the case.
With this commit we make the following changes:
- streamline using TextSearchInfo.NONE in all field types that don't support text queries
- standardize the exception being thrown when a field type does not support term queries to be IllegalArgumentException. Note that this is not a breaking change as both exceptions previously returned translated to 400 status code.
- Adapt the MatchQuery logic to skip fields that don't support term queries. There is no need to call termQuery passing an empty string and catch exceptions potentially thrown. We can rather check the TextSearchInfo which tells already whether the field supports text queries or not.
- add a test method to MapperTestCase that verifies the consistency of a field type by verifying that it is not searchable whenever it uses TextSearchInfo.NONE, while it is otherwise. This is what triggered all of the above changes.
When constructing a value fetcher, the 'parsesArrayValue' flag must match
`FieldMapper#parsesArrayValue`. However there is nothing in code or tests to
help enforce this.
This PR reworks the value fetcher constructors so that `parsesArrayValue` is
'false' by default. Just as for `FieldMapper#parsesArrayValue`, field types must
explicitly set it to true and ensure the behavior is covered by tests.
Follow-up to #62974.
This fixes fields retrieval on unsigned_long field
1) For docvalue_fields a custom UnsignedLongLeafFieldData::getLeafValueFetcher
is implemented that correctly retrieves doc values.
2) For stored fields, an error was fixed in UnsignedLongFieldMapper
how stored values were stored. Before they were incorrectly
stored in the shifted format, now they are stored as original
values in String format.
Relates to #60050
MapperService carries a lot of weight and is only used to determine if loading of field data for the id field is enabled, which can be done in a different way. There was another usage that recently went away with the removal of `TypeFieldMapper`.
For runtime fields, we will want to do all search-time interaction with
a field definition via a MappedFieldType, rather than a FieldMapper, to
avoid interfering with the logic of document parsing. Currently, fetching
values for runtime scripts and for building top hits responses need to
call a method on FieldMapper. This commit moves this method to
MappedFieldType, incidentally simplifying the current call sites and freeing
us up to implement runtime fields as pure MappedFieldType objects.
- Modify Lucene::readSortValue to read BigInteger as a sortValue.
In #60050 writeSortValue was modified, but readSortValue was forgotten.
- Adjust yml tests to v7.10 asr unsigned_long has been backported
- Change MapperTests to use proper document IDs
Relates to #60050
Test testSortDifferentFormatsShouldFail was occasionally failing for
2 reasons:
1) Documents on "idx2" were not available for search before a
search request started
2) Running a test multiple times was causing
occasional ResourceAlreadyExistsException for idx2,
as idx2 was not deleted for a test.
This patch makes the following fixes:
1) Sets up immediate refresh policy for docs in the index"idx2"
2) Creates an index idx2 only once per cluster
Closes: #62997
This commit adds a mechanism to MapperTestCase that allows implementing
test classes to check that their parameters can be updated, or throw conflict
errors as advertised. Child classes override the registerParameters method
and tell the passed-in UpdateChecker class about their parameters. Simple
conflicts can be checked, using the existing minimal mappings as a base to
compare against, or alternatively a particular initial mapping can be provided
to check edge cases (eg, norms can be updated from true to false, but not
vice versa). Updates are registered with a predicate that checks that the update
has in fact been applied to the resulting FieldMapper.
Fixes#61631
This field type supports
- indexing of integer values from [0, 18446744073709551615]
- precise queries (term, range)
- precise sort and terms aggregations
- other aggregations are based on conversion of long values
to double and can be imprecise for large values.
Closes#32434