We would like to remove the use of 'external values' in document parsing.
This commit simplifies two of the four places it is currently used, by adding
direct indexValue methods to BinaryFieldMapper and ParentIdFieldMapper.
Relates to #56063
This commit adds the ability to define an index-time geo_point field
with a script parameter, allowing you to calculate points from other
values within the indexed document.
This change disallows stored scripts using a runtime field context. This adds a constructor parameter to
script context as to whether or not a context is allowed to be a stored script.
This change exposes the newly introduced parameter `dynamic_templates`
in ingest. This parameter can be set by a set processor or a script processor.
Relates #69948
There's a few places where we need to access all of the supported runtime fields script contexts. Up until now we have listed them in all those places, but a better way would be to have them listed in one place and access that same list from all consumers. This is what this commit introduces.
Along with the introduction of runtime fields contexts in ScriptModule, we rename the whitelist files so that they contain their corresponding context name to simplify looking them up.
This change removes assertion from DatabaseRegistry - we can easily loose .geoip_databases index with persistent task state still in cluster state, this is not assertion failing, this is usual failure and should be signalled as one.
This also tries to fix packaging tests by avoiding duplicates in elasticsearch.yml.
Closes#71762
This PR adds documentation for GeoIPv2 auto-update feature.
It also changes related settings names from geoip.downloader.* to ingest.geoip.downloader to have the same convention as current setting.
Relates to #68920
Co-authored-by: Elastic Machine <elasticmachine@users.noreply.github.com>
Co-authored-by: James Rodewig <40268737+jrodewig@users.noreply.github.com>
This change enables GeoIP downloader by default.
It removes feature flag but adds flag that is used by tests to disable it again (as we don't want to hammer GeoIP database service with every test cluster we spin up).
Relates to #68920
This change adds support for the 7 different runtime fields contexts to the Painless Execute API. Each
context can accept the standard script input (source and params) along with a user-defined document
and an index name to pull mappings from. The results depend on the output of the runtime field
type.
Closes#70467
This commit allows you to set 'script' and 'on_script_error' parameters
on date field mappers, meaning that runtime date fields can be made indexed
simply by moving their definitions from the runtime section of the mappings
to the properties section.
* Warn users if security is implicitly disabled
Elasticsearch has security features implicitly disabled by default for
Basic and Trial licenses, unless explicitly set in the configuration
file.
This may be good for onboarding, but it also lead to unintended insecure
clusters.
This change introduces clear warnings when security features are
implicitly disabled.
- a warning header in each REST response if security is implicitly
disabled;
- a log message during cluster boot.
This change fixes number of problems in GeoIPv2 code:
- closes streams from Files.list in GeoIpCli, which should fix tests on Windows
- makes sure that total download time in GeoIP stats is non-negative (we serialize it as vInt which can cause problems with negative numbers and it can happen when clock was changed during operation)
- fixes handling of failed/simultaneous downloads, #69951 was meant as a way to prevent 2 persistent tasks to index chunks but it would prevent any update if single download failed mid indexing, this change uses timestamp (lastUpdate) as sort of UUID. This should still prevent 2 tasks to step on each other toes (overwriting chunks) but in the end still only single task should be able to update task state (this is handled by persistent tasks framework)
Closes#71145
This commit allows you to set 'script' and 'on_script_error' parameters
on IP field mappers, meaning that runtime IP fields can be made indexed
simply by moving their definitions from the runtime section of the mappings
to the properties section.
Ensure that the index request is routed to the ingest,
so that the lazy loading occurs of geoip database
on ingest node (which is what is asserted later on)
Otherwise the database is lazy loaded on a different node.
(without this fix, this test fails reproducible with
`-Dtests.seed=2E234CC71CE96F4F`)
Closes#71251
This changes remove the loop counter for for each loops which is a regression from 7.9.3
documented in #71584. Not loop counting for each loops should be relatively safe as they have a
natural ending point. We can consider adding the check back in once we have configurable loop
counters.
This change adds additional test to GeoIpDownloaderIT which tests that artifacts produces by GeoIP CLI tool can be consumed by cluster the same way as from our original service.
It does so by running the tool from fixture which then simply serves the generated files (this is exactly the way users are supposed to use the tool as well).
Relates to #68920
Adds ability to serialize the user tree to an
XContent builder, handling all user tree decorations.
To avoid creating a tight dependency, the `ToXContent`
implementation is kept outside the relevant nodes.
Uses a wrapper around the `XContentBuilder` to change
checked `IOExceptions` into runtime exceptions to conform
to UserTreeVisitor API contract.
Adds new debugger method, `Debugger.phases` which
allows the caller to attach visitors at the three main phases
of Painless compilation.
In some scenarios where the read timeout is too tight it's possible
that the http request times out before the response headers have
been received, in that case an URLHttpClientIOException is thrown.
This commit adds that exception type to the expected set of read timeout
exceptions.
Closes#70931
Today by default the `MANAGEMENT` threadpool always permits 5 threads
even if the node has a single CPU, which unfairly prioritises management
activities on small nodes. With this commit we limit the size of this
threadpool to the number of processors if less than 5.
Relates #70435
Currently we don't report any exceptions occuring during field_caps requests back to the user.
This PR adds a new failure section to the response which contains exceptions per index.
In addition the response contains another field, `failed_indices`, with the number of indices that threw
an exception. If all of the requested indices fail, the whole request fails, otherwise the request succeeds
and it is up to the caller to check for potential errors in the response body.
Closes#68994
Today when creating an internal test cluster, we allow the test to
supply the node settings that are applied. The extension point to
provide these settings has a single integer parameter, indicating the
index (zero-based) of the node being constructed. This allows the test
to make some decisions about the settings to return, but it is too
simplistic. For example, imagine a test that wants to provide a setting,
but some values for that setting are not valid on non-data nodes. Since
the only information the test has about the node being constructed is
its index, it does not have sufficient information to determine if the
node being constructed is a non-data node or not, since this is done by
the test framework externally by overriding the final settings with
specific settings that dicate the roles of the node. This commit changes
the test framework so that the test has information about what settings
are going to be overriden by the test framework after the test provide
its test-specific settings. This allows the test to make informed
decisions about what values it can return to the test framework.
Air-gapped environments can't simply use GeoIp database service provided by Infra, so they have to either use proxy or recreate similar service themselves.
This PR adds tool to make this process easier. Basic workflow is:
download databases from MaxMind site to single directory (either .mmdb files or gzipped tarballs with .tgz suffix)
run the tool with $ES_PATH/bin/elasticsearch-geoip -s directory/to/use [-t target/directory]
serve static files from that directory (for example with docker run -v directory/to/use:/usr/share/nginx/html:ro nginx
use server above as endpoint for GeoIpDownloader (geoip.downloader.endpoint setting)
to update new databases simply put new files in directory and run the tool again
This change also adds support for relative paths in overview json because the cli tool doesn't know about the address it would be served under.
Relates to #68920
This commit adds a script parameter to long and double fields that makes
it possible to calculate a value for these fields at index time. It uses the same
script context as the equivalent runtime fields, and allows for multiple index-time
scripted fields to cross-refer while still checking for indirection loops.
In DatabaseRegistry we tried to replace file that was still open. This is not a problem under Linux and MacOS but Windows doesn't like it.
It was caught by our CI with reproducible failures when WindowsFS was set up by Lucene.
Now we skip one temp file and use GzipInputStream directly which fixes this problem.
Marking as non-issue since the code was not released yet.
Closes#70977Closes#71006
Runtime fields currently live in their own java package. This is really
a leftover from when they were in their own module; now that they are
in core they should instead live in the common packages for classes of
their kind.
This commit makes the following moves:
org.elasticsearch.runtimefields.mapper => org.elasticsearch.index.mapper
org.elasticsearch.runtimefields.fielddata => org.elasticsearch.index.fielddata
org.elasticsearch.runtimefields.query => org.elasticsearch.search.runtime
The XFieldScript fields are moved out of the `mapper` package into
org.elasticsearch.scripts, and the `PARSE_FROM_SOURCE` default scripts
are moved from these Script classes directly into the field type classes that
use them.
We have to ship COPYRIGHT.txt and LICENSE.txt files alongside .mmdb files for legal compliance. Infra will pack these in single .tgz (gzipped tar) archive provided by GeoIP databases service.
This change adds support for that format to GeoIpDownloader and DatabaseRegistry
Previously docvalue_fields for binary values with paddings did not
output padding. We consider it to be a bug because: 1) es would
not be able parse these values 2) output from source filtering
and fields API is different and does output padding.
This patches fixes this by outputing padding for binary
docvalue_fields where it is present.
This is the only remaining setter on MappedFieldType, and removing
it makes the base class entirely final. We now only override the
eagerGlobalOrdinals method on types that actually support it.
We've had a few bugs in the fields API where is doesn't behave like we'd
expect. Typically this happens because it isn't obvious what we expct. So
we'll try and use randomized testing to ferret out what we want. This adds
a test for most field types that asserts that `fields` works similarly
to `docvalues_fields`. We expect this to be true for most fields.
It does so by forcing all subclasses of `MapperTestCase` to define a
method that makes random values. It declares a few other hooks that
subclasses can override to further randomize the test.
We skip the test for a few field types that don't have doc values:
* `annotated_text`
* `completion`
* `search_as_you_type`
* `text`
We should come up with some way to test these without doc values, even
if it isn't as nice. But that is a problem for another time, I think.
We skip the test for a few more types just because I wanted to cut this
PR in half so we could get to reviewing it earlier. We'll get to those
in a follow up change.
I've filed a few bugs for things that are inconsistent with
`docvalues_fields`. Typically that means that we have to limit the
random values that we generate to those that *do* round trip properly.
Node in GeoIpStats response can have no databases field if there are no databases yet downloaded to that node. We have to check if the key is there before processing it to avoid NPE.
Fixes#70789
This change adds _geoip/stats endpoint that can be used to collect basic data about geoip downloader (successful, failed and skipped downloads, current db count and total time spent downloading).
It also fixes missing/wrong origins for clients that will break if used with security.
Relates to #68920
So far the runtime section supports only leaf field types, hence the internal representation is based on `RuntimeFieldType` that extends directly `MappedFieldType`. This is straightforward but it is limiting for e.g. an alias field that points to another field, or for object fields that are not queryable directly, hence should not be a MappedFieldType, yet their subfields do.
This commit makes `RuntimeFieldType` an interface, effectively splitting the definition of a runtime fields as defined and returned in the mappings, from its internal representation in terms of `MappedFieldType`.
The existing runtime script field types still extend `MappedFieldType` and now also implement the new interface, which makes the change rather simple.
This adds a new lexer for Painless suggestions that is different from the standard Painless lexer in that is supports types within the lexer from a PainlessLookup. This is possible because a PainlessLookup is always required to generate appropriate suggestions. This also makes the required state machines coming up much simpler to implement.
A script that creates a new object but doesn't use it,
such as by assigning it to a variable, would fail to compile
with an `ArrayIndexOutOfBoundsException`.
`new ArrayList(); return 1;` is an example that
currently fails.
This is because we do not `dup` the result of
`new` if the new object if the object is unread.
This is a problem because we always `pop` the
operand stack at the end of a statement (see ASM
phase's `visitStatement`). The number of `pop`s
depend on the type of expression in the statement.
A new object needs to be `pop`ed once.
However, the operand stack is empty if the
new object is not read in the statement.
This changes always `dup`s the result of `new`
so `visitStatement` has something to `pop`.
Fixes: #70478
This change adjust where the geoip tmp directory is created
to avoid issues when running multiple nodes on the same machine.
In the java tmp dir, a 'geoip-databases' directory is created and
directly under this directory a directory with the node id as name is created.
This allows safely running multiple nodes on the same machine (this
happens mainly during tests).
Closes#69972
Relates to #68920
A follow up after #69037 which added back size field for reindex api.
The original PR #43373 also removed the size field from update by query and delete by query APIs.
This commits allow to use size field with Compatible API for update_by_query and delete_by_query apis
This change modifies GeoIpDownloaderIT to wait in assertBusy despite of error (by wrapping whole body in try-catch) and adds additional assertion to debug failures tracked in #69594
The Checkstyle rule that bans unary negation in favour of an explicit
`== false` has a `maximumDepth` of 2 configured, which meant that it
didn't catch all violations. The `maximumDepth` isn't required (actually
it has a really high default), so this change removes the limit and
fixes the resulting violations.
The test failure looks legit, because there is a possibility that the same databases
was downloaded twice. See added comment in DatabaseRegistry class.
Relates to #69972
This test predefined expected md5 hashes in constants, that were expected with java15.
However java16 creates different md5 hashes and so the expected md5 hashes don't match
with the actual md5 hashes, which caused tests in this test suite to fail (running
with java16 only).
The tests now generates the expected md5 hash during the test instead of using predefined constants.
Closes#69986
This commit removes the metadata field _parent_join
that was needed to ensure that only one join field is used in a mapping.
It is replaced with a validation at the field level.
This change also fixes in [bug](https://github.com/elastic/kibana/issues/92960) in the handling of parent join fields in _field_caps.
This metadata field throws an unexpected exception in [7.11](https://github.com/elastic/elasticsearch/pull/63878)
when checking if the field is aggregatable.
That's now fixed since this unused field has been removed.
Wait for ingest threads to stop using the DatabaseReaderLazyLoader, so the during the next run the db update thread doesn't try to remove the db again (because the file hasn't yet been deleted).
Also delete tmp dirs this test create at the end of the test, so that when repeating this test many times, this test doesn't accumulate many directories with database files.
Closes#69980
The transport action for FieldCapabilities assumes the meta field for a MappedFieldType
is traversable. This commit adds a requirement to MappedFieldType itself to ensure that
it is implemented for all subtypes.
This change switches clean up in DatabaseRegistry.initialize from using Files.walk and stream operations to Files.walkFileTree which can be made more robust in case of errors
When geoip.downloader.enabled setting changes we should try to start/stop geo ip task from single node only- other requests will definitely fail.
This change also extends timeout in GeoIpDownloaderIT as current short one fails sometimes in CI
Today in tests we often use a utility method that creates and starts a
transport service, and then we start it again in the tests anyway. This
commit removes this unnecessary code and asserts that we only ever call
`TransportService#acceptIncomingRequests` once.
This component is responsible for making the databases maintained by GeoIpDownloader
available for ingest processors.
Also provided a lookup mechanism for geoip processors with fallback to {@link LocalDatabases}.
All databases are downloaded into a geoip tmp directory, which is created at node startup.
The following high level steps are executed after each cluster state update:
1) Check which databases are available in {@link GeoIpTaskState},
which is part of the geoip downloader persistent task.
2) For each database check whether the databases have changed
by comparing the local and remote md5 hash or are locally missing.
3) For each database identified in step 2 start downloading the database
chunks. Each chunks is appended to a tmp file (inside geoip tmp dir) and
after all chunks have been downloaded, the database is uncompressed and
renamed to the final filename. After this the database is loaded and
if there is an old instance of this database then that is closed.
4) Cleanup locally loaded databases that are no longer mentioned in {@link GeoIpTaskState}.
Relates to #68920
* Script: no compile rate limit for ingest templates
Remove the compilation rate limit for ingest templates.
Creates a new context, `ingest_template`, with an unlimited
compilation rate limit.
The `template` context is used in many places so it cannot be
exempted from rate limits.
Fixes: #64595
When indexing new chunks in GeoIpDwonloader we should never have to overwrite old chunks, if we try to that means that there are 2 simultaneous executions. This change forces one of them to throw error in such case.
This allows many of the optimizations added in #63643 and #68871 to run
on aggregations with sub-aggregations. This should:
* Speed up `terms` aggregations on fields with less than 1000 values that
also have sub-aggregations. Locally I see 2 second searches run in 1.2
seconds.
* Applies that same speedup to `range` and `date_histogram` aggregations but
it feels less impressive because the point range queries are a little
slower to get up and go.
* Massively speed up `filters` aggregations with sub-aggregations that
don't have a `parent` aggregation or collect "other" buckets. Also
save a ton of memory while collecting them.
This change updates GeoIP database service URL to the new https://geoip.elastic.co/v1/database and removes (now optional) key/UUID parameter.
It also fixes geoip-fixture to provide 3 different test databases (City, Country and ASN).
It also unmutes GeoIpDownloaderIT. testGeoIpDatabasesDownload with additional logging and increased timeouts which tries to address #69594
The runtime fields yaml tests are part of the x-pack yaml test suite, but the runtime fields code has been moved to server. This commit moves the yaml tests to the runtime-fields-common module. The reason why all the tests are moved to the module (rather than only the grok and dissect one which the module provides) is that painless is not available in the ordinary yaml tests and these tests rely on painless heavily.
The only tests that are left behind are the telemetry ones, as telemetry is the last bit that needs to be migrated.
As a follow-up of moving runtime fields to server, we'd like to remove the xpack plugin portions that are left. One part of this is the grok and dissect implementation which depends on grok and dissect libraries that painless does not have available. A new runtime-fields-common module is created to hold their implementations and plug in the necessary whitelists.
This commit introduces system index types that will be used to
differentiate behavior. Previously system indices were all treated the
same regardless of whether they belonged to Elasticsearch, a stack
component, or one of our solutions. Upon further discussion and
analysis this decision was not in the best interest of the various
teams and instead a new type of system index was needed. These system
indices will be referred to as external system indices. Within external
system indices, an option exists for these indices to be managed by
Elasticsearch or to be managed by the external product.
In order to represent this within Elasticsearch, each system index will
have a type and this type will be used to control behavior.
Closes#67383
We should not be ignoring and suppressing exceptions on releasing
network resources quietly in these spots.
Co-authored-by: David Turner <david.turner@elastic.co>
Currently we don't allow `keyword_marker` filter file resources to be reloaded via the
`_reload_search_analyzers` API. It would make sense to allow reloading this when the
file content has changed to allow e.g. for updating stemmer exeption rules at search time
without having to close and re-open the index in question. This change adds the updateable
flag to this token filter in the same way it is used for synonym filters. Analyzers containing
updateable keyword_marker filters would not be allowed to be used at index time but at
search time only, similar to what we allow for synonym filters.
Closes#65355
Forward port of #69525 to master branch.
When running with java 8 runtime, when overwriting a db file with the same content in a short time windown,
the change isn't detected and no reload happens which makes the test fail.
This change overwrites files using different source files.
Closes#69475
This change adds query parameter confirming that we accept ToS of GeoIP database service provided by Infra.
It also changes integration test to use lower timeout when using local fixture.
Relates to #68920
This change disables GeoIP downlaoder in GeoIpProcessorNonIngestNodeIT which should fix failure from #69496
It also adds clean up in GeoIpDownloaderIT which disables downloader after test which should prevent similar failures in that class.
Closes#69496
This change adds component that will download new GeoIP databases from infra service
New databases are downloaded in chunks and stored in .geoip_databases index
Downloads are verified against MD5 checksum provided by the server
Current state of all stored databases is stored in cluster state in persistent task state
Relates to #68920
Custom geoip databases can be provided via the config/ingest-geoip directory,
which are loaded at node startup time. This change adds the functionality
that reloads custom databases at runtime.
Added reference counting when getting a DatabaseReaderLazyLoader instance,
this to avoid closing a maxmind db reader while it is still being used.
There is a small window of time where this might happen during a database update.
A DatabaseReaderLazyLoader instance (which wraps a Maxmind db reader) from config database directory:
* Is closed immediately if there are no usages of it by any geoip processor instance as part of the database reload.
* When there are usages, then it is not closed immediately after a database reload. It is closed by the caller that did the last geoip lookup using this DatabaseReaderLazyLoader instance.
This adds best effort `toString` rendering the various wrapping
action listeners to make `TRACE` logging, that will currently only print the
top level listener `toString` which isn't helpful to find the original of a listener
in case of wrapped listeners, more useful (e.g. when logging rejected executions).
Also this change makes the `delegateX` methods less verbose to use and makes use of them
in a few spots where they weren't yet used.
The extension point that used to allow to plug in the dynamic:runtime behaviour has been removed. The corresponding abstraction (interface + x-pack impl) can now be removed.
This commit makes a start on moving runtime fields to server.
The runtime field types are now built-in. The dynamic fields builder extension (needed to make dynamic:runtime work) is removed: further simplifications are needed but are left for a follow-up. It is still possible to plug in custom runtime field types through the same extension point that the xpack plugin used to use.
The runtime fields script contexts are now also moved to server, and the corresponding whitelists are part of painless-lang with the exception of grok and dissect which will require additional work.
The runtime fields xpack plugin still exists to hold the grok and dissect extensions needed, which depend on grok and dissect lib and we will address as a follow-up. Also, the qa tests and yaml tests are yet to be moved.
Despite the need for additional work, runtime fields are fully functional and users should not be impacted by this change.
This change adds a checkNull method that will check any def reference when accessed for null. If the
def reference is null it throws a more descriptive error message with the name of the method/field
that is accessed.
For the megamorphic cache, we use the checkNull as the filter instead of just getClass when looking
up if we need to compute a new MethodHandle for a changed type.
Fixes: #53129
This commit introduces a dedicated envirnoment variable ES_JAVA_HOME to
determine the JDK used to start (if not using the bundled JDK). This
environment variable will replace JAVA_HOME. The reason that we are
making this change is because JAVA_HOME is a common environment variable
and sometimes users have it set in their environment from other JDK
applications that they have installed on their system. In this case,
they would accidentally end up not using the bundled JDK despite their
intentions. By using a dedicated environment variable specific to
Elasticsearch, we avoid this potential for conflict. With this commit,
we introduce the new environment variable, and deprecate the use of
JAVA_HOME. We will remove support for JAVA_HOME in a future commit.
Accept system property `packageSources` with mapping of
package names to source roots.
Format `<package0>:<path0>;<package1>:<path1>`.
Checks ESv2 License in addition to GPLv2.
For re-index requests, the outer most "size" field was deprecated and
renamed to "max_docs" in 7.x (#43373). With this commit, the "size" field will
remain available (but still deprecated) through out 8.x when REST API
compatibility is requested.
relates #51816
If use the 'declaring' class's version of a method's javadoc if
the current class does not have javadoc for the method.
So, if HashMap.java's toString() has no javadoc, fall back to Object.java.
Currently, the value fetcher framework handles ignored fields by reading
the stored values of the _ignored metadata field, and passing these through
on calls to fetchValues(). However, this means that if a document has multiple
values indexed for a field, and one malformed value, then the fields API will
ignore everything, including the valid values, and return an empty list for this
document.
If a document source contains a malformed value, then it must have been
ignored at index time. Therefore, we can safely assume that if we get an
exception parsing values from source at fetch time, they were also ignored
at index time and they can be skipped. This commit moves this exception
handling directly into SourceValueFetcher and ArraySourceValueFetcher,
removing the need to inspect the _ignored metadata and fixing the case
of mixed valid and invalid values.
The currently allowed delta in this tests is to strict to account for slight floating point
differences across different platforms, e.g. ARM.
Closes#68936
Also moves it to a top-level interface in fielddata. It is not only used by
DocValueFetcher any more, and Leaf does not really describe what
it does or what it provides.
This PR expands the meaning of `include_global_state` for snapshots to include system indices. If `include_global_state` is `true` on creation, system indices will be included in the snapshot regardless of the contents of the `indices` field. If `include_global_state` is `true` on restoration, system indices will be restored (if included in the snapshot), regardless of the contents of the `indices` field. Index renaming is not applied to system indices, as system indices rely on their names matching certain patterns. If restored system indices are already present, they are automatically deleted prior to restoration from the snapshot to avoid conflicts.
This behavior can be overridden to an extent by including a new field in the snapshot creation or restoration call, `feature_states`, which contains an array of strings indicating the "feature" for which system indices should be snapshotted or restored. For example, this call will only restore the `watcher` and `security` system indices (in addition to `index_1`):
```
POST /_snapshot/my_repository/snapshot_2/_restore
{
"indices": "index_1",
"include_global_state": true,
"feature_states": ["watcher", "security"]
}
```
If `feature_states` is present, the system indices associated with those features will be snapshotted or restored regardless of the value of `include_global_state`. All system indices can be omitted by providing a special value of `none` (`"feature_states": ["none"]`), or included by omitting the field or explicitly providing an empty array (`"feature_states": []`), similar to the `indices` field.
The list of currently available features can be retrieved via a new "Get Snapshottable Features" API:
```
GET /_snapshottable_features
```
which returns a response of the form:
```
{
"features": [
{
"name": "tasks",
"description": "Manages task results"
},
{
"name": "kibana",
"description": "Manages Kibana configuration and reports"
}
]
}
```
Features currently map one-to-one with `SystemIndexPlugin`s, but this should be considered an implementation detail. The Get Snapshottable Features API and snapshot creation rely upon all relevant plugins being installed on the master node.
Further, the list of feature states included in a given snapshot is exposed by the Get Snapshot API, which now includes a new field, `feature_states`, which contains a list of the feature states and their associated system indices which are included in the snapshot. All system indices in feature states are also included in the `indices` array for backwards compatibility, although explicitly requesting system indices included in a feature state is deprecated. For example, an excerpt from the Get Snapshot API showing `feature_states`:
```
"feature_states": [
{
"feature_name": "tasks",
"indices": [
".tasks"
]
}
],
"indices": [
".tasks",
"test1",
"test2"
]
```
Co-authored-by: William Brafford <william.brafford@elastic.co>
This change helps facilitate allowing maxmind databases to be updated at runtime.
This will make is easier to purge the cache if a database changes.
Made the following changes:
* Changed how geoip processor integrates with the cache. The cache is moved from the geoip processor to DatabaseReaderLazyLoader class.
* Changed the cache key from ip + response class to ip + database_path.
* Moved GeoIpCache from IngestGeoIpPlugin class to be a top level class.
This partially reverts #64016 and and adds #67839 and adds
additional tests that would have caught issues with the changes
in #64016. It's mostly Nik's code, I am just cleaning things up
a bit.
Co-authored-by: Nik Everett <nik9000@gmail.com>
Clean javadoc tags and strip html.
Methods and constructors have an optional `javadoc` field. All fields under
`javadoc` are optional but at least one will be present.
Fields also have optional `javadoc` field which, if present, is a string.
```
"javadoc": {
"description": "...",
// from @param <param name> <param description>
"parameters": {
"p1": "<p1 description>",
"p2": "<p2 description>"
},
// from @return
"return": "...",
// from @throws <type> <description>
"throws": [
[
"IndexOutOfBoundsException",
"<description>"
],
[
"IOException",
"<description>"
]
]
}
```
This commit renames `MatchQuery` to make it clear it's not a query. Its purpose
is actually to produce Lucene queries through its `parse` method.
It also renames `MultiMatchQuery` -> `MultiMatchQueryParser`.
When parsing Java standard library javadocs, we need to ensure
that our use will comply with the license. Our use complies
with GPLv2 licenses but may not comply with proprietary licenses.
Reject .java files that have non-GPL licenses when parsing them
for parameter names and javadoc comments.
Part 9.
We have an in-house rule to compare explicitly against `false` instead
of using the logical not operator (`!`). However, this hasn't
historically been enforced, meaning that there are many violations in
the source at present.
We now have a Checkstyle rule that can detect these cases, but before we
can turn it on, we need to fix the existing violations. This is being
done over a series of PRs, since there are a lot to fix.
Types are no longer allowed in requests in 8.0, so we can remove support for
using the `_type` field within a search request.
Relates to #41059.
Closes#68311.
At the moment, the ‘fields’ API handles nested fields the same way I handles non-nested object arrays: it just returns them in a flat list. However, the relationship between nested fields is something we should try to preserve, since this is the main purpose of mapping something as “nested” instead of just using an object.
This PR changes this by returning grouped field values that are inside a nested object according to the nested object they initially appear in. Any further object structures inside a nested object are again returned as a flattened list. Fields inside nested fields don’t appear in the flattened response outside of the nested path any more. The grouping of fields inside nested objects is applied recursively if nested mappings are defined inside another nested mapping.
Closes#63709
As of #68088 painless can have methods where all parameters must be
constant. This improves the error message when the parameter isn't. It's
still not super great, but its better and its what we can easilly give
at that point in the compiler.
Adds extra information about the actual error in the pattern to the
error painless returns when you specify a bad pattern. This information
was hiding in the exception that `Pattern.compile` throws but isn't
included in its message so we were never showing it to anyone. These
error message include such gems as:
* named capturing group <name> does not exist
* Look-behind group does not have an obvious maximum length
* Unclosed counted closure
Now you'll get to know what you need to change about your pattern and
not just where it went wrong!
Replaces the double `Pattern.compile` invocations in painless scripts
with the fancy constant injection we added in #68088. This caused one of
the tests to fail. It turns out that we weren't fully iterating the IR
tree during the constant folding phases. I started experimenting and
added a ton of tests that failed. Then I fixed them by changing the IR
tree walking code.
As per the new licensing change for Elasticsearch and Kibana this commit
moves existing Apache 2.0 licensed source code to the new dual license
SSPL+Elastic license 2.0. In addition, existing x-pack code now uses
the new version 2.0 of the Elastic license. Full changes include:
- Updating LICENSE and NOTICE files throughout the code base, as well
as those packaged in our published artifacts
- Update IDE integration to now use the new license header on newly
created source files
- Remove references to the "OSS" distribution from our documentation
- Update build time verification checks to no longer allow Apache 2.0
license header in Elasticsearch source code
- Replace all existing Apache 2.0 license headers for non-xpack code
with updated header (vendored code with Apache 2.0 headers obviously
remains the same).
- Replace all Elastic license 1.0 headers with new 2.0 header in xpack.
If there's no java stdlib path, `StdlibJavadocExtractor` is unnecessary.
This creates a separate code path for that case, which removes a
bunch of checking that `StdlibJavadocExtractor` is `null`.
Part 5.
We have an in-house rule to compare explicitly against `false` instead
of using the logical not operator (`!`). However, this hasn't
historically been enforced, meaning that there are many violations in
the source at present.
We now have a Checkstyle rule that can detect these cases, but before we
can turn it on, we need to fix the existing violations. This is being
done over a series of PRs, since there are a lot to fix.
Instead of fixing up array names post-hoc when generating the api spec
and painless context docs, fix them up in the `_scripts/painless/_context`
API call.
This adds a `grok` and a `dissect` method to runtime fields which
returns a `Matcher` style object you can use to get the matched
patterns. A fairly simple script to extract the "verb" from an apache
log line with `grok` would look like this:
```
String verb = grok('%{COMMONAPACHELOG}').extract(doc["message"].value)?.verb;
if (verb != null) {
emit(verb);
}
```
And `dissect` would look like:
```
String verb = dissect('%{clientip} %{ident} %{auth} [%{@timestamp}] "%{verb} %{request} HTTP/%{httpversion}" %{status} %{size}').extract(doc["message"].value)?.verb;
if (verb != null) {
emit(verb);
}
```
We'll work later to get it down to a clean looking one liner, but for
now, this'll do.
The `grok` and `dissect` methods are special in that they only run at
script compile time. You can't pass non-constants to them. They'll
produce compile errors if you send in a bad pattern. This is nice
because they can be expensive to "compile" and there are many other
optimizations we can make when the patterns are available up front.
Closes#67825
Painless now has a `doc` source which has its own dependencies. That's
lovely for everything but Eclipse doesn't understand that sort of thing.
This adds the docs dependencies to the regular build path when building
with Eclipse.
SearchLookup has two intermediate classes, DocMap and StoredFieldsLookup, that
are simple factories for their Leaf implementations. They are never accessed
outside SearchLookup, with the exception of two calls on DocMap that can be
easily refactored. This commit removes them, making SearchLookup.getLeafSearchLookup
directly responsible for creating the leaf lookups.
* Scripting: Parse stdlib files for parameter names
* Task `generateContextApiSpec` takes optional system parameter
`jdksrc` with path to extracted java standard library source
files.
* stdlib source files are the source of `parameter_names` list
and the `javadoc` value.
* javadoc values may contain newlines and markup such as
`{@code XX}`, `<p>`, `@throws`
Example method:
```
{
"declaring": "Appendable",
"name": "append",
"return": "Appendable",
"javadoc": "Appends a subsequence of the specified character sequence to this...",
"parameters": ["CharSequence", "int", "int" ],
"parameter_names": ["csq", "start", "end"]
}
```
This PR does three things:
1) Tweaks existing reindex infrastructure so that different clients can be used for the "search" part and the "index" part of a reindex operation, and
2) Modifies Enrich to take advantage of this to perform the "search" part in the security context of the current user (so that DLS/FLS etc. are properly applied) while performing the "index" part in the security context of the Enrich plugin (so that access to system indices, and `.enrich-*` in particular, is allowed regardless of the permissions of the current user).
3) Adds integration tests for the above, to verify that Enrich does not leak info protected by DLS and/or FLS.
Co-authored-by: Jay Modi <jay.modi@elastic.co>
SourceLookup has two different ways of getting its internal source as an object
that will persist once the lookup has moved to a separate object. The first,
source(), can return null, and so only works after the source map has been
set explicitly, or one of the other map functions has been called. The second,
loadSourceIfNeeded() will never return null, and can be used to lazily load
values from stored fields.
In the past, the distinction between these two methods was important because
you could use a null check on source() to see if the source field was enabled.
We can now do this by checking the isSourceEnabled() method on
SearchExecutionContext.
This commit merges both fields into a single source() method, with the semantics
of the old loadSourceIfNeeded() method.
We have an in-house rule to compare explicitly against `false` instead
of using the logical not operator (`!`). However, this hasn't
historically been enforced, meaning that there are many violations in
the source at present.
We now have a Checkstyle rule that can detect these cases, but before we
can turn it on, we need to fix the existing violations. This is being
done over a series of PRs, since there are a lot to fix.
Stored scripts can have content_type option set, however when empty they default to XContentType.JSON#mediaType(). Commit 5e74f79 has changed this in master (ES8) method to return application/json;charset=utf-8 (previously application/json; charset=UTF-8)
This means that when upgrading ES from version 7 to 8 stored script will fail when being used as the encoder is being matched with string equality (map key)
This commit address this by adding back (in addition) the old application/json; charset=UTF-8 into the encoders map.
closes#66986
This change upgrades to the latest Lucene 8.8.0 snapshot.
It also restores the compression on binary doc values that was lost in the last snapshot upgrade.
The compression is now configurable on binary doc values but we don't expose this functionality yet so this commit ensures that we pick the same compression mode as previous releases (BEST_COMPRESSION).
Similar to #62444 but for the outbound path.
This does not detect slowness in individual transport handler logic,
this is done via the inbound handler logging already, but instead
warns if it takes a long time to hand off the message to the relevant
transport thread and then transfer the message over the wire.
This gives some visibility into the stability of the network
connection itself and into the reasons for slow network
responses (if they are the result of slow networking on the sender).
Closes#64824. Introduce the concept of categories to deprecation
logging. Every location where we log a deprecation message must now
include a deprecation category.
- Port UrlFixture to test fixture plugin
- Avoid exposing PID and PORt for http fixture when not required
- Make AbstractHttpFixture work inside and outside docker
- Check directories when running UrlFixture
`Augmentation.java` had a zero width space [1] in two method
definitions:
```
public static String[] split(Pattern receiver, int limitFactor, CharSequence input, int limit) {
^------- Right before the ( character
public static Stream<String> splitAsStream(Pattern receiver, int limitFactor, CharSequence input) {
^ Right before the ( here too
```
Sadly, Eclipse and javac treat this character differently. Eclipse seems
to include it in the method name and javac seems to treat it as regular
space. This caused all the unit tests for painless to fail to load
because they couldn't find the `split` and `splitAsStream`
augmentations. But if you listed all of the methods they looked like
they were there. If you crack open the line in a hex editor you can see
it.
Eclipse is tracking [2] similar issues.
[1]: https://en.wikipedia.org/wiki/Zero-width_space
[2]: https://bugs.eclipse.org/bugs/show_bug.cgi?id=547601
We decided to rename `QueryShardContext` to clarify that it supports all parts
of search request execution. Before there was confusion over whether it should
only be used for building queries, or maybe only used in the query phase. This
PR also updates the javadocs.
Closes#64740.
As part of #66295 we made QueryShardContext perform mapping lookups through MappingLookup rather than MapperService. That helps as MapperService relies on DocumentMapper which may change througout the execution of the search request. At search time, the percolate query also needs to parse documents, which made us add a parse method to MappingLookup.Such parse method currently relies on calling DocumentMapper#parseDocument through a function, but we would like to rather make this easier to follow. (see https://github.com/elastic/elasticsearch/pull/66295/files#r544639868)
We recently removed the need to provide the entire DocumentMapper to DocumentParser#parse, opening the possibility for using DocumentParser directly when needing to parse a document at query time. This commit adds everything that is needed (namely Mapping, IndexSettings and IndexAnalyzers) to MappingLookup so that it can parse a document through DocumentParser without relying on DocumentMapper.
As a bonus, given that MappingLookup holds a reference to these three additional objects, we can make DocumentMapper rely on MappingLookup to retrieve those and not hold its own same references to them.
Along the same lines, given that MappingLookup holds all that's necessary to parse a document, the signature of DocumentParser#parse can be simplified by replacing most of its arguments with MappingLookup and retrieving what is needed from it.
This changes the expected error message (on FIPS) so that the
order of the templates (and their associated patterns) matches
the (newly updated) order generated by the server.
Relates: #67066Resolves: #66820
This change fixes problem when using space or tab as a separator in CSV processor - we check if current character is separator before we check if it is whitespace.
This also improves tests to always check all combinations of separators and quotes.
Closes#67013
When removing the "lexer hack" to remove type context from the lexer, static inner class resolution
wasn't properly accounted for. This change adds code to handle static inner class resolution.
This commit reworks the InternalClusterInfoService to run
asynchronously, using timeouts on the stats requests instead of
implementing its own blocking timeouts. It also improves the logging of
failures by identifying the nodes that failed or timed out. Finally it
ensures that only a single refresh is running at once, enqueueing later
refresh requests to run immediately after the current refresh is
finished rather than racing them against each other.
This commit allows returning a correct requested response content-type - it did not work for versioned media types.
It is done by adding new vendor specific instances to XContent and TextFormat enums. These instances can then "format" the response content type string when provided with parameters. This is similar to what SQL plugin does with its media types.
#51816
This leniency was originally for lambda and method reference conversions, but they are both special
cased now. This removes change removes the unnecessary leniency of a cast from a def type to a void
type. This also fixes (#66175).
Currently, an incoming document is parsed through `DocumentMapper#parse`, which in turns calls `DocumentParser#parseDocument` providing `this` among other arguments. As part of the effort to reduce usages of `DocumentMapper` when possible, as it represents the mutable side of mappings (through mappings updates) and involves complexity, we can carry around only the needed components. This does add some required arguments to `DocumentParser#parseDocument` , though it makes dependencies clearer. This change does not affect end consumers as they all go through DocumentMapper anyways, but by not needed to provide DocumentMapper to parseDocument, we may be able to unblock further improvements down the line.
Relates to #66295
This finishes porting all tasks created in gradle build scripts and plugins to use
the task avoidance api (see #56610)
* Port all task definitions to task avoidance api
* Fix last task created during configuration
* Fix test setup in :modules:reindex
* Declare proper task inputs
The "Netty loaded" YAML test asserts that the configured transport is
"netty4", however when in FIPS mode, the tests enable security and the
configured transport is "security4".
This change skips the netty4 yaml test when running in FIPS mode.
Resolves: #66818
If year, year of era, or weekbased year is not specified ingest Java
date processor is defaulting year to current year.
However the current implementation has mistaken weekBasedYear field with
weekOfWeekBasedYear. This has lead to incorrect defaulting.
relates #63458
We were depending on the BouncyCastle FIPS own mechanics to set
itself in approved only mode since we run with the Security
Manager enabled. The check during startup seems to happen before we
set our restrictive SecurityManager though in
org.elasticsearch.bootstrap.Elasticsearch , and this means that
BCFIPS would not be in approved only mode, unless explicitly
configured so.
This commit sets the appropriate JVM property to explicitly set
BCFIPS in approved only mode in CI and adds tests to ensure that we
will be running with BCFIPS in approved only mode when we expect to.
It also sets xpack.security.fips_mode.enabled to true for all test clusters
used in fips mode and sets the distribution to the default one. It adds a
password to the elasticsearch keystore for all test clusters that run in fips
mode.
Moreover, it changes a few unit tests where we would use bcrypt even in
FIPS 140 mode. These would still pass since we are bundling our own
bcrypt implementation, but are now changed to use FIPS 140 approved
algorithms instead for better coverage.
It also addresses a number of tests that would fail in approved only mode
Mainly:
Tests that use PBKDF2 with a password less than 112 bits (14char). We
elected to change the passwords used everywhere to be at least 14
characters long instead of mandating
the use of pbkdf2_stretch because both pbkdf2 and
pbkdf2_stretch are supported and allowed in fips mode and it makes sense
to test with both. We could possibly figure out the password algorithm used
for each test and adjust password length accordingly only for pbkdf2 but
there is little value in that. It's good practice to use strong passwords so if
our docs and tests use longer passwords, then it's for the best. The approach
is brittle as there is no guarantee that the next test that will be added won't
use a short password, so we add some testing documentation too.
This leaves us with a possible coverage gap since we do support passwords
as short as 6 characters but we only test with > 14 chars but the
validation itself was not tested even before. Tests can be added in a followup,
outside of fips related context.
Tests that use a PKCS12 keystore and were not already muted.
Tests that depend on running test clusters with a basic license or
using the OSS distribution as FIPS 140 support is not available in
neither of these.
Finally, it adds some information around FIPS 140 testing in our testing
documentation reference so that developers can hopefully keep in
mind fips 140 related intricacies when writing/changing docs.
This commit introduces a new sort field called `_shard_doc` that
can be used in conjunction with a PIT to consistently tiebreak
identical sort values. The sort value is a numeric long that is
composed of the ordinal of the shard (assigned by the coordinating node)
and the internal Lucene document ID. These two values are consistent within
a PIT so this sort criteria can be used as the tiebreaker of any search
requests.
Since this sort criteria is stable we'd like to add it automatically to any
sorted search requests that use a PIT but we also need to expose it explicitly
in order to be able to:
* Reverse the order of the tiebreaking, useful to search "before" `search_after`.
* Force the primary sort to use it in order to benefit from the `search_after` optimization when sorting by index order (to be released in Lucene 8.8.
I plan to add the documentation and the automatic configuration for PIT in a follow up since this change is already big.
Relates #56828
Except when writing actual segment files to the blob store
we always write `BytesReference` instead of a stream.
Only having the stream API available forces needless copies
on us. I fixed the straight-forward needless copying for
HDFS and FS repos in this PR, we could do similar fixes for
GCS and Azure as well and thus significantly reduce the peak
memory use of these writes on master nodes in particular.
This PR removes outdated overrides in some tests that prevent them from testing
older index versions. Also removes an old comment + logic from
AggregatorFactoriesTests.
Drops `AggregatorTestCase#mapperServiceMock` because it is getting in
the way of other work I'm doing for runtime fields. It was only
overridden to test the `parent` and `child` aggregation to add the
`MappedFieldType`s for join fields in the backdoor. Those aggregations
can just as easily add those fields in the normal method calls.
This commit fixes a bug in the search_as_you_type field that was introduced during
the refactoring of the field mapper. The prefix field that is used internally
by the search_as_you_type mapper doesn't need term vector even if they are activated
on the main field. So this commit ensures that we don't copy the options from the main
field when we create the prefix sub-field.
Closes#66407
Adds `generateContextApiSpec` gradle task that generates whitelist api
specs under `modules/lang-painless/src/main/generated/whitelist-json`.
The common classes are in `painless-common.json`, the specialized classes
per context are in `painless-$context.json`.
eg. `painless-aggs.json` has the specialization for the aggs contexts
Refs: #49879
The dynamic:runtime setting is similar to dynamic:true in that it dynamically defines fields based on values parsed from incoming documents. Though instead of defining leaf fields under properties, it defines them as runtime fields under the runtime section. This is useful in scenarios where search speed can be traded for storage costs, given that runtime fields are loaded at runtime rather than indexed.
modules/mappers-extra should be an internal cluster, not
a javaRestTest. This test will work correctly until you
you try to modify the javaRestTest test cluster. Then
it will treat the javaRestTest as an external cluster
to it's own test cluster potentially causing issues with
tests.
This tweaks the AntFixture handling to make it compliant with the task avoidance api.
Tasks of type StandaloneRestTestTask are now generally finalised by using the typed ant stop task
which allows us to remove of errorprone dependsOn overrides in StandaloneRestTestTask. As a result
we also ported more task definitions in the build to task avoidance api.
Next work item regarding AntFixture handling is porting AntFixture to a plain Gradle task and remove
Groovy AntBuilder will allow us to port more build logic from Groovy to Java but is out of the scope of
This PR.
This change replaces all the member data in the ir nodes with decorations instead. This completes the
transition to a decoration system in the ir tree. This change allows for maximum flexibility when
modifying existing phases or adding additional phases.
(list -> copy -> add one -> wrap immutable) is a pretty common pattern in CS
updates and tests => added a shortcut for it here and used it in easily identifyable
spots.
This replaces the `SearchContext` passed to the ctor of `Aggregation`s
with `AggregationContext`. It ends up adding a fairly large number of
methods to `AggregationContext` but in exchange it shows a path to
removing a few methods from `SearchContext`. That seems nice!
It also gives us an accurate inventory of "all of the stuff" that
aggregations use to build and run.
This commit adds the ability to request highlights for runtime fields.
There are two changes included here:
* You can ask QueryShardContext for the index analyzer for a specific
field, and provide an optional lambda to be executed if the field is
unmapped. This allows us to return the standard Keyword analyzer by
default for fields that have not been configured in the mappings.
* HighlighterUtil.loadValues() now correctly handles values fetched
from a DocValueFetcher as well as a source fetcher, by setting the
LeafReaderContext from the passed-in HitContext.
The first change has a number of knock-on effects, notably that
MoreLikeThisQuery has to stop using its configured analyzer directly
in equals() and hashcode() checks as the anonymous analyzer
returned from QueryShardContext#getIndexAnalyzer() uses object
identity for comparisons.
In the refactoring of TextFieldMapper, we lost the ability to define
a default search or search_quote analyzer in index settings. This
commit restores that ability, and adds some more comprehensive
testing.
Fixes#65434