There are no remote invocations of any actions derived from
`TransportNodesAction` so there is no need to register the top-level
action with the `TransportService`, and that means that all the code
related to de/serialization of the top-level request and response is
unused and can be removed.
Relates #100111 Relates #100878
Each per-index process during snapshot deletion takes some nonzero
amount of working memory to hold the relevant snapshot IDs and metadata
generations etc. which we can keep under tighter limits and release
sooner if we limit the number of per-index processes running
concurrently. That's what this commit does.
Emmited metrics could not be index as elasticsearch.metrics.s3.exceptions field
is both long counter and a parent object for a histogram. This change renames
histogram to avoid the conflict.
There's no need for the `fslike` repository, the thread-name check it
exists to suppress permits execution on test threads so does not need
suppressing. This commit replaces it with a regular `fs` repository and
cleans up a couple of other nits.
Data nodes can fold a plan (typically for missing fields) to an empty,
local relationship as a logical optimization. However the context, such
as whether the output is an aggregation or not gets lost which is
problematic during physical execution since the upstream aggregation
expects the intermediate states while the aggregation returns the final
ones.
Consider the query: from index | where field is not null | stats c =
count()
On shards where the field in the filter does not exist, the filter gets
nullified which folds the whole _local_ plan to a LocalRelation
returning c as 0. However the data node should return the intermediate
aggregation states (count and seen) - otherwise the query fails with an
internal error (NPE) since the expected channel by the exchange is not
found.
Fix#100807
The build version is made up of a few parts in non-release builds. Both
the snapshot and pre-release qualifiers are appended to it. These
qualifiers used to be part of Version, but in 7.0 the qualifiers were
made to be found only in the build info. The Build class retains these
qualifiers through the compile ES version extracted from the server jar
at runtime.
Build.qualifiedVersion() is suppose to provide the fully qualified
version, including snapshot and pre-release qualifiers. Yet
Build.version() also includes this information; there is no distinction
since the qualifier was moved to be only in the build info.
This commit separates the pre-release qualifier from the version. It
maintains bwc in talking to older nodes, passing the fully qualified
version there, but in current nodes splits out the pre-release qualifier
into a new member of Build.
Direct access to the .enrich-* indices, which are restricted system
indices, should not be granted to users. Instead, ESQL enrich lookup
should access these indices using the enrich_origin on behalf of the
user. With this change, the enrich lookup checks for the monitor_enrich
cluster privilege before performing the actual lookup with the
enrich_origin.
Spin-off from #100724
In certain scenarios, a field can be mapped both as a primitive and
object, causing it to be marked as unsupported, losing any potential
subfields that might have been discovered before.
This commit preserve them to avoid subfields from being incorrectly
reported as missing.
Fix#100869
The test fails due to out-of-order documents in the enrich index. This
can occur when replicas are initializing during indexing. To avoid this,
we just need to ensure there are no initializing shards before starting
indexing and disable shard relocations.
Closes#99807
It appears that some freshly generated tokens fail authn under
concurrency conditions. This change increases verbosity of the
TokenService logging in order to track down how exactly is the token not
good for authn.
Related: https://github.com/elastic/elasticsearch/issues/85697
If ML serverless autoscaling fails to return a response within
the configured timeout period then the control plane autoscaler
will log an error. Too many of these errors will raise an alert,
therefore as much as possible should be done on the ML side to
_not_ time out.
Previously there were two possible causes of timeouts:
1. If a request for node stats from all ML nodes timed out
2. If a request to refresh the ML memory tracker timed out
The first case can happen if a node leaves the cluster at a bad
time and the message sent to it gets lost. The second case can
happen if searching the ML results indices for model size stats
documents is slow.
We can avoid timeouts in these two situations as follows:
1. There was no need to use the API to get the only value from
the node stats that the autoscaler needs to know - the total
amount of memory on each ML node is stored in a node attribute
on startup so exists in cluster state
2. When we refresh the ML memory tracker we can just return stats
that instruct the autoscaler to do nothing until the refresh
is complete - this is functionally the same as timing out each
request, but without generating error messages
Yet another test affected by the fix for showing the synthetic source,
#98808. This can trigger an assert in older versions as the mapping they
produce (without synthetic source) doesn't match the one they may get
from the master, if the latter is in version 8.10+.
Fixes#100913
```
"node_failures": [
{
"type": "failed_node_exception",
"reason": "Failed node [qpdSPb3yQkuDlsI9TH7a2g]",
"node_id": "qpdSPb3yQkuDlsI9TH7a2g",
"caused_by": {
"type": "transport_serialization_exception",
"reason": "Failed to deserialize response from handler",
"caused_by": {
"type": "illegal_argument_exception",
"reason": "Unknown NamedWriteable [org.elasticsearch.compute.operator.Operator$Status][topn]"
}
}
}
]
```
I hit this error when trying to retrieve ESQL tasks. The issue is that we forget
to register NamedWritable for the status of TopN.
We mostly run our tests with less than 1G of heap per JVM. This means that
we will use the unpooled Netty allocator in most tests, losing us a lot of
leak coverage in internal cluster tests (mostly for inbound buffers).
Unless otherwise specified by tests, we should force the use of our standard
allocator by default to get a higher chance of catching leaks in internalClusterTests
in particular.
If the cluster state is changing quickly while searches are starting
then these captured cluster states can consume substantial memory, and
we are only interested in two values here. This commit extracts the
two relevant values in the constructor, removing the cluster state
references entirely.
Closes#100120
This commit adds the possibility to create runtime fields of type geo-shape. In order to create them, users can
define an emit function that takes either a geojson object or a WKT string that internally creates a geometry object.
SourceConfirmedTestQuery uses a QueryVisitor to collect terms from
its inner query to build its internal SimScorer. It is important to hold these
terms in a consistent order so that when scores for each term are summed,
the order of summation is the same as it would be for the inner query. This
commit changes the call to visit to use a LinkedHashSet to ensure that
terms are iterated in the order in which they are collected.
Fixes#98712
Currently, before performing operations that require the ML internal
indices be available we check whether their primary shards are active.
In stateless Elasticsearch we need to separately check whether the
indices are searchable, as search and indexing shards are separate.
Currently, before performing operations that require the transform
internal index be available we check whether its primary shard is
active.
In stateless Elasticsearch we need to separately check whether the
index is searchable, as search and indexing shards are separate.
This PR builds on top of #100464 to publish s3 request count via the metrics API.
The metric takes the name of `repositories.requests.count` with
attributes/dimensions of
`{"repo_type": "s3", "repo_name": "xxx", "operation": "xxx", "purpose": "xxx"}`.
Closes: ES-6801
We can't assert no leaked blobs here because today the first cleanup
leaves the original `RepositoryData` in place so the second cleanup is
not a no-op.
Relates #100718
Refactor testRerouteRecovery, pulling out testing of shard recovery
throttling into separate targeted tests. Now there are two additional
tests, one testing source node throttling, and another testing target
node throttling. Throttling both nodes at once leads to primarily the
source node registering throttling, while the target node mostly has
no cause to instigate throttling.
manage_enrich is a cluster privilege, not a built in role.
manage_enrich is already documented as a cluster privilege.
This commit remove manage_enrich from the role documentation.
This commit also makes mention of the monitor_enrich introduced in #99646.
related: #85877
This commit fixes two things:
1) RotatableSecret#matches could throw a NullPointerException when the current secret is null but the prior secret is not.
2) RotatableSecret#checkExpired would not expire a prior secret when checking the same millisecond the prior secret was due to expire.
Both of these would cause intermittent test failures, the first based on randomization, the second based on timing.
A task cancelled exception has REST status 400, which makes it
irrecoverable as far as transforms is concerned. This means that
a transform that suffers such an exception will fail without
doing any retries. This is bad, because a search can fail with
a task cancelled exception if one of its lower level phases
suffers a circuit breaker exception. We want transforms to retry
in the event of there temporarily not being sufficient memory
for a search.