When shutting down a node, auto-followers will keep trying to run. This
is happening even as transport services and other components are being
closed. In some cases, this can lead to a stack overflow as we rapidly
try to check the license state of the remote cluster, can not because
the transport service is shutdown, and then immeidately retry
again. This can happen faster than the shutdown, and we die with stack
overflow. This commit adds a stop command to auto-followers so that this
retry loop occurs at most once on shutdown.
Today a coordinating node forces a final reduction of sibling pipeline aggregators whenever reducing aggs, unless it is reducing aggs incrementally. This works well for incremental reduction of aggs, but breaks CCS when minimizing roundtrips as each cluster ends up reducing its own pipeline aggregators locally while that should only be done by the CCS coordinating node later. This causes issues as after their reduction, pipeline aggs cannot be further reduced, which is what happens with CCS causing errors like "java.lang.UnsupportedOperationException: Not supported" being returned.
Each coordinating node should rather honour the reduce context flag that
indicates whether we are executing a final reduce or not. If not, it should leave the sibling pipeline aggregations alone.
Note that his bug affects only pipeline aggs that don't have a parent in
the aggs tree, while all the others work well.
Relates to #40059 but does not fix it yet, as the CCS coordinating node also needs to be adapted to recreate sibling pipeline aggregators from the request.
When minimizing round-trips, each cluster returns its own independent
search response. In case sort by field and/or field collapsing were
requested, when one cluster has no results to return, the information
about the field that sorting was based on (SortField array) as well as
the field (and the values) that collapsing was performed on are missing
in the search response. That causes problems as we can't build the
proper `TopDocs` instance which would need to be either `TopFieldDocs`
or `CollapseTopFieldDocs`. The merge routine expects that all the top
docs are of the same exact type which can't be guaranteed. Given that
the problematic results are empty, hence have no impact on the final
results, we can simply skip them.
Relates to #32125Closes#40067
This commit clarifies how the gateway selection works when configuring
remote clusters for CCR or CCS. Specifically, it clarifies compatibility
between different versions which is a very common question.
The `GET /_cluster/state` API returns an internal representation of the cluster
state that does change from version to version. It's useful for debugging, but
it is not intended for regular use by clients.
This change adjusts the documentation of `GET /_cluster/state` to clarify that
this API yields an internal representation that should not be expected to
remain stable between versions.
Relates #40061, #40016
This change adapts the bwc version check for the new numeric_type option of the FieldSortBuilder.
Since this change needs to be backported to 7x, all bwc tests are temporary disabled
until https://github.com/elastic/elasticsearch/pull/40084 is merged.
Relates #38095
This commit enables full-cluster-restart and rolling-upgrade tests
to run with nodes using a JVM in fips approved only node by using
PEM key material instead of a JKS for the transport layer in that
case.
With SUN security provider, a CertificateException is thrown when
attempting to parse a Certificate from a PEM file on disk with
`sun.security.provider.X509Provider#parseX509orPKCS7Cert`
When using the BouncyCastle Security provider (as we do in fips
tests) the parsing happens in
CertificateFactory#engineGenerateCertificates which doesn't throw
an exception but returns an empty list.
In order to have a consistent behavior, this change makes it so
that we throw a CertificateException when attempting to read
a PEM file from disk and failing to do so in either Security
Provider
Resolves: #39580
When the method ensureGreen in QA tests is timed out, it does not
provide enough info for us to investigate why the testing index is
not green yet. With this change, we will dump the cluster state if
ensureGreen timed out.
Relates #32027
`SecurityIndexManager` is hardcoded to handle only the `.security`-`.security-7` alias-index pair.
This commit removes the hardcoded bits, so that the `SecurityIndexManager` can be reused
for other indices, such as the planned security tokens index (`.security-tokens-7`).
This change adjusts the LDAP connection timeout for retrieving
attributes while performing the SAML IT to 5 seconds, from 5 ms
that it previously was.
Resolves: #40025
In 6.7.0 we introduced a system property to return the old behavior of
computing and sending the compressed size of the cluster state on
cluster state endpoints. In 7.0.0, we removed this behavior but hard
failed usage of this system property (to clearly inform users this
behavior is gone). In this commit, intending to target 8.0.0 only, we
remove this detection as it should no longer be needed; that is, we now
treat this as dead code.
When an auto-follower coordinator times out waiting for the remote
cluster state, we do not log any indication of this. While this is
expected behavior in quiet deployments, it is still useful to see this
information for tracing the behavior of the auto-follow
coordinator. This commit adds a trace log message indicating that the
timeout.
The Migration Assistance API has been functionally replaced by the
Deprecation Info API, and the Migration Upgrade API is not used for the
transition from ES 6.x to 7.x, and does not need to be kept around to
repair indices that were not properly upgraded before upgrading the
cluster, as was the case in 6.
Currently, we maintain a transport name ("mock-nio", "nio", "netty")
that is passed to a `TcpTransportChannel` when a request is received.
The value of this name is to associate with the task when we register a
task with the task manager. However, it is only possible to run ES with
one transport, so having an implementation specific name is unnecessary.
This commit removes the name and replaces it with the generic
"transport".
Currently if a field alias is updated, any percolator queries that contain the
alias will still refer to its old target. This PR documents the issue while we
look into addressing it.
Relates to #37212.
Since other classes besides intervals can be serialized as part of
the Cursor, the getNamedWritables method should be moved from Intervals
to a more generic class Literals.
Relates to #39973
This commit removes the cluster state size field from the cluster state
response, and drops the backwards compatibility layer added in 6.7.0 to
continue to support this field. As calculation of this field was
expensive and had dubious value, we have elected to remove this field.
The `sampler` agg creates a BestDocsDeferringCollector, which internally
initializes a priority queue of size `shardSize`. This queue is
populated with empty `Object` sentinels, which is roughly 16b per
object.
Similarly, the Diversified samplers create a DiversifiedTopDocsCollectors
which internally track PQ slots with ScoreDocKeys, weighing in around
28kb
If the user sets a very abusive `shard_size`, this could easily OOM
a node or cluster since these PQ are allocated up-front without
any checks.
This commit makes sure that when we create the collector, it
cannot be larger than the maxDoc so that we don't accidentally blow
up the node. We ensure the size is not greater than the overall
index maxDoc. A similar treatment is done for `maxDocsPerValue`
parameter of the diversified samplers
For good measure, this also adds in some CB accounting to try and track
memory usage.
Finally, a redundant array creation is removed to reduce a bit of
temporary memory.
* Fix IndexSearcherWrapper visibility
This change adds a wrapper for IndexSearcher that makes IndexSearcher#search(List, Weight, Collector) visible by
sub-classes. The wrapper is used by the ContextIndexSearcher to call this protected method on a searcher created by a plugin.
This ensures that an override of the protected method in an IndexSearcherWrapper plugin is called when a search is executed.
Closes#30758
Changed default of compress setting from false to true for blob store
repositories. This aligns the code with existing documentation and also
seems like the better default.
If multiple jobs are created together and the anomaly
results index does not exist then some of the jobs could
fail to update the mappings of the results index. This
lead them to fail to write their results correctly later.
Although this scenario sounds rare, it is exactly what
happens if the user creates their first jobs using the
Nginx module in the ML UI.
This change fixes the problem by updating the mappings
of the results index if it is found to exist during a
creation attempt.
Fixes#38785
This change adds an option to the `FieldSortBuilder` that allows to transform the type
of a numeric field into another. Possible values for this option are `long` that transforms
the source field into an integer and `double` that transforms the source field into a floating point.
This new option is useful for cross-index search when the sort field is mapped differently on some
indices. For instance if a field is mapped as a floating point in one index and as an integer in another
it is possible to align the type for both indices using the `numeric_type` option:
```
{
"sort": {
"field": "my_field",
"numeric_type": "double" <1>
}
}
```
<1> Ensure that values for this field are transformed to a floating point if needed.
We call `ensureConnections()` to undo the effects of a disruption. However, it
is possible that one or more targets are currently CONNECTING and have been
since the disruption was active, and that the connection attempt was thwarted
by a concurrent disruption to the connection. If so, we cannot simply add our
listener to the queue because it will be notified when this CONNECTING activity
completes even though it was disrupted. We must therefore wait for all the
current activity to finish and then go through and reconnect to any missing
nodes.
Closes#40030.
This commit adds a variant for every official distribution that omits
the bundled jdk. The "no-jdk" naming is conveyed through the package
classifier, alongside the platform. Package tests are also added for
each new distribution.
* Handle zero byte from stdin in keystore command
This change ensures that we do not make assumptions about the length
of the input that we can read from the stdin. It still consumes only
one line, as the previous implementation
* address feedback
* Allow eventual IOException to be thrown
Today we load the shard history retention leases from disk whenever opening the
engine, and treat a missing file as an empty set of leases. However in some
cases this is inappropriate: we might be restoring from a snapshot (if the
target index already exists then there may be leases on disk) or
force-allocating a stale primary, and in neither case does it make sense to
restore the retention leases from disk.
With this change we write an empty retention leases file during recovery,
except for the following cases:
- During peer recovery the on-disk leases may be accurate and could be needed
if the recovery target is made into a primary.
- During recovery from an existing store, as long as we are not
force-allocating a stale primary.
Relates #37165
This is an adjustment to the logging line that was added in
8e5ba9a1e4. This is necessary to
troubleshoot intermittent errors in SsmlAuthenticationIT in
CI that are not reproducible locally.