This PR makes the PUT shutdown API idempotent, as well as allowing
switching from shutdown types.
Co-authored-by: Lee Hinman <dakrone@users.noreply.github.com>
This commit cleans up a few issues with the Node Shutdown API, specifically:
- GET'ing the status of a node that's registered for shutdown, but not in the cluster, now produces a useful response instead of an NPE.
- The explanation of the `STALLED` status now calls out that the Allocation Explain API should be used on the given shard in particular. We may handle this differently soon, but this improves things for now.
- DELETE'ing a non-existent shutdown request now returns HTTP code 404 instead of 400.
This converts the system property feature flag 'es.shutdown_feature_flag_enabled' to a regular
non-dynamic node setting. This setting can only be set to 'true' on a snapshot build of
Elasticsearch (not a release build).
Relates to #70338
This commit modifies the Node Shutdown Status API to include information about the number of shards left to migrate off of the node in question, as well as checking if shard migration is stalled.
"Stalled" is defined as no shards currently relocating off the node, and at least one shard on the node which cannot move per current allocation deciders.
Relates #70338
Change the formatter config to sort / order imports, and reformat the
codebase. We already had a config file for Eclipse users, so Spotless now
uses that.
The "Eclipse Code Formatter" plugin ought to be able to use this file as
well for import ordering, but in my experiments the results were poor.
Instead, use IntelliJ's `.editorconfig` support to configure import
ordering.
I've also added a config file for the formatter plugin.
Other changes:
* I've quietly enabled the `toggleOnOff` option for Spotless. It was
already possible to disable formatting for sections using the markers
for docs snippets, so enabling this option just accepts this reality
and makes it possible via `formatter:off` and `formatter:on` without
the restrictions around line length. It should still only be used as
a very last resort and with good reason.
* I've removed mention of the `paddedCell` option from the contributing
guide, since I haven't had to use that option for a very long time. I
moved the docs to the spotless config.
ParseField is part of the x-content lib, yet it doesn't exist under the
same root package as the rest of the lib. This commit moves the class to
the appropriate package.
relates #73784
This commit makes ILM aware of different parts of the node shutdown lifecycle. It consists are two
main parts, reacting to the state during execution, and signaling the status of shutdown from ILM.
Reacting to shutdown state
ILM now considers nodes that are going to be shut down when deciding which node to assign for the
shrink action. It uses the `NodeShutdownAllocationDecider` within the `SetSingleNodeAllocateStep` to
not assign shards to a node that will be removed. If an index is already past this step and waiting
for allocation, this commit adds an `isCompletable` method to the
`ClusterStateWaitUntilThresholdStep` so that an allocation that cannot happen can be rewound and
retried on another (non-shutdown) node.
Signaling shutdown status
This commit introduces the `PluginShutdownService` which deals with `ShutdownAwarePlugin` classes.
This class is used to signal shutdowns to plugins, and also to gather the status of a shutdown from
these plugins. ILM implements this `ShutdownAwarePlugin` to signal if an index is in a step that is
unsafe, such as the actual shrink step, so that shutdown will wait until after the allocation rules
have been removed by ILM.
This commit also hooks up the get shutdown API response to consider the statuses of its parts (see
`SingleNodeShutdownMetadata.Status#combine`) when creating a response.
Relates to #70338
The cluster will not automatically react to the node shutdown being
registered unless we notify it that something has changed that may
require a change in shard allocation. This commit modifies the Put
and Delete Shutdown actions to invoke a reroute after the cluster
state has been updated, as well as an integration test to verify that
shards quickly move away from nodes which are shutting down for
removal.
Extract usage of internal API from TestClustersPlugin and PluginBuildPlugin and related plugins and build logic
This includes a refactoring of ElasticsearchDistribution to handle types
better in a way we can differentiate between supported Elasticsearch
Distribution types supported in TestCkustersPlugin and types only supported
in internal plugins.
It also introduces a set of internal versions of public plugins.
As part of this we also generate the plugin descriptors now.
As a follow up on this we can actually move these public used classes into
an extra project (declared as included build)
We keep LoggedExec and VersionProperties effectively public And workaround for RestTestBase
This commit ensures that node shutdown metadata is cleaned up between
tests, as it causes unrelated tests to fail if a test leaves node
shutdown metadata in place.
Originally these were stored in the cluster state using a single class, however, they will need to
be different objects without common parts, and they will be calculated on the fly rather than
persisted into cluster state.
This removes the NodeShutdownComponentStatus class, as its no longer needed.
Relates to #70338
This PR adds an allocation decider which uses the metadata managed by the Node Shutdown API to prevent shards from being allocated to nodes which are preparing to be removed from the cluster.
Additionally, shards will not be auto-expanded to nodes which are preparing to restart, instead waiting until after the restart is complete to expand the shard replication.
This commit changes the `PersistentTasksClusterService` to limit nodes for a task to a subset of
nodes (candidates) that are not currently shutting down.
It does not yet cancel tasks that may already be running on the nodes that are shut down, that will
be added in a subsequent request.
Relates to #70338
* Warn users if security is implicitly disabled
Elasticsearch has security features implicitly disabled by default for
Basic and Trial licenses, unless explicitly set in the configuration
file.
This may be good for onboarding, but it also lead to unintended insecure
clusters.
This change introduces clear warnings when security features are
implicitly disabled.
- a warning header in each REST response if security is implicitly
disabled;
- a log message during cluster boot.
This commit hooks up the Node Shutdown API to the Node Shutdown cluster
metadata, so using the API will result in the appropriate writes to the
cluster state.
This commit adds the rest endpoints for the node shutdown API. These APIs are behind the
`es.shutdown_feature_flag_enabled` feature flag for now, as development is ongoing.
Currently these APIs do not do anything, returning immediately. We plan to implement them for real
in subsequent work.
Relates to #70338