keepalives tell any intermediate devices that the connection remains alive, which helps with overzealous firewalls that are
killing idle connections. keepalives are enabled by default in Elasticsearch, but use system defaults for their
configuration, which often times do not have reasonable defaults (e.g. 7200s for TCP_KEEP_IDLE) in the context of
distributed systems such as Elasticsearch.
This PR sets the socket-level keep_alive options for network.tcp.{keep_idle,keep_interval} to 5 minutes on configurations
that support it (>= Java 11 & (MacOS || Linux)) and where the system defaults are set to something higher than 5
minutes. This helps keep the connections alive while not interfering with system defaults or user-specified settings
unless they are deemed to be set too high by providing better out-of-the-box defaults.
Today, we send operations in phase2 of peer recoveries batch by batch
sequentially. Normally that's okay as we should have a fairly small of
operations in phase 2 due to the file-based threshold. However, if
phase1 takes a lot of time and we are actively indexing, then phase2 can
have a lot of operations to replay.
With this change, we will send multiple batches concurrently (defaults
to 1) to reduce the recovery time.
Part of #48366. Add documentation for the dangling indices
API added in #58176.
Co-authored-by: David Turner <david.turner@elastic.co>
Co-authored-by: Adam Locke <adam.locke@elastic.co>
* Adding ESS icons to supported ES settings.
* Adding new file for supported ESS settings.
* Adding supported ESS settings for HTTP and disk-based shard allocation.
* Adding more supported settings for ESS.
* Adding descriptions for each Cloud section, plus additional settings.
* Adding new warehouse file for Cloud, plus additional settings.
* Adding node settings for Cloud.
* Adding audit settings for Cloud.
* Resolving merge conflict.
* Adding SAML settings (part 1).
* Adding SAML realm encryption and signing settings.
* Adding SAML SSL settings.
* Adding Kerberos realm settings.
* Adding OpenID Connect Realm settings.
* Adding OpenID Connect SSL settings.
* Resolving leftover Git merge markers.
* Removing Cloud settings page and link to it.
* Add link to mapping source
* Update docs/reference/docs/reindex.asciidoc
* Incorporate edit of HTTP settings
* Remove "cloud" from tag and ID
* Remove "cloud" from tag and update description
* Remove "cloud" from tag and ID
* Change "whitelists" to "specifies"
* Remove "cloud" from end tag
* Removing cloud from IDs and tags.
* Changing link reference to fix build issue.
* Adding index management page for missing settings.
* Removing warehouse file for Cloud and moving settings elsewhere.
* Clarifying true/false usage of http.detailed_errors.enabled.
* Changing underscore to dash in link to fix ci build.
* Forbid read-only-allow-delete block in blocks API
The read-only-allow-delete block is not really under the user's control
since Elasticsearch adds/removes it automatically. This commit removes
support for it from the new API for adding blocks to indices that was
introduced in #58094.
* Missing xref
* Reword paragraph on read-only-allow-delete block
Today the disk-based shard allocator accounts for incoming shards by
subtracting the estimated size of the incoming shard from the free space on the
node. This is an overly conservative estimate if the incoming shard has almost
finished its recovery since in that case it is already consuming most of the
disk space it needs.
This change adds to the shard stats a measure of how much larger each store is
expected to grow, computed from the ongoing recovery, and uses this to account
for the disk usage of incoming shards more accurately.
Restoring from a snapshot (which is a particular form of recovery) does not currently take recovery throttling into account
(i.e. the `indices.recovery.max_bytes_per_sec` setting). While restores are subject to their own throttling (repository
setting `max_restore_bytes_per_sec`), this repository setting does not allow for values to be configured differently on a
per-node basis. As restores are very similar in nature to peer recoveries (streaming bytes to the node), it makes sense to
configure throttling in a single place.
The `max_restore_bytes_per_sec` setting is also changed to default to unlimited now, whereas previously it was set to
`40mb`, which is the current default of `indices.recovery.max_bytes_per_sec`). This means that no behavioral change
will be observed by clusters where the recovery and restore settings were not adapted.
Relates https://github.com/elastic/elasticsearch/issues/57023
Co-authored-by: James Rodewig <james.rodewig@elastic.co>
Today we have individual settings for configuring node roles such as
node.data and node.master. Additionally, roles are pluggable and we have
used this to introduce roles such as node.ml and node.voting_only. As
the number of roles is growing, managing these becomes harder for the
user. For example, to create a master-only node, today a user has to
configure:
- node.data: false
- node.ingest: false
- node.remote_cluster_client: false
- node.ml: false
at a minimum if they are relying on defaults, but also add:
- node.master: true
- node.transform: false
- node.voting_only: false
If they want to be explicit. This is also challenging in cases where a
user wants to have configure a coordinating-only node which requires
disabling all roles, a list which we are adding to, requiring the user
to keep checking whether a node has acquired any of these roles.
This commit addresses this by adding a list setting node.roles for which
a user has explicit control over the list of roles that a node has. If
the setting is configured, the node has exactly the roles in the list,
and not any additional roles. This means to configure a master-only
node, the setting is merely 'node.roles: [master]', and to configure a
coordinating-only node, the setting is merely: 'node.roles: []'.
With this change we deprecate the existing 'node.*' settings such as
'node.data'.
* Changes for issue #36114.
* Adding stronger wording to the new note.
* Removing statement about typically not needting to set the read-only allow delete block.
* Replacing Elasticsearch with {es} variable.
Elasticsearch enables HTTP compression by default. However, to mitigate
potential security risks like the BREACH attack, compression is disabled by
default if HTTPS is enabled.
This updates the `http.compression` setting definition accordingly and adds
additional context.
Co-authored-by: Leaf-Lin <39002973+Leaf-Lin@users.noreply.github.com>
* Moves `Discovery and cluster formation` content from `Modules` to
`Set up Elasticsearch`.
* Combines `Adding and removing nodes` with `Adding nodes to your
cluster`. Adds related redirect.
* Removes and redirects the `Modules` page.
* Rewrites parts of `Discovery and cluster formation` to remove `module`
references and meta references to the section.
* [DOCS] Create top-level "Search your data" page
**Goal**
Create a top-level search section. This will let us clean up our search
API reference docs, particularly content from [`Request body search`][0].
**Changes**
* Creates a top-level `Search your data` page. This page is designed to
house concept and tutorial docs related to search.
* Creates a `Run a search` page under `Search your data`. For now, This
contains a basic search tutorial. The goal is to add content from
[`Request body search`][0] to this in the future.
* Relocates `Long-running searches` and `Search across clusters` under
`Search your data`. Increments several headings in that content.
* Reorders the top-level TOC to move `Search your data` higher. Also
moves the `Query DSL`, `EQL`, and `SQL access` chapters immediately
after.
Relates to #48194
[0]: https://www.elastic.co/guide/en/elasticsearch/reference/master/search-request-body.html
The following settings are now no-ops:
* xpack.flattened.enabled
* xpack.logstash.enabled
* xpack.rollup.enabled
* xpack.slm.enabled
* xpack.sql.enabled
* xpack.transform.enabled
* xpack.vectors.enabled
Since these settings no longer need to be checked, we can remove settings
parameters from a number of constructors and methods, and do so in this
commit.
We also update documentation to remove references to these settings.
The disk decider had special handling for the single data node case,
allowing any allocation (skipping watermark checks) for such clusters.
This special handling can now be avoided via a setting.
* Scripting: Deprecate general cache settings
* Add script.disable_max_compilations_rate setting
* Move construction to ScriptCache
* Use ScriptService to do updates of CacheHolder
* Remove fallbacks
* Add SCRIPT_DISABLE_MAX_COMPILATIONS_RATE_SETTING to ClusterSettings
* Node scope
* Use back compat
* 8.0 for bwc
* script.max_compilations_rate=2048/1m -> script.disable_max_compilations_rate=true in docker compose
* do not guard in esnode
* Doc update
* isSnapshotBuild() -> systemProperty 'es.script.disable_max_compilations_rate', 'true'
* Do not use snapshot in gradle to set max_compilations_rate
* Expose cacheHolder as package private
* monospace 75/5m in cbreaker docs, single space in using
* More detail in general compilation rate error
* Test: don't modify defaultConfig on upgrade
We believe there's no longer a need to be able to disable basic-license
features completely using the "xpack.*.enabled" settings. If users don't
want to use those features, they simply don't need to use them. Having
such features always available lets us build more complex features that
assume basic-license features are present.
This commit deprecates settings of the form "xpack.*.enabled" for
basic-license features, excluding "security", which is a special case.
It also removes deprecated settings from integration tests and unit
tests where they're not directly relevant; e.g. monitoring and ILM are
no longer disabled in many integration tests.
Relocates the "Remote Clusters" documentation from the "Modules" section to the "Set up Elasticsearch" section.
Supporting changes:
* Reorders the "Bootstrap checks for X-Pack" section to immediately follow the "Bootstrap checks"chapter.
* Removes an outdated X-Pack `idef` from the "Remote Clusters" intro.
Some of these characters are special to Asciidoctor and they ruin the
rendering on this page. Instead, we use a macro to passthrough these
characters without Asciidoctor applying any subtitutions to them. This
commit then addresses some rendering issues in the thread pool docs.
Co-authored-by: James Rodewig <james.rodewig@elastic.co>
The use of available processors, the terminology, and the settings
around it have evolved over time. This commit cleans up some places in
the codes and in the docs to adjust to the current terminology.
The `indices.recovery.max_bytes_per_sec` recovery bandwidth limit can differ
between nodes if it is not set dynamically, but today this is not obvious. This
commit adds a paragraph to its documentation clarifying how to set different
bandwidth limits on each node.
Co-Authored-By: James Rodewig <james.rodewig@elastic.co>
It is useful to be able to delay state recovery until enough data nodes have
joined the cluster, since this gives the shard allocator a decent opportunity
to re-use as much existing data as possible. However we also have the option to
delay state recovery until a certain number of master-eligible nodes have
joined, and this is unnecessary: we require a majority of master-eligible nodes
for state recovery, and there is no advantage in waiting for more.
This commit deprecates the unnecessary settings in preparation for their
removal.
Relates #51806
The introductory sections of the reference manual contains some simplified
instructions for adding a node to the cluster. Unfortunately they are a little
too simplified and only really work for clusters running on `localhost`. If you
try and follow these instructions for a distributed cluster then the new node
will, confusingly, auto-bootstrap itself into a distinct one-node cluster.
Multiple nodes running on localhost is a valid config, of course, but we should
spell out that these instructions are really only for experimentation and that
it takes a bit more work to add nodes to a distributed cluster. This commit
does so.
Also, the "important config" instructions for discovery say that you MUST set
`discovery.seed_hosts` whereas in fact it is fine to ignore this setting and
use a dynamic discovery mechanism instead. This commit weakens this statement
and links to the docs for dynamic discovery mechanisms.
Finally, this section is also overloaded with some technical details that are
not important for this context and are adequately covered elsewhere, and
completely fails to note that the default discovery port is 9300. This commit
addresses this.
Explicitly notes the Elasticsearch API endpoints that support CCS.
This should deter users from attempting to use CCS with other API
endpoints, such as `GET <index>/_doc/<_id>`.
Updates the cross-cluster search (CCS) documentation to note how
cluster-level settings are applied.
When `ccs_minimize_roundtrips` is `true`, each cluster applies its own
cluster-level settings to the request.
When `ccs_minimize_roundtrips` is `false`, cluster-level settings for
the local cluster is used. This includes shard limit settings, such as
`action.search.shard_count.limit`, `pre_filter_shard_size`, and
`max_concurrent_shard_requests`. If these limits are set too low, the
request could be rejected.
Today we use `cluster.join.timeout` to prevent nodes from waiting indefinitely
if joining a faulty master that is too slow to respond, and
`cluster.publish.timeout` to allow a faulty master to detect that it is unable
to publish its cluster state updates in a timely fashion. If these timeouts
occur then the node restarts the discovery process in an attempt to find a
healthier master.
In the special case of `discovery.type: single-node` there is no point in
looking for another healthier master since the single node in the cluster is
all we've got. This commit suppresses these timeouts and instead lets the node
wait for joins and publications to succeed no matter how long this might take.
Synced flush was a brilliant idea. It supports instant recoveries with a
quite small implementation. However, with the presence of sequence
numbers and retention leases, it is no longer needed. This change
removes it from 8.0.
Relates #5077
* Creates a prerequisites section in the cross-cluster replication (CCR)
overview.
* Adds concise definitions for local and remote cluster in a CCR context.
* Documents that the ES version of the local cluster must be the same
or a newer compatible version as the remote cluster.
The "Restore any snapshots as required" step is a trap: it's somewhere between
tricky and impossible to restore multiple clusters into a single one.
Also add a note about configuring discovery during a rolling upgrade to
proscribe any rare cases where you might accidentally autobootstrap during the
upgrade.
The docs specify that cluster.remote.connect disables cross-cluster
search. This is correct, but not fully accurate as it disables any
functionality that relies on remote cluster connections: cross-cluster
search, remote data feeds, and cross-cluster replication. This commit
updates the docs to reflect this.
Today the docs say that the low watermark has no effect on any shards that have
never been allocated, but this is confusing. Here "shard" means "replication
group" not "shard copy" but this conflicts with the "never been allocated"
qualifier since one allocates shard copies and not replication groups.
This commit removes the misleading words. A newly-created replication group
remains newly-created until one of its copies is assigned, which might be quite
some time later, but it seems better to leave this implicit.
Clarifies not to set `cluster.initial_master_nodes` on nodes that are joining
an existing cluster.
Co-Authored-By: James Rodewig <james.rodewig@elastic.co>
Setting `cluster.routing.allocation.disk.include_relocations` to `false` is a
bad idea since it will lead to the kinds of overshoot that were otherwise fixed
in #46079. This setting was deprecated in #47443. This commit removes it.
Setting `cluster.routing.allocation.disk.include_relocations` to `false` is a
bad idea since it will lead to the kinds of overshoot that were otherwise fixed
in #46079. This commit deprecates this setting so it can be removed in the next
major release.
We added docs for proxy mode in #40281 but on reflection we should not be
documenting this setting since it does not play well with all proxies and we
can't recommend its use. This commit removes those docs and expands its Javadoc
instead.
The changes in #32006 mean that the discovery process can no longer use
master-ineligible nodes as a stepping-stone between master-eligible nodes.
This was normally an indication of a strange and possibly-fragile configuration
and was not recommended. This commit clarifies that only master-eligible nodes
are now involved with discovery.
With this commit, Elasticsearch will no longer prefer using shards in the same location
(with the same awareness attribute values) to process `_search` and `_get` requests.
Instead, adaptive replica selection (the default since 7.0) should route requests more efficiently
using the service time of prior inter-node communications. Clusters with big latencies between
nodes should switch to cross cluster replication to isolate nodes within the same zone.
Note that this change only targets 8.0 since it is considered as breaking. However a follow up
pr should add an option to activate this behavior in 7.x in order to allow users to opt-in early.
Closes#43453
* Snapshot cleanup functionality via transport/REST endpoint.
* Added all the infrastructure for this with the HLRC and node client
* Made use of it in tests and resolved relevant TODO
* Added new `Custom` CS element that tracks the cleanup logic.
Kept it similar to the delete and in progress classes and gave it
some (for now) redundant way of handling multiple cleanups but only allow one
* Use the exact same mechanism used by deletes to have the combination
of CS entry and increment in repository state ID provide some
concurrency safety (the initial approach of just an entry in the CS
was not enough, we must increment the repository state ID to be safe
against concurrent modifications, otherwise we run the risk of "cleaning up"
blobs that just got created without noticing)
* Isolated the logic to the transport action class as much as I could.
It's not ideal, but we don't need to keep any state and do the same
for other repository operations
(like getting the detailed snapshot shard status)
Removes support for using a system property to disable the automatic release of
the write block applied when a node exceeds the flood-stage watermark.
Relates #42559
If a node exceeds the flood-stage disk watermark then we add a block to all of
its indices to prevent further writes as a last-ditch attempt to prevent the
node completely exhausting its disk space. However today this block remains in
place until manually removed, and this block is a source of confusion for users
who current have ample disk space and did not even realise they nearly ran out
at some point in the past.
This commit changes our behaviour to automatically remove this block when a
node drops below the high watermark again. The expectation is that the high
watermark is some distance below the flood-stage watermark and therefore the
disk space problem is truly resolved.
Fixes#39334
Uses JDK 11's per-socket configuration of TCP keepalive (supported on Linux and Mac), see
https://bugs.openjdk.java.net/browse/JDK-8194298, and exposes these as transport settings.
By default, these options are disabled for now (i.e. fall-back to OS behavior), but we would like
to explore whether we can enable them by default, in particular to force keepalive configurations
that are better tuned for running ES.
Today the lag detector may remove nodes from the cluster if they fail to apply
a cluster state within a reasonable timeframe, but it is rather unclear from
the default logging that this has occurred and there is very little extra
information beyond the fact that the removed node was lagging. Moreover the
only forewarning that the lag detector might be invoked is a message indicating
that cluster state publication took unreasonably long, which does not contain
enough information to investigate the problem further.
This commit adds a good deal more detail to make the issues of slow nodes more
prominent:
- after 10 seconds (by default) we log an INFO message indicating that a
publication is still waiting for responses from some nodes, including the
identities of the problematic nodes.
- when the publication times out after 30 seconds (by default) we log a WARN
message identifying the nodes that are still pending.
- the lag detector logs a more detailed warning when a fatally-lagging node is
detected.
- if applying a cluster state takes too long then the cluster applier service
logs a breakdown of all the tasks it ran as part of that process.
Most of the circuit breaker settings are dynamically configurable.
However, `indices.breaker.total.use_real_memory` is not. With this
commit we add a clarifying note that this specific setting is static.
Closes#44974
elastic/elasticsearch#41281 added custom metadata parameter to
snapshots. During review, the parameter name was changed from '_meta' to
'metadata,' but the documentation wasn't updated. This corrects the
documentation to use the 'metadata' name.
* Add SnapshotLifecycleService and related CRUD APIs
This commit adds `SnapshotLifecycleService` as a new service under the ilm
plugin. This service handles snapshot lifecycle policies by scheduling based on
the policies defined schedule.
This also includes the get, put, and delete APIs for these policies
Relates to #38461
* Make scheduledJobIds return an immutable set
* Use Object.equals for SnapshotLifecyclePolicy
* Remove unneeded TODO
* Implement ToXContentFragment on SnapshotLifecyclePolicyItem
* Copy contents of the scheduledJobIds
* Handle snapshot lifecycle policy updates and deletions (#40062)
(Note this is a PR against the `snapshot-lifecycle-management` feature branch)
This adds logic to `SnapshotLifecycleService` to handle updates and deletes for
snapshot policies. Policies with incremented versions have the old policy
cancelled and the new one scheduled. Deleted policies have their schedules
cancelled when they are no longer present in the cluster state metadata.
Relates to #38461
* Take a snapshot for the policy when the SLM policy is triggered (#40383)
(This is a PR for the `snapshot-lifecycle-management` branch)
This commit fills in `SnapshotLifecycleTask` to actually perform the
snapshotting when the policy is triggered. Currently there is no handling of the
results (other than logging) as that will be added in subsequent work.
This also adds unit tests and an integration test that schedules a policy and
ensures that a snapshot is correctly taken.
Relates to #38461
* Record most recent snapshot policy success/failure (#40619)
Keeping a record of the results of the successes and failures will aid
troubleshooting of policies and make users more confident that their
snapshots are being taken as expected.
This is the first step toward writing history in a more permanent
fashion.
* Validate snapshot lifecycle policies (#40654)
(This is a PR against the `snapshot-lifecycle-management` branch)
With the commit, we now validate the content of snapshot lifecycle policies when
the policy is being created or updated. This checks for the validity of the id,
name, schedule, and repository. Additionally, cluster state is checked to ensure
that the repository exists prior to the lifecycle being added to the cluster
state.
Part of #38461
* Hook SLM into ILM's start and stop APIs (#40871)
(This pull request is for the `snapshot-lifecycle-management` branch)
This change allows the existing `/_ilm/stop` and `/_ilm/start` APIs to also
manage snapshot lifecycle scheduling. When ILM is stopped all scheduled jobs are
cancelled.
Relates to #38461
* Add tests for SnapshotLifecyclePolicyItem (#40912)
Adds serialization tests for SnapshotLifecyclePolicyItem.
* Fix improper import in build.gradle after master merge
* Add human readable version of modified date for snapshot lifecycle policy (#41035)
* Add human readable version of modified date for snapshot lifecycle policy
This small change changes it from:
```
...
"modified_date": 1554843903242,
...
```
To
```
...
"modified_date" : "2019-04-09T21:05:03.242Z",
"modified_date_millis" : 1554843903242,
...
```
Including the `"modified_date"` field when the `?human` field is used.
Relates to #38461
* Fix test
* Add API to execute SLM policy on demand (#41038)
This commit adds the ability to perform a snapshot on demand for a policy. This
can be useful to take a snapshot immediately prior to performing some sort of
maintenance.
```json
PUT /_ilm/snapshot/<policy>/_execute
```
And it returns the response with the generated snapshot name:
```json
{
"snapshot_name" : "production-snap-2019.04.09-rfyv3j9qreixkdbnfuw0ug"
}
```
Note that this does not allow waiting for the snapshot, and the snapshot could
still fail. It *does* record this information into the cluster state similar to
a regularly trigged SLM job.
Relates to #38461
* Add next_execution to SLM policy metadata (#41221)
* Add next_execution to SLM policy metadata
This adds the next time a snapshot lifecycle policy will be executed when
retriving a policy's metadata, for example:
```json
GET /_ilm/snapshot?human
{
"production" : {
"version" : 1,
"modified_date" : "2019-04-15T21:16:21.865Z",
"modified_date_millis" : 1555362981865,
"policy" : {
"name" : "<production-snap-{now/d}>",
"schedule" : "*/30 * * * * ?",
"repository" : "repo",
"config" : {
"indices" : [
"foo-*",
"important"
],
"ignore_unavailable" : true,
"include_global_state" : false
}
},
"next_execution" : "2019-04-15T21:16:30.000Z",
"next_execution_millis" : 1555362990000
},
"other" : {
"version" : 1,
"modified_date" : "2019-04-15T21:12:19.959Z",
"modified_date_millis" : 1555362739959,
"policy" : {
"name" : "<other-snap-{now/d}>",
"schedule" : "0 30 2 * * ?",
"repository" : "repo",
"config" : {
"indices" : [
"other"
],
"ignore_unavailable" : false,
"include_global_state" : true
}
},
"next_execution" : "2019-04-16T02:30:00.000Z",
"next_execution_millis" : 1555381800000
}
}
```
Relates to #38461
* Fix and enhance tests
* Figured out how to Cron
* Change SLM endpoint from /_ilm/* to /_slm/* (#41320)
This commit changes the endpoint for snapshot lifecycle management from:
```
GET /_ilm/snapshot/<policy>
```
to:
```
GET /_slm/policy/<policy>
```
It mimics the ILM path only using `slm` instead of `ilm`.
Relates to #38461
* Add initial documentation for SLM (#41510)
* Add initial documentation for SLM
This adds the initial documentation for snapshot lifecycle management.
It also includes the REST spec API json files since they're sort of
documentation.
Relates to #38461
* Add `manage_slm` and `read_slm` roles (#41607)
* Add `manage_slm` and `read_slm` roles
This adds two more built in roles -
`manage_slm` which has permission to perform any of the SLM actions, as well as
stopping, starting, and retrieving the operation status of ILM.
`read_slm` which has permission to retrieve snapshot lifecycle policies as well
as retrieving the operation status of ILM.
Relates to #38461
* Add execute to the test
* Fix ilm -> slm typo in test
* Record SLM history into an index (#41707)
It is useful to have a record of the actions that Snapshot Lifecycle
Management takes, especially for the purposes of alerting when a
snapshot fails or has not been taken successfully for a certain amount of
time.
This adds the infrastructure to record SLM actions into an index that
can be queried at leisure, along with a lifecycle policy so that this
history does not grow without bound.
Additionally,
SLM automatically setting up an index + lifecycle policy leads to
`index_lifecycle` custom metadata in the cluster state, which some of
the ML tests don't know how to deal with due to setting up custom
`NamedXContentRegistry`s. Watcher would cause the same problem, but it
is already disabled (for the same reason).
* High Level Rest Client support for SLM (#41767)
* High Level Rest Client support for SLM
This commit add HLRC support for SLM.
Relates to #38461
* Fill out documentation tests with tags
* Add more callouts and asciidoc for HLRC
* Update javadoc links to real locations
* Add security test testing SLM cluster privileges (#42678)
* Add security test testing SLM cluster privileges
This adds a test to `PermissionsIT` that uses the `manage_slm` and `read_slm`
cluster privileges.
Relates to #38461
* Don't redefine vars
* Add Getting Started Guide for SLM (#42878)
This commit adds a basic Getting Started Guide for SLM.
* Include SLM policy name in Snapshot metadata (#43132)
Keep track of which SLM policy in the metadata field of the Snapshots
taken by SLM. This allows users to more easily understand where the
snapshot came from, and will enable future SLM features such as
retention policies.
* Fix compilation after master merge
* [TEST] Move exception wrapping for devious exception throwing
Fixes an issue where an exception was created from one line and thrown in another.
* Fix SLM for the change to AcknowledgedResponse
* Add Snapshot Lifecycle Management Package Docs (#43535)
* Fix compilation for transport actions now that task is required
* Add a note mentioning the privileges needed for SLM (#43708)
* Add a note mentioning the privileges needed for SLM
This adds a note to the top of the "getting started with SLM"
documentation mentioning that there are two built-in privileges to
assist with creating roles for SLM users and administrators.
Relates to #38461
* Mention that you can create snapshots for indices you can't read
* Fix REST tests for new number of cluster privileges
* Mute testThatNonExistingTemplatesAreAddedImmediately (#43951)
* Fix SnapshotHistoryStoreTests after merge
* Remove overridden newResponse functions that have been removed
This commit documents the backup and restore of a cluster's
security configuration.
It is not possible to only backup (or only restore) security
configuration, independent to the rest of the cluster's conf,
so this describes how a full configuration backup&restore
will include security as well. Moreover, it explains how part
of the security conf data resides on the special .security
index and how to backup that using regular data snapshot API.
Co-Authored-By: Lisa Cawley <lcawley@elastic.co>
Co-Authored-By: Tim Vernum <tim@adjective.org>
Clarifies the roles of a dedicated voting-only master-eligible node.
Co-Authored-By: James Rodewig <james.rodewig@elastic.co>
Co-Authored-By: David Turner <david.turner@elastic.co>
A voting-only master-eligible node is a node that can participate in master elections but will not act
as a master in the cluster. In particular, a voting-only node can help elect another master-eligible
node as master, and can serve as a tiebreaker in elections. High availability (HA) clusters require at
least three master-eligible nodes, so that if one of the three nodes is down, then the remaining two
can still elect a master amongst them-selves. This only requires one of the two remaining nodes to
have the capability to act as master, but both need to have voting powers. This means that one of
the three master-eligible nodes can be made as voting-only. If this voting-only node is a dedicated
master, a less powerful machine or a smaller heap-size can be chosen for this node. Alternatively, a
voting-only non-dedicated master node can play the role of the third master-eligible node, which
allows running an HA cluster with only two dedicated master nodes.
Closes#14340
Co-authored-by: David Turner <david.turner@elastic.co>
This commit adds multiple repositories support to get snapshots
request.
If some repository throws an exception this method does not fail fast
instead, it returns results for all repositories.
This PR is opened in favour of #41799, because we decided to change
the response format in a non-BwC manner. It makes sense to read a
discussion of the aforementioned PR.
This is the continuation of work done here #15151.
Adds a metadata field to snapshots which can be used to store arbitrary
key-value information. This may be useful for attaching a description of
why a snapshot was taken, tagging snapshots to make categorization
easier, or identifying the source of automatically-created snapshots.
This commit addresses a few more frequently-asked questions:
* clarifies that bootstrapping doesn't happen even after a full cluster
restart.
* removes the example that uses IP addresses, to try and further encourage the
use of node names for bootstrapping.
* clarifies that auto-bootstrapping might form different clusters on different
hosts, and gives a process for starting again if this wasn't what you wanted.
* adds the "do not stop half-or-more of the master-eligible nodes" slogan that
was notably absent.
* reformats one of the console examples to a narrower width
This commit removes docs for alternate transport implementations which
were removed years ago. These were missed because they have redirects
masking their existsence.
This setting, which prior to Elasticsearch 5 was enabled by default and caused all kinds of
confusion, has since been disabled by default and is not recommended for production use. The
preferred way going forward is for users to explicitly specify separate data folders for each started
node to ensure that each node is consistently assigned to the same data path.
Relates to #42426
The `bulk` threadpool is now called `write`, but `bulk` is still
used in some examples. This commit fixes that.
Also, the only way `threadpool.bulk.write: 30` is a valid increase in the size
of this threadpool is if you have 29 processors, which is an odd number of
processors to have. This commit removes the "more threads" bit.
In cases where node names and transport addresses can be muddled, it is unclear
that `cluster.initial_master_nodes: master-a:9300` means to look for a node
called `master-a:9300` rather than a node called `master-a` with transport port
`9300`. This commit adds docs to that effect.
* Clarify that peer recovery settings apply to shard relocation
* Fix awkward wording of 1st sentence
* [DOCS] Remove snapshot recovery reference.
Call out link to [[cat-recovery]].
Separate expert settings.
The example to delete a remote cluster is missing the `skip_unavailable` setting which results in an error:
```
"type": "illegal_argument_exception",
"reason": "missing required setting [cluster.remote.tiny-test.seeds] for setting [cluster.remote.tiny-test.skip_unavailable]"
```
The following phrase causes confusion:
> Alternatively the IP addresses or hostnames (if node name defaults to the
> host name) can be used.
This change clarifies the conditions under which you can use a hostname, and
adds an anchor to the note introduced in (#41137) so we can link directly to it
in conversations with users.
Added documentation for node repurpose tool and included documentation on how to repurpose nodes safely. Adjusted order of tools in `elasticsearch-node` tool since the repurpose tool is most likely to be used.
Co-Authored-By: David Turner <david.turner@elastic.co>
In #33062 we introduced the `cluster.remote.*.proxy` setting for proxied
connections to remote clusters, but left it deliberately undocumented since it
needed followup work so that it could work with SNI. However, since #32517 is
now closed we can add this documentation and remove the comment about its lack
of documentation.
This commit clarifies how the gateway selection works when configuring
remote clusters for CCR or CCS. Specifically, it clarifies compatibility
between different versions which is a very common question.
Removes all traces of Zen1 from the code base. Some of these commits will also be backported to
7.0/7.x (#39470) as the cluster.coordination package was making use of some things in
discovery.zen and we want to keep 7.x as close as possible to master.
Currently remote compression and ping schedule settings are dynamic.
However, we do not listen for changes. This commit adds listeners for
changes to those two settings. Additionally, when those settings change
we now close existing connections and open new ones with the settings
applied.
Fixes#37201.
`SearchShardIterator` inherits its `compareTo` implementation from `PlainShardIterator`. That is good in most of the cases, as such comparisons are based on the shard id which is unique, even when searching against indices with same names across multiple clusters (thanks to the index uuid being different). In case though the same cluster is registered multiple times with different aliases, the shard id is exactly the same, hence remote results will be returned before local ones with same shard id objects. That is because remote iterators are added before local ones, and we use a stable sorting method in `GroupShardIterators` constructor.
This PR enhances `compareTo` for `SearchShardIterator` to tie break on cluster alias and introduces consistent `equals` and `hashcode` methods. This allows to remove a TODO in `SearchResponseMerger` which otherwise has to handle this special case specifically. Also, while at it I added missing tests around equals/hashcode and compareTo and expanded existing ones.
In #38333 and #38350 we moved away from the `discovery.zen` settings namespace
since these settings have an effect even though Zen Discovery itself is being
phased out. This change aligns the documentation and the names of related
classes and methods with the newly-introduced naming conventions.
Renames the following settings to remove the mention of `zen` in their names:
- `discovery.zen.hosts_provider` -> `discovery.seed_providers`
- `discovery.zen.ping.unicast.concurrent_connects` -> `discovery.seed_resolver.max_concurrent_resolvers`
- `discovery.zen.ping.unicast.hosts.resolve_timeout` -> `discovery.seed_resolver.timeout`
- `discovery.zen.ping.unicast.hosts` -> `discovery.seed_addresses`