elasticsearch/docs/reference/snapshot-restore/repository-s3.asciidoc

553 lines
22 KiB
Plaintext

[[repository-s3]]
=== S3 repository
You can use AWS S3 as a repository for {ref}/snapshot-restore.html[Snapshot/Restore].
*If you are looking for a hosted solution of Elasticsearch on AWS, please visit
https://www.elastic.co/cloud/.*
[[repository-s3-usage]]
==== Getting started
To register an S3 repository, specify the type as `s3` when creating
the repository. The repository defaults to using
https://docs.aws.amazon.com/AmazonECS/latest/developerguide/task-iam-roles.html[ECS
IAM Role] credentials for authentication. You can also use <<iam-kubernetes-service-accounts>> Kubernetes service accounts.
The only mandatory setting is the bucket name:
[source,console]
----
PUT _snapshot/my_s3_repository
{
"type": "s3",
"settings": {
"bucket": "my-bucket"
}
}
----
// TEST[skip:we don't have s3 setup while testing this]
[[repository-s3-client]]
==== Client settings
The client that you use to connect to S3 has a number of settings available.
The settings have the form `s3.client.CLIENT_NAME.SETTING_NAME`. By default,
`s3` repositories use a client named `default`, but this can be modified using
the <<repository-s3-repository,repository setting>> `client`. For example:
[source,console]
----
PUT _snapshot/my_s3_repository
{
"type": "s3",
"settings": {
"bucket": "my-bucket",
"client": "my-alternate-client"
}
}
----
// TEST[skip:we don't have S3 setup while testing this]
Most client settings can be added to the `elasticsearch.yml` configuration file
with the exception of the secure settings, which you add to the {es} keystore.
For more information about creating and updating the {es} keystore, see
{ref}/secure-settings.html[Secure settings].
For example, if you want to use specific credentials to access S3 then run the
following commands to add these credentials to the keystore:
[source,sh]
----
bin/elasticsearch-keystore add s3.client.default.access_key
bin/elasticsearch-keystore add s3.client.default.secret_key
# a session token is optional so the following command may not be needed
bin/elasticsearch-keystore add s3.client.default.session_token
----
If instead you want to use the instance role or container role to access S3
then you should leave these settings unset. You can switch from using specific
credentials back to the default of using the instance role or container role by
removing these settings from the keystore as follows:
[source,sh]
----
bin/elasticsearch-keystore remove s3.client.default.access_key
bin/elasticsearch-keystore remove s3.client.default.secret_key
# a session token is optional so the following command may not be needed
bin/elasticsearch-keystore remove s3.client.default.session_token
----
*All* client secure settings of this repository type are
{ref}/secure-settings.html#reloadable-secure-settings[reloadable]. After you
reload the settings, the internal `s3` clients, used to transfer the snapshot
contents, will utilize the latest settings from the keystore. Any existing `s3`
repositories, as well as any newly created ones, will pick up the new values
stored in the keystore.
NOTE: In-progress snapshot/restore tasks will not be preempted by a *reload* of
the client's secure settings. The task will complete using the client as it was
built when the operation started.
The following list contains the available client settings. Those that must be
stored in the keystore are marked as "secure" and are *reloadable*; the other
settings belong in the `elasticsearch.yml` file.
`access_key` ({ref}/secure-settings.html[Secure], {ref}/secure-settings.html#reloadable-secure-settings[reloadable])::
An S3 access key. If set, the `secret_key` setting must also be specified.
If unset, the client will use the instance or container role instead.
`secret_key` ({ref}/secure-settings.html[Secure], {ref}/secure-settings.html#reloadable-secure-settings[reloadable])::
An S3 secret key. If set, the `access_key` setting must also be specified.
`session_token` ({ref}/secure-settings.html[Secure], {ref}/secure-settings.html#reloadable-secure-settings[reloadable])::
An S3 session token. If set, the `access_key` and `secret_key` settings
must also be specified.
`endpoint`::
The S3 service endpoint to connect to. This defaults to `s3.amazonaws.com`
but the
https://docs.aws.amazon.com/general/latest/gr/rande.html#s3_region[AWS
documentation] lists alternative S3 endpoints. If you are using an
<<repository-s3-compatible-services,S3-compatible service>> then you should
set this to the service's endpoint.
`protocol`::
The protocol to use to connect to S3. Valid values are either `http` or
`https`. Defaults to `https`. When using HTTPS, this repository type validates the
repository's certificate chain using the JVM-wide truststore. Ensure that
the root certificate authority is in this truststore using the JVM's
`keytool` tool.
`proxy.host`::
The host name of a proxy to connect to S3 through.
`proxy.port`::
The port of a proxy to connect to S3 through.
`proxy.username` ({ref}/secure-settings.html[Secure], {ref}/secure-settings.html#reloadable-secure-settings[reloadable])::
The username to connect to the `proxy.host` with.
`proxy.password` ({ref}/secure-settings.html[Secure], {ref}/secure-settings.html#reloadable-secure-settings[reloadable])::
The password to connect to the `proxy.host` with.
`read_timeout`::
(<<time-units,time value>>) The maximum time {es} will wait to receive the next byte
of data over an established, open connection to the repository before it closes the
connection. The default value is 50 seconds.
`max_retries`::
The number of retries to use when an S3 request fails. The default value is
`3`.
`use_throttle_retries`::
Whether retries should be throttled (i.e. should back off). Must be `true`
or `false`. Defaults to `true`.
`path_style_access`::
Whether to force the use of the path style access pattern. If `true`, the
path style access pattern will be used. If `false`, the access pattern will
be automatically determined by the AWS Java SDK (See
https://docs.aws.amazon.com/AWSJavaSDK/latest/javadoc/com/amazonaws/services/s3/AmazonS3Builder.html#setPathStyleAccessEnabled-java.lang.Boolean-[AWS
documentation] for details). Defaults to `false`.
[[repository-s3-path-style-deprecation]]
NOTE: In versions `7.0`, `7.1`, `7.2` and `7.3` all bucket operations used the
https://aws.amazon.com/blogs/aws/amazon-s3-path-deprecation-plan-the-rest-of-the-story/[now-deprecated]
path style access pattern. If your deployment requires the path style access
pattern then you should set this setting to `true` when upgrading.
`disable_chunked_encoding`::
Whether chunked encoding should be disabled or not. If `false`, chunked
encoding is enabled and will be used where appropriate. If `true`, chunked
encoding is disabled and will not be used, which may mean that snapshot
operations consume more resources and take longer to complete. It should
only be set to `true` if you are using a storage service that does not
support chunked encoding. See the
https://docs.aws.amazon.com/AWSJavaSDK/latest/javadoc/com/amazonaws/services/s3/AmazonS3Builder.html#disableChunkedEncoding--[AWS
Java SDK documentation] for details. Defaults to `false`.
`region`::
Allows specifying the signing region to use. Specificing this setting manually should not be necessary for most use cases. Generally,
the SDK will correctly guess the signing region to use. It should be considered an expert level setting to support S3-compatible APIs
that require https://docs.aws.amazon.com/general/latest/gr/signature-version-4.html[v4 signatures] and use a region other than the
default `us-east-1`. Defaults to empty string which means that the SDK will try to automatically determine the correct signing region.
`signer_override`::
Allows specifying the name of the signature algorithm to use for signing requests by the S3 client. Specifying this setting should not
be necessary for most use cases. It should be considered an expert level setting to support S3-compatible APIs that do not support the
signing algorithm that the SDK automatically determines for them.
See the
https://docs.aws.amazon.com/AWSJavaSDK/latest/javadoc/com/amazonaws/ClientConfiguration.html#setSignerOverride-java.lang.String-[AWS
Java SDK documentation] for details. Defaults to empty string which means that no signing algorithm override will be used.
[discrete]
[[repository-s3-compatible-services]]
===== S3-compatible services
There are a number of storage systems that provide an S3-compatible API, and
the `repository-s3` type allows you to use these systems in place of AWS S3.
To do so, you should set the `s3.client.CLIENT_NAME.endpoint` setting to the
system's endpoint. This setting accepts IP addresses and hostnames and may
include a port. For example, the endpoint may be `172.17.0.2` or
`172.17.0.2:9000`.
By default {es} communicates with your storage system using HTTPS, and
validates the repository's certificate chain using the JVM-wide truststore.
Ensure that the JVM-wide truststore includes an entry for your repository. If
you wish to use unsecured HTTP communication instead of HTTPS, set
`s3.client.CLIENT_NAME.protocol` to `http`.
https://minio.io[MinIO] is an example of a storage system that provides an
S3-compatible API. The `repository-s3` type allows {es} to work with
MinIO-backed repositories as well as repositories stored on AWS S3. Other
S3-compatible storage systems may also work with {es}, but these are not
covered by the {es} test suite.
Note that some storage systems claim to be S3-compatible but do not faithfully
emulate S3's behaviour in full. The `repository-s3` type requires full
compatibility with S3. In particular it must support the same set of API
endpoints, return the same errors in case of failures, and offer consistency and
performance at least as good as S3 even when accessed concurrently by multiple
nodes. You will need to work with the supplier of your storage system to address
any incompatibilities you encounter. Please do not report {es} issues involving
storage systems which claim to be S3-compatible unless you can demonstrate that
the same issue exists when using a genuine AWS S3 repository.
You can perform some basic checks of the suitability of your storage system
using the {ref}/repo-analysis-api.html[repository analysis API]. If this API
does not complete successfully, or indicates poor performance, then your
storage system is not fully compatible with AWS S3 and therefore unsuitable for
use as a snapshot repository. However, these checks do not guarantee full
compatibility.
Most storage systems can be configured to log the details of their interaction
with {es}. If you are investigating a suspected incompatibility with AWS S3, it
is usually simplest to collect these logs and provide them to the supplier of
your storage system for further analysis. If the incompatibility is not clear
from the logs emitted by the storage system, configure {es} to log every
request it makes to the S3 API by <<configuring-logging-levels,setting the
logging level>> of the `com.amazonaws.request` logger to `DEBUG`:
[source,console]
----
PUT /_cluster/settings
{
"persistent": {
"logger.com.amazonaws.request": "DEBUG"
}
}
----
// TEST[skip:we don't really want to change this logger]
The supplier of your storage system will be able to analyse these logs to determine the problem. See
the https://docs.aws.amazon.com/sdk-for-java/v1/developer-guide/java-dg-logging.html[AWS Java SDK]
documentation for further information.
[[repository-s3-repository]]
==== Repository settings
The `s3` repository type supports a number of settings to customize how data is
stored in S3. These can be specified when creating the repository. For example:
[source,console]
----
PUT _snapshot/my_s3_repository
{
"type": "s3",
"settings": {
"bucket": "my-bucket",
"another_setting": "setting-value"
}
}
----
// TEST[skip:we don't have S3 set up while testing this]
The following settings are supported:
`bucket`::
(Required)
Name of the S3 bucket to use for snapshots.
+
The bucket name must adhere to Amazon's
https://docs.aws.amazon.com/AmazonS3/latest/dev/BucketRestrictions.html#bucketnamingrules[S3
bucket naming rules].
`client`::
The name of the <<repository-s3-client,S3 client>> to use to connect to S3.
Defaults to `default`.
`base_path`::
Specifies the path to the repository data within its bucket. Defaults to an
empty string, meaning that the repository is at the root of the bucket. The
value of this setting should not start or end with a `/`.
+
NOTE: Don't set `base_path` when configuring a snapshot repository for {ECE}.
{ECE} automatically generates the `base_path` for each deployment so that
multiple deployments may share the same bucket.
`chunk_size`::
(<<byte-units,byte value>>) Big files can be broken down into chunks during snapshotting if needed.
Specify the chunk size as a value and unit, for example:
`1TB`, `1GB`, `10MB`. Defaults to the maximum size of a blob in the S3 which is `5TB`.
`compress`::
When set to `true` metadata files are stored in compressed format. This
setting doesn't affect index files that are already compressed by default.
Defaults to `true`.
include::repository-shared-settings.asciidoc[]
`server_side_encryption`::
When set to `true` files are encrypted on server side using AES256
algorithm. Defaults to `false`.
`buffer_size`::
(<<byte-units,byte value>>) Minimum threshold below which the chunk is
uploaded using a single request.
Beyond this threshold, the S3 repository will use the
https://docs.aws.amazon.com/AmazonS3/latest/dev/uploadobjusingmpu.html[AWS
Multipart Upload API] to split the chunk into several parts, each of
`buffer_size` length, and to upload each part in its own request. Note that
setting a buffer size lower than `5mb` is not allowed since it will prevent
the use of the Multipart API and may result in upload errors. It is also not
possible to set a buffer size greater than `5gb` as it is the maximum upload
size allowed by S3. Defaults to `100mb` or `5%` of JVM heap, whichever is
smaller.
`canned_acl`::
The S3 repository supports all
https://docs.aws.amazon.com/AmazonS3/latest/dev/acl-overview.html#canned-acl[S3
canned ACLs] : `private`, `public-read`, `public-read-write`,
`authenticated-read`, `log-delivery-write`, `bucket-owner-read`,
`bucket-owner-full-control`. Defaults to `private`. You could specify a
canned ACL using the `canned_acl` setting. When the S3 repository creates
buckets and objects, it adds the canned ACL into the buckets and objects.
`storage_class`::
Sets the S3 storage class for objects stored in the snapshot repository.
Values may be `standard`, `reduced_redundancy`, `standard_ia`, `onezone_ia`
and `intelligent_tiering`. Defaults to `standard`. Changing this setting on
an existing repository only affects the storage class for newly created
objects, resulting in a mixed usage of storage classes. You may use an S3
Lifecycle Policy to adjust the storage class of existing objects in your
repository, but you must not transition objects to Glacier classes and you
must not expire objects. If you use Glacier storage classes or object
expiry then you may permanently lose access to your repository contents.
For more information about S3 storage classes, see
https://docs.aws.amazon.com/AmazonS3/latest/dev/storage-class-intro.html[AWS
Storage Classes Guide]
NOTE: The option of defining client settings in the repository settings as
documented below is considered deprecated, and will be removed in a future
version.
In addition to the above settings, you may also specify all non-secure client
settings in the repository settings. In this case, the client settings found in
the repository settings will be merged with those of the named client used by
the repository. Conflicts between client and repository settings are resolved
by the repository settings taking precedence over client settings.
For example:
[source,console]
----
PUT _snapshot/my_s3_repository
{
"type": "s3",
"settings": {
"client": "my-client",
"bucket": "my-bucket",
"endpoint": "my.s3.endpoint"
}
}
----
// TEST[skip:we don't have s3 set up while testing this]
This sets up a repository that uses all client settings from the client
`my_client_name` except for the `endpoint` that is overridden to
`my.s3.endpoint` by the repository settings.
[[repository-s3-permissions]]
===== Recommended S3 permissions
In order to restrict the Elasticsearch snapshot process to the minimum required
resources, we recommend using Amazon IAM in conjunction with pre-existing S3
buckets. Here is an example policy which will allow the snapshot access to an S3
bucket named "snaps.example.com". This may be configured through the AWS IAM
console, by creating a Custom Policy, and using a Policy Document similar to
this (changing snaps.example.com to your bucket name).
[source,js]
----
{
"Statement": [
{
"Action": [
"s3:ListBucket",
"s3:GetBucketLocation",
"s3:ListBucketMultipartUploads",
"s3:ListBucketVersions"
],
"Effect": "Allow",
"Resource": [
"arn:aws:s3:::snaps.example.com"
]
},
{
"Action": [
"s3:GetObject",
"s3:PutObject",
"s3:DeleteObject",
"s3:AbortMultipartUpload",
"s3:ListMultipartUploadParts"
],
"Effect": "Allow",
"Resource": [
"arn:aws:s3:::snaps.example.com/*"
]
}
],
"Version": "2012-10-17"
}
----
// NOTCONSOLE
You may further restrict the permissions by specifying a prefix within the
bucket, in this example, named "foo".
[source,js]
----
{
"Statement": [
{
"Action": [
"s3:ListBucket",
"s3:GetBucketLocation",
"s3:ListBucketMultipartUploads",
"s3:ListBucketVersions"
],
"Condition": {
"StringLike": {
"s3:prefix": [
"foo/*"
]
}
},
"Effect": "Allow",
"Resource": [
"arn:aws:s3:::snaps.example.com"
]
},
{
"Action": [
"s3:GetObject",
"s3:PutObject",
"s3:DeleteObject",
"s3:AbortMultipartUpload",
"s3:ListMultipartUploadParts"
],
"Effect": "Allow",
"Resource": [
"arn:aws:s3:::snaps.example.com/foo/*"
]
}
],
"Version": "2012-10-17"
}
----
// NOTCONSOLE
The bucket needs to exist to register a repository for snapshots. If you did not
create the bucket then the repository registration will fail.
===== Cleaning up multi-part uploads
{es} uses S3's multi-part upload process to upload larger blobs to the
repository. The multi-part upload process works by dividing each blob into
smaller parts, uploading each part independently, and then completing the
upload in a separate step. This reduces the amount of data that {es} must
re-send if an upload fails: {es} only needs to re-send the part that failed
rather than starting from the beginning of the whole blob. The storage for each
part is charged independently starting from the time at which the part was
uploaded.
If a multi-part upload cannot be completed then it must be aborted in order to
delete any parts that were successfully uploaded, preventing further storage
charges from accumulating. {es} will automatically abort a multi-part upload on
failure, but sometimes the abort request itself fails. For example, if the
repository becomes inaccessible or the instance on which {es} is running is
terminated abruptly then {es} cannot complete or abort any ongoing uploads.
You must make sure that failed uploads are eventually aborted to avoid
unnecessary storage costs. You can use the
https://docs.aws.amazon.com/AmazonS3/latest/API/API_ListMultipartUploads.html[List
multipart uploads API] to list the ongoing uploads and look for any which are
unusually long-running, or you can
https://docs.aws.amazon.com/AmazonS3/latest/userguide/mpu-abort-incomplete-mpu-lifecycle-config.html[configure
a bucket lifecycle policy] to automatically abort incomplete uploads once they
reach a certain age.
[[repository-s3-aws-vpc]]
[discrete]
==== AWS VPC bandwidth settings
AWS instances resolve S3 endpoints to a public IP. If the Elasticsearch
instances reside in a private subnet in an AWS VPC then all traffic to S3 will
go through the VPC's NAT instance. If your VPC's NAT instance is a smaller
instance size (e.g. a t2.micro) or is handling a high volume of network traffic
your bandwidth to S3 may be limited by that NAT instance's networking bandwidth
limitations. Instead we recommend creating a https://docs.aws.amazon.com/vpc/latest/userguide/vpc-endpoints.html[VPC endpoint]
that enables connecting to S3 in instances that reside in a private subnet in
an AWS VPC. This will eliminate any limitations imposed by the network
bandwidth of your VPC's NAT instance.
Instances residing in a public subnet in an AWS VPC will connect to S3 via the
VPC's internet gateway and not be bandwidth limited by the VPC's NAT instance.
[[iam-kubernetes-service-accounts]]
[discrete]
==== Using IAM roles for Kubernetes service accounts for authentication
If you want to use https://aws.amazon.com/blogs/opensource/introducing-fine-grained-iam-roles-service-accounts/[Kubernetes service accounts]
for authentication, you need to add a symlink to the `$AWS_WEB_IDENTITY_TOKEN_FILE` environment variable
(which should be automatically set by a Kubernetes pod) in the S3 repository config directory, so the repository
can have the read access for the service account (a repository can't read any files outside its config directory).
For example:
[source,bash]
----
mkdir -p "${ES_PATH_CONF}/repository-s3"
ln -s $AWS_WEB_IDENTITY_TOKEN_FILE "${ES_PATH_CONF}/repository-s3/aws-web-identity-token-file"
----
IMPORTANT: The symlink must be created on all data and master eligible nodes and be readable
by the `elasticsearch` user. By default, {es} runs as user `elasticsearch` using uid:gid `1000:0`.
If the symlink exists, it will be used by default by all S3 repositories that don't have explicit `client` credentials.