Adds a test framework that validates instruments are registered before they are called and are not double registered.
Also records all invocations of Instruments and allows test authors to add validation to instruments.
The anti-contention delay in the S3 repository's compare-and-exchange
operation is hard-coded at 1 second today, but sometimes we encounter a
repository that needs much longer to perform a compare-and-exchange
operation when under contention. With this commit we make the
anti-contention delay configurable.
This splits out the registry and the service, which makes testing easier and removes much of the delegation from the old `APMMeter` to `Instruments` (now renamed `APMMeterRegistry`).
APMMeterService takes care of the lifecycle and APMMeterRegistry holds the instruments.
Similar to the TransportVersions holder class, IndexVersions is the new
place to contain all constants for IndexVersion. This commit moves all
existing constants to the new class. It is purely mechanical.
We'd like to make `SearchResponse` reference counted and pooled but there are around 6k
instances of tests that create a `SearchResponse` local variable that would need to be
released manually to avoid leaks in the tests.
This does away with about 10% of these spots by adding an override for `assertHitCount`
that handles the actual execution of the search request and its release automatically
and making use of it in all spots where the `.get()` on the request build could be inlined
semi-automatically and in a straight-forward fashion without other code changes.
Emmited metrics could not be index as elasticsearch.metrics.s3.exceptions field
is both long counter and a parent object for a histogram. This change renames
histogram to avoid the conflict.
This PR builds on top of #100464 to publish s3 request count via the metrics API.
The metric takes the name of `repositories.requests.count` with
attributes/dimensions of
`{"repo_type": "s3", "repo_name": "xxx", "operation": "xxx", "purpose": "xxx"}`.
Closes: ES-6801
Today we rely on the caller computing the appropriate repository format
version based on the nodes in the cluster and the snapshots in (some
recent copy of) the `RepositoryData`. This commit moves that computation
into `createSnapshotsDeletion` so that (a) we can be sure to use the
same `RepositoryData` used for the rest of the process, and (b) we avoid
dispatching work to the SNAPSHOT pool twice.
Relates a comment on #100657
Reorders the methods involved in snapshot deletion to be closer together
and better match the flow of execution, and harmonises the names of many
parameters and local variables to make it easier to follow them through
the process.
This PR wires the new Meter interface into S3BlobStore. The new meter
field remains unused in this PR. Actual metric collection will be
addressed in follow-ups.
Relates: ES-6801
A new no-op OperationPurpose parameter is added in #99615 to all blob
store/container operation method. This PR updates the s3 stats
collection code to actually use this parameter for finer grained stats
collection and reports. This differentiation between purposes are kept
internally for now. The stats are currently aggregated over operations for
existing stats reporting. This means responses from both
GetRepositoriesMetering API and GetBlobStoreStats API will not be changed.
We will have follow-ups to expose the finer stats separately.
Relates: #99615
Relates: ES-6800
Another round of automated fixes to this, marking things that can be
made static as static. Saves some JIT cycles but also turns some lambdas
from capturing to non-capturing and makes the "utilityness" of some
classes visible.
Today blobstore stats are collected against each HTTP operation, e.g.
Get, List. This is not granular enough because the same HTTP operration
can be performed for different purposes, e.g. cluster state, indices or
translog. This PR adds a new Purpose enum to provide further breakdown
for the same HTTP operation.
Relates: ES-6800
in order to avoid adding yet anther parameter to createComponents
a Tracer interface is replaced with TelemetryProvider.
this allows to get both Tracer and Metric (in the future) interfaces
This commit renames the tracing to telemetry.tracing in both xpack/APM and elasticserach's org.elasticsearch.tracing.Tracer (the api)
the xpack/APM is renamed as follows:
org.elasticsearch.telemetry.apm - the only exported package
org.elasticsearch.telemetry.apm.settings - APMSettings
org.elasticsearch.telemetry.apm.tracing - APMTracer
org.elasticsearch.tracing.Tracer is moved to org.elasticsearch.telemetry.tracing.Tracer (responsible for majority of the changes in this PR)
Compare-and-swap operations on a S3 repository are implemented using
multipart uploads. Today to try and avoid collisions we refuse to
perform a compare-and-swap if there are other concurrent uploads in
process. However this means that a node which crashes partway through a
compare-and-swap will block all future register operations.
With this commit we introduce a time-to-live on S3 multipart uploads,
such that uploads older than the TTL now do not block future
compare-and-swap attempts.
Drying this up further and adding the same short-cut for single node
tests. Dealing with most of the spots that I could grab via automatic
refactorings.
Currently we have a number of input streams that specifically override
the skip() method disabling the ability to skip bytes. In each case the
skip implementation works as we have properly implemented the
read(byte[]) methods used to discard bytes. However, we appear to have
disabled it as it would be possible to retry from the end of a skip if
there is a failure in the middle. At this time, that optimization is not
really necessary, however, we sporadically used skip so it would be nice
for the IS to support the method. This commit enables the super.skip()
and adds a comment about future optimizations.
Today the blob store register supports recording only a `long`,
represented as an 8-byte blob. We need to store a little more data in
the register, so this commit generalises things to work with a
`BytesReference` directly.
Further work towards the S3 compare-and-exchange implementation showed
that we would like this API to permit async operations. This commit
moves to an async API.
Also, this change made it fairly awkward to use an exception to deliver
to the caller the indication that the current value could not be read,
so this commit adjusts things to use `OptionalLong` throughout as
suggested in the discussion on #93955.
Only for testing purposes through the `FsRepository` for now and rather simple,
but should get the job done and technically be correct for a compliant NFS implementation.
Co-authored-by: David Turner <david.turner@elastic.co>
Some optimisations that I found when reusing searchable snapshot code elsewhere:
* Add an efficient input stream -> byte buffer path that avoids allocations + copies for heap buffers, this is non-trivial in its effects IMO
* Also at least avoid allocations and use existing thread-local buffer when doing input stream -> direct bb
* move `readFully` to lower level streams class to enable this
* Use same thread local direct byte buffer for frozen and caching index input instead of constantly allocating new heap buffers and writing those to disk inefficiently
Use local-independent `Strings.format` method instead of `String.format(Locale.ROOT, ...)`.
Inline `ESTestCase.forbidden` calls with `Strings.format` for the consistency sake.
Add `Strings.format` alias in `common.Strings`
This commit adds a new test framework for configuring and orchestrating
test clusters for both Java and YAML REST testing. This will eventually
replace the existing "test-clusters" Gradle plugin and the build-time
cluster orchestration.
This comes out of a user heap dump investigation. In some snapshot
corner cases we ran into about 100M of duplicate 0b instances.
-> even though it's a little heavy handed, lets make it so the common
constants that we already have are used whenever possible.
As discussed, we can be up to twice as fast without increasing CPU use
much on high latency blob stores so increasing the pool size to 10 here
to better utilize larger data nodes.
Currently, we only verify that local environment for web identity tokens is correctly set up, but we don't verify whether it's
possible to exchange the token to credentials from the STS. If we can't get credentials from the STS, we silently fall back
to the EC2 credentials provider. Let's try to log the web identity token auth errors, so the users get a clear message in the logs in case the STS is unavailable for the ES server.
With this change we are adding the allocation deciders
in create components we can simplify the use in the
Autoscaling plugin and implement reserved state handler
in the future.