This commit enables concurrent search execution in the DFS phase, which is going to improve resource usage as well as performance of knn queries which benefit from both concurrent rewrite and collection.
We will enable concurrent execution for the query phase in a subsequent commit. While this commit does not introduce parallelism for the query phase, it introduces offloading sequential computation to the newly introduced executor. This is true both for situations where a single slice needs to be searched, as well as scenarios where a specific request does not support concurrency (currently only DFS phase does regardless of the request). Sequential collection is not offloaded only if the request includes aggregations that don't support offloading: composite, nested and cardinality as their post collection method must be executed in the same thread as the collection or we'll trip a lucene assertion that verifies that doc_values are pulled and consumed from the same thread.
## Technical details
This commit introduces a secondary executor, used exclusively to execute the concurrent bits of search. The search threads are still the ones that coordinate the search (where the caller search will originate from), but the actual work will be offloaded to the newly introduced executor.
We are offloading not only parallel execution but also sequential execution, to make the workload more predictable, as it would be surprising to have bits of search executed in either of the two thread pools. Also, that would introduce the possibility to suddenly run a higher amount of heavy operations overall (some in the caller thread and some in the separate threads), which could overload the system as well as make sizing of thread pools more difficult.
Note that fetch, together with other actions, is still executed in the search thread pool. This commit does not make the search thread pool merely a coordinating only thread pool, It does so only for what concerns the IndexSearcher#search operation itself, which is though a big portion of the different phases of search API execution.
Given that the searcher blocks waiting for all tasks to be completed, we take a simple approach of introducing a thread pool executor that has the same size as the existing search thread pool but relies on an unbounded queue. This simplifies handling of thread pool queue and rejections. In fact, we'd like to guarantee that the secondary thread pool won't reject, and delegate queuing entirely to the search thread pool which is the entry point for every search operation anyway. The principle behind this is that if you got a slot in the search thread pool, you should be able to complete your search, and rather quickly.
As part of this commit we are also introducing the ability to cancel tasks that have not started yet, so that if any task throws an exception, other tasks are prevented from starting needless computation.
Relates to #80693
Relates to #90700
Today by default the `SEARCH_COORDINATION` pool is sized at half the
allocated processors, or five if there are more than ten CPUs. Yet, if
we scale up a node to have more than ten CPUs, we probably want to scale
up the number of search coordination threads to match. This commit
removes the limit of five threads.
Discovery, like cluster membership, can also be affected by network-like
issues (e.g. GC/VM pauses, dropped packets and blocked threads) so this
commit duplicates the troubleshooting info across both places.
A completely idle `transport_worker` thread is reported as `0.0%` idle,
which is confusing. Moreover the docs on the network threading model do
not reflect the changes made in #90482. This commit fixes both of those
things.
Suggest calling `jstack` every 15s to ensure that at least one capture
shows a stuck thread. Also adds a link to this guide to the list on the
troubleshooting overview page.
Explains why you should remove `cluster.initial_master_nodes`, and
rewords some of the other sections a little for (subjectively) improved
readability.
* Fixes CORS headers needed by Elastic clients
Updates the default value for the `http.cors.allow-headers`
setting to include headers used by Elastic client libraries.
Also adds the `access-control-expose-headers` header to responses to
CORS requests so that clients can successfully perform their product
check.
In #92309 we have aligned the size of the `search` and the `get` thread
pool but the docs still contain the prior `get` thread pool size. With
this commit we also align the docs.
Relates #92309
We sometimes see a `ShardLockObtainFailedException` when a shard failed
to shut down as fast as we expected, often because a node left and
rejoined the cluster. Sometimes this is because it was held open by
ongoing scrolls or PITs, but other times it may be because the shutdown
process itself is too slow. With this commit we add the ability to
capture and log a thread dump at the time of the failure to give us more
information about where the shutdown process might be running slowly.
Relates #93226
If debug logging is enabled then the lag detector will capture and
report the hot threads of a lagging node. In some cases the resulting
log message can be very large, exceeding 10kiB, which means it is
truncated in most logging setups. The relevant thread(s) may be waiting
on I/O, which is not considered "hot" and therefore may not appear in
the first 10kiB.
This commit adjusts this logging mechanism to split the message into
chunks of size at most 2kiB (after compression and base64-encoding) to
ensure that the entire hot threads output can be faithfully
reconstructed from these logs.
Closes#88126
Today to troubleshoot an unstable cluster we ask the users to parse the
rather complex `node-join` and `node-left` messages emitted by the
`MasterService`. These messages may refer to many nodes, may be
truncated, and are generally pretty hard to work with.
With this commit we start to emit a simplified log message about each
node added and removed. It also renames the respective executor classes:
- `JoinTaskExecutor` -> `NodeJoinExecutor`
- `NodeRemovalClusterStateTaskExecutor` -> `NodeLeftExecutor`
This brings their names in line with each other, and the messages that
they emit, whilst preserving the older `node-join` and `node-left`
terminology as reported by the `MasterService`.
Finally, it updates the troubleshooting logs to reflect these new and
simplified logs.
Relates #92741
* Update search-settings documentation to reflect the fact that the indices.query.bool.max_clause_count setting has been deprecated
* Fix indentation
* Replace Elasticsearch with {es}
* Add deprecation entry to release notes
Co-authored-by: Abdon Pijpelink <abdon.pijpelink@elastic.co>
Currently the documentation on network threading suggests that we still
use a model where we have individual workers dedicated to server
sockets. That is no longer true and server sockets are assigned to
normal workers. This commit updates the documentation.
* Update threadpool.asciidoc
Starting from 8.0 the value of the `node.processors` setting is bounded by the number of available
processors https://github.com/elastic/elasticsearch/pull/44894
* Update docs/reference/modules/threadpool.asciidoc
Co-authored-by: Adam Locke <adam.locke@elastic.co>
When parsing queries on the coordinating node, there is currently no way to share state between the different parsing methods (`fromXContent`). The only query that supports a parse context is bool query, which uses the context to track nested depth of queries, added with #66204. Such nested depth tracking mechanism is not 100% accurate as it tracks bool queries only, while there's many more query types that can hold other queries hence potentially cause stack overflow when deeply nested.
This change removes the parsing context that's specific to bool query, introduced with #66204, in favour of generalizing the nested depth tracking to all query types.
The generic tracking is introduced by wrapping the parser and overriding the method that parses named objects through the xcontent registry. Another way would have been to require a context argument when parsing queries, which would mean adding a context argument to all the QueryBuilder#fromXContent static methods. That would be a breaking change for plugins that provide custom queries, hence I went for trying out a different approach.
One aspect that this change requires and introduces is the distinction between parsing a top level query (which will wrap the parser, or it would create the context if we had one), as opposed to parsing an inner query, which goes ahead with the given parser and context. We already have this distinction as we have two different static methods in `AbstractQueryBuilder` but in practice only bool query makes the distinction being the only context-aware query.
In addition to generalizing tracking nested depth when parsing queries, we should be able to adopt this same strategy to track queries usage as part #90176 .
Given that the depth check is now more restrictive, as it counts all compound queries and not only bool, we have decided to raise the default limit to `30` to ensure that users are not going to hit the limit due to this change.
Adds to the docs a note that the `100mb` default for
`http.max_content_length` is the recommended maximum, along with
suggestions for what to do when hitting this limit.
Introduce max headroom settings for the low, high, and flood disk watermark stages, similar to the existing max headroom setting for the flood stage of the frozen tier. Introduce new max headrooms in HealthMetadata and in ReactiveStorageDeciderService. Add multiple tests in DiskThresholdDeciderUnitTests, DiskThresholdDeciderTests and DiskThresholdMonitorTests. Moreover, addition & subtraction for ByteSizeValue, and min.
This commit adds support for floating point node.processors setting.
This is useful when the nodes run in an environment where the CPU
time assigned to the ES node process is limited (i.e. using cgroups).
With this change, the system would be able to size the thread pools
accordingly, in this case it would round up the provided setting
to the closest integer.
The docs for `transport.ping_schedule` note that the transport client
defaults to a 5s ping schedule, but this is no longer relevant. This
commit drops this from the docs, and also moves the docs for this
setting further down the page to reflect its relative unimportance.
Today we say that voting-only nodes require a "low-latency" network.
This term has a specific meaning in some operating environments which is
different from our intended meaning. To avoid this confusion this commit
removes the absolute term "low-latency" in favour of describing the
requirements relative to the user's own performance goals.
Clean up network setting docs
- Add types for all params
- Remove mention of JDKs before 11
- Clarify some wording
Co-authored-by: Stef Nestor <steffanie.nestor@gmail.com>
This change ensures that existing read_only_allow_delete blocks that
are placed on indices when the flood_stage watermark threshold is
exceeded, are removed when the disk threshold monitoring is disabled.
This is done by changing how InternalClusterInfoService behaves when
disabled. With this change, it will keep calling the registered
listeners periodically, but with an empty ClusterInfo.
Closes#86383
Our current default for the http.max_header_size setting is 8kb. This
is lower than the current default for Kibana (16kb in 8.x), and the ESS
proxy (1mb based on the Go http library default). To align with the
current convention of other Elastic components, this PR increases the
ES header size setting default to 16kb.
Closes#88501
* Convert disk watermarks to RelativeByteSizeValues
Similar to the existing watermark setting for the frozen tier.
Pre-requisite for PR 88639 that plans to introduce max headroom
settings for the disk watermarks, similar to the frozen tier max
headroom setting.
* Add changelog
* Revert 20gb to 20GB
* Make formatNoTrailingZerosPercent non static
* ByteSizeValue.MINUS_ONE
* Remove getMinimumTotalSizeForBelowWatermark
* Remove comment
* Fix minor stuff
* Make parsing of RelativeByteSizeValue faster
Mimicks older definitelyNotPercentage function
* Remove Locale from Strings.format
* More MINUS_ONE
* Adding discovery troubleshooting link
* Add tags to pull in discovery troubleshooting content
* Move discovery troubleshooting to separate page and add redirects
Co-authored-by: Adam Locke <adam.locke@elastic.co>
In #85074 we added docs on discovery troubleshooting that really only
talked about troubleshooting master elections. There's also the case
where the master is elected fine but some other node can't join it. This
commit adds troubleshooting docs about that too.
Co-authored-by: Adam Locke <adam.locke@elastic.co>
Fixes a few scalability issues around join validation:
- compresses the cluster state sent over the wire
- shares the serialized cluster state across multiple nodes
- forks the decompression/deserialization work off the transport thread
Relates #77466Closes#83204
Ensures that on every page of the docs that mentions
`cluster.initial_master_nodes` also mentions that this setting must be
removed after bootstrapping completes.
Today it's no longer true that by default nodes will auto-discover other
nodes on the same host and bootstrap them all into a cluster. This
commit fixes the docs on auto-bootstrapping to recognise this.
Today we don't really say anything about the requirements for the data
path in terms of correctness, and we specifically say to avoid NFS for
performance reasons. This isn't wholly accurate: some NFS
implementations work just fine. This commit documents a more balanced
position on local vs remote storage.
This moves the bulk of the upgrade information into the consolidated upgrade guide, but leaves the primary upgrade topic in place as a cross reference.
Relates to: https://github.com/elastic/stack-docs/pull/1970
Co-authored-by: gchaps <33642766+gchaps@users.noreply.github.com>
Co-authored-by: Elastic Machine <elasticmachine@users.noreply.github.com>
Co-authored-by: James Rodewig <40268737+jrodewig@users.noreply.github.com>
(cherry picked from commit f6473d71f9)
Co-authored-by: debadair <debadair@elastic.co>
This commit updates the Operator-only functionality doc to
mention the operator only settings introduced in #82819.
It also adds an integration test for those operator only
settings that would have caught #83359.
As of 8.0, the compatibility window for cross-cluster search (CCS) to an earlier release will be one minor release. This updates the CCS docs and adds a related 8.0 breaking change.
Closes https://github.com/elastic/elasticsearch/issues/80782
* Adds a prerequisites section covering remote cluster config, node roles, and security.
* Moves existing content about remote cluster config to the prereqs.
* Updates the remote cluster docs to include information about eligible gateway nodes and tagging for gateway nodes.
Closes https://github.com/elastic/elasticsearch/issues/72001
Updates the remote clusters version compatibility table to include 7.17 and 8.x versions.
Co-authored-by: James Rodewig <40268737+jrodewig@users.noreply.github.com>
Today the same-shard allocation decider falls back to checking the
hostname if the node has no host address. In practice nodes will always
have an address so the fallback is dead code. This commit removes that
dead code.
Relates #80702 which will add the ability to distinguish nodes by
hostname regardless of whether they have an address or not, and #80767
which optimizes this area of code - this refactoring should make the
optimization simpler.
Today we increase the verbosity of discovery failures after 5 minutes
without a master. Unfortunately 5 minutes is a common orchestration
timeout, so if discovery is broken then we see nodes being shut down
just before they start to emit useful logs. This commit reduces the
default timeout to 3 minutes to address that.
We have a few leftover mentions of `zen` discovery, mostly for
historical/BwC reasons, which this commit removes.
Prior to this commit the default value for `discovery.type` was `zen`
but this was not written down anywhere or officially supported: the two
options were to set it to `single-node` or to omit it entirely. This
commit changes the default to `multi-node` and documents this.
Co-authored-by: Adam Locke <adam.locke@elastic.co>
Today we have a short note in one place in the docs saying not to touch
the contents of the data path. This commit expands the warning to
describe more precisely what is forbidden, and to give some more detail
of the consequences, and also duplicates the warning to the other
location that documents the `path.data` setting.
Deprecate the script context cache in favor of the general cache.
Users should use the following settings:
`script.max_compilations_rate` to set the max compilation rate
for user scripts such as filter scripts. Certain script contexts
that submit scripts outside of the control of the user are
exempted from this rate limit. Examples include runtime fields,
ingest and watcher.
`script.cache.max_size` to set the max size of the cache.
`script.cache.expire` to set the expiration time for entries in
the cache.
Whats deprecated?
`script.max_compilations_rate: use-context`. This special
setting value was used to turn on the script context-specific caches.
`script.context.$CONTEXT.cache_max_size`, use `script.cache.max_size`
instead.
`script.context.$CONTEXT.cache_expire`, use `script.cache.expire`
instead.
`script.context.$CONTEXT.max_compilations_rate`, use
`script.max_compilations_rate` instead.
The default cache size was increased from `100` to `3000`, which
was approximately the max cache size when using context-specific caches.
The default compilation rate limit was increased from `75/5m` to
`150/5m` to account for increasing uses of scripts.
System script contexts can now opt-out of compilation rate limiting
using a flag rather than a sentinel rate limit value.
7.16: Script: Deprecate script context cache #79508
Refs: #62899
7.16: Script: Opt-out system contexts from script compilation rate limit #79459
Refs: #62899
Today we limit the max number of concurrent snapshot file restores
per recovery. This works well when the default
node_concurrent_recoveries is used (which is 2). When this limit is
increased, it is possible to exhaust the underlying repository
connection pool, affecting other workloads.
This commit adds a new setting
`indices.recovery.max_concurrent_snapshot_file_downloads_per_node` that
allows to limit the max number of snapshot file downloads per node
during recoveries. When a recovery starts in the target node it tries
to acquire a permit that allows it to download snapshot files when it is
granted. This is communicated to the source node in the
StartRecoveryRequest. This is a rather conservative approach since it is
possible that a recovery that gets a permit to use snapshot files
doesn't recover any snapshot file while there's a concurrent recovery
that doesn't get a permit could take advantage of recovering from a
snapshot.
Closes#79044
Changes can-match from a shard-level to a node-level action, which helps avoid an explosion of shard-level can-match
subrequests in clusters with many shards, that can cause stability issues. Also introduces a new search_coordination
thread pool to handle the sending and handling of node-level can-match requests.
This PR changes uses of transient cluster settings to
persistent cluster settings.
The PR also deprecates the transient settings usage.
Relates to #49540
* A typo error
a space between 'E' and 'cluster...'
* Update example, fix headings, change notes
Co-authored-by: Adam Locke <adam.locke@elastic.co>
Co-authored-by: Marwane Chahoud <marwane.chahoud@gmail.com>
* [DOCS] Fix default value for closed indices
#57953 introduced changes that added ESS icons to many Elasticsearch settings. As part of those changes, the default value for `cluster.indices.close.enable` was indicated as `false`, when it should be `true`. This PR updates the default value to `true`.
Closes#78877
* Update description
* Update note to remove outdated claims
The documentation indicates that `stack.templates.enabled` can be used in Elasticsearch Service, but it is not part of the settings allowlist in ESS. This PR makes the documentation match the state of the allowlist.
* Improve docs for pre-release version compatibility
Follow-up to #78317 clarifying a couple of points:
- a pre-release build can restore snapshots from released builds
- compatibility applies if at least one of the local or remote cluster
is a released build
* Remote cluster build date nit
The reference manual includes docs on version compatibility in various
places, but it's not clear that these docs only apply to released
versions and that the rules for pre-release versions are stricter than
folks expect. This commit adds some words to the docs for unreleased
versions which explains this subtlety.
* [DOCS] Update remote cluster docs
* Add files, rename files, write new stuff
* Plethora of changes
* Add test and update snippets
* Redirects, moved files, and test updates
* Moved file to x-pack for tests
* Remove older CCS page and add redirects
* Cleanup, link updates, and some rewrites
* Update image
* Incorporating user feedback and rewriting much of the remote clusters page
* More changes from review feedback
* Numerous updates, including request examples for CCS and Kibana
* More changes from review feedback
* Minor clarifications on security for remote clusters
* Incorporate review feedback
Co-authored-by: Yang Wang <ywangd@gmail.com>
* Some review feedback and some editorial changes
Co-authored-by: Elastic Machine <elasticmachine@users.noreply.github.com>
Co-authored-by: Yang Wang <ywangd@gmail.com>
We currently use the plaintext body of a shard request as the key to the
request cache. This has the disadvantage that very large requests can
quickly fill up the cache due to the size of their keys. With this commit,
we instead use a sha-256 hash of the shard request as the cache key,
which will use a constant (and much smaller) number of bytes.
Today we expire the client stats for HTTP channels 5 minutes after they
close. It's possible to open a very large number of HTTP channels in 5
minutes, possibly inadvertently, and the stats for those channels can be
overwhelming.
This commit introduces a limit on the number of channels tracked by each
node which applies in addition to the age limit, and makes these limits
configurable via static settings. It drops the pruning of old stats when
starting to track a new channel and instead uses a queue to expire the
oldest stats when each channel closes if necessary to respect the count
limit; it only performs age-based expiry when retrieving the stats,
since the count limit now bounds the memory needed. Finally, it
tightents up some missing synchronization and makes sure that we expose
only immutable objects to the stats subsystem.
This is related to #73497. Currently, we only use the configured
transport.compression_scheme setting when compressing a request or a
response. Additionally, the cluster.remote.*.compression_scheme
setting is ignored. This commit fixes this behavior by respecting the
per-cluster setting. Additionally, it resolves confusion around inbound
and outbound connections by always responding with the same scheme that
was received. This allows remote connections to have different schemes
than local connections.
This commit adds peer recoveries from snapshots. It allows establishing a replica by downloading file data from a snapshot rather than transferring the data from the primary.
Enabling this feature is done on the repository definition. Repositories having the setting `use_for_peer_recovery=true` will be consulted to find a good snapshot when recovering a shard.
Relates #73496
In 7.15, we intend for the indexing_data compression level and the
compression scheme lz4 to no longer be experimental. This commit
updates the documentation to reflect this. Additionally, it adds
missing docs for the cluster.remote.*.transport.compression_scheme
setting.
Relates to #73497.
The special values `_global_`, `_site_`, `0.0.0.0` and so on may resolve
to multiple addresses, of which one is chosen to be the publish address.
This commit generalises the warning about reachability as applied to
DNS-resolved hostnames to also apply to these special values.
Co-authored-by: James Rodewig <40268737+jrodewig@users.noreply.github.com>
This commit adds a new set of classes that would compute a peer
recovery plan, based on source files + target files + available
snapshots. When possible it would try to maximize the number of
files used from a snapshot. It uses repositories with `use_for_peer_recovery`
setting set to true.
It adds a new recovery setting `indices.recovery.use_snapshots`
Relates #73496
In the upcoming Lucene 9 release, `indices.query.bool.max_clause_count` is
going to apply to the entire query tree rather than per `bool` query. In order
to avoid breaks, the limit has been bumped from 1024 to 4096.
The semantics will effectively change when we upgrade to Lucene 9, this PR
is only about agreeing on a migration strategy and documenting this change.
To avoid further breaks, I am leaning towards keeping the current setting name
even though it contains `bool`. I believe that it still makes sense given that
`bool` queries are typically the main contributors to high numbers of clauses.
Co-authored-by: James Rodewig <40268737+jrodewig@users.noreply.github.com>