elasticsearch/docs/internal/DistributedArchitectureGuid...

# Distributed Area Team Internals

(Summary, brief discussion of our features)

# Networking

### ThreadPool

(We have many thread pools, what and why)

### ActionListener

See the [Javadocs for `ActionListener`](https://github.com/elastic/elasticsearch/blob/main/server/src/main/java/org/elasticsearch/action/ActionListener.java)

(TODO: add useful starter references and explanations for a range of Listener classes. Reference the Netty section.)

### REST Layer

The REST and Transport layers are bound together through the `ActionModule`. `ActionModule#initRestHandlers` registers all the
rest actions with a `RestController` that matches incoming requests to particular REST actions. `RestController#registerHandler`
uses each `Rest*Action`'s `#routes()` implementation to match HTTP requests to that particular `Rest*Action`. Typically, REST
actions follow the class naming convention `Rest*Action`, which makes them easier to find, but not always; the `#routes()`
definition can also be helpful in finding a REST action. `RestController#dispatchRequest` eventually calls `#handleRequest` on a
`RestHandler` implementation. `RestHandler` is the base class for `BaseRestHandler`, which most `Rest*Action` instances extend to
implement a particular REST action.

`BaseRestHandler#handleRequest` calls into `BaseRestHandler#prepareRequest`, which children `Rest*Action` classes extend to
define the behavior for a particular action. `RestController#dispatchRequest` passes a `RestChannel` to the `Rest*Action` via
`RestHandler#handleRequest`: `Rest*Action#prepareRequest` implementations return a `RestChannelConsumer` defining how to execute
the action and reply on the channel (usually in the form of completing an ActionListener wrapper). `Rest*Action#prepareRequest`
implementations are responsible for parsing the incoming request, and verifying that the structure of the request is valid.
`BaseRestHandler#handleRequest` will then check that all the request parameters have been consumed: unexpected request parameters
result in an error.

### How REST Actions Connect to Transport Actions

The Rest layer uses an implementation of `AbstractClient`. `BaseRestHandler#prepareRequest` takes a `NodeClient`: this client
knows how to connect to a specified TransportAction. A `Rest*Action` implementation will return a `RestChannelConsumer` that
most often invokes a call into a method on the `NodeClient` to pass through to the TransportAction. Along the way from
`BaseRestHandler#prepareRequest` through the `AbstractClient` and `NodeClient` code, `NodeClient#executeLocally` is called: this
method calls into `TaskManager#registerAndExecute`, registering the operation with the `TaskManager` so it can be found in Task
API requests, before moving on to execute the specified TransportAction.

`NodeClient` has a `NodeClient#actions` map from `ActionType` to `TransportAction`. `ActionModule#setupActions` registers all the
core TransportActions, as well as those defined in any plugins that are being used: plugins can override `Plugin#getActions()` to
define additional TransportActions. Note that not all TransportActions will be mapped back to a REST action: many TransportActions
are only used for internode operations/communications.

### Transport Layer

(Managed by the TransportService, TransportActions must be registered there, too)

(Executing a TransportAction (either locally via NodeClient or remotely via TransportService) is where most of the authorization & other security logic runs)

(What actions, and why, are registered in TransportService but not NodeClient?)

### Direct Node to Node Transport Layer

(TransportService maps incoming requests to TransportActions)

### Chunk Encoding

#### XContent

### Performance

### Netty

(long running actions should be forked off of the Netty thread. Keep short operations to avoid forking costs)

### Work Queues

### RestClient

The `RestClient` is primarily used in testing, to send requests against cluster nodes in the same format as would users. There
are some uses of `RestClient`, via `RestClientBuilder`, in the production code. For example, remote reindex leverages the
`RestClient` internally as the REST client to the remote elasticsearch cluster, and to take advantage of the compatibility of
`RestClient` requests with much older elasticsearch versions. The `RestClient` is also used externally by the `Java API Client`
to communicate with Elasticsearch.

# Cluster Coordination

(Sketch of important classes? Might inform more sections to add for details.)

(A NodeB can coordinate a search across several other nodes, when NodeB itself does not have the data, and then return a result to the caller. Explain this coordinating role)

### Node Roles

### Master Nodes

### Master Elections

(Quorum, terms, any eligibility limitations)

### Cluster Formation / Membership

(Explain joining, and how it happens every time a new master is elected)

#### Discovery

### Master Transport Actions

### Cluster State

#### Master Service

#### Cluster State Publication

(Majority concensus to apply, what happens if a master-eligible node falls behind / is incommunicado.)

#### Cluster State Application

(Go over the two kinds of listeners -- ClusterStateApplier and ClusterStateListener?)

#### Persistence

(Sketch ephemeral vs persisted cluster state.)

(what's the format for persisted metadata)

# Replication

(More Topics: ReplicationTracker concepts / highlights.)

### What is a Shard

### Primary Shard Selection

(How a primary shard is chosen)

#### Versioning

(terms and such)

### How Data Replicates

(How an index write replicates across shards -- TransportReplicationAction?)

### Consistency Guarantees

(What guarantees do we give the user about persistence and readability?)

# Locking

(rarely use locks)

### ShardLock

### Translog / Engine Locking

### Lucene Locking

# Engine

(What does Engine mean in the distrib layer? Distinguish Engine vs Directory vs Lucene)

(High level explanation of how translog ties in with Lucene)

(contrast Lucene vs ES flush / refresh / fsync)

### Refresh for Read

(internal vs external reader manager refreshes? flush vs refresh)

### Reference Counting

### Store

(Data lives beyond a high level IndexShard instance. Continue to exist until all references to the Store go away, then Lucene data is removed)

### Translog

(Explain checkpointing and generations, when happens on Lucene flush / fsync)

(Concurrency control for flushing)

(VersionMap)

#### Translog Truncation

#### Direct Translog Read

### Index Version

### Lucene

(copy a sketch of the files Lucene can have here and explain)

(Explain about SearchIndexInput -- IndexWriter, IndexReader -- and the shared blob cache)

(Lucene uses Directory, ES extends/overrides the Directory class to implement different forms of file storage.
Lucene contains a map of where all the data is located in files and offsites, and fetches it from various files.
ES doesn't just treat Lucene as a storage engine at the bottom (the end) of the stack. Rather ES has other information that
works in parallel with the storage engine.)

#### Segment Merges

# Recovery

(All shards go through a 'recovery' process. Describe high level. createShard goes through this code.)

(How is the translog involved in recovery?)

### Create a Shard

### Local Recovery

### Peer Recovery

### Snapshot Recovery

### Recovery Across Server Restart

(partial shard recoveries survive server restart? `reestablishRecovery`? How does that work.)

### How a Recovery Method is Chosen

# Data Tiers

(Frozen, warm, hot, etc.)

# Allocation

(AllocationService runs on the master node)

(Discuss different deciders that limit allocation. Sketch / list the different deciders that we have.)

### APIs for Balancing Operations

(Significant internal APIs for balancing a cluster)

### Heuristics for Allocation

### Cluster Reroute Command

(How does this command behave with the desired auto balancer.)

# Autoscaling

(Reactive and proactive autoscaling. Explain that we surface recommendations, how control plane uses it.)

(Sketch / list the different deciders that we have, and then also how we use information from each to make a recommendation.)

# Snapshot / Restore

(We've got some good package level documentation that should be linked here in the intro)

(copy a sketch of the file system here, with explanation -- good reference)

### Snapshot Repository

### Creation of a Snapshot

(Include an overview of the coordination between data and master nodes, which writes what and when)

(Concurrency control: generation numbers, pending generation number, etc.)

(partial snapshots)

### Deletion of a Snapshot

### Restoring a Snapshot

### Detecting Multiple Writers to a Single Repository

# Task Management / Tracking

(How we identify operations/tasks in the system and report upon them. How we group operations via parent task ID.)

### What Tasks Are Tracked

### Tracking A Task Across Threads

### Tracking A Task Across Nodes

### Kill / Cancel A Task

### Persistent Tasks

# Cross Cluster Replication (CCR)

(Brief explanation of the use case for CCR)

(Explain how this works at a high level, and details of any significant components / ideas.)

### Cross Cluster Search

# Indexing / CRUD

(Explain that the Distributed team is responsible for the write path, while the Search team owns the read path.)

(Generating document IDs. Same across shard replicas, \_id field)

(Sequence number: different than ID)

### Reindex

### Locking

(what limits write concurrency, and how do we minimize)

### Soft Deletes

### Refresh

(explain visibility of writes, and reference the Lucene section for more details (whatever makes more sense explained there))

# Server Startup

# Server Shutdown

### Closing a Shard

(this can also happen during shard reallocation, right? This might be a standalone topic, or need another section about it in allocation?...)
Add an outline for the distributed area team architecture guide (#105264) 2024-02-09 00:53:08 +08:00			`# Distributed Area Team Internals`

			`(Summary, brief discussion of our features)`

			`# Networking`

			`### ThreadPool`

			`(We have many thread pools, what and why)`

			`### ActionListener`

Move conceptual docs about `ActionListener` (#107875) This information is more discoverable as the class-level javadocs for `ActionListener` itself rather than hidden away in a separate Markdown file. Also this way the links all stay up to date. 2024-05-09 16:37:56 +08:00			See the [Javadocs for `ActionListener`](https://github.com/elastic/elasticsearch/blob/main/server/src/main/java/org/elasticsearch/action/ActionListener.java)
Add an outline for the distributed area team architecture guide (#105264) 2024-02-09 00:53:08 +08:00
			`(TODO: add useful starter references and explanations for a range of Listener classes. Reference the Netty section.)`

			`### REST Layer`

Document REST layer in architecture guide (#107714) 2024-04-24 23:33:24 +08:00			The REST and Transport layers are bound together through the `ActionModule`. `ActionModule#initRestHandlers` registers all the
			rest actions with a `RestController` that matches incoming requests to particular REST actions. `RestController#registerHandler`
			uses each `RestAction`'s `#routes()` implementation to match HTTP requests to that particular `RestAction`. Typically, REST
			actions follow the class naming convention `Rest*Action`, which makes them easier to find, but not always; the `#routes()`
			definition can also be helpful in finding a REST action. `RestController#dispatchRequest` eventually calls `#handleRequest` on a
			`RestHandler` implementation. `RestHandler` is the base class for `BaseRestHandler`, which most `Rest*Action` instances extend to
			`implement a particular REST action.`

			`BaseRestHandler#handleRequest` calls into `BaseRestHandler#prepareRequest`, which children `Rest*Action` classes extend to
			define the behavior for a particular action. `RestController#dispatchRequest` passes a `RestChannel` to the `Rest*Action` via
			`RestHandler#handleRequest`: `Rest*Action#prepareRequest` implementations return a `RestChannelConsumer` defining how to execute
			the action and reply on the channel (usually in the form of completing an ActionListener wrapper). `Rest*Action#prepareRequest`
			`implementations are responsible for parsing the incoming request, and verifying that the structure of the request is valid.`
			`BaseRestHandler#handleRequest` will then check that all the request parameters have been consumed: unexpected request parameters
			`result in an error.`

			`### How REST Actions Connect to Transport Actions`

			The Rest layer uses an implementation of `AbstractClient`. `BaseRestHandler#prepareRequest` takes a `NodeClient`: this client
			knows how to connect to a specified TransportAction. A `Rest*Action` implementation will return a `RestChannelConsumer` that
			most often invokes a call into a method on the `NodeClient` to pass through to the TransportAction. Along the way from
			`BaseRestHandler#prepareRequest` through the `AbstractClient` and `NodeClient` code, `NodeClient#executeLocally` is called: this
			method calls into `TaskManager#registerAndExecute`, registering the operation with the `TaskManager` so it can be found in Task
			`API requests, before moving on to execute the specified TransportAction.`

			`NodeClient` has a `NodeClient#actions` map from `ActionType` to `TransportAction`. `ActionModule#setupActions` registers all the
			core TransportActions, as well as those defined in any plugins that are being used: plugins can override `Plugin#getActions()` to
			`define additional TransportActions. Note that not all TransportActions will be mapped back to a REST action: many TransportActions`
			`are only used for internode operations/communications.`
Add an outline for the distributed area team architecture guide (#105264) 2024-02-09 00:53:08 +08:00
			`### Transport Layer`

Document REST layer in architecture guide (#107714) 2024-04-24 23:33:24 +08:00			`(Managed by the TransportService, TransportActions must be registered there, too)`

			`(Executing a TransportAction (either locally via NodeClient or remotely via TransportService) is where most of the authorization & other security logic runs)`

			`(What actions, and why, are registered in TransportService but not NodeClient?)`

			`### Direct Node to Node Transport Layer`

			`(TransportService maps incoming requests to TransportActions)`

Add an outline for the distributed area team architecture guide (#105264) 2024-02-09 00:53:08 +08:00			`### Chunk Encoding`

			`#### XContent`

			`### Performance`

			`### Netty`

			`(long running actions should be forked off of the Netty thread. Keep short operations to avoid forking costs)`

			`### Work Queues`

Brief document blurb about RestClient (#107863) 2024-05-09 01:34:56 +08:00			`### RestClient`

			The `RestClient` is primarily used in testing, to send requests against cluster nodes in the same format as would users. There
			are some uses of `RestClient`, via `RestClientBuilder`, in the production code. For example, remote reindex leverages the
			`RestClient` internally as the REST client to the remote elasticsearch cluster, and to take advantage of the compatibility of
			`RestClient` requests with much older elasticsearch versions. The `RestClient` is also used externally by the `Java API Client`
			`to communicate with Elasticsearch.`

Add an outline for the distributed area team architecture guide (#105264) 2024-02-09 00:53:08 +08:00			`# Cluster Coordination`

			`(Sketch of important classes? Might inform more sections to add for details.)`

			`(A NodeB can coordinate a search across several other nodes, when NodeB itself does not have the data, and then return a result to the caller. Explain this coordinating role)`

			`### Node Roles`

			`### Master Nodes`

			`### Master Elections`

			`(Quorum, terms, any eligibility limitations)`

			`### Cluster Formation / Membership`

			`(Explain joining, and how it happens every time a new master is elected)`

			`#### Discovery`

			`### Master Transport Actions`

			`### Cluster State`

			`#### Master Service`

			`#### Cluster State Publication`

			`(Majority concensus to apply, what happens if a master-eligible node falls behind / is incommunicado.)`

			`#### Cluster State Application`

			`(Go over the two kinds of listeners -- ClusterStateApplier and ClusterStateListener?)`

			`#### Persistence`

			`(Sketch ephemeral vs persisted cluster state.)`

			`(what's the format for persisted metadata)`

			`# Replication`

			`(More Topics: ReplicationTracker concepts / highlights.)`

			`### What is a Shard`

			`### Primary Shard Selection`

			`(How a primary shard is chosen)`

			`#### Versioning`

			`(terms and such)`

			`### How Data Replicates`

			`(How an index write replicates across shards -- TransportReplicationAction?)`

			`### Consistency Guarantees`

			`(What guarantees do we give the user about persistence and readability?)`

			`# Locking`

			`(rarely use locks)`

			`### ShardLock`

			`### Translog / Engine Locking`

			`### Lucene Locking`

			`# Engine`

			`(What does Engine mean in the distrib layer? Distinguish Engine vs Directory vs Lucene)`

			`(High level explanation of how translog ties in with Lucene)`

			`(contrast Lucene vs ES flush / refresh / fsync)`

			`### Refresh for Read`

			`(internal vs external reader manager refreshes? flush vs refresh)`

			`### Reference Counting`

			`### Store`

			`(Data lives beyond a high level IndexShard instance. Continue to exist until all references to the Store go away, then Lucene data is removed)`

			`### Translog`

			`(Explain checkpointing and generations, when happens on Lucene flush / fsync)`

			`(Concurrency control for flushing)`

			`(VersionMap)`

			`#### Translog Truncation`

			`#### Direct Translog Read`

			`### Index Version`

			`### Lucene`

			`(copy a sketch of the files Lucene can have here and explain)`

			`(Explain about SearchIndexInput -- IndexWriter, IndexReader -- and the shared blob cache)`

			`(Lucene uses Directory, ES extends/overrides the Directory class to implement different forms of file storage.`
			`Lucene contains a map of where all the data is located in files and offsites, and fetches it from various files.`
			`ES doesn't just treat Lucene as a storage engine at the bottom (the end) of the stack. Rather ES has other information that`
			`works in parallel with the storage engine.)`

			`#### Segment Merges`

			`# Recovery`

			`(All shards go through a 'recovery' process. Describe high level. createShard goes through this code.)`

			`(How is the translog involved in recovery?)`

			`### Create a Shard`

			`### Local Recovery`

			`### Peer Recovery`

			`### Snapshot Recovery`

			`### Recovery Across Server Restart`

			(partial shard recoveries survive server restart? `reestablishRecovery`? How does that work.)

			`### How a Recovery Method is Chosen`

			`# Data Tiers`

			`(Frozen, warm, hot, etc.)`

			`# Allocation`

			`(AllocationService runs on the master node)`

			`(Discuss different deciders that limit allocation. Sketch / list the different deciders that we have.)`

			`### APIs for Balancing Operations`

			`(Significant internal APIs for balancing a cluster)`

			`### Heuristics for Allocation`

			`### Cluster Reroute Command`

			`(How does this command behave with the desired auto balancer.)`

			`# Autoscaling`

			`(Reactive and proactive autoscaling. Explain that we surface recommendations, how control plane uses it.)`

			`(Sketch / list the different deciders that we have, and then also how we use information from each to make a recommendation.)`

			`# Snapshot / Restore`

			`(We've got some good package level documentation that should be linked here in the intro)`

			`(copy a sketch of the file system here, with explanation -- good reference)`

			`### Snapshot Repository`

			`### Creation of a Snapshot`

			`(Include an overview of the coordination between data and master nodes, which writes what and when)`

			`(Concurrency control: generation numbers, pending generation number, etc.)`

			`(partial snapshots)`

			`### Deletion of a Snapshot`

			`### Restoring a Snapshot`

			`### Detecting Multiple Writers to a Single Repository`

			`# Task Management / Tracking`

			`(How we identify operations/tasks in the system and report upon them. How we group operations via parent task ID.)`

			`### What Tasks Are Tracked`

			`### Tracking A Task Across Threads`

			`### Tracking A Task Across Nodes`

			`### Kill / Cancel A Task`

			`### Persistent Tasks`

			`# Cross Cluster Replication (CCR)`

			`(Brief explanation of the use case for CCR)`

			`(Explain how this works at a high level, and details of any significant components / ideas.)`

			`### Cross Cluster Search`

			`# Indexing / CRUD`

			`(Explain that the Distributed team is responsible for the write path, while the Search team owns the read path.)`

			`(Generating document IDs. Same across shard replicas, \_id field)`

			`(Sequence number: different than ID)`

			`### Reindex`

			`### Locking`

			`(what limits write concurrency, and how do we minimize)`

			`### Soft Deletes`

			`### Refresh`

			`(explain visibility of writes, and reference the Lucene section for more details (whatever makes more sense explained there))`

			`# Server Startup`

			`# Server Shutdown`

			`### Closing a Shard`

			`(this can also happen during shard reallocation, right? This might be a standalone topic, or need another section about it in allocation?...)`