Commit Graph

281 Commits

Author SHA1 Message Date
Alok Nikhil 27a998e8a0
KAFKA-12237; Support lazy initialization of quorum voter addresses (#9985)
With KIP-595, we previously expect `RaftConfig` to specify the quorum voter endpoints upfront on startup. In the general case, this works fine. However, for testing where the bound port is not known ahead of time, we need a lazier approach that discovers the other voters in the quorum after startup. 

In this patch, we take the voter endpoint initialization out of `KafkaRaftClient.initialize` and move it to `RaftManager`. We use a special address to indicate that the voter addresses will be provided later This approach also lends itself well to future use cases where we might discover voter addresses through an external service (for example).

Reviewers: Jason Gustafson <jason@confluent.io>
2021-01-28 17:14:56 -08:00
dengziming a26db2a1ec
KAFKA-10694; Implement zero copy for FetchSnapshot (#9819)
This patch adds zero-copy support for the `FetchSnapshot` API. Unlike the normal `Fetch` API, records are not assumed to be offset-aligned in `FetchSnapshot` responses. Hence this patch introduces a new `UnalignedRecords` type which allows us to use most of the existing logic to support zero-copy while preserving type safety in the snapshot APIs.

Reviewers: José Armando García Sancio <jsancio@gmail.com>, Jason Gustafson <jason@confluent.io>
2021-01-26 10:33:36 -08:00
Ismael Juma 6f8ca66127
MINOR: Tag `RaftEventSimulationTest` as `integration` and tweak it (#9925)
The test takes over 1 minute to run, so it should not be considered a
unit test.

Also:
* Replace `test` prefix with `check` prefix for helper methods. A common
mistake is to forget to add the @Test annotation, so it's good to use a
different naming convention for methods that should have the annotation
versus methods that should not.
* Replace `Action` functional interface with built-in `Runnable`.
* Remove unnecessary `assumeTrue`.
* Remove `@FunctionalInterface` from `Invariant` since it's not used
in that way.

Reviewers: Chia-Ping Tsai <chia7712@gmail.com>
2021-01-23 15:57:25 -08:00
Jason Gustafson 9689a313f5
MINOR: Drop enable.metadata.quorum config (#9934)
The primary purpose of this patch is to remove the internal `enable.metadata.quorum` configuration. Instead, we rely on `process.roles` to determine if the self-managed quorum has been enabled. As a part of this, I've done the following:

1. Replace the notion of "disabled" APIs with "controller-only" APIs. We previously marked some APIs which were intended only for the KIP-500 as "disabled" so that they would not be unintentionally exposed. For example, the Raft quorum APIs were disabled. Marking them as "controller-only" carries the same effect, but makes the intent that they should be only exposed by the KIP-500 controller clearer.
2. Make `ForwardingManager` optional in `KafkaServer` and `KafkaApis`. Previously we used `null` if forwarding was enabled and relied on the metadata quorum check.
3. Make `zookeeper.connect` an optional configuration if `process.roles` is defined.
4. Update raft README to remove reference to `zookeeper.conntect`

Reviewers: Colin Patrick McCabe <cmccabe@confluent.io>, Boyang Chen <boyang@confluent.io>
2021-01-21 15:16:15 -08:00
Alok Nikhil fea2f65929
MINOR: Import RaftConfig config definitions into KafkaConfig (#9916)
This patch moves Raft config definitions from `RaftConfig` to `KafkaConfig`, where they are re-defined as internal configs until we are ready to expose them. It also adds the missing "controller" prefix that was added by KIP-631.

Reviewers: Jason Gustafson <jason@confluent.io>
2021-01-21 10:26:23 -08:00
Jason Gustafson 7ac06065f1
KAFKA-12161; Support raft observers with optional id (#9871)
We would like to be able to use `KafkaRaftClient` for tooling/debugging use cases. For this, we need the localId to be optional so that the client can be used more like a consumer. This is already supported in the `Fetch` protocol by setting `replicaId=-1`, which the Raft implementation checks for. We just need to alter `QuorumState` so that the `localId` is optional. The main benefit of doing this is that it saves tools the need to generate an arbitrary id (which might cause conflicts given limited Int32 space) and it lets the leader avoid any local state for these observers (such as `ReplicaState` inside `LeaderState`).

Reviewers: Ismael Juma <ismael@juma.me.uk>, Boyang Chen <boyang@confluent.io>
2021-01-15 14:10:17 -08:00
Alok Nikhil c49f660c62
MINOR: Initialize QuorumState lazily in RaftClient.initialize() (#9881)
It is helpful to delay initialization of the `RaftClient` configuration including the voter string until after construction. This helps in integration test cases where the voter ports may not be known until sockets are bound.

Reviewers: José Armando García Sancio <jsancio@users.noreply.github.com>, Jason Gustafson <jason@confluent.io>
2021-01-14 17:31:54 -08:00
CHUN-HAO TANG 2996642566
MINOR: Fix error message in SnapshotWriter.java (#9862)
Reviewers: Chia-Ping Tsai <chia7712@gmail.com>
2021-01-13 21:49:11 +08:00
Jason Gustafson f62c2b26cc
MINOR: Factor `RaftManager` out of `TestRaftServer` (#9839)
This patch factors out a `RaftManager` class from `TestRaftServer` which will be needed when we integrate this layer into the server. This class encapsulates the logic to build `KafkaRaftClient` as well as its IO thread. 

Reviewers: Chia-Ping Tsai <chia7712@gmail.com>
2021-01-11 09:28:12 -08:00
José Armando García Sancio 2023aed59d
KAFKA-10427: Fetch snapshot API (#9553)
Implements the code necessary for the leader to response to fetch snapshot requests and for the follower to fetch snapshots. This API is described in more detail in KIP-630: https://cwiki.apache.org/confluence/display/KAFKA/KIP-630%3A+Kafka+Raft+Snapshot.  More specifically, this patch includes the following changes:

Leader Changes:
1. Raft leader response to FetchSnapshot request by reading the local snapshot and sending the requested bytes in the response. This implementation currently copies the bytes to memory. This will be fixed in a future PR.

Follower Changes:
1. Raft followers will start fetching snapshot if the leader sends a Fetch response that includes a SnapshotId.

2. Raft followers send FetchSnapshot requests if there is a pending download. The same timer is used for both Fetch and FetchSnapshot requests.

3. Raft follower handle FetchSnapshot responses by comping the bytes to the pending SnapshotWriter. This implementation doesn't fix the replicated log after the snapshot has been downloaded. This will be implemented in a future PR.

Reviewers: Jason Gustafson <jason@confluent.io>
2020-12-28 18:37:08 -08:00
vamossagar12 d5151f6f09
KAFKA-10828; Replacing endorsing with acknowledging for voters (#9737)
This PR replaces the terms endorsing with acknowledging for voters which have recognised the current leader.

Reviewers: Jason Gustafson <jason@confluent.io>
2020-12-22 10:05:07 -08:00
Jason Gustafson eb9fe411bb
KAFKA-10842; Use `InterBrokerSendThread` for raft's outbound network channel (#9732)
This patch contains the following improvements:

- Separate inbound/outbound request flows so that we can open the door for concurrent inbound request handling
- Rewrite `KafkaNetworkChannel` to use `InterBrokerSendThread` which fixes a number of bugs/shortcomings
- Get rid of a lot of boilerplate conversions in `KafkaNetworkChannel` 
- Improve validation of inbound responses in `KafkaRaftClient` by checking correlationId. This fixes a bug which could cause an out of order Fetch to be applied incorrectly.

Reviewers: David Arthur <mumrah@gmail.com>
2020-12-21 18:15:15 -08:00
dengziming 125d5ea0fb
KAFKA-10677; Complete fetches in purgatory immediately after resigning (#9639)
This patch adds logic to complete fetches immediately after resigning by returning the BROKER_NOT_AVAILABLE error. This ensures that the new election cannot be delayed by fetches which are stuck in purgatory. 

Reviewers: Jason Gustafson <jason@confluent.io>
2020-12-10 09:25:04 -08:00
Jason Gustafson a8b668b37c
KAFKA-10826; Ensure raft io thread respects linger timeout (#9716)
When there are no pending operations, the raft IO thread can block indefinitely waiting for a network event. We rely on asynchronous wakeups in order to break the blocking wait in order to respond to a scheduled append. The current logic already does this, but only for the case when the linger time has been completed during the call to `scheduleAppend`. It is possible instead that after making one call to `scheduleAppend` to start the linger timer, the application does not do any additional appends. In this case, we still need the IO thread to wakeup when the linger timer expires. This patch fixes the problem by ensuring that the IO thread gets woken up after the first append which begins the linger timer.

Reviewers: Guozhang Wang <wangguoz@gmail.com>
2020-12-09 13:33:28 -08:00
vamossagar12 99b5e4f4ab
KAFKA-10634; Adding LeaderId to voters list in LeaderChangeMessage along with granting voters (#9539)
This patch ensures that the leader is included among the voters in the `LeaderChangeMessage`. It also adds an additional field for the set of granting voters, which was originally specified in KIP-595.

Reviewers: José Armando García Sancio <jsancio@users.noreply.github.com>, Jason Gustafson <jason@confluent.io>
2020-12-08 17:37:48 -08:00
dengziming 3e5a22cefa
KAFKA-10756; Add missing unit test for `UnattachedState` (#9635)
This patch adds a unit test for `UnattachedState`, similar to `ResignedStateTest` and `VotedStateTest`.

Reviewers: Jason Gustafson <jason@confluent.io>
2020-12-08 10:27:11 -08:00
José Armando García Sancio ab0807dd85
KAFKA-10394: Add classes to read and write snapshot for KIP-630 (#9512)
This PR adds support for generating snapshot for KIP-630.

1. Adds the interfaces `RawSnapshotWriter` and `RawSnapshotReader` and the implementations `FileRawSnapshotWriter` and `FileRawSnapshotReader` respectively. These interfaces and implementations are low level API for writing and reading snapshots. They are internal to the Raft implementation and are not exposed to the users of `RaftClient`. They operation at the `Record` level. These types are exposed to the `RaftClient` through the `ReplicatedLog` interface.

2. Adds a buffered snapshot writer: `SnapshotWriter<T>`. This type is a higher-level type and it is exposed through the `RaftClient` interface. A future PR will add the related `SnapshotReader<T>`, which will be used by the state machine to load a snapshot.

Reviewers: Jason Gustafson <jason@confluent.io>
2020-12-07 14:06:25 -08:00
Jason Gustafson f49c6c203f
KAFKA-10661; Add new resigned state for graceful shutdown/initialization (#9531)
When initializing the raft state machine after shutting down as a leader, we were previously entering the "unattached" state, which means we have no leader and no voted candidate. This was a bug because it allowed a reinitialized leader to cast a vote for a candidate in the same epoch that it was already the leader of. This patch fixes the problem by introducing a new "resigned" state which allows us to retain the leader state so that we cannot change our vote and we will not accept additional appends.

This patch also revamps the shutdown logic to make use of the new "resigned" state. Previously we had a separate path in `KafkaRaftClient.poll` for the shutdown logic which resulted in some duplication. Instead now we incorporate shutdown behavior into each state's respective logic.

Finally, this patch changes the shutdown logic so that `EndQuorumEpoch` is only sent by resigning leaders. Previously we allowed this request to be sent by candidates as well.

Reviewers: dengziming <dengziming1993@gmail.com>, Guozhang Wang <wangguoz@gmail.com>
2020-11-09 12:52:28 -08:00
Jason Gustafson 21a65e1043
KAFKA-10632; Raft client should push all committed data to state machines (#9482)
In #9418, we add a listener to the `RaftClient` interface. In that patch, we used it only to send commit notifications for writes from the leader. In this PR, we extend the `handleCommit` API to accept all committed data and we remove the pull-based `read` API. Additionally, we add two new callbacks to the listener interface in order to notify the state machine when the raft client has claimed or resigned leadership.

Finally, this patch allows the `RaftClient` to support multiple listeners. This is necessary for KIP-500 because we will have one listener for the controller role and one for the broker role.

Reviewers: José Armando García Sancio <jsancio@users.noreply.github.com>, Boyang Chen <boyang@confluent.io>
2020-11-02 15:06:58 -08:00
dengziming b4100d4b28
KAFKA-10644; Fix VotedToUnattached test error (#9503)
This patch fixes a test a test case in `QuorumStateTest`. The method name is "testVotedToUnattachedHigherEpoch," but the code initialized in the unattached state instead of the voted state.

Reviewers: Jason Gustafson <jason@confluent.io>
2020-10-27 16:41:00 -07:00
Jason Gustafson 927edfece3
KAFKA-10601; Add support for append linger to Raft implementation (#9418)
The patch adds `quorum.append.linger.ms` behavior to the raft implementation. This gives users a powerful knob to tune the impact of fsync.  When an append is accepted from the state machine, it is held in an accumulator (similar to the producer) until the configured linger time is exceeded. This allows the implementation to amortize fsync overhead at the expense of some write latency.

The patch also improves our methodology for testing performance. Up to now, we have relied on the producer performance test, but it is difficult to simulate expected controller loads because producer performance is limited by other factors such as the number of producer clients and head-of-line blocking. Instead, this patch adds a workload generator which runs on the leader after election.

Finally, this patch brings us nearer to the write semantics expected by the KIP-500 controller. It makes the following changes:

- Introduce `RecordSerde<T>` interface which abstracts the underlying log implementation from `RaftClient`. The generic type is carried over to `RaftClient<T>` and is exposed through the read/write APIs.
- `RaftClient.append` is changed to `RaftClient.scheduleAppend` and returns the last offset of the expected log append.
- `RaftClient.scheduleAppend` accepts a list of records and ensures that the full set are included in a single batch.
- Introduce `RaftClient.Listener` with a single `handleCommit` API which will eventually replace `RaftClient.read` in order to surface committed data to the controller state machine. Currently `handleCommit` is only used for records appended by the leader.

Reviewers: José Armando García Sancio <jsancio@users.noreply.github.com>, Guozhang Wang <wangguoz@gmail.com>
2020-10-27 12:10:13 -07:00
José Armando García Sancio 94820ca652
MINOR: Refactor RaftClientTest to be used by other tests (#9476)
There is a lot of functionality in KafkaRaftClientTest that is useful for writing other tests. Refactor that functionality into another class that can be reused in other tests.

Reviewers: Jason Gustafson <jason@confluent.io>
2020-10-22 18:14:27 -07:00
Ismael Juma 8acbd85e1d
MINOR: Update raft/README.md and minor RaftConfig tweaks (#9484)
* Replace quorum.bootstrap.servers and quorum.bootstrap.voters with
quorum.voters.
* Remove seemingly unused `verbose` config.
* Use constant to avoid unnecessary repeated concatenation.

Reviewers: Jason Gustafson <jason@confluent.io>
2020-10-22 17:51:05 -07:00
dengziming 42ce00fdd6
MINOR: refactor CandidateState.unrecordedVoters (#9442)
Reviewers: Chia-Ping Tsai <chia7712@gmail.com>
2020-10-22 16:36:41 +08:00
Jason Gustafson a72f0c1eac
KAFKA-10533; KafkaRaftClient should flush log after appends (#9352)
This patch adds missing flush logic to `KafkaRaftClient`. The initial flushing behavior is simplistic. We guarantee that the leader will not replicate above the last flushed offset and we guarantee that the follower will not fetch data above its own flush point. More sophisticated flush behavior is proposed in KAFKA-10526.

We have also extended the simulation test so that it covers flush behavior. When a node is shutdown, all unflushed data is lost. We were able to confirm that the monotonic high watermark invariant fails without the added `flush` calls.

This patch also piggybacks a fix to the `TestRaftServer` implementation. The initial check-in contained a bug which caused `RequestChannel` to fail sending responses because the disabled APIs did not have metrics registered. As a result of this, it is impossible to elect leaders.

Reviewers: Guozhang Wang <wangguoz@gmail.com>
2020-10-13 08:59:02 -07:00
Jason Gustafson 05f9803d72
KAFKA-10527; Voters should not reinitialize as leader in same epoch (#9348)
One of the invariants that the raft replication protocol relies on is that each record is uniquely identified by leader epoch and offset. This can be violated if a leader remains elected with the same epoch between restarts since unflushed data could be lost.

Reviewers: Guozhang Wang <wangguoz@gmail.com>
2020-10-06 10:11:56 -07:00
Guozhang Wang 5146c5a6cb
MINOR: Update doc for raft state metrics (#9342)
Reviewers:  Jason Gustafson <jason@confluent.io>
2020-10-05 14:52:34 -07:00
Jason Gustafson dbe3e4a4cc
KAFKA-10511; Ensure monotonic start epoch/offset updates in `MockLog` (#9332)
There is a minor difference in behavior between the epoch caching logic in `MockLog` from the behavior in `LeaderEpochFileCache`. The latter ensures that every new epoch/start offset entry added to the cache increases monotonically over the previous entries. This patch brings the behavior of `MockLog` in line. 

It also simplifies the `assignEpochStartOffset` api in `ReplicatedLog`. We always intend to use the log end offset, so this patch removes the start offset parameter.

Reviewers: Boyang Chen <boyang@confluent.io>
2020-09-28 17:16:55 -07:00
Jason Gustafson ac8acec653
KAFKA-10519; Add missing unit test for `VotedState` (#9337)
Add a simple unit test for `VotedState`. 

Reviewers: Guozhang Wang <wangguoz@gmail.com>
2020-09-25 09:12:56 -07:00
Ismael Juma 51957de806
MINOR: Use JUnit 5 in raft module (#9331)
I also removed a test class with no tests currently (Jason filed KAFKA-10519 for
filling the test gap).

Reviewers: Jason Gustafson <jason@confluent.io>
2020-09-24 02:37:17 -07:00
Jason Gustafson b7c8490cf4
KAFKA-10492; Core Kafka Raft Implementation (KIP-595) (#9130)
This is the core Raft implementation specified by KIP-595: https://cwiki.apache.org/confluence/display/KAFKA/KIP-595%3A+A+Raft+Protocol+for+the+Metadata+Quorum. We have created a separate "raft" module where most of the logic resides. The new APIs introduced in this patch in order to support Raft election and such are disabled in the server until the integration with the controller is complete. Until then, there is a standalone server which can be used for testing the performance of the Raft implementation. See `raft/README.md` for details.

Reviewers: Guozhang Wang <wangguoz@gmail.com>, Boyang Chen <boyang@confluent.io>

Co-authored-by: Boyang Chen <boyang@confluent.io>
Co-authored-by: Guozhang Wang <wangguoz@gmail.com>
2020-09-22 11:32:44 -07:00