kafka

Commit Graph

Author	SHA1	Message	Date
gongxuanzhang	d239dde8f6	KAFKA-10787 Apply spotless to raft module (#16278 ) Reviewers: Chia-Ping Tsai <chia7712@gmail.com>	2024-06-15 11:28:36 +08:00
Omnia Ibrahim	e99da2446c	KAFKA-15853: Move KafkaConfig.configDef out of core (#16116 ) Reviewers: Mickael Maison <mickael.maison@gmail.com>, Chia-Ping Tsai <chia7712@gmail.com>	2024-06-14 17:26:00 +02:00
gongxuanzhang	596b945072	KAFKA-16643 Add ModifierOrder checkstyle rule (#15890 ) Reviewers: Chia-Ping Tsai <chia7712@gmail.com>	2024-06-13 15:39:32 +08:00
Nikolay	aecaf44475	KAFKA-16520: Support KIP-853 in DescribeQuorum (#16106 ) Add support for KIP-953 KRaft Quorum reconfiguration in the DescribeQuorum request and response. Also add support to AdminClient.describeQuorum, so that users will be able to find the current set of quorum nodes, as well as their directories, via these RPCs. Reviewers: Luke Chen <showuon@gmail.com>, Colin P. McCabe <cmccabe@apache.org>, Andrew Schofield <aschofield@confluent.io>	2024-06-11 10:01:35 -07:00
Alyssa Huang	f880ad6ccf	KAFKA-16530: Fix high-watermark calculation to not assume the leader is in the voter set (#16079 ) 1. Changing log message from error to info - We may expect the HW calculation to give us a smaller result than the current HW in the case of quorum reconfiguration. We will continue to not allow the HW to actually decrease. 2. Logic for finding the updated LeaderEndOffset for updateReplicaState is changed as well. We do not assume the leader is in the voter set and check the observer states as well. 3. updateLocalState now accepts an additional "lastVoterSet" param which allows us to update the leader state with the last known voters. any nodes in this set but not in voterStates will be added to voterStates and removed from observerStates, any nodes not in this set but in voterStates will be removed from voterStates and added to observerStates Reviewers: Luke Chen <showuon@gmail.com>, José Armando García Sancio <jsancio@apache.org>	2024-06-06 14:30:49 +08:00
Sanskar Jhajharia	896af1b2f2	MINOR: Raft module Cleanup (#16205 ) Reviewers: Chia-Ping Tsai <chia7712@gmail.com>	2024-06-06 04:16:59 +08:00
José Armando García Sancio	459da4795a	KAFKA-16525; Dynamic KRaft network manager and channel (#15986 ) Allow KRaft replicas to send requests to any node (Node) not just the nodes configured in the controller.quorum.voters property. This flexibility is needed so KRaft can implement the controller.quorum.voters configuration, send request to the dynamically changing set of voters and send request to the leader endpoint (Node) discovered through the KRaft RPCs (specially BeginQuorumEpoch request and Fetch response). This was achieved by changing the RequestManager API to accept Node instead of just the replica ID. Internally, the request manager tracks connection state using the Node.idString method to match the connection management used by NetworkClient. The API for RequestManager is also changed so that the ConnectState class is not exposed in the API. This allows the request manager to reclaim heap memory for any connection that is ready. The NetworkChannel was updated to receive the endpoint information (Node) through the outbound raft request (RaftRequent.Outbound). This makes the network channel more flexible as it doesn't need to be configured with the list of all possible endpoints. RaftRequest.Outbound and RaftResponse.Inbound were updated to include the remote node instead of just the remote id. The follower state tracked by KRaft replicas was updated to include both the leader id and the leader's endpoint (Node). In this comment the node value is computed from the set of voters. In future commit this will be updated so that it is sent through KRaft RPCs. For example BeginQuorumEpoch request and Fetch response. Support for configuring controller.quorum.bootstrap.servers was added. This includes changes to KafkaConfig, QuorumConfig, etc. All of the tests using QuorumTestHarness were changed to use the controller.quorum.bootstrap.servers instead of the controller.quorum.voters for the broker configuration. Finally, the node id for the bootstrap server will be decreasing negative numbers starting with -2. Reviewers: Jason Gustafson <jason@confluent.io>, Luke Chen <showuon@gmail.com>, Colin P. McCabe <cmccabe@apache.org>	2024-06-03 14:24:48 -07:00
dengziming	131ce0ba59	Minor: Fix VoterSetHistoryTest.testAddAt (#16104 ) Reviewers: Luke Chen <showuon@gmail.com>	2024-05-30 10:28:07 +08:00
Colin P. McCabe	90892ae99f	KAFKA-16516: Fix the controller node provider for broker to control channel Fix the code in the RaftControllerNodeProvider to query RaftManager to find Node information, rather than consulting a static map. Add a RaftManager.voterNode function to supply this information. In KRaftClusterTest, add testControllerFailover to get more coverage of controller failovers. Reviewers: José Armando García Sancio <jsancio@apache.org>	2024-05-24 09:52:47 -07:00
Mickael Maison	affe8da54c	KAFKA-7632: Support Compression Levels (KIP-390) (#15516 ) Reviewers: Jun Rao <jun@confluent.io>, Luke Chen <showuon@gmail.com> Co-authored-by: Lee Dongjin <dongjin@apache.org>	2024-05-21 17:58:49 +02:00
José Armando García Sancio	056d232f4e	KAFKA-16526; Quorum state data version 1 (#15859 ) Allow KRaft replicas to read and write version 0 and 1 of the quorum-state file. Which version is written is controlled by the kraft.version. With kraft.version 0, version 0 of the quorum-state file is written. With kraft.version 1, version 1 of the quorum-state file is written. Version 1 of the quorum-state file adds the VotedDirectoryId field and removes the CurrentVoters. The other fields removed in version 1 are not important as they were not overwritten or used by KRaft. In kraft.version 1 the set of voters will be stored in the kraft partition log segments and snapshots. To implement this feature the following changes were made to KRaft. FileBasedStateStore was renamed to FileQuorumStateStore to better match the name of the implemented interface QuorumStateStore. The QuorumStateStore::writeElectionState was extended to include the kraft.version. This version is used to determine which version of QuorumStateData to store. When writing version 0 the VotedDirectoryId is not persisted but the latest value is kept in-memory. This allows the replica to vote consistently while they stay online. If a replica restarts in the middle of an election it will forget the VotedDirectoryId if the kraft.version is 0. This should be rare in practice and should only happen if there is an election and failure while the system is upgrading to kraft.version 1. The type ElectionState, the interface EpochState and all of the implementations of EpochState (VotedState, UnattachedState, FollowerState, ResignedState, CandidateState and LeaderState) are extended to support the new voted directory id. The type QuorumState is changed so that local directory id is used. The type is also changed so that the latest value for the set of voters and the kraft version is query from the KRaftControlRecordStateMachine. The replica directory id is read from the meta.properties and passed to the KafkaRaftClient. The replica directory id is guaranteed to be set in the local replica. Adds a new metric for current-vote-directory-id which exposes the latest in-memory value of the voted directory id. Renames VoterSet.VoterKey to ReplicaKey. It is important to note that after this change, version 1 of the quorum-state file will not be written by kraft controllers and brokers. This change adds support reading and writing version 1 of the file in preparation for future changes. Reviewers: Jun Rao <junrao@apache.org>	2024-05-16 09:53:36 -04:00
José Armando García Sancio	440f5f6c09	MINOR; Validate at least one control record (#15912 ) Validate that a control batch in the batch accumulator has at least one control record. Reviewers: Jun Rao <junrao@apache.org>, Chia-Ping Tsai <chia7712@apache.org>	2024-05-14 10:02:29 -04:00
José Armando García Sancio	bfe81d6229	KAFKA-16207; KRaft's internal log listener to update voter set (#15671 ) Adds support for the KafkaRaftClient to read the control records KRaftVersionRecord and VotersRecord in the snapshot and log. As the control records in the KRaft partition are read, the replica's known set of voters are updated. This change also contains the necessary changes to include the control records when a snapshot is generated by the KRaft state machine. It is important to note that this commit changes the code and the in-memory state to track the sets of voters but it doesn't change any data that is externally exposed. It doesn't change the RPCs, data stored on disk or configuration. When the KRaft replica starts the PartitionListener reads the latest snapshot and then log segments up to the LEO, updating the in-memory state as it reads KRaftVersionRecord and VotersRecord. When the replica (leader and follower) appends to the log, the PartitionListener catches up to the new LEO. When the replica truncates the log because of a diverging epoch, the PartitionListener also truncates the in-memory state to the new LEO. When the state machine generate a new snapshot the PartitionListener trims any prefix entries that are not needed. This is all done to minimize the amount of data tracked in-memory and to make sure that it matches the state on disk. To implement the functionality described above this commit also makes the following changes: Adds control records for KRaftVersionRecord and VotersRecord. KRaftVersionRecord describes the finalized kraft.version supported by all of the replicas. VotersRecords describes the set of voters at a specific offset. Changes Kafka's feature version to support 0 as the smallest valid value. This is needed because the default value for kraft.version is 0. Refactors FileRawSnapshotWriter so that it doesn't directly call the onSnapshotFrozen callback. It adds NotifyingRawSnapshotWriter for calling such callbacks. This reorganization is needed because in this change both the KafkaMetadataLog and the KafkaRaftClient need to react to snapshots getting frozen. Cleans up KafkaRaftClient's initialization. Removes initialize from RaftClient - this is an implementation detail that doesn't need to be exposed in the interface. Removes RaftConfig.AddressSpec and simplifies the bootstrapping of the static voter's address. The bootstrapping of the address is delayed because of tests. We should be able to simplify this further in future commits. Update the DumpLogSegment CLI to support the new control records KRaftVersionRecord and VotersRecord. Fix the RecordsSnapshotReader implementations so that the iterator includes control records. RecordsIterator is extended to support reading the new control records. Improve the BatchAccumulator implementation to allow multiple control records in one control batch. This is needed so that KRaft can make sure that VotersRecord is included in the same batch as the control record (KRaftVersionRecord) that upgrades the kraft.version to 1. Add a History interface and default implementation TreeMapHistory. This is used to track all of the sets of voters between the latest snapshot and the LEO. This is needed so that KafkaRaftClient can query for the latest set of voters and so that KafkaRaftClient can include the correct set of voters when the state machine generates a new snapshot at a given offset. Add a builder pattern for RecordsSnapshotWriter. The new builder pattern also implements including the KRaftVersionRecord and VotersRecord control records in the snapshot as necessary. A KRaftVersionRecord should be appended if the kraft.version is greater than 0 at the snapshot's offset. Similarly, a VotersRecord should be appended to the snapshot with the latest value up to the snapshot's offset. Reviewers: Jason Gustafson <jason@confluent.io>	2024-05-04 12:43:16 -07:00
Mickael Maison	e7792258df	MINOR: Various cleanups in raft (#15805 ) Reviewers: Chia-Ping Tsai <chia7712@gmail.com>	2024-04-26 15:20:09 +02:00
Omnia Ibrahim	6feae817d2	MINOR: Rename RaftConfig to QuorumConfig (#15797 ) Reviewers: Chia-Ping Tsai <chia7712@gmail.com>	2024-04-26 03:08:31 +08:00
Kuan-Po (Cooper) Tseng	12a1d85362	KAFKA-12187 replace assertTrue(obj instanceof X) with assertInstanceOf (#15512 ) Reviewers: Chia-Ping Tsai <chia7712@gmail.com>	2024-03-20 10:36:25 +08:00
José Armando García Sancio	474f8c1ad6	KAFKA-16286; Notify listener of latest leader and epoch (#15397 ) KRaft was only notifying listeners of the latest leader and epoch when the replica transition to a new state. This can result in the listener never getting notified if the registration happened after it had become a follower. This problem doesn't exists for the active leader because the KRaft implementation attempts to notified the listener of the latest leader and epoch when the replica is the active leader. This issue is fixed by notifying the listeners of the latest leader and epoch after processing the listener registration request. Reviewers: Colin P. McCabe <cmccabe@apache.org>	2024-02-23 12:56:25 -08:00
yicheny	af0aceb4d7	MINOR: fix word spelling mistakes (#15331 ) fix word spelling mistakes Reviewers: Luke Chen <showuon@gmail.com>	2024-02-07 15:30:01 +08:00
Luke Chen	70bd4ce8a7	KAFKA-16144: skip checkQuorum for only 1 voter case (#15235 ) When there's only 1 voter, there will be no fetch request from other voters. In this case, we should still not expire the checkQuorum timer because there's just 1 voter. Reviewers: Mickael Maison <mickael.maison@gmail.com>, Federico Valeri <fedevaleri@gmail.com>, José Armando García Sancio <jsancio@apache.org>	2024-01-23 10:17:53 +08:00
Jason Gustafson	599e22b842	MINOR: Move Raft io thread implementation to Java (#15119 ) This patch moves the `RaftIOThread` implementation into Java. I changed the name to `KafkaRaftClientDriver` since the main thing it does is drive the calls to `poll()`. There shouldn't be any changes to the logic. Reviewers: José Armando García Sancio <jsancio@apache.org>	2024-01-05 09:27:36 -08:00
Josep Prat	dd143cdcb7	MINOR Minor Cleanup KRaft code (#15002 ) * MINOR Minor Cleanup KRaft code Clean up minor issues with KRaft code. Add batch info excluding the records to ease troubleshotting in RecordsSnapshotReader#lastContainedLogTimestamp Reviewers: Luke Chen <showuon@gmail.com> Signed-off-by: Josep Prat <josep.prat@aiven.io>	2023-12-15 13:55:18 +01:00
Luke Chen	37416e1aeb	KAFKA-15489: resign leadership when no fetch or fetch snapshot from majority voters (#14428 ) In KIP-595, we expect to piggy-back on the `quorum.fetch.timeout.ms` config, and if the leader did not receive Fetch requests from a majority of the quorum for that amount of time, it would begin a new election, to resolve the network partition in the quorum. But we missed this implementation in current KRaft. Fixed it in this PR. The commit include: 1. Added a timer with timeout configuration in `LeaderState`, and check if expired each time when leader is polled. If expired, resigning the leadership and start a new election. 2. Added `fetchedVoters` in `LeaderState`, and update the value each time received a FETCH or FETCH_SNAPSHOT request, and clear it and resets the timer if the majority - 1 of the remote voters sent such requests. Reviewers: José Armando García Sancio <jsancio@apache.org>	2023-11-30 11:34:44 -08:00
Gyeongwon, Do	abde0e0878	MINOR: fix typo and comment (#14650 ) Reviewers: hudeqi <1217150961@qq.com>, Ziming Deng <dengziming1993@gmail.com>.	2023-10-28 12:10:53 +08:00
Ismael Juma	69e591db3a	MINOR: Rewrite/Move KafkaNetworkChannel to the `raft` module (#14559 ) This is now possible since `InterBrokerSend` was moved from `core` to `server-common`. Also rewrite/move `KafkaNetworkChannelTest`. The scala version of `KafkaNetworkChannelTest` passed with the changes here (before I deleted it). Reviewers: Justine Olshan <jolshan@confluent.io>, José Armando García Sancio <jsancio@users.noreply.github.com>	2023-10-16 20:10:31 -07:00
Ismael Juma	4cf86c5d2f	KAFKA-15492: Upgrade and enable spotbugs when building with Java 21 (#14533 ) Spotbugs was temporarily disabled as part of KAFKA-15485 to support Kafka build with JDK 21. This PR upgrades the spotbugs version to 4.8.0 which adds support for JDK 21 and enables it's usage on build again. Reviewers: Divij Vaidya <diviv@amazon.com>	2023-10-12 14:09:10 +02:00
Ismael Juma	98febb989a	KAFKA-15485: Fix "this-escape" compiler warnings introduced by JDK 21 (1/N) (#14427 ) This is one of the steps required for kafka to compile with Java 21. For each case, one of the following fixes were applied: 1. Suppress warning if fixing would potentially result in an incompatible change (for public classes) 2. Add final to one or more methods so that the escape is not possible 3. Replace method calls with direct field access. In addition, we also fix a couple of compiler warnings related to deprecated references in the `core` module. See the following for more details regarding the new lint warning: https://www.oracle.com/java/technologies/javase/21-relnote-issues.html#JDK-8015831 Reviewers: Divij Vaidya <diviv@amazon.com>, Satish Duggana <satishd@apache.org>, Chris Egerton <chrise@aiven.io>	2023-09-24 05:59:29 -07:00
José Armando García Sancio	7b669e8806	KAFKA-14273; Close file before atomic move (#14354 ) In the Windows OS atomic move are not allowed if the file has another open handle. E.g __cluster_metadata-0\quorum-state: The process cannot access the file because it is being used by another process at java.base/sun.nio.fs.WindowsException.translateToIOException(WindowsException.java:92) at java.base/sun.nio.fs.WindowsException.rethrowAsIOException(WindowsException.java:103) at java.base/sun.nio.fs.WindowsFileCopy.move(WindowsFileCopy.java:403) at java.base/sun.nio.fs.WindowsFileSystemProvider.move(WindowsFileSystemProvider.java:293) at java.base/java.nio.file.Files.move(Files.java:1430) at org.apache.kafka.common.utils.Utils.atomicMoveWithFallback(Utils.java:949) at org.apache.kafka.common.utils.Utils.atomicMoveWithFallback(Utils.java:932) at org.apache.kafka.raft.FileBasedStateStore.writeElectionStateToFile(FileBasedStateStore.java:152) This is fixed by first closing the temporary quorum-state file before attempting to move it. Reviewers: Colin Patrick McCabe <cmccabe@apache.org> Co-Authored-By: Renaldo Baur Filho <renaldobf@gmail.com>	2023-09-07 16:17:03 -07:00
mannoopj	2e3ff21c2e	KAFKA-15412: Reading an unknown version of quorum-state-file should trigger an error (#14302 ) Reading an unknown version of quorum-state-file should trigger an error. Currently the only known version is 0. Reading any other version should cause an error. Reviewers: Justine Olshan <jolshan@confluent.io>, Luke Chen <showuon@gmail.com>	2023-08-30 15:03:41 +08:00
Phuc-Hong-Tran	8d12c1175c	KAFKA-15152: Fix incorrect format specifiers when formatting string (#14026 ) Reviewers: Divij Vaidya <diviv@amazon.com> Co-authored-by: phuchong.tran <phuchong.tran@servicenow.com>	2023-08-24 19:38:45 +02:00
José Armando García Sancio	3f4816dd3e	KAFKA-15345; KRaft leader notifies leadership when listener reaches epoch start (#14213 ) In a non-empty log the KRaft leader only notifies the listener of leadership when it has read to the leader's epoch start offset. This guarantees that the leader epoch has been committed and that the listener has read all committed offsets/records. Unfortunately, the KRaft leader doesn't do this when the log is empty. When the log is empty the listener is notified immediately when it has become leader. This makes the API inconsistent and harder to program against. This change fixes that by having the KRaft leader wait for the listener's nextOffset to be greater than the leader's epochStartOffset before calling handleLeaderChange. The RecordsBatchReader implementation is also changed to include control records. This makes it possible for the state machine learn about committed control records. This additional information can be used to compute the committed offset or for counting those bytes when determining when to snapshot the partition. Reviewers: Colin P. McCabe <cmccabe@apache.org>, Jason Gustafson <jason@confluent.io>	2023-08-17 18:40:17 -07:00
José Armando García Sancio	dafe51b658	KAFKA-15100; KRaft data race with the expiration service (#14141 ) The KRaft client uses an expiration service to complete FETCH requests that have timed out. This expiration service uses a different thread from the KRaft polling thread. This means that it is unsafe for the expiration service thread to call tryCompleteFetchRequest. tryCompleteFetchRequest reads and updates a lot of states that is assumed to be only be read and updated from the polling thread. The KRaft client now does not call tryCompleteFetchRequest when the FETCH request has expired. It instead will send the FETCH response that was computed when the FETCH request was first handled. This change also fixes a bug where the KRaft client was not sending the FETCH response immediately, if the response contained a diverging epoch or snapshot id. Reviewers: Jason Gustafson <jason@confluent.io>	2023-08-09 07:12:08 -07:00
José Armando García Sancio	e0727063f7	KAFKA-15312; Force channel before atomic file move (#14162 ) On ext4 file systems we have seen snapshots with zero-length files. This is possible if the file is closed and moved before forcing the channel to write to disk. Reviewers: Ron Dagostino <rndgstn@gmail.com>, Alok Thatikunta <athatikunta@confluent.io>	2023-08-08 14:31:42 -07:00
Colin Patrick McCabe	10bcd4fc7f	KAFKA-15213: provide the exact offset to QuorumController.replay (#13643 ) Provide the exact record offset to QuorumController.replay() in all cases. There are several situations where this is useful, such as logging, implementing metadata transactions, or handling broker registration records. In the case where the QC is inactive, and simply replaying records, it is easy to compute the exact record offset from the batch base offset and the record index. The active QC case is more difficult. Technically, when we submit records to the Raft layer, it can choose a batch base offset later than the one we expect, if someone else is also adding records. While the QC is the only entity submitting data records, control records may be added at any time. In the current implementation, these are really only used for leadership elections. However, this could change with the addition of quorum reconfiguration or similar features. Therefore, this PR allows the QC to tell the Raft layer that a record append should fail if it would have resulted in a batch base offset other than what was expected. This in turn will trigger a controller failover. In the future, if automatically added control records become more common, we may wish to have a more sophisticated system than this simple optimistic concurrency mechanism. But for now, this will allow us to rely on the offset as correct. In order that the active QC can learn what offset to start writing at, the PR also adds a new RaftClient#endOffset function. At the Raft level, this PR adds a new exception, UnexpectedBaseOffsetException. This gets thrown when we request a base offset that doesn't match the one the Raft layer would have given us. Although this exception should cause a failover, it should not be considered a fault. This complicated the exception handling a bit and motivated splitting more of it out into the new EventHandlerExceptionInfo class. This will also let us unit test things like slf4j log messages a bit better. Reviewers: David Arthur <mumrah@gmail.com>, José Armando García Sancio <jsancio@apache.org>	2023-07-27 17:01:55 -07:00
Colin Patrick McCabe	c7de30f38b	KAFKA-15183: Add more controller, loader, snapshot emitter metrics (#14010 ) Implement some of the metrics from KIP-938: Add more metrics for measuring KRaft performance. Add these metrics to QuorumControllerMetrics: kafka.controller:type=KafkaController,name=TimedOutBrokerHeartbeatCount kafka.controller:type=KafkaController,name=EventQueueOperationsStartedCount kafka.controller:type=KafkaController,name=EventQueueOperationsTimedOutCount kafka.controller:type=KafkaController,name=NewActiveControllersCount Create LoaderMetrics with these new metrics: kafka.server:type=MetadataLoader,name=CurrentMetadataVersion kafka.server:type=MetadataLoader,name=HandleLoadSnapshotCount Create SnapshotEmitterMetrics with these new metrics: kafka.server:type=SnapshotEmitter,name=LatestSnapshotGeneratedBytes kafka.server:type=SnapshotEmitter,name=LatestSnapshotGeneratedAgeMs Reviewers: Ron Dagostino <rndgstn@gmail.com>	2023-07-24 21:13:58 -07:00
Cheryl Simmons	e98508747a	Doc fixes: Fix format and other small errors in config documentation (#13661 ) Various formatting fixes in the config docs. Reviewers: Bill Bejeck <bbejeck@apache.org>	2023-07-10 12:48:35 -04:00
José Armando García Sancio	3a246b1aba	KAFKA-15078; KRaft leader replys with snapshot for offset 0 (#13845 ) If the follower has an empty log, fetches with offset 0, it is more efficient for the leader to reply with a snapshot id (redirect to FETCH_SNAPSHOT) than for the follower to continue fetching from the log segments. Reviewers: David Arthur <mumrah@gmail.com>, dengziming <dengziming1993@gmail.com>	2023-06-28 14:21:11 -07:00
José Armando García Sancio	b7a6a8fd5f	KAFKA-15076; KRaft should prefer latest snapshot (#13834 ) If the KRaft listener is at offset 0, the start of the log, and KRaft has generated a snapshot, it should prefer the latest snapshot instead of having the listener read from the start of the log. This is implemented by having KafkaRaftClient send a Listener.handleLoadSnapshot event, if the Listener is at offset 0 and the KRaft partition has generated a snapshot. Reviewers: Jason Gustafson <jason@confluent.io>, David Arthur <mumrah@gmail.com>	2023-06-12 07:25:42 -07:00
Alok Thatikunta	3d349ae0d6	MINOR; Add helper util Snapshots.lastContainedLogTimestamp (#13772 ) This change refactors the lastContainedLogTimestamp to the Snapshots class, for re-usability. Introduces IdentitySerde based on ByteBuffer, required for using RecordsSnapshotReader. This change also removes the "recordSerde: RecordSerde[_]" argument from the KafkaMetadataLog constructor. Reviewers: José Armando García Sancio <jsancio@apache.org>	2023-06-06 08:29:15 -07:00
mojh7	04f2f6a26a	MINOR: Typo and unused method removal (#13739 ) clean up unused private method and removed typos Reviewers: Divij Vaidya <diviv@amazon.com>, Manyanda Chitimbo <manyanda.chitimbo@gmail.com>, Daniel Scanteianu, Josep Prat <josep.prat@aiven.io>	2023-06-06 10:50:56 +02:00
Divij Vaidya	fe6a827e20	KAFKA-14633: Reduce data copy & buffer allocation during decompression (#13135 ) After this change, For broker side decompression: JMH benchmark RecordBatchIterationBenchmark demonstrates 20-70% improvement in throughput (see results for RecordBatchIterationBenchmark.measureSkipIteratorForVariableBatchSize). For consumer side decompression: JMH benchmark RecordBatchIterationBenchmark a mix bag of single digit regression for some compression type to 10-50% improvement for Zstd (see results for RecordBatchIterationBenchmark.measureStreamingIteratorForVariableBatchSize). Reviewers: Luke Chen <showuon@gmail.com>, Manyanda Chitimbo <manyanda.chitimbo@gmail.com>, Ismael Juma <mail@ismaeljuma.com>	2023-06-05 15:04:49 +08:00
David Mao	d944ef1efb	MINOR: Rename handleSnapshot to handleLoadSnapshot (#13727 ) Rename handleSnapshot to handleLoadSnapshot to make it explicit that it is handling snapshot load, not generation. Reviewers: Colin P. McCabe <cmccabe@apache.org>, Jason Gustafson <jason@confluent.io>	2023-05-17 09:57:24 -07:00
Chia-Ping Tsai	3c8665025a	MINOR: move ControlRecordTest to correct directory (#13718 ) Reviewers: Kamal Chandraprakash<kamal.chandraprakash@gmail.com>, hudeqi<1217150961@qq.com>, Satish Duggana<satishd@apache.org>	2023-05-17 18:56:02 +05:30
Luke Chen	625ef176ee	MINOR: remove kraft readme link (#13691 ) The config/kraft/README.md is already removed. We should also remove the link. Reviewers: dengziming <dengziming1993@gmail.com>	2023-05-10 16:40:20 +08:00
Manyanda Chitimbo	dd63d88ac3	MINOR: fix noticed typo in raft and metadata projects (#13612 ) Reviewers: Josep Prat <jlprat@apache.org>	2023-04-21 15:02:06 +02:00
José Armando García Sancio	1f1900b380	MINOR: Improve raft log4j messages a bit (#13553 ) Reviewers: Colin P. McCabe <cmccabe@apache.org>	2023-04-14 10:05:22 -07:00
Paolo Patierno	571841fed3	KAFKA-14883: Expose `observer` state in KRaft metrics (#13525 ) Currently, the current-state KRaft related metric reports follower state for a broker while technically it should be reported as an observer as the kafka-metadata-quorum tool does. Reviewers: Luke Chen <showuon@gmail.com>, dengziming <dengziming1993@gmail.com>	2023-04-13 12:55:57 +08:00
José Armando García Sancio	672dd3ab6a	KAFKA-13020; Implement reading Snapshot log append timestamp (#13345 ) The SnapshotReader exposes the "last contained log time". This is mainly used during snapshot cleanup. The previous implementation used the append time of the snapshot record. This is not accurate as this is the time when the snapshot was created and not the log append time of the last record included in the snapshot. The log append time of the last record included in the snapshot is store in the header control record of the snapshot. The header control record is the first record of the snapshot. To be able to read this record, this change extends the RecordsIterator to decode and expose the control records in the Records type. Reviewers: Colin Patrick McCabe <cmccabe@apache.org>	2023-04-07 09:25:54 -07:00
José Armando García Sancio	d604534cc3	MINOR; Increase log level of some rare events (#13430 ) To help debug KRaft's behavior this change increases the log level of some rare messages to INFO level. Reviewers: Jason Gustafson <jason@confluent.io>	2023-03-21 17:02:38 -07:00
Calvin Liu	79b5f7f1ce	KAFKA-14617: Add ReplicaState to FetchRequest (KIP-903) (#13323 ) This patch is the first part of KIP-903. It updates the FetchRequest to include the new tagged ReplicaState field which replaces the now deprecated ReplicaId field. The FetchRequest version is bumped to version 15 and the MetadataVersion to 3.5-IV1. Reviewers: David Jacot <djacot@confluent.io>	2023-03-16 14:04:34 +01:00
José Armando García Sancio	44e613c4cd	KAFKA-13884; Only voters flush on Fetch response (#13396 ) The leader only requires that voters have flushed their log up to the fetch offset before sending a fetch request. This change only flushes the log when handling the fetch response, if the follower is a voter. This should improve the disk performance on observers (brokers). Reviewers: Jason Gustafson <jason@confluent.io>	2023-03-15 12:06:41 -07:00
José Armando García Sancio	c13b49f2d1	Revert "KAFKA-14371: Remove unused clusterId field from quorum-state file (#13102 )" (#13355 ) This reverts commit `0927049a61`. Reviewers: Luke Chen <showuon@gmail.com>	2023-03-06 18:09:21 -08:00
Christo Lolov	5b295293c0	MINOR: Remove unnecessary toString(); fix comment references (#13212 ) Reviewers: Mickael Maison <mickael.maison@gmail.com>, Divij Vaidya <diviv@amazon.com>, Lucas Brutschy <lbrutschy@confluent.io>	2023-03-06 18:39:04 +01:00
Gantigmaa Selenge	0927049a61	KAFKA-14371: Remove unused clusterId field from quorum-state file (#13102 ) Remove clusterId field from the KRaft controller's quorum-state file $LOG_DIR/__cluster_metadata-0/quorum-state Reviewers: Luke Chen <showuon@gmail.com>, dengziming <dengziming1993@gmail.com>, Christo Lolov <christololov@gmail.com>	2023-03-01 10:13:38 +08:00
Jason Gustafson	35142d43e6	KAFKA-14664; Fix inaccurate raft idle ratio metric (#13207 ) The raft idle ratio is currently computed as the average of all recorded poll durations. This tends to underestimate the actual idle ratio since it treats all measurements equally regardless how much time was spent. For example, say we poll twice with the following durations: Poll 1: 2s Poll 2: 0s Assume that the busy time is negligible, so 2s passes overall. In the first measurement, 2s is spent waiting, so we compute and record a ratio of 1.0. In the second measurement, no time passes, and we record 0.0. The idle ratio is then computed as the average of these two values (1.0 + 0.0 / 2 = 0.5), which suggests that the process was busy for 1s, which overestimates the true busy time. In this patch, we create a new `TimeRatio` class which tracks the total duration of a periodic event over a full interval of time measurement. Reviewers: José Armando García Sancio <jsancio@apache.org>	2023-02-15 14:40:00 -08:00
José Armando García Sancio	f9e0d03274	MINOR; Make granting voter immutable (#13154 ) Make LeaderState's grantingVoters field explicitly immutable. The set of voters that granted their voter to the current leader was already immutable. This change makes that explicit. Reviewers: Jason Gustafson <jason@confluent.io>, Mathew Hogan <mathewdhogan@@users.noreply.github.com>	2023-01-25 15:52:01 -08:00
José Armando García Sancio	058d8d530b	KAFKA-14618; Fix off by one error in snapshot id (#13108 ) The KRaft client expects the offset of the snapshot id to be an end offset. End offsets are exclusive. The MetadataProvenance type was createing a snapshot id using the last contained offset which is inclusive. This change fixes that and renames some of the fields to make this difference more obvious. Reviewers: Colin P. McCabe <cmccabe@apache.org>	2023-01-13 10:06:38 -08:00
Jason Gustafson	26a4d42072	MINOR: Pass snapshot ID directly in `RaftClient.createSnapshot` (#12981 ) Let `RaftClient.createSnapshot` take the snapshotId directly instead of the committed offset/epoch (which may not exist). Reviewers: José Armando García Sancio <jsancio@apache.org>	2022-12-13 10:44:56 -08:00
José Armando García Sancio	3541d5ab18	MINOR; Improve high watermark log messages (#12975 ) While debugging KRaft and the metadata state machines it is helpful to always log the first time the replica discovers the high watermark. All other updates to the high watermark are logged at trace because they are more frequent and less useful. Reviewers: Luke Chen <showuon@gmail.com>	2022-12-12 16:32:16 -08:00
Colin Patrick McCabe	5514f372b3	MINOR: extract jointly owned parts of BrokerServer and ControllerServer (#12837 ) Extract jointly owned parts of BrokerServer and ControllerServer into SharedServer. Shut down SharedServer when the last component using it shuts down. But make sure to stop the raft manager before closing the ControllerServer's sockets. This PR also fixes a memory leak where ReplicaManager was not removing some topic metric callbacks during shutdown. Finally, we now release memory from the BatchMemoryPool in KafkaRaftClient#close. These changes should reduce memory consumption while running junit tests. Reviewers: Jason Gustafson <jason@confluent.io>, Ismael Juma <ismael@juma.me.uk>	2022-12-02 00:27:22 -08:00
José Armando García Sancio	72b535acaf	KAFKA-14307; Controller time-based snapshots (#12761 ) Implement time based snapshot for the controller. The general strategy for this feature is that the controller will use the record-batch's append time to determine if a snapshot should be generated. If the oldest record that has been committed but is not included in the latest snapshot is older than `metadata.log.max.snapshot.interval.ms`, the controller will trigger a snapshot immediately. This is useful in case the controller was offline for more that `metadata.log.max.snapshot.interval.ms` milliseconds. If the oldest record that has been committed but is not included in the latest snapshot is NOT older than `metadata.log.max.snapshot.interval.ms`, the controller will schedule a `maybeGenerateSnapshot` deferred task. It is possible that when the controller wants to generate a new snapshot, either because of time or number of bytes, the controller is currently generating a snapshot. In this case the `SnapshotGeneratorManager` was changed so that it checks and potentially triggers another snapshot when the currently in-progress snapshot finishes. To better support this feature the following additional changes were made: 1. The configuration `metadata.log.max.snapshot.interval.ms` was added to `KafkaConfig` with a default value of one hour. 2. `RaftClient` was extended to return the latest snapshot id. This snapshot id is used to determine if a given record is included in a snapshot. 3. Improve the `SnapshotReason` type to support the inclusion of values in the message. Reviewers: Jason Gustafson <jason@confluent.io>, Niket Goel <niket-goel@users.noreply.github.com>	2022-11-21 17:30:50 -08:00
Jason Gustafson	c710ecd071	MINOR: Reduce tries in RecordsIteratorTest to improve build time (#12798 ) `RecordsIteratorTest` takes the longest times in recent builds (even including integration tests). The default of 1000 tries from jqwik is probably overkill and causes the test to take 10 minutes locally. Decreasing to 50 tries reduces that to less than 30s. Reviewers: David Jacot <djacot@confluent.io>	2022-10-31 09:29:19 -07:00
Orsák Maroš	a0e37b79aa	MINOR: Add test cases to the Raft module (#12692 ) Reviewers: Mickael Maison <mickael.maison@gmail.com> , Divij Vaidya <diviv@amazon.com>, Ismael Juma <mlists@juma.me.uk>	2022-10-28 17:54:34 +02:00
José Armando García Sancio	d0ff869718	MINOR; Add accessor methods to OffsetAndEpoch (#12770 ) Accessor are preferred over fields because they compose better with Java's lambda syntax. Reviewers: Jason Gustafson <jason@confluent.io>	2022-10-19 12:07:07 -07:00
Niket	eb8f0bd5e4	MINOR: Adding KRaft Monitoring Related Metrics to docs/ops.html (#12679 ) This commit adds KRaft monitoring related metrics to the Kafka docs (docs/ops.html). Reviewers: Jason Gustafson <jason@confluent.io>, Luke Chen <showuon@gmail.com>	2022-09-26 14:25:36 +08:00
Luke Chen	bf7ddf73af	MINOR: use addExact to avoid overflow and some cleanup (#12660 ) What changes in this PR: 1. Use addExact to avoid overflow in BatchAccumulator#bytesNeeded. We did use addExact in bytesNeededForRecords method, but forgot that when returning the result. 2. javadoc improvement Reviewers: Jason Gustafson <jason@confluent.io>	2022-09-22 09:22:58 +08:00
Colin Patrick McCabe	b401fdaefb	MINOR: Add more validation during KRPC deserialization When deserializing KRPC (which is used for RPCs sent to Kafka, Kafka Metadata records, and some other things), check that we have at least N bytes remaining before allocating an array of size N. Remove DataInputStreamReadable since it was hard to make this class aware of how many bytes were remaining. Instead, when reading an individual record in the Raft layer, simply create a ByteBufferAccessor with a ByteBuffer containing just the bytes we're interested in. Add SimpleArraysMessageTest and ByteBufferAccessorTest. Also add some additional tests in RequestResponseTest. Reviewers: Tom Bentley <tbentley@redhat.com>, Mickael Maison <mickael.maison@gmail.com>, Colin McCabe <colin@cmccabe.xyz> Co-authored-by: Colin McCabe <colin@cmccabe.xyz> Co-authored-by: Manikumar Reddy <manikumar.reddy@gmail.com> Co-authored-by: Mickael Maison <mickael.maison@gmail.com>	2022-09-21 20:58:23 +05:30
Jason Gustafson	8c8b5366a6	KAFKA-14240; Validate kraft snapshot state on startup (#12653 ) We should prevent the metadata log from initializing in a known bad state. If the log start offset of the first segment is greater than 0, then must be a snapshot an offset greater than or equal to it order to ensure that the initialized state is complete. Reviewers: José Armando García Sancio <jsancio@users.noreply.github.com>	2022-09-19 11:52:48 -07:00
Ashmeet Lamba	86645cb40a	KAFKA-14073; Log the reason for snapshot (#12414 ) When a snapshot is taken it is due to either of the following reasons - Max bytes were applied Metadata version was changed Once the snapshot process is started, it will log the reason that initiated the process. Updated existing tests to include code changes required to log the reason. I was not able to check the logs when running tests - could someone guide me on how to enable logs when running a specific test case. Reviewers: dengziming <dengziming1993@gmail.com>, José Armando García Sancio <jsancio@apache.org>	2022-09-13 10:03:47 -07:00
José Armando García Sancio	c5954175a4	KAFKA-14222; KRaft's memory pool should always allocate a buffer (#12625 ) Because the snapshot writer sets a linger ms of Integer.MAX_VALUE it is possible for the memory pool to run out of memory if the snapshot is greater than 5 * 8MB. This change allows the BatchMemoryPool to always allocate a buffer when requested. The memory pool frees the extra allocated buffer when released if the number of pooled buffers is greater than the configured maximum batches. Reviewers: Jason Gustafson <jason@confluent.io>	2022-09-13 08:04:40 -07:00
José Armando García Sancio	f83c6f2da4	KAFKA-14183; Cluster metadata bootstrap file should use header/footer (#12565 ) The boostrap.checkpoint files should include a control record batch for the SnapshotHeaderRecord at the start of the file. It should also include a control record batch for the SnapshotFooterRecord at the end of the file. The snapshot header record is important because it versions the rest of the bootstrap file. Reviewers: David Arthur <mumrah@gmail.com>	2022-08-27 19:11:06 -07:00
Jason Gustafson	5c52c61a46	MINOR: A few cleanups for DescribeQuorum APIs (#12548 ) A few small cleanups in the `DescribeQuorum` API and handling logic: - Change field types in `QuorumInfo`: - `leaderId`: `Integer` -> `int` - `leaderEpoch`: `Integer` -> `long` (to allow for type expansion in the future) - `highWatermark`: `Long` -> `long` - Use field names `lastFetchTimestamp` and `lastCaughtUpTimestamp` consistently - Move construction of `DescribeQuorumResponseData.PartitionData` into `LeaderState` - Consolidate fetch time/offset update logic into `LeaderState.ReplicaState.updateFollowerState` Reviewers: Luke Chen <showuon@gmail.com>, José Armando García Sancio <jsancio@users.noreply.github.com>	2022-08-24 13:12:14 -07:00
Niket	c7f051914e	KAFKA-13888; Implement `LastFetchTimestamp` and in `LastCaughtUpTimestamp` for DescribeQuorumResponse [KIP-836] (#12508 ) This commit implements the newly added fields `LastFetchTimestamp` and `LastCaughtUpTimestamp` for KIP-836: https://cwiki.apache.org/confluence/display/KAFKA/KIP-836:+Addition+of+Information+in+DescribeQuorumResponse+about+Voter+Lag. Reviewers: Jason Gustafson <jason@confluent.io>	2022-08-19 15:09:09 -07:00
Jason Gustafson	e5b865d6bf	KAFKA-13940; Return NOT_LEADER_OR_FOLLOWER if DescribeQuorum sent to non-leader (#12517 ) Currently the server will return `INVALID_REQUEST` if a `DescribeQuorum` request is sent to a node that is not the current leader. In addition to being inconsistent with all of the other leader APIs in the raft layer, this error is treated as fatal by both the forwarding manager and the admin client. Instead, we should return `NOT_LEADER_OR_FOLLOWER` as we do with the other APIs. This error is retriable and we can rely on the admin client to retry it after seeing this error. Reviewers: David Jacot <djacot@confluent.io>	2022-08-17 15:48:32 -07:00
dengziming	50e5b32a6d	KAFKA-13959: Controller should unfence Broker with busy metadata log (#12274 ) The reason for KAFKA-13959 is a little complex, the two keys to this problem are: KafkaRaftClient.MAX_FETCH_WAIT_MS==MetadataMaxIdleIntervalMs == 500ms. We rely on fetchPurgatory to complete a FetchRequest, in details, if FetchRequest.fetchOffset >= log.endOffset, we will wait for 500ms to send a FetchResponse. The follower needs to send one more FetchRequest to get the HW. Here are the event sequences: 1. When starting the leader(active controller) LEO=m+1(m is the offset of the last record), leader HW=m(because we need more than half of the voters to reach m+1) 2. Follower (standby controller) and observer (broker) send FetchRequest(fetchOffset=m) 2.1. leader receives FetchRequest, set leader HW=m and waits 500ms before send FetchResponse 2.2. leader send FetchResponse(HW=m) 3.3 broker receive FetchResponse(HW=m), set metadataOffset=m. 3. Leader append NoOpRecord, LEO=m+2. leader HW=m 4. Looping 1-4 If we change MAX_FETCH_WAIT_MS=200 (less than half of MetadataMaxIdleIntervalMs), this problem can be solved temporarily. We plan to improve this problem in 2 ways, firstly, in this PR, we change the controller to unfence a broker when the broker's high-watermark has reached the broker registration record for that broker. Secondly, we will propagate the HWM to the replicas as quickly as possible in KAFKA-14145. Reviewers: Luke Chen <showuon@gmail.com>, José Armando García Sancio <jsancio@users.noreply.github.com>	2022-08-12 09:06:24 -07:00
Niket	48caba9340	KAFKA-14104; Add CRC validation when iterating over Metadata Log Records (#12457 ) This commit adds a check to ensure the RecordBatch CRC is valid when iterating over a Batch of Records using the RecordsIterator. The RecordsIterator is used by both Snapshot reads and Log Records reads in Kraft. The check can be turned off by a class parameter and is on by default. Reviewers: José Armando García Sancio <jsancio@users.noreply.github.com>	2022-08-08 15:03:04 -07:00
Divij Vaidya	5e4c8f704c	KAFKA-13943; Make `LocalLogManager` implementation consistent with the `RaftClient` contract (#12224 ) Fixes two issues in the implementation of `LocalLogManager`: - As per the interface contract for `RaftClient.scheduleAtomicAppend()`, it should throw a `NotLeaderException` exception when the provided current leader epoch does not match the current epoch. However, the current `LocalLogManager`'s implementation of the API returns a LONG_MAX instead of throwing an exception. This change fixes the behaviour and makes it consistent with the interface contract. - As per the interface contract for `RaftClient.resign(epoch)`if the parameter epoch does not match the current epoch, this call will be ignored. But in the current `LocalLogManager` implementation the leader epoch might change when the thread is waiting to acquire a lock on `shared.tryAppend()` (note that tryAppend() is a synchronized method). In such a case, if a NotALeaderException is thrown (as per code change in above), then resign should be ignored. Reviewers: José Armando García Sancio <jsancio@users.noreply.github.com>, Tom Bentley <tbentley@redhat.com>, Jason Gustafson <jason@confluent.io>	2022-07-05 20:08:28 -07:00
Christo Lolov	6c90f3335e	KAFKA-13947: Use %d formatting for integers rather than %s (#12267 ) Reviewers: Mickael Maison <mickael.maison@gmail.com>, Divij Vaidya <diviv@amazon.com>, Kvicii <kvicii.yu@gmail.com>	2022-06-10 13:55:52 +02:00
dengziming	1d6e3d6cb3	KAFKA-13845: Add support for reading KRaft snapshots in kafka-dump-log (#12084 ) The kafka-dump-log command should accept files with a suffix of ".checkpoint". It should also decode and print using JSON the snapshot header and footer control records. Reviewers: José Armando García Sancio <jsancio@users.noreply.github.com>	2022-06-01 14:49:00 -07:00
José Armando García Sancio	7d1b0926fa	KAFKA-13883: Implement NoOpRecord and metadata metrics (#12183 ) Implement NoOpRecord as described in KIP-835. This is controlled by the new metadata.max.idle.interval.ms configuration. The KRaft controller schedules an event to write NoOpRecord to the metadata log if the metadata version supports this feature. This event is scheduled at the interval defined in metadata.max.idle.interval.ms. Brokers and controllers were improved to ignore the NoOpRecord when replaying the metadata log. This PR also addsffour new metrics to the KafkaController metric group, as described KIP-835. Finally, there are some small fixes to leader recovery. This PR fixes a bug where metadata version 3.3-IV1 was not marked as changing the metadata. It also changes the ReplicaControlManager to accept a metadata version supplier to determine if the leader recovery state is supported. Reviewers: Colin P. McCabe <cmccabe@apache.org>	2022-06-01 10:48:24 -07:00
Colin Patrick McCabe	fa59be4e77	KAFKA-13649: Implement early.start.listeners and fix StandardAuthorizer loading (#11969 ) Since the StandardAuthorizer relies on the metadata log to store its ACLs, we need to be sure that we have the latest metadata before allowing the authorizer to be used. However, if the authorizer is not usable for controllers in the cluster, the latest metadata cannot be fetched, because inter-node communication cannot occur. In the initial commit which introduced StandardAuthorizer, we punted on the loading issue by allowing the authorizer to be used immediately. This commit fixes that by implementing early.start.listeners as specified in KIP-801. This will allow in superusers immediately, but throw the new AuthorizerNotReadyException if non-superusers try to use the authorizer before StandardAuthorizer#completeInitialLoad is called. For the broker, we call StandardAuthorizer#completeInitialLoad immediately after metadata catch-up is complete, right before unfencing. For the controller, we call StandardAuthorizer#completeInitialLoad when the node has caught up to the high water mark of the cluster metadata partition. This PR refactors the SocketServer so that it creates the configured acceptors and processors in its constructor, rather than requiring a call to SocketServer#startup A new function, SocketServer#enableRequestProcessing, then starts the threads and begins listening on the configured ports. enableRequestProcessing uses an async model: we will start the acceptor and processors associated with an endpoint as soon as that endpoint's authorizer future is completed. Also fix a bug where the controller and listener were sharing an Authorizer when in co-located mode, which was not intended. Reviewers: Jason Gustafson <jason@confluent.io>	2022-05-12 14:48:33 -07:00
Jacklee	2a656aea27	MINOR: fix typo for QUORUM_FETCH_TIMEOUT_MS_DOC (#12132 ) Reviewers: Luke Chen <showuon@gmail.com>	2022-05-10 10:47:23 +08:00
RivenSun	51fb42bdfd	MINOR: Correct spelling errors in KafkaRaftClient (#12061 ) Correct spelling errors in KafkaRaftClient Reviewers: José Armando García Sancio <jsancio@users.noreply.github.com>	2022-04-18 10:58:57 -07:00
David Arthur	55ff5d3603	KAFKA-13823 Feature flag changes from KIP-778 (#12036 ) This PR includes the changes to feature flags that were outlined in KIP-778. Specifically, it changes UpdateFeatures and FeatureLevelRecord to remove the maximum version level. It also adds dry-run to the RPC so the controller can actually attempt the upgrade (rather than the client). It introduces an upgrade type enum, which supersedes the allowDowngrade boolean. Because FeatureLevelRecord was unused previously, we do not need to introduce a new version. The kafka-features.sh tool was overhauled in KIP-778 and now includes the describe, upgrade, downgrade, and disable sub-commands. Refer to [KIP-778](https://cwiki.apache.org/confluence/display/KAFKA/KIP-778%3A+KRaft+Upgrades) for more details on the new command structure. Reviewers: Colin P. McCabe <cmccabe@apache.org>, dengziming <dengziming1993@gmail.com>	2022-04-14 10:04:32 -07:00
liym	620f1d88d8	Polish Javadoc for EpochState (#11897 ) Polish Javadoc for EpochState Reviewers: Bill Bejeck <bbejeck@apache.org>	2022-03-15 19:58:47 -04:00
Cong Ding	a21aec8d62	KAFKA-13603: Allow the empty active segment to have missing offset index during recovery (#11345 ) Within a LogSegment, the TimeIndex and OffsetIndex are lazy indices that don't get created on disk until they are accessed for the first time. However, Log recovery logic expects the presence of an offset index file on disk for each segment, otherwise, the segment is considered corrupted. This PR introduces a forceFlushActiveSegment boolean for the log.flush function to allow the shutdown process to flush the empty active segment, which makes sure the offset index file exists. Co-Author: Kowshik Prakasam kowshik@gmail.com Reviewers: Jason Gustafson <jason@confluent.io>, Jun Rao <junrao@gmail.com>	2022-01-27 14:59:21 -08:00
Kvicii	dd58f81b25	KAFKA-13618: Fix typo in BatchAccumulator (#11715 ) Co-authored-by: Kvicii <Karonazaba@gmail.com> Reviewers: Mickael Maison <mickael.maison@gmail.com>	2022-01-26 18:30:56 +01:00
José Armando García Sancio	1bf418beaf	MINOR: Pass along compression type to snapshot writer (#11556 ) Make sure that the compression type is passed along to the `RecordsSnapshotWriter` constructor when creating the snapshot writer using the static `createWithHeader` method. Reviewers: Jason Gustafson <jason@confluent.io>	2021-11-30 17:29:36 -08:00
loboya~	42306ba267	KAFKA-12932: Interfaces for SnapshotReader and SnapshotWriter (#11529 ) Change the snapshot API so that SnapshotWriter and SnapshotReader are interfaces. Change the existing types SnapshotWriter and SnapshotReader to use a different name and to implement the interfaces introduced by this commit. Co-authored-by: loboxu <loboxu@tencent.com> Reviews: José Armando García Sancio <jsancio@users.noreply.github.com>	2021-11-30 11:44:39 -07:00
loboya~	6c59d2d685	MINOR: Update javadoc of `SnapshotWriter.createWithHeader` (#11530 ) Reviewers: Luke Chen <showuon@gmail.com>, David Jacot <djacot@confluent.io>	2021-11-29 08:49:36 +01:00
Lee Dongjin	051efc7b1a	MINOR: Remove unused parameters, exceptions, comments, etc. (#11472 ) * Remove redundant toString call & unused value in LogCleanerParameterizedIntegrationTest * Remove unthrown exceptions in FileRawSnapshotTest * Remove unused parameters in DumpLogSegmentsTest.scala * Remove redundant parameter to FetchDataInfo() * Remove redundant toString call in EndToEndLatency * Remove unused parameters in DumpLogSegments * Remove unused toString call in AbstractLogCleanerIntegrationTest * Remove unused parameter in LogCleanerTest#appendTransactionalAsLeader * Remove redundant 'val's from ClientQuotaManagerTest.UserClient. * Remove redundant parameters in EdgeCaseRequestTest * Remove redundant Int.MaxValue from DumpLogSegments.dumpTimeIndex parameters. * Remove '// 9) static client-id quota' from DefaultQuotaCallback#quotaMetricTags; static client-id quota was removed in 3.0.0. * Remove redundant parameters to DumpLogSegments#[dumpLog, dumpTimeIndex]. Reviewers: Mickael Maison <mickael.maison@gmail.com>	2021-11-18 11:52:10 +01:00
Jorge Esteban Quilcate Otoya	214b59b3ec	KAFKA-13429: ignore bin on new modules (#11415 ) Reviewers: John Roesler <vvcephei@apache.org>	2021-11-10 14:36:24 -06:00
Niket	feee65a2cf	MINOR: Adding a constant to denote UNKNOWN leader in LeaderAndEpoch (#11477 ) Reviewers: José Armando García Sancio <jsancio@gmail.com>, Jason Gustafson <jason@confluent.io>	2021-11-09 09:07:05 -08:00
Lee Dongjin	22d056c9b7	TRIVIAL: Fix type inconsistencies, unthrown exceptions, etc (#10678 ) Reviewers: Ismael Juma <ismael@juma.me.uk>, Bruno Cadonna <cadonna@apache.org>	2021-11-03 14:58:42 +01:00
feyman2016	82d5e1cf14	KAFKA-10800; Enhance the test for validation when the state machine creates a snapshot (#10593 ) This patch adds additional test cases covering the validations done when snapshots are created by the state machine. Reviewers: José Armando García Sancio <jsancio@users.noreply.github.com>, Jason Gustafson <jason@confluent.io>	2021-10-26 14:03:46 -07:00
Satish Duggana	6288b5370d	MINOR: Renamed a few record definition files with the existing convention. (#11414 ) Reviewers: Jun Rao <junrao@gmail.com>	2021-10-21 13:44:08 -07:00
ik	74f25bf3ad	MINOR: Fix redundant static modifier for enum (#11282 ) Reviewers: Mickael Maison <mickael.maison@gmail.com> Co-authored-by: ik.lim <iksh192@gmail.com>	2021-10-18 10:40:48 +02:00
Matthew Wong	6c80643009	[KAFKA-8522] Streamline tombstone and transaction marker removal (#10914 ) This PR aims to remove tombstones that persist indefinitely due to low throughput. Previously, deleteHorizon was calculated from the segment's last modified time. In this PR, the deleteHorizon will now be tracked in the baseTimestamp of RecordBatches. After the first cleaning pass that finds a record batch with tombstones, the record batch is recopied with deleteHorizon flag and a new baseTimestamp that is the deleteHorizonMs. The records in the batch are rebuilt with relative timestamps based on the deleteHorizonMs that is recorded. Later cleaning passes will be able to remove tombstones more accurately on their deleteHorizon due to the individual time tracking on record batches. KIP 534: https://cwiki.apache.org/confluence/display/KAFKA/KIP-534%3A+Retain+tombstones+and+transaction+markers+for+approximately+delete.retention.ms+milliseconds Co-authored-by: Ted Yu <yuzhihong@gmail.com> Co-authored-by: Richard Yu <yohan.richard.yu@gmail.com>	2021-09-16 09:17:15 -07:00
Ismael Juma	0118330103	KAFKA-13273: Add support for Java 17 (#11296 ) Java 17 is at release candidate stage and it will be a LTS release once it's out (previous LTS release was Java 11). Details: * Replace Java 16 with Java 17 in Jenkins and Readme. * Replace `--illegal-access=permit` (which was removed from Java 17) with `--add-opens` for the packages we require internal access to. Filed KAFKA-13275 for updating the tests not to require `--add-opens` (where possible). * Update `release.py` to use JDK8. and JDK 17 (instead of JDK 8 and JDK 15). * Removed all but one Streams test from `testsToExclude`. The Connect test exclusion list remains the same. * Add notable change to upgrade.html * Upgrade to Gradle 7.2 as it's required for proper Java 17 support. * Upgrade mockito to 3.12.4 for better Java 17 support. * Adjusted `KafkaRaftClientTest` and `QuorumStateTest` not to require private access to `jdk.internal.util.random`. Reviewers: Manikumar Reddy <manikumar.reddy@gmail.com>, Chia-Ping Tsai <chia7712@gmail.com>	2021-09-06 08:55:52 -07:00
dengziming	b980ca8709	KAFKA-12158; Better return type of RaftClient.scheduleAppend (#10909 ) This patch improves the return type for `scheduleAppend` and `scheduleAtomicAppend`. Previously we were using a `Long` value and using both `null` and `Long.MaxValue` to distinguish between different error cases. In this PR, we change the return type to `long` and only return a value if the append was accepted. For the error cases, we instead throw an exception. For this purpose, the patch introduces a couple new exception types: `BufferAllocationException` and `NotLeaderException`. Reviewers: José Armando García Sancio <jsancio@users.noreply.github.com>, Jason Gustafson <jason@confluent.io>	2021-08-02 14:47:03 -07:00
José Armando García Sancio	fd36e5a8b6	KAFKA-12851: Fix Raft partition simulation (#11134 ) Instead of waiting for a high-watermark of 20 after the partition, the test should wait for the high-watermark to reach an offset greater than the largest log end offset at the time of the partition. Only that offset is guarantee to be reached as the high-watermark by the new majority. Reviewers: Jason Gustafson <jason@confluent.io>	2021-07-28 09:28:56 -07:00
José Armando García Sancio	55d9acad65	KAFKA-13113; Support unregistering Raft listeners (#11109 ) This patch adds support for unregistering listeners to `RaftClient`. Reviewers: Colin P. McCabe <cmccabe@apache.org>, Jason Gustafson <jason@confluent.io>	2021-07-23 21:54:44 -07:00
Niket	57866bd588	MINOR: Rename the @metadata topic to __cluster_metadata #11102 Reviewers: Colin P. McCabe <cmccabe@apache.org>	2021-07-21 17:30:35 -07:00
Ryan Dielhenn	56ef910358	KAFKA-13104: Controller should notify raft client when it resigns #11082 When the active controller encounters an event exception it attempts to renounce leadership. Unfortunately, this doesn't tell the RaftClient that it should attempt to give up leadership. This will result in inconsistent state with the RaftClient as leader but with the controller as inactive. This PR changes the implementation so that the active controller asks the RaftClient to resign. Reviewers: Jose Sancio <jsancio@gmail.com>, Colin P. McCabe <cmccabe@apache.org>	2021-07-20 16:41:20 -07:00
José Armando García Sancio	69a4661d7a	KAFKA-13100: Create KRaft controller snapshot during promotion (#11084 ) The leader assumes that there is always an in-memory snapshot at the last committed offset. This means that the controller needs to generate an in-memory snapshot when getting promoted from inactive to active. This PR adds that code. This fixes a bug where sometimes we would try to look for that in-memory snapshot and not find it. The controller always starts inactive, and there is no requirement that there exists an in-memory snapshot at the last committed offset when the controller is inactive. Therefore we can remove the initial snapshot at offset -1. We should also optimize when a snapshot is cancelled or completes, by deleting all in-memory snapshots less that the last committed offset. SnapshotRegistry's createSnapshot should allow the creating of a snapshot if the last snapshot's offset is the given offset. This allows for simpler client code. Finally, this PR renames createSnapshot to getOrCreateSnapshot. Reviewers: Colin P. McCabe <cmccabe@apache.org>	2021-07-20 10:13:01 -07:00
José Armando García Sancio	b5cb02b288	KAFKA-13090: Improve kraft snapshot integration test Check and verify generated snapshots for the controllers and the brokers. Assert reader state when reading last log append time. Reviewers: Colin P. McCabe <cmccabe@apache.org>	2021-07-16 14:10:52 -07:00
José Armando García Sancio	8134adcf91	KAFKA-13073: Fix MockLog snapshot implementation (#11032 ) Fix a simulation test failure by: 1. Relaxing the valiation of the snapshot id against the log start offset when the state machine attempts to create new snapshot. It is safe to just ignore the request instead of throwing an exception when the snapshot id is less that the log start offset. 2. Fixing the MockLog implementation so that it uses startOffset both externally and internally. Reviewers: Colin P. McCabe <cmccabe@apache.org>	2021-07-13 17:06:18 -07:00
José Armando García Sancio	0f00f3677f	KAFKA-13078: Fix a bug where we were closing the RawSnapshotWriter to early (#11040 ) Reviewers: David Arthur <mumrah@gmail.com>	2021-07-13 16:20:03 -07:00
Justine Olshan	2b8aff58b5	KAFKA-10580: Add topic ID support to Fetch request (#9944 ) Updated FetchRequest and FetchResponse to use topic IDs rather than topic names. Some of the complicated code is found in FetchSession and FetchSessionHandler. We need to be able to store topic IDs and maintain a cache on the broker for IDs that may not have been resolved. On incremental fetch requests, we will try to resolve them or remove them if in toForget. Reviewers: Rajini Sivaram <rajinisivaram@googlemail.com>, Chia-Ping Tsai <chia7712@gmail.com>, Jun Rao <junrao@gmail.com>	2021-07-07 16:02:37 -07:00
David Arthur	284ec262c6	KAFKA-12155: Metadata log and snapshot cleaning #10864 This PR includes changes to KafkaRaftClient and KafkaMetadataLog to support periodic cleaning of old log segments and snapshots. Four new public config keys are introduced: metadata.log.segment.bytes, metadata.log.segment.ms, metadata.max.retention.bytes, and metadata.max.retention.ms. These are used to configure the log layer as well as the snapshot cleaning logic. Snapshot and log cleaning is performed based on two criteria: total metadata log + snapshot size (metadata.max.retention.bytes), and max age of a snapshot (metadata.max.retention.ms). Since we have a requirement that the log start offset must always align with a snapshot, we perform the cleaning on snapshots first and then clean what logs we can. The cleaning algorithm follows: 1. Delete the oldest snapshot. 2. Advance the log start offset to the new oldest snapshot. 3. Request that the log layer clean any segments prior to the new log start offset 4. Repeat this until the retention size or time is no longer violated, or only a single snapshot remains. The cleaning process is triggered every 60 seconds from the KafkaRaftClient polling thread. Reviewers: José Armando García Sancio <jsancio@gmail.com>, dengziming <dengziming1993@gmail.com>, Colin P. McCabe <cmccabe@apache.org>	2021-07-06 14:19:44 -07:00
zhaohaidao	10b1f73cd4	KAFKA-12958: add an invariant that notified leaders are never asked to load snapshot (#10932 ) Track handleSnapshot calls and make sure it is never triggered on the leader node. Reviewers: Luke Chen <showuon@gmail.com>, José Armando García Sancio <jsancio@users.noreply.github.com>, Boyang Chen <bchen11@outlook.com>	2021-07-04 08:32:12 -07:00
José Armando García Sancio	9f01909dc3	KAFKA-12997: Expose the append time for batches from raft (#10946 ) Add the record append time to Batch. Change SnapshotReader to set this time to the time of the last log in the last batch. Fix the QuorumController to remember the last committed batch append time and to store it in the generated snapshot. Reviewers: David Arthur <mumrah@gmail.com>, Luke Chen <showuon@gmail.com>, Colin P. McCabe <cmccabe@apache.org>	2021-07-01 16:38:59 -07:00
José Armando García Sancio	1b7ab8eb9f	KAFKA-12863: Configure controller snapshot generation (#10812 ) Add the ability for KRaft controllers to generate snapshots based on the number of new record bytes that have been applied since the last snapshot. Add a new configuration key to control this parameter. For now, it defaults to being off, although we will change that in a follow-on PR. Also, fix LocalLogManager so that snapshot loading is only triggered when the listener is not the leader. Reviewers: Colin P. McCabe <cmccabe@apache.org>	2021-06-30 18:13:53 -07:00
Niket	d3ec9f940c	KAFKA-12952 Add header and footer records for raft snapshots (#10899 ) Add header and footer records for raft snapshots. This helps identify when the snapshot starts and ends. The header also contains a time. The time field is currently set to 0. KAFKA-12997 will add in the necessary wiring to use the correct timestamp. Reviewers: Jose Sancio <jsancio@gmail.com>, Colin P. McCabe <cmccabe@apache.org>	2021-06-29 09:37:20 -07:00
Jason Gustafson	f86cb1d1da	KAFKA-12631; Implement `resign` API in `KafkaRaftClient` (#10913 ) This patch adds an implementation of the `resign()` API which allows the controller to proactively resign leadership in case it encounters an unrecoverable situation. There was not a lot to do here because we already supported a `Resigned` state to facilitate graceful shutdown. Reviewers: José Armando García Sancio <jsancio@users.noreply.github.com>, David Arthur <mumrah@gmail.com>	2021-06-28 18:00:19 -07:00
José Armando García Sancio	4ddf1a5f94	KAFKA-12837; Process entire batch reader in the `BrokerMetadataListener` commit handler (#10902 ) We should process the entire batch in `BrokerMetadataListener` and make sure that `hasNext` is called before calling `next` on the iterator. The previous code worked because the raft client kept track of the position in the iterator, but it caused NoSuchElementException to be raised when the reader was empty (as might be the case with control records). Reviewers: Jason Gustafson <jason@confluent.io>	2021-06-18 13:20:43 -07:00
Jason Gustafson	6c260a5e1d	MINOR: Fix javadoc errors in `RaftClient` (#10901 ) This patch fixes a few minor javadoc issues in the `RaftClient` interface. Reviewers: José Armando García Sancio <jsancio@users.noreply.github.com>, David Jacot <djacot@confluent.io>	2021-06-17 23:33:02 -07:00
José Armando García Sancio	b67a77d5b9	KAFKA-12787; Integrate controller snapshoting with raft client (#10786 ) Directly use `RaftClient.Listener`, `SnapshotWriter` and `SnapshotReader` in the quorum controller. 1. Allow `RaftClient` users to create snapshots by specifying the last committed offset and last committed epoch. These values are validated against the log and leader epoch cache. 2. Remove duplicate classes in the metadata module for writing and reading snapshots. 3. Changed the logic for comparing snapshots. The old logic was assuming a certain batch grouping. This didn't match the implementation of the snapshot writer. The snapshot writer is free to merge batches before writing them. 4. Improve `LocalLogManager` to keep track of multiple snapshots. 5. Improve the documentation and API for the snapshot classes to highlight the distinction between the offset of batches in the snapshot vs the offset of batches in the log. These two offsets are independent of one another. `SnapshotWriter` and `SnapshotReader` expose a method called `lastOffsetFromLog` which represents the last inclusive offset from the log that is represented in the snapshot. Reviewers: dengziming <swzmdeng@163.com>, Jason Gustafson <jason@confluent.io>	2021-06-15 10:32:01 -07:00
loboya~	4b7ad7b14d	KAFKA-12773; Use UncheckedIOException when wrapping IOException (#10749 ) The raft module may not be fully consistent on this but in general in that module we have decided to not throw the checked IOException. We have been avoiding checked IOException exceptions by wrapping them in RuntimeException. The raft module should instead wrap IOException in UncheckedIOException. Reviewers: Luke Chen <showuon@gmail.com>, David Arthur <mumrah@gmail.com>, José Armando García Sancio <jsancio@users.noreply.github.com>, Jason Gustafson <jason@confluent.io>	2021-06-15 10:22:48 -07:00
Satish Duggana	a9fe2bd935	MINOR Removed unused ConfigProvider from raft resources module. (#10829 ) Reviewers: Jun Rao <junrao@gmail.com>	2021-06-09 14:22:18 -07:00
Boyang Chen	0358c21ae4	minor stylish fixes to raft client (#10809 ) Style fixes to KafkaRaftClient Reviewers: Luke Chen <showuon@gmail.com>	2021-06-03 18:51:03 -07:00
Chia-Ping Tsai	f426a72f03	MINOR: replace by org.junit.jupiter.api.Tag by net.jqwik.api.Tag for raft module (#10791 ) The command `./gradlew raft:integrationTest` can't run any integration test since `org.junit.jupiter.api.Tag` does not work for jqwik engine (see https://github.com/jlink/jqwik/issues/36#issuecomment-436535760). Reviewers: Ismael Juma <ismael@juma.me.uk>	2021-06-01 13:03:25 +08:00
José Armando García Sancio	f50f13d781	KAFKA-12342: Remove MetaLogShim and use RaftClient directly (#10705 ) This patch removes the temporary shim layer we added to bridge the interface differences between MetaLogManager and RaftClient. Instead, we now use the RaftClient directly from the metadata module. This also means that the metadata gradle module now depends on raft, rather than the other way around. Finally, this PR also consolidates the handleResign and handleNewLeader APIs into a single handleLeaderChange API. Co-authored-by: Jason Gustafson <jason@confluent.io>	2021-05-20 15:39:46 -07:00
David Arthur	937e28db5d	Fix compile errors from KAFKA-12543 (#10719 ) Reviewers: Colin P. Mccabe <cmccabe@confluent.io>, Jun Rao <junrao@gmail.com>, José Armando García Sancio <jsancio@users.noreply.github.com>	2021-05-18 17:09:51 -04:00
José Armando García Sancio	924c870fb1	KAFKA-12543: Change RawSnapshotReader ownership model (#10431 ) Kafka networking layer doesn't close FileRecords and assumes that they are already open when sending them over a channel. To support this pattern this commit changes the ownership model for FileRawSnapshotReader so that they are owned by KafkaMetadataLog. Reviewers: dengziming <swzmdeng@163.com>, David Arthur <mumrah@gmail.com>, Jun Rao <junrao@gmail.com>	2021-05-18 14:14:17 -04:00
Satish Duggana	7ef3879429	KAFKA-12758 Added `server-common` module to have server side common classes. (#10638 ) Added server-common module to have server side common classes. Moved ApiMessageAndVersion, RecordSerde, AbstractApiMessageSerde, and BytesApiMessageSerde to server-common module. Reivewers: Kowshik Prakasam <kprakasam@confluent.io>, Jun Rao <junrao@gmail.com>	2021-05-11 09:58:28 -07:00
Satish Duggana	a1367f57f5	KAFKA-12429: Added serdes for the default implementation of RLMM based on an internal topic as storage. (#10271 ) KAFKA-12429: Added serdes for the default implementation of RLMM based on an internal topic as storage. This topic will receive events of RemoteLogSegmentMetadata, RemoteLogSegmentUpdate, and RemotePartitionDeleteMetadata. These events are serialized into Kafka protocol message format. Added tests for all the event types for that topic. This is part of the tiered storaqe implementation KIP-405. Reivewers: Kowshik Prakasam <kprakasam@confluent.io>, Jun Rao <junrao@gmail.com>	2021-05-05 07:48:52 -07:00
José Armando García Sancio	6203bf8b94	KAFKA-12154; Raft Snapshot Loading API (#10085 ) Implement Raft Snapshot loading API. 1. Adds a new method `handleSnapshot` to `raft.Listener` which is called whenever the `RaftClient` determines that the `Listener` needs to load a new snapshot before reading the log. This happens when the `Listener`'s next offset is less than the log start offset also known as the earliest snapshot. 2. Adds a new type `SnapshotReader<T>` which provides a `Iterator<Batch<T>>` interface and de-serializes records in the `RawSnapshotReader` into `T`s 3. Adds a new type `RecordsIterator<T>` that implements an `Iterator<Batch<T>>` by scanning a `Records` object and deserializes the batches and records into `Batch<T>`. This type is used by both `SnapshotReader<T>` and `RecordsBatchReader<T>` internally to implement the `Iterator` interface that they expose. 4. Changes the `MockLog` implementation to read one or two batches at a time. The previous implementation always read from the given offset to the high-watermark. This made it impossible to test interesting snapshot loading scenarios. 5. Removed `throws IOException` from some methods. Some of types were inconsistently throwing `IOException` in some cases and throwing `RuntimeException(..., new IOException(...))` in others. This PR improves the consistent by wrapping `IOException` in `RuntimeException` in a few more places and replacing `Closeable` with `AutoCloseable`. 6. Updated the Kafka Raft simulation test to take into account snapshot. `ReplicatedCounter` was updated to generate snapshot after 10 records get committed. This means that the `ConsistentCommittedData` validation was extended to take snapshots into account. Also added a new invariant to ensure that the log start offset is consistently set with the earliest snapshot. Reviewers: dengziming <swzmdeng@163.com>, David Arthur <mumrah@gmail.com>, Jason Gustafson <jason@confluent.io>	2021-05-01 10:05:45 -07:00
Ryan	a855f6ac37	KAFKA-12265; Move the BatchAccumulator in KafkaRaftClient to LeaderState (#10480 ) The KafkaRaftClient has a field for the BatchAccumulator that is only used and set when it is the leader. In other cases, leader specific information was stored in LeaderState. In a recent change EpochState, which LeaderState implements, was changed to be a Closable. QuorumState makes sure to always close the previous state before transitioning to the next state. This redesign was used to move the BatchAccumulator to the LeaderState and simplify some of the handling in KafkaRaftClient. Reviewers: José Armando García Sancio <jsancio@gmail.com>, Jason Gustafson <jason@confluent.io>	2021-04-29 09:25:21 -07:00
Bill Bejeck	637c44c976	KAFKA-12672: Added config for raft testing server (#10545 ) Adding a property to the `raft/config/kraft.properties` for running the raft test server in development. For testing I ran `./bin/test-kraft-server-start.sh --config config/kraft.properties` and validated the test server started running with a throughput test. Reviewers: Ismael Juma <ismael@juma.me.uk>	2021-04-15 20:48:53 -07:00
dengziming	db688b1a5e	KAFKA-12607; Test case for resigned state vote granting (#10510 ) This patch adds unit tests to verify vote behavior when in the "resigned" state. Reviewers: Jason Gustafson <jason@confluent.io>	2021-04-09 19:15:11 -07:00
Jason Gustafson	d2c06c9c3c	KAFKA-12619; Raft leader should expose hw only after committing LeaderChange (#10481 ) KIP-595 describes an extra condition on commitment here: https://cwiki.apache.org/confluence/display/KAFKA/KIP-595%3A+A+Raft+Protocol+for+the+Metadata+Quorum#KIP595:ARaftProtocolfortheMetadataQuorum-Fetch. In order to ensure that a newly elected leader's committed entries cannot get lost, it must commit one record from its own epoch. This guarantees that its latest entry is larger (in terms of epoch/offset) than any previously written record which ensures that any future leader must also include it. This is the purpose of the `LeaderChange` record which is written to the log as soon as the leader gets elected. Although we had this check implemented, it was off by one. We only ensured that replication reached the epoch start offset, which does not reflect the appended `LeaderChange` record. This patch fixes the check and clarifies the point of the check. The rest of the patch is just fixing up test cases. Reviewers: dengziming <swzmdeng@163.com>, Guozhang Wang <wangguoz@gmail.com>	2021-04-08 10:42:30 -07:00
Justine Olshan	c2ea0c2e1d	KAFKA-12457; Add sentinel ID to metadata topic (#10492 ) KIP-516 introduces topic IDs to topics, but there is a small issue with how the KIP-500 metadata topic will interact with topic IDs. For example, https://github.com/apache/kafka/pull/9944 aims to replace topic names in the Fetch request with topic IDs. In order to get these IDs, brokers must fetch from the metadata topic. This leads to a sort of "chicken and the egg" problem concerning how we find out the metadata topic's topic ID. This PR adds the a special sentinel topic ID for the metadata topic, which gets around this problem. More information can be found in the [JIRA](https://issues.apache.org/jira/browse/KAFKA-12457) and in [KIP-516](https://cwiki.apache.org/confluence/display/KAFKA/KIP-516%3A+Topic+Identifiers). Reviewers: Jason Gustafson <jason@confluent.io>	2021-04-08 10:24:23 -07:00
dengziming	4f47a565e2	KAFKA-12539; Refactor KafkaRaftCllient handleVoteRequest to reduce cyclomatic complexity (#10393 ) 1. Add `canGrantVote` to `EpochState` 2. Move the if-else in `KafkaRaftCllient.handleVoteRequest` to `EpochState` 3. Add unit tests for `canGrantVote` Reviewers: Jason Gustafson <jason@confluent.io>	2021-04-05 09:27:50 -07:00
Cong Ding	66b0c5c64f	KAFKA-3968: fsync the parent directory of a segment file when the file is created (#10405 ) Kafka does not call fsync() on the directory when a new log segment is created and flushed to disk. The problem is that following sequence of calls doesn't guarantee file durability: fd = open("log", O_RDWR \| O_CREATE); // suppose open creates "log" write(fd); fsync(fd); If the system crashes after fsync() but before the parent directory has been flushed to disk, the log file can disappear. This PR is to flush the directory when flush() is called for the first time. Reviewers: Jun Rao <junrao@gmail.com>	2021-04-02 17:31:56 -07:00
Jason Gustafson	03b52dbe31	MINOR: Improve reproducability of raft simulation tests (#10422 ) When a `@Property` tests fail, jqwik helpfully reports the initial seed that resulted in the failure. For example, if we are executing a test scenario 100 times and it fails on the 51st run, then we will get the initial seed that generated . But if you specify the seed in the `@Property` annotation as the previous comment suggested, then the test still needs to run 50 times before we get to the 51st case, which makes debugging very difficult given the complex nature of the simulation tests. Jqwik also gives us the specific argument list that failed, but that is not very helpful at the moment since `Random` does not have a useful `toString` which indicates the initial seed. To address these problems, I've changed the `@Property` methods to take the random seed as an argument directly so that it is displayed clearly in the output of a failure. I've also updated the documentation to clarify how to reproduce failures. Reviewers: David Jacot <djacot@confluent.io>	2021-04-01 10:41:30 -07:00
Ismael Juma	16b2d4f3a7	MINOR: Self-managed -> KRaft (Kafka Raft) (#10414 ) `Self-managed` is also used in the context of Cloud vs on-prem and it can be confusing. `KRaft` is a cute combination of `Kafka Raft` and it's pronounced like `craft` (as in `craftsmanship`). Reviewers: Colin P. McCabe <cmccabe@apache.org>, Jose Sancio <jsancio@gmail.com>, Chia-Ping Tsai <chia7712@gmail.com>, Ron Dagostino <rdagostino@confluent.io>	2021-03-29 15:39:10 -07:00
wenbingshen	e0cbd0fa66	MINOR: Remove duplicate definition about 'the' from kafka project (#10370 ) Reviewers: Chia-Ping Tsai <chia7712@gmail.com>	2021-03-23 10:44:55 +08:00
Jason Gustafson	f5f66b982d	KAFKA-12181; Loosen raft fetch offset validation of remote replicas (#10309 ) Currently the Raft leader raises an exception if there is a non-monotonic update to the fetch offset of a replica. In a situation where the replica had lost it disk state, this would prevent the replica from being able to recover. In this patch, we relax the validation to address this problem. It is worth pointing out that this validation could not be relied on to protect from data loss after a voter has lost committed state. Reviewers: José Armando García Sancio <jsancio@gmail.com>, Boyang Chen <boyang@confluent.io>	2021-03-22 16:05:07 -07:00
David Arthur	e820eb42b2	KAFKA-12383: Get RaftClusterTest.java and other KIP-500 junit tests working (#10220 ) Introduce "testkit" package which includes KafkaClusterTestKit class for enabling integration tests of self-managed clusters. Also make use of this new integration test harness in the ClusterTestExtentions JUnit extension. Adds RaftClusterTest for basic self-managed integration test. Reviewers: Jason Gustafson <jason@confluent.io>, Colin P. McCabe <cmccabe@apache.org> Co-authored-by: Colin P. McCabe <cmccabe@apache.org>	2021-03-22 11:45:56 -04:00
dengziming	69eebbf968	KAFKA-12440; ClusterId validation for Vote, BeginQuorum, EndQuorum and FetchSnapshot (#10289 ) Previously we implemented ClusterId validation for the Fetch API in the Raft implementation. This patch adds ClusterId validation to the remaining Raft RPCs. Reviewers: José Armando García Sancio <jsancio@users.noreply.github.com>, Jason Gustafson <jason@confluent.io>	2021-03-19 10:27:47 -07:00
Rohit Deshpande	a19806f262	KAFKA-12253: Add tests that cover all of the cases for ReplicatedLog's validateOffsetAndEpoch (#10276 ) Improves test coverage of `validateOffsetAndEpoch`. Reviewers: José Armando García Sancio <jsancio@users.noreply.github.com>, Jason Gustafson <jason@confluent.io>	2021-03-19 10:09:38 -07:00
José Armando García Sancio	6190fb32ce	MINOR: Remove use of `NoSuchElementException` in `KafkaMetadataLog` (#10344 ) Replace the use of the method `last` and `first` in `ConcurrentSkipListSet` with the descending and ascending iterator respectively. The methods `last` and `first` throw an exception when the set is empty this causes poor `KafkaRaftClient` performance when there aren't any snapshots. Reviewers: Ismael Juma <ismael@juma.me.uk>, Jason Gustafson <jason@confluent.io>	2021-03-18 11:03:21 -07:00
Jason Gustafson	8ef1619f3e	KAFKA-12459; Use property testing library for raft event simulation tests (#10323 ) This patch changes the raft simulation tests to use jqwik, which is a property testing library. This provides two main benefits: - It simplifies the randomization of test parameters. Currently the tests use a fixed set of `Random` seeds, which means that most builds are doing redundant work. We get a bigger benefit from allowing each build to test different parameterizations. - It makes it easier to reproduce failures. Whenever a test fails, jqwik will report the random seed that failed. A developer can then modify the `@Property` annotation to use that specific seed in order to reproduce the failure. This patch also includes an optimization for `MockLog.earliestSnapshotId` which reduces the time to run the simulation tests dramatically. Reviewers: Ismael Juma <ismael@juma.me.uk>, Chia-Ping Tsai <chia7712@gmail.com>, José Armando García Sancio <jsancio@gmail.com>, David Jacot <djacot@confluent.io>	2021-03-17 19:20:07 -07:00
Jason Gustafson	c6a0f76073	KAFKA-12460; Do not allow raft truncation below high watermark (#10310 ) Initially we want to be strict about the loss of committed data for the `@metadata` topic. This patch ensures that truncation below the high watermark is not allowed. Note that `MockLog` already had the logic to do so, so the patch adds a similar check to `KafkaMetadataLog`. Reviewers: David Jacot <djacot@confluent.io>, Boyang Chen <boyang@confluent.io>	2021-03-12 16:31:51 -08:00
dengziming	0e5591beda	KAFKA-12205; Delete snapshots less than the snapshot at the log start (#10021 ) This patch adds logic to delete old snapshots. There are three cases we handle: 1. Remove old snapshots after a follower completes fetching a snapshot and truncates the log to the latest snapshot 2. Remove old snapshots after a new snapshot is created. 3. Remove old snapshots during recovery after the node is restarted. Reviewers: Cao Manh Dat<caomanhdat317@gmail.com>, José Armando García Sancio <jsancio@users.noreply.github.com>, Jason Gustafson <jason@confluent.io>	2021-03-11 10:10:27 -08:00
Jason Gustafson	0685b9dcd5	MINOR: Raft max batch size needs to propagate to log config (#10256 ) This patch ensures that the constant max batch size defined in `KafkaRaftClient` is propagated to the constructed log configuration in `KafkaMetadataLog`. We also ensure that the fetch max size is set consistently with appropriate testing. Reviewers: Chia-Ping Tsai <chia7712@gmail.com>, David Arthur <mumrah@gmail.com>	2021-03-04 14:40:31 -08:00
José Armando García Sancio	96a2b7aac4	KAFKA-12376: Apply atomic append to the log (#10253 )	2021-03-04 13:55:43 -05:00
Chia-Ping Tsai	8205051e90	MINOR: remove FetchResponse.AbortedTransaction and redundant construc… (#9758 ) 1. rename INVALID_HIGHWATERMARK to INVALID_HIGH_WATERMARK 2. replace FetchResponse.AbortedTransaction by FetchResponseData.AbortedTransaction 3. remove redundant constructors from FetchResponse.PartitionData 4. rename recordSet to records 5. add helpers "recordsOrFail" and "recordsSize" to FetchResponse to process record casting Reviewers: Ismael Juma <ismael@juma.me.uk>	2021-03-04 18:06:50 +08:00
Colin Patrick McCabe	1657deec37	MINOR: tune KIP-631 configurations (#10179 ) Since we expect KIP-631 controller fail-overs to be fairly cheap, tune the default raft configuration parameters so that we detect node failures more quickly. Reduce the broker session timeout as well so that broker failures are detected more quickly. Reviewers: Jason Gustafson <jason@confluent.io>, Alok Nikhil <anikhil@confluent.io>	2021-02-25 17:16:37 -08:00
Jason Gustafson	1a09bac030	MINOR: Remove redundant log close in `KafkaRaftClient` (#10168 ) This patch fixes a small shutdown bug. Current logic closes the log twice: once in `KafkaRaftClient`, and once in `RaftManager`. This can lead to errors like the following: ``` [2021-02-18 18:35:12,643] WARN (kafka.utils.CoreUtils$) java.nio.channels.ClosedChannelException at java.base/sun.nio.ch.FileChannelImpl.ensureOpen(FileChannelImpl.java:150) at java.base/sun.nio.ch.FileChannelImpl.force(FileChannelImpl.java:452) at org.apache.kafka.common.record.FileRecords.flush(FileRecords.java:197) at org.apache.kafka.common.record.FileRecords.close(FileRecords.java:204) at kafka.log.LogSegment.$anonfun$close$4(LogSegment.scala:592) at kafka.utils.CoreUtils$.swallow(CoreUtils.scala:68) at kafka.log.LogSegment.close(LogSegment.scala:592) at kafka.log.Log.$anonfun$close$4(Log.scala:1038) at kafka.log.Log.$anonfun$close$4$adapted(Log.scala:1038) at scala.collection.IterableOnceOps.foreach(IterableOnce.scala:563) at scala.collection.IterableOnceOps.foreach$(IterableOnce.scala:561) at scala.collection.AbstractIterable.foreach(Iterable.scala:919) at kafka.log.Log.$anonfun$close$3(Log.scala:1038) at kafka.log.Log.close(Log.scala:2433) at kafka.raft.KafkaMetadataLog.close(KafkaMetadataLog.scala:295) at kafka.raft.KafkaRaftManager.shutdown(RaftManager.scala:150) ``` I have tended to view `RaftManager` as owning the lifecycle of the log, so I removed the extra call to close in `KafkaRaftClient`. Reviewers: José Armando García Sancio <jsancio@users.noreply.github.com>, Ismael Juma <ismael@juma.me.uk>	2021-02-20 12:38:09 -08:00

1 2 3 4 5 ...

292 Commits