Commit Graph

281 Commits

Author SHA1 Message Date
Orsák Maroš a0e37b79aa
MINOR: Add test cases to the Raft module (#12692)
Reviewers: Mickael Maison <mickael.maison@gmail.com>
, Divij Vaidya <diviv@amazon.com>, Ismael Juma <mlists@juma.me.uk>
2022-10-28 17:54:34 +02:00
José Armando García Sancio d0ff869718
MINOR; Add accessor methods to OffsetAndEpoch (#12770)
Accessor are preferred over fields because they compose better with Java's
lambda syntax.

Reviewers: Jason Gustafson <jason@confluent.io>
2022-10-19 12:07:07 -07:00
Niket eb8f0bd5e4
MINOR: Adding KRaft Monitoring Related Metrics to docs/ops.html (#12679)
This commit adds KRaft monitoring related metrics to the Kafka docs (docs/ops.html).

Reviewers: Jason Gustafson <jason@confluent.io>, Luke Chen <showuon@gmail.com>
2022-09-26 14:25:36 +08:00
Luke Chen bf7ddf73af
MINOR: use addExact to avoid overflow and some cleanup (#12660)
What changes in this PR:
1. Use addExact to avoid overflow in BatchAccumulator#bytesNeeded. We did use addExact in bytesNeededForRecords method, but forgot that when returning the result.
2. javadoc improvement

Reviewers: Jason Gustafson <jason@confluent.io>
2022-09-22 09:22:58 +08:00
Colin Patrick McCabe b401fdaefb MINOR: Add more validation during KRPC deserialization
When deserializing KRPC (which is used for RPCs sent to Kafka, Kafka Metadata records, and some
    other things), check that we have at least N bytes remaining before allocating an array of size N.

    Remove DataInputStreamReadable since it was hard to make this class aware of how many bytes were
    remaining. Instead, when reading an individual record in the Raft layer, simply create a
    ByteBufferAccessor with a ByteBuffer containing just the bytes we're interested in.

    Add SimpleArraysMessageTest and ByteBufferAccessorTest. Also add some additional tests in
    RequestResponseTest.

    Reviewers: Tom Bentley <tbentley@redhat.com>, Mickael Maison <mickael.maison@gmail.com>, Colin McCabe <colin@cmccabe.xyz>

    Co-authored-by: Colin McCabe <colin@cmccabe.xyz>
    Co-authored-by: Manikumar Reddy <manikumar.reddy@gmail.com>
    Co-authored-by: Mickael Maison <mickael.maison@gmail.com>
2022-09-21 20:58:23 +05:30
Jason Gustafson 8c8b5366a6
KAFKA-14240; Validate kraft snapshot state on startup (#12653)
We should prevent the metadata log from initializing in a known bad state. If the log start offset of the first segment is greater than 0, then must be a snapshot an offset greater than or equal to it order to ensure that the initialized state is complete.

Reviewers: José Armando García Sancio <jsancio@users.noreply.github.com>
2022-09-19 11:52:48 -07:00
Ashmeet Lamba 86645cb40a
KAFKA-14073; Log the reason for snapshot (#12414)
When a snapshot is taken it is due to either of the following reasons -

    Max bytes were applied
    Metadata version was changed

Once the snapshot process is started, it will log the reason that initiated the process.

Updated existing tests to include code changes required to log the reason. I was not able to check the logs when running tests - could someone guide me on how to enable logs when running a specific test case.

Reviewers: dengziming <dengziming1993@gmail.com>, José Armando García Sancio <jsancio@apache.org>
2022-09-13 10:03:47 -07:00
José Armando García Sancio c5954175a4
KAFKA-14222; KRaft's memory pool should always allocate a buffer (#12625)
Because the snapshot writer sets a linger ms of Integer.MAX_VALUE it is
possible for the memory pool to run out of memory if the snapshot is
greater than 5 * 8MB.

This change allows the BatchMemoryPool to always allocate a buffer when
requested. The memory pool frees the extra allocated buffer when released if
the number of pooled buffers is greater than the configured maximum
batches.

Reviewers: Jason Gustafson <jason@confluent.io>
2022-09-13 08:04:40 -07:00
José Armando García Sancio f83c6f2da4
KAFKA-14183; Cluster metadata bootstrap file should use header/footer (#12565)
The boostrap.checkpoint files should include a control record batch for
the SnapshotHeaderRecord at the start of the file. It should also
include a control record batch for the SnapshotFooterRecord at the end
of the file.

The snapshot header record is important because it versions the rest of
the bootstrap file.

Reviewers: David Arthur <mumrah@gmail.com>
2022-08-27 19:11:06 -07:00
Jason Gustafson 5c52c61a46
MINOR: A few cleanups for DescribeQuorum APIs (#12548)
A few small cleanups in the `DescribeQuorum` API and handling logic:

- Change field types in `QuorumInfo`:
  - `leaderId`: `Integer` -> `int`
  - `leaderEpoch`: `Integer` -> `long` (to allow for type expansion in the future)
  - `highWatermark`: `Long` -> `long`
- Use field names `lastFetchTimestamp` and `lastCaughtUpTimestamp` consistently
- Move construction of `DescribeQuorumResponseData.PartitionData` into `LeaderState`
- Consolidate fetch time/offset update logic into `LeaderState.ReplicaState.updateFollowerState`

Reviewers: Luke Chen <showuon@gmail.com>, José Armando García Sancio <jsancio@users.noreply.github.com>
2022-08-24 13:12:14 -07:00
Niket c7f051914e
KAFKA-13888; Implement `LastFetchTimestamp` and in `LastCaughtUpTimestamp` for DescribeQuorumResponse [KIP-836] (#12508)
This commit implements the newly added fields `LastFetchTimestamp` and `LastCaughtUpTimestamp` for KIP-836: https://cwiki.apache.org/confluence/display/KAFKA/KIP-836:+Addition+of+Information+in+DescribeQuorumResponse+about+Voter+Lag.

Reviewers: Jason Gustafson <jason@confluent.io>
2022-08-19 15:09:09 -07:00
Jason Gustafson e5b865d6bf
KAFKA-13940; Return NOT_LEADER_OR_FOLLOWER if DescribeQuorum sent to non-leader (#12517)
Currently the server will return `INVALID_REQUEST` if a `DescribeQuorum` request is sent to a node that is not the current leader. In addition to being inconsistent with all of the other leader APIs in the raft layer, this error is treated as fatal by both the forwarding manager and the admin client. Instead, we should return `NOT_LEADER_OR_FOLLOWER` as we do with the other APIs. This error is retriable and we can rely on the admin client to retry it after seeing this error.

Reviewers: David Jacot <djacot@confluent.io>
2022-08-17 15:48:32 -07:00
dengziming 50e5b32a6d
KAFKA-13959: Controller should unfence Broker with busy metadata log (#12274)
The reason for KAFKA-13959 is a little complex, the two keys to this problem are:

KafkaRaftClient.MAX_FETCH_WAIT_MS==MetadataMaxIdleIntervalMs == 500ms. We rely on fetchPurgatory to complete a FetchRequest, in details, if FetchRequest.fetchOffset >= log.endOffset, we will wait for 500ms to send a FetchResponse. The follower needs to send one more FetchRequest to get the HW.

Here are the event sequences:

1. When starting the leader(active controller) LEO=m+1(m is the offset of the last record), leader HW=m(because we need more than half of the voters to reach m+1)
2. Follower (standby controller) and observer (broker) send FetchRequest(fetchOffset=m)
    2.1. leader receives FetchRequest, set leader HW=m and waits 500ms before send FetchResponse
    2.2. leader send FetchResponse(HW=m)
    3.3 broker receive FetchResponse(HW=m), set metadataOffset=m.
3. Leader append NoOpRecord, LEO=m+2. leader HW=m
4. Looping 1-4

If we change MAX_FETCH_WAIT_MS=200 (less than half of MetadataMaxIdleIntervalMs), this problem can be solved temporarily.

We plan to improve this problem in 2 ways, firstly, in this PR, we change the controller to unfence a broker when the broker's high-watermark has reached the broker registration record for that broker. Secondly, we will propagate the HWM to the replicas as quickly as possible in KAFKA-14145.

Reviewers: Luke Chen <showuon@gmail.com>, José Armando García Sancio <jsancio@users.noreply.github.com>
2022-08-12 09:06:24 -07:00
Niket 48caba9340
KAFKA-14104; Add CRC validation when iterating over Metadata Log Records (#12457)
This commit adds a check to ensure the RecordBatch CRC is valid when
iterating over a Batch of Records using the RecordsIterator. The
RecordsIterator is used by both Snapshot reads and Log Records reads in
Kraft. The check can be turned off by a class parameter and is on by default.

Reviewers: José Armando García Sancio <jsancio@users.noreply.github.com>
2022-08-08 15:03:04 -07:00
Divij Vaidya 5e4c8f704c
KAFKA-13943; Make `LocalLogManager` implementation consistent with the `RaftClient` contract (#12224)
Fixes two issues in the implementation of `LocalLogManager`:

- As per the interface contract for `RaftClient.scheduleAtomicAppend()`, it should throw a `NotLeaderException` exception when the provided current leader epoch does not match the current epoch. However, the current `LocalLogManager`'s implementation of the API returns a LONG_MAX instead of throwing an exception. This change fixes the behaviour and makes it consistent with the interface contract.
-  As per the interface contract for `RaftClient.resign(epoch)`if the parameter epoch does not match the current epoch, this call will be ignored. But in the current `LocalLogManager` implementation the leader epoch might change when the thread is waiting to acquire a lock on `shared.tryAppend()` (note that tryAppend() is a synchronized method). In such a case, if a NotALeaderException is thrown (as per code change in above), then resign should be ignored.

Reviewers: José Armando García Sancio <jsancio@users.noreply.github.com>, Tom Bentley <tbentley@redhat.com>, Jason Gustafson <jason@confluent.io>
2022-07-05 20:08:28 -07:00
Christo Lolov 6c90f3335e
KAFKA-13947: Use %d formatting for integers rather than %s (#12267)
Reviewers: Mickael Maison <mickael.maison@gmail.com>, Divij Vaidya <diviv@amazon.com>, Kvicii <kvicii.yu@gmail.com>
2022-06-10 13:55:52 +02:00
dengziming 1d6e3d6cb3
KAFKA-13845: Add support for reading KRaft snapshots in kafka-dump-log (#12084)
The kafka-dump-log command should accept files with a suffix of ".checkpoint". It should also decode and print using JSON the snapshot header and footer control records.

Reviewers: José Armando García Sancio <jsancio@users.noreply.github.com>
2022-06-01 14:49:00 -07:00
José Armando García Sancio 7d1b0926fa
KAFKA-13883: Implement NoOpRecord and metadata metrics (#12183)
Implement NoOpRecord as described in KIP-835. This is controlled by the new
metadata.max.idle.interval.ms configuration.

The KRaft controller schedules an event to write NoOpRecord to the metadata log if the metadata
version supports this feature. This event is scheduled at the interval defined in
metadata.max.idle.interval.ms. Brokers and controllers were improved to ignore the NoOpRecord when
replaying the metadata log.

This PR also addsffour new metrics to the KafkaController metric group, as described KIP-835.

Finally, there are some small fixes to leader recovery. This PR fixes a bug where metadata version
3.3-IV1 was not marked as changing the metadata. It also changes the ReplicaControlManager to
accept a metadata version supplier to determine if the leader recovery state is supported.

Reviewers: Colin P. McCabe <cmccabe@apache.org>
2022-06-01 10:48:24 -07:00
Colin Patrick McCabe fa59be4e77
KAFKA-13649: Implement early.start.listeners and fix StandardAuthorizer loading (#11969)
Since the StandardAuthorizer relies on the metadata log to store its ACLs, we need to be sure that
we have the latest metadata before allowing the authorizer to be used. However, if the authorizer
is not usable for controllers in the cluster, the latest metadata cannot be fetched, because
inter-node communication cannot occur. In the initial commit which introduced StandardAuthorizer,
we punted on the loading issue by allowing the authorizer to be used immediately. This commit fixes
that by implementing early.start.listeners as specified in KIP-801. This will allow in superusers
immediately, but throw the new AuthorizerNotReadyException if non-superusers try to use the
authorizer before StandardAuthorizer#completeInitialLoad is called.

For the broker, we call StandardAuthorizer#completeInitialLoad immediately after metadata catch-up
is complete, right before unfencing. For the controller, we call
StandardAuthorizer#completeInitialLoad when the node has caught up to the high water mark of the
cluster metadata partition.

This PR refactors the SocketServer so that it creates the configured acceptors and processors in
its constructor, rather than requiring a call to SocketServer#startup A new function,
SocketServer#enableRequestProcessing, then starts the threads and begins listening on the
configured ports. enableRequestProcessing uses an async model: we will start the acceptor and
processors associated with an endpoint as soon as that endpoint's authorizer future is completed.

Also fix a bug where the controller and listener were sharing an Authorizer when in co-located
mode, which was not intended.

Reviewers: Jason Gustafson <jason@confluent.io>
2022-05-12 14:48:33 -07:00
Jacklee 2a656aea27
MINOR: fix typo for QUORUM_FETCH_TIMEOUT_MS_DOC (#12132)
Reviewers: Luke Chen <showuon@gmail.com>
2022-05-10 10:47:23 +08:00
RivenSun 51fb42bdfd
MINOR: Correct spelling errors in KafkaRaftClient (#12061)
Correct spelling errors in KafkaRaftClient

Reviewers: José Armando García Sancio <jsancio@users.noreply.github.com>
2022-04-18 10:58:57 -07:00
David Arthur 55ff5d3603
KAFKA-13823 Feature flag changes from KIP-778 (#12036)
This PR includes the changes to feature flags that were outlined in KIP-778.  Specifically, it
changes UpdateFeatures and FeatureLevelRecord to remove the maximum version level. It also adds
dry-run to the RPC so the controller can actually attempt the upgrade (rather than the client). It
introduces an upgrade type enum, which supersedes the allowDowngrade boolean. Because
FeatureLevelRecord was unused previously, we do not need to introduce a new version.

The kafka-features.sh tool was overhauled in KIP-778 and now includes the describe, upgrade,
downgrade, and disable sub-commands.  Refer to
[KIP-778](https://cwiki.apache.org/confluence/display/KAFKA/KIP-778%3A+KRaft+Upgrades) for more
details on the new command structure.

Reviewers: Colin P. McCabe <cmccabe@apache.org>, dengziming <dengziming1993@gmail.com>
2022-04-14 10:04:32 -07:00
liym 620f1d88d8
Polish Javadoc for EpochState (#11897)
Polish Javadoc for EpochState

Reviewers: Bill Bejeck <bbejeck@apache.org>
2022-03-15 19:58:47 -04:00
Cong Ding a21aec8d62
KAFKA-13603: Allow the empty active segment to have missing offset index during recovery (#11345)
Within a LogSegment, the TimeIndex and OffsetIndex are lazy indices that don't get created on disk until they are accessed for the first time. However, Log recovery logic expects the presence of an offset index file on disk for each segment, otherwise, the segment is considered corrupted.

This PR introduces a forceFlushActiveSegment boolean for the log.flush function to allow the shutdown process to flush the empty active segment, which makes sure the offset index file exists.

Co-Author: Kowshik Prakasam kowshik@gmail.com
Reviewers: Jason Gustafson <jason@confluent.io>, Jun Rao <junrao@gmail.com>
2022-01-27 14:59:21 -08:00
Kvicii dd58f81b25
KAFKA-13618: Fix typo in BatchAccumulator (#11715)
Co-authored-by: Kvicii <Karonazaba@gmail.com>
Reviewers: Mickael Maison <mickael.maison@gmail.com>
2022-01-26 18:30:56 +01:00
José Armando García Sancio 1bf418beaf
MINOR: Pass along compression type to snapshot writer (#11556)
Make sure that the compression type is passed along to the `RecordsSnapshotWriter` constructor when creating the snapshot writer using the static `createWithHeader` method.

Reviewers: Jason Gustafson <jason@confluent.io>
2021-11-30 17:29:36 -08:00
loboya~ 42306ba267
KAFKA-12932: Interfaces for SnapshotReader and SnapshotWriter (#11529)
Change the snapshot API so that SnapshotWriter and SnapshotReader are interfaces. Change the existing types SnapshotWriter and SnapshotReader to use a different name and to implement the interfaces introduced by this commit.

Co-authored-by: loboxu <loboxu@tencent.com>
Reviews: José Armando García Sancio <jsancio@users.noreply.github.com>
2021-11-30 11:44:39 -07:00
loboya~ 6c59d2d685
MINOR: Update javadoc of `SnapshotWriter.createWithHeader` (#11530)
Reviewers: Luke Chen <showuon@gmail.com>, David Jacot <djacot@confluent.io>
2021-11-29 08:49:36 +01:00
Lee Dongjin 051efc7b1a
MINOR: Remove unused parameters, exceptions, comments, etc. (#11472)
* Remove redundant toString call & unused value in LogCleanerParameterizedIntegrationTest
* Remove unthrown exceptions in FileRawSnapshotTest
* Remove unused parameters in DumpLogSegmentsTest.scala
* Remove redundant parameter to FetchDataInfo()
* Remove redundant toString call in EndToEndLatency
* Remove unused parameters in DumpLogSegments
* Remove unused toString call in AbstractLogCleanerIntegrationTest
* Remove unused parameter in LogCleanerTest#appendTransactionalAsLeader
* Remove redundant 'val's from ClientQuotaManagerTest.UserClient.
* Remove redundant parameters in EdgeCaseRequestTest
* Remove redundant Int.MaxValue from DumpLogSegments.dumpTimeIndex parameters.
* Remove '// 9) static client-id quota' from DefaultQuotaCallback#quotaMetricTags; static client-id quota was removed in 3.0.0.
* Remove redundant parameters to DumpLogSegments#[dumpLog, dumpTimeIndex].

Reviewers: Mickael Maison <mickael.maison@gmail.com>
2021-11-18 11:52:10 +01:00
Jorge Esteban Quilcate Otoya 214b59b3ec
KAFKA-13429: ignore bin on new modules (#11415)
Reviewers: John Roesler <vvcephei@apache.org>
2021-11-10 14:36:24 -06:00
Niket feee65a2cf
MINOR: Adding a constant to denote UNKNOWN leader in LeaderAndEpoch (#11477)
Reviewers: José Armando García Sancio <jsancio@gmail.com>, Jason Gustafson <jason@confluent.io>
2021-11-09 09:07:05 -08:00
Lee Dongjin 22d056c9b7
TRIVIAL: Fix type inconsistencies, unthrown exceptions, etc (#10678)
Reviewers: Ismael Juma <ismael@juma.me.uk>, Bruno Cadonna <cadonna@apache.org>
2021-11-03 14:58:42 +01:00
feyman2016 82d5e1cf14
KAFKA-10800; Enhance the test for validation when the state machine creates a snapshot (#10593)
This patch adds additional test cases covering the validations done when snapshots are created by the state machine.

Reviewers: José Armando García Sancio <jsancio@users.noreply.github.com>, Jason Gustafson <jason@confluent.io>
2021-10-26 14:03:46 -07:00
Satish Duggana 6288b5370d
MINOR: Renamed a few record definition files with the existing convention. (#11414)
Reviewers: Jun Rao <junrao@gmail.com>
2021-10-21 13:44:08 -07:00
ik 74f25bf3ad
MINOR: Fix redundant static modifier for enum (#11282)
Reviewers: Mickael Maison <mickael.maison@gmail.com>

Co-authored-by: ik.lim <iksh192@gmail.com>
2021-10-18 10:40:48 +02:00
Matthew Wong 6c80643009
[KAFKA-8522] Streamline tombstone and transaction marker removal (#10914)
This PR aims to remove tombstones that persist indefinitely due to low throughput. Previously, deleteHorizon was calculated from the segment's last modified time.

In this PR, the deleteHorizon will now be tracked in the baseTimestamp of RecordBatches. After the first cleaning pass that finds a record batch with tombstones, the record batch is recopied with deleteHorizon flag and a new baseTimestamp that is the deleteHorizonMs. The records in the batch are rebuilt with relative timestamps based on the deleteHorizonMs that is recorded. Later cleaning passes will be able to remove tombstones more accurately on their deleteHorizon due to the individual time tracking on record batches.

KIP 534: https://cwiki.apache.org/confluence/display/KAFKA/KIP-534%3A+Retain+tombstones+and+transaction+markers+for+approximately+delete.retention.ms+milliseconds

Co-authored-by: Ted Yu <yuzhihong@gmail.com>
Co-authored-by: Richard Yu <yohan.richard.yu@gmail.com>
2021-09-16 09:17:15 -07:00
Ismael Juma 0118330103
KAFKA-13273: Add support for Java 17 (#11296)
Java 17 is at release candidate stage and it will be a LTS release once
it's out (previous LTS release was Java 11).

Details:
* Replace Java 16 with Java 17 in Jenkins and Readme.
* Replace `--illegal-access=permit` (which was removed from Java 17)
   with  `--add-opens` for the packages we require internal access to.
   Filed KAFKA-13275 for updating the tests not to require `--add-opens`
   (where possible).
* Update `release.py` to use JDK8. and JDK 17 (instead of JDK 8 and JDK 15).
* Removed all but one Streams test from `testsToExclude`. The
   Connect test exclusion list remains the same.
* Add notable change to upgrade.html
* Upgrade to Gradle 7.2 as it's required for proper Java 17 support.
* Upgrade mockito to 3.12.4 for better Java 17 support.
* Adjusted `KafkaRaftClientTest` and `QuorumStateTest` not to require
   private access to `jdk.internal.util.random`.

Reviewers: Manikumar Reddy <manikumar.reddy@gmail.com>, Chia-Ping Tsai <chia7712@gmail.com>
2021-09-06 08:55:52 -07:00
dengziming b980ca8709
KAFKA-12158; Better return type of RaftClient.scheduleAppend (#10909)
This patch improves the return type for `scheduleAppend` and `scheduleAtomicAppend`. Previously we were using a `Long` value and using both `null` and `Long.MaxValue` to distinguish between different error cases. In this PR, we change the return type to `long` and only return a value if the append was accepted. For the error cases, we instead throw an exception. For this purpose, the patch introduces a couple new exception types: `BufferAllocationException` and `NotLeaderException`.

Reviewers: José Armando García Sancio <jsancio@users.noreply.github.com>, Jason Gustafson <jason@confluent.io>
2021-08-02 14:47:03 -07:00
José Armando García Sancio fd36e5a8b6
KAFKA-12851: Fix Raft partition simulation (#11134)
Instead of waiting for a high-watermark of 20 after the partition, the
test should wait for the high-watermark to reach an offset greater than
the largest log end offset at the time of the partition. Only that offset
is guarantee to be reached as the high-watermark by the new majority.

Reviewers: Jason Gustafson <jason@confluent.io>
2021-07-28 09:28:56 -07:00
José Armando García Sancio 55d9acad65
KAFKA-13113; Support unregistering Raft listeners (#11109)
This patch adds support for unregistering listeners to `RaftClient`. 

Reviewers: Colin P. McCabe <cmccabe@apache.org>, Jason Gustafson <jason@confluent.io>
2021-07-23 21:54:44 -07:00
Niket 57866bd588
MINOR: Rename the @metadata topic to __cluster_metadata #11102
Reviewers: Colin P. McCabe <cmccabe@apache.org>
2021-07-21 17:30:35 -07:00
Ryan Dielhenn 56ef910358
KAFKA-13104: Controller should notify raft client when it resigns #11082
When the active controller encounters an event exception it attempts to renounce leadership.
Unfortunately, this doesn't tell the RaftClient that it should attempt to give up leadership. This
will result in inconsistent state with the RaftClient as leader but with the controller as
inactive.  This PR changes the implementation so that the active controller asks the RaftClient
to resign.

Reviewers: Jose Sancio <jsancio@gmail.com>, Colin P. McCabe <cmccabe@apache.org>
2021-07-20 16:41:20 -07:00
José Armando García Sancio 69a4661d7a
KAFKA-13100: Create KRaft controller snapshot during promotion (#11084)
The leader assumes that there is always an in-memory snapshot at the last
committed offset. This means that the controller needs to generate an in-memory
snapshot when getting promoted from inactive to active.  This PR adds that
code. This fixes a bug where sometimes we would try to look for that in-memory
snapshot and not find it.

The controller always starts inactive, and there is no requirement that there
exists an in-memory snapshot at the last committed offset when the controller
is inactive. Therefore we can remove the initial snapshot at offset -1.

We should also optimize when a snapshot is cancelled or completes, by deleting
all in-memory snapshots less that the last committed offset.

SnapshotRegistry's createSnapshot should allow the creating of a snapshot if
the last snapshot's offset is the given offset. This allows for simpler client
code. Finally, this PR renames createSnapshot to getOrCreateSnapshot.

Reviewers: Colin P. McCabe <cmccabe@apache.org>
2021-07-20 10:13:01 -07:00
José Armando García Sancio b5cb02b288 KAFKA-13090: Improve kraft snapshot integration test
Check and verify generated snapshots for the controllers and the
brokers. Assert reader state when reading last log append time.

Reviewers: Colin P. McCabe <cmccabe@apache.org>
2021-07-16 14:10:52 -07:00
José Armando García Sancio 8134adcf91
KAFKA-13073: Fix MockLog snapshot implementation (#11032)
Fix a simulation test failure by:

1. Relaxing the valiation of the snapshot id against the log start
offset when the state machine attempts to create new snapshot. It
is safe to just ignore the request instead of throwing an exception
when the snapshot id is less that the log start offset.

2. Fixing the MockLog implementation so that it uses startOffset both
externally and internally.

Reviewers: Colin P. McCabe <cmccabe@apache.org>
2021-07-13 17:06:18 -07:00
José Armando García Sancio 0f00f3677f
KAFKA-13078: Fix a bug where we were closing the RawSnapshotWriter to early (#11040)
Reviewers: David Arthur <mumrah@gmail.com>
2021-07-13 16:20:03 -07:00
Justine Olshan 2b8aff58b5
KAFKA-10580: Add topic ID support to Fetch request (#9944)
Updated FetchRequest and FetchResponse to use topic IDs rather than topic names.
Some of the complicated code is found in FetchSession and FetchSessionHandler.
We need to be able to store topic IDs and maintain a cache on the broker for IDs that may not have been resolved. On incremental fetch requests, we will try to resolve them or remove them if in toForget.

Reviewers: Rajini Sivaram <rajinisivaram@googlemail.com>, Chia-Ping Tsai <chia7712@gmail.com>, Jun Rao <junrao@gmail.com>
2021-07-07 16:02:37 -07:00
David Arthur 284ec262c6 KAFKA-12155: Metadata log and snapshot cleaning #10864
This PR includes changes to KafkaRaftClient and KafkaMetadataLog to support periodic
cleaning of old log segments and snapshots.

Four new public config keys are introduced: metadata.log.segment.bytes,
metadata.log.segment.ms, metadata.max.retention.bytes, and
metadata.max.retention.ms.

These are used to configure the log layer as well as the snapshot cleaning logic. Snapshot
and log cleaning is performed based on two criteria: total metadata log + snapshot size
(metadata.max.retention.bytes), and max age of a snapshot (metadata.max.retention.ms).
Since we have a requirement that the log start offset must always align with a snapshot,
we perform the cleaning on snapshots first and then clean what logs we can.

The cleaning algorithm follows:
1. Delete the oldest snapshot.
2. Advance the log start offset to the new oldest snapshot.
3. Request that the log layer clean any segments prior to the new log start offset
4. Repeat this until the retention size or time is no longer violated, or only a single
snapshot remains.

The cleaning process is triggered every 60 seconds from the KafkaRaftClient polling
thread.

Reviewers: José Armando García Sancio <jsancio@gmail.com>, dengziming <dengziming1993@gmail.com>, Colin P. McCabe <cmccabe@apache.org>
2021-07-06 14:19:44 -07:00
zhaohaidao 10b1f73cd4
KAFKA-12958: add an invariant that notified leaders are never asked to load snapshot (#10932)
Track handleSnapshot calls and make sure it is never triggered on the leader node.

Reviewers: Luke Chen <showuon@gmail.com>, José Armando García Sancio <jsancio@users.noreply.github.com>, Boyang Chen <bchen11@outlook.com>
2021-07-04 08:32:12 -07:00
José Armando García Sancio 9f01909dc3
KAFKA-12997: Expose the append time for batches from raft (#10946)
Add the record append time to Batch. Change SnapshotReader to set this time to the
time of the last log in the last batch. Fix the QuorumController to remember the last
committed batch append time and to store it in the generated snapshot.

Reviewers: David Arthur <mumrah@gmail.com>, Luke Chen <showuon@gmail.com>, Colin P. McCabe <cmccabe@apache.org>
2021-07-01 16:38:59 -07:00
José Armando García Sancio 1b7ab8eb9f
KAFKA-12863: Configure controller snapshot generation (#10812)
Add the ability for KRaft controllers to generate snapshots based on the number of new record bytes that have 
been applied since the last snapshot. Add a new configuration key to control this parameter. For now, it
defaults to being off, although we will change that in a follow-on PR. Also, fix LocalLogManager so that
snapshot loading is only triggered when the listener is not the leader.

Reviewers: Colin P. McCabe <cmccabe@apache.org>
2021-06-30 18:13:53 -07:00
Niket d3ec9f940c
KAFKA-12952 Add header and footer records for raft snapshots (#10899)
Add header and footer records for raft snapshots. This helps identify when the snapshot
starts and ends. The header also contains a time.  The time field is currently set to 0.
KAFKA-12997 will add in the necessary wiring to use the correct timestamp.

Reviewers: Jose Sancio <jsancio@gmail.com>, Colin P. McCabe <cmccabe@apache.org>
2021-06-29 09:37:20 -07:00
Jason Gustafson f86cb1d1da
KAFKA-12631; Implement `resign` API in `KafkaRaftClient` (#10913)
This patch adds an implementation of the `resign()` API which allows the controller to proactively resign leadership in case it encounters an unrecoverable situation. There was not a lot to do here because we already supported a `Resigned` state to facilitate graceful shutdown.

Reviewers: José Armando García Sancio <jsancio@users.noreply.github.com>, David Arthur <mumrah@gmail.com>
2021-06-28 18:00:19 -07:00
José Armando García Sancio 4ddf1a5f94
KAFKA-12837; Process entire batch reader in the `BrokerMetadataListener` commit handler (#10902)
We should process the entire batch in `BrokerMetadataListener` and make sure that `hasNext` is called before calling `next` on the iterator. The previous code worked because the raft client kept track of the position in the iterator, but it caused NoSuchElementException to be raised when the reader was empty (as might be the case with control records).

Reviewers: Jason Gustafson <jason@confluent.io>
2021-06-18 13:20:43 -07:00
Jason Gustafson 6c260a5e1d
MINOR: Fix javadoc errors in `RaftClient` (#10901)
This patch fixes a few minor javadoc issues in the `RaftClient` interface.

Reviewers: José Armando García Sancio <jsancio@users.noreply.github.com>, David Jacot <djacot@confluent.io>
2021-06-17 23:33:02 -07:00
José Armando García Sancio b67a77d5b9
KAFKA-12787; Integrate controller snapshoting with raft client (#10786)
Directly use `RaftClient.Listener`, `SnapshotWriter` and `SnapshotReader` in the quorum controller.

1. Allow `RaftClient` users to create snapshots by specifying the last committed offset and last committed epoch. These values are validated against the log and leader epoch cache.
2. Remove duplicate classes in the metadata module for writing and reading snapshots.
3. Changed the logic for comparing snapshots. The old logic was assuming a certain batch grouping. This didn't match the implementation of the snapshot writer. The snapshot writer is free to merge batches before writing them.
4. Improve `LocalLogManager` to keep track of multiple snapshots.
5. Improve the documentation and API for the snapshot classes to highlight the distinction between the offset of batches in the snapshot vs the offset of batches in the log. These two offsets are independent of one another. `SnapshotWriter` and `SnapshotReader` expose a method called `lastOffsetFromLog` which represents the last inclusive offset from the log that is represented in the snapshot.

Reviewers: dengziming <swzmdeng@163.com>, Jason Gustafson <jason@confluent.io>
2021-06-15 10:32:01 -07:00
loboya~ 4b7ad7b14d
KAFKA-12773; Use UncheckedIOException when wrapping IOException (#10749)
The raft module may not be fully consistent on this but in general in that module we have decided to not throw the checked IOException. We have been avoiding checked IOException exceptions by wrapping them in RuntimeException. The raft module should instead wrap IOException in UncheckedIOException. 

Reviewers: Luke Chen <showuon@gmail.com>, David Arthur <mumrah@gmail.com>, José Armando García Sancio <jsancio@users.noreply.github.com>, Jason Gustafson <jason@confluent.io>
2021-06-15 10:22:48 -07:00
Satish Duggana a9fe2bd935
MINOR Removed unused ConfigProvider from raft resources module. (#10829)
Reviewers: Jun Rao <junrao@gmail.com>
2021-06-09 14:22:18 -07:00
Boyang Chen 0358c21ae4
minor stylish fixes to raft client (#10809)
Style fixes to KafkaRaftClient

Reviewers: Luke Chen <showuon@gmail.com>
2021-06-03 18:51:03 -07:00
Chia-Ping Tsai f426a72f03
MINOR: replace by org.junit.jupiter.api.Tag by net.jqwik.api.Tag for raft module (#10791)
The command `./gradlew raft:integrationTest`  can't run any integration test since `org.junit.jupiter.api.Tag` does not work for jqwik engine (see https://github.com/jlink/jqwik/issues/36#issuecomment-436535760). 

Reviewers: Ismael Juma <ismael@juma.me.uk>
2021-06-01 13:03:25 +08:00
José Armando García Sancio f50f13d781
KAFKA-12342: Remove MetaLogShim and use RaftClient directly (#10705)
This patch removes the temporary shim layer we added to bridge the interface
differences between MetaLogManager and RaftClient. Instead, we now use the
RaftClient directly from the metadata module.  This also means that the
metadata gradle module now depends on raft, rather than the other way around.
Finally, this PR also consolidates the handleResign and handleNewLeader APIs
into a single handleLeaderChange API.

Co-authored-by: Jason Gustafson <jason@confluent.io>
2021-05-20 15:39:46 -07:00
David Arthur 937e28db5d
Fix compile errors from KAFKA-12543 (#10719)
Reviewers: Colin P. Mccabe <cmccabe@confluent.io>, Jun Rao <junrao@gmail.com>, José Armando García Sancio <jsancio@users.noreply.github.com>
2021-05-18 17:09:51 -04:00
José Armando García Sancio 924c870fb1
KAFKA-12543: Change RawSnapshotReader ownership model (#10431)
Kafka networking layer doesn't close FileRecords and assumes that they are already open when sending them over a channel. To support this pattern this commit changes the ownership model for FileRawSnapshotReader so that they are owned by KafkaMetadataLog.

Reviewers: dengziming <swzmdeng@163.com>, David Arthur <mumrah@gmail.com>, Jun Rao <junrao@gmail.com>
2021-05-18 14:14:17 -04:00
Satish Duggana 7ef3879429
KAFKA-12758 Added `server-common` module to have server side common classes. (#10638)
Added server-common module to have server side common classes. Moved ApiMessageAndVersion, RecordSerde, AbstractApiMessageSerde, and BytesApiMessageSerde to server-common module.

Reivewers:  Kowshik Prakasam <kprakasam@confluent.io>, Jun Rao <junrao@gmail.com>
2021-05-11 09:58:28 -07:00
Satish Duggana a1367f57f5
KAFKA-12429: Added serdes for the default implementation of RLMM based on an internal topic as storage. (#10271)
KAFKA-12429: Added serdes for the default implementation of RLMM based on an internal topic as storage. This topic will receive events of RemoteLogSegmentMetadata, RemoteLogSegmentUpdate, and RemotePartitionDeleteMetadata. These events are serialized into Kafka protocol message format.
Added tests for all the event types for that topic.

This is part of the tiered storaqe implementation KIP-405.

Reivewers:  Kowshik Prakasam <kprakasam@confluent.io>, Jun Rao <junrao@gmail.com>
2021-05-05 07:48:52 -07:00
José Armando García Sancio 6203bf8b94
KAFKA-12154; Raft Snapshot Loading API (#10085)
Implement Raft Snapshot loading API.

1. Adds a new method `handleSnapshot` to `raft.Listener` which is called whenever the `RaftClient` determines that the `Listener` needs to load a new snapshot before reading the log. This happens when the `Listener`'s next offset is less than the log start offset also known as the earliest snapshot.

2.  Adds a new type `SnapshotReader<T>` which provides a `Iterator<Batch<T>>` interface and de-serializes records in the `RawSnapshotReader` into `T`s

3.  Adds a new type `RecordsIterator<T>` that implements an `Iterator<Batch<T>>` by scanning a `Records` object and deserializes the batches and records into `Batch<T>`. This type is used by both `SnapshotReader<T>` and `RecordsBatchReader<T>` internally to implement the `Iterator` interface that they expose. 

4. Changes the `MockLog` implementation to read one or two batches at a time. The previous implementation always read from the given offset to the high-watermark. This made it impossible to test interesting snapshot loading scenarios.

5. Removed `throws IOException` from some methods. Some of types were inconsistently throwing `IOException` in some cases and throwing `RuntimeException(..., new IOException(...))` in others. This PR improves the consistent by wrapping `IOException` in `RuntimeException` in a few more places and replacing `Closeable` with `AutoCloseable`.

6. Updated the Kafka Raft simulation test to take into account snapshot. `ReplicatedCounter` was updated to generate snapshot after 10 records get committed. This means that the `ConsistentCommittedData` validation was extended to take snapshots into account. Also added a new invariant to ensure that the log start offset is consistently set with the earliest snapshot.

Reviewers: dengziming <swzmdeng@163.com>, David Arthur <mumrah@gmail.com>, Jason Gustafson <jason@confluent.io>
2021-05-01 10:05:45 -07:00
Ryan a855f6ac37
KAFKA-12265; Move the BatchAccumulator in KafkaRaftClient to LeaderState (#10480)
The KafkaRaftClient has a field for the BatchAccumulator that is only used and set when it is the leader. In other cases, leader specific information was stored in LeaderState. In a recent change EpochState, which LeaderState implements, was changed to be a Closable. QuorumState makes sure to always close the previous state before transitioning to the next state. This redesign was used to move the BatchAccumulator to the LeaderState and simplify some of the handling in KafkaRaftClient.

Reviewers: José Armando García Sancio <jsancio@gmail.com>, Jason Gustafson <jason@confluent.io>
2021-04-29 09:25:21 -07:00
Bill Bejeck 637c44c976
KAFKA-12672: Added config for raft testing server (#10545)
Adding a property to the `raft/config/kraft.properties` for running the raft
test server in development.

For testing I ran `./bin/test-kraft-server-start.sh --config config/kraft.properties`
and validated the test server started running with a throughput test.

Reviewers: Ismael Juma <ismael@juma.me.uk>
2021-04-15 20:48:53 -07:00
dengziming db688b1a5e
KAFKA-12607; Test case for resigned state vote granting (#10510)
This patch adds unit tests to verify vote behavior when in the "resigned" state.

Reviewers: Jason Gustafson <jason@confluent.io>
2021-04-09 19:15:11 -07:00
Jason Gustafson d2c06c9c3c
KAFKA-12619; Raft leader should expose hw only after committing LeaderChange (#10481)
KIP-595 describes an extra condition on commitment here: https://cwiki.apache.org/confluence/display/KAFKA/KIP-595%3A+A+Raft+Protocol+for+the+Metadata+Quorum#KIP595:ARaftProtocolfortheMetadataQuorum-Fetch. In order to ensure that a newly elected leader's committed entries cannot get lost, it must commit one record from its own epoch. This guarantees that its latest entry is larger (in terms of epoch/offset) than any previously written record which ensures that any future leader must also include it. This is the purpose of the `LeaderChange` record which is written to the log as soon as the leader gets elected.

Although we had this check implemented, it was off by one. We only ensured that replication reached the epoch start offset, which does not reflect the appended `LeaderChange` record. This patch fixes the check and clarifies the point of the check. The rest of the patch is just fixing up test cases.

Reviewers: dengziming <swzmdeng@163.com>, Guozhang Wang <wangguoz@gmail.com>
2021-04-08 10:42:30 -07:00
Justine Olshan c2ea0c2e1d
KAFKA-12457; Add sentinel ID to metadata topic (#10492)
KIP-516 introduces topic IDs to topics, but there is a small issue with how the KIP-500 metadata topic will interact with topic IDs. 

For example, https://github.com/apache/kafka/pull/9944 aims to replace topic names in the Fetch request with topic IDs. In order to get these IDs, brokers must fetch from the metadata topic. This leads to a sort of "chicken and the egg" problem concerning how we find out the metadata topic's topic ID. 

This PR adds the a special sentinel topic ID for the metadata topic, which gets around this problem.
More information can be found in the [JIRA](https://issues.apache.org/jira/browse/KAFKA-12457) and in [KIP-516](https://cwiki.apache.org/confluence/display/KAFKA/KIP-516%3A+Topic+Identifiers).

Reviewers: Jason Gustafson <jason@confluent.io>
2021-04-08 10:24:23 -07:00
dengziming 4f47a565e2
KAFKA-12539; Refactor KafkaRaftCllient handleVoteRequest to reduce cyclomatic complexity (#10393)
1. Add `canGrantVote` to `EpochState`
2. Move the if-else in `KafkaRaftCllient.handleVoteRequest` to `EpochState`
3. Add unit tests for `canGrantVote`

Reviewers: Jason Gustafson <jason@confluent.io>
2021-04-05 09:27:50 -07:00
Cong Ding 66b0c5c64f
KAFKA-3968: fsync the parent directory of a segment file when the file is created (#10405)
Kafka does not call fsync() on the directory when a new log segment is created and flushed to disk.

The problem is that following sequence of calls doesn't guarantee file durability:

fd = open("log", O_RDWR | O_CREATE); // suppose open creates "log"
write(fd);
fsync(fd);

If the system crashes after fsync() but before the parent directory has been flushed to disk, the log file can disappear.

This PR is to flush the directory when flush() is called for the first time.

Reviewers: Jun Rao <junrao@gmail.com>
2021-04-02 17:31:56 -07:00
Jason Gustafson 03b52dbe31
MINOR: Improve reproducability of raft simulation tests (#10422)
When a `@Property` tests fail, jqwik helpfully reports the initial seed that resulted in the failure. For example, if we are executing a test scenario 100 times and it fails on the 51st run, then we will get the initial seed that generated . But if you specify the seed in the `@Property` annotation as the previous comment suggested, then the test still needs to run 50 times before we get to the 51st case, which makes debugging very difficult given the complex nature of the simulation tests. Jqwik also gives us the specific argument list that failed, but that is not very helpful at the moment since `Random` does not have a useful `toString` which indicates the initial seed. 

To address these problems, I've changed the `@Property` methods to take the random seed as an argument directly so that it is displayed clearly in the output of a failure. I've also updated the documentation to clarify how to reproduce failures.

Reviewers: David Jacot <djacot@confluent.io>
2021-04-01 10:41:30 -07:00
Ismael Juma 16b2d4f3a7
MINOR: Self-managed -> KRaft (Kafka Raft) (#10414)
`Self-managed` is also used in the context of Cloud vs on-prem and it can
be confusing.

`KRaft` is a cute combination of `Kafka Raft` and it's pronounced like `craft`
(as in `craftsmanship`).

Reviewers: Colin P. McCabe <cmccabe@apache.org>, Jose Sancio <jsancio@gmail.com>, Chia-Ping Tsai <chia7712@gmail.com>, Ron Dagostino <rdagostino@confluent.io>
2021-03-29 15:39:10 -07:00
wenbingshen e0cbd0fa66
MINOR: Remove duplicate definition about 'the' from kafka project (#10370)
Reviewers: Chia-Ping Tsai <chia7712@gmail.com>
2021-03-23 10:44:55 +08:00
Jason Gustafson f5f66b982d
KAFKA-12181; Loosen raft fetch offset validation of remote replicas (#10309)
Currently the Raft leader raises an exception if there is a non-monotonic update to the fetch offset of a replica. In a situation where the replica had lost it disk state, this would prevent the replica from being able to recover. In this patch, we relax the validation to address this problem. It is worth pointing out that this validation could not be relied on to protect from data loss after a voter has lost committed state.

Reviewers: José Armando García Sancio <jsancio@gmail.com>, Boyang Chen <boyang@confluent.io>
2021-03-22 16:05:07 -07:00
David Arthur e820eb42b2
KAFKA-12383: Get RaftClusterTest.java and other KIP-500 junit tests working (#10220)
Introduce "testkit" package which includes KafkaClusterTestKit class for enabling integration tests of self-managed clusters. Also make use of this new integration test harness in the ClusterTestExtentions JUnit extension. 

Adds RaftClusterTest for basic self-managed integration test. 

Reviewers: Jason Gustafson <jason@confluent.io>, Colin P. McCabe <cmccabe@apache.org>

Co-authored-by: Colin P. McCabe <cmccabe@apache.org>
2021-03-22 11:45:56 -04:00
dengziming 69eebbf968
KAFKA-12440; ClusterId validation for Vote, BeginQuorum, EndQuorum and FetchSnapshot (#10289)
Previously we implemented ClusterId validation for the Fetch API in the Raft implementation. This patch adds ClusterId validation to the remaining Raft RPCs. 

Reviewers: José Armando García Sancio <jsancio@users.noreply.github.com>, Jason Gustafson <jason@confluent.io>
2021-03-19 10:27:47 -07:00
Rohit Deshpande a19806f262
KAFKA-12253: Add tests that cover all of the cases for ReplicatedLog's validateOffsetAndEpoch (#10276)
Improves test coverage of `validateOffsetAndEpoch`. 

Reviewers: José Armando García Sancio <jsancio@users.noreply.github.com>, Jason Gustafson <jason@confluent.io>
2021-03-19 10:09:38 -07:00
José Armando García Sancio 6190fb32ce
MINOR: Remove use of `NoSuchElementException` in `KafkaMetadataLog` (#10344)
Replace the use of the method `last` and `first` in `ConcurrentSkipListSet` with the descending and ascending iterator respectively. The methods `last` and `first` throw an exception when the set is empty this causes poor `KafkaRaftClient` performance when there aren't any snapshots.

Reviewers: Ismael Juma <ismael@juma.me.uk>, Jason Gustafson <jason@confluent.io>
2021-03-18 11:03:21 -07:00
Jason Gustafson 8ef1619f3e
KAFKA-12459; Use property testing library for raft event simulation tests (#10323)
This patch changes the raft simulation tests to use jqwik, which is a property testing library. This provides two main benefits:

- It simplifies the randomization of test parameters. Currently the tests use a fixed set of `Random` seeds, which means that most builds are doing redundant work. We get a bigger benefit from allowing each build to test different parameterizations.
- It makes it easier to reproduce failures. Whenever a test fails, jqwik will report the random seed that failed. A developer can then modify the `@Property` annotation to use that specific seed in order to reproduce the failure.

This patch also includes an optimization for `MockLog.earliestSnapshotId` which reduces the time to run the simulation tests dramatically.

Reviewers: Ismael Juma <ismael@juma.me.uk>, Chia-Ping Tsai <chia7712@gmail.com>, José Armando García Sancio <jsancio@gmail.com>, David Jacot <djacot@confluent.io>
2021-03-17 19:20:07 -07:00
Jason Gustafson c6a0f76073
KAFKA-12460; Do not allow raft truncation below high watermark (#10310)
Initially we want to be strict about the loss of committed data for the `@metadata` topic. This patch ensures that truncation below the high watermark is not allowed. Note that `MockLog` already had the logic to do so, so the patch adds a similar check to `KafkaMetadataLog`. 

Reviewers: David Jacot <djacot@confluent.io>, Boyang Chen <boyang@confluent.io>
2021-03-12 16:31:51 -08:00
dengziming 0e5591beda
KAFKA-12205; Delete snapshots less than the snapshot at the log start (#10021)
This patch adds logic to delete old snapshots. There are three cases we handle:

1. Remove old snapshots after a follower completes fetching a snapshot and truncates the log to the latest snapshot
2. Remove old snapshots after a new snapshot is created.
3. Remove old snapshots during recovery after the node is restarted.

Reviewers: Cao Manh Dat<caomanhdat317@gmail.com>, José Armando García Sancio <jsancio@users.noreply.github.com>, Jason Gustafson <jason@confluent.io>
2021-03-11 10:10:27 -08:00
Jason Gustafson 0685b9dcd5
MINOR: Raft max batch size needs to propagate to log config (#10256)
This patch ensures that the constant max batch size defined in `KafkaRaftClient` is propagated to the constructed log configuration in `KafkaMetadataLog`. We also ensure that the fetch max size is set consistently with appropriate testing. 

Reviewers: Chia-Ping Tsai <chia7712@gmail.com>, David Arthur <mumrah@gmail.com>
2021-03-04 14:40:31 -08:00
José Armando García Sancio 96a2b7aac4
KAFKA-12376: Apply atomic append to the log (#10253) 2021-03-04 13:55:43 -05:00
Chia-Ping Tsai 8205051e90
MINOR: remove FetchResponse.AbortedTransaction and redundant construc… (#9758)
1. rename INVALID_HIGHWATERMARK to INVALID_HIGH_WATERMARK
2. replace FetchResponse.AbortedTransaction by FetchResponseData.AbortedTransaction
3. remove redundant constructors from FetchResponse.PartitionData
4. rename recordSet to records
5. add helpers "recordsOrFail" and "recordsSize" to FetchResponse to process record casting

Reviewers: Ismael Juma <ismael@juma.me.uk>
2021-03-04 18:06:50 +08:00
Colin Patrick McCabe 1657deec37
MINOR: tune KIP-631 configurations (#10179)
Since we expect KIP-631 controller fail-overs to be fairly cheap, tune
the default raft configuration parameters so that we detect node
failures more quickly.

Reduce the broker session timeout as well so that broker failures are
detected more quickly.

Reviewers: Jason Gustafson <jason@confluent.io>, Alok Nikhil <anikhil@confluent.io>
2021-02-25 17:16:37 -08:00
Jason Gustafson 1a09bac030
MINOR: Remove redundant log close in `KafkaRaftClient` (#10168)
This patch fixes a small shutdown bug. Current logic closes the log twice: once in `KafkaRaftClient`, and once in `RaftManager`. This can lead to errors like the following:

```
[2021-02-18 18:35:12,643] WARN  (kafka.utils.CoreUtils$)
java.nio.channels.ClosedChannelException
        at java.base/sun.nio.ch.FileChannelImpl.ensureOpen(FileChannelImpl.java:150)
        at java.base/sun.nio.ch.FileChannelImpl.force(FileChannelImpl.java:452)
        at org.apache.kafka.common.record.FileRecords.flush(FileRecords.java:197)
        at org.apache.kafka.common.record.FileRecords.close(FileRecords.java:204)
        at kafka.log.LogSegment.$anonfun$close$4(LogSegment.scala:592)
        at kafka.utils.CoreUtils$.swallow(CoreUtils.scala:68)
        at kafka.log.LogSegment.close(LogSegment.scala:592)
        at kafka.log.Log.$anonfun$close$4(Log.scala:1038)
        at kafka.log.Log.$anonfun$close$4$adapted(Log.scala:1038)
        at scala.collection.IterableOnceOps.foreach(IterableOnce.scala:563)
        at scala.collection.IterableOnceOps.foreach$(IterableOnce.scala:561)
        at scala.collection.AbstractIterable.foreach(Iterable.scala:919)
        at kafka.log.Log.$anonfun$close$3(Log.scala:1038)
        at kafka.log.Log.close(Log.scala:2433)
        at kafka.raft.KafkaMetadataLog.close(KafkaMetadataLog.scala:295)
        at kafka.raft.KafkaRaftManager.shutdown(RaftManager.scala:150)
```

I have tended to view `RaftManager` as owning the lifecycle of the log, so I removed the extra call to close in `KafkaRaftClient`.

Reviewers: José Armando García Sancio <jsancio@users.noreply.github.com>, Ismael Juma <ismael@juma.me.uk>
2021-02-20 12:38:09 -08:00
David Jacot bbf145b1b1
KAFKA-10817; Add clusterId validation to raft Fetch handling (#10129)
This patch adds clusterId validation in the `Fetch` API as documented in KIP-595. A new error code `INCONSISTENT_CLUSTER_ID` is returned if the request clusterId does not match the value on the server. If no clusterId is provided, the request is treated as valid.

Reviewers: Jason Gustafson <jason@confluent.io>
2021-02-19 14:43:14 -08:00
José Armando García Sancio 9243c10161
KAFKA-12258; Add support for splitting appending records (#10063)
1. Type `BatchAccumulator`. Add support for appending records into one or more batches.
2. Type `RaftClient`. Rename `scheduleAppend` to `scheduleAtomicAppend`.
3. Type `RaftClient`. Add a new method `scheduleAppend` which appends records to the log using as many batches as necessary.
4. Increase the batch size from 1MB to 8MB.

Reviewers: David Arthur <mumrah@gmail.com>, Jason Gustafson <jason@confluent.io>
2021-02-18 19:46:23 -08:00
José Armando García Sancio e29f7a36db
KAFKA-12331: Use LEO for the base offset of LeaderChangeMessage batch (#10138)
The `KafkaMetadataLog` implementation of `ReplicatedLog` validates that batches appended using `appendAsLeader` and `appendAsFollower` have an offset that matches the LEO. This is enforced by `KafkaRaftClient` and `BatchAccumulator`. When creating control batches for the `LeaderChangeMessage` the default base offset of `0` was being used instead of using the LEO. This is fixed by:

1. Changing the implementation for `MockLog` to validate against this and throw an `RuntimeException` if this invariant is violated.
2. Always create a batch for `LeaderChangeMessage` with an offset equal to the LEO.

Reviewers: Jason Gustafson <jason@confluent.io>
2021-02-18 16:44:40 -08:00
Ron Dagostino a30f92bf59
MINOR: Add KIP-500 BrokerServer and ControllerServer (#10113)
This PR adds the KIP-500 BrokerServer and ControllerServer classes and 
makes some related changes to get them working.  Note that the ControllerServer 
does not instantiate a QuorumController object yet, since that will be added in
PR #10070.

* Add BrokerServer and ControllerServer

* Change ApiVersions#computeMaxUsableProduceMagic so that it can handle
endpoints which do not support PRODUCE (such as KIP-500 controller nodes)

* KafkaAdminClientTest: fix some lingering references to decommissionBroker
that should be references to unregisterBroker.

* Make some changes to allow SocketServer to be used by ControllerServer as
we as by the broker.

* We now return a random active Broker ID as the Controller ID in
MetadataResponse for the Raft-based case as per KIP-590.

* Add the RaftControllerNodeProvider

* Add EnvelopeUtils

* Add MetaLogRaftShim

* In ducktape, in config_property.py: use a KIP-500 compatible cluster ID.

Reviewers: Colin P. McCabe <cmccabe@apache.org>, David Arthur <mumrah@gmail.com>
2021-02-17 21:35:13 -08:00
Justine Olshan fb7da1a245
Fixed README and added clearer error message. (#10133)
The script `test-raft-server-start.sh` requires the config to be specified with `--config`. I've included this in the README and added an error message for this specific case.

Reviewers: Jason Gustafson <jason@confluent.io>
2021-02-16 15:12:29 -08:00
Ismael Juma 744d05b128
KAFKA-12327: Remove MethodHandle usage in CompressionType (#10123)
We don't really need it and it causes problems in older Android versions
and GraalVM native image usage (there are workarounds for the latter).

Move the logic to separate classes that are only invoked when the
relevant compression library is actually used. Place such classes
in their own package and enforce via checkstyle that only these
classes refer to compression library packages.

To avoid cyclic dependencies, moved `BufferSupplier` to the `utils`
package.

Reviewers: Chia-Ping Tsai <chia7712@gmail.com>
2021-02-14 08:12:25 -08:00
dengziming 2c590de54e
MINOR: Add FetchSnapshot API doc in KafkaRaftClient (#10097) 2021-02-12 10:38:10 -05:00
Jason Gustafson f58c2acf26
KAFKA-12250; Add metadata record serde for KIP-631 (#9998)
This patch adds a `RecordSerde` implementation for the metadata record format expected by KIP-631. 

Reviewers: Colin McCabe <cmccabe@apache.org>, Ismael Juma <mlists@juma.me.uk>
2021-02-03 16:16:35 -08:00
feyman2016 db73d86ea6
KAFKA-10636; Bypass log validation and offset assignment for writes from the raft leader (#9739)
Since the Raft leader is already doing the work of assigning offsets and the leader epoch, we can skip the same logic in `Log.appendAsLeader`. This lets us avoid an unnecessary round of decompression.

Reviewers: dengziming <dengziming1993@gmail.com>, Jason Gustafson <jason@confluent.io>
2021-02-01 10:05:47 -08:00
Jason Gustafson 7205cd36e4
KAFKA-12236; New meta.properties logic for KIP-500 (#9967)
This patch contains the new handling of `meta.properties` required by the KIP-500 server as specified in KIP-631. When using the self-managed quorum, the `meta.properties` file is required in each log directory with the new `version` property set to 1. It must include the `cluster.id` property and it must have a `node.id` matching that in the configuration.

The behavior of `meta.properties` for the Zookeeper-based `KafkaServer` does not change. We treat `meta.properties` as optional and as if it were `version=0`. We continue to generate the clusterId and/or the brokerId through Zookeeper as needed.

Reviewers: Ismael Juma <ismael@juma.me.uk>, Chia-Ping Tsai <chia7712@gmail.com>
2021-01-30 17:05:31 -08:00
José Armando García Sancio 5b3351e10b
KAFKA-10761; Kafka Raft update log start offset (#9816)
Adds support for nonzero log start offsets.

Changes to `Log`:
1. Add a new "reason" for increasing the log start offset. This is used by `KafkaMetadataLog` when a snapshot is generated.
2. `LogAppendInfo` should return if it was rolled because of an records append. A log is rolled when a new segment is created. This is used by `KafkaMetadataLog` to in some cases delete the created segment based on the log start offset.

Changes to `KafkaMetadataLog`:
1. Update both append functions to delete old segments based on the log start offset whenever the log is rolled.
2. Update `lastFetchedEpoch` to return the epoch of the latest snapshot whenever the log is empty.
3. Add a function that empties the log whenever the latest snapshot is greater than the replicated log. This is used when first loading the `KafkaMetadataLog` and whenever the `KafkaRaftClient` downloads a snapshot from the leader.

Changes to `KafkaRaftClient`:
1. Improve `validateFetchOffsetAndEpoch` so that it can handle fetch offset and last fetched epoch that are smaller than the log start offset. This is in addition to the existing code that check for a diverging log. This is used by the raft client to determine if the Fetch response should include a diverging epoch or a snapshot id. 
2. When a follower finishes fetching a snapshot from the leader fully truncate the local log.
3. When polling the current state the raft client should check if the state machine has generated a new snapshot and update the log start offset accordingly.

Reviewers: Jason Gustafson <jason@confluent.io>
2021-01-29 14:06:01 -08:00
Alok Nikhil 27a998e8a0
KAFKA-12237; Support lazy initialization of quorum voter addresses (#9985)
With KIP-595, we previously expect `RaftConfig` to specify the quorum voter endpoints upfront on startup. In the general case, this works fine. However, for testing where the bound port is not known ahead of time, we need a lazier approach that discovers the other voters in the quorum after startup. 

In this patch, we take the voter endpoint initialization out of `KafkaRaftClient.initialize` and move it to `RaftManager`. We use a special address to indicate that the voter addresses will be provided later This approach also lends itself well to future use cases where we might discover voter addresses through an external service (for example).

Reviewers: Jason Gustafson <jason@confluent.io>
2021-01-28 17:14:56 -08:00
dengziming a26db2a1ec
KAFKA-10694; Implement zero copy for FetchSnapshot (#9819)
This patch adds zero-copy support for the `FetchSnapshot` API. Unlike the normal `Fetch` API, records are not assumed to be offset-aligned in `FetchSnapshot` responses. Hence this patch introduces a new `UnalignedRecords` type which allows us to use most of the existing logic to support zero-copy while preserving type safety in the snapshot APIs.

Reviewers: José Armando García Sancio <jsancio@gmail.com>, Jason Gustafson <jason@confluent.io>
2021-01-26 10:33:36 -08:00
Ismael Juma 6f8ca66127
MINOR: Tag `RaftEventSimulationTest` as `integration` and tweak it (#9925)
The test takes over 1 minute to run, so it should not be considered a
unit test.

Also:
* Replace `test` prefix with `check` prefix for helper methods. A common
mistake is to forget to add the @Test annotation, so it's good to use a
different naming convention for methods that should have the annotation
versus methods that should not.
* Replace `Action` functional interface with built-in `Runnable`.
* Remove unnecessary `assumeTrue`.
* Remove `@FunctionalInterface` from `Invariant` since it's not used
in that way.

Reviewers: Chia-Ping Tsai <chia7712@gmail.com>
2021-01-23 15:57:25 -08:00
Jason Gustafson 9689a313f5
MINOR: Drop enable.metadata.quorum config (#9934)
The primary purpose of this patch is to remove the internal `enable.metadata.quorum` configuration. Instead, we rely on `process.roles` to determine if the self-managed quorum has been enabled. As a part of this, I've done the following:

1. Replace the notion of "disabled" APIs with "controller-only" APIs. We previously marked some APIs which were intended only for the KIP-500 as "disabled" so that they would not be unintentionally exposed. For example, the Raft quorum APIs were disabled. Marking them as "controller-only" carries the same effect, but makes the intent that they should be only exposed by the KIP-500 controller clearer.
2. Make `ForwardingManager` optional in `KafkaServer` and `KafkaApis`. Previously we used `null` if forwarding was enabled and relied on the metadata quorum check.
3. Make `zookeeper.connect` an optional configuration if `process.roles` is defined.
4. Update raft README to remove reference to `zookeeper.conntect`

Reviewers: Colin Patrick McCabe <cmccabe@confluent.io>, Boyang Chen <boyang@confluent.io>
2021-01-21 15:16:15 -08:00
Alok Nikhil fea2f65929
MINOR: Import RaftConfig config definitions into KafkaConfig (#9916)
This patch moves Raft config definitions from `RaftConfig` to `KafkaConfig`, where they are re-defined as internal configs until we are ready to expose them. It also adds the missing "controller" prefix that was added by KIP-631.

Reviewers: Jason Gustafson <jason@confluent.io>
2021-01-21 10:26:23 -08:00
Jason Gustafson 7ac06065f1
KAFKA-12161; Support raft observers with optional id (#9871)
We would like to be able to use `KafkaRaftClient` for tooling/debugging use cases. For this, we need the localId to be optional so that the client can be used more like a consumer. This is already supported in the `Fetch` protocol by setting `replicaId=-1`, which the Raft implementation checks for. We just need to alter `QuorumState` so that the `localId` is optional. The main benefit of doing this is that it saves tools the need to generate an arbitrary id (which might cause conflicts given limited Int32 space) and it lets the leader avoid any local state for these observers (such as `ReplicaState` inside `LeaderState`).

Reviewers: Ismael Juma <ismael@juma.me.uk>, Boyang Chen <boyang@confluent.io>
2021-01-15 14:10:17 -08:00
Alok Nikhil c49f660c62
MINOR: Initialize QuorumState lazily in RaftClient.initialize() (#9881)
It is helpful to delay initialization of the `RaftClient` configuration including the voter string until after construction. This helps in integration test cases where the voter ports may not be known until sockets are bound.

Reviewers: José Armando García Sancio <jsancio@users.noreply.github.com>, Jason Gustafson <jason@confluent.io>
2021-01-14 17:31:54 -08:00
CHUN-HAO TANG 2996642566
MINOR: Fix error message in SnapshotWriter.java (#9862)
Reviewers: Chia-Ping Tsai <chia7712@gmail.com>
2021-01-13 21:49:11 +08:00
Jason Gustafson f62c2b26cc
MINOR: Factor `RaftManager` out of `TestRaftServer` (#9839)
This patch factors out a `RaftManager` class from `TestRaftServer` which will be needed when we integrate this layer into the server. This class encapsulates the logic to build `KafkaRaftClient` as well as its IO thread. 

Reviewers: Chia-Ping Tsai <chia7712@gmail.com>
2021-01-11 09:28:12 -08:00
José Armando García Sancio 2023aed59d
KAFKA-10427: Fetch snapshot API (#9553)
Implements the code necessary for the leader to response to fetch snapshot requests and for the follower to fetch snapshots. This API is described in more detail in KIP-630: https://cwiki.apache.org/confluence/display/KAFKA/KIP-630%3A+Kafka+Raft+Snapshot.  More specifically, this patch includes the following changes:

Leader Changes:
1. Raft leader response to FetchSnapshot request by reading the local snapshot and sending the requested bytes in the response. This implementation currently copies the bytes to memory. This will be fixed in a future PR.

Follower Changes:
1. Raft followers will start fetching snapshot if the leader sends a Fetch response that includes a SnapshotId.

2. Raft followers send FetchSnapshot requests if there is a pending download. The same timer is used for both Fetch and FetchSnapshot requests.

3. Raft follower handle FetchSnapshot responses by comping the bytes to the pending SnapshotWriter. This implementation doesn't fix the replicated log after the snapshot has been downloaded. This will be implemented in a future PR.

Reviewers: Jason Gustafson <jason@confluent.io>
2020-12-28 18:37:08 -08:00
vamossagar12 d5151f6f09
KAFKA-10828; Replacing endorsing with acknowledging for voters (#9737)
This PR replaces the terms endorsing with acknowledging for voters which have recognised the current leader.

Reviewers: Jason Gustafson <jason@confluent.io>
2020-12-22 10:05:07 -08:00
Jason Gustafson eb9fe411bb
KAFKA-10842; Use `InterBrokerSendThread` for raft's outbound network channel (#9732)
This patch contains the following improvements:

- Separate inbound/outbound request flows so that we can open the door for concurrent inbound request handling
- Rewrite `KafkaNetworkChannel` to use `InterBrokerSendThread` which fixes a number of bugs/shortcomings
- Get rid of a lot of boilerplate conversions in `KafkaNetworkChannel` 
- Improve validation of inbound responses in `KafkaRaftClient` by checking correlationId. This fixes a bug which could cause an out of order Fetch to be applied incorrectly.

Reviewers: David Arthur <mumrah@gmail.com>
2020-12-21 18:15:15 -08:00
dengziming 125d5ea0fb
KAFKA-10677; Complete fetches in purgatory immediately after resigning (#9639)
This patch adds logic to complete fetches immediately after resigning by returning the BROKER_NOT_AVAILABLE error. This ensures that the new election cannot be delayed by fetches which are stuck in purgatory. 

Reviewers: Jason Gustafson <jason@confluent.io>
2020-12-10 09:25:04 -08:00
Jason Gustafson a8b668b37c
KAFKA-10826; Ensure raft io thread respects linger timeout (#9716)
When there are no pending operations, the raft IO thread can block indefinitely waiting for a network event. We rely on asynchronous wakeups in order to break the blocking wait in order to respond to a scheduled append. The current logic already does this, but only for the case when the linger time has been completed during the call to `scheduleAppend`. It is possible instead that after making one call to `scheduleAppend` to start the linger timer, the application does not do any additional appends. In this case, we still need the IO thread to wakeup when the linger timer expires. This patch fixes the problem by ensuring that the IO thread gets woken up after the first append which begins the linger timer.

Reviewers: Guozhang Wang <wangguoz@gmail.com>
2020-12-09 13:33:28 -08:00
vamossagar12 99b5e4f4ab
KAFKA-10634; Adding LeaderId to voters list in LeaderChangeMessage along with granting voters (#9539)
This patch ensures that the leader is included among the voters in the `LeaderChangeMessage`. It also adds an additional field for the set of granting voters, which was originally specified in KIP-595.

Reviewers: José Armando García Sancio <jsancio@users.noreply.github.com>, Jason Gustafson <jason@confluent.io>
2020-12-08 17:37:48 -08:00
dengziming 3e5a22cefa
KAFKA-10756; Add missing unit test for `UnattachedState` (#9635)
This patch adds a unit test for `UnattachedState`, similar to `ResignedStateTest` and `VotedStateTest`.

Reviewers: Jason Gustafson <jason@confluent.io>
2020-12-08 10:27:11 -08:00
José Armando García Sancio ab0807dd85
KAFKA-10394: Add classes to read and write snapshot for KIP-630 (#9512)
This PR adds support for generating snapshot for KIP-630.

1. Adds the interfaces `RawSnapshotWriter` and `RawSnapshotReader` and the implementations `FileRawSnapshotWriter` and `FileRawSnapshotReader` respectively. These interfaces and implementations are low level API for writing and reading snapshots. They are internal to the Raft implementation and are not exposed to the users of `RaftClient`. They operation at the `Record` level. These types are exposed to the `RaftClient` through the `ReplicatedLog` interface.

2. Adds a buffered snapshot writer: `SnapshotWriter<T>`. This type is a higher-level type and it is exposed through the `RaftClient` interface. A future PR will add the related `SnapshotReader<T>`, which will be used by the state machine to load a snapshot.

Reviewers: Jason Gustafson <jason@confluent.io>
2020-12-07 14:06:25 -08:00
Jason Gustafson f49c6c203f
KAFKA-10661; Add new resigned state for graceful shutdown/initialization (#9531)
When initializing the raft state machine after shutting down as a leader, we were previously entering the "unattached" state, which means we have no leader and no voted candidate. This was a bug because it allowed a reinitialized leader to cast a vote for a candidate in the same epoch that it was already the leader of. This patch fixes the problem by introducing a new "resigned" state which allows us to retain the leader state so that we cannot change our vote and we will not accept additional appends.

This patch also revamps the shutdown logic to make use of the new "resigned" state. Previously we had a separate path in `KafkaRaftClient.poll` for the shutdown logic which resulted in some duplication. Instead now we incorporate shutdown behavior into each state's respective logic.

Finally, this patch changes the shutdown logic so that `EndQuorumEpoch` is only sent by resigning leaders. Previously we allowed this request to be sent by candidates as well.

Reviewers: dengziming <dengziming1993@gmail.com>, Guozhang Wang <wangguoz@gmail.com>
2020-11-09 12:52:28 -08:00
Jason Gustafson 21a65e1043
KAFKA-10632; Raft client should push all committed data to state machines (#9482)
In #9418, we add a listener to the `RaftClient` interface. In that patch, we used it only to send commit notifications for writes from the leader. In this PR, we extend the `handleCommit` API to accept all committed data and we remove the pull-based `read` API. Additionally, we add two new callbacks to the listener interface in order to notify the state machine when the raft client has claimed or resigned leadership.

Finally, this patch allows the `RaftClient` to support multiple listeners. This is necessary for KIP-500 because we will have one listener for the controller role and one for the broker role.

Reviewers: José Armando García Sancio <jsancio@users.noreply.github.com>, Boyang Chen <boyang@confluent.io>
2020-11-02 15:06:58 -08:00
dengziming b4100d4b28
KAFKA-10644; Fix VotedToUnattached test error (#9503)
This patch fixes a test a test case in `QuorumStateTest`. The method name is "testVotedToUnattachedHigherEpoch," but the code initialized in the unattached state instead of the voted state.

Reviewers: Jason Gustafson <jason@confluent.io>
2020-10-27 16:41:00 -07:00
Jason Gustafson 927edfece3
KAFKA-10601; Add support for append linger to Raft implementation (#9418)
The patch adds `quorum.append.linger.ms` behavior to the raft implementation. This gives users a powerful knob to tune the impact of fsync.  When an append is accepted from the state machine, it is held in an accumulator (similar to the producer) until the configured linger time is exceeded. This allows the implementation to amortize fsync overhead at the expense of some write latency.

The patch also improves our methodology for testing performance. Up to now, we have relied on the producer performance test, but it is difficult to simulate expected controller loads because producer performance is limited by other factors such as the number of producer clients and head-of-line blocking. Instead, this patch adds a workload generator which runs on the leader after election.

Finally, this patch brings us nearer to the write semantics expected by the KIP-500 controller. It makes the following changes:

- Introduce `RecordSerde<T>` interface which abstracts the underlying log implementation from `RaftClient`. The generic type is carried over to `RaftClient<T>` and is exposed through the read/write APIs.
- `RaftClient.append` is changed to `RaftClient.scheduleAppend` and returns the last offset of the expected log append.
- `RaftClient.scheduleAppend` accepts a list of records and ensures that the full set are included in a single batch.
- Introduce `RaftClient.Listener` with a single `handleCommit` API which will eventually replace `RaftClient.read` in order to surface committed data to the controller state machine. Currently `handleCommit` is only used for records appended by the leader.

Reviewers: José Armando García Sancio <jsancio@users.noreply.github.com>, Guozhang Wang <wangguoz@gmail.com>
2020-10-27 12:10:13 -07:00
José Armando García Sancio 94820ca652
MINOR: Refactor RaftClientTest to be used by other tests (#9476)
There is a lot of functionality in KafkaRaftClientTest that is useful for writing other tests. Refactor that functionality into another class that can be reused in other tests.

Reviewers: Jason Gustafson <jason@confluent.io>
2020-10-22 18:14:27 -07:00
Ismael Juma 8acbd85e1d
MINOR: Update raft/README.md and minor RaftConfig tweaks (#9484)
* Replace quorum.bootstrap.servers and quorum.bootstrap.voters with
quorum.voters.
* Remove seemingly unused `verbose` config.
* Use constant to avoid unnecessary repeated concatenation.

Reviewers: Jason Gustafson <jason@confluent.io>
2020-10-22 17:51:05 -07:00
dengziming 42ce00fdd6
MINOR: refactor CandidateState.unrecordedVoters (#9442)
Reviewers: Chia-Ping Tsai <chia7712@gmail.com>
2020-10-22 16:36:41 +08:00
Jason Gustafson a72f0c1eac
KAFKA-10533; KafkaRaftClient should flush log after appends (#9352)
This patch adds missing flush logic to `KafkaRaftClient`. The initial flushing behavior is simplistic. We guarantee that the leader will not replicate above the last flushed offset and we guarantee that the follower will not fetch data above its own flush point. More sophisticated flush behavior is proposed in KAFKA-10526.

We have also extended the simulation test so that it covers flush behavior. When a node is shutdown, all unflushed data is lost. We were able to confirm that the monotonic high watermark invariant fails without the added `flush` calls.

This patch also piggybacks a fix to the `TestRaftServer` implementation. The initial check-in contained a bug which caused `RequestChannel` to fail sending responses because the disabled APIs did not have metrics registered. As a result of this, it is impossible to elect leaders.

Reviewers: Guozhang Wang <wangguoz@gmail.com>
2020-10-13 08:59:02 -07:00
Jason Gustafson 05f9803d72
KAFKA-10527; Voters should not reinitialize as leader in same epoch (#9348)
One of the invariants that the raft replication protocol relies on is that each record is uniquely identified by leader epoch and offset. This can be violated if a leader remains elected with the same epoch between restarts since unflushed data could be lost.

Reviewers: Guozhang Wang <wangguoz@gmail.com>
2020-10-06 10:11:56 -07:00
Guozhang Wang 5146c5a6cb
MINOR: Update doc for raft state metrics (#9342)
Reviewers:  Jason Gustafson <jason@confluent.io>
2020-10-05 14:52:34 -07:00
Jason Gustafson dbe3e4a4cc
KAFKA-10511; Ensure monotonic start epoch/offset updates in `MockLog` (#9332)
There is a minor difference in behavior between the epoch caching logic in `MockLog` from the behavior in `LeaderEpochFileCache`. The latter ensures that every new epoch/start offset entry added to the cache increases monotonically over the previous entries. This patch brings the behavior of `MockLog` in line. 

It also simplifies the `assignEpochStartOffset` api in `ReplicatedLog`. We always intend to use the log end offset, so this patch removes the start offset parameter.

Reviewers: Boyang Chen <boyang@confluent.io>
2020-09-28 17:16:55 -07:00
Jason Gustafson ac8acec653
KAFKA-10519; Add missing unit test for `VotedState` (#9337)
Add a simple unit test for `VotedState`. 

Reviewers: Guozhang Wang <wangguoz@gmail.com>
2020-09-25 09:12:56 -07:00
Ismael Juma 51957de806
MINOR: Use JUnit 5 in raft module (#9331)
I also removed a test class with no tests currently (Jason filed KAFKA-10519 for
filling the test gap).

Reviewers: Jason Gustafson <jason@confluent.io>
2020-09-24 02:37:17 -07:00
Jason Gustafson b7c8490cf4
KAFKA-10492; Core Kafka Raft Implementation (KIP-595) (#9130)
This is the core Raft implementation specified by KIP-595: https://cwiki.apache.org/confluence/display/KAFKA/KIP-595%3A+A+Raft+Protocol+for+the+Metadata+Quorum. We have created a separate "raft" module where most of the logic resides. The new APIs introduced in this patch in order to support Raft election and such are disabled in the server until the integration with the controller is complete. Until then, there is a standalone server which can be used for testing the performance of the Raft implementation. See `raft/README.md` for details.

Reviewers: Guozhang Wang <wangguoz@gmail.com>, Boyang Chen <boyang@confluent.io>

Co-authored-by: Boyang Chen <boyang@confluent.io>
Co-authored-by: Guozhang Wang <wangguoz@gmail.com>
2020-09-22 11:32:44 -07:00