As part of KIP-1022, I have created an interface for all the new features to be used when parsing the command line arguments, doing validations, getting default versions, etc.
I've also added the --feature flag to the storage tool to show how it will be used.
Created a TestFeatureVersion to show an implementation of the interface (besides MetadataVersion which is unique) and added tests using this new test feature.
I will add the unstable config and tests in a followup.
Reviewers: David Mao <dmao@confluent.io>, David Jacot <djacot@confluent.io>, Artem Livshits <alivshits@confluent.io>, Jun Rao <junrao@apache.org>
Rate reports value in the form of sumOrCount/monitoredWindowSize. It has a bug in monitoredWindowSize calculation, which leads to spikes in result values.
Reviewers: Jun Rao <junrao@gmail.com>
Implements KIP-1036.
Add raw ConsumerRecord data to RecordDeserialisationException to make DLQ implementation easier.
Reviewers: Kirk True <ktrue@confluent.io>, Andrew Schofield <aschofield@confluent.io>, Matthias J. Sax <matthias@confluent.io>
Handle FencedInstanceIdException that a consumer may receive in the heartbeat response. This will be the case when a static consumer is removed from the group by and admin client, and another member joins with the same group.instance.id (allowed in). The first member will receive a FencedInstanceId on its next heartbeat. The expectation is that this should be handled as a fatal error.
There are no actual changes in logic with this PR, given that without being handled, the FencedInstanceId was being treated as an "unexpected error", which are all treated as fatal errors, so the outcome remains the same. But we're introducing this small change just for accuracy in the logic and the logs: FencedInstanceId is expected during heartbeat, a log line is shown describing the situation and why it happened (and it's treated as a fatal error, just like it was before this PR).
This PR also improves the test to ensure that the error propagated to the app thread matches the one received in the HB.
Reviewers: Andrew Schofield <aschofield@confluent.io>, David Jacot <djacot@confluent.io>
The intention of the CompletableApplicationEvent is for a Consumer to enqueue the event and then block, waiting for it to complete. The application thread will block up to the amount of the timeout. This change introduces a consistent manner in which events are expired out by checking their timeout values.
The CompletableEventReaper is a new class that tracks CompletableEvents that are enqueued. Both the application thread and the network I/O thread maintain their own reaper instances. The application thread will track any CompletableBackgroundEvents that it receives and the network I/O thread will do the same with any CompletableApplicationEvents it receives. The application and network I/O threads will check their tracked events, and if any are expired, the reaper will invoke each event's CompletableFuture.completeExceptionally() method with a TimeoutException.
On closing the AsyncKafkaConsumer, both threads will invoke their respective reapers to cancel any unprocessed events in their queues. In this case, the reaper will invoke each event's CompletableFuture.completeExceptionally() method with a CancellationException instead of a TimeoutException to differentiate the two cases.
The overall design for the expiration mechanism is captured on the Apache wiki and the original issue (KAFKA-15848) has more background on the cause.
Note: this change only handles the event expiration and does not cover the network request expiration. That is handled in a follow-up Jira (KAFKA-16200) that builds atop this change.
This change also includes some minor refactoring of the EventProcessor and its implementations. This allows the event processor logic to focus on processing individual events rather than also the handling of batches of events.
Reviewers: Lianet Magrans <lianetmr@gmail.com>, Philip Nee <pnee@confluent.io>, Bruno Cadonna <cadonna@apache.org>
When client telemetry is configured in a cluster, Kafka producers and consumers push metrics to the brokers periodically. There is a special push of metrics that occurs when the client is terminating. A state machine in the client telemetry reporter controls its behaviour in different states.
Sometimes, when a client was terminating, it was attempting an invalid state transition from TERMINATING_PUSH_IN_PROGRESS to PUSH_NEEDED when it receives a response to a PushTelemetry RPC. This was essentially harmless because the state transition did not occur but it did cause unsightly log lines to be generated. This PR performs a check for the terminating states when receiving the response and simply remains in the current state.
I added a test to validate the state management in this case. Actually, the test passes before the code change in the PR, but with unsightly log lines.
Reviewers: Manikumar Reddy <manikumar.reddy@gmail.com>, Apoorv Mittal <amittal@confluent.io>
The KafkaAsyncConsumer would occasionally fail to stop when wakeup() was invoked. It turns out that there's a race condition between the thread that invokes wakeup() and the thread that is performing an action on the consumer. If the operation's Future is already completed by thread A when thread B invoke's completeExceptionally() inside wakeup(), the WakeupException will be ignored. We should use the return value from completeExceptionally() to determine if that call actually triggered completion of the Future. If that method returns false, that signals that the Future was already completed, and the exception we passed to completeExceptionally() was ignored. Therefore, we then need to return a new WakeupFuture instead of null so that the next call to setActiveTask() will throw the WakeupException.
Reviewers: Lucas Brutschy <lbrutschy@confluent.io>
This PR fixes two kinds of bugs in the new(ish) rack-aware part of the sticky assignment algorithm:
First, when reassigning "owned partitions" to their previous owners, we now have to take the rack placement into account and might not immediately assign a previously-owned partition to its old consumer during this phase. There is a small chance this partition will be assigned to its previous owner during a later stage of the assignment, but if it's not then by definition it has been "revoked" and must be removed from the assignment during the adjustment phase of the CooperativeStickyAssignor according to the cooperative protocol. We need to make sure any partitions removed in this way end up in the "partitionsTransferringOwnership".
Second, the sticky algorithm works in part by keeping track of how many consumers are still "unfilled" when they are at the "minQuota", meaning we may need to assign one more partition to get to the expected number of consumers at the "maxQuota". During the rack-aware round-robin assignment phase, we were not properly clearing the set of unfilled & minQuota consumers once we reached the expected number of "maxQuota" consumers (since by definition that means no more minQuota consumers need to or can be given any more partitions since that would bump them up to maxQuota and exceed the expected count). This bug would result in the entire assignment being failed due to a correctness check at the end which verifies that the "unfilled members" set is empty before returning the assignment. An IllegalStateException would be thrown, failing the rebalancing and sending the group into an endless rebalancing loop until/unless it was lucky enough to produce a new assignment that didn't hit this bug
Reviewers: Anna Sophie Blee-Goldman <ableegoldman@apache.org>
Validate that a control batch in the batch accumulator has at least one control record.
Reviewers: Jun Rao <junrao@apache.org>, Chia-Ping Tsai <chia7712@apache.org>
Improve consumer log for expired poll timer, by showing how much time was the max.poll.interval.ms exceeded. This should be helpful in guiding the user to tune that config on the common case of long-running processing causing the consumer to leave the group. Inspired by other clients that log such information on the same situation.
Reviewers: Anna Sophie Blee-Goldman <ableegoldman@apache.org>, Matthias Sax <mjsax@apache.org>,
Andrew Schofield <andrew_schofield@live.com>, Kirk True <kirk@kirktrue.pro>
Fix to allow to initialize positions for newly assigned partitions, while the onPartitionsAssigned callback is running, even though the partitions remain non-fetchable until the callback completes.
Before this PR, we were not allowing initialization or fetching while the callback was running. The fix here only allows to initialize the newly assigned partition position, and keeps the existing logic for making sure that the partition remains non-fetchable until the callback completes.
The need for this fix came out in one of the connect system tests, that attempts to retrieve a newly assigned partition position with a call to consumer.position from within the onPartitionsAssigned callback (WorkerSinkTask). With this PR, we allow to make such calls (test added), which is the behaviour of the legacy consumer.
Reviewers: Lucas Brutschy <lbrutschy@confluent.io>
The contract of KafkaConsumer.poll(Duration) says that it throws InterruptException "if the calling thread is interrupted before or while this function is called". The new KafkaConsumer implementation was not doing this if the thread was interrupted before the poll was called, specifically with a very short timeout. If it ever waited for records, it did check the thread state. If it did not wait for records because of a short timeout, it did not.
Some of the log messages in the code erroneously mentioned timeouts, when they really meant interruption.
Also adds a test for this specific scenario.
Reviewers: Lianet Magrans <lianetmr@gmail.com>, Chia-Ping Tsai <chia7712@gmail.com>
Adds support for the KafkaRaftClient to read the control records KRaftVersionRecord and VotersRecord in the snapshot and log. As the control records in the KRaft partition are read, the replica's known set of voters are updated. This change also contains the necessary changes to include the control records when a snapshot is generated by the KRaft state machine.
It is important to note that this commit changes the code and the in-memory state to track the sets of voters but it doesn't change any data that is externally exposed. It doesn't change the RPCs, data stored on disk or configuration.
When the KRaft replica starts the PartitionListener reads the latest snapshot and then log segments up to the LEO, updating the in-memory state as it reads KRaftVersionRecord and VotersRecord. When the replica (leader and follower) appends to the log, the PartitionListener catches up to the new LEO. When the replica truncates the log because of a diverging epoch, the PartitionListener also truncates the in-memory state to the new LEO. When the state machine generate a new snapshot the PartitionListener trims any prefix entries that are not needed. This is all done to minimize the amount of data tracked in-memory and to make sure that it matches the state on disk.
To implement the functionality described above this commit also makes the following changes:
Adds control records for KRaftVersionRecord and VotersRecord. KRaftVersionRecord describes the finalized kraft.version supported by all of the replicas. VotersRecords describes the set of voters at a specific offset.
Changes Kafka's feature version to support 0 as the smallest valid value. This is needed because the default value for kraft.version is 0.
Refactors FileRawSnapshotWriter so that it doesn't directly call the onSnapshotFrozen callback. It adds NotifyingRawSnapshotWriter for calling such callbacks. This reorganization is needed because in this change both the KafkaMetadataLog and the KafkaRaftClient need to react to snapshots getting frozen.
Cleans up KafkaRaftClient's initialization. Removes initialize from RaftClient - this is an implementation detail that doesn't need to be exposed in the interface. Removes RaftConfig.AddressSpec and simplifies the bootstrapping of the static voter's address. The bootstrapping of the address is delayed because of tests. We should be able to simplify this further in future commits.
Update the DumpLogSegment CLI to support the new control records KRaftVersionRecord and VotersRecord.
Fix the RecordsSnapshotReader implementations so that the iterator includes control records. RecordsIterator is extended to support reading the new control records.
Improve the BatchAccumulator implementation to allow multiple control records in one control batch. This is needed so that KRaft can make sure that VotersRecord is included in the same batch as the control record (KRaftVersionRecord) that upgrades the kraft.version to 1.
Add a History interface and default implementation TreeMapHistory. This is used to track all of the sets of voters between the latest snapshot and the LEO. This is needed so that KafkaRaftClient can query for the latest set of voters and so that KafkaRaftClient can include the correct set of voters when the state machine generates a new snapshot at a given offset.
Add a builder pattern for RecordsSnapshotWriter. The new builder pattern also implements including the KRaftVersionRecord and VotersRecord control records in the snapshot as necessary. A KRaftVersionRecord should be appended if the kraft.version is greater than 0 at the snapshot's offset. Similarly, a VotersRecord should be appended to the snapshot with the latest value up to the snapshot's offset.
Reviewers: Jason Gustafson <jason@confluent.io>
The AsyncKafkaConsumer implementation of position(TopicPartition, Duration) was not updating its internal Timer, causing it to execute the loop forever. Adding a call to update the Timer at the bottom of the loop fixes the issue.
An integration test was added to catch this case; it fails without the newly added call to Timer.update(long).
Reviewers: Lianet Magrans <lianetmr@gmail.com>, Chia-Ping Tsai <chia7712@gmail.com>
In some cases, the network layer is very fast and can process a response and send out a follow-up request within the same millisecond timestamp. This is causing problems due to the way we determine if we already have an inflight request.
The previous logic for tracking inflight status used timestamps: if the timestamp from the last received response was less than the timestamp from the last sent request, we'd interpret that as having an inflight request. However, this approach would incorrectly return false from RequestState.requestInFlight() if the two timestamps were equal.
One result of this faulty logic is that in such cases, the consumer would accidentally send multiple heartbeat requests to the consumer group coordinator. The consumer group coordinator would interpret these requests as 'join group' requests and create members for each request. Therefore, the coordinator was under the false understanding that there were more members in the group than there really were. Consequently, if your luck was really bad, the coordinator might assign partitions to one of the duplicate members. Those partitions would be assigned to a phantom consumer that was not reading any data, and this led to flaky tests.
This change introduces a stupid simple flag to RequestState that is set in onSendAttempt and cleared in onSuccessfulAttempt, onFailedAttempt, and reset. A new unit test has been added and this has been tested against all of the consumer unit and integration tests, and has removed all known occurrences of phantom consumer group members in the system tests.
Reviewers: Lucas Brutschy <lbrutschy@confluent.io>, Lianet Magrans <lianetmr@gmail.com>, Philip Nee <pnee@confluent.io>
Change the documentation of the Brokers field to make it clear that it doesn't always have all the
brokers that are listed as replicas.
Reviewer: Colin P. McCabe <cmccabe@apache.org>
When user-defined rebalance listeners fail with an exception, the expectation is that the error should be propagated to the user as a KafkaException and break the poll loop (behavior in the legacy coordinator). The new consumer executes callbacks in the application thread, and sends an event to the background with the callback result and error if any, passing the error along with the event here to the background thread, but does not seem to propagate the exception to the user.
Reviewers: Lianet Magrans <lianetmr@gmail.com>, Kirk True <ktrue@confluent.io>, Bruno Cadonna <cadonna@apache.org>
Fix for resetting HB timer when the request is sent, rather than when a response is received. This ensures a more accurate timing of the HB, so that a member always sends HB on the interval (not in the interval + any delay in receiving the response).
This change, along with the logic already in place for checking in-flights, ensures that if the interval expires but there is a HB in-flight, the next HB is only send after the response for the in-flight is received, without waiting for another full interval. This is btw consistent with the timer reset & inflight behaviour for the auto-commit interval.
Reviewers: Kirk True <ktrue@confluent.io>, Bruno Cadonna <cadonna@apache.org>
Implementation of KIP-773 deprecated iotime-total and io-waittime-total metrics. It wasn't expected to mark io-ratio and io-wait-ratio deprecated. However, now they have *Deprecated* in their description. Here is the reason:
register io-ratio (desc: *Deprecated* The fraction of time ...) -> registered
register iotime-total (desc: *Deprecated* The total time ...) -> registered
register io-ratio (desc: The fraction of time ...) -> skipped, the same name already exists in registry
register io-time-ns-total (desc: The total time ...) -> registered
As a result, io-ratio has incorrect description. The same for io-wait-ratio. This PR fixes these descriptions..
Reviewers: Chia-Ping Tsai <chia7712@gmail.com>
The javadoc for KafkaConsumer.commitSync says:
Note that asynchronous offset commits sent previously with the {@link #commitAsync(OffsetCommitCallback)}
(or similar) are guaranteed to have their callbacks invoked prior to completion of this method.
This is not always true in the async consumer, where there is no code at all to make sure that the callback is executed before commitSync returns.
Similarly, the async consumer is also missing logic to await callback execution in close. While the javadoc doesn't explicitly promise callback execution, it promises "completing commits", which one would reasonably expect to include callback execution. Also, the legacy consumer contains some code to execute callbacks before closing.
This change proposed a number of fixes to clean up the callback execution guarantees in the async consumer:
We keep track of the incomplete async commit
futures and wait for them to complete before returning from
commitSync or close (if there is time).
Since we need to block to make sure that our previous commits are
completed, we allow the consumer to wake up.
Some similar gaps are addressed in the legacy consumer, see #15693
Testing
Two new integration tests and a couple of unit tests.
Reviewers: Bruno Cadonna <cadonna@apache.org>, Kirk True <ktrue@confluent.io>, Lianet Magrans <lianetmr@gmail.com>
The PR leverages the changes defined in KIP-1019. Does the cleanup for accessing KafkaMetric field by reflection and uses method exposed by KIP-1019 for metric measurability.
Reviewers: Andrew Schofield <aschofield@confluent.io>, Matthias J. Sax <matthias@confluent.io>
Implements KIP-1019, which exposes method to check if metric is of type Measurable.
Reviewers: Andrew Schofield <aschofield@confluent.io>, Matthias J. Sax <matthias@confluent.io>
Add the support for DescribeTopicPartitions API to AdminClient. For this initial implementation, we are simply loading all of the results into memory on the client side.
Reviewers: Andrew Schofield <aschofield@confluent.io>, Kirk True <ktrue@confluent.io>, David Jacot <djacot@confluent.io>, Artem Livshits <alivshits@confluent.io>, David Arthur <mumrah@gmail.com>
Reviewers: Mickael Maison <mickael.maison@gmail.com>, Luke Chen <showuon@gmail.com>, Igor Soarez<soarez@apple.com>
Co-authored-by: lixinyang <nickxyli@tencent.com>
Partitions that are marked as pendingOnAssignedCallback should not be reset in resetInitializingPositions(). Pending partitions are omitted from the set returned by initializingPartitions(). As a result, the Consumer does not include them in the set of partitions for which it attempts to load committed offsets. The code used by the Consumer to reset positions (resetInitializingPositions()) should likewise skip partitions marked as pending.
Reviewers: Lucas Brutschy <lbrutschy@confluent.io>