Migrates existing connect tests that were using Zookeeper to use KRaft
instead, and cleans up some dead ZK code. For broker compatibility tests,
tests for versions 2.1-2.3 still need to use ZK.
Reviewers: Colin P. McCabe <cmccabe@apache.org>
When loading transaction metadata from a transaction log partition, if the partition contains a segment ending with an empty batch, "currOffset" update logic at will be skipped for the last batch. Since "currOffset" is not advanced to next offset of last batch properly, TransactionStateManager.loadTransactionMetadata method will be stuck in the "while" loop.
This change fixes the issue by updating "currOffset" after processing each batch, whether the batch is empty or not.
Reviewers: Justine Olshan <jolshan@confluent.io>, Jun Rao <junrao@gmail.com>
Implementation of KIP-1076 to allow for adding client application metrics to the KIP-714 framework
Reviewers: Apoorv Mittal <amittal@confluent.io>, Andrew Schofield <aschofield@confluent.io>, Matthias Sax <mjsax@apache.org>
This is part one of a multi-pr effort to convert Kafka Streams system tests to KRaft. I decided to break down the changes into multiple PRs to reduce the review load
Reviewers: Matthias Sax <mjsax@apache.org>
We should only call once `maybeSendResponseCallback` for each marker during the WriteTxnMarkersRequest handling.
Consider the following 2 cases:
First
We have 2 markers to append, one for producer-0, one for producer-1
When we first process producer-0, it appends a marker to the __consumer_offset.
The __consumer_offset append finishes very fast because the group coordinator is no longer the leader. So the coordinator directly returns NOT_LEADER_OR_FOLLOWER. In its callback, it calls the maybeComplete() for the first time, and because there is only one partition to append, it is able to go further to call maybeSendResponseCallback() and decrement numAppends.
Then it calls the replica manager append for nothing, in the callback, it calls the maybeComplete() for the second time. This time, it also decrements numAppends.
Second
We have 2 markers to append, one for producer-0, one for producer-1
When we first process producer-0, it appends a marker to the __consumer_offset and a data topic foo.
The 2 appends will be handled by group coordinator and replica manager asynchronously.
It can be a race that, both appends finishes together, then they can fill the `markerResults` at the same time, then call the `maybeComplete`. Because the `partitionsWithCompatibleMessageFormat.size == markerResults.size` condition is satisfied, both `maybeComplete` calls can go through to decrement the `numAppends` and cause a premature response.
Note: the problem only happens with KIP-848 coordinator enabled.
Reviewers: Jeff Kim <jeff.kim@confluent.io>, Justine Olshan <jolshan@confluent.io>, David Jacot <djacot@confluent.io>
Currently in ShareConsumeRequestManager, after we receive a ShareFetchResponse, if the subscription changes before we acknowledge(via ShareFetch), then we do not acknowledge the records which are not part of the updated subscription. Instead we must acknowledge all the records that we had received irrespective of the current subscription.
This bug is only when we are acknowledging via ShareFetch where we use SubscriptionState::fetchablePartitions to obtain the partitions to fetch. In ShareAcknowledge, as we are getting the partitions from the active share sessions, even if the subscription changed, the session would remain active.
Reviewers: Andrew Schofield <aschofield@confluent.io>, Manikumar Reddy <manikumar.reddy@gmail.com>
This PR removes ZK test parameterizations from ducktape by:
- Removing zk from quorum.all_non_upgrade
- Removing quorum.zk from @matrix and @parametrize annotations
- Changing usages of quorum.all to quorum.all_kraft
- Deleting message_format_change_test.py
The default metadata_quorum value still needs to be changed to KRaft rather than ZK, but this will be done in a follow-up PR.
Reviewers: Kirk True <kirk@kirktrue.pro>, Colin P. McCabe <cmccabe@apache.org>
This patch adds a data structure to ConsumerGroup to track the number of members subscribed to each regular expressions in the group. This will be useful to know whether a regex is new in the group or whether a regex must be removed from the group.
Reviewers: Jeff Kim <jeff.kim@confluent.io>, Lianet Magrans <lmagrans@confluent.io>
This patch does two things:
1) Change the validation of the ConsumerGroupHeartbeat request to accept subscribed topic names and/or subscribed topic regex. At least of them must be set in the first request with epoch 0.
2) Validate the provided regular expression by compiling it.
Co-authored-by: Lianet Magrans <lmagrans@confluent.io>
Reviewers: Jeff Kim <jeff.kim@confluent.io>, Lianet Magrans <lmagrans@confluent.io>
Updates KafkaStreams.clientInstanceIds method to correctly populate the client-id -> clientInstanceId map that was altered in a previous refactoring.
Added a test that confirms ClientInstanceIds is correctly storing consumer and producer instance ids
Reviewers: Matthias Sax <mjsax@apache.org>
This patch includes changes to the clients end transaction response handling when transaction version 2 is enabled.
Version 5+ of the End Txn Response includes the producer Id and the producer epoch fields.
Upon receiving the request, the client updates its producer Id and epoch according to the response.
On receiving an EndTxnRequest the server would've either:
Bumped the epoch for the given producer ID.
On epoch overflow, sent a new producer Id with epoch 0.
This patch also includes changes to the endTxnRequest to send the right request version based on whether txnV2 is enabled.
There was a test failure in the integration tests that allowed us to catch a bug in the PrepareComplete method where we update the transit metadata incorrectly. Added the bug fix in this patch where the lastProducerEpoch is updated correctly.
Reviewers: Artem Livshits <alivshits@confluent.io>, Justine Olshan <jolshan@confluent.io>
Add a new "load-catalog" job to the workflow. This job will checkout the test-catalog branch at 7 days prior and generate a text file of all the tests that were known at that time. This file is then passed down to the two parallel "test" jobs to be used as a source of data for the quarantined test behavior.
Reviewers: Chia-Ping Tsai <chia7712@gmail.com>
We fail the entire CreateTopicsRequest action if there are more than 10k total
partitions being created in this topic for this specific request. The usual pattern for
this API to try and succeed with some topics. Since the 10k limit applies to all topics
then no topic should be created if they all exceede it.
Reviewers: Colin P. McCabe <cmccabe@apache.org>
The PR integrates leader epoch for partition while invoking Persister APIs. The write RPC is retried once on leader epoch failure.
Reviewers: Abhinav Dixit <adixit@confluent.io>, Andrew Schofield <aschofield@confluent.io>, Jun Rao <junrao@gmail.com>, David Arthur <mumrah@gmail.com>
Using the last 7 days of data on Oct 30 2024, this patch marks all flaky tests with more than 10% flakiness on trunk.
Reviewers: Chia-Ping Tsai <chia7712@gmail.com>
This patch adds a CI job to store our test catalog in an orphaned branch named "test-catalog" within this repo.
This data will be used to help determine which tests should be quarantined.
Reviewers: Chia-Ping Tsai <chia7712@gmail.com>