- Remove some unused Zookeeper code
- Migrate group mode transactions, security rolling upgrade, and throttling tests to using KRaft
- Add KRaft downgrade tests to kraft_upgrade_test.py
Reviewers: Colin P. McCabe <cmccabe@apache.org>
SnapshotRegistry needs to have a reference to all snapshot data structures. However, this should
not be a strong reference, but a weak reference, so that these data structures can be garbage
collected as needed. This PR also adds a scrub mechanism so that we can eventually reclaim the
slots used by GC'ed Revertable objects in the SnapshotRegistry.revertables array.
Reviewers: David Jacot <david.jacot@gmail.com>
With KAFKA-17872, we changed some internals that effects the conditions
of this test, introducing a race condition when the expected log
messages are printed.
This PR adds additional wait-conditions to the test to close the race
condition.
Reviewers: Bill Bejeck <bill@confluent.io>
This PR updates the key for storing the KIP-714 client instance id for the global consumer to follow a more consistent pattern of the other embedded Kafka Streams consumer clients.
Reviewers: Matthias Sax <mjsax@apache.org>
The Stale PRs workflow is only able to act on a relatively small number of PRs due to the API operations limit. This patch increases the limit from 100 to 500.
Reviewers: Josep Prat <josep.prat@aiven.io>, Chia-Ping Tsai <chia7712@gmail.com>
Fixes another issue introduced in #17725 where the streaming XML parser would skip over tests that followed a SKIPPED test. This caused a large number of tests to be removed from the test catalog e4a5eb8
Reviewers: Chia-Ping Tsai <chia7712@gmail.com>
During conversion from classic to consumer group, if a member has empty assignment (e.g., the member just joined and has never synced), the conversion will fail because of the buffer underflow error when deserializing the member assignment. This patch allows empty assignment while deserializing the member assignment.
Reviewers: Jeff Kim <jeff.kim@confluent.io>, David Jacot <djacot@confluent.io>
This PR introduces the unified GroupState enum for all group types from KIP-1043. This PR also removes ShareGroupState and begins the work to replace Admin.listShareGroups with Admin.listGroups. That will complete in a future PR.
Reviewers: Manikumar Reddy <manikumar.reddy@gmail.com>
This patch introduces the `CoordinatorExecutor` construct into the `CoordinatorRuntime`. It allows scheduling asynchronous tasks from within a `CoordinatorShard` while respecting the runtime semantic. It will be used to asynchronously resolve regular expressions.
The `GroupCoordinatorService` uses a default `ExecutorService` with a single thread to back it at the moment. It seems that it should be sufficient. In the future, we could consider making the number of threads configurable.
Reviewers: Jeff Kim <jeff.kim@confluent.io>, Lianet Magrans <lmagrans@confluent.io>
Following the KIP-1033 a FailedProcessingException is passed to the Streams-specific uncaught exception handler.
The goal of the PR is to unwrap a FailedProcessingException into a StreamsException when an exception occurs during the flushing or closing of a store
Reviewer: Bruno Cadonna <cadonna@apache.org>
The thread that evaluates the gauge for the oldest-iterator-open-since-ms runs concurrently
with threads that open/close iterators (stream threads and interactive query threads). This PR
fixed a race condition between `openIterators.isEmpty()` and `openIterators.first()`, by catching
a potential exception. Because we except the race condition to be rare, we rather catch the
exception in favor of introducing a guard via locking.
Reviewers: Matthias J. Sax <matthias@confluent.io>, Anna Sophie Blee-Goldman <ableegoldman@apache.org>
BrokerHeartbeatManager.java: fix an outdated comment.
Move an inefficient test method that is O(num_brokers) from ClusterControlManager.java into ReplicationControlManagerTest.java, so that it doesn't accidentally get used in production code.
Remove QuorumController.ImbalanceSchedule, etc. since it is no longer used.
Move the initialization of OffsetControlManager later in the QuorumController constructor and add a comment explaining why it should come last. This doesn't fix any bugs currently, but it's a good practice for the future.
Reviewers: Mickael Maison <mickael.maison@gmail.com>
What
There was a bug in handling piggyback acknowledgements in ShareConsumeRequestManager, where the fetchAcknowledgementsMap could be updated when the request was in flight and when the ShareFetch response is received, we were removing any acknowledgements(without actually sending them) which came when the request was in flight.
Fix
Now we are maintaining 2 separate maps(one which has the acknowledgements to send and one which keeps track of the acknowledgements in flight).
Reviewers: Andrew Schofield <aschofield@confluent.io>, Apoorv Mittal <apoorvmittal10@gmail.com>, Manikumar Reddy <manikumar.reddy@gmail.com>
Admin API operations have two phases: lookup and fulfilment. The lookup phase involves a METADATA request whose details depend upon the operation being performed.
For some operations, the METADATA request can be quite expensive to serve. For example, if the user calls Admin.listOffsets for 1000 topics, the METADATA request will include all 1000 topics and the response will contain the leader information for all of these topics. And then the actual fulfilment phase does the real work of the operation.
In cases where a long-running application is performing repeated admin operations which need the same metadata information about partition leadership, it is not necessary to send the METADATA request for every single admin operation.
This PR adds a cache of the mapping from topic-partition to leader id to the admin client. The cache doesn't need to be very sophisticated because the admin client will retry if the information becomes stale, and the cache can be updated as a result of the retry.
Reviewers: Chia-Ping Tsai <chia7712@gmail.com>
Several of the ShareConsumerTest integration tests are a bit flaky. This PR tightens up the logic with the aim of eliminating the flakes. Annoyingly the tests seem rock solid locally so this might take some experimentation.
Reviewers: Apoorv Mittal <apoorvmittal10@gmail.com>, Divij Vaidya <diviv@amazon.com>, Manikumar Reddy <manikumar.reddy@gmail.com>
This patch contains changes to the handling of the INVALID_PRODUCER_ID_MAPPING error.
Quoted from KIP-890
Since we bump epoch on abort, we no longer need to call InitProducerId to fence requests. InitProducerId will only be called when the producer starts up to fence a previous instance.
With this change, some other calls to InitProducerId were inspected including the call after receiving an InvalidPidMappingException. This exception was changed to abortable as part of KIP-360: Improve reliability of idempotent/transactional producer. However, this change means that we can violate EOS guarantees. As an example:
Consider an application that is copying data from one partition to another
Application instance A processes to offset 4
Application instance B comes up and fences application instance A
Application instance B processes to offset 5
Application instances A and B are idle for transaction.id.expiration.ms, transaction id expires on server
Application instance A attempts to process offset 5 (since in its view, that is next) -- if we recover from invalid pid mapping, we can duplicate this processing
Thus, INVALID_PID_MAPPING should be fatal to the producer.
This is consistent with KIP-1050: Consistent error handling for Transactions where errors that are fatal to the producer are in the "application recoverable" category. This is a grouping that indicates to the client that the producer needs to restart and recovery on the application side is necessary. KIP-1050 is approved so we are consistent with that decision.
This PR also fixes the flakiness of TransactionsExpirationTest.
Reviewers: Artem Livshits <alivshits@confluent.io>, Justine Olshan <jolshan@confluent.io>, Calvin Liu <caliu@confluent.io>
Adds support for UpdateRaftVoterRequest in KafkaNetworkChannel. This addresses the following scenario:
* Bootstrap a KRaft Controller quorum in dynamic mode
* Start additional controllers (as observers)
* Update kraft.version feature from 0 to 1
* Use kafka-metadata-quorum add-controller to promote an observer controller to a follower
Reviewers: Colin Patrick McCabe <cmccabe@apache.org>, Alyssa Huang <ahuang@confluent.io>