kafka

Commit Graph

Author	SHA1	Message	Date
David Mao	d0f845a5e1	KAFKA-16120: Fix partition reassignment during ZK migration When we are migrating from ZK mode to KRaft mode, the brokers pass through a phase where they are running in ZK mode, but the controller is in KRaft mode (aka a kcontroller). This is called "hybrid mode." In hybrid mode, the KRaft controllers send old-style controller RPCs to the remaining ZK mode brokers. (StopReplicaRequest, LeaderAndIsrRequest, UpdateMetadataRequest, etc.) To complete partition reassignment, the kcontroller must send a StopReplicaRequest to any brokers that no longer host the partition in question. Previously, it was sending this StopReplicaRequest with delete = false. This led to stray partitions, because the partition data was never removed as it should have been. This PR fixes it to set delete = true. This fixes KAFKA-16120. There is one additional problem with partition reassignment in hybrid mode, tracked as KAFKA-16121. The issue is that in ZK mode, brokers ignore any LeaderAndIsr request where the partition leader epoch is less than or equal to the current partition leader epoch. However, when in hybrid mode, just as in KRaft mode, we do not bump the leader epoch when starting a new reassignment, see: `triggerLeaderEpochBumpIfNeeded`. This PR resolves this problem by adding a special case on the broker side when isKRaftController = true. Reviewers: Akhilesh Chaganti <akhileshchg@users.noreply.github.com>, Colin P. McCabe <cmccabe@apache.org>	2024-01-14 20:32:58 -08:00
Arpit Goyal	ef92deee9d	KAFKA-15388: Handling remote segment read in case of log compaction (#15060 ) Fetching from remote log segment implementation does not handle the topics that had retention policy as compact earlier and changed to delete. It always assumes record batch will exist in the required segment for the requested offset. But there is a possibility where the requested offset is the last offset of the segment and has been removed due to log compaction. Then it requires iterating over the next higher segment for further data as it has been done for local segment fetch request. This change partially addresses the above problem by iterating through the remote log segments to find the respective segment for the target offset. Reviewers: Satish Duggana <satishd@apache.org>, Kamal Chandraprakash<kamal.chandraprakash@gmail.com>, Divij Vaidya <diviv@amazon.com>, Christo Lolov <lolovc@amazon.com>	2024-01-15 05:15:58 +05:30
Kamal Chandraprakash	378a01999e	MINOR: Add isRemoteLogEnabled parameter to the Log Loader Javadoc. (#15179 ) Add isRemoteLogEnabled parameter to the Log Loader Javadoc Reviewers: Luke Chen <showuon@gmail.com>, Satish Duggana <satishd@apache.org>	2024-01-13 14:52:11 +08:00
Greg Harris	21227bda61	KAFKA-15816: Fix leaked sockets in core tests (#14754 ) Signed-off-by: Greg Harris <greg.harris@aiven.io> Reviewers: Mickael Maison <mickael.maison@gmail.com>	2024-01-12 13:18:03 -08:00
Omnia Ibrahim	e9f2218d94	KAFKA-15853: Move ReplicationQuotaManagerConfig to server module (#15160 ) Reviewers: Mickael Maison <mickael.maison@gmail.com>, Nikolay <nizhikov@apache.org>	2024-01-12 10:47:26 +01:00
谭九鼎	cf447ea4b5	MINOR: doc fix: use <code> instead of backticks (#15169 ) use <code> instead of backticks Reviewers: Luke Chen <showuon@gmail.com>	2024-01-12 16:48:47 +08:00
Abhinav Dixit	8cdf1abb0b	KAFKA-15738: Adding KRaft support in ConsumerWithLegacyMessageFormatIntegrationTest (#15171 ) Reviewers: Manikumar Reddy <manikumar.reddy@gmail.com>	2024-01-12 13:41:04 +05:30
dengziming	da6f05258f	MINOR: Enable kraft test in kafka.api (#14595 ) Reviewers: Mickael Maison <mickael.maison@gmail.com>	2024-01-12 11:50:12 +08:00
Divij Vaidya	65424ab484	MINOR: New year code cleanup - include final keyword (#15072 ) Reviewers: Mickael Maison <mickael.maison@gmail.com>, Ismael Juma <ismael@juma.me.uk>, Sagar Rao <sagarmeansocean@gmail.com>	2024-01-11 17:53:35 +01:00
David Jacot	a8203f9c7a	KAFKA-14505; [4/N] Wire transaction verification (#15142 ) This patch wires the transaction verification in the new group coordinator. It basically calls the verification path before scheduling the write operation. If the verification fails, the error is returned to the caller. Note that the patch uses `appendForGroup`. I suppose that we will move away from using it when https://github.com/apache/kafka/pull/15087 is merged. Reviewers: Justine Olshan <jolshan@confluent.io>	2024-01-11 04:58:57 -08:00
Omnia Ibrahim	dba789dc93	KAFKA-15853: Move OffsetConfig to group-coordinator module (#15161 ) Reviewers: Mickael Maison <mickael.maison@gmail.com>, David Jacot <djacot@confluent.io>, Nikolay <nizhikov@apache.org>	2024-01-11 10:19:42 +01:00
Omnia Ibrahim	13a83d58f8	KAFKA-15853: Move ProcessRole to server module (#15166 ) Prepare to move KafkaConfig (#15103). Reviewers: Ismael Juma <ismael@juma.me.uk>	2024-01-10 15:13:06 -08:00
TapDang	a63f76970a	KAFKA-15747: Add KRaft support in DynamicConnectionQuotaTest (#15028 ) Reviewers: Mickael Maison <mickael.maison@gmail.com>	2024-01-10 17:47:01 +01:00
Luke Chen	177c941982	KAFKA-16074: close leaking threads in replica manager tests (#15077 ) Following @dajac 's finding in #15063, I found we also create new RemoteLogManager in ReplicaManagerTest, but didn't close them. While investigating ReplicaManagerTest, I also found there are other threads leaking: 1. remote fetch reaper thread. It's because we create a reaper thread in test, which is not expected. We should create a mocked one like other purgatory instance. 2. Throttle threads. We created a quotaManager to feed into the replicaManager, but didn't close it. Actually, we have created a global quotaManager instance and will close it on AfterEach. We should re-use it. 3. replicaManager and logManager didn't invoke close after test. Reviewers: Divij Vaidya <divijvaidya13@gmail.com>, Satish Duggana <satishd@apache.org>, Justine Olshan <jolshan@confluent.io>	2024-01-10 19:54:50 +08:00
Sanskar Jhajharia	3d1d060d87	KAFKA-15735: KRaft support in SaslMultiMechanismConsumerTest (#15156 ) Reviewers: Manikumar Reddy <manikumar.reddy@gmail.com>	2024-01-10 12:47:37 +05:30
Zihao Lin	bdad163182	KAFKA-15741: KRaft support in DescribeConsumerGroupTest (#14668 ) Reviewers: Mickael Maison <mickael.maison@gmail.com>	2024-01-09 15:28:49 +01:00
Dmitry Werner	30d9678b3b	KAFKA-15721: KRaft support in DeleteTopicsRequestWithDeletionDisabledTest (#15124 ) Reviewers: Mickael Maison <mickael.maison@gmail.com>	2024-01-09 11:53:29 +01:00
Zihao Lin	b2bfd5d110	KAFKA-15719: Add KRaft support in OffsetsForLeaderEpochRequestTest (#15049 ) Reviewers: Mickael Maison <mickael.maison@gmail.com>	2024-01-08 17:20:58 +01:00
Vedarth Sharma	116762fdce	KAFKA-16016: Add docker wrapper in core and remove docker utility script (#15048 ) Migrates functionality provided by utility to Kafka core. This wrapper will be used to generate property files and format storage when invoked from docker container. Reviewers: Mickael Maison <mickael.maison@gmail.com>, Viktor Somogyi-Vass <viktorsomogyi@gmail.com>, Manikumar Reddy <manikumar.reddy@gmail.com>	2024-01-08 18:07:38 +05:30
Nikolay	da2aa68269	KAFKA-14588: Move ConfigEntityName to server-common (#14868 ) Reviewers: Mickael Maison <mickael.maison@gmail.com>, Kamal Chandraprakash <kamal.chandraprakash@gmail.com>	2024-01-08 12:41:43 +01:00
Luke Chen	70c8b8d0af	KAFKA-16059: close more kafkaApis instances (#15132 ) Reviewers: Divij Vaidya <diviv@amazon.com>, Justine Olshan <jolshan@confluent.io>	2024-01-06 15:00:20 +01:00
Jason Gustafson	599e22b842	MINOR: Move Raft io thread implementation to Java (#15119 ) This patch moves the `RaftIOThread` implementation into Java. I changed the name to `KafkaRaftClientDriver` since the main thing it does is drive the calls to `poll()`. There shouldn't be any changes to the logic. Reviewers: José Armando García Sancio <jsancio@apache.org>	2024-01-05 09:27:36 -08:00
Luke Chen	c8d61a5cbe	KAFKA-16079: fix threads leak threads in LocalLeaderEndPointTest and other tests (#15122 ) Fix threads leak in LocalLeaderEndPointTest/FinalizedFeatureChangeListenerTest/KafkaApisTest/ReplicaManagerConcurrencyTest Reviewers: Divij Vaidya <diviv@amazon.com>, Christo Lolov <christololov@gmail.com>	2024-01-05 09:43:03 +08:00
Michael Edgar	105db82956	KAFKA-15373: fix exception thrown in Admin#describeTopics for unknown ID (#14599 ) Throw UnknownTopicIdException instead of InvalidTopicException when no name is found for the topic ID. Similar to #6124 for describeTopics using a topic name. MockAdminClient already makes use of UnknownTopicIdException for this case. Reviewers: Justine Olshan <jolshan@confluent.io>, Ashwin Pankaj <apankaj@confluent.io>	2024-01-03 17:56:17 -08:00
Dmitry Werner	d4aeec3d3f	KAFKA-15742: KRaft support in GroupCoordinatorIntegrationTest (#15086 ) updated GroupCoordinatorIntegrationTest.testGroupCoordinatorPropagatesOffsetsTopicCompressionCodec to support KRaft Reviewers: Justine Olshan <jolshan@confluent.io>	2024-01-03 08:46:12 -08:00
DL1231	60c445bdd5	MINOR: Improve code style (#15107 ) Reviewers: Divij Vaidya <diviv@amazon.com>	2024-01-03 11:56:20 +01:00
Arpit Goyal	86a387c3c8	KAFKA-16063: Disable shutdownhook in MiniKdc (used for testing) (#15104 ) This stops a memory leaked in the tests caused due to ApplicationShutdownHooks Reviewers: Divij Vaidya <diviv@amazon.com>	2024-01-02 21:11:18 +01:00
Divij Vaidya	65b1558532	KAFKA-16059: Fix thread leak KafkaAPIsTest (#15093 ) Reviewers: Luke Chen <showuon@gmail.com>	2024-01-02 15:58:20 +01:00
Divij Vaidya	bd6cb4db22	KAFKA-16052: Save heap in AbstractCoordinatorConcurrencyTest by creating real ReplicaManager (#15094 ) Mockito will keep the invocation history in the test suite and cause the huge heap usage. Since the mock replicaManager is only used to bypass the replicaManager constructor without verifying/mocking anything, we create a real dummy replicaManager to avoid the mockito invocation history in memory. Reviewers: Luke Chen <showuon@gmail.com>, Justine Olshan <jolshan@confluent.io> Co-authored-by: Luke Chen <showuon@gmail.com>	2023-12-31 12:25:16 +01:00
wernerdv	b3664119fd	KAFKA-16064: Improve ControllerApiTest (#15091 ) This commit refactors ControllerApiTest to close an instance of ControllerApis in a tearDown method. Reviewers: Divij Vaidya <diviv@amazon.com>	2023-12-30 21:49:25 +01:00
Luke Chen	0600ac00e9	KAFKA-16065: close DelayedFuturePurgatory in DelayedOperationTest (#15090 ) Reviewers: Divij Vaidya <diviv@amazon.com>	2023-12-29 18:27:45 +01:00
Afshin Moazami	627aaef47e	MINOR: Duplicate method; The QuotaUtils one is used. (#15066 ) It seems like this PR (https://github.com/apache/kafka/pull/8768) duplicated the implementation to QuotaUtils, but didn't remove this implementation and private methods that is using Reviewers: Justine Olshan <jolshan@confluent.io>	2023-12-28 16:01:30 -08:00
Luke Chen	a465fb124f	KAFKA-16058: close controllerApi instance to avoid thread leaks (#15084 ) The controllerApi will create some resources, including the reaper threads. In ControllerApisTest, we created it on many test cases, but didn't close it. This commit doesn't change anything in the business logic of the test, it just adds try/finally to close the controllerApi instance. Reviewers: Divij Vaidya <diviv@amazon.com>	2023-12-28 16:38:20 +01:00
Divij Vaidya	a56b63e226	KAFKA-16053: Fix memory leaks due to KDC server in tests (#15079 ) This commit closes the KDC server properly in `CustomQuotaCallbackTest` and `AclAuthorizerWithZkSaslTest`. Reviewers: Justine Olshan <jolshan@confluent.io>	2023-12-28 10:55:14 +01:00
DL1231	f80686b4ac	MINOR: Improve code style (#15074 ) Reviewers: Divij Vaidya <diviv@amazon.com>	2023-12-26 12:21:12 +01:00
IBeyondy	89f32ca6a1	MINOR: Fix NullPointerException in ReplicaFetcherThreadTest.testTruncateOnFetchDoesNotUpdateHighWatermark Reviewers: Divij Vaidya <diviv@amazon.com>	2023-12-25 18:11:41 +01:00
Nikolay	417338ad77	KAFKA-16048: Fix ConfigCommandTest.shouldNotSupportAlterClientMetricsWithZookeeper (#15068 ) Reviewers: Divij Vaidya <diviv@amazon.com>	2023-12-25 14:38:15 +01:00
Nikolay	45bd19f2ef	KAFKA-14588: Move ConfigType to server-common (#14867 ) Reviewers: Mickael Maison <mickael.maison@gmail.com>	2023-12-22 18:35:27 +01:00
Rittika Adhikari	0bc736f3c4	MINOR: Refactor to only require one stopPartitions helper (#14662 ) Reviewers: Divij Vaidya <diviv@amazon.com>	2023-12-22 17:13:22 +01:00
Philip Nee	c963a71be0	KAFKA-16026: Send Poll event to the background thread (#15035 ) related to KAFKA-15818 This is a bug in the AsyncKafkaConsumer poll loop that it does not send an event to the network thread to acknowledge user poll. This causes a few issues: Autocommit won't work without user setting the timer the member will just leave the group after rebalance timeout and never able to rejoin. In this PR, a few subtle changes are made to address this issue Hook up poll event to the AsyncKafkaConsumer#poll. It is only fired once per invocation Upon entering staled state, we need to reset HeartbeatState otherwise we will get an invalid request We will clear and current assignment and remove all assigned partitions once the heartbeat is sent. See changes in onHeartbeatRequestSent Reviewers: David Jacot <djacot@confluent.io>, Bruno Cadonna <cadonna@apache.org>, Andrew Schofield <aschofield@confluent.io>	2023-12-22 15:21:39 +01:00
David Jacot	f7ccd082f1	MINOR: Exit catcher should be reset after the cluster is shutdown (#15062 ) I was investigating a build which failed with "exit 1". In the logs of the broker, I was that the first call to exist was caught. However, a second one was not. See the logs below. The issue seems to be that we must first shutdown the cluster before reseting the exit catcher. Otherwise, there is still a change for the broker to call exit. ``` [2023-12-21 13:52:59,310] ERROR Shutdown broker because all log dirs in /tmp/kafka-2594137463116889965 have failed (kafka.log.LogManager:143) [2023-12-21 13:52:59,312] ERROR test error (kafka.server.epoch.EpochDrivenReplicationProtocolAcceptanceWithIbp26Test:76) java.lang.RuntimeException: halt(1, null) called! at kafka.server.QuorumTestHarness.$anonfun$setUp$4(QuorumTestHarness.scala:273) at org.apache.kafka.common.utils.Exit.halt(Exit.java:63) at kafka.utils.Exit$.halt(Exit.scala:33) at kafka.log.LogManager.handleLogDirFailure(LogManager.scala:224) at kafka.server.ReplicaManager.handleLogDirFailure(ReplicaManager.scala:2600) at kafka.server.ReplicaManager$LogDirFailureHandler.doWork(ReplicaManager.scala:324) at org.apache.kafka.server.util.ShutdownableThread.run(ShutdownableThread.java:131) ``` ``` [2023-12-21 13:53:05,797] ERROR Shutdown broker because all log dirs in /tmp/kafka-7355495604650755405 have failed (kafka.log.LogManager:143) ``` Reviewers: Luke Chen <showuon@gmail.com>	2023-12-22 05:58:34 -08:00
David Jacot	654ac2528b	MINOR: Close RemoteLogManager in RemoteLogManagerTest (#15063 ) This patch ensures that the RemoteLogManager is closed in RemoteLogManagerTest. Reviewers: Divij Vaidya <diviv@amazon.com>, Lucas Brutschy <lbrutschy@confluent.io>	2023-12-22 05:54:48 -08:00
Luke Chen	82808873cb	KAFKA-16035: add tests for remoteLogSizeComputationTime/remoteFetchExpiresPerSec metrics (#15056 ) These tests are removed in this commit because they are flaky. After investigation, the causes are: 1. remoteLogSizeComputationTime: It failed with Expected to find 1000 for RemoteLogSizeComputationTime metric value, but found 0. The reason is because if the verification thread is too slow, and the 2nd run of RLMTask started, then it'll reset the value back to 0. Fix it by adding latch to wait for verification. 2. remoteFetchExpiresPerSec: It failed with The ExpiresPerSec value is not incremented. Current value is: 0. The reason is because the remoteFetchExpiresPerSec metric is a static metric. And we remove all metrics after each test completed in tearDown method. So once remoteFetchExpiresPerSec is removed, it won't be created again like other metrics. And that's why it failed sometimes in Jenkins because if there is a previous test have expired remote fetch, then this metric will be created and removed forever. Fix it by only removing it in afterAll. Reviewers: Kamal Chandraprakash <kamal.chandraprakash@gmail.com>, Satish Duggana <satishd@apache.org>, Christo Lolov <lolovc@amazon.com>	2023-12-22 15:02:55 +08:00
Christo Lolov	d4f3bf93d3	KAFKA-16014: Implement RemoteLogSizeBytes (#15050 ) This pull request aims to implement RemoteLogSizeBytes from KIP-963. Reviewers: Kamal Chandraprakash <kamal.chandraprakash@gmail.com>, Satish Duggana <satishd@apache.org>, Luke Chen <showuon@gmail.com>	2023-12-22 15:00:44 +08:00
David Jacot	98aca56ee5	KAFKA-16040; Rename `Generic` to `Classic` (#15059 ) People has raised concerned about using `Generic` as a name to designate the old rebalance protocol. We considered using `Legacy` but discarded it because there are still applications, such as Connect, using the old protocol. We settled on using `Classic` for the `Classic Rebalance Protocol`. The changes in this patch are extremely mechanical. It basically replaces the occurrences of `generic` by `classic`. Reviewers: Divij Vaidya <diviv@amazon.com>, Lucas Brutschy <lbrutschy@confluent.io>	2023-12-21 13:39:17 -08:00
David Jacot	79757b3081	KAFKA-14505; [3/N] Wire WriteTxnMarkers API (#14985 ) This patch wires the handling of makers written by the transaction coordinator via the WriteTxnMarkers API. In the old group coordinator, the markers are written to the logs and the group coordinator is informed to materialize the changes as a second step if the writes were successful. This approach does not really work with the new group coordinator for mainly two reasons: 1) The second step would actually fail while the coordinator is loading and there is no guarantee that the loading has picked up the write or not; 2) It does not fit well with the new memory model where the state is snapshotted by offset. In both cases, it seems that having a single writer to the `__consumer_offsets` partitions is more robust and preferable. Reviewers: Jeff Kim <jeff.kim@confluent.io>, Justine Olshan <jolshan@confluent.io>	2023-12-21 10:59:41 -08:00
Jeff Kim	4613286076	KAFKA-16030: new group coordinator should check if partition goes offline during load (#15043 ) The new coordinator stops loading if the partition goes offline during load. However, the partition is still considered active. Instead, we should return NOT_LEADER_OR_FOLLOWER exception during load. Another change is that we only want to invoke CoordinatorPlayback#updateLastCommittedOffset if the current offset (last written offset) is greater than or equal to the current high watermark. This is to ensure that in the case the high watermark is ahead of the current offset, we don't clear snapshots prematurely. Reviewers: David Jacot <djacot@confluent.io>	2023-12-21 06:17:35 -08:00
Divij Vaidya	6250049e10	KAFKA-13950: Fix resource leak in error scenarios (#12228 ) We are not properly closing Closeable resources in the code base at multiple places especially when we have an exception. This code change fixes multiple of these leaks. Reviewers: Ismael Juma <ismael@juma.me.uk>, Luke Chen <showuon@gmail.com>, Mickael Maison <mickael.maison@gmail.com>	2023-12-21 13:47:22 +01:00
David Jacot	75dcc8dadf	KAFKA-16036; Add `group.coordinator.rebalance.protocols` and publish all new configs (#15053 ) This patch adds the group.coordinator.rebalance.protocols configuration which accepts a list of protocols to enable. At the moment, only generic and consumer are supported and it is not possible to disable generic yet. When consumer is enabled, the new consumer rebalance protocol (KIP-848) is enabled alongside the new group coordinator. This patch also publishes all the new configurations introduced by KIP-848. Reviewers: Jeff Kim <jeff.kim@confluent.io>, Stanislav Kozlovski <stanislav@confluent.io>	2023-12-21 04:43:57 -08:00
Luke Chen	d59d613258	KAFKA-16013: Throw an exception in DelayedRemoteFetch for follower fetch replicas. (#15015 ) Follow-up for KAFKA-16013: Add metric for expiration rate of delayed remote fetch Reviewers: Nikhil Ramakrishnan <ramakrishnan.nikhil@gmail.com>, Kamal Chandraprakash <kamal.chandraprakash@gmail.com>, Satish Duggana <satishd@apache.org>	2023-12-21 15:45:24 +08:00
Christo Lolov	1a97de2fe6	KAFKA-16002: Implement RemoteCopyLagSegments, RemoteDeleteLagBytes and RemoteDeleteLagSegments (#15005 ) This pull request aims to implement RemoteCopyLagSegments, RemoteDeleteLagBytes and RemoteDeleteLagSegments from KIP-963. Reviewers: Luke Chen <showuon@gmail.com>, Kamal Chandraprakash <kamal.chandraprakash@gmail.com>	2023-12-21 14:27:12 +08:00
Ismael Juma	919b585da0	KAFKA-15874: Add metric and request log attribute for deprecated request api versions (KIP-896) (#15032 ) Breakdown of this PR: * Extend the generator to support deprecated api versions * Set deprecated api versions via the request json files * Expose the information via metrics and the request log The relevant section of the KIP: > * Introduce metric `kafka.network:type=RequestMetrics,name=DeprecatedRequestsPerSec,request=(api-name),version=(api-version),clientSoftwareName=(client-software-name),clientSoftwareVersion=(client-software-version)` > * Add boolean field `requestApiVersionDeprecated` to the request header section of the request log (alongside `requestApiKey` , `requestApiVersion`, `requestApiKeyName` , etc.). Unit tests were added to verify the new generator functionality, the new metric and the new request log attribute. Reviewers: Jason Gustafson <jason@confluent.io>	2023-12-20 05:13:36 -08:00
Luke Chen	4e11de00a7	KAFKA-16014: Add RemoteLogMetadataCount metric (#15026 ) Reviewers: Christo Lolov <lolovc@amazon.com>, Kamal Chandraprakash<kamal.chandraprakash@gmail.com>, Satish Duggana <satishd@apache.org>	2023-12-20 14:21:30 +05:30
Viktor Somogyi-Vass	0e0282395d	KAFKA-15366: Modify LogDirFailureTest for KRaft (#14977 ) Reviewers: Omnia G.H Ibrahim <o.g.h.ibrahim@gmail.com>, Ron Dagostino <rdagostino@confluent.io>, Igor Soarez <soarez@apple.com>	2023-12-19 21:02:49 -05:00
Philip Nee	5e37ec80f8	KAFKA-15696: Refactor closing consumer (#14937 ) We drive the consumer closing via events, and rely on the still-lived network thread to complete these operations. This ticket encompasses several different tickets: KAFKA-15696/KAFKA-15548 When closing the consumer, we need to perform a few tasks. And here is the top level overview: We want to keep the network thread alive until we are ready to shut down, i.e., no more requests need to be sent out. To achieve so, I implemented a method, signalClose() to signal the managers to prepare for shutdown. Once we signal the network thread to close, the manager will prepare for the request to be sent out on the next event loop. The network thread can then be closed after issuing these events. The application thread's task is pretty straightforward, 1. Tell the background thread to perform n events and 2. Block on certain events until succeed or the timer runs out. Once all requests are sent out, we close the network thread and other components as usual. Here I outline the changes in detail AsyncKafkaConsumer: Shutdown procedures, and several utility functions to ensure proper exceptions are thrown during shutdown AsyncKafkaConsumerTest: I examine each individual test and fix ones that are blocking for too long or logging errors CommitRequestManager: signalClose() FetchRequestManagerTest: changes due to change in pollOnClose() ApplicationEventProcessor: handle CommitOnClose and LeaveGroupOnClose. Latter, it triggers leaveGroup() which should be completed on the next heartbeat (or we time out on the application thread) Reviewers: Lucas Brutschy <lbrutschy@confluent.io>, Kirk True <ktrue@confluent.io>	2023-12-19 13:20:33 +01:00
David Jacot	35e2d3c196	MINOR: Fix thread leak in AuthorizerIntegrationTest (#15006 ) Producers and consumers could be leaked in the AuthorizerIntegrationTest. In the teardown logic, `removeAllClientAcls()` is called before calling the super teardown method. If `removeAllClientAcls()` fails, the super method does not have a change to close the producers and consumers. Example of such failure [here](https://ci-builds.apache.org/blue/organizations/jenkins/Kafka%2Fkafka-pr/detail/PR-14925/11/tests/). As a new cluster is created for each test anyway, calling `removeAllClientAcls()` does not seem necessary. This patch removes it. Reviewers: Jason Gustafson <jason@confluent.io>	2023-12-18 23:48:10 -08:00
Gantigmaa Selenge	7b21da9712	KAFKA-15158: Add metrics for RemoteDelete and BuildRemoteLogAuxState (#14375 ) This PR implements part of KIP-963, specifically for adding new metrics. The metrics added in this PR are: RemoteDeleteRequestsPerSec (emitted when expired log segments on remote storage being deleted) RemoteDeleteErrorsPerSec (emitted when failed to delete expired log segments on remote storage) BuildRemoteLogAuxStateRequestsPerSec (emitted when building remote log aux state for replica fetchers) BuildRemoteLogAuxStateErrorsPerSec (emitted when failed to build remote log aux state for replica fetchers) Reviewers: Luke Chen <showuon@gmail.com>, Nikhil Ramakrishnan <ramakrishnan.nikhil@gmail.com>, Christo Lolov <lolovc@amazon.com>, Kamal Chandraprakash <kamal.chandraprakash@gmail.com>, Divij Vaidya <diviv@amazon.com>, Satish Duggana <satishd@apache.org>	2023-12-19 15:02:45 +08:00
Luke Chen	c240993be2	KAFKA-16014: Add RemoteLogSizeComputationTime metric (#15021 ) Reviewers: Satish Duggana <satishd@apache.org>, Kamal Chandraprakash<kamal.chandraprakash@gmail.com>, Christo Lolov <lolovc@amazon.com>	2023-12-18 21:39:43 +05:30
Lucas Brutschy	7aade70cc6	Revert "KAFKA-15764: Missing Tests for Transactions (#14702 )" (#15029 ) This reverts commit `ed7ad6d`. We have been seeing a lot of failures of TransactionsWithTieredStoreTest.testTransactionsWithCompression on trunk, and it seems to start with this PR. I see how this PR can influence the test via the change in TestUtils. The bad part is that sometimes seems to kill the Gradle Executors completely. So I'd suggest reverting the change before investigating further to stabilize CI. Reviewers: Bruno Cadonna <cadonna@apache.org>	2023-12-18 10:12:05 +01:00
Philip Nee	a6076c71f6	KAFKA-16023: Disable flaky tests in PlaintextConsumerTest (#15025 ) I observed several failed tests in PR builds. Let's first disable them and try to find a different way to test the async consumer with these tests. Reviewers: Lucas Brutschy <lbrutschy@confluent.io>	2023-12-17 10:43:45 +01:00
Justine Olshan	ed7ad6d9d3	KAFKA-15764: Missing Tests for Transactions (#14702 ) I ran this test 40 times without KAFKA-15653 with and without compression enabled. With compression it failed 39/40 times and without it passed 40/40 times. With the KAFKA-15653 and compression it passed 40/40 times locally Reviewers: Jason Gustafson <jason@confluent.io>	2023-12-15 09:41:20 -08:00
Andrew Schofield	a23dae4e9a	KAFKA-15971: Re-enable consumer integration tests for new consumer (#14925 ) The consumer integration tests were experimentally disabled for the new `AsyncKafkaConsumer` variant with the aim of improving build stability. Several improvements have been made to the consumer code and other tests which seem to have made a difference. This patch re-enables the tests. Reviewers: David Jacot <djacot@confluent.io>	2023-12-15 05:16:54 -08:00
Nikhil Ramakrishnan	52496dcd38	KAFKA-16013: Add metric for expiration rate of delayed remote fetch (#15014 ) Add metric for the number of expired remote fetches per second, and corresponding unit test to verify that the metric is marked on expiration. kafka.server:type=DelayedRemoteFetchMetrics,name=ExpiresPerSec Reviewers: Luke Chen <showuon@gmail.com>, Satish Duggana <satishd@apache.org>, Kamal Chandraprakash<kamal.chandraprakash@gmail.com>, Christo Lolov <lolovc@amazon.com>	2023-12-15 19:21:39 +08:00
Kirk True	9dc9040f33	KAFKA-15276: Implement event plumbing for ConsumerRebalanceListener callbacks (#14640 ) This patch adds the logic for coordinating the invocation of the `ConsumerRebalanceListener` callback invocations between the background thread (in `MembershipManagerImpl`) and the application thread (`AsyncKafkaConsumer`) and back again. It allowed us to enable more tests from `PlaintextConsumerTest` to exercise the code herein. Reviewers: David Jacot <djacot@confluent.io>	2023-12-15 00:42:31 -08:00
Proven Provenzano	b0e99b5593	KAFKA-15922: Bump MetadataVersion to support JBOD with KRaft (#14984 ) Moves ELR from MetadataVersion IBP_3_7_IV3 into the new IBP_3_8_IV0 because the ELR feature was not completed before 3.7 reached feature freeze. Leaves IBP_3_7_IV3 empty -- it is a no-op and is not reused for anything. Adds the new MetadataVersion IBP_3_7_IV4 for the FETCH request changes from KIP-951, which were mistakenly never associated with a MetadataVersion. Updates the LATEST_PRODUCTION MetadataVersion to IBP_3_7_IV4 to declare both KRaft JBOD and the KIP-951 changes ready for production use. Reviewers: Omnia G H Ibrahim <o.g.h.ibrahim@gmail.com>, Ron Dagostino <rdagostino@confluent.io>, Ismael Juma <ismael@juma.me.uk>, José Armando García Sancio <jsancio@apache.org>, Justine Olshan <jolshan@confluent.io>	2023-12-14 10:08:54 -05:00
Justine Olshan	e4249b69bd	KAFKA-15784: Ensure atomicity of in memory update and write when transactionally committing offsets (#14774 ) Rewrote the verification flow to pass a callback to execute after verification completes. For the TxnOffsetCommit, we will call doTxnCommitOffsets. This allows us to do offset validations post verification. I've reorganized the verification code and group coordinator code to make these code paths clearer. The followup refactor (https://issues.apache.org/jira/browse/KAFKA-15987) will further clean up the produce verification code. Reviewers: Artem Livshits <alivshits@confluent.io>, Jason Gustafson <jason@confluent.io>, David Jacot <djacot@confluent.io>, Jun Rao <junrao@gmail.com>	2023-12-13 17:45:09 -08:00
Christo Lolov	a87e86e015	KAFKA-15883: Implement RemoteCopyLagBytes (#14832 ) This pull request implements the first in the list of metrics in KIP-963: Additional metrics in Tiered Storage. Since each partition of a topic will be serviced by its own RLMTask we need an aggregator object for a topic. The aggregator object in this pull request is BrokerTopicAggregatedMetric. Since the RemoteCopyLagBytes is a gauge I have introduced a new GaugeWrapper. The GaugeWrapper is used by the metrics collection system to interact with the BrokerTopicAggregatedMetric. The RemoteLogManager interacts with the BrokerTopicAggregatedMetric directly. Reviewers: Luke Chen <showuon@gmail.com>, Satish Duggana <satishd@apache.org>, Kamal Chandraprakash<kamal.chandraprakash@gmail.com>	2023-12-14 09:21:37 +08:00
vamossagar12	a1e985d22f	KAFKA-15237: Implement write operation timeout (#14981 ) This patch ensure that `offset.commit.timeout.ms` is enforced. It does so by adding a timeout to the CoordinatorWriteEvent. Reviewers: David Jacot <djacot@confluent.io>	2023-12-13 11:30:53 -08:00
Andrew Schofield	b08fb14bed	KAFKA-15775: New consumer listTopics and partitionsFor (#14962 ) Implement Consumer.listTopics and Consumer.partitionsFor in the new consumer. The topic metadata request manager already existed so this PR adds expiration to requests, removes some redundant state checking and adds tests. Reviewers: Lucas Brutschy <lucasbru@apache.org>	2023-12-13 08:47:25 +01:00
Nikhil Ramakrishnan	be531c681c	KAFKA-15695: Update the local log start offset of a log after rebuilding the auxiliary state (#14649 ) Reviewers: Satish Duggana <satishd@apache.org>, Luke Chen <showuon@gmail.com>, Divij Vaidya <diviv@amazon.com>, Kamal Chandraprakash<kamal.chandraprakash@gmail.com>, Alexandre Dupriez <alexandre.dupriez@gmail.com>	2023-12-12 21:43:42 +05:30
Philip Nee	5b478aebfd	KAFKA-15818: ensure leave group on max poll interval (#14873 ) Currently, poll interval is not being respected during consumer#poll. When the user stops polling the consumer, we should assume either the consumer is too slow to respond or is already dead. In either case, we should let the group coordinator kick the member out of the group and reassign its partition after the rebalance timeout expires. If the consumer comes back alive, we should send a heartbeat and the member will be fenced and rejoin. (and the partitions will be revoked). This is the same behavior as the current implementation. Reviewers: Lucas Brutschy <lucasbru@apache.org>, Bruno Cadonna <cadonna@apache.org>, Lianet Magrans <lianetmr@gmail.com>	2023-12-12 10:06:34 +01:00
Omnia Ibrahim	07490b929b	KAFKA-15365: Broker-side replica management changes (#14881 ) Reviewers: Igor Soarez <soarez@apple.com>, Ron Dagostino <rndgstn@gmail.com>, Proven Provenzano <pprovenzano@confluent.io>	2023-12-11 09:34:22 -05:00
Lucas Brutschy	134eabee16	MINOR: fix leak in `GroupEndToEndAuthorizationTest` (#14975 ) Session expiration in ZkClient can lead to a thread leak, and does fail CI on master. This is happening in testNoDescribeProduceOrConsumeWithoutTopicDescribeAcl, and possibly other tests. Use try-with-resources to close ZkClient if this happens. This does not fix the underlying session expiration in ZK. Reviewers: David Jacot <djacot@confluent.io>	2023-12-11 09:05:03 +01:00
Andrew Schofield	f80f991c79	KAFKA-15978: Update member information on HB response (#14945 ) In the new consumer, the commit request manager and the membership manager are separate components. The commit request manager is initialised with group information that it uses to construct `OffsetCommit` requests. However, the initial value of the member ID is `""` in some cases. When the consumer joins the group, it receives a `ConsumerGroupHeartbeat` response which tells it the member ID. The member ID was not being passed to the commit request manager, so it sent invalid `OffsetCommit` requests that failed with `UNKNOWN_MEMBER_ID`. Reviewers: Bruno Cadonna <cadonna@apache.org>, David Jacot <djacot@confluent.io>	2023-12-10 23:56:54 -08:00
David Jacot	131581a2b4	MINOR: Remove `SubscribedTopicRegex` field from `ConsumerGroupHeartbeatRequest` (#14956 ) The support for regular expressions has not been implemented yet in the new consumer group protocol. This patch removes the `SubscribedTopicRegex` from the `ConsumerGroupHeartbeatRequest` in preparation for 3.7. It seems better to bump the version and add it back when we implement the feature, as part of https://issues.apache.org/jira/browse/KAFKA-14517, instead of having an unused field in the request. Reviewers: Sagar Rao <sagarmeansocean@gmail.com>, Justine Olshan <jolshan@confluent.io>	2023-12-10 23:53:08 -08:00
TapDang	cbc882ba07	KAFKA-15714: KRaft support in DynamicNumNetworkThreadsTest (#14970 ) Reviewers: Mickael Maison <mickael.maison@gmail.com>	2023-12-10 13:33:01 +01:00
Igor Soarez	8c184b4743	MINOR: Fix some AssignmentsManager bugs (#14954 ) - Add proper start & stop for AssignmentsManager's event loop - Dedupe queued duplicate assignments - Fix bug where directory ID is resolved too late Co-authored-by: Gaurav Narula <gaurav_narula2@apple.com> Reviewers: Colin P. McCabe <cmccabe@apache.org>	2023-12-08 15:37:23 -08:00
Proven Provenzano	02d9f46f3a	MINOR: allow JBOD during ZK migration (#14968 ) Allow using JBOD during ZK migration if MetadataVersion is at or above 3.7-IV2. Reviewers: Ron Dagostino <rndgstn@gmail.com>, Colin P. McCabe <cmccabe@apache.org>, Proven Provenzano <pprovenzano@confluent.io>	2023-12-08 14:38:57 -08:00
Igor Soarez	9de72daa50	KAFKA-15361: Migrating brokers must register with directory list (#14976 ) KAFKA-15361 (#14838) introduced a check for non empty directory list on brokerregistration requests from MetadataVersion.IBP_3_7_IV2 or later, which enables directory assignment. However, ZK brokers weren't yet registering yet with a directory list. This patch addresses that. We also make the directory list non-optional in BrokerLifecycleManager. Reviewers: Ron Dagostino <rndgstn@gmail.com>, Colin P. McCabe <cmccabe@apache.org>, Proven Provenzano <pprovenzano@confluent.io>	2023-12-08 10:16:48 -08:00
vamossagar12	e6e7d8c09f	KAFKA-14516: [3/3] Integration Test - Static Member Removed After Session Timeout (#14911 ) This new integration test verifies that a static member who temporary left the group is removed after the session timeout expires. It also verifies that a new static member with the same instance id can't join the group until the previous static member is expired. Reviewers: David Jacot <djacot@confluent.io>	2023-12-08 04:59:10 -08:00
David Jacot	0ad059d101	MINOR: Fix leak thread in DeleteTopicTest.testIncreasePartitionCountDuringDeleteTopic (#14960 ) Reviewers: Lucas Brutschy <lbrutschy@confluent.io>	2023-12-08 04:34:26 -08:00
David Jacot	38c873b80f	MINOR: Avoid leaking threads in DelegationTokenEndToEndAuthorizationWithOwnerTest.testDescribeTokenForOtherUserFails (#14959 ) Reviewers: Lucas Brutschy <lbrutschy@confluent.io>	2023-12-07 23:23:08 -08:00
Omnia Ibrahim	ec92410e59	KAFKA-15363: Broker log directory failure changes (#14790 ) Part of JBOD KIP-858, https://cwiki.apache.org/confluence/display/KAFKA/KIP-858%3A+Handle+JBOD+broker+disk+failure+in+KRaft Reviewers: Igor Soarez <i@soarez.me>, Colin P. McCabe <cmccabe@apache.org>, Ron Dagostino <rdagostino@confluent.io>	2023-12-07 20:44:56 -05:00
Lucas Brutschy	02915a2c5e	KAFKA-15977: Fix leak in DelegationTokenEndToEndAuthorizationWithOwnerTest (#14939 ) DelegationTokenEndToEndAuthorizationWithOwnerTest can leak a thread, causing problems with many tests. This is due to an admin client that isn't being closed when a (flaky) test fails. Using the Scala util Using to close the auto-closable admin client in case the validation fails. Reviewers: David Jacot <djacot@confluent.io>, Bruno Cadonna <cadonna@apache.org>	2023-12-07 21:37:23 +01:00
Colin P. McCabe	c062e5a1f9	HOTFIX: fix scala 2.12 build again	2023-12-07 12:03:02 -08:00
Igor Soarez	c515bf51f8	KAFKA-15426: Process and persist directory assignments Handle AssignReplicasToDirs requests, persist metadata changes with new directory assignments and possible leader elections. Reviewers: Proven Provenzano <pprovenzano@confluent.io>, Ron Dagostino <rndgstn@gmail.com>, Colin P. McCabe <cmccabe@apache.org>	2023-12-07 11:44:45 -08:00
Colin Patrick McCabe	969bc7749c	KAFKA-15980: Add the CurrentControllerId metric (#14749 ) Add the CurrentControllerId metric as described in KIP-1001. This gives us an easy way to identify the current controller by looking at the metrics of any Kafka node (broker or controller). Reviewers: David Arthur <mumrah@gmail.com>	2023-12-06 21:03:33 -08:00
Apoorv Mittal	dc09d7a4e0	KAFKA-15684: Support to describe all client metrics resources (KIP-714) (#14933 ) Improvement for KIP-1000 to list client metrics resources in KafkaApis.scala. Using functionality exposed by KIP-1000 to support describe all metrics operations for KIP-714. Reviewers: Andrew Schofield <aschofield@confluent.io>, Jun Rao <junrao@gmail.com>	2023-12-06 11:09:42 -08:00
Andrew Schofield	8ed53a15ee	KAFKA-15932: Wait for responses in consumer operations (#14912 ) The Kafka consumer makes a variety of requests to brokers such as fetching committed offsets and updating metadata. In the LegacyKafkaConsumer, the approach is typically to prepare RPC requests and then poll the network to wait for responses. In the AsyncKafkaConsumer, the approach is to enqueue an ApplicationEvent for processing by one of the request managers on the background thread. However, it is still important to wait for responses rather than spinning enqueuing events for the request managers before they have had a chance to respond. In general, the behaviour will not be changed by this code. The PlaintextConsumerTest.testSeek test was flaky because operations such as KafkaConsumer.position were not properly waiting for a response which meant that subsequent operations were being attempted in the wrong state. This test is no longer flaky. Reviewers: Kirk True <ktrue@confluent.io>, Lianet Magrans <lianetmr@gmail.com>, Bruno Cadonna <cadonna@apache.org>	2023-12-06 18:47:26 +01:00
Jeff Kim	b888fa1ec9	KAFKA-15910: New group coordinator needs to generate snapshots while loading (#14849 ) After the new coordinator loads a __consumer_offsets partition, it logs the following exception when making a read operation (fetch/list groups, etc): ``` java.lang.RuntimeException: No in-memory snapshot for epoch 740745. Snapshot epochs are: at org.apache.kafka.timeline.SnapshotRegistry.getSnapshot(SnapshotRegistry.java:178) at org.apache.kafka.timeline.SnapshottableHashTable.snapshottableIterator(SnapshottableHashTable.java:407) at org.apache.kafka.timeline.TimelineHashMap$ValueIterator.<init>(TimelineHashMap.java:283) at org.apache.kafka.timeline.TimelineHashMap$Values.iterator(TimelineHashMap.java:271) ``` This happens because we don't have a snapshot at the last updated high watermark after loading. We cannot generate a snapshot at the high watermark after loading all batches because it may contain records that have not yet been committed. We also don't know where the high watermark will advance up to so we need to generate a snapshot for each offset the loader observes to be greater than the current high watermark. Then once we add the high watermark listener and update the high watermark we can delete all of the older snapshots. Reviewers: David Jacot <djacot@confluent.io>	2023-12-06 08:38:05 -08:00
Lucas Brutschy	c575ba238d	KAFKA-15280: Implement client support for KIP-848 server-side assignors (#14878 ) * Validate the client’s configuration for server-side assignor selection defined in config group.remote.assignor * Include the assignor taken from config in the ConsumerGroupHeartbeat request, in the ServerAssignor field * Properly handle UNSUPPORTED_ASSIGNOR errors that may be returned to the HB response if the server does not support the assignor defined by the consumer. Includes a simple integration tests for sending an invalid assignor to the broker, and for using the range assignor with a single consumer. Reviewers: David Jacot <djacot@confluent.io>, Lianet Magrans <lianetmr@gmail.com>, Bruno Cadonna <cadonna@apache.org>	2023-12-06 15:22:11 +01:00
Kamal Chandraprakash	f05b342b39	MINOR: Allow local-log segment deletion when log-start-offset incremented. (#14905 ) DELETE_RECORDS API can move the log-start-offset beyond the highest-copied-remote-offset. In such cases, we should allow deletion of local-log segments since they won't be eligible for upload to remote storage. Reviewers: Satish Duggana <satishd@apache.org>, Luke Chen <showuon@gmail.com>	2023-12-06 16:59:16 +05:30
Andrew Schofield	587f50d48f	KAFKA-15831: KIP-1000 protocol and admin client (#14811 ) This adds the new ListClientMetricsResources RPC to the Kafka protocol and puts support into the Kafka admin client. The broker-side implementation in this PR is just to return an empty list. A future PR will obtain the list from the config store. Includes a few unit tests for what is a very simple RPC. There are additional tests already written and waiting for the PR that delivers the kafka-client-metrics.sh tool which builds on this PR. Reviewers: Jun Rao <junrao@gmail.com>	2023-12-05 07:14:06 -08:00
vamossagar12	0f56eeb046	KAFKA-14516: [2/N] Integration Test - Static Member Gets Assignment Back (#14882 ) This patch adds an integration test which verifies that a static member gets back its previous assignment back when rejoining. Reviewers: David Jacot <djacot@confluent.io>	2023-12-05 04:36:15 -08:00
Nikolay	783698c525	KAFKA-15645: Move ReplicationQuotasTestRig to tools module (#14588 ) Reviewers: Mickael Maison <mickael.maison@gmail.com>, Justine Olshan <jolshan@confluent.io>, Taras Ledkov <tledkov@apache.org>	2023-12-05 10:03:33 +01:00
David Jacot	34e1dbbaba	MINOR: Add Uniform assignor to the default config (#14826 ) This patch adds the `Uniform` assignor to the default list of supported assignors. It also do small changes in the code. Reviewers: Justine Olshan <jolshan@confluent.io>	2023-12-05 00:32:50 -08:00
David Jacot	26274afd05	MINOR: Ensure that DisplayName is set in all parameterized tests (#14850 ) This is a follow-up to https://github.com/apache/kafka/pull/14687 as we found out that some parameterized tests do not include the test method name in their name. For the context, the JUnit XML report does not include the name of the method by default but only rely on the display name provided. Reviewers: David Arthur <mumrah@gmail.com>	2023-12-04 23:58:48 -08:00
David Jacot	b46505c8de	KAFKA-15061; CoordinatorPartitionWriter should reuse buffer (#14885 ) This patch adds a ThreadLocal with a GrowableBufferSupplier so that each writing thread can reuse the same buffer instead of allocating a new one for each write. The patch relies on existing tests. Reviewers: Jeff Kim <jeff.kim@confluent.io>, Justine Olshan <jolshan@confluent.io>	2023-12-04 23:56:52 -08:00
David Jacot	b335ed954e	MINOR: Add @Timeout annotation to consumer integration tests (#14896 ) In this [buid](https://ci-builds.apache.org/blue/organizations/jenkins/Kafka%2Fkafka-pr/detail/PR-14826/11/pipeline/12/), the following test hang forever. ``` Gradle Test Run :core:test > Gradle Test Executor 93 > PlaintextConsumerTest > testSeek(String, String) > testSeek(String, String).quorum=kraft+kip848.groupProtocol=consumer STARTED ``` As the new consumer is not extremely stable yet, we should add a Timeout to all those integration tests to ensure that builds are not blocked unnecessarily. Reviewers: Andrew Schofield <aschofield@confluent.io>, Justine Olshan <jolshan@confluent.io>	2023-12-04 23:55:39 -08:00
Colin Patrick McCabe	ebae7b26b5	MINOR: fix bug where we weren't registering SnapshotEmitterMetrics (#14918 ) Fix a bug where we weren't properly exposing SnapshotEmitterMetrics. Add a test. Reviewers: David Arthur <mumrah@gmail.com>	2023-12-04 21:32:12 -08:00

1 2 3 4 5 ...

4601 Commits