Commit Graph

4601 Commits

Author SHA1 Message Date
David Mao d0f845a5e1 KAFKA-16120: Fix partition reassignment during ZK migration
When we are migrating from ZK mode to KRaft mode, the brokers pass through a phase where they are
running in ZK mode, but the controller is in KRaft mode (aka a kcontroller). This is called "hybrid
mode." In hybrid mode, the KRaft controllers send old-style controller RPCs to the remaining ZK
mode brokers. (StopReplicaRequest, LeaderAndIsrRequest, UpdateMetadataRequest, etc.)

To complete partition reassignment, the kcontroller must send a StopReplicaRequest to any brokers
that no longer host the partition in question. Previously, it was sending this StopReplicaRequest
with delete = false. This led to stray partitions, because the partition data was never removed as
it should have been. This PR fixes it to set delete = true. This fixes KAFKA-16120.

There is one additional problem with partition reassignment in hybrid mode, tracked as KAFKA-16121.
The issue is that in ZK mode, brokers ignore any LeaderAndIsr request where the partition leader
epoch is less than or equal to the current partition leader epoch. However, when in hybrid mode,
just as in KRaft mode, we do not bump the leader epoch when starting a new reassignment, see:
`triggerLeaderEpochBumpIfNeeded`. This PR resolves this problem by adding a special case on the
broker side when isKRaftController = true.

Reviewers: Akhilesh Chaganti <akhileshchg@users.noreply.github.com>, Colin P. McCabe <cmccabe@apache.org>
2024-01-14 20:32:58 -08:00
Arpit Goyal ef92deee9d
KAFKA-15388: Handling remote segment read in case of log compaction (#15060)
Fetching from remote log segment implementation does not handle the topics that had retention policy as compact earlier and changed to delete. It always assumes record batch will exist in the required segment for the requested offset. But there is a possibility where the requested offset is the last offset of the segment and has been removed due to log compaction. Then it requires iterating over the next higher segment for further data as it has been done for local segment fetch request.

This change partially addresses the above problem by iterating through the remote log segments to find the respective segment for the target offset.

Reviewers: Satish Duggana <satishd@apache.org>, Kamal Chandraprakash<kamal.chandraprakash@gmail.com>, Divij Vaidya <diviv@amazon.com>, Christo Lolov <lolovc@amazon.com>
2024-01-15 05:15:58 +05:30
Kamal Chandraprakash 378a01999e
MINOR: Add isRemoteLogEnabled parameter to the Log Loader Javadoc. (#15179)
Add isRemoteLogEnabled parameter to the Log Loader Javadoc

Reviewers: Luke Chen <showuon@gmail.com>,  Satish Duggana <satishd@apache.org>
2024-01-13 14:52:11 +08:00
Greg Harris 21227bda61
KAFKA-15816: Fix leaked sockets in core tests (#14754)
Signed-off-by: Greg Harris <greg.harris@aiven.io>
Reviewers: Mickael Maison <mickael.maison@gmail.com>
2024-01-12 13:18:03 -08:00
Omnia Ibrahim e9f2218d94
KAFKA-15853: Move ReplicationQuotaManagerConfig to server module (#15160)
Reviewers: Mickael Maison <mickael.maison@gmail.com>, Nikolay <nizhikov@apache.org>
2024-01-12 10:47:26 +01:00
谭九鼎 cf447ea4b5
MINOR: doc fix: use <code> instead of backticks (#15169)
use <code> instead of backticks 

Reviewers: Luke Chen <showuon@gmail.com>
2024-01-12 16:48:47 +08:00
Abhinav Dixit 8cdf1abb0b
KAFKA-15738: Adding KRaft support in ConsumerWithLegacyMessageFormatIntegrationTest (#15171)
Reviewers: Manikumar Reddy <manikumar.reddy@gmail.com>
2024-01-12 13:41:04 +05:30
dengziming da6f05258f
MINOR: Enable kraft test in kafka.api (#14595)
Reviewers: Mickael Maison <mickael.maison@gmail.com>
2024-01-12 11:50:12 +08:00
Divij Vaidya 65424ab484
MINOR: New year code cleanup - include final keyword (#15072)
Reviewers: Mickael Maison <mickael.maison@gmail.com>, Ismael Juma <ismael@juma.me.uk>, Sagar Rao <sagarmeansocean@gmail.com>
2024-01-11 17:53:35 +01:00
David Jacot a8203f9c7a
KAFKA-14505; [4/N] Wire transaction verification (#15142)
This patch wires the transaction verification in the new group coordinator. It basically calls the verification path before scheduling the write operation. If the verification fails, the error is returned to the caller.

Note that the patch uses `appendForGroup`. I suppose that we will move away from using it when https://github.com/apache/kafka/pull/15087 is merged.

Reviewers: Justine Olshan <jolshan@confluent.io>
2024-01-11 04:58:57 -08:00
Omnia Ibrahim dba789dc93
KAFKA-15853: Move OffsetConfig to group-coordinator module (#15161)
Reviewers: Mickael Maison <mickael.maison@gmail.com>, David Jacot <djacot@confluent.io>, Nikolay <nizhikov@apache.org>
2024-01-11 10:19:42 +01:00
Omnia Ibrahim 13a83d58f8
KAFKA-15853: Move ProcessRole to server module (#15166)
Prepare to move KafkaConfig (#15103).

Reviewers: Ismael Juma <ismael@juma.me.uk>
2024-01-10 15:13:06 -08:00
TapDang a63f76970a
KAFKA-15747: Add KRaft support in DynamicConnectionQuotaTest (#15028)
Reviewers: Mickael Maison <mickael.maison@gmail.com>
2024-01-10 17:47:01 +01:00
Luke Chen 177c941982
KAFKA-16074: close leaking threads in replica manager tests (#15077)
Following @dajac 's finding in #15063, I found we also create new RemoteLogManager in ReplicaManagerTest, but didn't close them.

While investigating ReplicaManagerTest, I also found there are other threads leaking:

   1. remote fetch reaper thread. It's because we create a reaper thread in test, which is not expected. We should create a mocked one like other purgatory instance.
   2. Throttle threads. We created a quotaManager to feed into the replicaManager, but didn't close it. Actually, we have created a global quotaManager instance and will close it on AfterEach. We should re-use it.
   3. replicaManager and logManager didn't invoke close after test.

Reviewers: Divij Vaidya <divijvaidya13@gmail.com>, Satish Duggana <satishd@apache.org>, Justine Olshan <jolshan@confluent.io>
2024-01-10 19:54:50 +08:00
Sanskar Jhajharia 3d1d060d87
KAFKA-15735: KRaft support in SaslMultiMechanismConsumerTest (#15156)
Reviewers: Manikumar Reddy <manikumar.reddy@gmail.com>
2024-01-10 12:47:37 +05:30
Zihao Lin bdad163182
KAFKA-15741: KRaft support in DescribeConsumerGroupTest (#14668)
Reviewers: Mickael Maison <mickael.maison@gmail.com>
2024-01-09 15:28:49 +01:00
Dmitry Werner 30d9678b3b
KAFKA-15721: KRaft support in DeleteTopicsRequestWithDeletionDisabledTest (#15124)
Reviewers: Mickael Maison <mickael.maison@gmail.com>
2024-01-09 11:53:29 +01:00
Zihao Lin b2bfd5d110
KAFKA-15719: Add KRaft support in OffsetsForLeaderEpochRequestTest (#15049)
Reviewers: Mickael Maison <mickael.maison@gmail.com>
2024-01-08 17:20:58 +01:00
Vedarth Sharma 116762fdce
KAFKA-16016: Add docker wrapper in core and remove docker utility script (#15048)
Migrates functionality provided by utility to Kafka core. This wrapper will be used to generate property files and format storage when invoked from docker container.

Reviewers: Mickael Maison <mickael.maison@gmail.com>, Viktor Somogyi-Vass <viktorsomogyi@gmail.com>, Manikumar Reddy <manikumar.reddy@gmail.com>
2024-01-08 18:07:38 +05:30
Nikolay da2aa68269
KAFKA-14588: Move ConfigEntityName to server-common (#14868)
Reviewers: Mickael Maison <mickael.maison@gmail.com>, Kamal Chandraprakash <kamal.chandraprakash@gmail.com>
2024-01-08 12:41:43 +01:00
Luke Chen 70c8b8d0af
KAFKA-16059: close more kafkaApis instances (#15132)
Reviewers: Divij Vaidya <diviv@amazon.com>, Justine Olshan <jolshan@confluent.io>
2024-01-06 15:00:20 +01:00
Jason Gustafson 599e22b842
MINOR: Move Raft io thread implementation to Java (#15119)
This patch moves the `RaftIOThread` implementation into Java. I changed the name to `KafkaRaftClientDriver` since the main thing it does is drive the calls to `poll()`. There shouldn't be any changes to the logic.

Reviewers: José Armando García Sancio <jsancio@apache.org>
2024-01-05 09:27:36 -08:00
Luke Chen c8d61a5cbe
KAFKA-16079: fix threads leak threads in LocalLeaderEndPointTest and other tests (#15122)
Fix threads leak in LocalLeaderEndPointTest/FinalizedFeatureChangeListenerTest/KafkaApisTest/ReplicaManagerConcurrencyTest

Reviewers: Divij Vaidya <diviv@amazon.com>, Christo Lolov <christololov@gmail.com>
2024-01-05 09:43:03 +08:00
Michael Edgar 105db82956
KAFKA-15373: fix exception thrown in Admin#describeTopics for unknown ID (#14599)
Throw UnknownTopicIdException instead of InvalidTopicException when no name is found for the topic ID.

Similar to #6124 for describeTopics using a topic name. MockAdminClient already makes use of UnknownTopicIdException for this case.

Reviewers: Justine Olshan <jolshan@confluent.io>, Ashwin Pankaj <apankaj@confluent.io>
2024-01-03 17:56:17 -08:00
Dmitry Werner d4aeec3d3f
KAFKA-15742: KRaft support in GroupCoordinatorIntegrationTest (#15086)
updated GroupCoordinatorIntegrationTest.testGroupCoordinatorPropagatesOffsetsTopicCompressionCodec to support KRaft

Reviewers: Justine Olshan <jolshan@confluent.io>
2024-01-03 08:46:12 -08:00
DL1231 60c445bdd5
MINOR: Improve code style (#15107)
Reviewers: Divij Vaidya <diviv@amazon.com>
2024-01-03 11:56:20 +01:00
Arpit Goyal 86a387c3c8
KAFKA-16063: Disable shutdownhook in MiniKdc (used for testing) (#15104)
This stops a memory leaked in the tests caused due to ApplicationShutdownHooks

Reviewers: Divij Vaidya <diviv@amazon.com>
2024-01-02 21:11:18 +01:00
Divij Vaidya 65b1558532
KAFKA-16059: Fix thread leak KafkaAPIsTest (#15093)
Reviewers: Luke Chen <showuon@gmail.com>
2024-01-02 15:58:20 +01:00
Divij Vaidya bd6cb4db22
KAFKA-16052: Save heap in AbstractCoordinatorConcurrencyTest by creating real ReplicaManager (#15094)
Mockito will keep the invocation history in the test suite and cause the huge heap usage. Since the mock replicaManager is only used to bypass the replicaManager constructor without verifying/mocking anything, we create a real dummy replicaManager to avoid the mockito invocation history in memory.

Reviewers: Luke Chen <showuon@gmail.com>, Justine Olshan <jolshan@confluent.io>

Co-authored-by: Luke Chen <showuon@gmail.com>
2023-12-31 12:25:16 +01:00
wernerdv b3664119fd
KAFKA-16064: Improve ControllerApiTest (#15091)
This commit refactors ControllerApiTest to close an instance of ControllerApis in a tearDown method.

Reviewers: Divij Vaidya <diviv@amazon.com>
2023-12-30 21:49:25 +01:00
Luke Chen 0600ac00e9
KAFKA-16065: close DelayedFuturePurgatory in DelayedOperationTest (#15090)
Reviewers: Divij Vaidya <diviv@amazon.com>
2023-12-29 18:27:45 +01:00
Afshin Moazami 627aaef47e
MINOR: Duplicate method; The QuotaUtils one is used. (#15066)
It seems like this PR (https://github.com/apache/kafka/pull/8768) duplicated the implementation to QuotaUtils, but didn't remove this implementation and private methods that is using

Reviewers: Justine Olshan <jolshan@confluent.io>
2023-12-28 16:01:30 -08:00
Luke Chen a465fb124f
KAFKA-16058: close controllerApi instance to avoid thread leaks (#15084)
The controllerApi will create some resources, including the reaper threads. In ControllerApisTest, we created it on many test cases, but didn't close it. This commit doesn't change anything in the business logic of the test, it just adds try/finally to close the controllerApi instance.

Reviewers: Divij Vaidya <diviv@amazon.com>
2023-12-28 16:38:20 +01:00
Divij Vaidya a56b63e226
KAFKA-16053: Fix memory leaks due to KDC server in tests (#15079)
This commit closes the KDC server properly in `CustomQuotaCallbackTest` and `AclAuthorizerWithZkSaslTest`.

Reviewers: Justine Olshan <jolshan@confluent.io>
2023-12-28 10:55:14 +01:00
DL1231 f80686b4ac
MINOR: Improve code style (#15074)
Reviewers: Divij Vaidya <diviv@amazon.com>
2023-12-26 12:21:12 +01:00
IBeyondy 89f32ca6a1
MINOR: Fix NullPointerException in ReplicaFetcherThreadTest.testTruncateOnFetchDoesNotUpdateHighWatermark
Reviewers: Divij Vaidya <diviv@amazon.com>
2023-12-25 18:11:41 +01:00
Nikolay 417338ad77
KAFKA-16048: Fix ConfigCommandTest.shouldNotSupportAlterClientMetricsWithZookeeper (#15068)
Reviewers: Divij Vaidya <diviv@amazon.com>
2023-12-25 14:38:15 +01:00
Nikolay 45bd19f2ef
KAFKA-14588: Move ConfigType to server-common (#14867)
Reviewers: Mickael Maison <mickael.maison@gmail.com>
2023-12-22 18:35:27 +01:00
Rittika Adhikari 0bc736f3c4
MINOR: Refactor to only require one stopPartitions helper (#14662)
Reviewers: Divij Vaidya <diviv@amazon.com>
2023-12-22 17:13:22 +01:00
Philip Nee c963a71be0
KAFKA-16026: Send Poll event to the background thread (#15035)
related to KAFKA-15818

This is a bug in the AsyncKafkaConsumer poll loop that it does not send an event to the network thread to acknowledge user poll. This causes a few issues:

Autocommit won't work without user setting the timer
the member will just leave the group after rebalance timeout and never able to rejoin.
In this PR, a few subtle changes are made to address this issue

Hook up poll event to the AsyncKafkaConsumer#poll. It is only fired once per invocation
Upon entering staled state, we need to reset HeartbeatState otherwise we will get an invalid request
We will clear and current assignment and remove all assigned partitions once the heartbeat is sent. See changes in onHeartbeatRequestSent

Reviewers: David Jacot <djacot@confluent.io>, Bruno Cadonna <cadonna@apache.org>, Andrew Schofield <aschofield@confluent.io>
2023-12-22 15:21:39 +01:00
David Jacot f7ccd082f1
MINOR: Exit catcher should be reset after the cluster is shutdown (#15062)
I was investigating a build which failed with "exit 1". In the logs of the broker, I was that the first call to exist was caught. However, a second one was not. See the logs below. The issue seems to be that we must first shutdown the cluster before reseting the exit catcher. Otherwise, there is still a change for the broker to call exit.

```
[2023-12-21 13:52:59,310] ERROR Shutdown broker because all log dirs in /tmp/kafka-2594137463116889965 have failed (kafka.log.LogManager:143)
[2023-12-21 13:52:59,312] ERROR test error (kafka.server.epoch.EpochDrivenReplicationProtocolAcceptanceWithIbp26Test:76)
java.lang.RuntimeException: halt(1, null) called!
	at kafka.server.QuorumTestHarness.$anonfun$setUp$4(QuorumTestHarness.scala:273)
	at org.apache.kafka.common.utils.Exit.halt(Exit.java:63)
	at kafka.utils.Exit$.halt(Exit.scala:33)
	at kafka.log.LogManager.handleLogDirFailure(LogManager.scala:224)
	at kafka.server.ReplicaManager.handleLogDirFailure(ReplicaManager.scala:2600)
	at kafka.server.ReplicaManager$LogDirFailureHandler.doWork(ReplicaManager.scala:324)
	at org.apache.kafka.server.util.ShutdownableThread.run(ShutdownableThread.java:131)
```

```
[2023-12-21 13:53:05,797] ERROR Shutdown broker because all log dirs in /tmp/kafka-7355495604650755405 have failed (kafka.log.LogManager:143)
```

Reviewers: Luke Chen <showuon@gmail.com>
2023-12-22 05:58:34 -08:00
David Jacot 654ac2528b
MINOR: Close RemoteLogManager in RemoteLogManagerTest (#15063)
This patch ensures that the RemoteLogManager is closed in RemoteLogManagerTest.

Reviewers: Divij Vaidya <diviv@amazon.com>, Lucas Brutschy <lbrutschy@confluent.io>
2023-12-22 05:54:48 -08:00
Luke Chen 82808873cb
KAFKA-16035: add tests for remoteLogSizeComputationTime/remoteFetchExpiresPerSec metrics (#15056)
These tests are removed in this commit because they are flaky.

After investigation, the causes are:
   1. remoteLogSizeComputationTime: It failed with Expected to find 1000 for RemoteLogSizeComputationTime metric value, but found 0. The reason is because if the verification thread is too slow, and the 2nd run of RLMTask started, then it'll reset the value back to 0. Fix it by adding latch to wait for verification.
   2. remoteFetchExpiresPerSec: It failed with The ExpiresPerSec value is not incremented. Current value is: 0. The reason is because the remoteFetchExpiresPerSec metric is a static metric. And we remove all metrics after each test completed in tearDown method. So once remoteFetchExpiresPerSec is removed, it won't be created again like other metrics. And that's why it failed sometimes in Jenkins because if there is a previous test have expired remote fetch, then this metric will be created and removed forever. Fix it by only removing it in afterAll.

Reviewers: Kamal Chandraprakash <kamal.chandraprakash@gmail.com>,  Satish Duggana <satishd@apache.org>, Christo Lolov <lolovc@amazon.com>
2023-12-22 15:02:55 +08:00
Christo Lolov d4f3bf93d3
KAFKA-16014: Implement RemoteLogSizeBytes (#15050)
This pull request aims to implement RemoteLogSizeBytes from KIP-963.

Reviewers: Kamal Chandraprakash <kamal.chandraprakash@gmail.com>,  Satish Duggana <satishd@apache.org>, Luke Chen <showuon@gmail.com>
2023-12-22 15:00:44 +08:00
David Jacot 98aca56ee5
KAFKA-16040; Rename `Generic` to `Classic` (#15059)
People has raised concerned about using `Generic` as a name to designate the old rebalance protocol. We considered using `Legacy` but discarded it because there are still applications, such as Connect, using the old protocol. We settled on using `Classic` for the `Classic Rebalance Protocol`.

The changes in this patch are extremely mechanical. It basically replaces the occurrences of `generic` by `classic`.

Reviewers: Divij Vaidya <diviv@amazon.com>, Lucas Brutschy <lbrutschy@confluent.io>
2023-12-21 13:39:17 -08:00
David Jacot 79757b3081
KAFKA-14505; [3/N] Wire WriteTxnMarkers API (#14985)
This patch wires the handling of makers written by the transaction coordinator via the WriteTxnMarkers API. In the old group coordinator, the markers are written to the logs and the group coordinator is informed to materialize the changes as a second step if the writes were successful. This approach does not really work with the new group coordinator for mainly two reasons: 1) The second step would actually fail while the coordinator is loading and there is no guarantee that the loading has picked up the write or not; 2) It does not fit well with the new memory model where the state is snapshotted by offset. In both cases, it seems that having a single writer to the `__consumer_offsets` partitions is more robust and preferable.

Reviewers: Jeff Kim <jeff.kim@confluent.io>, Justine Olshan <jolshan@confluent.io>
2023-12-21 10:59:41 -08:00
Jeff Kim 4613286076
KAFKA-16030: new group coordinator should check if partition goes offline during load (#15043)
The new coordinator stops loading if the partition goes offline during load. However, the partition is still considered active. Instead, we should return NOT_LEADER_OR_FOLLOWER exception during load.

Another change is that we only want to invoke CoordinatorPlayback#updateLastCommittedOffset if the current offset (last written offset) is greater than or equal to the current high watermark. This is to ensure that in the case the high watermark is ahead of the current offset, we don't clear snapshots prematurely.

Reviewers: David Jacot <djacot@confluent.io>
2023-12-21 06:17:35 -08:00
Divij Vaidya 6250049e10
KAFKA-13950: Fix resource leak in error scenarios (#12228)
We are not properly closing Closeable resources in the code base at multiple places especially when we have an exception. This code change fixes multiple of these leaks.

Reviewers: Ismael Juma <ismael@juma.me.uk>, Luke Chen <showuon@gmail.com>, Mickael Maison <mickael.maison@gmail.com>
2023-12-21 13:47:22 +01:00
David Jacot 75dcc8dadf
KAFKA-16036; Add `group.coordinator.rebalance.protocols` and publish all new configs (#15053)
This patch adds the group.coordinator.rebalance.protocols configuration which accepts a list of protocols to enable. At the moment, only generic and consumer are supported and it is not possible to disable generic yet. When consumer is enabled, the new consumer rebalance protocol (KIP-848) is enabled alongside the new group coordinator. This patch also publishes all the new configurations introduced by KIP-848.

Reviewers: Jeff Kim <jeff.kim@confluent.io>, Stanislav Kozlovski <stanislav@confluent.io>
2023-12-21 04:43:57 -08:00
Luke Chen d59d613258
KAFKA-16013: Throw an exception in DelayedRemoteFetch for follower fetch replicas. (#15015)
Follow-up for KAFKA-16013: Add metric for expiration rate of delayed remote fetch 

Reviewers: Nikhil Ramakrishnan <ramakrishnan.nikhil@gmail.com>, Kamal Chandraprakash <kamal.chandraprakash@gmail.com>,  Satish Duggana <satishd@apache.org>
2023-12-21 15:45:24 +08:00
Christo Lolov 1a97de2fe6
KAFKA-16002: Implement RemoteCopyLagSegments, RemoteDeleteLagBytes and RemoteDeleteLagSegments (#15005)
This pull request aims to implement RemoteCopyLagSegments, RemoteDeleteLagBytes and RemoteDeleteLagSegments from KIP-963.

Reviewers: Luke Chen <showuon@gmail.com>, Kamal Chandraprakash <kamal.chandraprakash@gmail.com>
2023-12-21 14:27:12 +08:00
Ismael Juma 919b585da0
KAFKA-15874: Add metric and request log attribute for deprecated request api versions (KIP-896) (#15032)
Breakdown of this PR:
* Extend the generator to support deprecated api versions
* Set deprecated api versions via the request json files
* Expose the information via metrics and the request log

The relevant section of the KIP:

> * Introduce metric `kafka.network:type=RequestMetrics,name=DeprecatedRequestsPerSec,request=(api-name),version=(api-version),clientSoftwareName=(client-software-name),clientSoftwareVersion=(client-software-version)`
> * Add boolean field `requestApiVersionDeprecated`  to the request
header section of the request log (alongside `requestApiKey` ,
`requestApiVersion`, `requestApiKeyName` , etc.).

Unit tests were added to verify the new generator functionality,
the new metric and the new request log attribute.

Reviewers: Jason Gustafson <jason@confluent.io>
2023-12-20 05:13:36 -08:00
Luke Chen 4e11de00a7
KAFKA-16014: Add RemoteLogMetadataCount metric (#15026)
Reviewers: Christo Lolov <lolovc@amazon.com>, Kamal Chandraprakash<kamal.chandraprakash@gmail.com>, Satish Duggana <satishd@apache.org>
2023-12-20 14:21:30 +05:30
Viktor Somogyi-Vass 0e0282395d
KAFKA-15366: Modify LogDirFailureTest for KRaft (#14977)
Reviewers: Omnia G.H Ibrahim <o.g.h.ibrahim@gmail.com>, Ron Dagostino <rdagostino@confluent.io>, Igor Soarez <soarez@apple.com>
2023-12-19 21:02:49 -05:00
Philip Nee 5e37ec80f8
KAFKA-15696: Refactor closing consumer (#14937)
We drive the consumer closing via events, and rely on the still-lived network thread to complete these operations.

This ticket encompasses several different tickets:
KAFKA-15696/KAFKA-15548

When closing the consumer, we need to perform a few tasks. And here is the top level overview:
We want to keep the network thread alive until we are ready to shut down, i.e., no more requests need to be sent out. To achieve so, I implemented a method, signalClose() to signal the managers to prepare for shutdown. Once we signal the network thread to close, the manager will prepare for the request to be sent out on the next event loop. The network thread can then be closed after issuing these events. The application thread's task is pretty straightforward, 1. Tell the background thread to perform n events and 2. Block on certain events until succeed or the timer runs out. Once all requests are sent out, we close the network thread and other components as usual.

Here I outline the changes in detail

AsyncKafkaConsumer: Shutdown procedures, and several utility functions to ensure proper exceptions are thrown during shutdown
AsyncKafkaConsumerTest: I examine each individual test and fix ones that are blocking for too long or logging errors
CommitRequestManager: signalClose()
FetchRequestManagerTest: changes due to change in pollOnClose()
ApplicationEventProcessor: handle CommitOnClose and LeaveGroupOnClose. Latter, it triggers leaveGroup() which should be completed on the next heartbeat (or we time out on the application thread)

Reviewers:  Lucas Brutschy <lbrutschy@confluent.io>, Kirk True <ktrue@confluent.io>
2023-12-19 13:20:33 +01:00
David Jacot 35e2d3c196
MINOR: Fix thread leak in AuthorizerIntegrationTest (#15006)
Producers and consumers could be leaked in the AuthorizerIntegrationTest. In the teardown logic, `removeAllClientAcls()` is called before calling the super teardown method. If  `removeAllClientAcls()` fails, the super method does not have a change to close the producers and consumers. Example of such failure [here](https://ci-builds.apache.org/blue/organizations/jenkins/Kafka%2Fkafka-pr/detail/PR-14925/11/tests/).

As a new cluster is created for each test anyway, calling `removeAllClientAcls()` does not seem necessary. This patch removes it.

Reviewers: Jason Gustafson <jason@confluent.io>
2023-12-18 23:48:10 -08:00
Gantigmaa Selenge 7b21da9712
KAFKA-15158: Add metrics for RemoteDelete and BuildRemoteLogAuxState (#14375)
This PR implements part of KIP-963, specifically for adding new metrics.
The metrics added in this PR are:
    RemoteDeleteRequestsPerSec (emitted when expired log segments on remote storage being deleted)
    RemoteDeleteErrorsPerSec (emitted when failed to delete expired log segments on remote storage)
    BuildRemoteLogAuxStateRequestsPerSec (emitted when building remote log aux state for replica fetchers)
    BuildRemoteLogAuxStateErrorsPerSec (emitted when failed to build remote log aux state for replica fetchers)

Reviewers: Luke Chen <showuon@gmail.com>, Nikhil Ramakrishnan <ramakrishnan.nikhil@gmail.com>, Christo Lolov <lolovc@amazon.com>, Kamal Chandraprakash <kamal.chandraprakash@gmail.com>, Divij Vaidya <diviv@amazon.com>, Satish Duggana <satishd@apache.org>
2023-12-19 15:02:45 +08:00
Luke Chen c240993be2
KAFKA-16014: Add RemoteLogSizeComputationTime metric (#15021)
Reviewers: Satish Duggana <satishd@apache.org>, Kamal Chandraprakash<kamal.chandraprakash@gmail.com>, Christo Lolov <lolovc@amazon.com>
2023-12-18 21:39:43 +05:30
Lucas Brutschy 7aade70cc6
Revert "KAFKA-15764: Missing Tests for Transactions (#14702)" (#15029)
This reverts commit ed7ad6d.

We have been seeing a lot of failures of TransactionsWithTieredStoreTest.testTransactionsWithCompression on trunk, and it seems to start with this PR. I see how this PR can influence the test via the change in TestUtils. The bad part is that sometimes seems to kill the Gradle Executors completely. So I'd suggest reverting the change before investigating further to stabilize CI.

Reviewers: Bruno Cadonna <cadonna@apache.org>
2023-12-18 10:12:05 +01:00
Philip Nee a6076c71f6
KAFKA-16023: Disable flaky tests in PlaintextConsumerTest (#15025)
I observed several failed tests in PR builds. Let's first disable them and try to find a different way to test the async consumer with these tests.

Reviewers: Lucas Brutschy <lbrutschy@confluent.io>
2023-12-17 10:43:45 +01:00
Justine Olshan ed7ad6d9d3
KAFKA-15764: Missing Tests for Transactions (#14702)
I ran this test 40 times without KAFKA-15653 with and without compression enabled.
With compression it failed 39/40 times and without it passed 40/40 times.

With the KAFKA-15653 and compression it passed 40/40 times locally

Reviewers: Jason Gustafson <jason@confluent.io>
2023-12-15 09:41:20 -08:00
Andrew Schofield a23dae4e9a
KAFKA-15971: Re-enable consumer integration tests for new consumer (#14925)
The consumer integration tests were experimentally disabled for the new `AsyncKafkaConsumer` variant with the aim of improving build stability. Several improvements have been made to the consumer code and other tests which seem to have made a difference. This patch re-enables the tests.

Reviewers: David Jacot <djacot@confluent.io>
2023-12-15 05:16:54 -08:00
Nikhil Ramakrishnan 52496dcd38
KAFKA-16013: Add metric for expiration rate of delayed remote fetch (#15014)
Add metric for the number of expired remote fetches per second, and corresponding unit test to verify that the metric is marked on expiration.

kafka.server:type=DelayedRemoteFetchMetrics,name=ExpiresPerSec

Reviewers: Luke Chen <showuon@gmail.com>, Satish Duggana <satishd@apache.org>, Kamal Chandraprakash<kamal.chandraprakash@gmail.com>, Christo Lolov <lolovc@amazon.com>
2023-12-15 19:21:39 +08:00
Kirk True 9dc9040f33
KAFKA-15276: Implement event plumbing for ConsumerRebalanceListener callbacks (#14640)
This patch adds the logic for coordinating the invocation of the `ConsumerRebalanceListener` callback invocations between the background thread (in `MembershipManagerImpl`) and the application thread (`AsyncKafkaConsumer`) and back again. It allowed us to enable more tests from `PlaintextConsumerTest` to exercise the code herein.

Reviewers: David Jacot <djacot@confluent.io>
2023-12-15 00:42:31 -08:00
Proven Provenzano b0e99b5593
KAFKA-15922: Bump MetadataVersion to support JBOD with KRaft (#14984)
Moves ELR from MetadataVersion IBP_3_7_IV3 into the new IBP_3_8_IV0 because the ELR feature was not completed before 3.7 reached feature freeze.  Leaves IBP_3_7_IV3 empty -- it is a no-op and is not reused for anything.  Adds the new MetadataVersion IBP_3_7_IV4 for the FETCH request changes from KIP-951, which were mistakenly never associated with a MetadataVersion.  Updates the LATEST_PRODUCTION MetadataVersion to IBP_3_7_IV4 to declare both KRaft JBOD and the KIP-951 changes ready for production use.

Reviewers: Omnia G H Ibrahim <o.g.h.ibrahim@gmail.com>, Ron Dagostino <rdagostino@confluent.io>, Ismael Juma <ismael@juma.me.uk>, José Armando García Sancio <jsancio@apache.org>, Justine Olshan <jolshan@confluent.io>
2023-12-14 10:08:54 -05:00
Justine Olshan e4249b69bd
KAFKA-15784: Ensure atomicity of in memory update and write when transactionally committing offsets (#14774)
Rewrote the verification flow to pass a callback to execute after verification completes.
For the TxnOffsetCommit, we will call doTxnCommitOffsets. This allows us to do offset validations post verification.

I've reorganized the verification code and group coordinator code to make these code paths clearer. The followup refactor (https://issues.apache.org/jira/browse/KAFKA-15987) will further clean up the produce verification code.

Reviewers: Artem Livshits <alivshits@confluent.io>, Jason Gustafson <jason@confluent.io>, David Jacot <djacot@confluent.io>, Jun Rao <junrao@gmail.com>
2023-12-13 17:45:09 -08:00
Christo Lolov a87e86e015
KAFKA-15883: Implement RemoteCopyLagBytes (#14832)
This pull request implements the first in the list of metrics in KIP-963: Additional metrics in Tiered Storage.

Since each partition of a topic will be serviced by its own RLMTask we need an aggregator object for a topic. The aggregator object in this pull request is BrokerTopicAggregatedMetric. Since the RemoteCopyLagBytes is a gauge I have introduced a new GaugeWrapper. The GaugeWrapper is used by the metrics collection system to interact with the BrokerTopicAggregatedMetric. The RemoteLogManager interacts with the BrokerTopicAggregatedMetric directly.

Reviewers: Luke Chen <showuon@gmail.com>, Satish Duggana <satishd@apache.org>, Kamal Chandraprakash<kamal.chandraprakash@gmail.com>
2023-12-14 09:21:37 +08:00
vamossagar12 a1e985d22f
KAFKA-15237: Implement write operation timeout (#14981)
This patch ensure that `offset.commit.timeout.ms` is enforced. It does so by adding a timeout to the CoordinatorWriteEvent.

Reviewers: David Jacot <djacot@confluent.io>
2023-12-13 11:30:53 -08:00
Andrew Schofield b08fb14bed
KAFKA-15775: New consumer listTopics and partitionsFor (#14962)
Implement Consumer.listTopics and Consumer.partitionsFor in the new consumer. The topic metadata request manager already existed so this PR adds expiration to requests, removes some redundant state checking and adds tests.

Reviewers: Lucas Brutschy <lucasbru@apache.org>
2023-12-13 08:47:25 +01:00
Nikhil Ramakrishnan be531c681c
KAFKA-15695: Update the local log start offset of a log after rebuilding the auxiliary state (#14649)
Reviewers: Satish Duggana <satishd@apache.org>, Luke Chen <showuon@gmail.com>,  Divij Vaidya <diviv@amazon.com>, Kamal Chandraprakash<kamal.chandraprakash@gmail.com>, Alexandre Dupriez <alexandre.dupriez@gmail.com>
2023-12-12 21:43:42 +05:30
Philip Nee 5b478aebfd
KAFKA-15818: ensure leave group on max poll interval (#14873)
Currently, poll interval is not being respected during consumer#poll. When the user stops polling the consumer, we should assume either the consumer is too slow to respond or is already dead. In either case, we should let the group coordinator kick the member out of the group and reassign its partition after the rebalance timeout expires.

If the consumer comes back alive, we should send a heartbeat and the member will be fenced and rejoin. (and the partitions will be revoked).

This is the same behavior as the current implementation.

Reviewers: Lucas Brutschy <lucasbru@apache.org>, Bruno Cadonna <cadonna@apache.org>, Lianet Magrans <lianetmr@gmail.com>
2023-12-12 10:06:34 +01:00
Omnia Ibrahim 07490b929b
KAFKA-15365: Broker-side replica management changes (#14881)
Reviewers: Igor Soarez <soarez@apple.com>, Ron Dagostino <rndgstn@gmail.com>, Proven Provenzano <pprovenzano@confluent.io>
2023-12-11 09:34:22 -05:00
Lucas Brutschy 134eabee16
MINOR: fix leak in `GroupEndToEndAuthorizationTest` (#14975)
Session expiration in ZkClient can lead to a thread leak, and does fail CI on master.

This is happening in testNoDescribeProduceOrConsumeWithoutTopicDescribeAcl, and possibly other tests.

Use try-with-resources to close ZkClient if this happens.

This does not fix the underlying session expiration in ZK.

Reviewers: David Jacot <djacot@confluent.io>
2023-12-11 09:05:03 +01:00
Andrew Schofield f80f991c79
KAFKA-15978: Update member information on HB response (#14945)
In the new consumer, the commit request manager and the membership manager are separate components. The commit request manager is initialised with group information that it uses to construct `OffsetCommit` requests. However, the initial value of the member ID is `""` in some cases. When the consumer joins the group, it receives a `ConsumerGroupHeartbeat` response which tells it the member ID. The member ID was not being passed to the commit request manager, so it sent invalid `OffsetCommit` requests that failed with `UNKNOWN_MEMBER_ID`.

Reviewers: Bruno Cadonna <cadonna@apache.org>, David Jacot <djacot@confluent.io>
2023-12-10 23:56:54 -08:00
David Jacot 131581a2b4
MINOR: Remove `SubscribedTopicRegex` field from `ConsumerGroupHeartbeatRequest` (#14956)
The support for regular expressions has not been implemented yet in the new consumer group protocol. This patch removes the `SubscribedTopicRegex` from the `ConsumerGroupHeartbeatRequest` in preparation for 3.7. It seems better to bump the version and add it back when we implement the feature, as part of https://issues.apache.org/jira/browse/KAFKA-14517, instead of having an unused field in the request.

Reviewers: Sagar Rao <sagarmeansocean@gmail.com>, Justine Olshan <jolshan@confluent.io>
2023-12-10 23:53:08 -08:00
TapDang cbc882ba07
KAFKA-15714: KRaft support in DynamicNumNetworkThreadsTest (#14970)
Reviewers: Mickael Maison <mickael.maison@gmail.com>
2023-12-10 13:33:01 +01:00
Igor Soarez 8c184b4743
MINOR: Fix some AssignmentsManager bugs (#14954)
- Add proper start & stop for AssignmentsManager's event loop
- Dedupe queued duplicate assignments
- Fix bug where directory ID is resolved too late

Co-authored-by: Gaurav Narula <gaurav_narula2@apple.com>
Reviewers: Colin P. McCabe <cmccabe@apache.org>
2023-12-08 15:37:23 -08:00
Proven Provenzano 02d9f46f3a
MINOR: allow JBOD during ZK migration (#14968)
Allow using JBOD during ZK migration if MetadataVersion is at or above 3.7-IV2.

Reviewers: Ron Dagostino <rndgstn@gmail.com>, Colin P. McCabe <cmccabe@apache.org>, Proven Provenzano <pprovenzano@confluent.io>
2023-12-08 14:38:57 -08:00
Igor Soarez 9de72daa50
KAFKA-15361: Migrating brokers must register with directory list (#14976)
KAFKA-15361 (#14838) introduced a check for non empty directory list on brokerregistration requests
from MetadataVersion.IBP_3_7_IV2 or later, which enables directory assignment. However, ZK brokers
weren't yet registering yet with a directory list. This patch addresses that. We also make the
directory list non-optional in BrokerLifecycleManager.

Reviewers: Ron Dagostino <rndgstn@gmail.com>, Colin P. McCabe <cmccabe@apache.org>, Proven Provenzano <pprovenzano@confluent.io>
2023-12-08 10:16:48 -08:00
vamossagar12 e6e7d8c09f
KAFKA-14516: [3/3] Integration Test - Static Member Removed After Session Timeout (#14911)
This new integration test verifies that a static member who temporary left the group is removed after the session timeout expires. It also verifies that a new static member with the same instance id can't join the group until the previous static member is expired.

Reviewers: David Jacot <djacot@confluent.io>
2023-12-08 04:59:10 -08:00
David Jacot 0ad059d101
MINOR: Fix leak thread in DeleteTopicTest.testIncreasePartitionCountDuringDeleteTopic (#14960)
Reviewers: Lucas Brutschy <lbrutschy@confluent.io>
2023-12-08 04:34:26 -08:00
David Jacot 38c873b80f
MINOR: Avoid leaking threads in DelegationTokenEndToEndAuthorizationWithOwnerTest.testDescribeTokenForOtherUserFails (#14959)
Reviewers: Lucas Brutschy <lbrutschy@confluent.io>
2023-12-07 23:23:08 -08:00
Omnia Ibrahim ec92410e59
KAFKA-15363: Broker log directory failure changes (#14790)
Part of JBOD KIP-858, https://cwiki.apache.org/confluence/display/KAFKA/KIP-858%3A+Handle+JBOD+broker+disk+failure+in+KRaft

Reviewers: Igor Soarez <i@soarez.me>, Colin P. McCabe <cmccabe@apache.org>, Ron Dagostino <rdagostino@confluent.io>
2023-12-07 20:44:56 -05:00
Lucas Brutschy 02915a2c5e
KAFKA-15977: Fix leak in DelegationTokenEndToEndAuthorizationWithOwnerTest (#14939)
DelegationTokenEndToEndAuthorizationWithOwnerTest can leak a thread, causing problems with many tests.

This is due to an admin client that isn't being closed when a (flaky) test fails. Using the Scala util Using to close the auto-closable admin client in case the validation fails.

Reviewers: David Jacot <djacot@confluent.io>, Bruno Cadonna <cadonna@apache.org>
2023-12-07 21:37:23 +01:00
Colin P. McCabe c062e5a1f9 HOTFIX: fix scala 2.12 build again 2023-12-07 12:03:02 -08:00
Igor Soarez c515bf51f8 KAFKA-15426: Process and persist directory assignments
Handle AssignReplicasToDirs requests, persist metadata changes
with new directory assignments and possible leader elections.

Reviewers: Proven Provenzano <pprovenzano@confluent.io>, Ron Dagostino <rndgstn@gmail.com>, Colin P. McCabe <cmccabe@apache.org>
2023-12-07 11:44:45 -08:00
Colin Patrick McCabe 969bc7749c
KAFKA-15980: Add the CurrentControllerId metric (#14749)
Add the CurrentControllerId metric as described in KIP-1001. This gives us an easy way to identify the current controller by looking at the metrics of any Kafka node (broker or controller).

Reviewers: David Arthur <mumrah@gmail.com>
2023-12-06 21:03:33 -08:00
Apoorv Mittal dc09d7a4e0
KAFKA-15684: Support to describe all client metrics resources (KIP-714) (#14933)
Improvement for KIP-1000 to list client metrics resources in KafkaApis.scala. Using functionality exposed by KIP-1000 to support describe all metrics operations for KIP-714.

Reviewers: Andrew Schofield <aschofield@confluent.io>, Jun Rao <junrao@gmail.com>
2023-12-06 11:09:42 -08:00
Andrew Schofield 8ed53a15ee
KAFKA-15932: Wait for responses in consumer operations (#14912)
The Kafka consumer makes a variety of requests to brokers such as fetching committed offsets and updating metadata. In the LegacyKafkaConsumer, the approach is typically to prepare RPC requests and then poll the network to wait for responses. In the AsyncKafkaConsumer, the approach is to enqueue an ApplicationEvent for processing by one of the request managers on the background thread. However, it is still important to wait for responses rather than spinning enqueuing events for the request managers before they have had a chance to respond.

In general, the behaviour will not be changed by this code. The PlaintextConsumerTest.testSeek test was flaky because operations such as KafkaConsumer.position were not properly waiting for a response which meant that subsequent operations were being attempted in the wrong state. This test is no longer flaky.

Reviewers: Kirk True <ktrue@confluent.io>, Lianet Magrans <lianetmr@gmail.com>, Bruno Cadonna <cadonna@apache.org>
2023-12-06 18:47:26 +01:00
Jeff Kim b888fa1ec9
KAFKA-15910: New group coordinator needs to generate snapshots while loading (#14849)
After the new coordinator loads a __consumer_offsets partition, it logs the following exception when making a read operation (fetch/list groups, etc):

 ```
java.lang.RuntimeException: No in-memory snapshot for epoch 740745. Snapshot epochs are:
at org.apache.kafka.timeline.SnapshotRegistry.getSnapshot(SnapshotRegistry.java:178)
at org.apache.kafka.timeline.SnapshottableHashTable.snapshottableIterator(SnapshottableHashTable.java:407)
at org.apache.kafka.timeline.TimelineHashMap$ValueIterator.<init>(TimelineHashMap.java:283)
at org.apache.kafka.timeline.TimelineHashMap$Values.iterator(TimelineHashMap.java:271)
```
 
This happens because we don't have a snapshot at the last updated high watermark after loading. We cannot generate a snapshot at the high watermark after loading all batches because it may contain records that have not yet been committed. We also don't know where the high watermark will advance up to so we need to generate a snapshot for each offset the loader observes to be greater than the current high watermark. Then once we add the high watermark listener and update the high watermark we can delete all of the older snapshots. 

Reviewers: David Jacot <djacot@confluent.io>
2023-12-06 08:38:05 -08:00
Lucas Brutschy c575ba238d
KAFKA-15280: Implement client support for KIP-848 server-side assignors (#14878)
* Validate the client’s configuration for server-side assignor selection defined in config group.remote.assignor
* Include the assignor taken from config in the ConsumerGroupHeartbeat request, in the ServerAssignor field 
* Properly handle UNSUPPORTED_ASSIGNOR errors that may be returned to the HB response if the server does not support the assignor defined by the consumer. 
Includes a simple integration tests for sending an invalid assignor to the broker, and for using the range assignor with a single consumer.

Reviewers: David Jacot <djacot@confluent.io>, Lianet Magrans <lianetmr@gmail.com>, Bruno Cadonna <cadonna@apache.org>
2023-12-06 15:22:11 +01:00
Kamal Chandraprakash f05b342b39
MINOR: Allow local-log segment deletion when log-start-offset incremented. (#14905)
DELETE_RECORDS API can move the log-start-offset beyond the highest-copied-remote-offset. In such cases, we should allow deletion of local-log segments since they won't be eligible for upload to remote storage.

Reviewers: Satish Duggana <satishd@apache.org>, Luke Chen <showuon@gmail.com>
2023-12-06 16:59:16 +05:30
Andrew Schofield 587f50d48f
KAFKA-15831: KIP-1000 protocol and admin client (#14811)
This adds the new ListClientMetricsResources RPC to the Kafka protocol and puts support
into the Kafka admin client. The broker-side implementation in this PR is just to return an empty
list. A future PR will obtain the list from the config store.

Includes a few unit tests for what is a very simple RPC. There are additional tests already written and
waiting for the PR that delivers the kafka-client-metrics.sh tool which builds on this PR.

Reviewers: Jun Rao <junrao@gmail.com>
2023-12-05 07:14:06 -08:00
vamossagar12 0f56eeb046
KAFKA-14516: [2/N] Integration Test - Static Member Gets Assignment Back (#14882)
This patch adds an integration test which verifies that a static member gets back its previous assignment back when rejoining.

Reviewers: David Jacot <djacot@confluent.io>
2023-12-05 04:36:15 -08:00
Nikolay 783698c525
KAFKA-15645: Move ReplicationQuotasTestRig to tools module (#14588)
Reviewers: Mickael Maison <mickael.maison@gmail.com>, Justine Olshan <jolshan@confluent.io>, Taras Ledkov <tledkov@apache.org>
2023-12-05 10:03:33 +01:00
David Jacot 34e1dbbaba
MINOR: Add Uniform assignor to the default config (#14826)
This patch adds the `Uniform` assignor to the default list of supported assignors. It also do small changes in the code.

Reviewers: Justine Olshan <jolshan@confluent.io>
2023-12-05 00:32:50 -08:00
David Jacot 26274afd05
MINOR: Ensure that DisplayName is set in all parameterized tests (#14850)
This is a follow-up to https://github.com/apache/kafka/pull/14687 as we found out that some parameterized tests do not include the test method name in their name. For the context, the JUnit XML report does not include the name of the method by default but only rely on the display name provided.

Reviewers: David Arthur <mumrah@gmail.com>
2023-12-04 23:58:48 -08:00
David Jacot b46505c8de
KAFKA-15061; CoordinatorPartitionWriter should reuse buffer (#14885)
This patch adds a ThreadLocal with a GrowableBufferSupplier so that each writing thread can reuse the same buffer instead of allocating a new one for each write. The patch relies on existing tests.

Reviewers: Jeff Kim <jeff.kim@confluent.io>, Justine Olshan <jolshan@confluent.io>
2023-12-04 23:56:52 -08:00
David Jacot b335ed954e
MINOR: Add @Timeout annotation to consumer integration tests (#14896)
In this [buid](https://ci-builds.apache.org/blue/organizations/jenkins/Kafka%2Fkafka-pr/detail/PR-14826/11/pipeline/12/), the following test hang forever.

```
Gradle Test Run :core:test > Gradle Test Executor 93 > PlaintextConsumerTest > testSeek(String, String) > testSeek(String, String).quorum=kraft+kip848.groupProtocol=consumer STARTED
```

As the new consumer is not extremely stable yet, we should add a Timeout to all those integration tests to ensure that builds are not blocked unnecessarily.

Reviewers: Andrew Schofield <aschofield@confluent.io>, Justine Olshan <jolshan@confluent.io>
2023-12-04 23:55:39 -08:00
Colin Patrick McCabe ebae7b26b5
MINOR: fix bug where we weren't registering SnapshotEmitterMetrics (#14918)
Fix a bug where we weren't properly exposing SnapshotEmitterMetrics. Add a test.

Reviewers: David Arthur <mumrah@gmail.com>
2023-12-04 21:32:12 -08:00