kafka

Commit Graph

Author	SHA1	Message	Date
Lucas Brutschy	7ac50a8611	KAFKA-16152: Fix PlaintextConsumerTest.testStaticConsumerDetectsNewPartitionCreatedAfterRestart (#15419 ) The group coordinator expects the instance ID to always be sent when leaving the group in a static membership configuration, see `ea94507679/group-coordinator/src/main/java/org/apache/kafka/coordinator/group/GroupMetadataManager.java (L814)` The failure was silent, because the group coordinator does not log failed requests and the consumer doesn't wait for the heartbeat response during close. Reviewers: Matthias J. Sax <matthias@confluent.io>, Kirk True <ktrue@confluent.io>, Bruno Cadonna <cadonna@apache.org>	2024-02-23 16:43:50 +01:00
Omnia Ibrahim	ead2431c37	MINOR: Remove unwanted debug line in LogDirFailureTest (#15371 ) Reviewers: Mickael Maison <mickael.maison@gmail.com>, Justine Olshan <jolshan@confluent.io>, Igor Soarez <soarez@apple.com>	2024-02-20 11:25:17 +01:00
Lucas Brutschy	5854139cd8	KAFKA-16243: Make sure that we do not exceed max poll interval inside poll (#15372 ) The consumer keeps a poll timer, which is used to ensure liveness of the application thread. The poll timer automatically updates while the Consumer.poll(Duration) method is blocked, while the newer consumer only updates the poll timer when a new call to Consumer.poll(Duration) is issued. This means that the kafka-console-consumer.sh tools, which uses a very long timeout by default, works differently with the new consumer, with the consumer proactively rejoining the group during long poll timeouts. This change solves the problem by (a) repeatedly sending PollApplicationEvents to the background thread, not just on the first call of poll and (b) making sure that the application thread doesn't block for so long that it runs out of max.poll.interval. An integration test is added to make sure that we do not rejoin the group when a long poll timeout is used with a low max.poll.interval. Reviewers: Lianet Magrans <lianetmr@gmail.com>, Andrew Schofield <aschofield@confluent.io>, Bruno Cadonna <cadonna@apache.org>	2024-02-20 10:48:36 +01:00
runom	a26a1d847f	MINOR: fix MetricsTest.testBrokerTopicMetricsBytesInOut (#14744 ) The assertion to check BytesOut doesn't include replication was performed before replication occurred. This PR fixed the position of the assertion. Reviewers: Luke Chen <showuon@gmail.com>	2024-02-20 11:23:06 +08:00
Josep Prat	b71999be95	MINOR: Clean up core modules (#15279 ) This PR cleans up: metrics, migration, network, raft, security, serializer, tools, utils, and zookeeper package classes Mark methods and fields private where possible Annotate public methods and fields Remove unused classes and methods Make sure Arrays are not printed with .toString Optimize minor warnings Reviewers: Mickael Maison <mickael.maison@gmail.com>	2024-02-19 16:54:50 +01:00
Lucas Brutschy	1442862bbd	KAFKA-16009: Fix PlaintextConsumerTest. testMaxPollIntervalMsDelayInRevocation (#15383 ) The wake-up mechanism in the new consumer is preventing from committing within a rebalance listener callback. The reason is that we are trying to register two wake-uppable actions at the same time. The fix is to register the wake-uppable action more closely to where we are in fact blocking on it, so that the action is not registered when we execute rebalance listeneners and callback listeners. Reviewers: Bruno Cadonna <cadonna@apache.org>	2024-02-19 15:33:37 +01:00
Kirk True	051d4274da	KAFKA-16167: re-enable PlaintextConsumerTest.testAutoCommitOnCloseAfterWakeup (#15358 ) This integration test is now passing, presumably based on recent related changes. Re-enabling to ensure it is included in the test suite to catch any regressions. Reviewers: Lucas Brutschy <lbrutschy@confluent.io>	2024-02-16 09:05:02 +01:00
Lucas Brutschy	e8c70fce26	KAFKA-16155: Re-enable testAutoCommitIntercept (#15334 ) The main bug causing this test to fail as described in the ticket was already fixed. The test is still flaky if unchanged, because in the new consumer, the assignment can change in between two polls. Interceptors are only executed inside poll (and have to be, since they must run as part of the application thread), so we need to modify the integration test to call poll once after observing that the assignment changed. Reviewers: Bruno Cadonna <bruno@confluent.io>	2024-02-14 16:09:48 +01:00
Omnia Ibrahim	be6653c8bc	KAFKA-16225 [1/N]: Set metadata.log.dir to broker in KRAFT mode in integration test Fix the flakiness of LogDirFailureTest by setting a separate metadata.log.dir for brokers in KRAFT mode. The test was flaky because as we call causeLogDirFailure some times we impact the first log.dir which also is KafkaConfig.metadataLogDir as we don't have metadata.log.dir. So to fix the flakiness we need to explicitly set metadata.log.dir to diff log dir than the ones we could potentially fail for the tests. This is part 1 of the fixes. Delivering them separately as the other issues were not as clear cut. Reviewers: Gaurav Narula <gaurav_narula2@apple.com>, Justine Olshan <jolshan@confluent.io>, Greg Harris <greg.harris@aiven.io>	2024-02-13 14:13:53 -08:00
Mickael Maison	0bf830fc9c	KAFKA-14576: Move ConsoleConsumer to tools (#15274 ) Reviewers: Josep Prat <josep.prat@aiven.io>, Omnia Ibrahim <o.g.h.ibrahim@gmail.com>	2024-02-13 19:24:07 +01:00
Gantigmaa Selenge	fed3c3da84	KAFKA-14822: Allow restricting File and Directory ConfigProviders to specific paths (#14995 ) Reviewers: Greg Harris <gharris1727@gmail.com>, Mickael Maison <mickael.maison@gmail.com>	2024-02-13 18:28:28 +01:00
Divij Vaidya	011d238268	MINOR: Fix package name for FetchFromFollowerIntegrationTest (#15353 ) Reviewers: Omnia Ibrahim <o.g.h.ibrahim@gmail.com>, Josep Prat <josep.prat@aiven.io>	2024-02-13 12:10:49 +01:00
Nikolay	88c5543ccf	KAFKA-14589: [1/3] Tests of ConsoleGroupCommand rewritten in java (#15256 ) This PR is part of #14471 Is contains some of ConsoleGroupCommand tests rewritten in java. Intention of separate PR is to reduce changes and simplify review. Reviewers: Luke Chen <showuon@gmail.com>	2024-02-13 11:02:36 +08:00
ghostspiders	5cfcc52fb3	KAFKA-16239: Clean up references to non-existent IntegrationTestHelper (#15352 ) Co-authored-by: ghostspiders <yufeng.gao@seres.cn> Reviewers: Divij Vaidya <diviv@amazon.com>	2024-02-12 13:27:47 +01:00
Gyeongwon, Do	489a7dd71e	MINOR: Improve Code Style (#15319 ) - Removing ! and Unused Imports - Put a space after the control structure's defining keyword. - remove unnecessary whitespace a space after the method name in higher-order function invocations. Reviewers: Divij Vaidya <diviv@amazon.com>	2024-02-09 12:07:20 +01:00
David Arthur	116bc000c8	MINOR: fix scala compile issue (#15343 ) Reviewers: David Jacot <djacot@confluent.io>	2024-02-08 15:44:42 -08:00
David Arthur	c000b1fae2	MINOR: Fix some MetadataDelta handling issues during ZK migration (#15327 ) Reviewers: Colin P. McCabe <cmccabe@apache.org>	2024-02-07 12:54:59 -08:00
Gyeongwon, Do	a63131aab8	KAFKA-15717: Added KRaft support in LeaderEpochIntegrationTest (#15225 ) Reviewers: Mickael Maison <mickael.maison@gmail.com>	2024-02-05 16:57:10 +01:00
Ritika Reddy	68745ef21a	KAFKA-15460: Add group type filter to List Groups API (#15152 ) This patch adds the support for filtering groups by types (Classic or Consumer) to both the old and the new group coordinators. Reviewers: David Jacot <djacot@confluent.io>	2024-02-05 00:56:39 -08:00
Gaurav Narula	3db14ec62a	KAFKA-16157: fix topic recreation handling with offline disks (#15263 ) In Kraft mode, the broker fails to handle topic recreation correctly with broken disks. This is because ReplicaManager tracks HostedPartitions which are on an offline disk but it doesn't associate TopicId information with them. This change updates HostedPartition.Offline to associate topic id information. We also update the log creation logic in Partition::createLogInAssignedDirectoryId to not just rely on targetLogDirectoryId == DirectoryId.UNASSIGNED to determine if the log to be created is "new". Please refer to the comments in https://issues.apache.org/jira/browse/KAFKA-16157 for more information. Reviewers: Luke Chen <showuon@gmail.com>, Omnia Ibrahim <o.g.h.ibrahim@gmail.com>, Gaurav Narula <gaurav_narula2@apple.com>	2024-02-03 14:40:40 +08:00
Colin Patrick McCabe	4169ac9f5d	KAFKA-16180: Fix UMR and LAIR handling during ZK migration (#15293 ) While migrating from ZK mode to KRaft mode, the broker passes through a "hybrid" phase, in which it receives LeaderAndIsrRequest and UpdateMetadataRequest RPCs from the KRaft controller. For the most part, these RPCs can be handled just like their traditional equivalents from a ZK-based controller. However, there is one thing that is different: the way topic deletions are handled. In ZK mode, there is a "deleting" state which topics enter prior to being completely removed. Partitions stay in this state until they are removed from the disks of all replicas. And partitions associated with these deleting topics show up in the UMR and LAIR as having a leader of -2 (which is not a valid broker ID, of course, because it's negative). When brokers receive these RPCs, they know to remove the associated partitions from their metadata caches, and disks. When a full UMR or ISR is sent, deleting partitions are included as well. In hybrid mode, in contrast, there is no "deleting" state. Topic deletion happens immediately. We can do this because we know that we have topic IDs that are never reused. This means that we can always tell the difference between a broker that had an old version of some topic, and a broker that has a new version that was re-created with the same name. To make this work, when handling a full UMR or LAIR, hybrid brokers must compare the full state that was sent over the wire to their own local state, and adjust accordingly. Prior to this PR, the code for handling those adjustments had several major flaws. The biggest flaw is that it did not correctly handle the "re-creation" case where a topic named FOO appears in the RPC, but with a different ID than the broker's local FOO. Another flaw is that a problem with a single partition would prevent handling the whole request. In ZkMetadataCache.scala, we handle full UMR requests from KRaft controllers by rewriting the UMR so that it contains the implied deletions. I fixed this code so that deletions always appear at the start of the list of topic states. This is important for the re-creation case since it means that a single request can both delete the old FOO and add a new FOO to the cache. Also, rather than modifying the requesst in-place, as the previous code did, I build a whole new request with the desired list of topic states. This is much safer because it avoids unforseen interactions with other parts of the code that deal with requests (like request logging). While this new copy may sound expensive, it should actually not be. We are doing a "shallow copy" which references the previous list topic state entries. I also reworked ZkMetadataCache.updateMetadata so that if a partition is re-created, it does not appear in the returned set of deleted TopicPartitions. Since this set is used only by the group manager, this seemed appropriate. (If I was in the consumer group for the previous iteration of FOO, I should still be in the consumer group for the new iteration.) On the ReplicaManager.scala side, we handle full LAIR requests by treating anything which does not appear in them as a "stray replica." (But we do not rewrite the request objects as we do with UMR.) I moved the logic for finding stray replicas from ReplicaManager into LogManager. It makes more sense there, since the information about what is on-disk is managed in LogManager. Also, the stray replica detection logic for KRaft mode is there, so it makes sense to put the stray replica detection logic for hybrid mode there as well. Since the stray replica detection is now in LogManager, I moved the unit tests there as well. Previously some of those tests had been in BrokerMetadataPublisherTest for historical reasons. The main advantage of the new LAIR logic is that it takes topic ID into account. A replica can be a stray even if the LAIR contains a topic of the given name, but a different ID. I also moved the stray replica handling earlier in the becomeLeaderOrFollower function, so that we could correctly handle the "delete and re-create FOO" case. Reviewers: David Arthur <mumrah@gmail.com>	2024-02-02 15:49:10 -08:00
Gaurav Narula	3d95a69a28	KAFKA-16195: ignore metadata.log.dir failure in ZK mode (#15262 ) In KRaft mode, or on ZK brokers that are migrating to KRaft, we have a local __cluster_metadata log. This log is stored in a single log directory which is configured via metadata.log.dir. If there is no metadata.log.dir given, it defaults to the first entry in log.dirs. In the future we may support multiple metadata log directories, but we don't yet. For now, we must abort the process when this log directory fails. In ZK mode, it is not necessary to abort the process when this directory fails, since there is no __cluster_metadata log there. This PR changes the logic so that we check for whether we're in ZK mode and do not abort in that scenario (unless we lost the final remaining log directory. of course.) Reviewers: Luke Chen <showuon@gmail.com>, Colin P. McCabe <cmccabe@apache.org>, Omnia G H Ibrahim <o.g.h.ibrahim@gmail.com>, Proven Provenzano <pprovenzano@confluent.io>	2024-02-02 09:47:14 -08:00
Zihao Lin	dfb903fb8d	KAFKA-15728: KRaft support in DescribeUserScramCredentialsRequestNotAuthorizedTest (#14736 ) Reviewers: Mickael Maison <mickael.maison@gmail.com>	2024-02-02 15:20:14 +01:00
David Arthur	12ce9c7f98	KAFKA-16216: Reduce batch size for initial metadata load during ZK migration During migration from ZK mode to KRaft mode, there is a step where the kcontrollers load all of the data from ZK into the metadata log. Previously, we were using a batch size of 1000 for this, but 200 seems better. This PR also adds an internal configuration to control this batch size, for testing purposes. Reviewers: Colin P. McCabe <cmccabe@apache.org>	2024-02-01 15:48:52 -08:00
David Jacot	6c09cc9586	KAFKA-16189; Extend admin to support ConsumerGroupDescribe API (#15253 ) This patch extends the Admin client to support describing new consumer groups with the ConsumerGroupDescribe API introduced in KIP-848. Users will continue to use the `Admin#describeConsumerGroups` API. The admin client does all the magic. Basically, the admin client always tries to describe the requested groups with the ConsumerGroupDescribe API to start with. If all the groups are there, great, the job is done. If there are groups unresolved groups due to a UNSUPPORTED_VERSION or GROUP_ID_NOT_FOUND error, the admin client tries with the DescribeGroups API. The patch also adds fields to the data structure returned by `Admin#describeConsumerGroups` as stated in the KIP. Reviewers: Andrew Schofield <aschofield@confluent.io>, Bruno Cadonna <bruno@confluent.io>	2024-02-01 00:30:56 -08:00
Omnia Ibrahim	127fe7d276	KAFKA-15853: Move AuthorizerUtils and its dependencies to server module (#15167 ) Reviewers: Mickael Maison <mickael.maison@gmail.com>	2024-01-31 15:38:14 +01:00
Josep Prat	cfc8257479	MINOR: Clean up core server classes (#15272 ) * MINOR: Clean up core server classes Mark methods and fields private where possible Annotate public methods and fields Remove unused classes and methods Make sure Arrays are not printed with .toString Optimize minor warnings Remove unused apply method Signed-off-by: Josep Prat <josep.prat@aiven.io> Reviewers: Mickael Maison <mickael.maison@gmail.com>	2024-01-31 13:52:22 +01:00
Gantigmaa Selenge	cdd9c62c55	KAFKA-15711: KRaft support in LogRecoveryTest (#14693 ) Reviewers: Mickael Maison <mickael.maison@gmail.com>, Zihao Lin	2024-01-31 11:34:42 +01:00
David Jacot	6dd517daac	KAFKA-14505; [6/N] Avoid recheduling callback in request thread (#15176 ) This patch removes the extra hop via the request thread when the new group coordinator verifies a transaction. Prior to it, the ReplicaManager would automatically re-schedule the callback to a request thread. However, the new group coordinator does not need this as it already schedules the write into its own thread. With this patch, the decision to re-schedule on a request thread or not is left to the caller. Reviewers: Artem Livshits <alivshits@confluent.io>, Justine Olshan <jolshan@confluent.io>	2024-01-30 23:27:11 -08:00
Apoorv Mittal	016bd682fe	KAFKA-16186: Broker metrics for client telemetry (KIP-714) (#15251 ) Add the broker metrics defined in KIP-714. Reviewers: Andrew Schofield <aschofield@confluent.io>, Jun Rao <junrao@gmail.com>	2024-01-30 15:03:09 -08:00
Mickael Maison	3e9ef70853	KAFKA-15853: Move PasswordEncoder to server-common (#15246 ) Reviewers: Luke Chen <showuon@gmail.com>, Omnia Ibrahim <o.g.h.ibrahim@gmail.com>	2024-01-30 19:08:50 +01:00
Gaurav Narula	4c6f975ab3	KAFKA-16162: resend broker registration on metadata update to IBP 3.7-IV2 We update metadata update handler to resend broker registration when metadata has been updated to >= 3.7IV2 so that the controller becomes aware of the log directories in the broker. We also update DirectoryId::isOnline to return true on an empty list of log directories while the controller awaits broker registration. Co-authored-by: Proven Provenzano <pprovenzano@confluent.io> Reviewers: Omnia G H Ibrahim <o.g.h.ibrahim@gmail.com>, Luke Chen <showuon@gmail.com>, Colin P. McCabe <cmccabe@apache.org>	2024-01-30 10:00:07 -08:00
Gaurav Narula	9e4a4a2821	KAFKA-16204: Create partition dir for mockLog (#15288 ) Reviewers: Mickael Maison <mickael.maison@gmail.com>, Omnia Ibrahim <o.g.h.ibrahim@gmail.com>	2024-01-30 15:53:24 +01:00
Josep Prat	50940fa537	MINOR: Fixes broken build (#15290 ) Because of lack of implicit conversions, boolean properties need to be passed as Strings This is done in other parts of the code already Signed-off-by: Josep Prat <josep.prat@aiven.io>	2024-01-30 13:29:19 +01:00
Colin P. McCabe	f7feb43af3	KAFKA-14616: Fix stray replica of recreated topics in KRaft mode When a broker is down, and a topic is deleted, this will result in that broker seeing "stray replicas" the next time it starts up. These replicas contain data that used to be important, but which now needs to be deleted. Stray replica deletion is handled during the initial metadata publishing step on the broker. Previously, we deleted these stray replicas after starting up BOTH LogManager and ReplicaManager. However, this wasn't quite correct. The presence of the stray replicas confused ReplicaManager. Instead, we should delete the stray replicas BEFORE starting ReplicaManager. This bug triggered when a topic was deleted and re-created while a broker was down, and some of the replicas of the re-created topic landed on that broker. The impact was that the stray replicas were deleted, but the new replicas for the next iteration of the topic never got created. This, in turn, led to persistent under-replication until the next time the broker was restarted. Reviewers: Luke Chen <showuon@gmail.com>, Omnia G H Ibrahim <o.g.h.ibrahim@gmail.com>, Gaurav Narula <gaurav_narula2@apple.com>	2024-01-29 22:36:09 -08:00
Gantigmaa Selenge	5ad1191466	KAFKA-15720: KRaft support in DeleteTopicTest (#14846 ) Reviewers: Ziming Deng<dengziming1993@gmail.com>.	2024-01-30 11:34:15 +08:00
David Arthur	16ed7357b1	KAFKA-16171: Fix ZK migration controller race #15238 This patch causes the active KRaftMigrationDriver to reload the /migration ZK state after electing itself as the leader in ZK. This closes a race condition where the previous active controller could make an update to /migration after the new leader was elected. The update race was not actually a problem regarding the data since both controllers would be syncing the same state from KRaft to ZK, but the change to the znode causes the new controller to fail on the zk version check on /migration. This patch also fixes a as-yet-unseen bug where the active controllers failing to elect itself via claimControllerLeadership would not retry. Reviewers: Colin P. McCabe <cmccabe@apache.org>	2024-01-29 13:51:45 -08:00
Lucas Brutschy	1a54c25fdf	KAFKA-15942: Implement ConsumerInterceptors in the async consumer (#15000 ) We need to make sure to call the consumer interceptor and test its integration. This is adding the required call in commitSync and commitAsync. The calls in commitAsync are executed using the same mechanism as commit callbacks, to ensure that we are calling the interceptors from a single thread, as is intended in the original KIP. The interceptors also need to be invoked on auto-commits which are executed in the commit request manager. For this purpose, we share the OffsetCommitCallbackInvoker class with the background thread (it is already accessed implicitly from the background thread through a future lambda). This is done analogous to the RebalanceListenerInvoker. Co-authored-by: John Doe zh2725284321@gmail.com Reviewers: Bruno Cadonna <bruno@confluent.io>, Andrew Schofield <aschofield@confluent.io>, Philip Nee <pnee@confluent.io>	2024-01-29 21:26:44 +01:00
DL1231	82920ffad0	KAFKA-16095: Update list group state type filter to include the states for the new consumer group type (#15211 ) While using —list —state the current accepted values correspond to the classic group type states. This patch adds the new states introduced by KIP-848. It also make the matching on the server case insensitive. Co-authored-by: d00791190 <dinglan6@huawei.com> Reviewers: Ritika Reddy <rreddy@confluent.io>, David Jacot <djacot@confluent.io>	2024-01-29 07:19:05 -08:00
Luke Chen	70b8c5ae8e	KAFKA-16085: Add metric value consolidated for topics on a broker for tiered storage. (#15133 ) In BrokerTopicMetrics group, we'll provide not only the metric for per topic, but also the all topic aggregated metric value. The beanName is like this: kafka.server:type=BrokerTopicMetrics,name=RemoteCopyLagSegments kafka.server:type=BrokerTopicMetrics,name=RemoteCopyLagSegments,topic=Leader This PR is to add the missing all topic aggregated metric value for tiered storage, specifically for gauge type metrics. Reviewers: Divij Vaidya <divijvaidya13@gmail.com>, Kamal Chandraprakash<kamal.chandraprakash@gmail.com>, Christo Lolov <lolovc@amazon.com>	2024-01-27 19:13:24 +08:00
Crispin Bernier	0d4e35514f	Minor update to KafkaApisTest (#15257 ) I was using the ZERO_UUID topicId instead of the actual topicId in the testFetchResponseContainsNewLeaderOnNotLeaderOrFollower introduced in #14444, updating as the actual topicId is more correct. Reviewers: Justine Olshan <jolshan@confluent.io>	2024-01-26 13:17:25 -08:00
Nikolay	13c0c5ee97	KAFKA-14589 ConsumerGroupServiceTest rewritten in java (#15248 ) This PR is part of #14471 Is contains single test rewritten in java. Intention of separate PR is to reduce changes and simplify review. Reviewers: Justine Olshan <jolshan@confluent.io>	2024-01-26 10:32:48 -08:00
Justine Olshan	5eb82010ef	KAFKA-15987: Refactor ReplicaManager code for transaction verification (#15087 ) I originally did some refactors in #14774, but we decided to keep the changes minimal since the ticket was a blocker. Here are those refactors: * Removed separate append paths so that produce, group coordinator, and other append paths all call appendRecords * AppendRecords has been simplified * Removed unneeded error conversions in verification code since group coordinator and produce path convert errors differently, removed test for that * Fixed incorrect capital param name in KafkaRequestHandler * Updated ReplicaManager test to handle produce appends separately when transactions are used. Reviewers: David Jacot <djacot@confluent.io>, Jason Gustafson <jason@confluent.io>	2024-01-26 10:01:03 -08:00
Josep Prat	2a6e420dfb	MINOR: cleanup core modules part 1 (#15252 ) * MINOR: Clean up core api, cluster, common, log, admin, controller and coordinator classes Mark methods and fields private where possible Annotate public methods and fields Remove unused classes and methods Signed-off-by: Josep Prat <josep.prat@aiven.io> Reviewers: Luke Chen <showuon@gmail.com>	2024-01-26 16:35:10 +01:00
Drawxy	706c11e3ee	MINOR: Remove unreachable if-else block in ReplicaManager.scala (#15220 ) After this #13107 PR, an if-else block became unreachable. We need remove it and make the code clean. Reviewers: Luke Chen <showuon@gmail.com>, Divij Vaidya <diviv@amazon.com>	2024-01-26 14:46:39 +08:00
Mickael Maison	80a1bf8f56	KAFKA-16003: Always create the /config/topics ZNode even for topics without configs (#15022 ) Reviewers: Luke Chen <showuon@gmail.com>	2024-01-25 15:46:24 +01:00
Mickael Maison	c843912d40	KAFKA-7957: Enable testMetricsReporterUpdate (#15147 ) Reviewers: Luke Chen <showuon@gmail.com>	2024-01-25 10:12:01 +01:00
Calvin Liu	7e5ef9b509	KAFKA-15585: Implement DescribeTopicPartitions RPC on broker (#14612 ) This patch implements the new DescribeTopicPartitions RPC as defined in KIP-966 (ELR). Additionally, this patch adds a broker config "max.request.partition.size.limit" which limits the number of partitions returned by the new RPC. Reviewers: Artem Livshits <alivshits@confluent.io>, Jason Gustafson <jason@confluent.io>, David Arthur <mumrah@gmail.com>	2024-01-24 15:16:09 -05:00
Lianet Magrans	839cd1438b	KAFKA-16107: Stop fetching while onPartitionsAssign completes (#15215 ) This ensures that no records are fetched, or positions initialized, while the onPartitionsAssigned callback completes in the new async consumer Application thread. This is achieved using an internal mark in the subscription state, so that the partitions are not considered fetchable or requiring initializing positions until the callback completes. Reviewers: David Jacot <djacot@confluent.io>	2024-01-24 04:34:35 -08:00
Ismael Juma	70e0dbd795	Delete unused classes (#14797 ) Reviewers: Mickael Maison <mickael.maison@gmail.com>	2024-01-23 22:04:44 -08:00
Apoorv Mittal	208f9e7765	KAFKA-15813: Evict client instances from cache (KIP-714) (#15234 ) KIP-714 requires client instance cache in broker which should also have a time-based eviction policy where client instances which are not actively sending metrics should be evicted. KIP mentions This client instance specific state is maintained in broker memory up to MAX(601000, PushIntervalMs 3) milliseconds. Reviewers: Andrew Schofield <aschofield@confluent.io>, Jun Rao <junrao@gmail.com>	2024-01-23 15:06:02 -08:00
Justine Olshan	e00d36b9c0	KAFKA-15468 [1/2]: Prevent transaction coordinator reloads on already loaded leaders (#15139 ) This originally was #14489 which covered 2 aspects -- reloading on partition epoch changes where leader epoch did not change and reloading when leader epoch changed but we were already the leader. I've cut out the second part of the change since the first part is much simpler. Redefining the TopicDelta fields to better distinguish when a leader is elected (leader epoch bump) vs when a leader has isr/replica changes (partition epoch bump). There are some cases where we bump the partition epoch but not the leader epoch. We do not need to do operations that only care about the leader epoch bump. (ie -- onElect callbacks) Reviewers: Artem Livshits <alivshits@confluent.io>, José Armando García Sancio <jsancio@apache.org>	2024-01-23 14:58:53 -08:00
David Jacot	4d6a422e86	KAFKA-14505; [7/N] Always materialize the most recent committed offset (#15183 ) When transactional offset commits are eventually committed, we must always keep the most recent committed when we have a mix of transactional and regular offset commits. We achieve this by storing the offset of the offset commit record along side the committed offset in memory. Without preserving information of the commit record offset, compaction of the __consumer_offsets topic itself may result in the wrong offset commit being materialized. Reviewers: Jeff Kim <jeff.kim@confluent.io>, Justine Olshan <jolshan@confluent.io>	2024-01-22 23:26:40 -08:00
Omnia Ibrahim	62ce551826	KAFKA-15853: Move KafkaConfig.Defaults to server module (#15158 ) Reviewers: Mickael Maison <mickael.maison@gmail.com>, Ismael Juma <ismael@juma.me.uk> , David Jacot <djacot@confluent.io>, Nikolay <NIzhikov@gmail.com>	2024-01-22 15:29:11 +01:00
Apoorv Mittal	556dc2a93f	KAFKA-15811: Enhance request context with client socket port information (KIP-714) (#15190 ) PR adds support to capture client socket port information in Request Context. The port from request context is used as matching criteria in filtering clients and shall be used by metrics plugin to fetch port from request context. Reviewers: Andrew Schofield <aschofield@confluent.io>, Jun Rao <junrao@gmail.com>	2024-01-19 10:28:07 -08:00
Kirk True	08c437d25e	KAFKA-16104: Enable additional PlaintextConsumerTest tests for new consumer (#15206 ) We reevaluated the integration tests that were disabled for the new consumer group protocol which should be supported. The evaluation was to run the PlaintextConsumerTest suite ten times and see which tests passed and which failed. Based on that evaluation, the following test can now be enabled: testAutoCommitOnClose testAutoCommitOnRebalance testExpandingTopicSubscriptions testMultiConsumerSessionTimeoutOnClose testMultiConsumerSessionTimeoutOnStopPolling testShrinkingTopicSubscriptions There are three tests which consistently failed. For each, a dedicated Jira was created to track and fix. Those that failed: testAutoCommitOnCloseAfterWakeup (KAFKA-16167) testPerPartitionLagMetricsCleanUpWithSubscribe (local failure rate 100%, KAFKA-16150) testPerPartitionLeadMetricsCleanUpWithSubscribe (local failure rate: 70%, KAFKA-16151) testStaticConsumerDetectsNewPartitionCreatedAfterRestart (local failure rate: 100%, KAFKA-16152) See KAFKA-16104 for more details. Reviewers: Lucas Brutschy <lbrutschy@confluent.io>	2024-01-19 10:23:15 +01:00
Omnia Ibrahim	2f2a0d799a	KAFKA-15853: Move ClientQuotaManagerConfig outside of core (#15159 ) Reviewers: Mickael Maison <mickael.maison@gmail.com>, Nikolay <NIzhikov@gmail.com>	2024-01-18 16:15:13 +01:00
Jeff Kim	96f852f9e7	MINOR: log new coordinator partition load schedule time (#15017 ) The current load summary exposes the time from when the partition load operation is scheduled to when the load completes. We are missing the information of how long the scheduled operation stays in the scheduler. Log that information. Reviewers: David Jacot <djacot@confluent.io>	2024-01-18 02:28:17 -08:00
Mickael Maison	acab4657dd	MINOR: Fix compilation issue in ReplicaManagerTest (#15222 ) Reviewers: Divij Vaidya <diviv@amazon.com>	2024-01-18 11:18:28 +01:00
Nikhil Ramakrishnan	46b05b3ac7	MINOR: Add test case for follower fetch (#14212 ) This PR adds a test case for follower fetch when segments are archived and expired from remote storage. This test case verifies the following scenario (from comment): 1. Leader is archiving to tiered storage and has a follower. 2. Follower has caught up to offset X (exclusive). 3. While follower is offline, leader moves X to tiered storage and expires data locally till Y, such that, Y = leaderLocalLogStartOffset and leaderLocalLogStartOffset > X. Meanwhile, X has been expired from tiered storage as well. Hence, X < globalLogStartOffset. 4. Follower comes online and tries to fetch X from leader. Reviewers: Luke Chen <showuon@gmail.com>, Kamal Chandraprakash <kamal.chandraprakash@gmail.com>	2024-01-18 15:03:12 +08:00
David Arthur	7bf7fd99a5	KAFKA-16078: Be more consistent about getting the latest MetadataVersion This PR creates MetadataVersion.latestTesting to represent the highest metadata version (which may be unstable) and MetadataVersion.latestProduction to represent the latest version that should be used in production. It fixes a few cases where the broker was advertising that it supported the testing versions even when unstable metadata versions had not been configured. Reviewers: Colin P. McCabe <cmccabe@apache.org>, Ismael Juma <ismael@juma.me.uk>	2024-01-17 14:59:22 -08:00
Proven Provenzano	6c14d77998	KAFKA-16131: Only update directoryIds if the metadata version supports DirectoryAssignment (#15197 ) We only want to send directory assignments if the metadata version supports it. Reviewers: Colin P. McCabe <cmccabe@apache.org>	2024-01-17 14:06:38 -08:00
Colin P. McCabe	0015d0f01b	KAFKA-16126: Kcontroller dynamic configurations may fail to apply at startup Some kcontroller dynamic configurations may fail to apply at startup. This happens because there is a race between registering the reconfigurables to the DynamicBrokerConfig class, and receiving the first update from the metadata publisher. We can fix this by registering the reconfigurables first. This seems to have been introduced by the "MINOR: Install ControllerServer metadata publishers sooner" change. Reviewers: Ron Dagostino <rdagostino@confluent.io>	2024-01-16 16:03:17 -08:00
Dmitry Werner	dd0916ef9a	KAFKA-15743: KRaft support in ReplicationQuotasTest (#15191 ) Reviewers: Mickael Maison <mickael.maison@gmail.com>	2024-01-16 16:01:02 +01:00
Lianet Magrans	b16df3b103	KAFKA-16133 - Reconciliation auto-commit fix (#15194 ) This fixes an issue with the time boundaries used for the auto-commit performed when partitions are revoked. Reviewers: Lucas Brutschy <lbrutschy@confluent.io>	2024-01-15 21:50:09 +01:00
Zihao Lin	3041151cf2	KAFKA-15750: KRaft support in KafkaMetricReporterExceptionHandlingTest (#14707 ) Reviewers: Mickael Maison <mickael.maison@gmail.com>, Qichao Chu <qichao@uber.com>, Gantigmaa Selenge <gselenge@redhat.com>	2024-01-15 12:09:31 +01:00
Zihao Lin	8d89e637c3	KAFKA-15740: KRaft support in DeleteOffsetsConsumerGroupCommandIntegrationTest (#14669 ) Reviewers: Mickael Maison <mickael.maison@gmail.com>, Qichao Chu <qichao@uber.com>	2024-01-15 11:55:31 +01:00
David Mao	d0f845a5e1	KAFKA-16120: Fix partition reassignment during ZK migration When we are migrating from ZK mode to KRaft mode, the brokers pass through a phase where they are running in ZK mode, but the controller is in KRaft mode (aka a kcontroller). This is called "hybrid mode." In hybrid mode, the KRaft controllers send old-style controller RPCs to the remaining ZK mode brokers. (StopReplicaRequest, LeaderAndIsrRequest, UpdateMetadataRequest, etc.) To complete partition reassignment, the kcontroller must send a StopReplicaRequest to any brokers that no longer host the partition in question. Previously, it was sending this StopReplicaRequest with delete = false. This led to stray partitions, because the partition data was never removed as it should have been. This PR fixes it to set delete = true. This fixes KAFKA-16120. There is one additional problem with partition reassignment in hybrid mode, tracked as KAFKA-16121. The issue is that in ZK mode, brokers ignore any LeaderAndIsr request where the partition leader epoch is less than or equal to the current partition leader epoch. However, when in hybrid mode, just as in KRaft mode, we do not bump the leader epoch when starting a new reassignment, see: `triggerLeaderEpochBumpIfNeeded`. This PR resolves this problem by adding a special case on the broker side when isKRaftController = true. Reviewers: Akhilesh Chaganti <akhileshchg@users.noreply.github.com>, Colin P. McCabe <cmccabe@apache.org>	2024-01-14 20:32:58 -08:00
Arpit Goyal	ef92deee9d	KAFKA-15388: Handling remote segment read in case of log compaction (#15060 ) Fetching from remote log segment implementation does not handle the topics that had retention policy as compact earlier and changed to delete. It always assumes record batch will exist in the required segment for the requested offset. But there is a possibility where the requested offset is the last offset of the segment and has been removed due to log compaction. Then it requires iterating over the next higher segment for further data as it has been done for local segment fetch request. This change partially addresses the above problem by iterating through the remote log segments to find the respective segment for the target offset. Reviewers: Satish Duggana <satishd@apache.org>, Kamal Chandraprakash<kamal.chandraprakash@gmail.com>, Divij Vaidya <diviv@amazon.com>, Christo Lolov <lolovc@amazon.com>	2024-01-15 05:15:58 +05:30
Kamal Chandraprakash	378a01999e	MINOR: Add isRemoteLogEnabled parameter to the Log Loader Javadoc. (#15179 ) Add isRemoteLogEnabled parameter to the Log Loader Javadoc Reviewers: Luke Chen <showuon@gmail.com>, Satish Duggana <satishd@apache.org>	2024-01-13 14:52:11 +08:00
Greg Harris	21227bda61	KAFKA-15816: Fix leaked sockets in core tests (#14754 ) Signed-off-by: Greg Harris <greg.harris@aiven.io> Reviewers: Mickael Maison <mickael.maison@gmail.com>	2024-01-12 13:18:03 -08:00
Omnia Ibrahim	e9f2218d94	KAFKA-15853: Move ReplicationQuotaManagerConfig to server module (#15160 ) Reviewers: Mickael Maison <mickael.maison@gmail.com>, Nikolay <nizhikov@apache.org>	2024-01-12 10:47:26 +01:00
谭九鼎	cf447ea4b5	MINOR: doc fix: use <code> instead of backticks (#15169 ) use <code> instead of backticks Reviewers: Luke Chen <showuon@gmail.com>	2024-01-12 16:48:47 +08:00
Abhinav Dixit	8cdf1abb0b	KAFKA-15738: Adding KRaft support in ConsumerWithLegacyMessageFormatIntegrationTest (#15171 ) Reviewers: Manikumar Reddy <manikumar.reddy@gmail.com>	2024-01-12 13:41:04 +05:30
dengziming	da6f05258f	MINOR: Enable kraft test in kafka.api (#14595 ) Reviewers: Mickael Maison <mickael.maison@gmail.com>	2024-01-12 11:50:12 +08:00
Divij Vaidya	65424ab484	MINOR: New year code cleanup - include final keyword (#15072 ) Reviewers: Mickael Maison <mickael.maison@gmail.com>, Ismael Juma <ismael@juma.me.uk>, Sagar Rao <sagarmeansocean@gmail.com>	2024-01-11 17:53:35 +01:00
David Jacot	a8203f9c7a	KAFKA-14505; [4/N] Wire transaction verification (#15142 ) This patch wires the transaction verification in the new group coordinator. It basically calls the verification path before scheduling the write operation. If the verification fails, the error is returned to the caller. Note that the patch uses `appendForGroup`. I suppose that we will move away from using it when https://github.com/apache/kafka/pull/15087 is merged. Reviewers: Justine Olshan <jolshan@confluent.io>	2024-01-11 04:58:57 -08:00
Omnia Ibrahim	dba789dc93	KAFKA-15853: Move OffsetConfig to group-coordinator module (#15161 ) Reviewers: Mickael Maison <mickael.maison@gmail.com>, David Jacot <djacot@confluent.io>, Nikolay <nizhikov@apache.org>	2024-01-11 10:19:42 +01:00
Omnia Ibrahim	13a83d58f8	KAFKA-15853: Move ProcessRole to server module (#15166 ) Prepare to move KafkaConfig (#15103). Reviewers: Ismael Juma <ismael@juma.me.uk>	2024-01-10 15:13:06 -08:00
TapDang	a63f76970a	KAFKA-15747: Add KRaft support in DynamicConnectionQuotaTest (#15028 ) Reviewers: Mickael Maison <mickael.maison@gmail.com>	2024-01-10 17:47:01 +01:00
Luke Chen	177c941982	KAFKA-16074: close leaking threads in replica manager tests (#15077 ) Following @dajac 's finding in #15063, I found we also create new RemoteLogManager in ReplicaManagerTest, but didn't close them. While investigating ReplicaManagerTest, I also found there are other threads leaking: 1. remote fetch reaper thread. It's because we create a reaper thread in test, which is not expected. We should create a mocked one like other purgatory instance. 2. Throttle threads. We created a quotaManager to feed into the replicaManager, but didn't close it. Actually, we have created a global quotaManager instance and will close it on AfterEach. We should re-use it. 3. replicaManager and logManager didn't invoke close after test. Reviewers: Divij Vaidya <divijvaidya13@gmail.com>, Satish Duggana <satishd@apache.org>, Justine Olshan <jolshan@confluent.io>	2024-01-10 19:54:50 +08:00
Sanskar Jhajharia	3d1d060d87	KAFKA-15735: KRaft support in SaslMultiMechanismConsumerTest (#15156 ) Reviewers: Manikumar Reddy <manikumar.reddy@gmail.com>	2024-01-10 12:47:37 +05:30
Zihao Lin	bdad163182	KAFKA-15741: KRaft support in DescribeConsumerGroupTest (#14668 ) Reviewers: Mickael Maison <mickael.maison@gmail.com>	2024-01-09 15:28:49 +01:00
Dmitry Werner	30d9678b3b	KAFKA-15721: KRaft support in DeleteTopicsRequestWithDeletionDisabledTest (#15124 ) Reviewers: Mickael Maison <mickael.maison@gmail.com>	2024-01-09 11:53:29 +01:00
Zihao Lin	b2bfd5d110	KAFKA-15719: Add KRaft support in OffsetsForLeaderEpochRequestTest (#15049 ) Reviewers: Mickael Maison <mickael.maison@gmail.com>	2024-01-08 17:20:58 +01:00
Vedarth Sharma	116762fdce	KAFKA-16016: Add docker wrapper in core and remove docker utility script (#15048 ) Migrates functionality provided by utility to Kafka core. This wrapper will be used to generate property files and format storage when invoked from docker container. Reviewers: Mickael Maison <mickael.maison@gmail.com>, Viktor Somogyi-Vass <viktorsomogyi@gmail.com>, Manikumar Reddy <manikumar.reddy@gmail.com>	2024-01-08 18:07:38 +05:30
Nikolay	da2aa68269	KAFKA-14588: Move ConfigEntityName to server-common (#14868 ) Reviewers: Mickael Maison <mickael.maison@gmail.com>, Kamal Chandraprakash <kamal.chandraprakash@gmail.com>	2024-01-08 12:41:43 +01:00
Luke Chen	70c8b8d0af	KAFKA-16059: close more kafkaApis instances (#15132 ) Reviewers: Divij Vaidya <diviv@amazon.com>, Justine Olshan <jolshan@confluent.io>	2024-01-06 15:00:20 +01:00
Jason Gustafson	599e22b842	MINOR: Move Raft io thread implementation to Java (#15119 ) This patch moves the `RaftIOThread` implementation into Java. I changed the name to `KafkaRaftClientDriver` since the main thing it does is drive the calls to `poll()`. There shouldn't be any changes to the logic. Reviewers: José Armando García Sancio <jsancio@apache.org>	2024-01-05 09:27:36 -08:00
Luke Chen	c8d61a5cbe	KAFKA-16079: fix threads leak threads in LocalLeaderEndPointTest and other tests (#15122 ) Fix threads leak in LocalLeaderEndPointTest/FinalizedFeatureChangeListenerTest/KafkaApisTest/ReplicaManagerConcurrencyTest Reviewers: Divij Vaidya <diviv@amazon.com>, Christo Lolov <christololov@gmail.com>	2024-01-05 09:43:03 +08:00
Michael Edgar	105db82956	KAFKA-15373: fix exception thrown in Admin#describeTopics for unknown ID (#14599 ) Throw UnknownTopicIdException instead of InvalidTopicException when no name is found for the topic ID. Similar to #6124 for describeTopics using a topic name. MockAdminClient already makes use of UnknownTopicIdException for this case. Reviewers: Justine Olshan <jolshan@confluent.io>, Ashwin Pankaj <apankaj@confluent.io>	2024-01-03 17:56:17 -08:00
Dmitry Werner	d4aeec3d3f	KAFKA-15742: KRaft support in GroupCoordinatorIntegrationTest (#15086 ) updated GroupCoordinatorIntegrationTest.testGroupCoordinatorPropagatesOffsetsTopicCompressionCodec to support KRaft Reviewers: Justine Olshan <jolshan@confluent.io>	2024-01-03 08:46:12 -08:00
DL1231	60c445bdd5	MINOR: Improve code style (#15107 ) Reviewers: Divij Vaidya <diviv@amazon.com>	2024-01-03 11:56:20 +01:00
Arpit Goyal	86a387c3c8	KAFKA-16063: Disable shutdownhook in MiniKdc (used for testing) (#15104 ) This stops a memory leaked in the tests caused due to ApplicationShutdownHooks Reviewers: Divij Vaidya <diviv@amazon.com>	2024-01-02 21:11:18 +01:00
Divij Vaidya	65b1558532	KAFKA-16059: Fix thread leak KafkaAPIsTest (#15093 ) Reviewers: Luke Chen <showuon@gmail.com>	2024-01-02 15:58:20 +01:00
Divij Vaidya	bd6cb4db22	KAFKA-16052: Save heap in AbstractCoordinatorConcurrencyTest by creating real ReplicaManager (#15094 ) Mockito will keep the invocation history in the test suite and cause the huge heap usage. Since the mock replicaManager is only used to bypass the replicaManager constructor without verifying/mocking anything, we create a real dummy replicaManager to avoid the mockito invocation history in memory. Reviewers: Luke Chen <showuon@gmail.com>, Justine Olshan <jolshan@confluent.io> Co-authored-by: Luke Chen <showuon@gmail.com>	2023-12-31 12:25:16 +01:00
wernerdv	b3664119fd	KAFKA-16064: Improve ControllerApiTest (#15091 ) This commit refactors ControllerApiTest to close an instance of ControllerApis in a tearDown method. Reviewers: Divij Vaidya <diviv@amazon.com>	2023-12-30 21:49:25 +01:00
Luke Chen	0600ac00e9	KAFKA-16065: close DelayedFuturePurgatory in DelayedOperationTest (#15090 ) Reviewers: Divij Vaidya <diviv@amazon.com>	2023-12-29 18:27:45 +01:00
Afshin Moazami	627aaef47e	MINOR: Duplicate method; The QuotaUtils one is used. (#15066 ) It seems like this PR (https://github.com/apache/kafka/pull/8768) duplicated the implementation to QuotaUtils, but didn't remove this implementation and private methods that is using Reviewers: Justine Olshan <jolshan@confluent.io>	2023-12-28 16:01:30 -08:00
Luke Chen	a465fb124f	KAFKA-16058: close controllerApi instance to avoid thread leaks (#15084 ) The controllerApi will create some resources, including the reaper threads. In ControllerApisTest, we created it on many test cases, but didn't close it. This commit doesn't change anything in the business logic of the test, it just adds try/finally to close the controllerApi instance. Reviewers: Divij Vaidya <diviv@amazon.com>	2023-12-28 16:38:20 +01:00
Divij Vaidya	a56b63e226	KAFKA-16053: Fix memory leaks due to KDC server in tests (#15079 ) This commit closes the KDC server properly in `CustomQuotaCallbackTest` and `AclAuthorizerWithZkSaslTest`. Reviewers: Justine Olshan <jolshan@confluent.io>	2023-12-28 10:55:14 +01:00
DL1231	f80686b4ac	MINOR: Improve code style (#15074 ) Reviewers: Divij Vaidya <diviv@amazon.com>	2023-12-26 12:21:12 +01:00
IBeyondy	89f32ca6a1	MINOR: Fix NullPointerException in ReplicaFetcherThreadTest.testTruncateOnFetchDoesNotUpdateHighWatermark Reviewers: Divij Vaidya <diviv@amazon.com>	2023-12-25 18:11:41 +01:00
Nikolay	417338ad77	KAFKA-16048: Fix ConfigCommandTest.shouldNotSupportAlterClientMetricsWithZookeeper (#15068 ) Reviewers: Divij Vaidya <diviv@amazon.com>	2023-12-25 14:38:15 +01:00
Nikolay	45bd19f2ef	KAFKA-14588: Move ConfigType to server-common (#14867 ) Reviewers: Mickael Maison <mickael.maison@gmail.com>	2023-12-22 18:35:27 +01:00
Rittika Adhikari	0bc736f3c4	MINOR: Refactor to only require one stopPartitions helper (#14662 ) Reviewers: Divij Vaidya <diviv@amazon.com>	2023-12-22 17:13:22 +01:00
Philip Nee	c963a71be0	KAFKA-16026: Send Poll event to the background thread (#15035 ) related to KAFKA-15818 This is a bug in the AsyncKafkaConsumer poll loop that it does not send an event to the network thread to acknowledge user poll. This causes a few issues: Autocommit won't work without user setting the timer the member will just leave the group after rebalance timeout and never able to rejoin. In this PR, a few subtle changes are made to address this issue Hook up poll event to the AsyncKafkaConsumer#poll. It is only fired once per invocation Upon entering staled state, we need to reset HeartbeatState otherwise we will get an invalid request We will clear and current assignment and remove all assigned partitions once the heartbeat is sent. See changes in onHeartbeatRequestSent Reviewers: David Jacot <djacot@confluent.io>, Bruno Cadonna <cadonna@apache.org>, Andrew Schofield <aschofield@confluent.io>	2023-12-22 15:21:39 +01:00
David Jacot	f7ccd082f1	MINOR: Exit catcher should be reset after the cluster is shutdown (#15062 ) I was investigating a build which failed with "exit 1". In the logs of the broker, I was that the first call to exist was caught. However, a second one was not. See the logs below. The issue seems to be that we must first shutdown the cluster before reseting the exit catcher. Otherwise, there is still a change for the broker to call exit. ``` [2023-12-21 13:52:59,310] ERROR Shutdown broker because all log dirs in /tmp/kafka-2594137463116889965 have failed (kafka.log.LogManager:143) [2023-12-21 13:52:59,312] ERROR test error (kafka.server.epoch.EpochDrivenReplicationProtocolAcceptanceWithIbp26Test:76) java.lang.RuntimeException: halt(1, null) called! at kafka.server.QuorumTestHarness.$anonfun$setUp$4(QuorumTestHarness.scala:273) at org.apache.kafka.common.utils.Exit.halt(Exit.java:63) at kafka.utils.Exit$.halt(Exit.scala:33) at kafka.log.LogManager.handleLogDirFailure(LogManager.scala:224) at kafka.server.ReplicaManager.handleLogDirFailure(ReplicaManager.scala:2600) at kafka.server.ReplicaManager$LogDirFailureHandler.doWork(ReplicaManager.scala:324) at org.apache.kafka.server.util.ShutdownableThread.run(ShutdownableThread.java:131) ``` ``` [2023-12-21 13:53:05,797] ERROR Shutdown broker because all log dirs in /tmp/kafka-7355495604650755405 have failed (kafka.log.LogManager:143) ``` Reviewers: Luke Chen <showuon@gmail.com>	2023-12-22 05:58:34 -08:00
David Jacot	654ac2528b	MINOR: Close RemoteLogManager in RemoteLogManagerTest (#15063 ) This patch ensures that the RemoteLogManager is closed in RemoteLogManagerTest. Reviewers: Divij Vaidya <diviv@amazon.com>, Lucas Brutschy <lbrutschy@confluent.io>	2023-12-22 05:54:48 -08:00
Luke Chen	82808873cb	KAFKA-16035: add tests for remoteLogSizeComputationTime/remoteFetchExpiresPerSec metrics (#15056 ) These tests are removed in this commit because they are flaky. After investigation, the causes are: 1. remoteLogSizeComputationTime: It failed with Expected to find 1000 for RemoteLogSizeComputationTime metric value, but found 0. The reason is because if the verification thread is too slow, and the 2nd run of RLMTask started, then it'll reset the value back to 0. Fix it by adding latch to wait for verification. 2. remoteFetchExpiresPerSec: It failed with The ExpiresPerSec value is not incremented. Current value is: 0. The reason is because the remoteFetchExpiresPerSec metric is a static metric. And we remove all metrics after each test completed in tearDown method. So once remoteFetchExpiresPerSec is removed, it won't be created again like other metrics. And that's why it failed sometimes in Jenkins because if there is a previous test have expired remote fetch, then this metric will be created and removed forever. Fix it by only removing it in afterAll. Reviewers: Kamal Chandraprakash <kamal.chandraprakash@gmail.com>, Satish Duggana <satishd@apache.org>, Christo Lolov <lolovc@amazon.com>	2023-12-22 15:02:55 +08:00
Christo Lolov	d4f3bf93d3	KAFKA-16014: Implement RemoteLogSizeBytes (#15050 ) This pull request aims to implement RemoteLogSizeBytes from KIP-963. Reviewers: Kamal Chandraprakash <kamal.chandraprakash@gmail.com>, Satish Duggana <satishd@apache.org>, Luke Chen <showuon@gmail.com>	2023-12-22 15:00:44 +08:00
David Jacot	98aca56ee5	KAFKA-16040; Rename `Generic` to `Classic` (#15059 ) People has raised concerned about using `Generic` as a name to designate the old rebalance protocol. We considered using `Legacy` but discarded it because there are still applications, such as Connect, using the old protocol. We settled on using `Classic` for the `Classic Rebalance Protocol`. The changes in this patch are extremely mechanical. It basically replaces the occurrences of `generic` by `classic`. Reviewers: Divij Vaidya <diviv@amazon.com>, Lucas Brutschy <lbrutschy@confluent.io>	2023-12-21 13:39:17 -08:00
David Jacot	79757b3081	KAFKA-14505; [3/N] Wire WriteTxnMarkers API (#14985 ) This patch wires the handling of makers written by the transaction coordinator via the WriteTxnMarkers API. In the old group coordinator, the markers are written to the logs and the group coordinator is informed to materialize the changes as a second step if the writes were successful. This approach does not really work with the new group coordinator for mainly two reasons: 1) The second step would actually fail while the coordinator is loading and there is no guarantee that the loading has picked up the write or not; 2) It does not fit well with the new memory model where the state is snapshotted by offset. In both cases, it seems that having a single writer to the `__consumer_offsets` partitions is more robust and preferable. Reviewers: Jeff Kim <jeff.kim@confluent.io>, Justine Olshan <jolshan@confluent.io>	2023-12-21 10:59:41 -08:00
Jeff Kim	4613286076	KAFKA-16030: new group coordinator should check if partition goes offline during load (#15043 ) The new coordinator stops loading if the partition goes offline during load. However, the partition is still considered active. Instead, we should return NOT_LEADER_OR_FOLLOWER exception during load. Another change is that we only want to invoke CoordinatorPlayback#updateLastCommittedOffset if the current offset (last written offset) is greater than or equal to the current high watermark. This is to ensure that in the case the high watermark is ahead of the current offset, we don't clear snapshots prematurely. Reviewers: David Jacot <djacot@confluent.io>	2023-12-21 06:17:35 -08:00
Divij Vaidya	6250049e10	KAFKA-13950: Fix resource leak in error scenarios (#12228 ) We are not properly closing Closeable resources in the code base at multiple places especially when we have an exception. This code change fixes multiple of these leaks. Reviewers: Ismael Juma <ismael@juma.me.uk>, Luke Chen <showuon@gmail.com>, Mickael Maison <mickael.maison@gmail.com>	2023-12-21 13:47:22 +01:00
David Jacot	75dcc8dadf	KAFKA-16036; Add `group.coordinator.rebalance.protocols` and publish all new configs (#15053 ) This patch adds the group.coordinator.rebalance.protocols configuration which accepts a list of protocols to enable. At the moment, only generic and consumer are supported and it is not possible to disable generic yet. When consumer is enabled, the new consumer rebalance protocol (KIP-848) is enabled alongside the new group coordinator. This patch also publishes all the new configurations introduced by KIP-848. Reviewers: Jeff Kim <jeff.kim@confluent.io>, Stanislav Kozlovski <stanislav@confluent.io>	2023-12-21 04:43:57 -08:00
Luke Chen	d59d613258	KAFKA-16013: Throw an exception in DelayedRemoteFetch for follower fetch replicas. (#15015 ) Follow-up for KAFKA-16013: Add metric for expiration rate of delayed remote fetch Reviewers: Nikhil Ramakrishnan <ramakrishnan.nikhil@gmail.com>, Kamal Chandraprakash <kamal.chandraprakash@gmail.com>, Satish Duggana <satishd@apache.org>	2023-12-21 15:45:24 +08:00
Christo Lolov	1a97de2fe6	KAFKA-16002: Implement RemoteCopyLagSegments, RemoteDeleteLagBytes and RemoteDeleteLagSegments (#15005 ) This pull request aims to implement RemoteCopyLagSegments, RemoteDeleteLagBytes and RemoteDeleteLagSegments from KIP-963. Reviewers: Luke Chen <showuon@gmail.com>, Kamal Chandraprakash <kamal.chandraprakash@gmail.com>	2023-12-21 14:27:12 +08:00
Ismael Juma	919b585da0	KAFKA-15874: Add metric and request log attribute for deprecated request api versions (KIP-896) (#15032 ) Breakdown of this PR: * Extend the generator to support deprecated api versions * Set deprecated api versions via the request json files * Expose the information via metrics and the request log The relevant section of the KIP: > * Introduce metric `kafka.network:type=RequestMetrics,name=DeprecatedRequestsPerSec,request=(api-name),version=(api-version),clientSoftwareName=(client-software-name),clientSoftwareVersion=(client-software-version)` > * Add boolean field `requestApiVersionDeprecated` to the request header section of the request log (alongside `requestApiKey` , `requestApiVersion`, `requestApiKeyName` , etc.). Unit tests were added to verify the new generator functionality, the new metric and the new request log attribute. Reviewers: Jason Gustafson <jason@confluent.io>	2023-12-20 05:13:36 -08:00
Luke Chen	4e11de00a7	KAFKA-16014: Add RemoteLogMetadataCount metric (#15026 ) Reviewers: Christo Lolov <lolovc@amazon.com>, Kamal Chandraprakash<kamal.chandraprakash@gmail.com>, Satish Duggana <satishd@apache.org>	2023-12-20 14:21:30 +05:30
Viktor Somogyi-Vass	0e0282395d	KAFKA-15366: Modify LogDirFailureTest for KRaft (#14977 ) Reviewers: Omnia G.H Ibrahim <o.g.h.ibrahim@gmail.com>, Ron Dagostino <rdagostino@confluent.io>, Igor Soarez <soarez@apple.com>	2023-12-19 21:02:49 -05:00
Philip Nee	5e37ec80f8	KAFKA-15696: Refactor closing consumer (#14937 ) We drive the consumer closing via events, and rely on the still-lived network thread to complete these operations. This ticket encompasses several different tickets: KAFKA-15696/KAFKA-15548 When closing the consumer, we need to perform a few tasks. And here is the top level overview: We want to keep the network thread alive until we are ready to shut down, i.e., no more requests need to be sent out. To achieve so, I implemented a method, signalClose() to signal the managers to prepare for shutdown. Once we signal the network thread to close, the manager will prepare for the request to be sent out on the next event loop. The network thread can then be closed after issuing these events. The application thread's task is pretty straightforward, 1. Tell the background thread to perform n events and 2. Block on certain events until succeed or the timer runs out. Once all requests are sent out, we close the network thread and other components as usual. Here I outline the changes in detail AsyncKafkaConsumer: Shutdown procedures, and several utility functions to ensure proper exceptions are thrown during shutdown AsyncKafkaConsumerTest: I examine each individual test and fix ones that are blocking for too long or logging errors CommitRequestManager: signalClose() FetchRequestManagerTest: changes due to change in pollOnClose() ApplicationEventProcessor: handle CommitOnClose and LeaveGroupOnClose. Latter, it triggers leaveGroup() which should be completed on the next heartbeat (or we time out on the application thread) Reviewers: Lucas Brutschy <lbrutschy@confluent.io>, Kirk True <ktrue@confluent.io>	2023-12-19 13:20:33 +01:00
David Jacot	35e2d3c196	MINOR: Fix thread leak in AuthorizerIntegrationTest (#15006 ) Producers and consumers could be leaked in the AuthorizerIntegrationTest. In the teardown logic, `removeAllClientAcls()` is called before calling the super teardown method. If `removeAllClientAcls()` fails, the super method does not have a change to close the producers and consumers. Example of such failure [here](https://ci-builds.apache.org/blue/organizations/jenkins/Kafka%2Fkafka-pr/detail/PR-14925/11/tests/). As a new cluster is created for each test anyway, calling `removeAllClientAcls()` does not seem necessary. This patch removes it. Reviewers: Jason Gustafson <jason@confluent.io>	2023-12-18 23:48:10 -08:00
Gantigmaa Selenge	7b21da9712	KAFKA-15158: Add metrics for RemoteDelete and BuildRemoteLogAuxState (#14375 ) This PR implements part of KIP-963, specifically for adding new metrics. The metrics added in this PR are: RemoteDeleteRequestsPerSec (emitted when expired log segments on remote storage being deleted) RemoteDeleteErrorsPerSec (emitted when failed to delete expired log segments on remote storage) BuildRemoteLogAuxStateRequestsPerSec (emitted when building remote log aux state for replica fetchers) BuildRemoteLogAuxStateErrorsPerSec (emitted when failed to build remote log aux state for replica fetchers) Reviewers: Luke Chen <showuon@gmail.com>, Nikhil Ramakrishnan <ramakrishnan.nikhil@gmail.com>, Christo Lolov <lolovc@amazon.com>, Kamal Chandraprakash <kamal.chandraprakash@gmail.com>, Divij Vaidya <diviv@amazon.com>, Satish Duggana <satishd@apache.org>	2023-12-19 15:02:45 +08:00
Luke Chen	c240993be2	KAFKA-16014: Add RemoteLogSizeComputationTime metric (#15021 ) Reviewers: Satish Duggana <satishd@apache.org>, Kamal Chandraprakash<kamal.chandraprakash@gmail.com>, Christo Lolov <lolovc@amazon.com>	2023-12-18 21:39:43 +05:30
Lucas Brutschy	7aade70cc6	Revert "KAFKA-15764: Missing Tests for Transactions (#14702 )" (#15029 ) This reverts commit `ed7ad6d`. We have been seeing a lot of failures of TransactionsWithTieredStoreTest.testTransactionsWithCompression on trunk, and it seems to start with this PR. I see how this PR can influence the test via the change in TestUtils. The bad part is that sometimes seems to kill the Gradle Executors completely. So I'd suggest reverting the change before investigating further to stabilize CI. Reviewers: Bruno Cadonna <cadonna@apache.org>	2023-12-18 10:12:05 +01:00
Philip Nee	a6076c71f6	KAFKA-16023: Disable flaky tests in PlaintextConsumerTest (#15025 ) I observed several failed tests in PR builds. Let's first disable them and try to find a different way to test the async consumer with these tests. Reviewers: Lucas Brutschy <lbrutschy@confluent.io>	2023-12-17 10:43:45 +01:00
Justine Olshan	ed7ad6d9d3	KAFKA-15764: Missing Tests for Transactions (#14702 ) I ran this test 40 times without KAFKA-15653 with and without compression enabled. With compression it failed 39/40 times and without it passed 40/40 times. With the KAFKA-15653 and compression it passed 40/40 times locally Reviewers: Jason Gustafson <jason@confluent.io>	2023-12-15 09:41:20 -08:00
Andrew Schofield	a23dae4e9a	KAFKA-15971: Re-enable consumer integration tests for new consumer (#14925 ) The consumer integration tests were experimentally disabled for the new `AsyncKafkaConsumer` variant with the aim of improving build stability. Several improvements have been made to the consumer code and other tests which seem to have made a difference. This patch re-enables the tests. Reviewers: David Jacot <djacot@confluent.io>	2023-12-15 05:16:54 -08:00
Nikhil Ramakrishnan	52496dcd38	KAFKA-16013: Add metric for expiration rate of delayed remote fetch (#15014 ) Add metric for the number of expired remote fetches per second, and corresponding unit test to verify that the metric is marked on expiration. kafka.server:type=DelayedRemoteFetchMetrics,name=ExpiresPerSec Reviewers: Luke Chen <showuon@gmail.com>, Satish Duggana <satishd@apache.org>, Kamal Chandraprakash<kamal.chandraprakash@gmail.com>, Christo Lolov <lolovc@amazon.com>	2023-12-15 19:21:39 +08:00
Kirk True	9dc9040f33	KAFKA-15276: Implement event plumbing for ConsumerRebalanceListener callbacks (#14640 ) This patch adds the logic for coordinating the invocation of the `ConsumerRebalanceListener` callback invocations between the background thread (in `MembershipManagerImpl`) and the application thread (`AsyncKafkaConsumer`) and back again. It allowed us to enable more tests from `PlaintextConsumerTest` to exercise the code herein. Reviewers: David Jacot <djacot@confluent.io>	2023-12-15 00:42:31 -08:00
Proven Provenzano	b0e99b5593	KAFKA-15922: Bump MetadataVersion to support JBOD with KRaft (#14984 ) Moves ELR from MetadataVersion IBP_3_7_IV3 into the new IBP_3_8_IV0 because the ELR feature was not completed before 3.7 reached feature freeze. Leaves IBP_3_7_IV3 empty -- it is a no-op and is not reused for anything. Adds the new MetadataVersion IBP_3_7_IV4 for the FETCH request changes from KIP-951, which were mistakenly never associated with a MetadataVersion. Updates the LATEST_PRODUCTION MetadataVersion to IBP_3_7_IV4 to declare both KRaft JBOD and the KIP-951 changes ready for production use. Reviewers: Omnia G H Ibrahim <o.g.h.ibrahim@gmail.com>, Ron Dagostino <rdagostino@confluent.io>, Ismael Juma <ismael@juma.me.uk>, José Armando García Sancio <jsancio@apache.org>, Justine Olshan <jolshan@confluent.io>	2023-12-14 10:08:54 -05:00
Justine Olshan	e4249b69bd	KAFKA-15784: Ensure atomicity of in memory update and write when transactionally committing offsets (#14774 ) Rewrote the verification flow to pass a callback to execute after verification completes. For the TxnOffsetCommit, we will call doTxnCommitOffsets. This allows us to do offset validations post verification. I've reorganized the verification code and group coordinator code to make these code paths clearer. The followup refactor (https://issues.apache.org/jira/browse/KAFKA-15987) will further clean up the produce verification code. Reviewers: Artem Livshits <alivshits@confluent.io>, Jason Gustafson <jason@confluent.io>, David Jacot <djacot@confluent.io>, Jun Rao <junrao@gmail.com>	2023-12-13 17:45:09 -08:00
Christo Lolov	a87e86e015	KAFKA-15883: Implement RemoteCopyLagBytes (#14832 ) This pull request implements the first in the list of metrics in KIP-963: Additional metrics in Tiered Storage. Since each partition of a topic will be serviced by its own RLMTask we need an aggregator object for a topic. The aggregator object in this pull request is BrokerTopicAggregatedMetric. Since the RemoteCopyLagBytes is a gauge I have introduced a new GaugeWrapper. The GaugeWrapper is used by the metrics collection system to interact with the BrokerTopicAggregatedMetric. The RemoteLogManager interacts with the BrokerTopicAggregatedMetric directly. Reviewers: Luke Chen <showuon@gmail.com>, Satish Duggana <satishd@apache.org>, Kamal Chandraprakash<kamal.chandraprakash@gmail.com>	2023-12-14 09:21:37 +08:00
vamossagar12	a1e985d22f	KAFKA-15237: Implement write operation timeout (#14981 ) This patch ensure that `offset.commit.timeout.ms` is enforced. It does so by adding a timeout to the CoordinatorWriteEvent. Reviewers: David Jacot <djacot@confluent.io>	2023-12-13 11:30:53 -08:00
Andrew Schofield	b08fb14bed	KAFKA-15775: New consumer listTopics and partitionsFor (#14962 ) Implement Consumer.listTopics and Consumer.partitionsFor in the new consumer. The topic metadata request manager already existed so this PR adds expiration to requests, removes some redundant state checking and adds tests. Reviewers: Lucas Brutschy <lucasbru@apache.org>	2023-12-13 08:47:25 +01:00
Nikhil Ramakrishnan	be531c681c	KAFKA-15695: Update the local log start offset of a log after rebuilding the auxiliary state (#14649 ) Reviewers: Satish Duggana <satishd@apache.org>, Luke Chen <showuon@gmail.com>, Divij Vaidya <diviv@amazon.com>, Kamal Chandraprakash<kamal.chandraprakash@gmail.com>, Alexandre Dupriez <alexandre.dupriez@gmail.com>	2023-12-12 21:43:42 +05:30
Philip Nee	5b478aebfd	KAFKA-15818: ensure leave group on max poll interval (#14873 ) Currently, poll interval is not being respected during consumer#poll. When the user stops polling the consumer, we should assume either the consumer is too slow to respond or is already dead. In either case, we should let the group coordinator kick the member out of the group and reassign its partition after the rebalance timeout expires. If the consumer comes back alive, we should send a heartbeat and the member will be fenced and rejoin. (and the partitions will be revoked). This is the same behavior as the current implementation. Reviewers: Lucas Brutschy <lucasbru@apache.org>, Bruno Cadonna <cadonna@apache.org>, Lianet Magrans <lianetmr@gmail.com>	2023-12-12 10:06:34 +01:00
Omnia Ibrahim	07490b929b	KAFKA-15365: Broker-side replica management changes (#14881 ) Reviewers: Igor Soarez <soarez@apple.com>, Ron Dagostino <rndgstn@gmail.com>, Proven Provenzano <pprovenzano@confluent.io>	2023-12-11 09:34:22 -05:00
Lucas Brutschy	134eabee16	MINOR: fix leak in `GroupEndToEndAuthorizationTest` (#14975 ) Session expiration in ZkClient can lead to a thread leak, and does fail CI on master. This is happening in testNoDescribeProduceOrConsumeWithoutTopicDescribeAcl, and possibly other tests. Use try-with-resources to close ZkClient if this happens. This does not fix the underlying session expiration in ZK. Reviewers: David Jacot <djacot@confluent.io>	2023-12-11 09:05:03 +01:00
Andrew Schofield	f80f991c79	KAFKA-15978: Update member information on HB response (#14945 ) In the new consumer, the commit request manager and the membership manager are separate components. The commit request manager is initialised with group information that it uses to construct `OffsetCommit` requests. However, the initial value of the member ID is `""` in some cases. When the consumer joins the group, it receives a `ConsumerGroupHeartbeat` response which tells it the member ID. The member ID was not being passed to the commit request manager, so it sent invalid `OffsetCommit` requests that failed with `UNKNOWN_MEMBER_ID`. Reviewers: Bruno Cadonna <cadonna@apache.org>, David Jacot <djacot@confluent.io>	2023-12-10 23:56:54 -08:00
David Jacot	131581a2b4	MINOR: Remove `SubscribedTopicRegex` field from `ConsumerGroupHeartbeatRequest` (#14956 ) The support for regular expressions has not been implemented yet in the new consumer group protocol. This patch removes the `SubscribedTopicRegex` from the `ConsumerGroupHeartbeatRequest` in preparation for 3.7. It seems better to bump the version and add it back when we implement the feature, as part of https://issues.apache.org/jira/browse/KAFKA-14517, instead of having an unused field in the request. Reviewers: Sagar Rao <sagarmeansocean@gmail.com>, Justine Olshan <jolshan@confluent.io>	2023-12-10 23:53:08 -08:00
TapDang	cbc882ba07	KAFKA-15714: KRaft support in DynamicNumNetworkThreadsTest (#14970 ) Reviewers: Mickael Maison <mickael.maison@gmail.com>	2023-12-10 13:33:01 +01:00
Igor Soarez	8c184b4743	MINOR: Fix some AssignmentsManager bugs (#14954 ) - Add proper start & stop for AssignmentsManager's event loop - Dedupe queued duplicate assignments - Fix bug where directory ID is resolved too late Co-authored-by: Gaurav Narula <gaurav_narula2@apple.com> Reviewers: Colin P. McCabe <cmccabe@apache.org>	2023-12-08 15:37:23 -08:00
Proven Provenzano	02d9f46f3a	MINOR: allow JBOD during ZK migration (#14968 ) Allow using JBOD during ZK migration if MetadataVersion is at or above 3.7-IV2. Reviewers: Ron Dagostino <rndgstn@gmail.com>, Colin P. McCabe <cmccabe@apache.org>, Proven Provenzano <pprovenzano@confluent.io>	2023-12-08 14:38:57 -08:00
Igor Soarez	9de72daa50	KAFKA-15361: Migrating brokers must register with directory list (#14976 ) KAFKA-15361 (#14838) introduced a check for non empty directory list on brokerregistration requests from MetadataVersion.IBP_3_7_IV2 or later, which enables directory assignment. However, ZK brokers weren't yet registering yet with a directory list. This patch addresses that. We also make the directory list non-optional in BrokerLifecycleManager. Reviewers: Ron Dagostino <rndgstn@gmail.com>, Colin P. McCabe <cmccabe@apache.org>, Proven Provenzano <pprovenzano@confluent.io>	2023-12-08 10:16:48 -08:00
vamossagar12	e6e7d8c09f	KAFKA-14516: [3/3] Integration Test - Static Member Removed After Session Timeout (#14911 ) This new integration test verifies that a static member who temporary left the group is removed after the session timeout expires. It also verifies that a new static member with the same instance id can't join the group until the previous static member is expired. Reviewers: David Jacot <djacot@confluent.io>	2023-12-08 04:59:10 -08:00
David Jacot	0ad059d101	MINOR: Fix leak thread in DeleteTopicTest.testIncreasePartitionCountDuringDeleteTopic (#14960 ) Reviewers: Lucas Brutschy <lbrutschy@confluent.io>	2023-12-08 04:34:26 -08:00
David Jacot	38c873b80f	MINOR: Avoid leaking threads in DelegationTokenEndToEndAuthorizationWithOwnerTest.testDescribeTokenForOtherUserFails (#14959 ) Reviewers: Lucas Brutschy <lbrutschy@confluent.io>	2023-12-07 23:23:08 -08:00
Omnia Ibrahim	ec92410e59	KAFKA-15363: Broker log directory failure changes (#14790 ) Part of JBOD KIP-858, https://cwiki.apache.org/confluence/display/KAFKA/KIP-858%3A+Handle+JBOD+broker+disk+failure+in+KRaft Reviewers: Igor Soarez <i@soarez.me>, Colin P. McCabe <cmccabe@apache.org>, Ron Dagostino <rdagostino@confluent.io>	2023-12-07 20:44:56 -05:00
Lucas Brutschy	02915a2c5e	KAFKA-15977: Fix leak in DelegationTokenEndToEndAuthorizationWithOwnerTest (#14939 ) DelegationTokenEndToEndAuthorizationWithOwnerTest can leak a thread, causing problems with many tests. This is due to an admin client that isn't being closed when a (flaky) test fails. Using the Scala util Using to close the auto-closable admin client in case the validation fails. Reviewers: David Jacot <djacot@confluent.io>, Bruno Cadonna <cadonna@apache.org>	2023-12-07 21:37:23 +01:00
Colin P. McCabe	c062e5a1f9	HOTFIX: fix scala 2.12 build again	2023-12-07 12:03:02 -08:00
Igor Soarez	c515bf51f8	KAFKA-15426: Process and persist directory assignments Handle AssignReplicasToDirs requests, persist metadata changes with new directory assignments and possible leader elections. Reviewers: Proven Provenzano <pprovenzano@confluent.io>, Ron Dagostino <rndgstn@gmail.com>, Colin P. McCabe <cmccabe@apache.org>	2023-12-07 11:44:45 -08:00
Colin Patrick McCabe	969bc7749c	KAFKA-15980: Add the CurrentControllerId metric (#14749 ) Add the CurrentControllerId metric as described in KIP-1001. This gives us an easy way to identify the current controller by looking at the metrics of any Kafka node (broker or controller). Reviewers: David Arthur <mumrah@gmail.com>	2023-12-06 21:03:33 -08:00
Apoorv Mittal	dc09d7a4e0	KAFKA-15684: Support to describe all client metrics resources (KIP-714) (#14933 ) Improvement for KIP-1000 to list client metrics resources in KafkaApis.scala. Using functionality exposed by KIP-1000 to support describe all metrics operations for KIP-714. Reviewers: Andrew Schofield <aschofield@confluent.io>, Jun Rao <junrao@gmail.com>	2023-12-06 11:09:42 -08:00
Andrew Schofield	8ed53a15ee	KAFKA-15932: Wait for responses in consumer operations (#14912 ) The Kafka consumer makes a variety of requests to brokers such as fetching committed offsets and updating metadata. In the LegacyKafkaConsumer, the approach is typically to prepare RPC requests and then poll the network to wait for responses. In the AsyncKafkaConsumer, the approach is to enqueue an ApplicationEvent for processing by one of the request managers on the background thread. However, it is still important to wait for responses rather than spinning enqueuing events for the request managers before they have had a chance to respond. In general, the behaviour will not be changed by this code. The PlaintextConsumerTest.testSeek test was flaky because operations such as KafkaConsumer.position were not properly waiting for a response which meant that subsequent operations were being attempted in the wrong state. This test is no longer flaky. Reviewers: Kirk True <ktrue@confluent.io>, Lianet Magrans <lianetmr@gmail.com>, Bruno Cadonna <cadonna@apache.org>	2023-12-06 18:47:26 +01:00
Jeff Kim	b888fa1ec9	KAFKA-15910: New group coordinator needs to generate snapshots while loading (#14849 ) After the new coordinator loads a __consumer_offsets partition, it logs the following exception when making a read operation (fetch/list groups, etc): ``` java.lang.RuntimeException: No in-memory snapshot for epoch 740745. Snapshot epochs are: at org.apache.kafka.timeline.SnapshotRegistry.getSnapshot(SnapshotRegistry.java:178) at org.apache.kafka.timeline.SnapshottableHashTable.snapshottableIterator(SnapshottableHashTable.java:407) at org.apache.kafka.timeline.TimelineHashMap$ValueIterator.<init>(TimelineHashMap.java:283) at org.apache.kafka.timeline.TimelineHashMap$Values.iterator(TimelineHashMap.java:271) ``` This happens because we don't have a snapshot at the last updated high watermark after loading. We cannot generate a snapshot at the high watermark after loading all batches because it may contain records that have not yet been committed. We also don't know where the high watermark will advance up to so we need to generate a snapshot for each offset the loader observes to be greater than the current high watermark. Then once we add the high watermark listener and update the high watermark we can delete all of the older snapshots. Reviewers: David Jacot <djacot@confluent.io>	2023-12-06 08:38:05 -08:00
Lucas Brutschy	c575ba238d	KAFKA-15280: Implement client support for KIP-848 server-side assignors (#14878 ) * Validate the client’s configuration for server-side assignor selection defined in config group.remote.assignor * Include the assignor taken from config in the ConsumerGroupHeartbeat request, in the ServerAssignor field * Properly handle UNSUPPORTED_ASSIGNOR errors that may be returned to the HB response if the server does not support the assignor defined by the consumer. Includes a simple integration tests for sending an invalid assignor to the broker, and for using the range assignor with a single consumer. Reviewers: David Jacot <djacot@confluent.io>, Lianet Magrans <lianetmr@gmail.com>, Bruno Cadonna <cadonna@apache.org>	2023-12-06 15:22:11 +01:00
Kamal Chandraprakash	f05b342b39	MINOR: Allow local-log segment deletion when log-start-offset incremented. (#14905 ) DELETE_RECORDS API can move the log-start-offset beyond the highest-copied-remote-offset. In such cases, we should allow deletion of local-log segments since they won't be eligible for upload to remote storage. Reviewers: Satish Duggana <satishd@apache.org>, Luke Chen <showuon@gmail.com>	2023-12-06 16:59:16 +05:30
Andrew Schofield	587f50d48f	KAFKA-15831: KIP-1000 protocol and admin client (#14811 ) This adds the new ListClientMetricsResources RPC to the Kafka protocol and puts support into the Kafka admin client. The broker-side implementation in this PR is just to return an empty list. A future PR will obtain the list from the config store. Includes a few unit tests for what is a very simple RPC. There are additional tests already written and waiting for the PR that delivers the kafka-client-metrics.sh tool which builds on this PR. Reviewers: Jun Rao <junrao@gmail.com>	2023-12-05 07:14:06 -08:00
vamossagar12	0f56eeb046	KAFKA-14516: [2/N] Integration Test - Static Member Gets Assignment Back (#14882 ) This patch adds an integration test which verifies that a static member gets back its previous assignment back when rejoining. Reviewers: David Jacot <djacot@confluent.io>	2023-12-05 04:36:15 -08:00
Nikolay	783698c525	KAFKA-15645: Move ReplicationQuotasTestRig to tools module (#14588 ) Reviewers: Mickael Maison <mickael.maison@gmail.com>, Justine Olshan <jolshan@confluent.io>, Taras Ledkov <tledkov@apache.org>	2023-12-05 10:03:33 +01:00
David Jacot	34e1dbbaba	MINOR: Add Uniform assignor to the default config (#14826 ) This patch adds the `Uniform` assignor to the default list of supported assignors. It also do small changes in the code. Reviewers: Justine Olshan <jolshan@confluent.io>	2023-12-05 00:32:50 -08:00
David Jacot	26274afd05	MINOR: Ensure that DisplayName is set in all parameterized tests (#14850 ) This is a follow-up to https://github.com/apache/kafka/pull/14687 as we found out that some parameterized tests do not include the test method name in their name. For the context, the JUnit XML report does not include the name of the method by default but only rely on the display name provided. Reviewers: David Arthur <mumrah@gmail.com>	2023-12-04 23:58:48 -08:00
David Jacot	b46505c8de	KAFKA-15061; CoordinatorPartitionWriter should reuse buffer (#14885 ) This patch adds a ThreadLocal with a GrowableBufferSupplier so that each writing thread can reuse the same buffer instead of allocating a new one for each write. The patch relies on existing tests. Reviewers: Jeff Kim <jeff.kim@confluent.io>, Justine Olshan <jolshan@confluent.io>	2023-12-04 23:56:52 -08:00
David Jacot	b335ed954e	MINOR: Add @Timeout annotation to consumer integration tests (#14896 ) In this [buid](https://ci-builds.apache.org/blue/organizations/jenkins/Kafka%2Fkafka-pr/detail/PR-14826/11/pipeline/12/), the following test hang forever. ``` Gradle Test Run :core:test > Gradle Test Executor 93 > PlaintextConsumerTest > testSeek(String, String) > testSeek(String, String).quorum=kraft+kip848.groupProtocol=consumer STARTED ``` As the new consumer is not extremely stable yet, we should add a Timeout to all those integration tests to ensure that builds are not blocked unnecessarily. Reviewers: Andrew Schofield <aschofield@confluent.io>, Justine Olshan <jolshan@confluent.io>	2023-12-04 23:55:39 -08:00
Colin Patrick McCabe	ebae7b26b5	MINOR: fix bug where we weren't registering SnapshotEmitterMetrics (#14918 ) Fix a bug where we weren't properly exposing SnapshotEmitterMetrics. Add a test. Reviewers: David Arthur <mumrah@gmail.com>	2023-12-04 21:32:12 -08:00
Apoorv Mittal	463ed09f4e	KAFKA-15830: Add telemetry API handling (KIP-714) (#14767 ) The PR adds handling of telemetry APIs in KafkaAPIs.scala which calls the respective manager to handle the API calls. Also the telemetry plugin which if registered in configs get registered for exporting client metrics. Reviewers: Jun Rao <junrao@gmail.com>	2023-12-04 16:00:35 -08:00
Max Riedel	b7c99e22a7	KAFKA-14509: [2/N] Implement server side logic for ConsumerGroupDescribe API (#14544 ) This patch implements the ConsumerGroupDescribe API. Reviewers: David Jacot <djacot@confluent.io>	2023-12-04 07:19:28 -08:00
Andrew Schofield	b6571a5f44	MINOR: Experimentally turn off consumer integration tests using new consumer (#14904 ) This is part of the investigation into recent build instability. It simply turns off the consumer integration tests that use the new AsyncKafkaConsumer to see whether the build runs smoothly. Reviewers: David Jacot <djacot@confluent.io>	2023-12-04 01:18:29 -08:00
Bruno Cadonna	0cf227dd4f	KAFKA-14438: Throw if async consumer configured with invalid group ID (#14872 ) Verifies that the group ID passed into the async consumer is valid. That is, if the group ID is not null, it is not empty or it does not consist of only whitespaces. This change stores the group ID in the group metadata because KAFKA-15281 about the group metadata API will build on that. Reviewers: Lucas Brutschy <lbrutschy@confluent.io>, Kirk True <ktrue@confluent.io>	2023-12-03 23:11:41 +01:00
Jason Gustafson	a701c0e04f	MINOR: Fix flaky `DescribeClusterRequestTest.testDescribeClusterRequestIncludingClusterAuthorizedOperations` (#14890 ) Test startup does not assure that all brokers are registered. In flaky failures, the `DescribeCluster` API does not return a complete list of brokers. To fix the issue, we add a call to `ensureConsistentKRaftMetadata()` to ensure that all brokers are registered and have caught up to current metadata. Reviewers: David Jacot <djacot@confluent.io>	2023-12-01 09:33:17 -08:00
Andrew Schofield	1750d735cd	KAFKA-15842: Correct handling of KafkaConsumer.committed for new consumer (#14859 ) This PR fixes some details of the interface to KafkaConsumer.committed which were different between the existing consumer and the new consumer. Adds a unit test that validates the behaviour is the same for both consumer implementations. Reviewers: Kirk True <ktrue@confluent.io>, Bruno Cadonna <cadonna@apache.org>	2023-12-01 14:37:21 +01:00
David Jacot	5fdfb3afaf	MINOR: Disable FetchFromFollowerIntegrationTest.testRackAwareRangeAssignor (#14876 ) `FetchFromFollowerIntegrationTest.testRackAwareRangeAssignor` is extremely flaky and we have never been able to fix it. This patch disables it until we find a solution to make it reliable with https://issues.apache.org/jira/browse/KAFKA-15020. Reviewers: Stanislav Kozlovski <stanislav@confluent.io>	2023-12-01 00:05:46 -08:00
Igor Soarez	6b87c85291	KAFKA-15886: Always specify directories for new partition registrations When creating partition registrations directories must always be defined. If creating a partition from a PartitionRecord or PartitionChangeRecord from an older version that does not support directory assignments, then DirectoryId.MIGRATING is assumed. If creating a new partition, or triggering a change in assignment, DirectoryId.UNASSIGNED should be specified, unless the target broker has a single online directory registered, in which case the replica should be assigned directly to that single directory. Reviewers: Colin P. McCabe <cmccabe@apache.org>	2023-11-30 14:10:47 -08:00
Colin Patrick McCabe	a94bc8d6d5	KAFKA-15922: Add a MetadataVersion for JBOD (#14860 ) Assign MetadataVersion.IBP_3_7_IV2 to JBOD. Move KIP-966 support to MetadataVersion.IBP_3_7_IV3. Create MetadataVersion.LATEST_PRODUCTION as the latest metadata version that can be used when formatting a new cluster, or upgrading a cluster using kafka-features.sh. This will allow us to clearly distinguish between stable and unstable metadata versions for the first time. Reviewers: Igor Soarez <soarez@apple.com>, Ron Dagostino <rndgstn@gmail.com>, Calvin Liu <caliu@confluent.io>, Proven Provenzano <pprovenzano@confluent.io>	2023-11-30 10:35:13 -08:00
Jason Gustafson	085f1d340b	MINOR: No need for response callback when applying controller mutation throttle (#14861 ) With `AbstractResponse.maybeSetThrottleTimeMs`, we don't need to use a callback to build the response with the respective throttle. Reviewers: David Jacot <djacot@confluent.io>	2023-11-29 16:33:05 -08:00
Okada Haruki	d71d0639d9	KAFKA-15046: Get rid of unnecessary fsyncs inside UnifiedLog.lock to stabilize performance (#14242 ) While any blocking operation under holding the UnifiedLog.lock could lead to serious performance (even availability) issues, currently there are several paths that calls fsync(2) inside the lock In the meantime the lock is held, all subsequent produces against the partition may block This easily causes all request-handlers to be busy on bad disk performance Even worse, when a disk experiences tens of seconds of glitch (it's not rare in spinning drives), it makes the broker to unable to process any requests with unfenced from the cluster (i.e. "zombie" like status) This PR gets rid of 4 cases of essentially-unnecessary fsync(2) calls performed under the lock: (1) ProducerStateManager.takeSnapshot at UnifiedLog.roll I moved fsync(2) call to the scheduler thread as part of existing "flush-log" job (before incrementing recovery point) Since it's still ensured that the snapshot is flushed before incrementing recovery point, this change shouldn't cause any problem (2) ProducerStateManager.removeAndMarkSnapshotForDeletion as part of log segment deletion This method calls Utils.atomicMoveWithFallback with needFlushParentDir = true internally, which calls fsync. I changed it to call Utils.atomicMoveWithFallback with needFlushParentDir = false (which is consistent behavior with index files deletion. index files deletion also doesn't flush parent dir) This change shouldn't cause problems neither. (3) LeaderEpochFileCache.truncateFromStart when incrementing log-start-offset This path is called from deleteRecords on request-handler threads. Here, we don't need fsync(2) either actually. On unclean shutdown, few leader epochs might be remained in the file but it will be handled by LogLoader on start-up so not a problem (4) LeaderEpochFileCache.truncateFromEnd as part of log truncation Likewise, we don't need fsync(2) here, since any epochs which are untruncated on unclean shutdown will be handled on log loading procedure Reviewers: Luke Chen <showuon@gmail.com>, Divij Vaidya <diviv@amazon.com>, Justine Olshan <jolshan@confluent.io>, Jun Rao <junrao@gmail.com>	2023-11-29 09:43:44 -08:00
Apoorv Mittal	f1819f4480	KAFKA-15778 & KAFKA-15779: Implement metrics manager (KIP-714) (#14699 ) The PR provide implementation for client metrics manager along with other classes. Manager is responsible to support 3 operations: UpdateSubscription - From kafka-configs.sh and reload from metadata cache. Process Get Telemetry Request - From KafkaApis.scala Process Push Telemetry Request - From KafkaApis.scala Manager maintains an in-memory cache to keep track of client instances against their instance id. Reviewers: Andrew Schofield <aschofield@confluent.io>, Jun Rao <junrao@gmail.com>	2023-11-29 09:20:07 -08:00
David Jacot	5ae0b49839	KAFKA-14505; [1/N] Add support for transactional writes to CoordinatorRuntime (#14844 ) This patch adds support for transactional writes to the CoordinatorRuntime framework. This mainly consists in adding CoordinatorRuntime#scheduleTransactionalWriteOperation and in adding the producerId and producerEpoch to various interfaces. The patch also extends the CoordinatorLoaderImpl and the CoordinatorPartitionWriter accordingly. Reviewers: Justine Olshan <jolshan@confluent.io>	2023-11-29 08:54:23 -08:00
Proven Provenzano	14571054aa	KAFKA-15904: Only add directory.id to meta.properties when migrating or in kraft mode Only add directory.id to meta.properties when migrating to kraft mode, or already in kraft mode. This prevents incompatibilities with older Kafka releases, which checked that each directory in a JBOD ensemble had the same meta.properties values. Reviewers: Colin P. McCabe <cmccabe@apache.org>	2023-11-28 23:14:10 -08:00
Andrew Schofield	161b94d196	KAFKA-15544: Enable integration tests for new consumer (#14758 ) This commit parameterizes the consumer integration tests so they can be run against the existing "generic" group protocol and the new "consumer" group protocol introduced in KIP-848. The KIP-848 client code is under construction so some of the tests do not run on both variants to start with, but the idea is that the tests can be enabled as the gaps in functionality are closed. Reviewers: Lucas Brutschy <lbrutschy@confluent.io>, Kirk True <ktrue@confluent.io>	2023-11-28 21:26:59 +01:00
Lucas Brutschy	f3e776fd34	MINOR: time-out hanging ZooKeeperClientTest (#14855 ) As described in KAFKA-9470, testBlockOnRequestCompletionFromStateChangeHandler will block for hours occasionally. If it passes, it takes 0.5 seconds, so a minute timeout should be safe. This is not a fix for KAFKA-9470, it's just aiming to make the CI more stable. Reviewers: David Jacot <djacot@confluent.io>, Matthias J. Sax <matthias@confluent.io>	2023-11-28 12:04:53 -08:00
Apoorv Mittal	38f2faf83f	KAFKA-15681: Add support of client-metrics in kafka-configs.sh (KIP-714) (#14632 ) The PR adds support of alter/describe configs for client-metrics as defined in KIP-714 Reviewers: Andrew Schofield <aschofield@confluent.io>, Jun Rao <junrao@gmail.com>	2023-11-28 09:24:25 -08:00
Kamal Chandraprakash	fade3d10ea	KAFKA-15047: Roll active segment when it breaches the retention policy (#14766 ) Roll the active segment and offload it to remote storage once it breaches the retention time policy. A segment is eligible for deletion once it gets uploaded to the remote storage. We have checks to allow only the passive segments to be uploaded, so the active segment never gets removed at all even if breaches the retention time. For low-throughput/stale topics, the active segment can hold the data beyond the configured retention time by the user. Reviewers: Satish Duggana <satishd@apache.org>, Christo Lolov <lolovc@amazon.com>	2023-11-28 09:38:11 +05:30
Igor Soarez	b34d75c313	MINOR: Fix flaky BrokerLifecycleManagerTest (#14836 ) Fix some flakiness introduced by "MINOR: Always send cumulative failed dirs in HB request" Reviewers: Colin P. McCabe <cmccabe@apache.org>	2023-11-27 14:09:45 -08:00
Greg Harris	fb16a2d628	KAFKA-15819: KafkaServer.shutdown should free KafkaRaftManager (#14751 ) The other call sites for KafkaRaftManager (SharedServer, TestRaftServer, MetadataShell) appear to shutdown the KafkaRaftManager when shutting down themselves. The call-site in ZK-mode KafkaServer should behave the same way. Reviewers: Colin P. McCabe <cmccabe@apache.org>	2023-11-27 13:39:28 -08:00
Kamal Chandraprakash	a42c846336	KAFKA-15241: Compute tiered copied offset by keeping the respective epochs in scope (#14787 ) "findHighestRemoteOffset" does not take into account the leader-epoch end offset. This can cause log divergence between the local and remote log segments when there is unclean leader election. To handle it correctly, the logic to find the highest remote offset can be updated to: find-highest-remote-offset = min(end-offset-for-epoch-in-the-checkpoint, highest-remote-offset-for-epoch) Discussion thread: https://github.com/apache/kafka/pull/14004#discussion_r1266864272 Reviewers: Satish Duggana <satishd@apache.org>, Christo Lolov <lolovc@amazon.com>	2023-11-27 09:10:46 +05:30
Jakub Scholz	95f41d59b3	KIP-978: Allow dynamic reloading of certificates with different DN / SANs (#14756 ) This PR implements KIP-978: Allow dynamic reloading of certificates with different DN / SANs. It adds two new options ssl.allow.dn.changes and ssl.allow.san.changes that can be used to enable dynamic reloading of certificates even if their DN / SANs change. They both default to false to maintain the current behavior by default. Reviewers: Mickael Maison <mimaison@apache.org>, Viktor Somogyi-Vass <viktorsomogyi@gmail.com>	2023-11-24 16:41:16 +01:00
Dongnuo Lyu	d5a8b892ae	KAFKA-15856: Add integration tests for JoinGroup API and SyncGroup API (#14800 ) This patch adds integration tests for JoinGroup API and SyncGroup API. Reviewers: David Jacot <djacot@confluent.io>	2023-11-23 02:30:48 -08:00
Dongnuo Lyu	891dd2a58a	KAFKA-15756: [1/2] Migrate existing integration tests to run old protocol in new coordinator (#14781 ) This patch updates the testing framework to support running tests with kraft and the new group coordinator introduced in the context of KIP-848. This can be done by using `kraft+kip-848` as a quorum. Note that this is temporary until we make it the default and only option in 4.0. To verify this, this patch also enables kraft and kraft+kip-848 in PlaintextConsumerTest and its parent classes. Reviewers: David Jacot <djacot@confluent.io>	2023-11-23 02:05:54 -08:00
Igor Soarez	e90692246a	KAFKA-15362: Resolve offline replicas in metadata cache (#14737 ) The metadata cache now considers registered log directories and directory assignments when determining offline replicas. Reviewers: Colin P. McCabe <cmccabe@apache.org>, Proven Provenzano <pprovenzano@confluent.io>	2023-11-21 09:40:04 -08:00
David Jacot	7826d5fc8a	MINOR: Mark ConsumerGroupHeartbeat API (v1), OffsetCommit API (v9) and OffsetFetch API (v9) as stable (KIP-848) (#14801 ) We plan to ship an early access of KIP-848 in AK 3.7. Therefore, we need to mark the ConsumerGroupHeartbeat API (v1), OffsetCommit API (v9) and OffsetFetch API (v9) as stable. Reviewers: Andrew Schofield <aschofield@confluent.io>, Justine Olshan <jolshan@confluent.io>	2023-11-21 02:51:59 -08:00
Dongnuo Lyu	4ad1d7842e	KAFKA-15705: Add integration tests for Heartbeat API and GroupLeave API (#14656 ) Reviewers: David Jacot <djacot@confluent.io>	2023-11-21 00:37:39 -08:00
Jeff Kim	07fee62afe	KAFKA-14519; [2/N] New coordinator metrics (#14387 ) This patch copy over existing metrics and add new consumer group metrics to the new GroupCoordinatorService. Now that each coordinator is responsible for a topic partition, this patch introduces a GroupCoordinatorMetrics that records gauges for global metrics such as the number of generic groups in PreparingRebalance state, etc. For GroupCoordinatorShard specific metrics, GroupCoordinatorMetrics will activate new GroupCoordinatorMetricsShards that will be responsible for incrementing/decrementing TimelineLong objects and then aggregate the total amount across all shards. As the CoordinatorRuntime/CoordinatorShard does not care about group metadata, we have introduced a CoordinatorMetrics.java/CoordinatorMetricsShard.java so that in the future transaction coordinator metrics can also be onboarded in a similar fashion. Main files to look at: GroupCoordinatorMetrics.java GroupCoordinatorMetricsShard.java CoordinatorMetrics.java CoordinatorMetricsShard.java CoordinatorRuntime.java Metrics to add after #14408 is merged: offset deletions sensor (OffsetDeletions); Meter(offset-deletion-rate, offset-deletion-count) Metrics to add after https://issues.apache.org/jira/browse/KAFKA-14987 is merged: offset expired sensor (OffsetExpired); Meter(offset-expiration-rate, offset-expiration-count) Reviewers: Justine Olshan <jolshan@confluent.io>	2023-11-20 21:38:50 -08:00
Igor Soarez	c7c82baf87	MINOR: Always send cumulative failed dirs in HB request (#14770 ) Instead of only sending failed log directory UUIDs in the heartbeat request until a successful response is received, the broker sends the full cumulative set of failed directories since startup time. This aims to simplify the handling of log directory failure in the controller side, considering overload mode handling of heartbeat requests, which returns an undifferentiated reply. Reviewers: Colin P. McCabe <cmccabe@apache.org>, Proven Provenzano <pprovenzano@confluent.io>	2023-11-20 16:18:38 -08:00
Jason Gustafson	e905ef1edf	MINOR: Small LogValidator clean ups (#14697 ) This patch contains a few small clean-ups in LogValidator and associated classes: 1. Set shallowOffsetOfMaxTimestamp consistently as the last offset in the batch for v2 compressed and non-compressed data. 2. Rename `RecordConversionStats` to `RecordValidationStats` since one of its fields `temporaryMemoryBytes` does not depend on conversion. 3. Rename `batchIndex` in `recordIndex` in loops over the records in each batch inside `LogValidator`. Reviewers: Qichao Chu <5326144+ex172000@users.noreply.github.com>, Jun Rao <junrao@gmail.com>	2023-11-20 10:40:45 -08:00
Philip Nee	e63f23718f	KAFKA-15174: Ensure CommitAsync propagate the exception to the user (#14680 ) The commit covers a few important points: - Exception handling: We should be thrown RetriableCommitException when the commit exception is retriable. We should throw FencedIdException on commit and poll similar to the current implementation. Other errors should be thrown as it is. - Callback invocation: The callbacks need to be invoked on the main/application thread; however, the future is completed in the background thread. To achieve this, I created an Invoker class with a queue, so that this callback can be invoked during the consumer.poll() Note: One change I made is to remove the DefaultOffsetCommit callback. Since the callback is purely for logging, I think it is reasonable for us to move the logging to the background thread instead of relying on the application thread to trigger the logging. Reviewers: Lucas Brutschy <lbrutschy@confluent.io>	2023-11-20 09:15:48 +01:00
Ismael Juma	df78204e05	KAFKA-15854: Move Java classes from `kafka.server` to the `server` module (#14796 ) We only move Java classes that have minimal or no dependencies on Scala classes in this PR. Details: * Configured `server` module in build files. * Changed `ControllerRequestCompletionHandler` to be an interface since it has no implementations. * Cleaned up various import control files. * Minor build clean-ups for `server-common`. * Disabled `testAssignmentAggregation` when executed with Java 8, this is an existing issue (see #14794). For broader context on this change, please check: * KAFKA-15852: Move server code from `core` to `server` module Reviewers: Divij Vaidya <diviv@amazon.com>	2023-11-19 22:09:19 -08:00
David Jacot	fe7a373baa	HOTFIX: Fix compilation error in ReplicaManagerConcurrencyTest for Scala 2.12 (#14786 ) https://github.com/apache/kafka/pull/14369 introduced a compilation error in ReplicaManagerConcurrencyTest for Scala 2.12. Reviewers: Lucas Brutschy <lbrutschy@confluent.io>	2023-11-17 00:29:26 -08:00
Igor Soarez	a03a71d7b5	KAFKA-15357: Aggregate and propagate assignments A new AssignmentsManager accumulates, batches, and sends KIP-858 assignment events to the Controller. Assignments are sent via AssignReplicasToDirs requests. Move QuorumTestHarness.formatDirectories into TestUtils so it can be used in other test contexts. Fix a bug in ControllerRegistration.java where the wrong version of the record was being generated in ControllerRegistration.toRecord. Reviewers: Colin P. McCabe <cmccabe@apache.org>, Proven Provenzano <pprovenzano@confluent.io>, Omnia G H Ibrahim <o.g.h.ibrahim@gmail.com>	2023-11-16 16:19:49 -08:00
Crispin Bernier	b1d83e2b04	Revert "Revert "KAFKA-15661: KIP-951: Server side changes (#14444 )" (#14738 )" (#14747 ) This KIP-951 commit was reverted to investigate the org.apache.kafka.tiered.storage.integration.ReassignReplicaShrinkTest test failure (#14738). A fix for that was merged in #14757, hence unreverting this change. This reverts commit `a98bd7d`. Reviewers: Justine Olshan <jolshan@confluent.io>, Mayank Shekhar Narula <mayanks.narula@gmail.com>	2023-11-16 15:42:34 -08:00
David Arthur	a8622faf47	KAFKA-15799 Handle full metadata updates on ZK brokers (#14719 ) This patch adds the concept of a "Full" UpdateMetadataRequest, similar to what is used in LeaderAndIsr. A new tagged field is added to UpdateMetadataRequest at version 8 which allows the KRaft controller to indicate if a UMR contains all the metadata or not. Since UMR is implicitly treated as incremental by the ZK broker, we needed a way to detect topic deletions when the KRaft broker sends a metadata snapshot to the ZK broker. By sending a "Full" flag, the broker can now compare existing topic IDs to incoming topic IDs and calculate which topics should be removed from the MetadataCache. This patch only removes deleted topics from the MetadataCache. Partition/log management was implemented in KAFKA-15605. Reviewers: Colin P. McCabe <cmccabe@apache.org>	2023-11-16 14:38:44 -08:00
Jorge Esteban Quilcate Otoya	875e610a2b	KAFKA-15802: Validate remote segment state before fetching index (#14727 ) (#14759 ) Reviewers: Satish Duggana <satishd@apache.org>, Divij Vaidya <diviv@amazon.com>, Christo Lolov <lolovc@amazon.com>, Luke Chen <showuon@gmail.com>, Kamal Chandraprakash<kamal.chandraprakash@gmail.com>	2023-11-16 08:52:01 +05:30
Kirk True	22f7ffe5e1	KAFKA-15277: Design & implement support for internal Consumer delegates (#14670 ) The consumer refactoring project introduced another `Consumer` implementation, creating two different, coexisting implementations of the `Consumer` interface: * `KafkaConsumer` (AKA "existing", "legacy" consumer) * `PrototypeAsyncConsumer` (AKA "new", "refactored" consumer) The goal of this task is to refactor the code via the delegation pattern so that we can keep a top-level `KafkaConsumer` but then delegate to another implementation under the covers. There will be two delegates at first: * `LegacyKafkaConsumer` * `AsyncKafkaConsumer` `LegacyKafkaConsumer` is essentially a renamed `KafkaConsumer`. That implementation handles the existing group protocol. `AsyncKafkaConsumer` is renamed from `PrototypeAsyncConsumer` and will implement the new consumer group protocol from KIP-848. Both of those implementations will live in the `internals` sub-package to discourage their use. This task is part of the work to implement support for the new KIP-848 consumer group protocol. Reviewers: Philip Nee <pnee@confluent.io>, Andrew Schofield <aschofield@confluent.io>, David Jacot <djacot@confluent.io>	2023-11-15 05:00:40 -08:00
Justine Olshan	83b7c9a053	MINOR Re-add action queue parameter removed from appendRecords (#14753 ) In `91fa196`, I accidentally removed the action queue paramater that was added in `7d147cf`. I also renamed the actionQueue as to not confuse this in the future. I don't think this broke anything since we don't use verification for group coordinator commits, but I should fix it to be as it was before. Reviewers: Artem Livshits <alivshits@confluent.io>, Jason Gustafson <jason@confluent.io>	2023-11-14 15:56:42 -08:00
David Jacot	a98bd7d65f	Revert "KAFKA-15661: KIP-951: Server side changes (#14444 )" (#14738 ) This reverts commit `f38b0d8`. Trying to find the root cause of org.apache.kafka.tiered.storage.integration.ReassignReplicaShrinkTest failing in CI. Reviewers: Justine Olshan <jolshan@confluent.io>	2023-11-11 18:12:17 -08:00
David Jacot	fcfd378129	HOTFIX: Fix compilation error in BrokerLifecycleManager (#14732 ) This patch fixes a compilation error introduced by https://github.com/apache/kafka/pull/14392 for Scala 2.12. ``` > Task :core:compileScala [Error] /home/jenkins/workspace/Kafka_kafka-pr_PR-14392/core/src/main/scala/kafka/server/BrokerLifecycleManager.scala:305:49: value incl is not a member of scala.collection.immutable.Set[org.apache.kafka.common.Uuid] ``` Reviewers: Luke Chen <showuon@gmail.com>	2023-11-10 01:27:35 -08:00
Crispin Bernier	f38b0d886c	KAFKA-15661: KIP-951: Server side changes (#14444 ) This is the server side changes to populate the fields in KIP-951. On NOT_LEADER_OR_FOLLOWER errors in both FETCH and PRODUCE the new leader ID and epoch are retrieved from the local cache through ReplicaManager and included in the response, falling back to the metadata cache if they are unavailable there. The endpoint for the new leader is retrieved from the metadata cache. The new fields are all optional (tagged) and an IBP bump was required. https://cwiki.apache.org/confluence/display/KAFKA/KIP-951%3A+Leader+discovery+optimisations+for+the+client https://issues.apache.org/jira/browse/KAFKA-15661 Protocol changes: #14627 Testing Benchmarking described here https://cwiki.apache.org/confluence/display/KAFKA/KIP-951%3A+Leader+discovery+optimisations+for+the+client#KIP951:Leaderdiscoveryoptimisationsfortheclient-BenchmarkResults ./gradlew core:test --tests kafka.server.KafkaApisTest Reviewers: Justine Olshan <jolshan@confluent.io>, David Jacot <djacot@confluent.io>, Jason Gustafson <jason@confluent.io>, Fred Zheng <zhengyd2014@gmail.com>, Mayank Shekhar Narula <mayanks.narula@gmail.com>, Yang Yang <yayang@uber.com>, David Mao <dmao@confluent.io>, Kirk True <ktrue@confluent.io>	2023-11-09 21:07:21 -08:00
Igor Soarez	eaa6b8abdd	KAFKA-15360: Include dirs in BrokerRegistration #14392 BrokerLifecycleManager should send the offline log directories in the BrokerHeartbeatRequests it sends. Also, when handling BrokerHeartbeatResponses, do so by enqueing a BrokerLifecycleManager event, rather than trying to do the handling directly in the callback. Reviewers: Colin P. McCabe <cmccabe@apache.org>, Proven Provenzano <pprovenzano@confluent.io>	2023-11-09 11:01:01 -08:00
Colin Patrick McCabe	7060c08d6f	MINOR: Rewrite the meta.properties handling code in Java and fix some issues #14628 (#14628 ) meta.properties files are used by Kafka to identify log directories within the filesystem. Previously, the code for handling them was in BrokerMetadataCheckpoint.scala. This PR rewrites the code for handling them as Java and moves it to the apache.kafka.metadata.properties namespace. It also gets rid of the separate types for v0 and v1 meta.properties objects. Having separate types wasn't so bad back when we had a strict rule that zk clusters used v0 and kraft clusters used v1. But ZK migration has blurred the lines. Now, a zk cluster may have either v0 or v1, if it is migrating, and a kraft cluster may have either v0 or v1, at any time. The new code distinguishes between an individual meta.properties file, which is represented by MetaProperties, and a collection of meta.properties files, which is represented by MetaPropertiesEnsemble. It is useful to have this distinction, because in JBOD mode, even if some log directories are inaccessible, we can still use the ensemble to extract needed information like the cluster ID. (Of course, even when not in JBOD mode, KRaft servers have always been able to configure a metadata log directory separate from the main log directory.) Since we recently added a unique directory.id to each meta.properties file, the previous convention of passing a "canonical" MetaProperties object for the cluster around to various places in the code needs to be revisited. After all, we can no longer assume all of the meta.properties files are the same. This PR fixes these parts of the code. For example, it fixes the constructors of ControllerApis and RaftManager to just take a cluster ID, rather than a MetaProperties object. It fixes some other parts of the code, like the constructor of SharedServer, to take a MetaPropertiesEnsemble object. Another goal of this PR was to centralize meta.properties validation a bit more and make it unit-testable. For this purpose, the PR adds MetaPropertiesEnsemble.verify, and a few other verification methods. These enforce invariants like "the metadata directory must be readable," and so on. Reviewers: Igor Soarez <soarez@apple.com>, David Arthur <mumrah@gmail.com>, Divij Vaidya <diviv@amazon.com>, Proven Provenzano <pprovenzano@confluent.io>	2023-11-09 09:32:35 -08:00
Justine Olshan	91fa196930	KAFKA-15653: Pass requestLocal as argument to callback so we use the correct one for the thread (#14629 ) With the new callback mechanism we were accidentally passing context with the wrong request local. Now include a RequestLocal as an explicit argument to the callback. Also make the arguments passed through the callback clearer by separating the method out. Added a test to ensure we use the request handler's request local and not the one passed in when the callback is executed via the request handler. Reviewers: Ismael Juma <ismael@juma.me.uk>, Divij Vaidya <diviv@amazon.com>, David Jacot <djacot@confluent.io>, Jason Gustafson <jason@confluent.io>, Artem Livshits <alivshits@confluent.io>, Jun Rao <junrao@gmail.com>,	2023-11-07 15:14:17 -08:00
Calvin Liu	edc7e10a74	KAFKA-15583: Enforce HWM advance only if partition is not under-min-ISR (#14594 ) Only advance the HWM for a partition if the ISR set is equal to or above the min ISR config. This patch also sets an upper bound on the min ISR config so it cannot exceed the number of replicas. Reviewers: David Arthur <mumrah@gmail.com>	2023-11-07 10:24:40 -05:00
David Mao	c6ea0a84ab	KAFKA-15780: Wait for consistent KRaft metadata when creating or deleting topics (#14695 ) TestUtils.createTopicWithAdmin calls waitForAllPartitionsMetadata which waits for partition(s) to be present in each brokers' metadata cache. This is a sufficient check in ZK mode because the controller sends an LISR request before sending an UpdateMetadataRequest which means that the partition in the ReplicaManager will be updated before the metadata cache. In KRaft mode, the metadata cache is updated first, so the check may return before partitions and other metadata listeners are fully initialized. Testing: Insert a Thread.sleep(100) in BrokerMetadataPublisher.onMetadataUpdate after // Publish the new metadata image to the metadata cache. metadataCache.setImage(newImage) and run EdgeCaseRequestTest.testProduceRequestWithNullClientId and the test will fail locally nearly deterministically. After the change(s), the test no longer fails. Reviewers: Justine Olshan <jolshan@confluent.io>	2023-11-06 17:07:56 -08:00
Qichao Chu	173c9c9dfc	MINOR: Fix flaky ProducerIdManagerTest.testUnrecoverableErrors (#14688 ) We add a sleep until RetryBackoffMs to ensure that next call to generateProducerId() is triggered. Reviewers: Divij Vaidya <diviv@amazon.com>	2023-11-06 12:07:48 +01:00
Apoorv Mittal	a53147e7d9	KAFKA-15673: Adding client metrics resource types (KIP-714) (#14621 ) This PR adds resources to store and handle client metrics needed for KIP-714. Changes include: Adding CLIENT_METRICS to resource type Corresponding DYNAMIC client configurations in resources. Changes to support dynamic loading of configuration on changes. Changes to support API calls to fetch data stored against the new resource. Test cases for the changes. Reviewers: Andrew Schofield <andrew_schofield@uk.ibm.com>, Philip Nee <pnee@confluent.io>, Jun Rao <junrao@gmail.com>	2023-11-03 14:51:50 -07:00
Colin Patrick McCabe	4d8efa94cb	MINOR: MetaProperties refactor, part 1 (#14678 ) Since we have added directory.id to MetaProperties, it is no longer safe to assume that all directories on a node contain the same MetaProperties. Therefore, we should get rid of places where we are using a single MetaProperties object to represent the settings of an entire cluster. This PR removes a few such cases. In each case, it is sufficient just to pass cluster ID. The second part of this change refactors KafkaClusterTestKit so that we convert paths to absolute before creating BrokerNode and ControllerNode objects, rather than after. This prepares the way for storing an ensemble of MetaProperties objects in BrokerNode and ControllerNode, which we will do in a follow-up change. Reviewers: Ron Dagostino <rndgstn@gmail.com>	2023-11-02 10:26:52 -07:00
Igor Soarez	0390d5b1a2	KAFKA-15355: Message schema changes (#14290 ) Reviewers: Christo Lolov <lolovc@amazon.com>, Colin P. McCabe <cmccabe@apache.org>, Proven Provenzano <pprovenzano@confluent.io>, Ron Dagostino <rdagostino@confluent.io>	2023-11-02 09:46:05 -04:00
hudeqi	9911fab1a1	KAFKA-15432: RLM Stop partitions should not be invoked for non-tiered storage topics (#14667 ) Reviewers: Christo Lolov <lolovc@amazon.com>, Divij Vaidya <diviv@amazon.com>, Kamal Chandraprakash <kamal.chandraprakash@gmail.com>	2023-11-02 10:00:15 +01:00
Dongnuo Lyu	5b37ec5e57	KAFKA-15598 & KAFKA-15461: Add integration tests for DescribeGroups API, DeleteGroups API, OffsetDelete API and ListGroups API (#14537 ) This patch adds integration tests for four group coordinator APIs with new group coordinator and new protocol, new group coordinator and old protocol, and old group coordinator and old protocols for the following APIs: - DescribeGroups - DeleteGroups - OffsetDelete - ListGroups Reviewers: Ritika Reddy <rreddy@confluent.io>, David Jacot <djacot@confluent.io>	2023-11-02 01:47:43 -07:00
Alok Thatikunta	eca8502990	KAFKA-14484: [1/N] Move PartitionMetadataFile to storage module (#14607 ) This PR moves PartitionMetadataFile to the storage module. Existing unit tests in UnifiedLogTest like testLogFlushesPartitionMetadataOnAppend should suffice. Reviewers: Ismael Juma <ismael@juma.me.uk>, Jun Rao <junrao@gmail.com>	2023-11-01 09:40:45 -07:00
Ismael Juma	7ef5e8b022	MINOR: Avoid a couple of map copies in `KRaftMetadataCache.getPartitionReplicaEndpoints` (#14660 ) Neither the `toMap` or `filter` seem to be necessary. Reviewers: Ziming Deng<dengziming1993@gmail.com>.	2023-11-01 10:34:30 +08:00
Nikolay	76b1b50b64	KAFKA-14595 Move ReassignPartitionsCommand to java (#13247 ) This PR contains changes required to move PartitionReassignmentState class to java code. Reviewers: Mickael Maison <mickael.maison@gmail.com>, Justine Olshan <jolshan@confluent.io>, Federico Valeri <fedevaleri@gmail.com>, Taras Ledkov Taras Ledkov <tledkov@apache.org>, Kamal Chandraprakash<kamal.chandraprakash@gmail.com>,	2023-10-31 17:29:05 -07:00
Dongnuo Lyu	7bdd1a015e	KAFKA-15647: Fix the different behavior in error handling between the old and new group coordinator (#14589 ) In `KafkaApis.scala`, we build the API response differently if exceptions are thrown during the API execution. Since the new group coordinator only populates the response with error code instead of throwing an exception when an error occurs, there may be different behavior between the existing group coordinator and the new one. This patch: - Fixes the response building in `KafkaApis.scala` for the two APIs affected by such difference -- OffsetFetch and OffsetDelete. - In `GroupCoordinatorService.java`, returns a response with error code instead of a failed future when the coordinator is not active. Reviewers: David Jacot <djacot@confluent.io>	2023-10-31 03:11:52 -07:00
Calvin Liu	8f8ad6db38	KAFKA-15582: Move the clean shutdown file to the storage package (#14603 ) A follow-up change to move the clean shutdown file to the storage package. Reviewers: Ismael Juma <ismael@juma.me.uk>, Jun Rao <junrao@gmail.com>	2023-10-30 16:27:40 -07:00
Kirk True	2e2f32c050	KAFKA-15628: Refactor ConsumerRebalanceListener invocation for reuse (#14638 ) Straightforward refactoring to extract an inner class and methods related to `ConsumerRebalanceListener` for reuse in the KIP-848 implementation of the consumer group protocol. Also using `Optional` to explicitly mark when a `ConsumerRebalanceListener` is in use or not, allowing us to make some (forthcoming) optimizations when there is no listener to invoke. Reviewers: David Jacot <djacot@confluent.io>	2023-10-30 11:51:30 -07:00
Igor Soarez	9dbee599f1	MINOR: Rename log dir UUIDs (#14517 ) After a late discussion in the voting thread for KIP-858 we decided to improve the names for the designated reserved log directory UUID values. Reviewers: Christo Lolov <lolovc@amazon.com>, Ismael Juma <ismael@juma.me.uk>, Ziming Deng <dengziming1993@gmail.com>.	2023-10-30 19:10:57 +08:00
David Arthur	37715862d7	KAFKA-15704: Set missing ZkMigrationReady field on ControllerRegistrationRequest This field was missed by the initial KIP-919 PR(s). The result is that migrations can't begin since the controllers will never become ready. This patch fixes that as well as pulls over some fixes from the 3.6 branch. Reviewers: Colin P. McCabe <cmccabe@apache.org>	2023-10-27 14:16:24 -07:00
David Arthur	339d2556c6	KAFKA-15605: Fix topic deletion handling during ZK migration (#14545 ) This patch adds reconciliation logic to migrating ZK brokers to deal with pending topic deletions as well as missed StopReplicas. During the hybrid mode of the ZK migration, the KRaft controller is asynchronously sending UMR and LISR to the ZK brokers to propagate metadata. Since this process is essentially "best effort" it is possible for a broker to miss a StopReplicas. The new logic lets the ZK broker examine its local logs compared with the full set of replicas in a "Full" LISR. Any local logs which are not present in the set of replicas in the request are removed from ReplicaManager and marked as "stray". To avoid inadvertent data loss with this new behavior, the brokers do not delete the "stray" partitions. They will rename the directories and log warning messages during log recovery. It will be up to the operator to manually delete the stray partitions. We can possibly enhance this in the future to clean up old stray logs. This patch makes use of the previously unused Type field on LeaderAndIsrRequest. This was added as part of KIP-516 but never implemented. Since its introduction, an implicit 0 was sent in all LISR. The KRaft controller will now send a value of 2 to indicate a full LISR (as specified by the KIP). The presence of this value acts as a trigger for the ZK broker to perform the log reconciliation. Reviewers: Colin P. McCabe <cmccabe@apache.org>	2023-10-26 18:13:52 -04:00
hudeqi	b559942c17	KAFKA-15671: Fix flaky test RemoteIndexCacheTest.testClearCacheAndIndexFilesWhenResizeCache (#14622 ) Reviewers: Divij Vaidya <diviv@amazon.com> --------- Co-authored-by: Deqi Hu <deqi.hu@shopee.com>	2023-10-25 11:18:55 +02:00
dengziming	03ea24aa1d	MINOR: Fix flaky testFollowerCompleteDelayedFetchesOnReplication (#14616 ) Reviewers: Mickael Maison <mickael.maison@gmail.com>	2023-10-25 09:39:29 +08:00
Nikolay	e0121a38b1	MINOR: Deduplicating ConsumerGroupCommand print formating (#14610 ) ConsumerGroupCommand contains code duplications for table row format. This PR reduces code duplication and make it more clear and easy to understand. Reviewers: Luke Chen <showuon@gmail.com>, hudeqi <1217150961@qq.com>	2023-10-24 15:16:32 +08:00
Jotaniya Jeel	4612fe42af	KAFKA-15481: Fix concurrency bug in RemoteIndexCache (#14483 ) RemoteIndexCache has a concurrency bug which leads to IOException while fetching data from remote tier. The bug could be reproduced as per the following order of events:- Thread 1 (cache thread): invalidates the entry, removalListener is invoked async, so the files have not been renamed to "deleted" suffix yet. Thread 2: (fetch thread): tries to find entry in cache, doesn't find it because it has been removed by 1, fetches the entry from S3, writes it to existing file (using replace existing) Thread 1: async removalListener is invoked, acquires a lock on old entry (which has been removed from cache), it renames the file to "deleted" and starts deleting it Thread 2: Tries to create in-memory/mmapped index, but doesn't find the file and hence, creates a new file of size 2GB in AbstractIndex constructor. JVM returns an error as it won't allow creation of 2GB random access file. This commit fixes the bug by using EvictionListener instead of RemovalListener to perform the eviction atomically with the file rename. It handles the manual removal (not handled by EvictionListener) by using computeIfAbsent() and enforcing atomic cache removal & file rename. Reviewers: Luke Chen <showuon@gmail.com>, Divij Vaidya <diviv@amazon.com>, Arpit Goyal <goyal.arpit.91@gmail.com>, Kamal Chandraprakash <kamal.chandraprakash@gmail.com>	2023-10-23 14:50:46 +02:00
Gantigmaa Selenge	84a58d75bb	KAFKA-15566: Fix test FetchRequestTest.testLastFetchedEpochValidation for KRaft mode (#14563 ) Fix test FetchRequestTest.testLastFetchedEpochValidation for KRaft mode The test fails due to unexpected error (OFFSET_OUT_OF_RANGE) when enabled with KRaft mode. The reason it takes longer to set the leader epoch in KRaft mode is because of the way the topic partitions are created differently than Zookeeper. In Zookeeper mode, we create the topic partitions directly with Zookeeper therefore seem to take less time to create the logs and set leader epoch on broker. In KRaft mode, we use Admin client to create topic partitions. Even though the test waits for topic partitions to get created and appear in metadata cache, it doesn’t seem to be sufficient time for leader epoch to get set on the brokers. Reviewers: Luke Chen <showuon@gmail.com>, dengziming <dengziming1993@gmail.com>	2023-10-23 11:05:57 +08:00
Justine Olshan	e8c8969330	KAFKA-15626: Replace verification guard object with an specific type (#14568 ) I've added a new class with an incrementing atomic long to represent the verification guard. Upon creation of verification guard, we will increment this value and assign it to the guard. The expected behavior is the same as the object guard, but with better debuggability with the string value and type safety (I found a type safety issue in the current code when implementing this) Reviewers: Ismael Juma <ismael@juma.me.uk>, Artem Livshits <alivshits@confluent.io>	2023-10-20 14:26:20 -07:00
hudeqi	21ebbe6b28	MINOR:Remove unused method parameter in ConsumerGroupCommand (#14585 ) In ConsumerGroupCommand, there are two methods: getLogEndOffsets and getLogStartOffsets, the first parameter groupId is not used, so remove it. Reviewers: Luke Chen <showuon@gmail.com>	2023-10-20 10:05:47 +08:00
Gantigmaa Selenge	486d5f6c64	KAFKA-15566: Fix flaky tests in FetchRequestTest.scala in KRaft mode (#14573 ) Fixed some of the failing tests in FetchRequestTest. testFetchWithPartitionsWithIdError and testCreateIncrementalFetchWithPartitionsInErrorV12 fail with the following error when enabled with KRaft mode. These tests only fail sometimes when running locally but consistently failed when running in the Jenkins Pipeline. Tests will call the utility function TestUtils.waitUntilLeaderIsKnown after creating the topic partitions so that they wait for the logs to be created on the leader before sending fetch requests. Enabled all tests except checkLastFetchedEpochValidation with KRaft mode. Looking at the build history in Jenkins, all the other tests except these 2 tests and checkLastFetchedEpochValidation were passing when they were enabled with KRaft mode. Therefore enabled them with KRaft mode again but left checkLastFetchedEpochValidation to be investigated further. Reviewers: Luke Chen <showuon@gmail.com>, dengziming <dengziming1993@gmail.com>	2023-10-20 09:59:21 +08:00
Calvin Liu	af747fbfed	KAFKA-15581: Introduce ELR (#14312 ) This patch introduces preliminary changes for Eligible Leader Replicas (KIP-966) * New MetadataVersion 16 (3.7-IV1) * New record versions for PartitionRecord and PartitionChangeRecord * New tagged fields on PartitionRecord and PartitionChangeRecord * New static config "eligible.leader.replicas.enable" to gate the whole feature Reviewers: Artem Livshits <alivshits@confluent.io>, David Arthur <mumrah@gmail.com>, Colin P. McCabe <cmccabe@apache.org>	2023-10-19 14:05:15 -04:00
Calvin Liu	14029e2ddd	KAFKA-15582: Identify clean shutdown broker (#14465 ) The PR includes: * Added a new class of CleanShutdownFile which helps write and read from a clean shutdown file. * Updated the BrokerRegistration API. * Client side handling for the broker epoch. * Minimum work on the controller side. Reviewers: Jun Rao <junrao@gmail.com>	2023-10-19 10:25:23 -07:00
Apoorv Mittal	36abc8dcea	KAFKA-15604: Telemetry API request and response schemas and classes (KIP-714) (#14554 ) Initial PR for [KIP-714](https://cwiki.apache.org/confluence/display/KAFKA/KIP-714%3A+Client+metrics+and+observability) - [KAFKA-15601](https://issues.apache.org/jira/browse/KAFKA-15601). This PR defines json request and response schemas for the new Telemetry APIs and implements the corresponding java classes. Reviewers: Andrew Schofield <andrew_schofield@uk.ibm.com>, Kirk True <ktrue@confluent.io>, Matthias J. Sax <matthias@confluent.io>, Walker Carlson <wcarlson@apache.org>	2023-10-19 10:55:21 -05:00
Mickael Maison	8aee297669	MINOR: Various Java cleanups in core (#14561 ) Reviewers: Josep Prat <josep.prat@aiven.io>	2023-10-18 11:49:25 +02:00
Matthias J. Sax	9b468fb278	MINOR: Do not end Javadoc comments with `**/` (#14540 ) Reviewers: Bruno Cadonna <bruno@confluent.io>, Bill Bejeck <bill@confluent.io>, Hao Li <hli@confluent.io>, Josep Prat <josep.prat@aiven.io>	2023-10-17 21:11:04 -07:00
Jeff Kim	abee8f711c	KAFKA-14519; [1/N] Implement coordinator runtime metrics (#14417 ) Implements the following metrics: kafka.server:type=group-coordinator-metrics,name=num-partitions,state=loading kafka.server:type=group-coordinator-metrics,name=num-partitions,state=active kafka.server:type=group-coordinator-metrics,name=num-partitions,state=failed kafka.server:type=group-coordinator-metrics,name=event-queue-size kafka.server:type=group-coordinator-metrics,name=partition-load-time-max kafka.server:type=group-coordinator-metrics,name=partition-load-time-avg kafka.server:type=group-coordinator-metrics,name=thread-idle-ratio-min kafka.server:type=group-coordinator-metrics,name=thread-idle-ratio-avg The PR makes these metrics generic so that in the future the transaction coordinator runtime can implement the same metrics in a similar fashion. Also, CoordinatorLoaderImpl#load will now return LoadSummary which encapsulates the start time, end time, number of records/bytes. Co-authored-by: David Jacot <djacot@confluent.io> Reviewers: Ritika Reddy <rreddy@confluent.io>, Calvin Liu <caliu@confluent.io>, David Jacot <djacot@confluent.io>, Justine Olshan <jolshan@confluent.io>	2023-10-17 16:06:23 -07:00
Mickael Maison	9d04c7a045	MINOR: Various Scala cleanups in core (#14558 ) Reviewers: Ismael Juma <ismael@juma.me.uk>	2023-10-17 12:04:14 +02:00
Omnia G.H Ibrahim	9af1e74b5e	KAFKA-14596: Move TopicCommand to tools (#13201 ) Reviewers: Mickael Maison <mickael.maison@gmail.com>, Federico Valeri <fedevaleri@gmail.com>	2023-10-17 11:40:15 +02:00
Ismael Juma	69e591db3a	MINOR: Rewrite/Move KafkaNetworkChannel to the `raft` module (#14559 ) This is now possible since `InterBrokerSend` was moved from `core` to `server-common`. Also rewrite/move `KafkaNetworkChannelTest`. The scala version of `KafkaNetworkChannelTest` passed with the changes here (before I deleted it). Reviewers: Justine Olshan <jolshan@confluent.io>, José Armando García Sancio <jsancio@users.noreply.github.com>	2023-10-16 20:10:31 -07:00
dengziming	5c9db5e735	KAFKA-15390: Do not return fenced broker in FetchResponse.preferredReplica (#14272 ) Do not return fenced brokers from metadataCache.getPartitionReplicaEndpoints, since that could lead to them getting used as preferred read replicas. Reviewers: Colin P. McCabe <cmccabe@apache.org>	2023-10-16 15:08:40 -07:00
Ismael Juma	1073d434ec	KAFKA-14481: Move LogSegment/LogSegments to storage module (#14529 ) A few notes: * Delete a few methods from `UnifiedLog` that were simply invoking the related method in `LogFileUtils` * Fix `CoreUtils.swallow` to use the passed in `logging` * Fix `LogCleanerParameterizedIntegrationTest` to close `log` before reopening * Minor tweaks in `LogSegment` for readability For broader context on this change, please check: * KAFKA-14470: Move log layer to storage module Reviewers: Divij Vaidya <diviv@amazon.com>, Satish Duggana <satishd@apache.org>	2023-10-16 06:37:30 -07:00
hudeqi	b0b8693c72	KAFKA-15536: Dynamically resize remoteIndexCache (#14511 ) Dynamically resize remoteIndexCache Reviewers: Christo Lolov <lolovc@amazon.com>, Luke Chen <showuon@gmail.com>, Divij Vaidya <diviv@amazon.com>, Kamal Chandraprakash <kamal.chandraprakash@gmail.com>	2023-10-16 15:24:36 +08:00
Ismael Juma	4cf86c5d2f	KAFKA-15492: Upgrade and enable spotbugs when building with Java 21 (#14533 ) Spotbugs was temporarily disabled as part of KAFKA-15485 to support Kafka build with JDK 21. This PR upgrades the spotbugs version to 4.8.0 which adds support for JDK 21 and enables it's usage on build again. Reviewers: Divij Vaidya <diviv@amazon.com>	2023-10-12 14:09:10 +02:00

... 3 4 5 6 7 ...

4818 Commits