kafka

Commit Graph

Author	SHA1	Message	Date
Christo	c5169ca805	Bump version to 4.0.1	2025-08-18 14:01:19 +01:00
Shashank	de857226fe	KAFKA-15307: Kafka Streams configuration docs outdated (#20329 ) Updated Kafka Streams configuration documentation to stay latest with version 4.0.0. Reviewers: Matthias J. Sax <matthias@confluent.io>	2025-08-17 13:19:05 -07:00
Clemens Hutter	7b507bd37b	MINOR: Remove SPAM URL in Streams Documentation (#20321 ) The previous URL http://lambda-architecture.net/ seems to now be controlled by spammers Co-authored-by: Shashank <hsshashank.grad@gmail.com> Reviewers: Mickael Maison <mickael.maison@gmail.com>	2025-08-13 12:09:05 -07:00
Matthias J. Sax	704a7a4013	MINOR: add missing section to TOC (#20305 ) Add new group coordinator metrics section to TOC. Reviewers: Chia-Ping Tsai <chia7712@gmail.com>	2025-08-05 14:30:18 -07:00
Jared Harley	0f9b312703	KAFKA-19576 Fix typo in state-change log filename after rotate (#20269 ) The `state-change.log` file is being incorrectly rotated to `stage-change.log.[date]`. This change fixes the typo to have the log file correctly rotated to `state-change.log.[date]` _No functional changes._ Reviewers: Mickael Maison <mickael.maison@gmail.com>, Christo Lolov <lolovc@amazon.com>, Luke Chen <showuon@gmail.com>, Ken Huang <s7133700@gmail.com>, TengYao Chi <kitingiao@gmail.com>, Chia-Ping Tsai <chia7712@gmail.com>	2025-08-05 12:49:20 +08:00
Lucas Brutschy	deb58910c8	KAFKA-19529: State updater sensor names should be unique (#20262 ) (#20273 ) CI / build (push) Has been cancelled Details All state updater threads use the same metrics instance, but do not use unique names for their sensors. This can have the following symptoms: 1) Data inserted into one sensor by one thread can affect the metrics of all state updater threads. 2) If one state updater thread is shutdown, the metrics associated to all state updater threads are removed. 3) If one state updater thread is started, while another one is removed, it can happen that a metric is registered with the `Metrics` instance, but not associated to any `Sensor` (because it is concurrently removed), which means that the metric will not be removed upon shutdown. If a thread with the same name later tries to register the same metric, we may run into a `java.lang.IllegalArgumentException: A metric named ... already exists`, as described in the ticket. This change fixes the bug giving unique names to the sensors. A test is added that there is no interference of the removal of sensors and metrics during shutdown. Reviewers: Matthias J. Sax <matthias@confluent.io>	2025-08-01 14:59:15 +02:00
Christo	2a45b4f662	Merge tag '4.0.1-rc0' into 4.0 CI / build (push) Has been cancelled Details 4.0.1-rc0	2025-07-24 14:21:04 +01:00
Christo	b81230cfff	Bump version to 4.0.1	2025-07-24 14:21:04 +01:00
Christo	9c3fffb89c	Merge tag '4.0.1-rc0' into 4.0 4.0.1-rc0	2025-07-24 13:57:57 +01:00
Christo	aaf1864d0c	Bump version to 4.0.1	2025-07-24 13:57:57 +01:00
Christo	72fbb371e3	Merge tag '4.0.1-rc0' into 4.0 4.0.1-rc0	2025-07-24 13:20:49 +01:00
Christo	4432d6a558	Bump version to 4.0.1	2025-07-24 13:20:49 +01:00
Christo	d183e520a5	Merge tag '4.0.1-rc0' into 4.0 4.0.1-rc0	2025-07-24 12:39:43 +01:00
Christo	01d6edc510	Bump version to 4.0.1	2025-07-24 12:39:43 +01:00
Christo	b3aeb69cb4	Bump version to 4.0.1	2025-07-24 12:34:54 +01:00
Tsung-Han Ho (Miles Ho)	74d93adadb	KAFKA-19501 Update OpenJDK base image from buster to bullseye (#20165 ) CI / build (push) Has been cancelled Details The changes update the OpenJDK base image from 17-buster to 17-bullseye: - Updates tests/docker/Dockerfile to use openjdk:17-bullseye instead of openjdk:17-buster - Updates tests/docker/ducker-ak script to use the new default image - Updates documentation in tests/README.md with the new image name examples Reviewers: Federico Valeri <fedevaleri@gmail.com>, TengYao Chi <kitingiao@gmail.com>, Ken Huang <s7133700@gmail.com>, Chia-Ping Tsai <chia7712@gmail.com>	2025-07-22 22:39:25 +08:00
Dmitry Werner	12e695e298	KAFKA-19520 Bump Commons-Lang for CVE-2025-48924 (#20196 ) CI / build (push) Has been cancelled Details Bump Commons-Lang for CVE-2025-48924. Reviewers: Luke Chen <showuon@gmail.com>, Federico Valeri <fedevaleri@gmail.com>	2025-07-19 15:07:48 +08:00
Kaushik Raina	70c51641fb	Cherrypick "MINOR : Handle error for client telemetry push (#19881 )" (#20176 ) CI / build (push) Has been cancelled Details Update catch to handle compression errors Before : ![image](https://github.com/user-attachments/assets/c5ca121e-ba0c-4664-91f1-20b54abf67cc) After ``` Sent message: KR Message 376 [kafka-producer-network-thread \| kr-kafka-producer] INFO org.apache.kafka.common.telemetry.internals.ClientTelemetryReporter - KR: Failed to compress telemetry payload for compression: zstd, sending uncompressed data Sent message: KR Message 377 ``` Reviewers: Apoorv Mittal <apoorvmittal10@gmail.com>, Bill Bejeck <bbejeck@gmail.com>, Chia-Ping Tsai <chia7712@gmail.com>	2025-07-16 17:38:02 +01:00
Ming-Yen Chung	eefee6d58d	KAFKA-19427 Allow the coordinator to grow its buffer dynamically (#20040 ) * Coordinator starts with a smaller buffer, which can grow as needed. * In freeCurrentBatch, release the appropriate buffer: * The Coordinator recycles the expanded buffer (`currentBatch.builder.buffer()`), not `currentBatch.buffer`, because `MemoryBuilder` may allocate a new `ByteBuffer` if the existing one isn't large enough. * There are two cases that buffer may exceeds `maxMessageSize` 1. If there's a single record whose size exceeds `maxMessageSize` (which, so far, is derived from `max.message.bytes`) and the write is in `non-atomic` mode, it's still possible for the buffer to grow beyond `maxMessageSize`. In this case, the Coordinator should revert to using a smaller buffer afterward. 2. Coordinator do not recycles the buffer that larger than `maxMessageSize`. If the user dynamically reduces `maxMessageSize` to a value even smaller than `INITIAL_BUFFER_SIZE`, the Coordinator should avoid recycling any buffer larger than `maxMessageSize` so that Coordinator can allocate the smaller buffer in the next round. * Add tests to verify the above scenarios. Reviewers: David Jacot <djacot@confluent.io>, Sean Quah <squah@confluent.io>, Ken Huang <s7133700@gmail.com>, PoAn Yang <payang@apache.org>, TaiJuWu <tjwu1217@gmail.com>, Jhen-Yung Hsu <jhenyunghsu@gmail.com>, Chia-Ping Tsai <chia7712@gmail.com>	2025-07-16 22:18:38 +08:00
Bill Bejeck	d95857a155	KAFKA-19504: Remove unused metrics reporter initialization in KafkaAdminClient (#20166 ) CI / build (push) Has been cancelled Details The `AdminClient` adds a telemetry reporter to the metrics reporters list in the constructor. The problem is that the reporter was already added in the `createInternal` method. In the `createInternal` method call, the `clientTelemetryReporter` is added to a `List<MetricReporters>` which is passed to the `Metrics` object, will get closed when `Metrics.close()` is called. But adding a reporter to the reporters list in the constructor is not used by the `Metrics` object and hence doesn't get closed, causing a memory leak. All related tests pass after this change. Reviewers: Apoorv Mittal <apoorvmittal10@apache.org>, Matthias J. Sax <matthias@confluent.io>, Chia-Ping Tsai <chia7712@gmail.com>, Jhen-Yung Hsu <jhenyunghsu@gmail.com>	2025-07-14 20:28:50 -04:00
Ismael Juma	80b9abebad	KAFKA-19444: Add back JoinGroup v0 & v1 (#20116 ) CI / build (push) Has been cancelled Details This fixes librdkafka older than the recently released 2.11.0 with Kerberos authentication and Apache Kafka 4.x. Even though this is a bug in librdkafka, a key goal of KIP-896 is not to break the popular client libraries listed in it. Adding back JoinGroup v0 & v1 is a very small change and worth it from that perspective. Reviewers: Chia-Ping Tsai <chia7712@gmail.com>	2025-07-08 16:01:26 -07:00
Gaurav Narula	8433ac4d31	KAFKA-19221 Propagate IOException on LogSegment#close (#20072 ) CI / build (push) Has been cancelled Details Log segment closure results in right sizing the segment on disk along with the associated index files. This is specially important for TimeIndexes where a failure to right size may eventually cause log roll failures leading to under replication and log cleaner failures. This change uses `Utils.closeAll` which propagates exceptions, resulting in an "unclean" shutdown. That would then cause the broker to attempt to recover the log segment and the index on next startup, thereby avoiding the failures described above. Reviewers: Omnia Ibrahim <o.g.h.ibrahim@gmail.com>, Jun Rao <junrao@gmail.com>, Chia-Ping Tsai <chia7712@gmail.com>	2025-07-06 03:05:34 +08:00
Gaurav Narula	45327fd597	KAFKA-18035: Backport TransactionsTest testBumpTransactionalEpochWithTV2Disabled failed on trunk (#20102 ) CI / build (push) Waiting to run Details Backports the flakyness fix in #18451 to 4.0 branch > Sometimes we didn't get into abortable state before aborting, so the epoch didn't get bumped. Now we force abortable state with an attempt to send before aborting so the epoch bump occurs as expected. > > Reviewers: Jeff Kim <jeff.kim@confluent.io> Reviewers: Chia-Ping Tsai <chia7712@gmail.com> Co-authored-by: Justine Olshan <jolshan@confluent.io>	2025-07-05 05:00:10 +08:00
Matthias J. Sax	4ce6f5cb92	MINOR: Improve ProcessorContext JavaDocs (#20042 ) CI / build (push) Has been cancelled Details Clarify that state stores are sharded, and shards cannot be shared across Processors. Reviewers: Lucas Brutschy <lbrutschy@confluent.io>	2025-06-26 10:08:51 -07:00
Rajini Sivaram	6351bc05aa	MINOR: Fix response for consumer group describe with empty group id (#20030 ) (#20036 ) CI / build (push) Waiting to run Details ConsumerGroupDescribe with an empty group id returns a response containing `null` groupId in a non-nullable field. Since the response cannot be serialized, this results in UNKNOWN_SERVER_ERROR being returned to the client. This PR sets the group id in the response to an empty string instead and adds request tests for empty group id. Reviewers: David Jacot <djacot@confluent.io>, Chia-Ping Tsai <chia7712@gmail.com>	2025-06-25 17:56:58 +01:00
Logan Zhu	15ec053665	KAFKA-18656 Backport KAFKA-18597 to 4.0 (#20026 ) CI / build (push) Waiting to run Details Backport of [KAFKA-18597](https://github.com/apache/kafka/pull/18627) to the 4.0 branch. Reviewers: Chia-Ping Tsai <chia7712@gmail.com>	2025-06-25 19:19:30 +08:00
Calvin Liu	46e843da9f	KAFKA-19383: Handle the deleted topics when applying ClearElrRecord (#20034 ) CI / build (push) Waiting to run Details https://issues.apache.org/jira/browse/KAFKA-19383 When applying the ClearElrRecord, it may pick up the topicId in the image without checking if the topic has been deleted. This can cause the creation of a new TopicRecord with an old topic ID. Reviewers: Alyssa Huang <ahuang@confluent.io>, Artem Livshits <alivshits@confluent.io>, Colin P. McCabe <cmccabe@apache.org> No conflicts.	2025-06-24 19:59:50 -07:00
Colin Patrick McCabe	7e51a2a43b	KAFKA-19294: Fix BrokerLifecycleManager RPC timeouts (#19745 ) Previously, we could wait for up to half of the broker session timeout for an RPC to complete, and then delay by up to half of the broker session timeout. When taken together, these two delays could lead to brokers erroneously missing heartbeats. This change removes exponential backoff for heartbeats sent from the broker to the controller. The load caused by heartbeats is not heavy, and controllers can easily time out heartbeats when the queue length is too long. Additionally, we now set the maximum RPC time to the length of the broker period. This minimizes the impact of heavy load. Reviewers: José Armando García Sancio <jsancio@apache.org>, David Arthur <mumrah@gmail.com>	2025-06-24 16:39:07 -07:00
Alyssa Huang	52a5b88512	KAFKA-19411: Fix deleteAcls bug which allows more deletions than max records per user op (#19974 ) CI / build (push) Waiting to run Details If there are more deletion filters after we initially hit the `MAX_RECORDS_PER_USER_OP` bound, we will add an additional deletion record ontop of that for each additional filter. The current error message returned to the client is not useful either, adding logic so client doesn't just get `UNKNOWN_SERVER_EXCEPTION` with no details returned.	2025-06-24 16:00:20 -07:00
Lan Ding	d426d65041	MINOR: fix reassign command bug (#20003 ) see `9570c67b8c/core/src/main/scala/kafka/admin/ReassignPartitionsCommand.scala (L1208)` During the rewrite for [KAFKA-14595](https://github.com/apache/kafka/pull/13247), the relevant condition was omitted. Reviewers: Chia-Ping Tsai <chia7712@gmail.com>	2025-06-25 02:38:41 +08:00
Okada Haruki	9fcfe546d1	KAFKA-19407 Fix potential IllegalStateException when appending to timeIndex (#19972 ) ## Summary - Fix potential race condition in LogSegment#readMaxTimestampAndOffsetSoFar(), which may result in non-monotonic offsets and causes replication to stop. - See https://issues.apache.org/jira/browse/KAFKA-19407 for the details how it happen. Reviewers: Vincent PÉRICART <mauhiz@gmail.com>, Jun Rao <junrao@gmail.com>, Chia-Ping Tsai <chia7712@gmail.com>	2025-06-25 00:37:15 +08:00
Ritika Reddy	c8b8adf3c1	KAFKA-19367: Follow up bug fix (#19991 ) CI / build (push) Waiting to run Details This is a follow up to [https://github.com/apache/kafka/pull/19910](https://github.com/apache/kafka/pull/url) The coordinator failed to write an epoch fence transition for producer jt142 to the transaction log with error COORDINATOR_NOT_AVAILABLE. The epoch was increased to 2 but not returned to the client (kafka.coordinator.transaction.TransactionCoordinator) -- as we don't bump the epoch with this change, we should also update the message to not say "increased" and remove the epochAndMetadata.transactionMetadata.hasFailedEpochFence = true line In the test, the expected behavior is: First append transaction to the log fails with COORDINATOR_NOT_AVAILABLE (epoch 1) We try init_pid again, this time the SINGLE epoch bump succeeds, and the following things happen simultaneously (epoch 2) -> Transition to COMPLETE_ABORT -> Return CONCURRENT_TRANSACTION error to the client The client retries, and there is another epoch bump; state transitions to EMPTY (epoch 3) Reviewers: Justine Olshan <jolshan@confluent.io>	2025-06-23 15:15:36 -07:00
Apoorv Mittal	26e5a53906	HOTFIX: Correcting build after cherry-pick (#19969 ) CI / build (push) Has been cancelled Details Fixing build after cherrypicking: ``` commit `254c1fa519` Author: Apoorv Mittal <apoorvmittal10@gmail.com> Date: Thu Jun 12 22:52:50 2025 +0100 MINOR: Fixing client telemetry validate request (#19959) Minor fix to correct the validate condition for GetTelemetryRequests. Added respective tests as well. Reviewers: Andrew Schofield <aschofield@confluent.io> ``` Reviewers: Manikumar Reddy <manikumar.reddy@gmail.com>	2025-06-16 10:11:18 +01:00
Ritika Reddy	c6b44b5d66	Cherry Pick KAFKA-19367 to 4.0 (#19958 ) CI / build (push) Has been cancelled Details [`0b2e410d61`](url) Bug fix in 4.0 Conflicts: - The Transaction Coordinator had some conflicts, mainly with the transaction states. Ex: ongoing in 4.0 is TransactionState.ONGOING in 4.1. - The TransactionCoordinatorTest file had conflicts w.r.t the 2PC changes from KIP-939 in 4.1 and the above mentioned state changes Reviewers: Justine Olshan <jolshan@confluent.io>, Artem Livshits <alivshits@confluent.io>	2025-06-14 11:40:00 -07:00
Apoorv Mittal	254c1fa519	MINOR: Fixing client telemetry validate request (#19959 ) CI / build (push) Has been cancelled Details Minor fix to correct the validate condition for GetTelemetryRequests. Added respective tests as well. Reviewers: Andrew Schofield <aschofield@confluent.io>	2025-06-12 22:56:21 +01:00
Luke Chen	00a1b1e8ce	Bump the commons-beanutils for CVE-2025-48734. Since `commons-validator` CI / build (push) Has been cancelled Details hasn't had new release with newer `commons-beanutils` versions, we manually bump it in kafka. Reviewers: Mickael Maison <mickael.maison@gmail.com>	2025-06-11 15:27:22 +08:00
Okada Haruki	1cc14f6343	KAFKA-19334 MetadataShell execution unintentionally deletes lock file (#19817 ) CI / build (push) Has been cancelled Details ## Summary - MetadataShell may deletes lock file unintentionally when it exists or fails to acquire lock. If there's running server, this causes unexpected result as below: * MetadataShell succeeds on 2nd run unexpectedly * Even worse, LogManager/RaftManager's lock also no longer work from concurrent Kafka process startup Reviewers: TengYao Chi <frankvicky@apache.org> # Conflicts: # shell/src/test/java/org/apache/kafka/shell/MetadataShellIntegrationTest.java	2025-06-09 12:48:19 +08:00
Alyssa Huang	ded7653066	MINOR: Fix some Request toString methods (#19655 ) (#19689 ) CI / build (push) Has been cancelled Details Reviewers: Colin P. McCabe <cmccabe@apache.org> ``` Conflicts: clients/src/main/java/org/apache/kafka/common/requests/AlterUserScramCredentialsRequest.java - import statement clients/src/main/java/org/apache/kafka/common/requests/IncrementalAlterConfigsRequest.java - import statement core/src/test/scala/unit/kafka/server/KafkaApisTest.scala - different logging and metadatacache instantiation ``` Cherry-Picked-From: `042be5b9ac` Cherry-Picked-By: Alyssa Huang <ahuang@confluent.io> Cherry-Picked-At: Mon May 12 11:01:47 2025 -0700	2025-05-27 15:37:29 -07:00
Dongnuo Lyu	e9c5069be7	KAFKA-18687: Setting the subscriptionMetadata during conversion to consumer group (#19790 ) CI / build (push) Waiting to run Details When a consumer protocol static member replaces an existing member in a classic group, it's not necessary to recompute the assignment. However, it happens anyway. In [ConsumerGroup.fromClassicGroup](`0ff4dafb7d/group-coordinator/src/main/java/org/apache/kafka/coordinator/group/modern/consumer/ConsumerGroup.java (L1140)`), we don't set the group's subscriptionMetadata. Later in the consumer group heartbeat, we [call updateSubscriptionMetadata](`0ff4dafb7d/group-coordinator/src/main/java/org/apache/kafka/coordinator/group/GroupMetadataManager.java (L1748)`), which [notices that the group's subscriptionMetadata needs an update](`0ff4dafb7d/group-coordinator/src/main/java/org/apache/kafka/coordinator/group/GroupMetadataManager.java (L2757)`) and bumps the epoch. Since the epoch is bumped, we [recompute the assignment](`0ff4dafb7d/group-coordinator/src/main/java/org/apache/kafka/coordinator/group/GroupMetadataManager.java (L1766)`). As a fix, this patch sets the subscriptionMetadata in ConsumerGroup.fromClassicGroup. Reviewers: Sean Quah <squah@confluent.io>, David Jacot <djacot@confluent.io>	2025-05-27 11:31:13 +02:00
Alyssa Huang	3170e1130c	KAFKA-18345; Prevent livelocked elections (#19658 ) CI / build (push) Has been cancelled Details At the retry limit binaryExponentialElectionBackoffMs it becomes statistically likely that the exponential backoff returned electionBackoffMaxMs. This is an issue as multiple replicas can get stuck starting elections at the same cadence. This change fixes that by added a random jitter to the max election backoff. Reviewers: José Armando García Sancio <jsancio@apache.org>, TaiJuWu <tjwu1217@gmail.com>, Yung <yungyung7654321@gmail.com>	2025-05-20 16:10:23 -04:00
Andy Li	14fd498ed0	MINOR: API Responses missing latest version in Kafka protocol guide (#19769 ) ### Issue: API Responses missing latest version in [Kafka protocol guide](https://kafka.apache.org/protocol.html) #### For example: These are missing: - ApiVersions Response (Version: 4) — Only versions 0–3 are documented, though version 4 of the request is included. - DescribeTopicPartitions Response — Not listed at all. - Fetch Response (Version: 17) — Only versions 4–16 are documented, though version 17 of the request is included. #### After the fix: docs/generated/protocol_messages.html <img width="1045" alt="image" src="https://github.com/user-attachments/assets/5ea79ced-aab5-4c47-8e09-9956047c9bf1" /> Reviewers: dengziming <dengziming1993@gmail.com>, Ken Huang <s7133700@gmail.com>, Chia-Ping Tsai <chia7712@gmail.com>	2025-05-21 00:27:32 +08:00
Matthias J. Sax	923086dba2	KAFKA-19171: Kafka Streams crashes with UnsupportedOperationException (#19507 ) CI / build (push) Has been cancelled Details This PR fixes a regression bug introduced with KAFKA-17203. We need to pass in mutable collections into `closeTaskClean(...)`. Reviewers: Chia-Ping Tsai <chia7712@gmail.com>, Bruno Cadonna <bruno@confluent.io>, Lucas Brutschy <lbrutschy@confluent.io>	2025-05-15 21:40:08 -07:00
Matthias J. Sax	62c6697ac9	KAFKA-19208: KStream-GlobalKTable join should not drop left-null-key record (#19580 ) CI / build (push) Waiting to run Details Reviewers: Lucas Brutschy <lbrutschy@confluent.io>	2025-05-15 18:37:39 -07:00
David Jacot	d7d7876989	KAFKA-19274; Group Coordinator Shards are not unloaded when `__consumer_offsets` topic is deleted (#19713 ) CI / build (push) Waiting to run Details Group Coordinator Shards are not unloaded when `__consumer_offsets` topic is deleted. The unloading is scheduled but it is ignored because the epoch is equal to the current epoch: ``` [2025-05-13 08:46:00,883] INFO [GroupCoordinator id=1] Scheduling unloading of metadata for __consumer_offsets-0 with epoch OptionalInt[0] (org.apache.kafka.coordinator.common.runtime.CoordinatorRuntime) [2025-05-13 08:46:00,883] INFO [GroupCoordinator id=1] Scheduling unloading of metadata for __consumer_offsets-1 with epoch OptionalInt[0] (org.apache.kafka.coordinator.common.runtime.CoordinatorRuntime) [2025-05-13 08:46:00,883] INFO [GroupCoordinator id=1] Ignored unloading metadata for __consumer_offsets-0 in epoch OptionalInt[0] since current epoch is 0. (org.apache.kafka.coordinator.common.runtime.CoordinatorRuntime) [2025-05-13 08:46:00,883] INFO [GroupCoordinator id=1] Ignored unloading metadata for __consumer_offsets-1 in epoch OptionalInt[0] since current epoch is 0. (org.apache.kafka.coordinator.common.runtime.CoordinatorRuntime) ``` This patch fixes the issue by not setting the leader epoch in this case. The coordinator expects the leader epoch to be incremented when the resignation code is called. When the topic is deleted, the epoch is not incremented. Therefore, we must not use it. Note that this is aligned with deleted partitions are handled too. Reviewers: Dongnuo Lyu <dlyu@confluent.io>, José Armando García Sancio <jsancio@apache.org>	2025-05-15 19:12:42 +02:00
Kuan-Po Tseng	f99db0804e	KAFKA-19275 client-state and thread-state metrics are always "Unavailable" (#19712 ) CI / build (push) Has been cancelled Details Fix the issue where JMC is unable to correctly display client-state and thread-state metrics. The root cause is that these two metrics directly return the `State` class to JMX. If the user has not set up the RMI server, JMC or other monitoring tools will be unable to interpret the `State` class. To resolve this, we should return a string representation of the state instead of the State class in these two metrics. Reviewers: Luke Chen <showuon@gmail.com>, Ken Huang <s7133700@gmail.com>, Chia-Ping Tsai <chia7712@gmail.com>	2025-05-14 14:08:01 +08:00
Sean Quah	d64a97099d	KAFKA-18688: Fix uniform homogeneous assignor stability (#19677 ) CI / build (push) Waiting to run Details When the number of partitions is not divisible by the number of members, some members will end up with one more partition than others. Previously, we required these to be the members at the start of the iteration order, which meant that partitions could be reassigned even when the previous assignment was already balanced. Allow any member to have the extra partition, so that we do not move partitions around when the previous assignment is already balanced. Before the PR ``` Benchmark (assignmentType) (assignorType) (isRackAware) (memberCount) (partitionsToMemberRatio) (subscriptionType) (topicCount) Mode Cnt Score Error Units ServerSideAssignorBenchmark.doAssignment FULL RANGE false 10000 50 HOMOGENEOUS 1000 avgt 2 26.175 ms/op ServerSideAssignorBenchmark.doAssignment FULL RANGE false 10000 50 HETEROGENEOUS 1000 avgt 2 123.955 ms/op ServerSideAssignorBenchmark.doAssignment INCREMENTAL RANGE false 10000 50 HOMOGENEOUS 1000 avgt 2 24.408 ms/op ServerSideAssignorBenchmark.doAssignment INCREMENTAL RANGE false 10000 50 HETEROGENEOUS 1000 avgt 2 114.873 ms/op ``` After the PR ``` Benchmark (assignmentType) (assignorType) (isRackAware) (memberCount) (partitionsToMemberRatio) (subscriptionType) (topicCount) Mode Cnt Score Error Units ServerSideAssignorBenchmark.doAssignment FULL RANGE false 10000 50 HOMOGENEOUS 1000 avgt 2 24.259 ms/op ServerSideAssignorBenchmark.doAssignment FULL RANGE false 10000 50 HETEROGENEOUS 1000 avgt 2 118.513 ms/op ServerSideAssignorBenchmark.doAssignment INCREMENTAL RANGE false 10000 50 HOMOGENEOUS 1000 avgt 2 24.636 ms/op ServerSideAssignorBenchmark.doAssignment INCREMENTAL RANGE false 10000 50 HETEROGENEOUS 1000 avgt 2 115.503 ms/op ``` Reviewers: David Jacot <djacot@confluent.io>	2025-05-13 17:03:19 +02:00
Sean Quah	48f75616d7	KAFKA-19163: Avoid deleting groups with pending transactional offsets (#19496 ) When a group has pending transactional offsets but no committed offsets, we can accidentally delete it while cleaning up expired offsets. Add a check to avoid this case. Reviewers: David Jacot <djacot@confluent.io>	2025-05-13 14:10:55 +02:00
David Jacot	9aad1849e2	MINOR: Fix version in 4.0 branch (#19686 ) CI / build (push) Waiting to run Details This patch fixes the version used in the `4.0` branch. It should be `4.0.1` instead of `4.1.0`. Reviewers: Chia-Ping Tsai <chia7712@gmail.com>	2025-05-12 20:17:30 +02:00
ChickenchickenLove	5015183c1c	KAFKA-19242: Fix commit bugs caused by race condition during rebalancing. (#19631 ) ### Motivation While investigating “events skipped in group rebalancing” ([spring‑projects/spring‑kafka#3703](https://github.com/spring-projects/spring-kafka/issues/3703)) I discovered a race condition between - the main poll/commit thread, and - the consumer‑coordinator heartbeat thread. If the main thread enters `ConsumerCoordinator.sendOffsetCommitRequest()` while the heartbeat thread is finishing a rebalance (`SyncGroupResponseHandler.handle()`), the group state transitions in the following order: ``` COMPLETING_REBALANCE → (race window) → STABLE ``` Because we read the state twice without a lock: 1. `generationIfStable()` returns `null` (state still `COMPLETING_REBALANCE`), 2. the heartbeat thread flips the state to `STABLE`, 3. the main thread re‑checks with `rebalanceInProgress()` and wrongly decides that a rebalance is still active, 4. a spurious `CommitFailedException` is returned even though the commit could succeed. For more details, please refer to sequence diagram below. <img width="1494" alt="image" src="https://github.com/user-attachments/assets/90f19af5-5e2d-4566-aece-ef764df2d89c" /> ### Impact - The exception is semantically wrong: the consumer is in a stable group, but reports failure. - Frameworks and applications that rely on the semantics of `CommitFailedException` and `RetryableCommitException` (for example `Spring Kafka`) take the wrong code path, which can ultimately skip the events and break “at‑most‑once” guarantees. ### Fix We enlarge the synchronized block in `ConsumerCoordinator.sendOffsetCommitRequest()` so that the consumer group state is examined atomically with respect to the heartbeat thread: ### Jira https://issues.apache.org/jira/browse/KAFKA-19242 https: //github.com/spring-projects/spring-kafka/issues/3703 Signed-off-by: chickenchickenlove <ojt90902@naver.com> Reviewers: David Jacot <david.jacot@gmail.com>	2025-05-12 17:01:54 +02:00
Sean Quah	cf3c177936	KAFKA-19160;KAFKA-19164; Improve performance of fetching stable offsets (#19497 ) CI / build (push) Waiting to run Details When fetching stable offsets in the group coordinator, we iterate over all requested partitions. For each partition, we iterate over the group's ongoing transactions to check if there is a pending transactional offset commit for that partition. This can get slow when there are a large number of partitions and a large number of pending transactions. Instead, maintain a list of pending transactions per partition to speed up lookups. Reviewers: Shaan, Dongnuo Lyu <dlyu@confluent.io>, Chia-Ping Tsai <chia7712@gmail.com>, David Jaco <djacot@confluent.io>	2025-05-12 09:34:43 +02:00

1 2 3 4 5 ...

14996 Commits All Branches Search

14996 Commits

All Branches