Commit Graph

16148 Commits

Author SHA1 Message Date
PoAn Yang ea771563e0
KAFKA-14604: SASL session expiration time will be overflowed when calculation (#18526)
CI / build (push) Has been cancelled Details
Fixup PR Labels / fixup-pr-labels (needs-attention) (push) Has been cancelled Details
Fixup PR Labels / fixup-pr-labels (triage) (push) Has been cancelled Details
Docker Image CVE Scanner / scan_jvm (3.7.2) (push) Has been cancelled Details
Docker Image CVE Scanner / scan_jvm (3.8.1) (push) Has been cancelled Details
Docker Image CVE Scanner / scan_jvm (3.9.1) (push) Has been cancelled Details
Docker Image CVE Scanner / scan_jvm (4.0.0) (push) Has been cancelled Details
Docker Image CVE Scanner / scan_jvm (latest) (push) Has been cancelled Details
Flaky Test Report / Flaky Test Report (push) Has been cancelled Details
Fixup PR Labels / needs-attention (push) Has been cancelled Details
The timeout value may be overflowed if users set a large expiration
time.

```
sessionExpirationTimeNanos = authenticationEndNanos + 1000 * 1000 *
sessionLifetimeMs;
```

Fixed it by throwing exception if the value is overflowed.

Reviewers: TaiJuWu <tjwu1217@gmail.com>, Luke Chen <showuon@gmail.com>,
 TengYao Chi <frankvicky@apache.org>

Signed-off-by: PoAn Yang <payang@apache.org>
2025-08-03 19:12:04 +08:00
Tsung-Han Ho (Miles Ho) 3f1d830174
MINOR: Remove duplicate renewTimePeriodOpt in DelegationTokenCommand validation (#20177)
The bug was a duplicate parameter validation in the
`DelegationTokenCommand` class.  The `checkInvalidArgs` method for the
`describeOpt` was incorrectly including `renewTimePeriodOpt` twice in
the set of invalid arguments.

This bug caused unexpected command errors during E2E testing.

### Before the fix:
The following command would fail due to the duplicate validation logic:
```
TC_PATHS="tests/kafkatest/tests/core/delegation_token_test.py::DelegationTokenTest"
/bin/bash tests/docker/run_tests.sh
```

### Error output:
```
ducktape.cluster.remoteaccount.RemoteCommandError: ducker@ducker03:
Command
'KAFKA_OPTS="-Djava.security.auth.login.config=/mnt/security/jaas.conf
-Djava.security.krb5.conf=/mnt/security/krb5.conf"
/opt/kafka-dev/bin/kafka-delegation-tokens.sh --bootstrap-server
ducker03:9094  --create  --max-life-time-period -1  --command-config
/mnt/kafka/client.properties > /mnt/kafka/delegation_token.out' returned
non-zero exit status 1. Remote error message: b'duplicate element:
[renew-time-period]\njava.lang.IllegalArgumentException: duplicate
element: [renew-time-period]\n\tat
java.base/java.util.ImmutableCollections$SetN.<init>(ImmutableCollections.java:918)\n\tat
java.base/java.util.Set.of(Set.java:544)\n\tat
org.apache.kafka.tools.DelegationTokenCommand$DelegationTokenCommandOptions.checkArgs(DelegationTokenCommand.java:304)\n\tat
org.apache.kafka.tools.DelegationTokenCommand.execute(DelegationTokenCommand.java:79)\n\tat
org.apache.kafka.tools.DelegationTokenCommand.mainNoExit(DelegationTokenCommand.java:57)\n\tat
org.apache.kafka.tools.DelegationTokenCommand.main(DelegationTokenCommand.java:52)\n\n'

[INFO:2025-07-31 11:27:25,531]: RunnerClient:
kafkatest.tests.core.delegation_token_test.DelegationTokenTest.test_delegation_token_lifecycle.metadata_quorum=ISOLATED_KRAFT:
Data: None
================================================================================
SESSION REPORT (ALL TESTS)
ducktape version: 0.12.0
session_id:       2025-07-31--002
run time:         33.213 seconds
tests run:        1
passed:           0
flaky:            0
failed:           1
ignored:          0
================================================================================
test_id:
kafkatest.tests.core.delegation_token_test.DelegationTokenTest.test_delegation_token_lifecycle.metadata_quorum=ISOLATED_KRAFT
status:     FAIL
run time:   33.090 seconds
```

### After the fix:
The same command now executes successfully:
```
TC_PATHS="tests/kafkatest/tests/core/delegation_token_test.py::DelegationTokenTest"
/bin/bash tests/docker/run_tests.sh
```

### Success output:
```
================================================================================
SESSION REPORT (ALL TESTS)
ducktape version: 0.12.0
session_id:       2025-07-31--001
run time:         35.488 seconds
tests run:        1
passed:           1
flaky:            0
failed:           0
ignored:          0
================================================================================
test_id:
kafkatest.tests.core.delegation_token_test.DelegationTokenTest.test_delegation_token_lifecycle.metadata_quorum=ISOLATED_KRAFT
status:     PASS
run time:   35.363 seconds
--------------------------------------------------------------------------------
```

Reviewers: Jhen-Yung Hsu <jhenyunghsu@gmail.com>, TengYao Chi
 <frankvicky@apache.org>, Ken Huang <s7133700@gmail.com>, PoAn Yang
 <payang@apache.org>, Chia-Ping Tsai <chia7712@gmail.com>
2025-08-03 16:40:18 +08:00
Apoorv Mittal 05d71ad1a8
KAFKA-19476: Concurrent execution fixes for lock timeout and lso movement (#20286)
CI / build (push) Has been cancelled Details
The PR fixes following:

1. In case share partition arrive at a state which should be treated as
final state
of that batch/offset (example - LSO movement which causes offset/batch
to be ARCHIVED permanently), the result of pending write state RPCs for
that offset/batch override the ARCHIVED state. Hence track such updates
and apply when transition is completed.

2. If an acquisition lock timeout occurs while an offset/batch is
undergoing transition followed by write state RPC failure, then
respective batch/offset can
land in a scenario where the offset stays in ACQUIRED state with no
acquisition lock timeout task.

3. If a timer task is cancelled, but due to concurrent execution of
timer task and acknowledgement, there can be a scenario when timer task
has processed post cancellation. Hence it can mark the offset/batch
re-avaialble despite already acknowledged.

Reviewers: Andrew Schofield <aschofield@confluent.io>, Abhinav Dixit
 <adixit@confluent.io>
2025-08-01 23:20:25 +01:00
Andrew Schofield b909544e99
MINOR: Improve consistency of acknowledge type terminology (#20282)
The code had a mixture of "acknowledgement type" and "acknowledge type".
The latter is preferred.

Reviewers: TengYao Chi <frankvicky@apache.org>, Lan Ding
 <isDing_L@163.com>
2025-08-01 21:17:22 +01:00
Shivsundar R e1f45218c9
KAFKA-19485 (II) : Complete any pending acknowledgements in ShareFetch on an error response. (#20247)
CI / build (push) Waiting to run Details
*What*
Currently when we received a top level error response in ShareFetch, we
would log the error, update the share session epoch and proceed to the
next request.
But these acknowledgements(if any) are not completed and the callback
would not have been processed.

PR aims to address this by completing these acknowledgements with the
error code from the response in this case.

Reviewers: Andrew Schofield <aschofield@confluent.io>
2025-07-31 22:07:44 +01:00
Andrew Schofield f38359300b
MINOR: Fix javadoc mistake (#20281)
Fixes a couple of tiny mistakes in the javadoc for KafkaShareConsumer.

Reviewers: Chia-Ping Tsai <chia7712@gmail.com>
2025-07-31 19:25:04 +01:00
majialong 6b96735872
MINOR: Fix typo in GetOffsetShell (#20277)
Fix typo in GetOffsetShell : `visible for tseting` -> `Visible for
testing`

Reviewers: Chia-Ping Tsai <chia7712@gmail.com>
2025-08-01 00:48:31 +08:00
Manikumar Reddy f73e3bcd6a
KAFKA-19561: Set OP_WRITE interest after SASL reauthentication to resume pending writes (#20258)
https://issues.apache.org/jira/browse/KAFKA-19561

Addresses a race condition during SASL reauthentication where the
server-side `KafkaChannel.send()` queues a response, but OP_WRITE is
removed before the channel becomes writable — resulting in stuck
responses and client  timeouts.

Reviewers: Rajini Sivaram <rajinisivaram@googlemail.com>
2025-07-31 21:59:21 +05:30
TaiJuWu dfaf9f9cf7
MINOR: add test for `kafka-consumer-groups.sh` should not fail when partition offline (#20235)
CI / build (push) Waiting to run Details
See: https://github.com/apache/kafka/pull/20168#discussion_r2227310093

add follow test case:

Given a topic with three partitions, where partition `t-2` is offline,
if partitionsToReset contains only `t-1`, the method
filterNoneLeaderPartitions incorrectly includes `t-2` in the result,
leading to a failure in the tool.

Reviewers: Chia-Ping Tsai <chia7712@gmail.com>, Jhen-Yung Hsu
 <jhenyunghsu@gmail.com>, Ken Huang <s7133700@gmail.com>, Andrew
 Schofield <aschofield@confluent.io>
2025-07-31 15:54:27 +01:00
Lan Ding d0a9a04a02
MINOR: Cleanups in storage module (#20270)
Cleanups including:

- Rewrite `FetchCountAndOp`  as a record class
- Replace `Tuple` by `Map.Entry`

Reviewers: TengYao Chi <frankvicky@apache.org>, Chia-Ping Tsai
 <chia7712@gmail.com>
2025-07-31 20:55:49 +08:00
Sanskar Jhajharia c7caf912aa
KAFKA-19524 Fix UnsupportedOperationException in connect-plugin-path (#20222)
### Problem
The connect-plugin-path tool crashes with
`UnsupportedOperationException` when processing plugins that have
loadable classes but no ServiceLoader manifest files.

### Root Cause
Line 326 attempts to remove from an immutable `Collections.emptySet()`:
```java
nonLoadableManifests.getOrDefault(pluginDesc.className(),
Collections.emptySet()).remove(pluginDesc.type());
```

### Solution
Replace `Collections.emptySet()` with `new HashSet<>()` to provide a
mutable set.

Reviewers: Ken Huang <s7133700@gmail.com>, Chia-Ping Tsai
 <chia7712@gmail.com>
2025-07-31 20:37:35 +08:00
Mickael Maison fd9b5514ad
MINOR: Cleanups in coordinator-common/group-coordinator (#20097)
- Use `record` where possible
- Use enhanced `switch`
- Tweak a bunch of assertions

Reviewers: Yung <yungyung7654321@gmail.com>, TengYao Chi
 <frankvicky@apache.org>, Ken Huang <s7133700@gmail.com>, Dongnuo Lyu
 <dlyu@confluent.io>, PoAn Yang <payang@apache.org>
2025-07-31 16:34:08 +08:00
Lucas Brutschy 44c6e956ed
KAFKA-19529: State updater sensor names should be unique (#20262)
CI / build (push) Waiting to run Details
All state updater threads use the same metrics instance, but do not use
unique names for their sensors. This can have the following symptoms:

1) Data inserted into one sensor by one thread can affect the metrics of
all state updater threads.
2) If one state updater thread is shutdown, the metrics associated to
all state updater threads are removed.
3) If one state updater thread is started, while another one is removed,
it can happen that a metric is registered with the `Metrics` instance,
but not associated to any `Sensor` (because it is concurrently removed),
which means that the metric will not be removed upon shutdown. If a
thread with the same name later tries to register the same metric, we
may run into a `java.lang.IllegalArgumentException: A metric named ...
already exists`, as described in the ticket.

This change fixes the bug giving unique names to the sensors. A test is
added that there is no interference of the removal of sensors and
metrics during shutdown.

Reviewers: Matthias J. Sax <matthias@confluent.io>
2025-07-31 08:58:52 +02:00
Luke Chen a058123cd8
KAFKA-19563: improve the add controller doc (#20261)
JIRA: [KAFKA-19563](https://issues.apache.org/jira/browse/KAFKA-19563)
Improve the add-controller doc to notify users they have to add the
configs to be passed to admin client into the controller.properties
before the improvement is done.
```
--command-config COMMAND_CONFIG
      Property file containing configs to be passed to Admin Client.
For  add-controller,  the file is used to specify the controller
properties as well.
```

Reviewers: Chia-Ping Tsai <chia7712@gmail.com>
2025-07-31 14:09:47 +08:00
Now dda1b5a4e8
MINOR: Fix duplicate 'to' in ExactlyOnceMessageProcessor javadoc (#20228)
CI / build (push) Waiting to run Details
Fixed a simple typo in javadoc comment where "to to" appeared instead of
"to".

_No functional changes_

Reviewers: Chia-Ping Tsai <chia7712@gmail.com>
2025-07-30 23:59:49 +08:00
Kevin Wu 1bcaa19c46
KAFKA-19489; Extra validation when formatting a node (#20136)
CI / build (push) Waiting to run Details
This PR adds a check to the storage tool's format command which throws a
TerseFailure when the controller.quorum.voters config is defined and the
node is formatted with the --standalone flag or the
--initial-controllers flag.

Without this check, it is possible to have two voter sets. For example,
in a three node setup, the two nodes that formatted with
--no-initial-controllers could form quorum with each other since they
have the static voter set, and the --standalone node would ignore the
config and read the voter set of itself from its log, forming its own
quorum of 1.

Reviewers: José Armando García Sancio <jsancio@apache.org>, TaiJuWu
 <tjwu1217@gmail.com>, Alyssa Huang <ahuang@confluent.io>
2025-07-30 10:58:08 -04:00
Jhen-Yung Hsu 4c1b79851f
MINOR: fix e2e docs for test that cannot be run directly (#20145)
`tests/kafkatest/services/console_consumer.py` cannot be run directly.
The correct way is to run
`tests/kafkatest/sanity_checks/test_console_consumer.py`.

Reviewers: Chia-Ping Tsai <chia7712@gmail.com>
2025-07-30 20:57:57 +08:00
Hong-Yi Chen a5e3bfa232
KAFKA-19527 improve the docs of LogDirDescription for remote storage (#20211)
Clarifies that the fields `LogDirDescription#totalBytes`,
`LogDirDescription#usableBytes`, and `ReplicaInfo#size` do not include
the size of remote storage by updating their corresponding docs.

Reviewers: Chia-Ping Tsai <chia7712@gmail.com>
2025-07-30 20:41:45 +08:00
Mickael Maison 6973deab03
MINOR: Cleanups in storage module (#20087)
Cleanups including:
- Java 17 syntax, record and switch
- assertEquals() order
- javadoc

Reviewers: Andrew Schofield <aschofield@confluent.io>, Jhen-Yung Hsu
 <jhenyunghsu@gmail.com>, Ken Huang <s7133700@gmail.com>, Chia-Ping Tsai
 <chia7712@gmail.com>
2025-07-30 16:02:01 +08:00
Jinhe Zhang 81bdf0b889
MINOR: Remove duplicated code in PgeViewDemo (#20252)
CI / build (push) Waiting to run Details
Remove the default implementation

Reviewers: Lucas Brutschy <lucasbru@apache.org>
2025-07-29 18:18:38 +02:00
Ken Huang 96c8e86cdf
KAFKA-19530 RemoteLogManager should record lag stats when remote storage is offline (#20218)
CI / build (push) Waiting to run Details
When remote storage is offline, then the segmentLag and bytesLag metrics
are not recorded.   These metrics are useful to know the pending data to
upload when remote storage is down.

Reviewers: TaiJuWu <tjwu1217@gmail.com>, Kamal Chandraprakash
 <kamal.chandraprakash@gmail.com>
2025-07-29 20:08:06 +05:30
George Wu 7dba91d025
KAFKA-19484: Fix bug with tiered storage throttle metrics (#20129)
Fixes a bug with tiered storage quota metrics introduced in [KIP-956](https://cwiki.apache.org/confluence/display/KAFKA/KIP-956+Tiered+Storage+Quotas).
The metrics tracking how much time have been spent in a throttled state
can stop reporting if a cluster stops stops doing remote copy/fetch and
the sensors go inactive.

This change delegates the job of refreshing inactive sensors to
SensorAccess. There's pretty similar logic in RLMQuotaManager which is
actually responsible for tracking and enforcing quotas and also uses a
Sensor object.

```
remote-fetch-throttle-time-avg
remote-copy-throttle-time-avg
remote-fetch-throttle-time-max
remote-copy-throttle-time-max
```
Reviewers: Kamal Chandraprakash <kamal.chandraprakash@gmail.com>
2025-07-29 19:37:41 +05:30
lucliu1108 4ff851a562
MINOR: Delete the redundant feature upgrade instruction for running KIP-1071 EA (#20250)
Follow up on https://github.com/apache/kafka/pull/20241.

Delete the instruction that manually set `streams.version=1` for running
Kafka 4.1 since it is already achieved in previous setup steps.

Reviewers: Lucas Brutschy <lucasbru@apache.org>
2025-07-29 14:28:30 +02:00
jimmy dd784e7d7a
KAFKA-16717 [3/N]: Add AdminClient.alterShareGroupOffsets (#19820)
[KAFKA-16717](https://issues.apache.org/jira/browse/KAFKA-16717) aims to
finish the AlterShareGroupOffsets for ShareGroupCommand part.

Reviewers: Lan Ding <isDing_L@163.com>, Chia-Ping Tsai
 <chia7712@gmail.com>, TaiJuWu <tjwu1217@gmail.com>, Andrew Schofield
 <aschofield@confluent.io>
2025-07-29 11:47:24 +01:00
Shivsundar R 40b4fdb0d8
KAFKA-19559: Remove ShareFetchMetricsManager sensors on consumer.close() (#20255)
*What*
https://issues.apache.org/jira/browse/KAFKA-19559

- There are a few sensors(created when a `ShareConsumer` initializes)
which are not removed when the `ShareConsumer` closes.
- To maintain consistency and prevent any leaks, we have to remove all
the sensors when the consumer is closed.

This is similar to this issue for regular consumers -
https://issues.apache.org/jira/browse/KAFKA-19542

Reviewers: Andrew Schofield <aschofield@confluent.io>, Apoorv Mittal
<apoorvmittal10@gmail.com>
2025-07-29 11:10:48 +01:00
Apoorv Mittal 875537f54b
KAFKA-19555: Restrict records acquisition post max in-flight limit (#20253)
The PR restricts the records being acquired post max-inflight limit.
Previously the max in-flight limit was only enforced while considering
the share partition for further fetches i.e. once the limit was reached
the share partition was not considered for further fetches. However,
when the records are actively released then there might be some records
being acquired post max-inflight limit. This is evident with higher
number of consumers reading from same share partition and releasing the
records.

Reviewers: Andrew Schofield <aschofield@confluent.io>, Lan Ding
<isDing_L@163.com>
2025-07-29 10:40:06 +01:00
Uladzislau Blok f9ccf83a7f
KAFKA-18066: Fix mismatched StreamThread ID in log messages (#19517)
CI / build (push) Waiting to run Details
This PR fixes an issue where the thread name shown in log messages did
not match the actual execution context. Previously, log entries
displayed the context of the newly created thread, while the logger
reflected the current executing thread. This mismatch led to confusion
and made log tracing more difficult.

Changes:
 - Use logger without context to not have context
 - Updated log messages to explicitly describe the thread being created
- Fixed instances where the log context reflected the current thread
instead of the newly created one
 
 Reviewers: Matthias J. Sax <matthias@confluent.io>
2025-07-28 17:00:50 -07:00
Rajani K de2adb69de
KAFKA-12281: Deprecate BrokerNotFoundException (#20192)
CI / build (push) Waiting to run Details
Implements KIP-1195.

BrokerNotFoundException exception is unused since 2.8  Marking it
deprecated so that it can be removed in next major release.

Reviewers: Matthias J. Sax <matthias@confluent.io>
2025-07-28 15:18:34 -07:00
Alyssa Huang 7339d65b27
KAFKA-19354: KRaft observer should send fetch to best node (#19854)
Observers may get stuck fetching from bootstrap servers even on
discovery of a leader from a fetch response.

Observers currently fetch from bootstrap servers once their fetch
timeout expires, and even on rediscovery of the leader (e.g. observer
receives fetch response indicating location of leader), observers will
not  attempt to resume fetching from the leader.

This change simplifies observer polling logic to just rely on
maybeSendFetchToBestNode which takes care of the logic of selecting
either the leader or bootstrap servers  as the destination based on
timeouts and backoffs.

Reviewers: José Armando García Sancio <jsancio@apache.org>
2025-07-28 13:03:14 -04:00
Lan Ding dfced692d2
KAFKA-19551: Remove the handling of FatalExitError in RemoteStorageThreadPool (#20245)
CI / build (push) Waiting to run Details
FatalExitError is not thrown after
[KAFKA-19425](https://issues.apache.org/jira/browse/KAFKA-19425). Clean
up the handling of FatalExitError in `RemoteStorageThreadPool`.
2025-07-28 17:49:45 +05:30
stroller e61c297b73
KAFKA-19425: Stop the server when fail to initialize to avoid local segment never got deleted. (#20007)
We found that one broker's local segment on disk never get removed
forever no matter how long it stored. The disk always keep increasing.


![image](https://github.com/user-attachments/assets/42129bb6-7d07-481b-923f-971da3ab12da)
note: Partition 2's node is the exception node.

After we trouble shooting. we find if one broker is very slow to startup
it will cause the
TopicBasedRemoteLogMetadataManager#initializeResources's fail sometime
(it meet expectation due to the server is not ready as fast). Thus it
won't stop the server so that the server still run just with some
exception log but not shutdown.  It won't upload to remote for the local
so that the local segment never to deleted.

So propose the change to shutdown the broker to avoid the silence
critical error which caused the disk keep increasing forever.

Reviewers: Kamal Chandraprakash <kamal.chandraprakash@gmail.com>, Luke
 Chen <showuon@gmail.com>
2025-07-28 17:47:09 +05:30
Lan Ding abbb6b3c13
KAFKA-19471: Enable acknowledgement for a record which could not be deserialized (#20148)
CI / build (push) Waiting to run Details
This patch mainly includes two improvements:

1. Update currentFetch when `pollForFetches()` throws an exception.
2. Add an override `KafkaShareConsumer.acknowledge(String topic, int
partition, long offset, AcknowledgeType type)` .

Reviewers: Andrew Schofield <aschofield@confluent.io>
2025-07-27 22:35:04 +01:00
Yeikel Santana 1a176beff1
MINOR: Reword warning message when internal deprecated properties are used (#20224)
This is a small typo I noticed while working on Connect

Original warning message

> level=WARN connector_context=The worker has been configured with one
or more internal converter properties ([internal.key.converter,
schemas.enable, internal.value.converter, schemas.enable]). Support for
these properties was deprecated in version 2.0 and removed in version
3.0, and specifying them will have no effect. Instead, an instance of
the JsonConverter with schemas.enable set to false will be used. For
more information, please visit
https://kafka.apache.org/documentation/#upgrade and consult the upgrade
notesfor the 3.0 release.
class=org.apache.kafka.connect.runtime.WorkerConfig line=310

Reviewers: Ken Huang <s7133700@gmail.com>, Andrew Schofield
 <aschofield@confluent.io>
2025-07-27 21:12:49 +01:00
Jinhe Zhang 73f195f062
MINOR: Re-add pageview demo to :streams:examples and remove dependency on :connect:json (#20239)
CI / build (push) Has been cancelled Details
With 4.0 release, we remove pageview demo because it depends on
`:connect:json` which requires JDK 17.  This PR removes the connect
dependency and adds a customized serializer and deserializer,  to make
pageview demo works with JDK 11.

Reviewers: Matthias J. Sax <matthias@confluent.io>
2025-07-25 11:06:12 -07:00
lucliu1108 46cb6cbcff
MINOR: Improve Kafka Streams Protocol Upgrade Doc (#20241)
As a follow-up minor addition to PR for
[KSTREAMS-7735](https://confluentinc.atlassian.net/browse/KSTREAMS-7735
) (https://github.com/apache/kafka/pull/20029) , add the instructions
for upgrading `streams.version` parameter for KIP-1071 EA.

Reviewers: Andrew Schofield <aschofield@confluent.io>
2025-07-25 18:22:09 +01:00
majialong a27d6e32b0
MINOR: Optimize RemoteLogManager#buildFilteredLeaderEpochMap (#20205)
Optimize `RemoteLogManager#buildFilteredLeaderEpochMap` .

Add a temporary unit test `testBuildFilteredLeaderEpochMapModify` in
`RemoteLogManagerTest` to verify the output consistency of the method
before and after optimization.

Randomly generate leaderEpochs and iterate 100000 times for
verification.

```
    @Test
    public void testBuildFilteredLeaderEpochMapModify() {
        int testIterations = 100000;

        for (int i = 0; i < testIterations; i++) { TreeMap<Integer,
Long> leaderEpochToStartOffset =
generateRandomLeaderEpochAndStartOffset();

            // before optimize              NavigableMap<Integer, Long>
optimizeBefore =
RemoteLogManager.buildFilteredLeaderEpochMap(leaderEpochToStartOffset);

            // after optimize              NavigableMap<Integer, Long>
optimizeAfter =
RemoteLogManager.buildFilteredLeaderEpochMap2(leaderEpochToStartOffset);

            assertEquals(optimizeBefore, optimizeAfter);          } }

    private static TreeMap<Integer, Long>
generateRandomLeaderEpochAndStartOffset() {          TreeMap<Integer,
Long> map = new TreeMap<>();          Random random = new Random();
int numEntries = random.nextInt(100000);          long lastStartOffset =
0;

        for (int i = 0; i < numEntries; i++) {              // generate
a leader epoch              int leaderEpoch = random.nextInt(100000);
long startOffset;

            // generate a random start offset , or use the last start
offset              if (i > 0 && random.nextDouble() < 0.2) {
startOffset = lastStartOffset;              } else { startOffset =
Math.abs(random.nextLong()) % 100000;              } lastStartOffset =
startOffset;

            map.put(leaderEpoch, startOffset);
        }
        return map;
    }
```

Command:
``` ./gradlew storage:test --tests RemoteLogManagerTest```

Result:  All unit tests passed.

<img width="1258" height="424" alt="image"
src="https://github.com/user-attachments/assets/7d9fc3b5-3bbc-440f-b1cf-3a2a5f97557a"
/>  <img width="411" height="66" alt="image"
src="https://github.com/user-attachments/assets/22a0b443-88e8-43d2-a3f2-51266935ed34"
/>

Reviewers: Kamal Chandraprakash <kamal.chandraprakash@gmail.com>,
 Chia-Ping Tsai <chia7712@gmail.com>
2025-07-24 23:16:27 +08:00
Kevin Wu 93447d5b88
KAFKA-19400: Update AddRaftVoterRequest RPC to version 1 (#19982)
Add the ackWhenCommitted boolean field to the AddRaftVoter request, and
bump the RPC's version to 1.

The default value of the ackWhenCommitted field is true, and in this
case the leader will return a response after committing the VotersRecord
generated by the RPC. If ackWhenCommitted is  false, the leader will
return a response after generating the new VotersRecord and adding it to
the batch accumulator.

Reviewers: Alyssa Huang <ahuang@confluent.io>, José Armando García
 Sancio <jsancio@apache.org>
2025-07-24 10:38:31 -04:00
Shang-Hao Yang 5cc585669f
KAFKA-19545 Remove MetadataVersionValidator (#20233)
The `MetadataVersionValidator` class became unused after commit
b6e6a3c68a (KAFKA-18360: Remove zookeeper configurations) removed the
`inter.broker.protocol.version` configuration, which was the only
component using this validator.

This commit removes both the validator class and its associated test as
they serve no purpose in the current implementation.

Reviewers: Ken Huang <s7133700@gmail.com>, Chia-Ping Tsai
 <chia7712@gmail.com>
2025-07-24 22:23:40 +08:00
Sushant Mahajan 6fac2d0700
MINOR: Prevent error message swallow in persister. (#20238)
* The persister receives the response from the share coordinator and
performs various actions based on the response error code.
* There was an issue in handling the error code and error message. We
were creating an Error object from the error code in the response but
not looking at the error message in the response.
* In some cases where a standard code has a custom associated message -
this will not provide complete information in the logs.
* In this PR, we rectify the situation by first checking the error
message in response and if empty, getting the standard message from
Errors object.

Reviewers: Andrew Schofield <aschofield@confluent.io>, Apoorv Mittal
<apoorvmittal10@gmail.com>
2025-07-24 14:59:43 +01:00
Chirag Wadhwa 1ae4173601
KAFKA-19499: Corrected the logger name in PersisterStateManagerHandler (#20175)
This PR corrects the name of logger in the inner class
PersisterStateManagerHandler. After this change it will be possible to
change the log level dynamically in the kafka brokers. This PR also
includes a small change in a log line in GroupCoordinatorService, making
it clearer.

Reviewers: Apoorv Mittal <apoorvmittal10@gmail.com>, Sushant Mahajan
<smahajan@confluent.io>, Lan Ding <isDing_L@163.com>, TengYao Chi
<frankvicky@apache.org>
2025-07-24 12:28:58 +01:00
Maros Orsak 8614e15a28
MINOR: typo in javadoc (#20113)
CI / build (push) Waiting to run Details
Fixup PR Labels / fixup-pr-labels (needs-attention) (push) Has been cancelled Details
Fixup PR Labels / fixup-pr-labels (triage) (push) Has been cancelled Details
Fixup PR Labels / needs-attention (push) Has been cancelled Details
This PR fixes a typo in the Javadoc.

---------

Signed-off-by: see-quick <maros.orsak159@gmail.com>

Reviewers: Luke Chen <showuon@gmail.com>
2025-07-24 19:05:07 +08:00
Matt Welch fba01c42c8
KAFKA-17645 Enable warmup in producer performance test (KIP-1052) (#17340)
CI / build (push) Waiting to run Details
In order to better analyze steady-state performance of Kafka, this PR
enables a warmup in the Producer Performance test. The warmup duration
is specified as a number of records that are a subset of the total
numRecords. If warmup records is greater than 0, the warmup is
represented by a second Stats object which holds warmup results. Once
warmup records have been exhausted, the test switches to using the
existing Stats object. At end of test, if warmup was enabled, the
summary of the whole test (warump + steady state) is printed followed by
the summary of the steady-state portion of the test.  If no warmup is
used, summary prints don't change from existing behavior. This
contribution is an original work and is licensed to the Kafka project
under the Apache license

Testing strategy comprises new Java unit tests added to
ProducerPerformanceTests.java.

Reviewers: Kirk True <kirk@kirktrue.pro>, Federico Valeri
 <fedevaleri@gmail.com>, Chia-Ping Tsai <chia7712@gmail.com>
2025-07-24 13:07:26 +08:00
Apoorv Mittal d350f603a4
KAFKA-18265: Move inflight batch and state classes from SharePartition (2/N) (#20230)
CI / build (push) Waiting to run Details
Another refactor PR to move in-flight batch and state out of
SharePartition. This PR concludes the refactoring and subsequent PRs for
this ticket will involve code cleanups and better lock handling. However
the intent is to keep PRs small so they can be reviewed easily.

Reviewers: Andrew Schofield <aschofield@confluent.io>
2025-07-23 23:01:23 +01:00
Apoorv Mittal a663ce3f45
KAFKA-18265: Move acquisition lock classes from share partition (1/N) (#20227)
While working on KAFKA-19476, I realized that we need to refactor
SharePartition for read/write lock handling. I have started some work in
the area. For the initial PR, I have moved AcquisitionLockTimeout class
outside of SharePartition.

Reviewers: Andrew Schofield <aschofield@confluent.io>
2025-07-23 20:21:42 +01:00
Kamal Chandraprakash 93adaea599
KAFKA-19523: Gracefully handle error while building remoteLogAuxState (#20201)
CI / build (push) Waiting to run Details
Improve the error handling while building the remote-log-auxiliary state
when a follower node with an empty disk begin to synchronise with the
leader. If the topic has remote storage enabled, then the
ReplicaFetcherThread attempt to build the remote-log-auxiliary state.
Note that the remote-log-auxiliary state gets invoked only when the
leader-log-start-offset is non-zero and leader-log-start-offset is not
equal to leader-local-log-start-offset.

When the LeaderAndISR request is received, then the
ReplicaManager#becomeLeaderOrFollower invokes 'makeFollowers' initially,
followed by the RemoteLogManager#onLeadershipChange call. As a result,
when ReplicaFetcherThread initiates the
RemoteLogManager#fetchRemoteLogSegmentMetadata, the partition may not
have been initialized at that time and throws retriable exception.

Introduced RetriableRemoteStorageException to gracefully handle the
error.

After the patch:
```
[2025-07-19 19:28:20,934] INFO [ReplicaFetcher replicaId=3, leaderId=1,
fetcherId=0] Could not build remote log auxiliary state for orange-1 due
to error: RemoteLogManager is not ready for partition: orange-1
(kafka.server.ReplicaFetcherThread)
[2025-07-19 19:28:20,934] INFO [ReplicaFetcher replicaId=3, leaderId=2,
fetcherId=0] Could not build remote log auxiliary state for orange-0 due
to error: RemoteLogManager is not ready for partition: orange-0
(kafka.server.ReplicaFetcherThread)
```

Reviewers: Luke Chen <showuon@gmail.com>, Satish Duggana <satishd@apache.org>
2025-07-23 19:29:31 +05:30
Chang-Chi Hsu 0086f24101
MINOR: Declare the inner RocksDBDualCFRangeIterator class as static (#20220)
Make inner classes static.

from: https://github.com/apache/kafka/pull/20182#issuecomment-3102893453

Reviewers: Ken Huang <s7133700@gmail.com>, Chia-Ping Tsai
<chia7712@gmail.com>
2025-07-23 21:37:48 +08:00
Kamal Chandraprakash 16c079ed23
KAFKA-19525: Refactor TopicBasedRLMM implementation to remove unused code (#20204)
CI / build (push) Waiting to run Details
- startConsumerThread is always true so removed the variable.
- Replaced the repetitive lock handling logic with
`withReadLockAndEnsureInitialized` to reduce duplication and improve
readability.
- Consolidated the logic in `initializeResources` and. simplified method
arguments to better encapsulate configuration.
- Extracted common code and reduced the usage of global variables.
- Named the variables properly.

Tests:
- Existing UTs since this patch refactored the code.

Reviewers: PoAn Yang <payang@apache.org>
2025-07-23 12:19:13 +05:30
Sanskar Jhajharia f1e9aa1c65
MINOR: Fix flaky tests in Tools modules (#20225)
### Problem
The
`ShareGroupCommandTest.testDeleteShareGroupOffsetsArgsWithoutTopic()`,
`ShareGroupCommandTest.testDeleteShareGroupOffsetsArgsWithoutGroup()`,
`ResetStreamsGroupOffsetTest.testResetOffsetsWithoutGroupOption()`,
`DeleteStreamsGroupTest.testDeleteWithoutGroupOption()`,
`DescribeStreamsGroupTest.testDescribeWithoutGroupOption()` tests were
flaky due to a dependency on Set iteration order in error message
generation.

### Root Cause
The cleanup [commit](https://github.com/apache/kafka/pull/20091) that
replaced `new HashSet<>(Arrays.asList(...))` with `Set.of(...)` in
ShareGroupCommandOptions and StreamsGroupCommandOptions changed the
iteration characteristics of collections used for error message
generation:

This produces different orders like `[topic], [group]` vs `[group],
[topic]`, but the tests expected a specific order, causing intermittent
failures.

### Solution
Fix the root cause by ensuring deterministic error message generation
through alphabetical sorting of option names.

Reviewers: ShivsundarR <shr@confluent.io>, Ken Huang
 <s7133700@gmail.com>, TengYao Chi <frankvicky@apache.org>
2025-07-23 14:40:18 +08:00
Ritika Reddy c7e4ff01cd
KAFKA-19272: Add initPid Response handling when keepPrepared is set to true (KIP-939) (#20039)
CI / build (push) Waiting to run Details
When initPid(keepPrepared = true) is called after a client crashes,
several situations should be considered.

When there's an ongoing transaction, we can transition it to the newly
added PREPARED_TRANSACTION state. However, what if there's no ongoing
transaction?

Another scenario could be:

- Issued a commit, to commit prepared
- The commit succeeded on the TC, but the client crashed
- Client restarted with keepPreparedTxn=true (because it doesn't know if
the commit succeeded or not and needs to keep retrying the commit until
it's successful)
- Issued a commit, but the transaction is not ongoing, because it's
committed

**Solution:**
This is a perfectly valid scenario as the external transaction
coordinator for the 2PC transaction will keep committing participants,
and the participants need to eventually return success (that's a
guarantee for a prepared transaction).
_Rejected Alt 1_ -> Return an InvalidTxnStateException : Returning an
error would break the above scenario.
_Rejected Alt 2_ -> Then the next thought is that we should somehow
validate if the state is expected, but we don't have data to validate
the result against.

**Final Solution:**  Just returning the success and transitioning to
READY is the proper handling of this condition.

Reviewers: Justine Olshan <jolshan@confluent.io>, Artem Livshits
 <alivshits@confluent.io>
2025-07-22 15:03:49 -07:00
Evanston Zhou 1cef5325ad
KAFKA-19213 Client ignores default properties (#20134)
https://issues.apache.org/jira/browse/KAFKA-19213

Fixes a bug where creating a producer/consumer using a `Properties`
object created using the `Properties(Properties defaults)` constructor
will ignore the default properties.

Reviewers: Kirk True <kirk@kirktrue.pro>, TaiJuWu <tjwu1217@gmail.com>,
 Jhen-Yung Hsu <jhenyunghsu@gmail.com>, Chia-Ping Tsai
 <chia7712@gmail.com>
2025-07-23 02:16:58 +08:00