Removing the isEligibleLeaderReplicasV1Enabled to let ELR be enabled if
MV is at least 4.1IV1. Also bump the Latest Prod MV to 4.1IV1
Reviewers: Paolo Patierno <ppatierno@live.com>, Jun Rao <junrao@gmail.com>
The e2e tests currently cover version 2.1.0 and above. Thus, we can
remove `force_use_zk_connection` in
`kafka_acls_cmd_with_optional_security_settings`
In contrast, the `force_use_zk_connection` in
`kafka_topics_cmd_with_optional_security_settings` and
`kafka_configs_cmd_with_optional_security_settings` still needs to be
kept as `kafka-topics.sh` does not support `--bootstrap-server` in 2.1
and 2.2
e2e test result:
```
===========================================
SESSION REPORT (ALL TESTS)
ducktape version: 0.12.0
session_id: 2025-07-02--001
run time: 200 minutes 28.399 seconds
tests run: 90
passed: 90
flaky: 0
failed: 0
ignored: 0
===========================================
```
Reviewers: Ken Huang <s7133700@gmail.com>, TengYao Chi
<kitingiao@gmail.com>, Chia-Ping Tsai <chia7712@gmail.com>
In this upgrade test, applications sometimes crash before the upgrade,
so it's actually triggering a bug in several older versions (2.x and
possibly others). It seems to be a rare race condition that has been
happening since 2022. Since we are not going to roll out a patch release
for Kafka Streams 2.x, we should just allow applications to crash before
the upgrade.
Reviewers: Matthias J. Sax <matthias@confluent.io>
`streams_broker_down_resilience_test` produce messages with `null` key
to a topic with three partitions and expect each partition to be
non-empty afterward. But I don't think this is a correct assumption, as
a producer may try to be sticky and only produce to two partitions.
This cause occasional flakiness in the test.
The fix is to produce records with keys.
Reviewers: Matthias J. Sax <matthias@confluent.io>, PoAn Yang
<payang@apache.org>
The test is resizing the `__consumer_offset` topic after broker start.
This seems to be completely unsupported. The group coordinator fetches
the number of partitions for the consumer offset topic once and never
updates it. So we can be in a state where two brokers have a different
understanding of how `__consumer_offsets` are partitioned.
The result in this test can be that two group coordinators both think
they own a certain group. The test is resizing `__consumer_offsets`
right after start-up from 3 to 50. Before the broker bounce, the GC
operates on only three partitions (0-2). During the bounce, we get new
brokers that operate on (0-49). This means that two brokers can both
think, at the same time, that they own a group.
Reviewers: Matthias J. Sax <matthias@confluent.io>
Fix to avoid flakiness in verifiable producer system test. The test
lists running processes and greps to find the VerifiableProducer one,
but wasn't providing an specific pattern to grep (so flaky if there were
more than one process containing the default grep pattern "kafka")
Fix by passing a "proc_grep_string" to filter when looking for the
VerifiableProducer process.
All test pass successfully after the change.
Reviewers: PoAn Yang <payang@apache.org>, Andrew Schofield
<aschofield@confluent.io>
According to the current code in AK, the offset reset strategy for share
groups was set using the flag `--offset-reset-strategy` in the
share_consumer_test.py tests, but that would mean that the admin client
call would be sent out by all members in the share group. This PR
changes that by introducing `set_group_offset_reset_strategy` method in
kafka.py, which runs the kafka-configs.sh script in one of the existing
docker containers, thereby changing the config only once.
Reviewers: Andrew Schofield <aschofield@confluent.io>
This PR adds system tests in share_consume_bench_test.py for testing the
trogdor agent for Share Consumers/
Reviewers: Lan Ding <53332773+DL1231@users.noreply.github.com>, Andrew
Schofield <aschofield@confluent.io>
This PR includes some performance system tests utilizing the
kafka-share-consumer-perf.sh tool for share groups
Reviewers: Andrew Schofield <aschofield@confluent.io>
Currently some tests in StreamsBrokersBounceTest failed due to error
`The cluster does not support the STREAMS group protocol or does not
support the versions of the STREAMS group protocol used by this client
(used versions: 0 to 0).`
The reason is that under isolated kraft mode, we missed to set both
`unstable.api.versions.enable` and `unstable.feature.versions.enable` to
true to all controllers, which cause `streams.version` fallback to 0 in
the broker side and the above error raise when
StreamsGroupRequestHeartbeat comes to the broker.
This patch add the missing configs to controllers properties if streams
group protocol is used.
Reviewers: Lucas Brutschy <lbrutschy@confluent.io>
This PR include system tests in the file share_group_command_test.py.
These tests test the functionality of kafka-share-groups.sh tool
Reviewers: Sushant Mahajan <smahajan@confluent.io>, Andrew Schofield
<aschofield@confluent.io>
Enable next system test with KIP-1071.
Some of the validation inside the test did not make sense for KIP-1071.
This is because in KIP-1071, if a member leaves or joins the group, not
all members may enter a REBALANCING state. We use the wrapper introduced
in [KAFKA-19271](https://issues.apache.org/jira/browse/KAFKA-19271)
to print a log line whenever the member epoch is bumped, which is the
only way a member can "indirectly" observe that other members are
rebalancing.
Reviewers: Bill Bejeck <bill@confluent.io>
This PR includes the system test file test_console_share_consumer.py
which tests the functioning of ConsoleShareConsumer
Reviewers: Andrew Schofield <aschofield@confluent.io>
There were some tests in share_consumer_test, where the default value of
metadata_quorum was quorum.zk. That is change to quorum.isolated_kraft in
this PR
Reviewers: Andrew Schofield <aschofield@confluent.io>
As a result of KAFKA-18905 the reassign test will often have test
failures which are unrelated to the actual reassignment of partitions.
This failure is mentioned in KAFKA-9199.
Quote from KAFKA-9199: "This issue popped up in the reassignment system
test. It ultimately caused the test to fail because the producer was
stuck retrying the duplicate batch repeatedly until ultimately giving
up."
Disabling the idempotent producer circumvents this issue and allows the
reassignment system tests to succeed reliably. The reassignment test
still check that produce batches were not lost.
Reviewers: José Armando García Sancio <jsancio@apache.org>
New system test for KIP-1071.
Standby replicas need to be enabled via `kafka-configs.sh`.
Reviewers: Bill Bejeck <bill@confluent.io>, Matthias J. Sax
<matthias@confluent.io>
* Currently in the share group heartbeat flow, if we see a TP subscribed
for the first time, we move that TP to initializing state in GC and let
the GC send a persister request to share group initialize the
aforementioned TP.
* However, if the coordinator runtime request for share group heartbeat
times out (maybe due to restarting/bad broker), the future completes
exceptionally resulting in persiter request to not be sent.
* Now, we are in a bad state since the TP is in initializing state in GC
but not persister initialized. Future heartbeats for the same share
partitions will also not help since we do not allow retrying persister
request for initializing TPs.
* This PR remedies the situation by allowing the same.
* A temporary fix to increase offset commit timeouts in system tests was
added to fix the issue. In this PR, we revert that change as well.
Reviewers: Andrew Schofield <aschofield@confluent.io>
Enable next system test with KIP-1071.
Also fixes the other KIP-1071 system tests, which now require enabling
the unstable `streams.version` feature.
Reviewers: Bill Bejeck <bbejeck@apache.org>
This PR uses the v1 of the ShareVersion feature to enable share groups
for KIP-932.
Previously, there were two potential configs which could be used -
`group.share.enable=true` and including "share" in
`group.coordinator.rebalance.protocols`. After this PR, the first of
these is retained, but the second is not. Instead, the preferred switch
is the ShareVersion feature.
The `group.share.enable` config is temporarily retained for testing and
situations in which it is inconvenient to set the feature, but it should
really not be necessary, especially when we get to AK 4.2. The aim is to
remove this internal config at that point.
No tests should be setting `group.share.enable` any more, because they
can use the feature (which is enabled in test environments by default
because that's how features work). For tests which need to disable share
groups, they now set the share feature to v0. The majority of the code
changes were related to correct initialisation of the metadata cache in
tests now that a feature is used.
Reviewers: Apoorv Mittal <apoorvmittal10@gmail.com>
Upon investigations for the failure of the system test
test_broker_failure, it was found there were situations where the
writing of records to the consumer_offsets topic was taking longer than
5 seconds (default value of offsets.commit.timeout.ms). Since the
persister requests of share partition initialization depends on the
completion of the record committing, due to the timeout, there were no
persister requests actually being sent. This PR increases the timeout
for this config to 20 seconds, as a temporary solution. The fix for this
is being tracked in the JIRA -
https://issues.apache.org/jira/browse/KAFKA-19204
Reviewers: Andrew Schofield <aschofield@confluent.io>
Enables KIP-1071 (`group.protocol=streams`) in the first streams system
test `streams_smoke_test.py`.
All tests using KIP-1071 cannot use `KafkaTest` anymore, since we need
to customize the broker configuration. The corresponding functionality
is added to `BaseStreamsTest`, which all streams tests will have to
extend from now on.
There are some left-overs from ZK in the tests that I copied from
'KafkaTest'. They need to be cleaned up, but this should be done in a
separate PR.
The system test `ShareConsumerTest.test_share_multiple_partitions`
started failing because of the recent change in the SimpleAssignor
algorithm. The tests assumed that if a share group is subscribed to a
topic, then every share consumers part of the group will be assigned all
partitions of the topic. But that does not happen now, and partitions
are split between the share consumers in certain cases, in which some
partitions are only assigned to a subset of share consumers. This change
removes that assumption
Reviewers: PoAn Yang <payang@apache.org>, Andrew Schofield <aschofield@confluent.io>
This PR removes the unstable API flag for the KIP-932 RPCs.
The 4 RPCs which were exposed for the early access release in AK 4.0 are
stabilised at v1. This is because the RPCs have evolved over time and AK
4.0 clients are not compatible with AK 4.1 brokers. By stabilising at
v1, the API version checks prevent incompatible communication and
server-side exceptions when trying to parse the requests from the older
clients.
Reviewers: Apoorv Mittal <apoorvmittal10@gmail.com>
Move LogCleaner and related classes to storage module and rewrite in
Java.
Reviewers: Mickael Maison <mickael.maison@gmail.com>, Jun Rao <junrao@gmail.com>
It is currently impossible to set both number of retries and idempotency
in the DucktapeVerifiable producer. This change allows that to occur.
Reviewers: PoAn Yang <payang@apache.org>, Manikumar Reddy <manikumar.reddy@gmail.com>
In an earlier PR, the flag --consumer.config in
VerifiableShareConsumer.java was changed to --command-config. This PR
makes the same change when VerifiableShareConsumer is used in
verifiable_share_consumer.py.
Reviewers: Andrew Schofield <aschofield@confluent.io>
This patch is the first of a series of patches to remove the old group
coordinator. With the release of Apache Kafka 4.0, the so-called new
group coordinator is the default and only option available now.
This patch update the system tests to not run with the old group
coordinator. It also removed the ability to use the old group
coordinator.
Reviewers: Lianet Magrans <lmagrans@confluent.io>
This patch updates all the core system tests to include 4.0.0.
Reviewers: Andrew Schofield <aschofield@confluent.io>, Lianet Magrans
<lmagrans@confluent.io>
The upgrade test in question is not supported for AK 3.3.2 due to a
[known issue](https://issues.apache.org/jira/browse/KAFKA-18442).
Previous attempt at solving this left the `metadata.log.dir` empty which
leads to the following crash log:
```
ERROR Exiting Kafka due to fatal exception (kafka.Kafka$)
org.apache.kafka.common.KafkaException: No `meta.properties` found in (have you run `kafka-storage.sh` to format the directory?)
at kafka.server.BrokerMetadataCheckpoint$.$anonfun$getBrokerMetadataAndOfflineDirs$2(BrokerMetadataCheckpoint.scala:172)
at scala.collection.Iterator.foreach(Iterator.scala:943)
at scala.collection.Iterator.foreach$(Iterator.scala:943)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)
at scala.collection.IterableLike.foreach(IterableLike.scala:74)
at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
at scala.collection.AbstractIterable.foreach(Iterable.scala:56)
at kafka.server.BrokerMetadataCheckpoint$.getBrokerMetadataAndOfflineDirs(BrokerMetadataCheckpoint.scala:161)
at kafka.server.KafkaRaftServer$.initializeLogDirs(KafkaRaftServer.scala:184)
at kafka.server.KafkaRaftServer.<init>(KafkaRaftServer.scala:61)
at kafka.Kafka$.buildServer(Kafka.scala:79)
at kafka.Kafka$.main(Kafka.scala:87)
at kafka.Kafka.main(Kafka.scala)
```
In 3.1 we deprecated the eager rebalancing protocol and marked it for
removal in a later release. We aim to officially drop support and remove
the protocol from Streams in 4.0.
The effect of this PR is that it will no longer be possible to perform a
live upgrade Kafka Streams directly to 4.0 from version 2.3 or below.
Users will have to go through a bridge release between 2.4 - 3.9
instead.
Reviewers: Matthias J. Sax <matthias@confluent.io>
Reduce the minISR to be 1 for the truncation test in order to skip the protection from KIP-966
Reviewers: David Jacot <djacot@confluent.io>, Colin P. McCabe <cmccabe@apache.org>, Chia-Ping Tsai <chia7712@gmail.com>
The main root cause is
3dba3125e9,
this PR remove the metadata version which is older than 3.3, thus this
test will fail when it use metadata version 3.2, 3.1
Reviewers: David Jacot <djacot@confluent.io>
The main root cause is
3dba3125e9,
this PR remove the metadata version which is older than 3.3, thus this
test will fail when it use metadata version 3.2, 3.1
Reviewers: Chia-Ping Tsai <chia7712@gmail.com>, David Jacot <djacot@confluent.io>
This patch adds a test case to replication_test.py test_replication_with_broker_failure which validates the scenario when we have failures of a combined mode broker/controller.
Reviewers: David Arthur <mumrah@gmail.com>
This patch marks IBP_4_0_IV3 as production ready for the Apache Kafka 4.0 release. It also introduced IBP_4_1_IV0 as the next development version.
Reviewers: Justine Olshan <jolshan@confluent.io>
This patch cleans up the places that should not use MV to determine ELR is enabled marks 4.0IV1 stable.
Reviewers: Alyssa Huang <ahuang@confluent.io>, Colin P. McCabe <cmccabe@apache.org>
The tests which set reassign_from_offset_zero=False have a setup phase which produces records with old timestamps to the topic and waits until they are cleaned by the retention in order to run the main phase of the test based on non-zero offsets. The setup phases did not wait enough for the cleaning task to kick in, mainly because the scheduled task was not started yet due to log.initial.task.delay.ms being set to 30s by default. Reducing it to 5s helps to stabilize the test. The patch also changes the sleep to 12s in order to have a bit more head room.
```
================================================================================
SESSION REPORT (ALL TESTS)
ducktape version: 0.12.0
session_id: 2025-02-11--016
run time: 26 minutes 9.451 seconds
tests run: 12
passed: 12
flaky: 0
failed: 0
ignored: 0
================================================================================
```
Reviewers: Chia-Ping Tsai <chia7712@gmail.com>