* Currently in the share group heartbeat flow, if we see a TP subscribed
for the first time, we move that TP to initializing state in GC and let
the GC send a persister request to share group initialize the
aforementioned TP.
* However, if the coordinator runtime request for share group heartbeat
times out (maybe due to restarting/bad broker), the future completes
exceptionally resulting in persiter request to not be sent.
* Now, we are in a bad state since the TP is in initializing state in GC
but not persister initialized. Future heartbeats for the same share
partitions will also not help since we do not allow retrying persister
request for initializing TPs.
* This PR remedies the situation by allowing the same.
* A temporary fix to increase offset commit timeouts in system tests was
added to fix the issue. In this PR, we revert that change as well.
Reviewers: Andrew Schofield <aschofield@confluent.io>
Enable next system test with KIP-1071.
Also fixes the other KIP-1071 system tests, which now require enabling
the unstable `streams.version` feature.
Reviewers: Bill Bejeck <bbejeck@apache.org>
This PR uses the v1 of the ShareVersion feature to enable share groups
for KIP-932.
Previously, there were two potential configs which could be used -
`group.share.enable=true` and including "share" in
`group.coordinator.rebalance.protocols`. After this PR, the first of
these is retained, but the second is not. Instead, the preferred switch
is the ShareVersion feature.
The `group.share.enable` config is temporarily retained for testing and
situations in which it is inconvenient to set the feature, but it should
really not be necessary, especially when we get to AK 4.2. The aim is to
remove this internal config at that point.
No tests should be setting `group.share.enable` any more, because they
can use the feature (which is enabled in test environments by default
because that's how features work). For tests which need to disable share
groups, they now set the share feature to v0. The majority of the code
changes were related to correct initialisation of the metadata cache in
tests now that a feature is used.
Reviewers: Apoorv Mittal <apoorvmittal10@gmail.com>
Upon investigations for the failure of the system test
test_broker_failure, it was found there were situations where the
writing of records to the consumer_offsets topic was taking longer than
5 seconds (default value of offsets.commit.timeout.ms). Since the
persister requests of share partition initialization depends on the
completion of the record committing, due to the timeout, there were no
persister requests actually being sent. This PR increases the timeout
for this config to 20 seconds, as a temporary solution. The fix for this
is being tracked in the JIRA -
https://issues.apache.org/jira/browse/KAFKA-19204
Reviewers: Andrew Schofield <aschofield@confluent.io>
Enables KIP-1071 (`group.protocol=streams`) in the first streams system
test `streams_smoke_test.py`.
All tests using KIP-1071 cannot use `KafkaTest` anymore, since we need
to customize the broker configuration. The corresponding functionality
is added to `BaseStreamsTest`, which all streams tests will have to
extend from now on.
There are some left-overs from ZK in the tests that I copied from
'KafkaTest'. They need to be cleaned up, but this should be done in a
separate PR.
The system test `ShareConsumerTest.test_share_multiple_partitions`
started failing because of the recent change in the SimpleAssignor
algorithm. The tests assumed that if a share group is subscribed to a
topic, then every share consumers part of the group will be assigned all
partitions of the topic. But that does not happen now, and partitions
are split between the share consumers in certain cases, in which some
partitions are only assigned to a subset of share consumers. This change
removes that assumption
Reviewers: PoAn Yang <payang@apache.org>, Andrew Schofield <aschofield@confluent.io>
This PR removes the unstable API flag for the KIP-932 RPCs.
The 4 RPCs which were exposed for the early access release in AK 4.0 are
stabilised at v1. This is because the RPCs have evolved over time and AK
4.0 clients are not compatible with AK 4.1 brokers. By stabilising at
v1, the API version checks prevent incompatible communication and
server-side exceptions when trying to parse the requests from the older
clients.
Reviewers: Apoorv Mittal <apoorvmittal10@gmail.com>
Move LogCleaner and related classes to storage module and rewrite in
Java.
Reviewers: Mickael Maison <mickael.maison@gmail.com>, Jun Rao <junrao@gmail.com>
It is currently impossible to set both number of retries and idempotency
in the DucktapeVerifiable producer. This change allows that to occur.
Reviewers: PoAn Yang <payang@apache.org>, Manikumar Reddy <manikumar.reddy@gmail.com>
In an earlier PR, the flag --consumer.config in
VerifiableShareConsumer.java was changed to --command-config. This PR
makes the same change when VerifiableShareConsumer is used in
verifiable_share_consumer.py.
Reviewers: Andrew Schofield <aschofield@confluent.io>
This patch is the first of a series of patches to remove the old group
coordinator. With the release of Apache Kafka 4.0, the so-called new
group coordinator is the default and only option available now.
This patch update the system tests to not run with the old group
coordinator. It also removed the ability to use the old group
coordinator.
Reviewers: Lianet Magrans <lmagrans@confluent.io>
This patch updates all the core system tests to include 4.0.0.
Reviewers: Andrew Schofield <aschofield@confluent.io>, Lianet Magrans
<lmagrans@confluent.io>
The upgrade test in question is not supported for AK 3.3.2 due to a
[known issue](https://issues.apache.org/jira/browse/KAFKA-18442).
Previous attempt at solving this left the `metadata.log.dir` empty which
leads to the following crash log:
```
ERROR Exiting Kafka due to fatal exception (kafka.Kafka$)
org.apache.kafka.common.KafkaException: No `meta.properties` found in (have you run `kafka-storage.sh` to format the directory?)
at kafka.server.BrokerMetadataCheckpoint$.$anonfun$getBrokerMetadataAndOfflineDirs$2(BrokerMetadataCheckpoint.scala:172)
at scala.collection.Iterator.foreach(Iterator.scala:943)
at scala.collection.Iterator.foreach$(Iterator.scala:943)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)
at scala.collection.IterableLike.foreach(IterableLike.scala:74)
at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
at scala.collection.AbstractIterable.foreach(Iterable.scala:56)
at kafka.server.BrokerMetadataCheckpoint$.getBrokerMetadataAndOfflineDirs(BrokerMetadataCheckpoint.scala:161)
at kafka.server.KafkaRaftServer$.initializeLogDirs(KafkaRaftServer.scala:184)
at kafka.server.KafkaRaftServer.<init>(KafkaRaftServer.scala:61)
at kafka.Kafka$.buildServer(Kafka.scala:79)
at kafka.Kafka$.main(Kafka.scala:87)
at kafka.Kafka.main(Kafka.scala)
```
In 3.1 we deprecated the eager rebalancing protocol and marked it for
removal in a later release. We aim to officially drop support and remove
the protocol from Streams in 4.0.
The effect of this PR is that it will no longer be possible to perform a
live upgrade Kafka Streams directly to 4.0 from version 2.3 or below.
Users will have to go through a bridge release between 2.4 - 3.9
instead.
Reviewers: Matthias J. Sax <matthias@confluent.io>
Reduce the minISR to be 1 for the truncation test in order to skip the protection from KIP-966
Reviewers: David Jacot <djacot@confluent.io>, Colin P. McCabe <cmccabe@apache.org>, Chia-Ping Tsai <chia7712@gmail.com>
The main root cause is
3dba3125e9,
this PR remove the metadata version which is older than 3.3, thus this
test will fail when it use metadata version 3.2, 3.1
Reviewers: David Jacot <djacot@confluent.io>
The main root cause is
3dba3125e9,
this PR remove the metadata version which is older than 3.3, thus this
test will fail when it use metadata version 3.2, 3.1
Reviewers: Chia-Ping Tsai <chia7712@gmail.com>, David Jacot <djacot@confluent.io>
This patch adds a test case to replication_test.py test_replication_with_broker_failure which validates the scenario when we have failures of a combined mode broker/controller.
Reviewers: David Arthur <mumrah@gmail.com>
This patch marks IBP_4_0_IV3 as production ready for the Apache Kafka 4.0 release. It also introduced IBP_4_1_IV0 as the next development version.
Reviewers: Justine Olshan <jolshan@confluent.io>
This patch cleans up the places that should not use MV to determine ELR is enabled marks 4.0IV1 stable.
Reviewers: Alyssa Huang <ahuang@confluent.io>, Colin P. McCabe <cmccabe@apache.org>
The tests which set reassign_from_offset_zero=False have a setup phase which produces records with old timestamps to the topic and waits until they are cleaned by the retention in order to run the main phase of the test based on non-zero offsets. The setup phases did not wait enough for the cleaning task to kick in, mainly because the scheduled task was not started yet due to log.initial.task.delay.ms being set to 30s by default. Reducing it to 5s helps to stabilize the test. The patch also changes the sleep to 12s in order to have a bit more head room.
```
================================================================================
SESSION REPORT (ALL TESTS)
ducktape version: 0.12.0
session_id: 2025-02-11--016
run time: 26 minutes 9.451 seconds
tests run: 12
passed: 12
flaky: 0
failed: 0
ignored: 0
================================================================================
```
Reviewers: Chia-Ping Tsai <chia7712@gmail.com>
This patch renames kraft_upgrade_test.py to upgrade_test.py. This is enough to cover the old upgrade/downgrade tests.
Reviewers: Chia-Ping Tsai <chia7712@gmail.com>
Once a StreamThread receives its assignment, it will close the startup tasks. But during the closing process, the StandbyTask.closeClean() method will eventually call theStatemanagerUtil.closeStateManager method which needs to lock the state directory, but locking requires the calling thread be the current owner. Since the main thread grabs the lock on startup but moves on without releasing it, we need to update ownership explicitly here in order for the stream thread to close the startup task and begin processing.
Reviewers: Matthias Sax <mjsax@apache.org>, Nick Telford
A prior commit introduced checking for the version of a node related to move to log4j2 but it was causing an error
AttributeError("'ClusterNode' object has no attribute 'version'") This PR uses the get_version method from version.py which checks if the Node has a version attribute preventing an error.
Reviewers: Matthias Sax <mjsax@apache.org>
Due to an issue with handling folders in Kafka version 3.3.2 (see https://github.com/apache/kafka/pull/13130), this end-to-end test requires using a single folder for upgrade/downgrade scenarios involving 3.3.2.
Reviewers: Ismael Juma <ismael@juma.me.uk>, Chia-Ping Tsai <chia7712@gmail.com>