We weren't enabling discoverBrokerVersions to check the supported versions in the AddPartitionsToTxnManager. This means that any verification request (or any AddPartitionsToTxnRequest version) from a newer broker would fail when sending to an older broker.
The bulk of this change is adding additional transactions system tests for old versions.
One test upgrades the cluster completely. This didn't catch the issue but could be useful.
The other test forces a new broker to send a verification request to an older one. Without the discoverBrokerVersions change, all tests between mixed brokers failed. (We introduced a new request version in 3.8 -- which is a separate version from the one that caused the bug for 3.5 -> 3.6) With the addition, the tests all passed.
I also manually ran a test for 3.5 -> 3.6 since the issue there was slightly different and was caused by the unstableLatestVersion flag being enabled. This change should fix this as well. 👍
Reviewers: David Jacot <djacot@confluent.io>
Enables log directory failure system test for all Kraft modes in addition to ZK mode.
Reviewers: Luke Chen <showuon@gmail.com>, Igor Soarez <soarez@apple.com>, Proven Provenzano <pprovenzano@confluent.io>
https://issues.apache.org/jira/browse/KAFKA-14505 is not done yet so we need to disable the system test. Added a comment in the jira to re-enable once it's implemented.
Reviewers: Justine Olshan <jolshan@confluent.io>
This patch converts a few more system tests to using the new group coordinator. This is only applied to KRaft clusters.
Reviewers: David Jacot <djacot@confluent.io>
The latest metadata version is now 3.7. Fix the KRaft upgrade
test to upgrade to that version instead of 3.6.
Change the vagrant setup and gradle dependencies to use 3.3.2 instead of 3.3.1.
Reviewers: David Arthur <mumrah@gmail.com>
This field was missed by the initial KIP-919 PR(s). The result is that migrations can't begin since
the controllers will never become ready. This patch fixes that as well as pulls over some fixes
from the 3.6 branch.
Reviewers: Colin P. McCabe <cmccabe@apache.org>
This patch adds configs to facilitate the testing with the new group coordinator (KIP-848) in kraft mode. Only one test files is converted at the moment. The others will follow.
Reviewers: Ian McDonald <imcdonald@confluent.io>, David Jacot <djacot@confluent.io>
Fixing bad test setup. We tried to fix an upgrade bug for FK-joins in 3.1 release, but it later turned out that the PR was not sufficient to fix it. We finally fixed in 3.4 release.
This PR updates the system test matrix to only test working versions with FK-joins, limited to available test versions.
Reviewers: Guozhang Wang <wangguoz@gmail.com>, Hao Li <hli@confluent.io>, Mickael Maison <mickael.maison@gmail.com>
This PR contains three main changes:
- Support for transactions in MetadataLoader
- Abort in-progress transaction during controller failover
- Utilize transactions for ZK to KRaft migration
A new MetadataBatchLoader class is added to decouple the loading of record batches from the
publishing of metadata in MetadataLoader. Since a transaction can span across multiple batches (or
multiple transactions could exist within one batch), some buffering of metadata updates was needed
before publishing out to the MetadataPublishers. MetadataBatchLoader accumulates changes into a
MetadataDelta, and uses a callback to publish to the publishers when needed.
One small oddity with this approach is that since we can "splitting" batches in some cases, the
number of bytes returned in the LogDeltaManifest has new semantics. The number of bytes included in
a batch is now only included in the last metadata update that is published as a result of a batch.
Reviewers: Colin P. McCabe <cmccabe@apache.org>
Kafka system tests with Java version 17 are failing on this issue:
```python
TimeoutError("MiniKdc didn't finish startup",)
Traceback (most recent call last):
File "/usr/local/lib/python3.6/site-packages/ducktape/tests/runner_client.py", line 186, in _do_run
data = self.run_test()
File "/usr/local/lib/python3.6/site-packages/ducktape/tests/runner_client.py", line 246, in run_test
return self.test_context.function(self.test)
File "/usr/local/lib/python3.6/site-packages/ducktape/mark/_mark.py", line 433, in wrapper
return functools.partial(f, *args, **kwargs)(*w_args, **w_kwargs)
File "/opt/kafka-dev/tests/kafkatest/sanity_checks/test_verifiable_producer.py", line 74, in test_simple_run
self.kafka.start()
File "/opt/kafka-dev/tests/kafkatest/services/kafka/kafka.py", line 635, in start
self.start_minikdc_if_necessary(add_principals)
File "/opt/kafka-dev/tests/kafkatest/services/kafka/kafka.py", line 596, in start_minikdc_if_necessary
self.minikdc.start()
File "/usr/local/lib/python3.6/site-packages/ducktape/services/service.py", line 265, in start
self.start_node(node, **kwargs)
File "/opt/kafka-dev/tests/kafkatest/services/security/minikdc.py", line 114, in start_node
monitor.wait_until("MiniKdc Running", timeout_sec=60, backoff_sec=1, err_msg="MiniKdc didn't finish startup")
File "/usr/local/lib/python3.6/site-packages/ducktape/cluster/remoteaccount.py", line 754, in wait_until
allow_fail=True) == 0, **kwargs)
File "/usr/local/lib/python3.6/site-packages/ducktape/utils/util.py", line 58, in wait_until
raise TimeoutError(err_msg() if callable(err_msg) else err_msg) from last_exception
ducktape.errors.TimeoutError: MiniKdc didn't finish startup
```
Specifically, when one runs the test cases and looks at the logs of the MiniKdc:
```java
Exception in thread "main" java.lang.IllegalAccessException: class kafka.security.minikdc.MiniKdc cannot access class sun.security.krb5.Config (in module java.security.jgss) because module java.security.jgss does not export sun.security.krb5 to unnamed module @24959ca4
at java.base/jdk.internal.reflect.Reflection.newIllegalAccessException(Reflection.java:392)
at java.base/java.lang.reflect.AccessibleObject.checkAccess(AccessibleObject.java:674)
at java.base/java.lang.reflect.Method.invoke(Method.java:560)
at kafka.security.minikdc.MiniKdc.refreshJvmKerberosConfig(MiniKdc.scala:268)
at kafka.security.minikdc.MiniKdc.initJvmKerberosConfig(MiniKdc.scala:245)
at kafka.security.minikdc.MiniKdc.start(MiniKdc.scala:123)
at kafka.security.minikdc.MiniKdc$.start(MiniKdc.scala:375)
at kafka.security.minikdc.MiniKdc$.main(MiniKdc.scala:366)
at kafka.security.minikdc.MiniKdc.main(MiniKdc.scala)
```
This error is caused by the fact that sun.security module is no longer supported in Java 16 and higher. Related to the [1].
There are two ways how to solve it, and I present one of them. The second way is to export the ENV variable during the deployment of the containers using Ducktape in [2].
[1] - https://openjdk.org/jeps/396
[2] - https://github.com/apache/kafka/blob/trunk/tests/docker/ducker-ak#L308
Reviewers: Ismael Juma <ismael@juma.me.uk>, Luke Chen <showuon@gmail.com>
This patch adds snapshot reconciliation during ZK to KRaft migration. This reconciliation happens whenever a snapshot is loaded by KRaft, or during a controller failover. Prior to this patch, it was possible to miss metadata updates coming from KRaft when dual-writing to ZK.
Internally this adds a new state SYNC_KRAFT_TO_ZK to the KRaftMigrationDriver state machine. The controller passes through this state after the initial ZK migration and each time a controller becomes active.
Logging during dual-write was enhanced to include a count of write operations happening.
Reviewers: Colin P. McCabe <cmccabe@apache.org>
This patch adds support for handling metadata snapshots while in dual-write mode. Prior to this change, if the active
controller loaded a snapshot, it would get out of sync with the ZK state.
In order to reconcile the snapshot state with ZK, several methods were added to scan through the metadata in ZK to
compute differences with the MetadataImage. Since this introduced a lot of code, I opted to split out a lot of methods
from ZkMigrationClient into their own client interfaces, such as TopicMigrationClient, ConfigMigrationClient, and
AclMigrationClient. Each of these has some iterator method that lets the caller examine the ZK state in a single pass
and without using too much memory.
Reviewers: Colin P. McCabe <cmccabe@apache.org>, Luke Chen <showuon@gmail.com>
This patch adds the concept of pre-migration mode to the KRaft controller. While in this mode,
the controller will only allow certain write operations. The purpose of this is to disallow metadata
changes when the controller is waiting for the ZK migration records to be committed.
The following ControllerWriteEvent operations are permitted in pre-migration mode
* completeActivation
* maybeFenceReplicas
* writeNoOpRecord
* processBrokerHeartbeat
* registerBroker (only for migrating ZK brokers)
* unregisterBroker
Raft events and other controller events do not follow the same code path as ControllerWriteEvent,
so they are not affected by this new behavior.
This patch also add a new metric as defined in KIP-868: kafka.controller:type=KafkaController,name=ZkMigrationState
In order to support upgrades from 3.4.0, this patch also redefines the enum value of value 1 to mean
MIGRATION rather than PRE_MIGRATION.
Reviewers: Chia-Ping Tsai <chia7712@gmail.com>, Colin P. McCabe <cmccabe@apache.org>