Kafka system tests with Java version 17 are failing on this issue:
```python
TimeoutError("MiniKdc didn't finish startup",)
Traceback (most recent call last):
File "/usr/local/lib/python3.6/site-packages/ducktape/tests/runner_client.py", line 186, in _do_run
data = self.run_test()
File "/usr/local/lib/python3.6/site-packages/ducktape/tests/runner_client.py", line 246, in run_test
return self.test_context.function(self.test)
File "/usr/local/lib/python3.6/site-packages/ducktape/mark/_mark.py", line 433, in wrapper
return functools.partial(f, *args, **kwargs)(*w_args, **w_kwargs)
File "/opt/kafka-dev/tests/kafkatest/sanity_checks/test_verifiable_producer.py", line 74, in test_simple_run
self.kafka.start()
File "/opt/kafka-dev/tests/kafkatest/services/kafka/kafka.py", line 635, in start
self.start_minikdc_if_necessary(add_principals)
File "/opt/kafka-dev/tests/kafkatest/services/kafka/kafka.py", line 596, in start_minikdc_if_necessary
self.minikdc.start()
File "/usr/local/lib/python3.6/site-packages/ducktape/services/service.py", line 265, in start
self.start_node(node, **kwargs)
File "/opt/kafka-dev/tests/kafkatest/services/security/minikdc.py", line 114, in start_node
monitor.wait_until("MiniKdc Running", timeout_sec=60, backoff_sec=1, err_msg="MiniKdc didn't finish startup")
File "/usr/local/lib/python3.6/site-packages/ducktape/cluster/remoteaccount.py", line 754, in wait_until
allow_fail=True) == 0, **kwargs)
File "/usr/local/lib/python3.6/site-packages/ducktape/utils/util.py", line 58, in wait_until
raise TimeoutError(err_msg() if callable(err_msg) else err_msg) from last_exception
ducktape.errors.TimeoutError: MiniKdc didn't finish startup
```
Specifically, when one runs the test cases and looks at the logs of the MiniKdc:
```java
Exception in thread "main" java.lang.IllegalAccessException: class kafka.security.minikdc.MiniKdc cannot access class sun.security.krb5.Config (in module java.security.jgss) because module java.security.jgss does not export sun.security.krb5 to unnamed module @24959ca4
at java.base/jdk.internal.reflect.Reflection.newIllegalAccessException(Reflection.java:392)
at java.base/java.lang.reflect.AccessibleObject.checkAccess(AccessibleObject.java:674)
at java.base/java.lang.reflect.Method.invoke(Method.java:560)
at kafka.security.minikdc.MiniKdc.refreshJvmKerberosConfig(MiniKdc.scala:268)
at kafka.security.minikdc.MiniKdc.initJvmKerberosConfig(MiniKdc.scala:245)
at kafka.security.minikdc.MiniKdc.start(MiniKdc.scala:123)
at kafka.security.minikdc.MiniKdc$.start(MiniKdc.scala:375)
at kafka.security.minikdc.MiniKdc$.main(MiniKdc.scala:366)
at kafka.security.minikdc.MiniKdc.main(MiniKdc.scala)
```
This error is caused by the fact that sun.security module is no longer supported in Java 16 and higher. Related to the [1].
There are two ways how to solve it, and I present one of them. The second way is to export the ENV variable during the deployment of the containers using Ducktape in [2].
[1] - https://openjdk.org/jeps/396
[2] - https://github.com/apache/kafka/blob/trunk/tests/docker/ducker-ak#L308
Reviewers: Ismael Juma <ismael@juma.me.uk>, Luke Chen <showuon@gmail.com>
Update "requests" lib used in system tests to version "2.31.0" to fix CVE-2023-32681: Unintended leak of Proxy-Authorization header in requests
Reviewers: Divij Vaidya <diviv@amazon.com>
This patch adds snapshot reconciliation during ZK to KRaft migration. This reconciliation happens whenever a snapshot is loaded by KRaft, or during a controller failover. Prior to this patch, it was possible to miss metadata updates coming from KRaft when dual-writing to ZK.
Internally this adds a new state SYNC_KRAFT_TO_ZK to the KRaftMigrationDriver state machine. The controller passes through this state after the initial ZK migration and each time a controller becomes active.
Logging during dual-write was enhanced to include a count of write operations happening.
Reviewers: Colin P. McCabe <cmccabe@apache.org>
This patch adds support for handling metadata snapshots while in dual-write mode. Prior to this change, if the active
controller loaded a snapshot, it would get out of sync with the ZK state.
In order to reconcile the snapshot state with ZK, several methods were added to scan through the metadata in ZK to
compute differences with the MetadataImage. Since this introduced a lot of code, I opted to split out a lot of methods
from ZkMigrationClient into their own client interfaces, such as TopicMigrationClient, ConfigMigrationClient, and
AclMigrationClient. Each of these has some iterator method that lets the caller examine the ZK state in a single pass
and without using too much memory.
Reviewers: Colin P. McCabe <cmccabe@apache.org>, Luke Chen <showuon@gmail.com>
This patch adds the concept of pre-migration mode to the KRaft controller. While in this mode,
the controller will only allow certain write operations. The purpose of this is to disallow metadata
changes when the controller is waiting for the ZK migration records to be committed.
The following ControllerWriteEvent operations are permitted in pre-migration mode
* completeActivation
* maybeFenceReplicas
* writeNoOpRecord
* processBrokerHeartbeat
* registerBroker (only for migrating ZK brokers)
* unregisterBroker
Raft events and other controller events do not follow the same code path as ControllerWriteEvent,
so they are not affected by this new behavior.
This patch also add a new metric as defined in KIP-868: kafka.controller:type=KafkaController,name=ZkMigrationState
In order to support upgrades from 3.4.0, this patch also redefines the enum value of value 1 to mean
MIGRATION rather than PRE_MIGRATION.
Reviewers: Chia-Ping Tsai <chia7712@gmail.com>, Colin P. McCabe <cmccabe@apache.org>
Some system tests from kafkatest.tests.core.network_degrade_test are failing due to missing utility iputils-ping.
[DEBUG - 2023-02-04 01:34:56,322 - network_degrade_test - test_latency -
lineno:67]: Ping output: bash: line 1: ping: command not found
Reviewers: Luke Chen <showuon@gmail.com>
Similar to https://github.com/apache/kafka/pull/13439:
ddd652c standardized on "isolated" as the name for all the isolated
modes, and renamed remote_controller_quorum to
kafkatest.services.kafka.quorum.remote_kraft to
isolated_controller_quorum. This broke
SecurityTest.test_quorum_ssl_endpoint_validation_failure, which should
be fixed by this simple rename.
Reviewers: Colin P. McCabe <cmccabe@apache.org>
ddd652c6 standardized on "isolated" as the name for all the isolated
modes, and renamed kafkatest.services.kafka.quorum.remote_kraft to
isolated_kraft. However, the tests using remote_kraft weren't
updated, and are broken as a result. This is a simple search and
replace to fix those.
Reviewers: Colin P. McCabe <cmccabe@apache.org>
Standardize KRaft thread names.
- Always use kebab case. That is, "my-thread-name".
- Thread prefixes are just strings, not Option[String] or Optional<String>.
If you don't want a prefix, use the empty string.
- Thread prefixes end in a dash (except the empty prefix). Then you can
calculate thread names as $prefix + "my-thread-name"
- Broker-only components get "broker-$id-" as a thread name prefix. For example, "broker-1-"
- Controller-only components get "controller-$id-" as a thread name prefix. For example, "controller-1-"
- Shared components get "kafka-$id-" as a thread name prefix. For example, "kafka-0-"
- Always pass a prefix to KafkaEventQueue, so that threads have names like
"broker-0-metadata-loader-event-handler" rather than "event-handler". Prior to this PR, we had
several threads just named "EventHandler" which was not helpful for debugging.
- QuorumController thread name is "quorum-controller-123-event-handler"
- Don't set a thread prefix for replication threads started by ReplicaManager. They run only on the
broker, and already include the broker ID.
Standardize KRaft slf4j log prefixes.
- Names should be of the form "[ComponentName id=$id] ". So for a ControllerServer with ID 123, we
will have "[ControllerServer id=123] "
- For the QuorumController class, use the prefix "[QuorumController id=$id] " rather than
"[Controller <nodeId] ", to make it clearer that this is a KRaft controller.
- In BrokerLifecycleManager, add isZkBroker=true to the log prefix for the migration case.
Standardize KRaft terminology.
- All synonyms of combined mode (colocated, coresident, etc.) should be replaced by "combined"
- All synonyms of isolated mode (remote, non-colocated, distributed, etc.) should be replaced by
"isolated".
Implement KIP-599 controller mutation quotas for the KRaft controller. These quotas apply to create
topics, create partitions, and delete topic operations. They are specified in terms of number of
partitions.
The approach taken here is to reuse the ControllerMutationQuotaManager that is also used in ZK
mode. The quotas are implemented as Sensor objects and Sensor.checkQuotas enforces the quota,
whereas Sensor.record notes that new partitions have been modified. While ControllerApis handles
fetching the Sensor objects, we must make the final callback to check the quotas from within
QuorumController. The reason is because only QuorumController knows the final number of partitions
that must be modified. (As one example, up-to-date information about the number of partitions that
will be deleted when a topic is deleted is really only available in QuorumController.)
For quota enforcement, the logic is already in place. The KRaft controller is expected to set the
throttle time in the response that is embedded in EnvelopeResponse, but it does not actually apply
the throttle because there is no client connection to throttle. Instead, the broker that forwarded
the request is expected to return the throttle value from the controller and to throttle the client
connection. It also applies its own request quota, so the enforced/returned quota is the maximum of
the two.
This PR also installs a DynamicConfigPublisher in ControllerServer. This allows dynamic
configurations to be published on the controller. Previously, they could be set, but they were not
applied. Note that we still don't have a good way to set node-level configurations for isolatied
controllers. However, this will allow us to set cluster configs (aka default node configs) and have
them take effect on the controllers.
In a similar vein, this PR separates out the dynamic client quota publisher logic used on the
broker into DynamicClientQuotaPublisher. We can now install this on both BrokerServer and
ControllerServer. This makes dynamically configuring quotas (such as controller mutation quotas)
possible.
Also add a ducktape test, controller_mutation_quota_test.py.
Reviewers: David Jacot <djacot@confluent.io>, Ismael Juma <ismael@juma.me.uk>, Colin P. McCabe <cmccabe@apache.org>
This patch introduces a preliminary state machine that can be used by KRaft
controller to drive online migration from Zk to KRaft.
MigrationState -- Defines the states we can have while migration from Zk to
KRaft.
KRaftMigrationDriver -- Defines the state transitions, and events to handle
actions like controller change, metadata change, broker change and have
interfaces through which it claims Zk controllership, performs zk writes and
sends RPCs to ZkBrokers.
MigrationClient -- Interface that defines the functions used to claim and
relinquish Zk controllership, read to and write from Zk.
Co-authored-by: David Arthur <mumrah@gmail.com>
Reviewers: Colin P. McCabe <cmccabe@apache.org>
Co-authored-by: Lucas Brutschy <lucasbru@users.noreply.github.com>
Reviewers: Suhas Satish <ssatish@confluent.io>, Lucas Brutschy <lbrutschy@confluent.io>, John Roesler <john@confluent.io>
A test that verifies the upgrade from a version of Streams with
state updater disabled to a version with state updater enabled
and vice versa, so that we can offer a save upgrade path.
- upgrade test from a version of Streams with state updater
disabled to a version with state updater enabled
- downgrade test from a version of Streams with state updater
enabled to a version with state updater disabled
Reviewer: Bruno Cadonna <cadonna@apache.org>
This PR introduces the new metadata loader and snapshot generator. For the time being, they are
only used by the controller, but a PR for the broker will come soon.
The new metadata loader supports adding and removing publishers dynamically. (In contrast, the old
loader only supported adding a single publisher.) It also passes along more information about each
new image that is published. This information can be found in the LogDeltaManifest and
SnapshotManifest classes.
The new snapshot generator replaces the previous logic for generating snapshots in
QuorumController.java and associated classes. The new generator is intended to be shared between
the broker and the controller, so it is decoupled from both.
There are a few small changes to the old snapshot generator in this PR. Specifically, we move the
batch processing time and batch size metrics out of BrokerMetadataListener.scala and into
BrokerServerMetrics.scala.
Finally, fix a case where we are using 'is' rather than '==' for a numeric comparison in
snapshot_test.py.
Reviewers: David Arthur <mumrah@gmail.com>
PR #12684 introduced a better format for timestamps in log
messages. Unfortunately, we missed that one of the modified
log messages is used by a system test for validation.
This PR adapts the system test to look for the modified
log message.
Reviewers: Divij Vaidya <diviv@amazon.com>, Matthias J. Sax <mjsax@apache.org>
The IBM Semeru JDK use the OpenJDK security providers instead of the IBM security providers so test for the OpenJDK classes first where possible and test for Semeru in the java.runtime.name system property otherwise.
Reviewers: Mickael Maison <mickael.maison@gmail.com>, Bruno Cadonna <cadonna@apache.org>
The streams upgrade system inserted FK join code for every version of the
the StreamsUpgradeTest except for the latest. Also, the original code
never switched on the `test.run_fk_join` flag for the target version of
the upgrade.
The effect was that FK join upgrades were not tested at all, since
no FK join code was executed after the bounce in the system test.
We introduce `extra_properties` in the system tests, that can be used
to pass any property to the upgrade driver, which is supposed to be
reused by system tests for switching on and off flags (e.g. for the
state restoration code).
Reviewers: Alex Sorokoumov <asorokoumov@confluent.io>, Anna Sophie Blee-Goldman <ableegoldman@apache.org>
For tests which use the console consumer service, we are currently enabling TRACE logging by default. I have seen some system tests where this produces GBs of logging. A better default is probably DEBUG.
Reviewers: José Armando García Sancio <jsancio@apache.org>
Adds the `metadata_quorum` parameter to the `@matrix(...)` annotation to many existing tests, so that they are run with both zookeeper and remote_kraft nodes.
Reviewers: Randall Hauch <rhauch@gmail.com>, Greg Harris <gharris1727@gmail.com>
The following files are available in https://s3-us-west-2.amazonaws.com/kafka-packages/:
kafka-streams-3.3.1-test.jar
kafka_2.12-3.3.1.tgz
kafka_2.13-3.3.1.tgz
Reviewers: Colin P. McCabe <cmccabe@apache.org>
KIP-373 added a "token requester" field to the output of kafka-delegation-tokens.sh. The system test was failing since it was not expecting this new field. This patch adds support for this field and improves the error output if we can't parse.
Reviewers: José Armando García Sancio <jsancio@apache.org>, Manikumar Reddy <manikumar.reddy@gmail.com>
This patch removes test_kafka_version.py, which contains two tests at the moment. The first test verifies we can start a 0.8.2 cluster. The second verifies we can start a cluster with one node on 0.8.2 and another on the latest. These test are covered in greater depth by upgrade_test.py and downgrade_test.py.
Reviewers: José Armando García Sancio <jsancio@users.noreply.github.com>
Also includes a minor quality-of-life improvement to clarify why some internal REST requests to workers may fail while that worker is still starting up.
Reviewers: Tom Bentley <tbentley@redhat.com>, Luke Chen <showuon@gmail.com>, José Armando García Sancio <jsancio@gmail.com>, Mickael Maison <mickael.maison@gmail.com>