why df04887ba5 does not fix it?
The fix of df04887ba5 is to NOT collect the log from path `/mnt/kafka/kafka-operational-logs/debug/xxxx.log`if the task is successful. It does not change the log level. see ducktape b2ad7693f2/ducktape/tests/test.py (L181)
why df04887ba5 does not see the error of "sort"
df04887ba5 does NOT show the error since the number of features is only "one" (only metadata.version). Hence, the bug is not triggered as it does not need to "sort". Now, we have two features - metadata.version and krafe.version - so the sort is executed and then we see the "hello bug"
why we should change the kafka.log_level to INFO?
the template of log4j.properties is controlled by `log_level` (https://github.com/apache/kafka/blob/trunk/tests/kafkatest/services/kafka/templates/log4j.properties#L16), and the bug happens in writing debug message (e4ca066680/core/src/main/scala/kafka/server/metadata/BrokerMetadataListener.scala (L274)). Hence, changing the log level to DEBUG can avoid triggering the bug.
Reviewers: Justine Olshan <jolshan@confluent.io>, Chia-Ping Tsai <chia7712@gmail.com>
Because of KIP-902 (Upgrade Zookeeper version to 3.8.2), it is not possible to upgrade from a Kafka version
earlier than 2.4 to a version later than 2.4. Therefore, we should not test these upgrade scenarios
in upgrade_test.py. They do happen to work sometimes, but only in the trivial case where we don't
create topics or make changes during the upgrade (which would reveal the ZK incompatibility).
Instead, we should test only supported scenarios.
Reviewers: Reviewers: José Armando García Sancio <jsancio@gmail.com>
This patch re-introduces the `group.version` feature flag and gates the new consumer rebalance protocol with it. The `group.version` feature flag is attached to the metadata version `4.0-IV0` and it is marked as production ready. This allows system tests to pick it up directly by default without requiring to set `unstable.feature.versions.enable` in all of them. This is fine because we don't plan to do any incompatible changes before 4.0.
Reviewers: Justine Olshan <jolshan@confluent.io>, Chia-Ping Tsai <chia7712@gmail.com>
This patch makes the new group coordinator, introduced as part of KIP-848, the default. This means that any KRaft cluster created from trunk defaults to using the new group coordinator. This includes all the integration tests which do not specify it. This patch also changes the default in system tests.
Reviewers: Andrew Schofield <aschofield@confluent.io>, Chia-Ping Tsai <chia7712@gmail.com>
- Mark 3.9-IV0 as stable. Metadata version 3.9-IV0 should return Fetch version 17.
- Move ELR to 4.0-IV0. Remove 3.9-IV1 since it's no longer needed.
- Create a new 4.0-IV1 MV for KIP-848.
Reviewers: Jun Rao <junrao@gmail.com>, Chia-Ping Tsai <chia7712@gmail.com>, Justine Olshan <jolshan@confluent.io>
7496e62434 fixed an error that caused an exception to be thrown on broker startup when debug logs were on. This made it to every version except 3.2.
The Kraft upgrade tests totally turn off debug logs, but I think we only need to remove them for the broken version.
Note: this bug is also present in 3.1, but there is no logging on startup like in subsequent versions.
Reviewers: Chia-Ping Tsai <chia7712@gmail.com>, David Jacot <david.jacot@gmail.com>
Fix an issue that cause system test failing when using AsyncKafkaConsumer.
A configuration option, group.coordinator.rebalance.protocols, was introduced to specify the rebalance protocols used by the group coordinator. By default, the rebalance protocol is set to classic. When the new group coordinator is enabled, the rebalance protocols are set to classic,consumer.
Reviewers: Chia-Ping Tsai <chia7712@gmail.com>, David Jacot <djacot@confluent.io>, Lianet Magrans <lianetmr@gmail.com>, Kirk True <kirk@kirktrue.pro>, Justine Olshan <jolshan@confluent.io>
When becoming the active KRaftMigrationDriver, there is another race condition similar to KAFKA-16171. This time, the race is due to a stale read from ZK. After writing to /controller and /controller_epoch, it is possible that a read on /migration is not linearized with the writes that were just made. In other words, we get a stale read on /migration. This leads to an inability to sync metadata to ZK due to incorrect zkVersion on the migration ZNode.
The non-linearizability of reads is in fact documented behavior for ZK, so we need to handle it.
To fix the stale read, this patch adds a write to /migration after updating /controller and /controller_epoch. This allows us to learn the correct zkVersion for the migration ZNode before leaving the BECOME_CONTROLLER state.
This patch also adds a check on the current leader epoch when running certain events in KRaftMigrationDriver. Historically, we did not include this check because it is not necessary for correctness. Writes to ZK are gated on the /controller_epoch zkVersion, and RPCs sent to brokers are gated on the controller epoch. However, during a time of rapid failover, there is a lot of processing happening on the controller (i.e., full metadata sync to ZK and full UMRs sent to brokers), so it is best to avoid running events we know will fail.
There is also a small fix in here to improve the logging of ZK operations. The log message are changed to past tense to reflect the fact that they have already happened by the time the log message is created.
Reviewers: Igor Soarez <soarez@apple.com>
LATEST_PRODUCTION version in MetadataVersion.java was updated in
both #16347 and #16400, but it was left unchanged in the system
tests.
Reviewers: Josep Prat <josep.prat@aiven.io>
This patch partially reverts `group.version` in trunk. I kept the `GroupVersion` class but removed it from `Features` so it is not advertised. I also kept all the changes in the test framework. I removed the logic to require `group.version=1` to enable the new consumer rebalance protocol. The new protocol is enabled based on the static configuration.
For the context, I prefer to revert it in trunk now so we don't forget to revert it in the 3.9 release. I will bring it back for the 4.0 release.
Reviewers: Chia-Ping Tsai <chia7712@gmail.com>
This reverts commit e95e91a.
With the change to include the group.version flag, these tests fail due to trying to set the feature for the old version.
It is unclear if these tests originally worked as intended and given the upgrade is not expected for 3.8, we will just revert from 3.8.
Reviewers: David Jacot <djacot@confluent.io>
Zookeeper migration system tests currently override the config to
use only one log directory.
This PR removes the override so that the system tests run with 2 log
directories following the work done as part of KIP-858.
Reviewers: Igor Soarez <soarez@apple.com>, Proven Provenzano <pprovenzano@confluent.io>
This PR contains the the following documentation changes for the native docker image:
in the docker/README.md: How to build, release and promote the native docker image.
in the tests/README.md: How to run system tests by bringing up kafka in the native mode.
added docker/native/README.md
added html changes for the kafka-site
added native docker image support in the docker compose files examples.
Testing:
Tested all the docker compose files with both the docker images - jvm and native
Tested the html changes locally with the kafka-site
Reviewers: Manikumar Reddy <manikumar.reddy@gmail.com>, Vedarth Sharma <vesharma@confluent.io>
To avoid confusion in 3.8/until we fully remove all the old task assignors and internal config, we should rename the old internal assignor classes like the StickyTaskAssignor so that they won't be mixed up with the new version of the assignor (which is also named StickyTaskAssignor)
Reviewers: Bruno Cadonna <cadonna@apache.org>, Josep Prat <josep.prat@aiven.io>
This PR does following things
System tests should bring up Kafka broker in the native mode
System tests should run on Kafka broker in native mode
Extract out native build command so that it can be reused.
Allow system tests to run on Native Kafka broker using Docker mechanism
To run system tests by bringing up Kafka in native mode:
Pass kafka_mode as native in the ducktape globals:--globals '{\"kafka_mode\":\"native\"}'
Running system tests by bringing up kafka in native mode via docker mechanism
_DUCKTAPE_OPTIONS="--globals '{\"kafka_mode\":\"native\"}'" TC_PATHS="tests/kafkatest/tests/" bash tests/docker/run_tests.sh
To only bring up ducker nodes to cater native kafka
bash tests/docker/ducker-ak up -m native
Reviewers: Manikumar Reddy <manikumar.reddy@gmail.com>
The system test was broken in ccf4bd5
which failed to import the matrix symbol. The test was failing silently, not
discovering any tests.
Reviewers: Bruno Cadonna <cadonna@apache.org>
We weren't enabling discoverBrokerVersions to check the supported versions in the AddPartitionsToTxnManager. This means that any verification request (or any AddPartitionsToTxnRequest version) from a newer broker would fail when sending to an older broker.
The bulk of this change is adding additional transactions system tests for old versions.
One test upgrades the cluster completely. This didn't catch the issue but could be useful.
The other test forces a new broker to send a verification request to an older one. Without the discoverBrokerVersions change, all tests between mixed brokers failed. (We introduced a new request version in 3.8 -- which is a separate version from the one that caused the bug for 3.5 -> 3.6) With the addition, the tests all passed.
I also manually ran a test for 3.5 -> 3.6 since the issue there was slightly different and was caused by the unstableLatestVersion flag being enabled. This change should fix this as well. 👍
Reviewers: David Jacot <djacot@confluent.io>
Minor change to how the describe topic output is parsed in system tests, to ensure that the output is preserved, even if only some fields are relevant to the test for now (which is what the test used to do before recent changes)
Initial problem: System tests were parsing the describe topic output in kafka.py assuming all fields would include a value. The describe API was recently changed, breaking this logic, because it included new fields for which there may not be values (ex. LastKnownElr).
Initial fix: The initial fix for this was to drop all fields from the output except for the ones currently used in the test, where in reality only the fields without values are the problematic ones.
Proposed improvement: A more extensible approach would be to drop only the fields that have no values and preserve the full output, which is what the test did before the initial fix mentioned above. This allows to easily extend the test to include more fields as needed, which could follow as the describe API and tests evolves (it will only require to add the fields to the returned value when needed, without having to change how the fields object is stripped).
Reviewers: Lucas Brutschy <lbrutschy@confluent.io>
In two tests, we are using the current snapshot version as a test parameter
`to_version`, but as the only option. We can hardcode it. This
simplifies testing downstream, since the test parameters do not change
with every version. In particular, some tests downstream are blacklisted
because they do not work with ARM. These lists need to be updated every
time `DEV_VERSION` is bumped.
Reviewers: Matthias J. Sax <matthias@confluent.io>
This fixes a consumer system test that was failing for the new protocol. The failure was because the test was expecting the eager behaviour of partitions being revoked on every rebalance, and it was wrongfully applying it to the runs with the new protocol too.
This same situation was previously identified and fixed in other parts of the sys test with #15661.
Reviewers: Lucas Brutschy <lbrutschy@confluent.io>
Checking that the TopicPartition is in assignment before attempting to remove it.
Also added some logging and refactoring.
Reviewers: Lucas Brutschy <lbrutschy@confluent.io>, Lianet Magrans <lianetmr@gmail.com>
The format of the 'describe topic' output was changed as part of KAFKA-15585 which required an update in the parsing logic used by system tests.
Reviewers: Lucas Brutschy <lbrutschy@confluent.io>
Added the check before the reassignment occurs and we start bouncing brokers.
Reviewers: David Mao <dmao@confluent.io>, David Jacot <djacot@confluent.io>
Summary of the changes:
Parameterizes the tests to use new coordinator and pass in consumer group protocol. This would be applicable to sink connectors only.
Enhances the sink connector creation code in system tests to accept a new optional parameter for consumer group protocol to be used.
Sets the consumer group protocol via consumer.override. override config when the new group coordinator is enabled.
Note about testing: There are 288 tests that need to be run and running on my local takes a lot of time. I will try to post the test results once I have a full run.
Reviewers: Kirk True <ktrue@confluent.io>, Lucas Brutschy <lbrutschy@confluent.io>, Philip Nee <pnee@confluent.io>
Consumer Rolling Upgrade is meant to test the protocol upgrade for the old protocol. Therefore, I am removing old changes.
Reviewers: Lucas Brutschy <lbrutschy@confluent.io>
Updating consumer system test that was failing with the new protocol, related to static membership behaviour. The behaviour regarding static consumers that join with conflicting group instance id is slightly different between the classic and new consumer protocol, so the expectations in the tests needed to be updated.
If static members join with same instance id:
Classic protocol: all members join the group with the same group instance id, and then the first one will eventually fail (receives a HB error with FencedInstanceIdException)
Consumer protocol: new member with an instance Id already in use is not able to join, and first member remains active (new member with same instance Id receives an UnreleasedInstanceIdException in the response to the HB to join the group)
This PR is keeping the single parametrized test that existed before, given that what's being tested and part of the test itself apply to all protocols. This is just updating the expectations that are different, based on the protocol parameter.
Reviewers: Lucas Brutschy <lbrutschy@confluent.io>, Kirk True <ktrue@confluent.io>
The current AssignmentValidationTest only tests EAGER assignment protocol and does not support incremental assignment like CooperativeStickyAssignor and consumer protocol. Therefore in the ConsumerEventHandler, I subclassed the existing handler overridden the assigned and revoke event handling methods, to permit incremental changes to the current assignments.
Reviewers: Lucas Brutschy <lbrutschy@confluent.io>, Kirk True <ktrue@confluent.io>
Enables log directory failure system test for all Kraft modes in addition to ZK mode.
Reviewers: Luke Chen <showuon@gmail.com>, Igor Soarez <soarez@apple.com>, Proven Provenzano <pprovenzano@confluent.io>
Added a new optional group_protocol parameter to the test methods, then passed that down to the setup_consumer method.
Unfortunately, because the new consumer can only be used with the new coordinator, this required a new @matrix block instead of adding the group_protocol=["classic", "consumer"] to the existing blocks 😢
Reviewers: Walker Carlson <wcarlson@apache.org>
Added a new optional group_protocol parameter to the test methods, then passed that down to the setup_consumer method.
Unfortunately, because the new consumer can only be used with the new coordinator, this required a new @matrix block instead of adding the group_protocol=["classic", "consumer"] to the existing blocks 😢
Reviewers: Walker Carlson <wcarlson@apache.org>
Migrated the following tests for the new consumer:
- test_fencing_static_consumer
- test_static_consumer_bounce
- test_static_consumer_persisted_after_rejoin
Reviewers: Walker Carlson <wcarlson@apache.org>
Added a new optional group_protocol parameter to the test methods, then passed that down to the methods involved.
Unfortunately, because the new consumer can only be used with the new coordinator, this required a new @matrix block instead of adding the group_protocol=["classic", "consumer"] to the existing blocks 😢
Reviewers: Walker Carlson <wcarlson@apache.org>
Update connect_distributed_test.py to support KIP-848’s group protocol config. Not all tests are updated because only a subset of it is using the consumer directly.
Reviewers: Lucas Brutschy <lbrutschy@confluent.io>, Kirk True <ktrue@confluent.io>
Added a new optional group_protocol parameter to the test methods, then passed that down to the methods involved.
Unfortunately, because the new consumer can only be used with the new coordinator, this required a new @matrix block instead of adding the group_protocol=["classic", "consumer"] to the existing blocks 😢
Reviewers: Lucas Brutschy <lbrutschy@confluent.io>
Added a new optional group_protocol parameter to the test methods, then passed that down to the methods involved.
Unfortunately, because the new consumer can only be used with the new coordinator, this required a new @matrix block instead of adding the group_protocol=["classic", "consumer"] to the existing blocks 😢
Reviewers: Lucas Brutschy <lbrutschy@confluent.io>
Upgrading the test to use the consumer group protocol. The two tests are failing due to Mismatch Assignment
Reviewers: Lucas Brutschy <lbrutschy@confluent.io>
This should help us avoid testing MVs before they are usable (stable).
We revert back from testing 3.8 in this case since 3.7 is the current stable version.
Reviewers: Proven Provenzano <pprovenzano@confluent.io>, Justine Olshan <jolshan@confluent.io>
Adding this as part of the greater effort to modify the system tests to incorporate the use of consumer group protocol from KIP-848. Following is the test results and the tests using protocol = consumer are expected to fail:
================================================================================
SESSION REPORT (ALL TESTS)
ducktape version: 0.11.4
session_id: 2024-03-16--002
run time: 76 minutes 36.150 seconds
tests run: 28
passed: 25
flaky: 0
failed: 3
ignored: 0
================================================================================
Reviewers: Lucas Brutschy <lbrutschy@confluent.io>, Kirk True <ktrue@confluent.io>
Added a new optional `group_protocol` parameter to the test methods, then passed that down to the `setup_consumer` method.
Unfortunately, because the new consumer can only be used with the new coordinator, this required a new `@matrix` block instead of adding the `group_protocol=["classic", "consumer"]` to the existing blocks 😢
Reviewers: Lucas Brutschy <lbrutschy@confluent.io>
Added a new optional `group_protocol` parameter to the test methods, then passed that down to the `setup_consumer` method.
Unfortunately, because the new consumer can only be used with the new coordinator, this required a new `@matrix` block instead of adding the `group_protocol=["classic", "consumer"]` to the existing blocks 😢
Reviewers: Lucas Brutschy <lbrutschy@confluent.io>
Added a new optional `group_protocol` parameter to the test methods, then passed that down to the `setup_consumer` method.
Unfortunately, because the new consumer can only be used with the new coordinator, this required a new `@matrix` block instead of adding the `group_protocol=["classic", "consumer"]` to the existing blocks 😢
Reviewers: Lucas Brutschy <lbrutschy@confluent.io>
* KAFKA-16267: Update consumer_group_command_test.py to support KIP-848’s group protocol config
Added a new optional group_protocol parameter to the test methods, then passed that down to the setup_consumer method.
Unfortunately, because the new consumer can only be used with the new coordinator, this required a new @matrix block instead of adding the group_protocol=["classic", "consumer"] to the existing blocks 😢
Note: this requires #15330.
* Update consumer_group_command_test.py
Reviewers: Lucas Brutschy <lbrutschy@confluent.io>
Added a new optional group_protocol parameter to the test methods, then passed that down to the setup_consumer method.
Unfortunately, because the new consumer can only be used with the new coordinator, this required a new @matrix block instead of adding the group_protocol=["classic", "consumer"] to the existing blocks 😢
Reviewers: Lucas Brutschy <lbrutschy@confluent.io>
In this modification, if ./gradlew systemTestLibs fails, the script will output an error message and terminate execution using the die function. This ensures that the script fails fast and prompts the user to address the error before continuing.
Reviewers: Luke Chen <showuon@gmail.com>
The Python VerifiableConsumer now passes in the --group-protocol and --group-remote-assignor command line arguments to VerifiableConsumer if the node is running 3.7.0+ and using the new consumer group.protocol.
Reviewers: Andrew Schofield <aschofield@confluent.io>, Lucas Brutschy <lbrutschy@confluent.io>
This patch wires the transaction verification in the new group coordinator. It basically calls the verification path before scheduling the write operation. If the verification fails, the error is returned to the caller.
Note that the patch uses `appendForGroup`. I suppose that we will move away from using it when https://github.com/apache/kafka/pull/15087 is merged.
Reviewers: Justine Olshan <jolshan@confluent.io>
This patch bumps the next release version to 3.8.0-SNAPSHOT.
Following the Release Process, I created the 3.7 branch and am following the steps to bump these versions:
Modify the version in trunk to bump to the next one (eg. "0.10.1.0-SNAPSHOT") in the following files:
docs/js/templateData.js
gradle.properties
kafka-merge-pr.py
streams/quickstart/java/pom.xml
streams/quickstart/java/src/main/resources/archetype-resources/pom.xml
streams/quickstart/pom.xml
tests/kafkatest/__init__.py
https://issues.apache.org/jira/browse/KAFKA-14505 is not done yet so we need to disable the system test. Added a comment in the jira to re-enable once it's implemented.
Reviewers: Justine Olshan <jolshan@confluent.io>
This patch converts a few more system tests to using the new group coordinator. This is only applied to KRaft clusters.
Reviewers: David Jacot <djacot@confluent.io>
The latest metadata version is now 3.7. Fix the KRaft upgrade
test to upgrade to that version instead of 3.6.
Change the vagrant setup and gradle dependencies to use 3.3.2 instead of 3.3.1.
Reviewers: David Arthur <mumrah@gmail.com>
This field was missed by the initial KIP-919 PR(s). The result is that migrations can't begin since
the controllers will never become ready. This patch fixes that as well as pulls over some fixes
from the 3.6 branch.
Reviewers: Colin P. McCabe <cmccabe@apache.org>
This patch adds configs to facilitate the testing with the new group coordinator (KIP-848) in kraft mode. Only one test files is converted at the moment. The others will follow.
Reviewers: Ian McDonald <imcdonald@confluent.io>, David Jacot <djacot@confluent.io>
Fixing bad test setup. We tried to fix an upgrade bug for FK-joins in 3.1 release, but it later turned out that the PR was not sufficient to fix it. We finally fixed in 3.4 release.
This PR updates the system test matrix to only test working versions with FK-joins, limited to available test versions.
Reviewers: Guozhang Wang <wangguoz@gmail.com>, Hao Li <hli@confluent.io>, Mickael Maison <mickael.maison@gmail.com>
This PR contains three main changes:
- Support for transactions in MetadataLoader
- Abort in-progress transaction during controller failover
- Utilize transactions for ZK to KRaft migration
A new MetadataBatchLoader class is added to decouple the loading of record batches from the
publishing of metadata in MetadataLoader. Since a transaction can span across multiple batches (or
multiple transactions could exist within one batch), some buffering of metadata updates was needed
before publishing out to the MetadataPublishers. MetadataBatchLoader accumulates changes into a
MetadataDelta, and uses a callback to publish to the publishers when needed.
One small oddity with this approach is that since we can "splitting" batches in some cases, the
number of bytes returned in the LogDeltaManifest has new semantics. The number of bytes included in
a batch is now only included in the last metadata update that is published as a result of a batch.
Reviewers: Colin P. McCabe <cmccabe@apache.org>
Kafka system tests with Java version 17 are failing on this issue:
```python
TimeoutError("MiniKdc didn't finish startup",)
Traceback (most recent call last):
File "/usr/local/lib/python3.6/site-packages/ducktape/tests/runner_client.py", line 186, in _do_run
data = self.run_test()
File "/usr/local/lib/python3.6/site-packages/ducktape/tests/runner_client.py", line 246, in run_test
return self.test_context.function(self.test)
File "/usr/local/lib/python3.6/site-packages/ducktape/mark/_mark.py", line 433, in wrapper
return functools.partial(f, *args, **kwargs)(*w_args, **w_kwargs)
File "/opt/kafka-dev/tests/kafkatest/sanity_checks/test_verifiable_producer.py", line 74, in test_simple_run
self.kafka.start()
File "/opt/kafka-dev/tests/kafkatest/services/kafka/kafka.py", line 635, in start
self.start_minikdc_if_necessary(add_principals)
File "/opt/kafka-dev/tests/kafkatest/services/kafka/kafka.py", line 596, in start_minikdc_if_necessary
self.minikdc.start()
File "/usr/local/lib/python3.6/site-packages/ducktape/services/service.py", line 265, in start
self.start_node(node, **kwargs)
File "/opt/kafka-dev/tests/kafkatest/services/security/minikdc.py", line 114, in start_node
monitor.wait_until("MiniKdc Running", timeout_sec=60, backoff_sec=1, err_msg="MiniKdc didn't finish startup")
File "/usr/local/lib/python3.6/site-packages/ducktape/cluster/remoteaccount.py", line 754, in wait_until
allow_fail=True) == 0, **kwargs)
File "/usr/local/lib/python3.6/site-packages/ducktape/utils/util.py", line 58, in wait_until
raise TimeoutError(err_msg() if callable(err_msg) else err_msg) from last_exception
ducktape.errors.TimeoutError: MiniKdc didn't finish startup
```
Specifically, when one runs the test cases and looks at the logs of the MiniKdc:
```java
Exception in thread "main" java.lang.IllegalAccessException: class kafka.security.minikdc.MiniKdc cannot access class sun.security.krb5.Config (in module java.security.jgss) because module java.security.jgss does not export sun.security.krb5 to unnamed module @24959ca4
at java.base/jdk.internal.reflect.Reflection.newIllegalAccessException(Reflection.java:392)
at java.base/java.lang.reflect.AccessibleObject.checkAccess(AccessibleObject.java:674)
at java.base/java.lang.reflect.Method.invoke(Method.java:560)
at kafka.security.minikdc.MiniKdc.refreshJvmKerberosConfig(MiniKdc.scala:268)
at kafka.security.minikdc.MiniKdc.initJvmKerberosConfig(MiniKdc.scala:245)
at kafka.security.minikdc.MiniKdc.start(MiniKdc.scala:123)
at kafka.security.minikdc.MiniKdc$.start(MiniKdc.scala:375)
at kafka.security.minikdc.MiniKdc$.main(MiniKdc.scala:366)
at kafka.security.minikdc.MiniKdc.main(MiniKdc.scala)
```
This error is caused by the fact that sun.security module is no longer supported in Java 16 and higher. Related to the [1].
There are two ways how to solve it, and I present one of them. The second way is to export the ENV variable during the deployment of the containers using Ducktape in [2].
[1] - https://openjdk.org/jeps/396
[2] - https://github.com/apache/kafka/blob/trunk/tests/docker/ducker-ak#L308
Reviewers: Ismael Juma <ismael@juma.me.uk>, Luke Chen <showuon@gmail.com>
Update "requests" lib used in system tests to version "2.31.0" to fix CVE-2023-32681: Unintended leak of Proxy-Authorization header in requests
Reviewers: Divij Vaidya <diviv@amazon.com>
This patch adds snapshot reconciliation during ZK to KRaft migration. This reconciliation happens whenever a snapshot is loaded by KRaft, or during a controller failover. Prior to this patch, it was possible to miss metadata updates coming from KRaft when dual-writing to ZK.
Internally this adds a new state SYNC_KRAFT_TO_ZK to the KRaftMigrationDriver state machine. The controller passes through this state after the initial ZK migration and each time a controller becomes active.
Logging during dual-write was enhanced to include a count of write operations happening.
Reviewers: Colin P. McCabe <cmccabe@apache.org>
This patch adds support for handling metadata snapshots while in dual-write mode. Prior to this change, if the active
controller loaded a snapshot, it would get out of sync with the ZK state.
In order to reconcile the snapshot state with ZK, several methods were added to scan through the metadata in ZK to
compute differences with the MetadataImage. Since this introduced a lot of code, I opted to split out a lot of methods
from ZkMigrationClient into their own client interfaces, such as TopicMigrationClient, ConfigMigrationClient, and
AclMigrationClient. Each of these has some iterator method that lets the caller examine the ZK state in a single pass
and without using too much memory.
Reviewers: Colin P. McCabe <cmccabe@apache.org>, Luke Chen <showuon@gmail.com>
This patch adds the concept of pre-migration mode to the KRaft controller. While in this mode,
the controller will only allow certain write operations. The purpose of this is to disallow metadata
changes when the controller is waiting for the ZK migration records to be committed.
The following ControllerWriteEvent operations are permitted in pre-migration mode
* completeActivation
* maybeFenceReplicas
* writeNoOpRecord
* processBrokerHeartbeat
* registerBroker (only for migrating ZK brokers)
* unregisterBroker
Raft events and other controller events do not follow the same code path as ControllerWriteEvent,
so they are not affected by this new behavior.
This patch also add a new metric as defined in KIP-868: kafka.controller:type=KafkaController,name=ZkMigrationState
In order to support upgrades from 3.4.0, this patch also redefines the enum value of value 1 to mean
MIGRATION rather than PRE_MIGRATION.
Reviewers: Chia-Ping Tsai <chia7712@gmail.com>, Colin P. McCabe <cmccabe@apache.org>
Some system tests from kafkatest.tests.core.network_degrade_test are failing due to missing utility iputils-ping.
[DEBUG - 2023-02-04 01:34:56,322 - network_degrade_test - test_latency -
lineno:67]: Ping output: bash: line 1: ping: command not found
Reviewers: Luke Chen <showuon@gmail.com>
Similar to https://github.com/apache/kafka/pull/13439:
ddd652c standardized on "isolated" as the name for all the isolated
modes, and renamed remote_controller_quorum to
kafkatest.services.kafka.quorum.remote_kraft to
isolated_controller_quorum. This broke
SecurityTest.test_quorum_ssl_endpoint_validation_failure, which should
be fixed by this simple rename.
Reviewers: Colin P. McCabe <cmccabe@apache.org>
ddd652c6 standardized on "isolated" as the name for all the isolated
modes, and renamed kafkatest.services.kafka.quorum.remote_kraft to
isolated_kraft. However, the tests using remote_kraft weren't
updated, and are broken as a result. This is a simple search and
replace to fix those.
Reviewers: Colin P. McCabe <cmccabe@apache.org>
Standardize KRaft thread names.
- Always use kebab case. That is, "my-thread-name".
- Thread prefixes are just strings, not Option[String] or Optional<String>.
If you don't want a prefix, use the empty string.
- Thread prefixes end in a dash (except the empty prefix). Then you can
calculate thread names as $prefix + "my-thread-name"
- Broker-only components get "broker-$id-" as a thread name prefix. For example, "broker-1-"
- Controller-only components get "controller-$id-" as a thread name prefix. For example, "controller-1-"
- Shared components get "kafka-$id-" as a thread name prefix. For example, "kafka-0-"
- Always pass a prefix to KafkaEventQueue, so that threads have names like
"broker-0-metadata-loader-event-handler" rather than "event-handler". Prior to this PR, we had
several threads just named "EventHandler" which was not helpful for debugging.
- QuorumController thread name is "quorum-controller-123-event-handler"
- Don't set a thread prefix for replication threads started by ReplicaManager. They run only on the
broker, and already include the broker ID.
Standardize KRaft slf4j log prefixes.
- Names should be of the form "[ComponentName id=$id] ". So for a ControllerServer with ID 123, we
will have "[ControllerServer id=123] "
- For the QuorumController class, use the prefix "[QuorumController id=$id] " rather than
"[Controller <nodeId] ", to make it clearer that this is a KRaft controller.
- In BrokerLifecycleManager, add isZkBroker=true to the log prefix for the migration case.
Standardize KRaft terminology.
- All synonyms of combined mode (colocated, coresident, etc.) should be replaced by "combined"
- All synonyms of isolated mode (remote, non-colocated, distributed, etc.) should be replaced by
"isolated".
Implement KIP-599 controller mutation quotas for the KRaft controller. These quotas apply to create
topics, create partitions, and delete topic operations. They are specified in terms of number of
partitions.
The approach taken here is to reuse the ControllerMutationQuotaManager that is also used in ZK
mode. The quotas are implemented as Sensor objects and Sensor.checkQuotas enforces the quota,
whereas Sensor.record notes that new partitions have been modified. While ControllerApis handles
fetching the Sensor objects, we must make the final callback to check the quotas from within
QuorumController. The reason is because only QuorumController knows the final number of partitions
that must be modified. (As one example, up-to-date information about the number of partitions that
will be deleted when a topic is deleted is really only available in QuorumController.)
For quota enforcement, the logic is already in place. The KRaft controller is expected to set the
throttle time in the response that is embedded in EnvelopeResponse, but it does not actually apply
the throttle because there is no client connection to throttle. Instead, the broker that forwarded
the request is expected to return the throttle value from the controller and to throttle the client
connection. It also applies its own request quota, so the enforced/returned quota is the maximum of
the two.
This PR also installs a DynamicConfigPublisher in ControllerServer. This allows dynamic
configurations to be published on the controller. Previously, they could be set, but they were not
applied. Note that we still don't have a good way to set node-level configurations for isolatied
controllers. However, this will allow us to set cluster configs (aka default node configs) and have
them take effect on the controllers.
In a similar vein, this PR separates out the dynamic client quota publisher logic used on the
broker into DynamicClientQuotaPublisher. We can now install this on both BrokerServer and
ControllerServer. This makes dynamically configuring quotas (such as controller mutation quotas)
possible.
Also add a ducktape test, controller_mutation_quota_test.py.
Reviewers: David Jacot <djacot@confluent.io>, Ismael Juma <ismael@juma.me.uk>, Colin P. McCabe <cmccabe@apache.org>
This patch introduces a preliminary state machine that can be used by KRaft
controller to drive online migration from Zk to KRaft.
MigrationState -- Defines the states we can have while migration from Zk to
KRaft.
KRaftMigrationDriver -- Defines the state transitions, and events to handle
actions like controller change, metadata change, broker change and have
interfaces through which it claims Zk controllership, performs zk writes and
sends RPCs to ZkBrokers.
MigrationClient -- Interface that defines the functions used to claim and
relinquish Zk controllership, read to and write from Zk.
Co-authored-by: David Arthur <mumrah@gmail.com>
Reviewers: Colin P. McCabe <cmccabe@apache.org>
Co-authored-by: Lucas Brutschy <lucasbru@users.noreply.github.com>
Reviewers: Suhas Satish <ssatish@confluent.io>, Lucas Brutschy <lbrutschy@confluent.io>, John Roesler <john@confluent.io>
A test that verifies the upgrade from a version of Streams with
state updater disabled to a version with state updater enabled
and vice versa, so that we can offer a save upgrade path.
- upgrade test from a version of Streams with state updater
disabled to a version with state updater enabled
- downgrade test from a version of Streams with state updater
enabled to a version with state updater disabled
Reviewer: Bruno Cadonna <cadonna@apache.org>
This PR introduces the new metadata loader and snapshot generator. For the time being, they are
only used by the controller, but a PR for the broker will come soon.
The new metadata loader supports adding and removing publishers dynamically. (In contrast, the old
loader only supported adding a single publisher.) It also passes along more information about each
new image that is published. This information can be found in the LogDeltaManifest and
SnapshotManifest classes.
The new snapshot generator replaces the previous logic for generating snapshots in
QuorumController.java and associated classes. The new generator is intended to be shared between
the broker and the controller, so it is decoupled from both.
There are a few small changes to the old snapshot generator in this PR. Specifically, we move the
batch processing time and batch size metrics out of BrokerMetadataListener.scala and into
BrokerServerMetrics.scala.
Finally, fix a case where we are using 'is' rather than '==' for a numeric comparison in
snapshot_test.py.
Reviewers: David Arthur <mumrah@gmail.com>
PR #12684 introduced a better format for timestamps in log
messages. Unfortunately, we missed that one of the modified
log messages is used by a system test for validation.
This PR adapts the system test to look for the modified
log message.
Reviewers: Divij Vaidya <diviv@amazon.com>, Matthias J. Sax <mjsax@apache.org>
The IBM Semeru JDK use the OpenJDK security providers instead of the IBM security providers so test for the OpenJDK classes first where possible and test for Semeru in the java.runtime.name system property otherwise.
Reviewers: Mickael Maison <mickael.maison@gmail.com>, Bruno Cadonna <cadonna@apache.org>
The streams upgrade system inserted FK join code for every version of the
the StreamsUpgradeTest except for the latest. Also, the original code
never switched on the `test.run_fk_join` flag for the target version of
the upgrade.
The effect was that FK join upgrades were not tested at all, since
no FK join code was executed after the bounce in the system test.
We introduce `extra_properties` in the system tests, that can be used
to pass any property to the upgrade driver, which is supposed to be
reused by system tests for switching on and off flags (e.g. for the
state restoration code).
Reviewers: Alex Sorokoumov <asorokoumov@confluent.io>, Anna Sophie Blee-Goldman <ableegoldman@apache.org>
For tests which use the console consumer service, we are currently enabling TRACE logging by default. I have seen some system tests where this produces GBs of logging. A better default is probably DEBUG.
Reviewers: José Armando García Sancio <jsancio@apache.org>
Adds the `metadata_quorum` parameter to the `@matrix(...)` annotation to many existing tests, so that they are run with both zookeeper and remote_kraft nodes.
Reviewers: Randall Hauch <rhauch@gmail.com>, Greg Harris <gharris1727@gmail.com>
The following files are available in https://s3-us-west-2.amazonaws.com/kafka-packages/:
kafka-streams-3.3.1-test.jar
kafka_2.12-3.3.1.tgz
kafka_2.13-3.3.1.tgz
Reviewers: Colin P. McCabe <cmccabe@apache.org>
KIP-373 added a "token requester" field to the output of kafka-delegation-tokens.sh. The system test was failing since it was not expecting this new field. This patch adds support for this field and improves the error output if we can't parse.
Reviewers: José Armando García Sancio <jsancio@apache.org>, Manikumar Reddy <manikumar.reddy@gmail.com>
This patch removes test_kafka_version.py, which contains two tests at the moment. The first test verifies we can start a 0.8.2 cluster. The second verifies we can start a cluster with one node on 0.8.2 and another on the latest. These test are covered in greater depth by upgrade_test.py and downgrade_test.py.
Reviewers: José Armando García Sancio <jsancio@users.noreply.github.com>
Also includes a minor quality-of-life improvement to clarify why some internal REST requests to workers may fail while that worker is still starting up.
Reviewers: Tom Bentley <tbentley@redhat.com>, Luke Chen <showuon@gmail.com>, José Armando García Sancio <jsancio@gmail.com>, Mickael Maison <mickael.maison@gmail.com>
This PR adds support to kafka-features.sh for the --metadata flag, as specified in KIP-778. This
flag makes it possible to upgrade to a new metadata version without consulting a table mapping
version names to short integers. Change --feature to use a key=value format.
FeatureCommandTest.scala: make most tests here true unit tests (that don't start brokers) in order
to improve test run time, and allow us to test more cases. For the integration test part, test both
KRaft and ZK-based clusters. Add support for mocking feature operations in MockAdminClient.java.
upgrade.html: add a section describing how the metadata.version should be upgraded in KRaft
clusters.
Add kraft_upgrade_test.py to test upgrades between KRaft versions.
Reviewers: David Arthur <mumrah@gmail.com>, dengziming <dengziming1993@gmail.com>, José Armando García Sancio <jsancio@gmail.com>
Migrates Streams sustem tests to either use kraft brokers or to use both kraft and zk in a testing matrix.
This skips tests which use various forms of Kafka versioning since those seem to have issues with KRaft at the moment. Running these tests with KRaft will require a followup PR.
Reviewers: Guozhang Wang <guozhang@apache.org>, John Roesler <vvcephei@apache.org>