Fixing bad test setup. We tried to fix an upgrade bug for FK-joins in 3.1 release, but it later turned out that the PR was not sufficient to fix it. We finally fixed in 3.4 release.
This PR updates the system test matrix to only test working versions with FK-joins, limited to available test versions.
Reviewers: Guozhang Wang <wangguoz@gmail.com>, Hao Li <hli@confluent.io>, Mickael Maison <mickael.maison@gmail.com>
This patch introduces a preliminary state machine that can be used by KRaft
controller to drive online migration from Zk to KRaft.
MigrationState -- Defines the states we can have while migration from Zk to
KRaft.
KRaftMigrationDriver -- Defines the state transitions, and events to handle
actions like controller change, metadata change, broker change and have
interfaces through which it claims Zk controllership, performs zk writes and
sends RPCs to ZkBrokers.
MigrationClient -- Interface that defines the functions used to claim and
relinquish Zk controllership, read to and write from Zk.
Co-authored-by: David Arthur <mumrah@gmail.com>
Reviewers: Colin P. McCabe <cmccabe@apache.org>
Co-authored-by: Lucas Brutschy <lucasbru@users.noreply.github.com>
Reviewers: Suhas Satish <ssatish@confluent.io>, Lucas Brutschy <lbrutschy@confluent.io>, John Roesler <john@confluent.io>
This PR introduces the new metadata loader and snapshot generator. For the time being, they are
only used by the controller, but a PR for the broker will come soon.
The new metadata loader supports adding and removing publishers dynamically. (In contrast, the old
loader only supported adding a single publisher.) It also passes along more information about each
new image that is published. This information can be found in the LogDeltaManifest and
SnapshotManifest classes.
The new snapshot generator replaces the previous logic for generating snapshots in
QuorumController.java and associated classes. The new generator is intended to be shared between
the broker and the controller, so it is decoupled from both.
There are a few small changes to the old snapshot generator in this PR. Specifically, we move the
batch processing time and batch size metrics out of BrokerMetadataListener.scala and into
BrokerServerMetrics.scala.
Finally, fix a case where we are using 'is' rather than '==' for a numeric comparison in
snapshot_test.py.
Reviewers: David Arthur <mumrah@gmail.com>
PR #12684 introduced a better format for timestamps in log
messages. Unfortunately, we missed that one of the modified
log messages is used by a system test for validation.
This PR adapts the system test to look for the modified
log message.
Reviewers: Divij Vaidya <diviv@amazon.com>, Matthias J. Sax <mjsax@apache.org>
The IBM Semeru JDK use the OpenJDK security providers instead of the IBM security providers so test for the OpenJDK classes first where possible and test for Semeru in the java.runtime.name system property otherwise.
Reviewers: Mickael Maison <mickael.maison@gmail.com>, Bruno Cadonna <cadonna@apache.org>
The streams upgrade system inserted FK join code for every version of the
the StreamsUpgradeTest except for the latest. Also, the original code
never switched on the `test.run_fk_join` flag for the target version of
the upgrade.
The effect was that FK join upgrades were not tested at all, since
no FK join code was executed after the bounce in the system test.
We introduce `extra_properties` in the system tests, that can be used
to pass any property to the upgrade driver, which is supposed to be
reused by system tests for switching on and off flags (e.g. for the
state restoration code).
Reviewers: Alex Sorokoumov <asorokoumov@confluent.io>, Anna Sophie Blee-Goldman <ableegoldman@apache.org>
For tests which use the console consumer service, we are currently enabling TRACE logging by default. I have seen some system tests where this produces GBs of logging. A better default is probably DEBUG.
Reviewers: José Armando García Sancio <jsancio@apache.org>
Adds the `metadata_quorum` parameter to the `@matrix(...)` annotation to many existing tests, so that they are run with both zookeeper and remote_kraft nodes.
Reviewers: Randall Hauch <rhauch@gmail.com>, Greg Harris <gharris1727@gmail.com>
The following files are available in https://s3-us-west-2.amazonaws.com/kafka-packages/:
kafka-streams-3.3.1-test.jar
kafka_2.12-3.3.1.tgz
kafka_2.13-3.3.1.tgz
Reviewers: Colin P. McCabe <cmccabe@apache.org>
KIP-373 added a "token requester" field to the output of kafka-delegation-tokens.sh. The system test was failing since it was not expecting this new field. This patch adds support for this field and improves the error output if we can't parse.
Reviewers: José Armando García Sancio <jsancio@apache.org>, Manikumar Reddy <manikumar.reddy@gmail.com>
This patch removes test_kafka_version.py, which contains two tests at the moment. The first test verifies we can start a 0.8.2 cluster. The second verifies we can start a cluster with one node on 0.8.2 and another on the latest. These test are covered in greater depth by upgrade_test.py and downgrade_test.py.
Reviewers: José Armando García Sancio <jsancio@users.noreply.github.com>
Also includes a minor quality-of-life improvement to clarify why some internal REST requests to workers may fail while that worker is still starting up.
Reviewers: Tom Bentley <tbentley@redhat.com>, Luke Chen <showuon@gmail.com>, José Armando García Sancio <jsancio@gmail.com>, Mickael Maison <mickael.maison@gmail.com>
This PR adds support to kafka-features.sh for the --metadata flag, as specified in KIP-778. This
flag makes it possible to upgrade to a new metadata version without consulting a table mapping
version names to short integers. Change --feature to use a key=value format.
FeatureCommandTest.scala: make most tests here true unit tests (that don't start brokers) in order
to improve test run time, and allow us to test more cases. For the integration test part, test both
KRaft and ZK-based clusters. Add support for mocking feature operations in MockAdminClient.java.
upgrade.html: add a section describing how the metadata.version should be upgraded in KRaft
clusters.
Add kraft_upgrade_test.py to test upgrades between KRaft versions.
Reviewers: David Arthur <mumrah@gmail.com>, dengziming <dengziming1993@gmail.com>, José Armando García Sancio <jsancio@gmail.com>
Migrates Streams sustem tests to either use kraft brokers or to use both kraft and zk in a testing matrix.
This skips tests which use various forms of Kafka versioning since those seem to have issues with KRaft at the moment. Running these tests with KRaft will require a followup PR.
Reviewers: Guozhang Wang <guozhang@apache.org>, John Roesler <vvcephei@apache.org>
* Description
In this test, when third proc join, sometimes there are other rebalance scenarios such as followup joingroup request happens before syncgroup response was received by one of the proc and the previously assigned tasks for that proc is then lost during new joingroup request. This can result in standby tasks assigned as 3, 1, 2. This PR relax the expected assignment of 2, 2, 2 to a range of [1-3].
* Some backgroud from Guozhang:
I talked to @hao Li offline and also inspected the code a bit, and tl;dr is that I think the code logic is correct (i.e. we do not really have a bug), but we need to relax the test verification a little bit. The general idea behind the subscription info is that:
When a client joins the group, its subscription will try to encode all its current assigned active and standby tasks, which would be used as prev active and standby tasks by the assignor in order to achieve some stickiness.
When a client drops all its active/standby tasks due to errors, it does not actually report all empty from its subscription, instead it tries to check its local state directory (you can see that from TaskManager#getTaskOffsetSums which populates the taskOffsetSum. For active task, its offset would be “-2” a.k.a. LATEST_OFFSET, for standby task, its offset is an actual numerical number.
So in this case, the proc2 which drops all its active and standby tasks, would still report all tasks that have some local state still, and since it was previously owning all six tasks (three as active, and three as standby), it would report all six as standbys, and when that happens the resulted assignment as @hao Li verified, is indeed the un-even one.
So I think the actual “issue“ happens here, is when proc2 is a bit late sending the sync-group request, when the previous rebalance has already completed, and a follow-up rebalance has already triggered, in that case, the resulted un-even assignment is indeed expected. Such a scenario, though not common, is still legitimate since in practice all kinds of timing skewness across instances can happen. So I think we should just relax our verification here, i.e. just making sure that each instance has at least one standby replica at the end, not exactly evenly as “2, 2, 2”.
Reviewers: Suhas Satish <ssatish@confluent.io>, Guozhang Wang <wangguoz@gmail.com>
KRaft mode will not support writing messages with an older message format (2.8) since the min supported IBP is 3.0 for KRaft. Testing support for reading older message formats will be covered by https://issues.apache.org/jira/browse/KAFKA-14056.
Reviewers: David Jacot <djacot@confluent.io>
* KAFKA-13930: Add 3.2.0 Streams upgrade system tests
Apache Kafka 3.2.0 was recently released. Now we need
to test upgrades from 3.2 to trunk in our system tests.
Reviewer: Bill Bejeck <bbejeck@apache.org>
I noticed that a system test using a KRaft cluster with 3 brokers but only 1 co-located controller did not force-kill the second and third broker after shutting down the first broker (the one with the controller). The issue was a floating point rounding error. This patch adjusts for the rounding error and also makes the logic work for an even number of controllers. A local run of `tests/kafkatest/sanity_checks/test_bounce.py` succeeded (and I manually increased the cluster size for the 1 co-located controller case and observed the correct kill behavior: the second and third brokers were force-killed as expected).
Reviewers: Luke Chen <showuon@gmail.com>, José Armando García Sancio <jsancio@users.noreply.github.com>, David Jacot <djacot@confluent.io>
Replace ACL_AUTHORIZER attribute with ZK_ACL_AUTHORIZER in system tests. Required after the changes merged with https://github.com/apache/kafka/pull/12190.
Reviewers: David Jacot <djacot@confluent.io>
Apache Kafka 3.2.0 was recently released. Now we need
to test upgrades and compatibility with 3.2 in core system tests.
Reviewer: Jason Gustafson <jason@confluent.io>
Currently the only place we see controller/raft logging in system tests is `server-start-stdout-stderr.log` where they are mixed with all other logs. It is more convenient to send them to `controller.log` as we do for zk tests.
Reviewers: Kvicii <42023367+Kvicii@users.noreply.github.com>, David Jacot <djacot@confluent.io>
It is useful to collect the directory for `__cluster_metadata` in system tests. We use a separate directory from user partitions, so it must be configured separately.
Reviewers: David Arthur <mumrah@gmail.com>
When running our Connect system tests with JDK 10+, we hit the error
AttributeError: 'ClusterNode' object has no attribute 'version'
because util.py attempts to check the version variable for non-Kafka service objects.
Reviewers: Konstantine Karantasis <k.karantasis@gmail.com>
Change `ZookeeperAuthorizerTest` to `AuthorizerTest` and add support for KRaft's `StandardAuthorizer` implementation.
Reviewers: David Jacot <djacot@confluent.io>
This patch fixes some strangeness and inconsistency in the messages written by `TransactionalMessageCopier` to stdout. Here is a sample of two messages.
Progress message:
```
{"consumed":33000,"stage":"ProcessLoop","totalProcessed":33000,"progress":"copier-0","time":"2022/04/24 05:40:31:649","remaining":333}
```
The `transactionalId` is set to the value of the `progress` key.
And a shutdown message:
```
{"consumed":33333,"shutdown_complete":"copier-0","totalProcessed":33333,"time":"2022/04/24 05:40:31:937","remaining":0}
```
The `transactionalId` this time is set to the `shutdown_complete` key and there is no `stage` key.
In this patch, we change the following:
1. Use a separate key for the `transactionalId`.
2. Drop the `progress` and `shutdown_complete` keys.
3. Use `stage=ShutdownComplete` in the shutdown message.
4. Modify `transactional_message_copier.py` system test service accordingly.
Reviewers: David Arthur <mumrah@gmail.com>
The second validation does not verify the second bounce because the verified producer and the verified consumer are stopped in `self.run_validation`. This means that the second `run_validation` just spit out the same information as the first one. Instead, we should just run the validation at the end.
Reviewers: Jason Gustafson <jason@confluent.io>
With this change we stop including the non-production grade connectors that are meant to be used for demos and quick starts by default in the CLASSPATH and plugin.path of Connect deployments. The package of these connector will still be shipped with the Apache Kafka distribution and will be available for explicit inclusion.
The changes have been tested through the system tests and the existing unit and integration tests.
Reviewers: Mickael Maison <mickael.maison@gmail.com>, Randall Hauch <rhauch@gmail.com>