This PR introduces the new metadata loader and snapshot generator. For the time being, they are
only used by the controller, but a PR for the broker will come soon.
The new metadata loader supports adding and removing publishers dynamically. (In contrast, the old
loader only supported adding a single publisher.) It also passes along more information about each
new image that is published. This information can be found in the LogDeltaManifest and
SnapshotManifest classes.
The new snapshot generator replaces the previous logic for generating snapshots in
QuorumController.java and associated classes. The new generator is intended to be shared between
the broker and the controller, so it is decoupled from both.
There are a few small changes to the old snapshot generator in this PR. Specifically, we move the
batch processing time and batch size metrics out of BrokerMetadataListener.scala and into
BrokerServerMetrics.scala.
Finally, fix a case where we are using 'is' rather than '==' for a numeric comparison in
snapshot_test.py.
Reviewers: David Arthur <mumrah@gmail.com>
PR #12684 introduced a better format for timestamps in log
messages. Unfortunately, we missed that one of the modified
log messages is used by a system test for validation.
This PR adapts the system test to look for the modified
log message.
Reviewers: Divij Vaidya <diviv@amazon.com>, Matthias J. Sax <mjsax@apache.org>
The IBM Semeru JDK use the OpenJDK security providers instead of the IBM security providers so test for the OpenJDK classes first where possible and test for Semeru in the java.runtime.name system property otherwise.
Reviewers: Mickael Maison <mickael.maison@gmail.com>, Bruno Cadonna <cadonna@apache.org>
The streams upgrade system inserted FK join code for every version of the
the StreamsUpgradeTest except for the latest. Also, the original code
never switched on the `test.run_fk_join` flag for the target version of
the upgrade.
The effect was that FK join upgrades were not tested at all, since
no FK join code was executed after the bounce in the system test.
We introduce `extra_properties` in the system tests, that can be used
to pass any property to the upgrade driver, which is supposed to be
reused by system tests for switching on and off flags (e.g. for the
state restoration code).
Reviewers: Alex Sorokoumov <asorokoumov@confluent.io>, Anna Sophie Blee-Goldman <ableegoldman@apache.org>
For tests which use the console consumer service, we are currently enabling TRACE logging by default. I have seen some system tests where this produces GBs of logging. A better default is probably DEBUG.
Reviewers: José Armando García Sancio <jsancio@apache.org>
Adds the `metadata_quorum` parameter to the `@matrix(...)` annotation to many existing tests, so that they are run with both zookeeper and remote_kraft nodes.
Reviewers: Randall Hauch <rhauch@gmail.com>, Greg Harris <gharris1727@gmail.com>
The following files are available in https://s3-us-west-2.amazonaws.com/kafka-packages/:
kafka-streams-3.3.1-test.jar
kafka_2.12-3.3.1.tgz
kafka_2.13-3.3.1.tgz
Reviewers: Colin P. McCabe <cmccabe@apache.org>
KIP-373 added a "token requester" field to the output of kafka-delegation-tokens.sh. The system test was failing since it was not expecting this new field. This patch adds support for this field and improves the error output if we can't parse.
Reviewers: José Armando García Sancio <jsancio@apache.org>, Manikumar Reddy <manikumar.reddy@gmail.com>
This patch removes test_kafka_version.py, which contains two tests at the moment. The first test verifies we can start a 0.8.2 cluster. The second verifies we can start a cluster with one node on 0.8.2 and another on the latest. These test are covered in greater depth by upgrade_test.py and downgrade_test.py.
Reviewers: José Armando García Sancio <jsancio@users.noreply.github.com>
Also includes a minor quality-of-life improvement to clarify why some internal REST requests to workers may fail while that worker is still starting up.
Reviewers: Tom Bentley <tbentley@redhat.com>, Luke Chen <showuon@gmail.com>, José Armando García Sancio <jsancio@gmail.com>, Mickael Maison <mickael.maison@gmail.com>
This PR adds support to kafka-features.sh for the --metadata flag, as specified in KIP-778. This
flag makes it possible to upgrade to a new metadata version without consulting a table mapping
version names to short integers. Change --feature to use a key=value format.
FeatureCommandTest.scala: make most tests here true unit tests (that don't start brokers) in order
to improve test run time, and allow us to test more cases. For the integration test part, test both
KRaft and ZK-based clusters. Add support for mocking feature operations in MockAdminClient.java.
upgrade.html: add a section describing how the metadata.version should be upgraded in KRaft
clusters.
Add kraft_upgrade_test.py to test upgrades between KRaft versions.
Reviewers: David Arthur <mumrah@gmail.com>, dengziming <dengziming1993@gmail.com>, José Armando García Sancio <jsancio@gmail.com>
Migrates Streams sustem tests to either use kraft brokers or to use both kraft and zk in a testing matrix.
This skips tests which use various forms of Kafka versioning since those seem to have issues with KRaft at the moment. Running these tests with KRaft will require a followup PR.
Reviewers: Guozhang Wang <guozhang@apache.org>, John Roesler <vvcephei@apache.org>
* Description
In this test, when third proc join, sometimes there are other rebalance scenarios such as followup joingroup request happens before syncgroup response was received by one of the proc and the previously assigned tasks for that proc is then lost during new joingroup request. This can result in standby tasks assigned as 3, 1, 2. This PR relax the expected assignment of 2, 2, 2 to a range of [1-3].
* Some backgroud from Guozhang:
I talked to @hao Li offline and also inspected the code a bit, and tl;dr is that I think the code logic is correct (i.e. we do not really have a bug), but we need to relax the test verification a little bit. The general idea behind the subscription info is that:
When a client joins the group, its subscription will try to encode all its current assigned active and standby tasks, which would be used as prev active and standby tasks by the assignor in order to achieve some stickiness.
When a client drops all its active/standby tasks due to errors, it does not actually report all empty from its subscription, instead it tries to check its local state directory (you can see that from TaskManager#getTaskOffsetSums which populates the taskOffsetSum. For active task, its offset would be “-2” a.k.a. LATEST_OFFSET, for standby task, its offset is an actual numerical number.
So in this case, the proc2 which drops all its active and standby tasks, would still report all tasks that have some local state still, and since it was previously owning all six tasks (three as active, and three as standby), it would report all six as standbys, and when that happens the resulted assignment as @hao Li verified, is indeed the un-even one.
So I think the actual “issue“ happens here, is when proc2 is a bit late sending the sync-group request, when the previous rebalance has already completed, and a follow-up rebalance has already triggered, in that case, the resulted un-even assignment is indeed expected. Such a scenario, though not common, is still legitimate since in practice all kinds of timing skewness across instances can happen. So I think we should just relax our verification here, i.e. just making sure that each instance has at least one standby replica at the end, not exactly evenly as “2, 2, 2”.
Reviewers: Suhas Satish <ssatish@confluent.io>, Guozhang Wang <wangguoz@gmail.com>
KRaft mode will not support writing messages with an older message format (2.8) since the min supported IBP is 3.0 for KRaft. Testing support for reading older message formats will be covered by https://issues.apache.org/jira/browse/KAFKA-14056.
Reviewers: David Jacot <djacot@confluent.io>
* KAFKA-13930: Add 3.2.0 Streams upgrade system tests
Apache Kafka 3.2.0 was recently released. Now we need
to test upgrades from 3.2 to trunk in our system tests.
Reviewer: Bill Bejeck <bbejeck@apache.org>
I noticed that a system test using a KRaft cluster with 3 brokers but only 1 co-located controller did not force-kill the second and third broker after shutting down the first broker (the one with the controller). The issue was a floating point rounding error. This patch adjusts for the rounding error and also makes the logic work for an even number of controllers. A local run of `tests/kafkatest/sanity_checks/test_bounce.py` succeeded (and I manually increased the cluster size for the 1 co-located controller case and observed the correct kill behavior: the second and third brokers were force-killed as expected).
Reviewers: Luke Chen <showuon@gmail.com>, José Armando García Sancio <jsancio@users.noreply.github.com>, David Jacot <djacot@confluent.io>
Replace ACL_AUTHORIZER attribute with ZK_ACL_AUTHORIZER in system tests. Required after the changes merged with https://github.com/apache/kafka/pull/12190.
Reviewers: David Jacot <djacot@confluent.io>
Apache Kafka 3.2.0 was recently released. Now we need
to test upgrades and compatibility with 3.2 in core system tests.
Reviewer: Jason Gustafson <jason@confluent.io>
Currently the only place we see controller/raft logging in system tests is `server-start-stdout-stderr.log` where they are mixed with all other logs. It is more convenient to send them to `controller.log` as we do for zk tests.
Reviewers: Kvicii <42023367+Kvicii@users.noreply.github.com>, David Jacot <djacot@confluent.io>
It is useful to collect the directory for `__cluster_metadata` in system tests. We use a separate directory from user partitions, so it must be configured separately.
Reviewers: David Arthur <mumrah@gmail.com>
When running our Connect system tests with JDK 10+, we hit the error
AttributeError: 'ClusterNode' object has no attribute 'version'
because util.py attempts to check the version variable for non-Kafka service objects.
Reviewers: Konstantine Karantasis <k.karantasis@gmail.com>
Change `ZookeeperAuthorizerTest` to `AuthorizerTest` and add support for KRaft's `StandardAuthorizer` implementation.
Reviewers: David Jacot <djacot@confluent.io>
This patch fixes some strangeness and inconsistency in the messages written by `TransactionalMessageCopier` to stdout. Here is a sample of two messages.
Progress message:
```
{"consumed":33000,"stage":"ProcessLoop","totalProcessed":33000,"progress":"copier-0","time":"2022/04/24 05:40:31:649","remaining":333}
```
The `transactionalId` is set to the value of the `progress` key.
And a shutdown message:
```
{"consumed":33333,"shutdown_complete":"copier-0","totalProcessed":33333,"time":"2022/04/24 05:40:31:937","remaining":0}
```
The `transactionalId` this time is set to the `shutdown_complete` key and there is no `stage` key.
In this patch, we change the following:
1. Use a separate key for the `transactionalId`.
2. Drop the `progress` and `shutdown_complete` keys.
3. Use `stage=ShutdownComplete` in the shutdown message.
4. Modify `transactional_message_copier.py` system test service accordingly.
Reviewers: David Arthur <mumrah@gmail.com>
The second validation does not verify the second bounce because the verified producer and the verified consumer are stopped in `self.run_validation`. This means that the second `run_validation` just spit out the same information as the first one. Instead, we should just run the validation at the end.
Reviewers: Jason Gustafson <jason@confluent.io>
With this change we stop including the non-production grade connectors that are meant to be used for demos and quick starts by default in the CLASSPATH and plugin.path of Connect deployments. The package of these connector will still be shipped with the Apache Kafka distribution and will be available for explicit inclusion.
The changes have been tested through the system tests and the existing unit and integration tests.
Reviewers: Mickael Maison <mickael.maison@gmail.com>, Randall Hauch <rhauch@gmail.com>
KRaft brokers always use the first controller listener, so if there is not also a colocated KRaft controller on the node be sure to only publish one controller listener in `controller.listener.names` even when the inter-controller listener name differs. System tests were failing due to unnecessarily publishing a second entry in `controller.listener.names` for a broker-only config and not also publishing a mapping for it in `listener.security.protocol.map`. Removing the unnecessary entry in `controller.listener.names` solves the problem.
Reviewers: David Jacot <djacot@confluent.io>
Log messages were changed in the AssignorConfiguration (#11490) that are
also used for verification in system test
StreamsCooperativeRebalanceUpgradeTest.test_upgrade_to_cooperative_rebalance.
This commit fixes the test and adds comments to the log messages
that point to the test that needs to be updated in case of
changes to the log messages.
Reviewers: John Roesler <vvcephei@apache.org>, Luke Chen <showuon@gmail.com>, David Jacot <djacot@confluent.io>
This patch adds a new system test which exercises the shrining/expansion process of the partition leader. It does so by introducing a network partition which isolates a broker from the other brokers in the cluster but not from KRaft Controller/ZK.
Reviewers: Jason Gustafson <jason@confluent.io>
Replace deprecated exactly_once_beta with exactly_once_v2 in system tests.
Follow up for #10870, found out there are still some system tests using the deprecated exactly_once_beta. This PR updates them.
Reviewers: Bruno Cadonna <cadonna@apache.org>
Clearing under-replicated-partitions helps ensure that partitions do not become unavailable longer than necessary as brokers are rolled. This prevents flakiness due to transaction timeouts.
Reviewers: Luke Chen <showuon@gmail.com>, Ismael Juma <ismael@juma.me.uk>
This patch ensures that the transaction message copier is fully started in `start_node`. Without this, it is possible that `stop_node` is called before the process is started which results in not stopping it at all.
Reviewers: Jason Gustafson <jason@confluent.io>
We increased the default session timeout to 30s in KIP-735:
https://cwiki.apache.org/confluence/display/KAFKA/KIP-735%3A+Increase+default+consumer+session+timeout
Since then, we are observing sporadic system test failures
due to rebalances taking longer than the test timeout.
Rather than increase the test wait times, we can just override
the session timeout to a value more appropriate in the testing
domain.
Reviewers: A. Sophie Blee-Goldman <ableegoldman@apache.org>
The streams static membership test has failed several times due to hitting the Kafka shutdown timeout, but the logs were showing that the shutdown did actually succeed after the 60 second timeout.
Reviewers: Anna Sophie Blee-Goldman <ableegoldman@apache.org>, Walker Carlson <wcarlson@confluent.io>
Also adjusted the acceptable recovery lag to stabilize Streams tests.
Reviewers: Justine Olshan <jolshan@confluent.io>, Matthias J. Sax <mjsax@apache.org>, John Roesler <vvcephei@apache.org>
As of version 2.2.1 , Kafka Streams uses message headers and
thus requires broker version 0.11.0 or newer.
Reviewers: John Roesler <john@confluent.io>, Ismael Juma <ismael@confluent.io>, A. Sophie Blee-Goldman <sophie@confluent.io>
This PR implements system tests in ducktape to test the ability of brokers and controllers to generate
and consume snapshots and catch up with the metadata log.
Reviewers: Colin P. McCabe <cmccabe@apache.org>, José Armando García Sancio <jsancio@gmail.com>
These failures were caused by a46b82bea9. Details for each test:
* message_format_change_test: use IBP 2.8 so that we can write in older message
formats.
* compatibility_test_new_broker_test_failures: fix down-conversion path to handle
empty record batches correctly. The record scan in the old code ensured that
empty record batches were never down-converted, which hid this bug.
* upgrade_test: set the IBP 2.8 when message format is < 0.11 to ensure we are
actually writing with the old message format even though the test was passing
without the change.
Verified with ducker that some variants of these tests failed without these changes
and passed with them. Also added a unit test for the down-conversion bug fix.
Reviewers: Jason Gustafson <jason@confluent.io>
Replace the unsupported describe topic via zk with describe users to fix the system tests.
For the upgrade_test case where TLS support is not required, use list_acls instead.
Reviewers: Ismael Juma <ismael@juma.me.uk>
Currently, we verify the startup of a Streams client by checking the transition
from REBALANCING to RUNNING and if the client processed some records
in the EOS system test. However, if the Streams client only
has standby tasks assigned as it can happen if the client is catching
up by using warm-up replicas, the client will never process
records within the timeout of the startup verification. Hence, the test
will fail although everything is fine. This commit fixes this by reducing
the time to the next probing rebalance and by increasing the number of
max warm-up replicas. In such a way, the catch up of the client and the
following processing of records should still be within the startup verification
timeout of the client.
Reviewers: Anna Sophie Blee-Goldman <ableegoldman@apache.org>
This patch fixes the ZooKeeperAuthorizerTest for KRaft. The system test was not
configuring/reconfiguring/restarting the remote controller quorum with the correct security settings.
Reviewers: Colin P. McCabe <cmccabe@apache.org>
This patch adds a sanity-check bounce system test for the case where we have 3
co-located KRaft controllers and fixes the system test code so that this case
will pass by starting brokers in parallel by default instead of serially. We
now also send SIGKILL to any running KRaft broker or controller nodes for the
co-located case when a majority of co-located controllers have been stopped --
otherwise they do not shutdown, and we spin for the 60 second timeout. Finally,
this patch adds the ability to specify that certain brokers should not be
started when starting the cluster, and then we can start those nodes at a later
time via the add_broker() method call; this is going to be helpful for KRaft
snapshot system testing.
We were not testing the 3 co-located KRaft controller case previously, and it
would not pass because the first Kafka node would never be considered started.
We were starting the Kafka nodes serially, and we decide that a node has
successfully started when it logs a particular message. This message is not
logged until the broker has identified the controller (i.e. the leader of the
KRaft quorum). There cannot be a leader until a majority of the KRaft quorum
has started, so with 3 co-located controllers the first node could never be
considered "started" by the system test.
Reviewers: Colin P. McCabe <cmccabe@apache.org>
Implements KIP-745 https://cwiki.apache.org/confluence/display/KAFKA/KIP-745%3A+Connect+API+to+restart+connector+and+tasks to change connector REST API to restart a connector and its tasks as a whole.
Testing strategy
- [x] Unit tests added for all possible combinations of onlyFailed and includeTasks
- [x] Integration tests added for all possible combinations of onlyFailed and includeTasks
- [x] System tests for happy path
Reviewers: Randall Hauch <rhauch@gmail.com>, Diego Erdody <erdody@gmail.com>, Konstantine Karantasis <k.karantasis@gmail.com>
Update the ZooKeeper version to v3.6.3. This requires adding dropwizard
as a new dependency.
Also, add Kafka v2.8.0 to the ducktape system test image.
Reviewers: Luke Chen <showuon@gmail.com>, Colin P. McCabe <cmccabe@apache.org>, Ismael Juma <ismael@juma.me.uk>
The TestSecurityRollingUpgrade. test_disable_separate_interbroker_listener() system test had a design flaw: it was migrating inter-broker communication from a SASL_SSL listener to an SSL listener in one roll while immediately removing the SASL_SSL listener in that roll. This requires two rolls because the existing SASL_SSL listener must remain available throughout the first roll so that unrolled brokers can continue to communicate with rolled brokers throughout. This patch adds the second roll to this test and removes the original SASL_SSL listener on that second roll instead of the first one. The test was not failing all the time -- it was flaky.
The TestSecurityRollingUpgrade.test_rolling_upgrade_phase_two() system test was not explicitly identifying the SASL mechanism to enable on a third port when that port was using SASL but the client security protocol was not SASL-based. This was resulting in an empty sasl.enabled.mechanisms config, which applied to that third port, and then when the cluster was rolled to take advantage of this third port for inter-broker communication the potential for an inability to communicate with other, unrolled brokers existed (similar to above, this resulted in a flaky test).
Reviewers: Chia-Ping Tsai <chia7712@gmail.com>
This IT has been failing on trunk recently. Enabling EOS during the integration test
makes it easier to be sure that the test's assumptions are really true during verification
and should make the test more reliable.
I also noticed that in the actual system test file, we are using the deprecated property
name "beta" instead of "v2".
Reviewers: Boyang Chen <boyang@apache.org>
This PR includes adding the NamedTopology to the Subscription/AssignmentInfo, and to the StateDirectory so it can place NamedTopology tasks within the hierarchical structure with task directories under the NamedTopology parent dir.
Reviewers: Walker Carlson <wcarlson@confluent.io>, Guozhang Wang <guozhang@confluent.io>
This patch adds support for running the ZooKeeper-based
kafka.security.authorizer.AclAuthorizer with KRaft clusters. Set the
authorizer.class.name config as well as the zookeeper.connect config while also
setting the typical KRaft configs (node.id, process.roles, etc.), and the
cluster will use KRaft for metadata and ZooKeeper for ACL storage. A system
test that exercises the authorizer is included.
This patch also changes "Raft" to "KRaft" in several system test files. It also
fixes a bug where system test admin clients were unable to connect to a cluster
with broker credentials via the SSL security protocol when the broker was using
that for inter-broker communication and SASL for client communication.
Reviewers: Colin P. McCabe <cmccabe@apache.org>, Ismael Juma <ismael@juma.me.uk>
Implement a striped replica placement algorithm for KRaft. This also
means implementing rack awareness. Previously, KRraft just chose
replicas randomly in a non-rack-aware fashion. Also, allow replicas to
be placed on fenced brokers if there are no other choices. This was
specified in KIP-631 but previously not implemented.
Reviewers: Jun Rao <junrao@gmail.com>
The StreamsNamedRepartitionTopicTest system tests did not have the @cluster annotation and was therefore taking up the entire cluster. For example, we see this in the log output:
kafkatest.tests.streams.streams_named_repartition_topic_test.StreamsNamedRepartitionTopicTest.test_upgrade_topology_with_named_repartition_topic is using entire cluster. It's possible this test has no associated cluster metadata.
This PR adds the missing annotation.
Reviewers: Bill Bejeck <bbejeck@apache.org>
Ensure security protocol and sasl mechanism are updated in the cached SecurityConfig during rolling system tests. Also explicitly indicate which SASL mechanisms we wish to expose during the tests.
Reviewers: David Arthur <mumrah@gmail.com>
These were deprecated in Apache Kafka 2.4 (released in December 2019) to be replaced
by `org.apache.kafka.server.authorizer.Authorizer` and `AclAuthorizer`.
As part of KIP-500, we will implement a new `Authorizer` implementation that relies
on a topic (potentially a KRaft topic) instead of `ZooKeeper`, so we should take the chance
to remove related tech debt in 3.0.
Details on the issues affecting the old Authorizer interface can be found in the KIP:
https://cwiki.apache.org/confluence/display/KAFKA/KIP-504+-+Add+new+Java+Authorizer+Interface
Reviewers: Manikumar Reddy <manikumar.reddy@gmail.com>, Ron Dagostino <rdagostino@confluent.io>
* Standardize license headers in scala, python, and gradle files.
* Relocate copyright attribution to the NOTICE.
* Add a license header check to `spotless` for scala files.
Reviewers: Ewen Cheslack-Postava <ewencp@apache.org>, Matthias J. Sax <mjsax@apache.org>, A. Sophie Blee-Goldman <ableegoldman@apache.org
`Self-managed` is also used in the context of Cloud vs on-prem and it can
be confusing.
`KRaft` is a cute combination of `Kafka Raft` and it's pronounced like `craft`
(as in `craftsmanship`).
Reviewers: Colin P. McCabe <cmccabe@apache.org>, Jose Sancio <jsancio@gmail.com>, Chia-Ping Tsai <chia7712@gmail.com>, Ron Dagostino <rdagostino@confluent.io>
KIP-500 is not particularly descriptive. I also tweaked the readme text a bit.
Tested that the readme for self-managed still works after these changes.
Reviewers: Colin P. McCabe <cmccabe@apache.org>, Ron Dagostino <rdagostino@confluent.io>, Jason Gustafson <jason@confluent.io>
Change the ducktape system tests to support both ZK and raft topic IDs. Clarifies that
the IBP check applies to the ZK code path.
Reviewers: Colin P. McCabe <cmccabe@apache.org>, Ron Dagostino <rdagostino@confluent.io>
ZooKeeper-related system tests in zookeeper_security_upgrade_test.py and
zookeeper_tls_test.py broke due to #10199. That patch changed the logic of
SecurityConfig.enabled_sasl_mechanisms() to only add the inter-broker SASL
mechanism when the inter-broker protocol was SASL_{PLAINTEXT,SSL}. The
inter-broker protocol is left to default to PLAINTEXT for the SecurityConfig
instance associated with Zookeeper since that value doesn't apply to ZooKeeper,
so the default inter-broker SASL mechanism of GSSAPI was not being added into
the set returned by enabled_sasl_mechanisms(). This is actually correct --
GSSAPI shouldn't be added since inter-broker communication is a Kafka concept
and doesn't apply to ZooKeeper. GSSAPI should be added when ZooKeeper uses it,
though -- which is the case in these tests. So the prior patch referred to
above uncovered a bug: we were relying on the default inter-broker SASL
mechanism to signal that Kerberos was being used by ZooKeeper even though the
inter-broker protocol has nothing to do with that determination in such cases.
This patch explicitly includes GSSAPI in the list of enabled SASL mechanisms
when SASL is enabled for use by ZooKeeper.
Reviewers: Colin P. McCabe <cmccabe@apache.org>
This test was failing when used with a Raft-based metadata quorum but succeeding with a
ZooKeeper-based quorum. This patch increases the consumers' session timeouts to 30 seconds,
which fixes the Raft case and also eliminates flakiness that has historically existed in the
Zookeeper case.
This patch also fixes a minor logging bug in RaftReplicaManager.endMetadataChangeDeferral() that
was discovered during the debugging of this issue, and it adds an extra logging statement in RaftReplicaManager.handleMetadataRecords() when a single metadata batch is applied to mirror
the same logging statement that occurs when deferred metadata changes are applied.
In the Raft system test case the consumer was sometimes receiving a METADATA response with just
1 alive broker, and then when that broker rolled the consumer wouldn't know about any alive nodes.
It would have to wait until the broker returned before it could reconnect, and by that time the group
coordinator on the second broker would have timed-out the client and initiated a group rebalance. The
test explicitly checks that no rebalances occur, so the test would fail. It turns out that the reason why
the ZooKeeper configuration wasn't seeing rebalances was just plain luck. The brokers' metadata
caches in the ZooKeeper configuration show 1 alive broker even more frequently than the Raft
configuration does. If we tweak the metadata.max.age.ms value on the consumers we can easily
get the ZooKeeper test to fail, and in fact this system test has historically been flaky for the
ZooKeeper configuration. We can get the test to pass by setting session.timeout.ms=30000 (which
is longer than the roll time of any broker), or we can increase the broker count so that the client
never sees a METADATA response with just a single alive broker and therefore never loses contact
with the cluster for an extended period of time. We have plenty of system tests with 3+ brokers, so
we choose to keep this test with 2 brokers and increase the session timeout.
Reviewers: Ismael Juma <ismael@juma.me.uk>
The KIP-500 early access release will not support creating a partition with a manual
partition assignment that includes a broker that is not currently online. This patch disables
system tests for Raft-based metadata quorums where the test depends on this functionality
to pass.
Reviewers: Colin P. McCabe <cmccabe@apache.org>
Removed broker number checks for invalid replication factor when doing the forwarding, in order to reduce false alarms for clients.
Reviewers: Jason Gustafson <jason@confluent.io>
Fix some cases where we were erroneously using the configuration of the inter broker
listener instead of the controller listener. Add the sasl.mechanism.controller.protocol
configuration key specified by KIP-631. Add some ducktape tests.
Reviewers: Colin P. McCabe <cmccabe@apache.org>, David Arthur <mumrah@gmail.com>, Boyang Chen <boyang@confluent.io>
This patch updates request `listeners` tags to be in line with what the KIP-500 broker/controller support today. We will re-enable these APIs as needed once we have added the support.
I have also updated `ControllerApis` to use `ApiVersionManager` and simplified the envelope handling logic.
Reviewers: Ron Dagostino <rdagostino@confluent.io>, Colin P. McCabe <cmccabe@apache.org>