The TestSecurityRollingUpgrade. test_disable_separate_interbroker_listener() system test had a design flaw: it was migrating inter-broker communication from a SASL_SSL listener to an SSL listener in one roll while immediately removing the SASL_SSL listener in that roll. This requires two rolls because the existing SASL_SSL listener must remain available throughout the first roll so that unrolled brokers can continue to communicate with rolled brokers throughout. This patch adds the second roll to this test and removes the original SASL_SSL listener on that second roll instead of the first one. The test was not failing all the time -- it was flaky.
The TestSecurityRollingUpgrade.test_rolling_upgrade_phase_two() system test was not explicitly identifying the SASL mechanism to enable on a third port when that port was using SASL but the client security protocol was not SASL-based. This was resulting in an empty sasl.enabled.mechanisms config, which applied to that third port, and then when the cluster was rolled to take advantage of this third port for inter-broker communication the potential for an inability to communicate with other, unrolled brokers existed (similar to above, this resulted in a flaky test).
Reviewers: Chia-Ping Tsai <chia7712@gmail.com>
This IT has been failing on trunk recently. Enabling EOS during the integration test
makes it easier to be sure that the test's assumptions are really true during verification
and should make the test more reliable.
I also noticed that in the actual system test file, we are using the deprecated property
name "beta" instead of "v2".
Reviewers: Boyang Chen <boyang@apache.org>
This PR includes adding the NamedTopology to the Subscription/AssignmentInfo, and to the StateDirectory so it can place NamedTopology tasks within the hierarchical structure with task directories under the NamedTopology parent dir.
Reviewers: Walker Carlson <wcarlson@confluent.io>, Guozhang Wang <guozhang@confluent.io>
The following error happens on my mac m1 when building docker image for system tests.
Collecting pynacl
Using cached PyNaCl-1.4.0.tar.gz (3.4 MB)
Installing build dependencies ... error
ERROR: Command errored out with exit status 1:
command: /usr/bin/python3 /usr/local/lib/python3.8/dist-packages/pip install --ignore-installed --no-user --prefix /tmp/pip-build-env-k867aac0/overlay --no-warn-script-location --no-binary :none: --only-binary :none: -i https://pypi.org/simple -- 'setuptools>=40.8.0' wheel 'cffi>=1.4.1; python_implementation != '"'"'PyPy'"'"''
cwd: None
Complete output (14 lines):
Traceback (most recent call last):
File "/usr/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/usr/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/usr/local/lib/python3.8/dist-packages/pip/__main__.py", line 23, in <module>
from pip._internal.cli.main import main as _main # isort:skip # noqa
File "/usr/local/lib/python3.8/dist-packages/pip/_internal/cli/main.py", line 5, in <module>
import locale
File "/usr/lib/python3.8/locale.py", line 16, in <module>
import re
File "/usr/lib/python3.8/re.py", line 145, in <module>
class RegexFlag(enum.IntFlag):
AttributeError: module 'enum' has no attribute 'IntFlag'
----------------------------------------
ERROR: Command errored out with exit status 1: /usr/bin/python3 /usr/local/lib/python3.8/dist-packages/pip install --ignore-installed --no-user --prefix /tmp/pip-build-env-k867aac0/overlay --no-warn-script-location --no-binary :none: --only-binary :none: -i https://pypi.org/simple -- 'setuptools>=40.8.0' wheel 'cffi>=1.4.1; python_implementation != '"'"'PyPy'"'"'' Check the logs for full command output.
There was a related issue: pypa/pip#9689 and it is already fixed by pypa/pip#9689 (included by pip 21.1.1). I test the pip 21.1.1 and it works well on mac m1.
Reviewers: Ismael Juma <ismael@juma.me.uk>
This patch adds support for running the ZooKeeper-based
kafka.security.authorizer.AclAuthorizer with KRaft clusters. Set the
authorizer.class.name config as well as the zookeeper.connect config while also
setting the typical KRaft configs (node.id, process.roles, etc.), and the
cluster will use KRaft for metadata and ZooKeeper for ACL storage. A system
test that exercises the authorizer is included.
This patch also changes "Raft" to "KRaft" in several system test files. It also
fixes a bug where system test admin clients were unable to connect to a cluster
with broker credentials via the SSL security protocol when the broker was using
that for inter-broker communication and SASL for client communication.
Reviewers: Colin P. McCabe <cmccabe@apache.org>, Ismael Juma <ismael@juma.me.uk>
Implement a striped replica placement algorithm for KRaft. This also
means implementing rack awareness. Previously, KRraft just chose
replicas randomly in a non-rack-aware fashion. Also, allow replicas to
be placed on fenced brokers if there are no other choices. This was
specified in KIP-631 but previously not implemented.
Reviewers: Jun Rao <junrao@gmail.com>
The StreamsNamedRepartitionTopicTest system tests did not have the @cluster annotation and was therefore taking up the entire cluster. For example, we see this in the log output:
kafkatest.tests.streams.streams_named_repartition_topic_test.StreamsNamedRepartitionTopicTest.test_upgrade_topology_with_named_repartition_topic is using entire cluster. It's possible this test has no associated cluster metadata.
This PR adds the missing annotation.
Reviewers: Bill Bejeck <bbejeck@apache.org>
Ensure security protocol and sasl mechanism are updated in the cached SecurityConfig during rolling system tests. Also explicitly indicate which SASL mechanisms we wish to expose during the tests.
Reviewers: David Arthur <mumrah@gmail.com>
These were deprecated in Apache Kafka 2.4 (released in December 2019) to be replaced
by `org.apache.kafka.server.authorizer.Authorizer` and `AclAuthorizer`.
As part of KIP-500, we will implement a new `Authorizer` implementation that relies
on a topic (potentially a KRaft topic) instead of `ZooKeeper`, so we should take the chance
to remove related tech debt in 3.0.
Details on the issues affecting the old Authorizer interface can be found in the KIP:
https://cwiki.apache.org/confluence/display/KAFKA/KIP-504+-+Add+new+Java+Authorizer+Interface
Reviewers: Manikumar Reddy <manikumar.reddy@gmail.com>, Ron Dagostino <rdagostino@confluent.io>
* Standardize license headers in scala, python, and gradle files.
* Relocate copyright attribution to the NOTICE.
* Add a license header check to `spotless` for scala files.
Reviewers: Ewen Cheslack-Postava <ewencp@apache.org>, Matthias J. Sax <mjsax@apache.org>, A. Sophie Blee-Goldman <ableegoldman@apache.org
`Self-managed` is also used in the context of Cloud vs on-prem and it can
be confusing.
`KRaft` is a cute combination of `Kafka Raft` and it's pronounced like `craft`
(as in `craftsmanship`).
Reviewers: Colin P. McCabe <cmccabe@apache.org>, Jose Sancio <jsancio@gmail.com>, Chia-Ping Tsai <chia7712@gmail.com>, Ron Dagostino <rdagostino@confluent.io>
KIP-500 is not particularly descriptive. I also tweaked the readme text a bit.
Tested that the readme for self-managed still works after these changes.
Reviewers: Colin P. McCabe <cmccabe@apache.org>, Ron Dagostino <rdagostino@confluent.io>, Jason Gustafson <jason@confluent.io>
Change the ducktape system tests to support both ZK and raft topic IDs. Clarifies that
the IBP check applies to the ZK code path.
Reviewers: Colin P. McCabe <cmccabe@apache.org>, Ron Dagostino <rdagostino@confluent.io>
ZooKeeper-related system tests in zookeeper_security_upgrade_test.py and
zookeeper_tls_test.py broke due to #10199. That patch changed the logic of
SecurityConfig.enabled_sasl_mechanisms() to only add the inter-broker SASL
mechanism when the inter-broker protocol was SASL_{PLAINTEXT,SSL}. The
inter-broker protocol is left to default to PLAINTEXT for the SecurityConfig
instance associated with Zookeeper since that value doesn't apply to ZooKeeper,
so the default inter-broker SASL mechanism of GSSAPI was not being added into
the set returned by enabled_sasl_mechanisms(). This is actually correct --
GSSAPI shouldn't be added since inter-broker communication is a Kafka concept
and doesn't apply to ZooKeeper. GSSAPI should be added when ZooKeeper uses it,
though -- which is the case in these tests. So the prior patch referred to
above uncovered a bug: we were relying on the default inter-broker SASL
mechanism to signal that Kerberos was being used by ZooKeeper even though the
inter-broker protocol has nothing to do with that determination in such cases.
This patch explicitly includes GSSAPI in the list of enabled SASL mechanisms
when SASL is enabled for use by ZooKeeper.
Reviewers: Colin P. McCabe <cmccabe@apache.org>
This test was failing when used with a Raft-based metadata quorum but succeeding with a
ZooKeeper-based quorum. This patch increases the consumers' session timeouts to 30 seconds,
which fixes the Raft case and also eliminates flakiness that has historically existed in the
Zookeeper case.
This patch also fixes a minor logging bug in RaftReplicaManager.endMetadataChangeDeferral() that
was discovered during the debugging of this issue, and it adds an extra logging statement in RaftReplicaManager.handleMetadataRecords() when a single metadata batch is applied to mirror
the same logging statement that occurs when deferred metadata changes are applied.
In the Raft system test case the consumer was sometimes receiving a METADATA response with just
1 alive broker, and then when that broker rolled the consumer wouldn't know about any alive nodes.
It would have to wait until the broker returned before it could reconnect, and by that time the group
coordinator on the second broker would have timed-out the client and initiated a group rebalance. The
test explicitly checks that no rebalances occur, so the test would fail. It turns out that the reason why
the ZooKeeper configuration wasn't seeing rebalances was just plain luck. The brokers' metadata
caches in the ZooKeeper configuration show 1 alive broker even more frequently than the Raft
configuration does. If we tweak the metadata.max.age.ms value on the consumers we can easily
get the ZooKeeper test to fail, and in fact this system test has historically been flaky for the
ZooKeeper configuration. We can get the test to pass by setting session.timeout.ms=30000 (which
is longer than the roll time of any broker), or we can increase the broker count so that the client
never sees a METADATA response with just a single alive broker and therefore never loses contact
with the cluster for an extended period of time. We have plenty of system tests with 3+ brokers, so
we choose to keep this test with 2 brokers and increase the session timeout.
Reviewers: Ismael Juma <ismael@juma.me.uk>
The KIP-500 early access release will not support creating a partition with a manual
partition assignment that includes a broker that is not currently online. This patch disables
system tests for Raft-based metadata quorums where the test depends on this functionality
to pass.
Reviewers: Colin P. McCabe <cmccabe@apache.org>
Removed broker number checks for invalid replication factor when doing the forwarding, in order to reduce false alarms for clients.
Reviewers: Jason Gustafson <jason@confluent.io>
Fix some cases where we were erroneously using the configuration of the inter broker
listener instead of the controller listener. Add the sasl.mechanism.controller.protocol
configuration key specified by KIP-631. Add some ducktape tests.
Reviewers: Colin P. McCabe <cmccabe@apache.org>, David Arthur <mumrah@gmail.com>, Boyang Chen <boyang@confluent.io>
This patch updates request `listeners` tags to be in line with what the KIP-500 broker/controller support today. We will re-enable these APIs as needed once we have added the support.
I have also updated `ControllerApis` to use `ApiVersionManager` and simplified the envelope handling logic.
Reviewers: Ron Dagostino <rdagostino@confluent.io>, Colin P. McCabe <cmccabe@apache.org>
Add the necessary test annotations to test the new KIP-500 quorum broker mode
in many of our ducktape tests. This mode is tested in addition to the classic
Apache ZooKeeper mode.
This PR also adds a new sanity_checks/bounce_test.py system test that runs
through a simple produce/bounce/produce series of events.
Finally, this PR adds @cluster annotations to dozens of system tests that were
missing them. The lack of this annotation was causing these tests to grab the
entire cluster of nodes. Adding the @cluster annotation dramatically reduced
the time needed to run these tests.
Reviewers: Colin P. McCabe <cmccabe@apache.org>, Ismael Juma <ismael@juma.me.uk>
This PR adds the KIP-500 BrokerServer and ControllerServer classes and
makes some related changes to get them working. Note that the ControllerServer
does not instantiate a QuorumController object yet, since that will be added in
PR #10070.
* Add BrokerServer and ControllerServer
* Change ApiVersions#computeMaxUsableProduceMagic so that it can handle
endpoints which do not support PRODUCE (such as KIP-500 controller nodes)
* KafkaAdminClientTest: fix some lingering references to decommissionBroker
that should be references to unregisterBroker.
* Make some changes to allow SocketServer to be used by ControllerServer as
we as by the broker.
* We now return a random active Broker ID as the Controller ID in
MetadataResponse for the Raft-based case as per KIP-590.
* Add the RaftControllerNodeProvider
* Add EnvelopeUtils
* Add MetaLogRaftShim
* In ducktape, in config_property.py: use a KIP-500 compatible cluster ID.
Reviewers: Colin P. McCabe <cmccabe@apache.org>, David Arthur <mumrah@gmail.com>
We need to be able to run system tests with Raft-based metadata quorums -- both
co-located brokers and controllers as well as remote controllers -- in addition to the
ZooKepeer-based mode we run today. This PR adds this capability to KafkaService in a
backwards-compatible manner as follows.
If no changes are made to existing system tests then they function as they always do --
they instantiate ZooKeeper, and Kafka will use ZooKeeper. On the other hand, if we want
to use a Raft-based metadata quorum we can do so by introducing a metadata_quorum
argument to the test method and using @matrix to set it to the quorums we want to use for
the various runs of the test. We then also have to skip creating a ZooKeeperService when
the quorum is Raft-based.
This PR does not update any tests -- those will come later after all the KIP-500 code is
merged.
Reviewers: Colin P. McCabe <cmccabe@apache.org>
This patch implements KIP-635 which mainly adds support for querying offsets of multiple topics/partitions.
Reviewers: David Jacot <djacot@confluent.io>
ducktape 0.8.1 was updated to include the following changes/fixes from 0.7.x branch:
* Junit reporting support
* fix for an issue where unicode characters in exception message would cause test runner to hang on py27.
Reviewers: Konstantine Karantasis <k.karantasis@gmail.com>
Topics processed by the controller and topics newly created will only be given topic IDs if the inter-broker protocol version on the controller is greater than 2.8. This PR also adds a kafka config to specify whether the IBP is greater or equal to 2.8. System tests have been modified to include topic ID checks for upgrade/downgrade tests. This PR also adds a new integration test file for requests/responses that are not gated by IBP (ex: metadata)
Reviewers: dengziming <dengziming1993@gmail.com>, Lucas Bradstreet <lucas@confluent.io>, Rajini Sivaram <rajinisivaram@googlemail.com>
We have seen recent system test timeouts associated with this test.
Analysis revealed an excessive amount of time spent searching
for test conditions in the logs.
This change addresses the issue by dropping some unnecessary
checks and using a more efficient log search mechanism.
Reviewers: Bill Bejeck <bbejeck@apache.org>, Guozhang Wang <guozhang@apache.org>
In Python 3, `filter` functions return iterators rather than `list` so it can traverse only once. Hence, the following loop will only see "empty" and then validation fails.
```python
src_messages = self.source.committed_messages() # return iterator
sink_messages = self.sink.flushed_messages()) # return iterator
for task in range(num_tasks):
# only first task can "see" the result. following tasks see empty result
src_seqnos = [msg['seqno'] for msg in src_messages if msg['task'] == task]
```
Reference: https://portingguide.readthedocs.io/en/latest/iterators.html#new-behavior-of-map-and-filter.
Reviewers: Jason Gustafson <jason@confluent.io>
SSH outputs in system tests originating from paramiko are bytes. However, the logger in the system tests does not accept bytes and instead throws an exception. That means, the bytes returned as SSH output from paramiko need to converted to a type that the logger (or other objects) can process.
Reviewers: Guozhang Wang <wangguoz@gmail.com>
This PR introduces a streams specific uncaught exception handler that currently has the option to close the client or the application. If the new handler is set as well as the old handler (java thread handler) will be ignored and an error will be logged.
The application shutdown is achieved through the rebalance protocol.
Reviewers: Bruno Cadonna <cadonna@confluent.io>, Leah Thomas <lthomas@confluent.io>, John Roesler <john@confluent.io>, Anna Sophie Blee-Goldman <ableegoldman@apache.org>
This newly added system test is to verify that with the fix in #9270 , the member.id update caused by static member rejoin would be persisted correctly.
Reviewers: Boyang Chen <boyang@confluent.io>
Increase the amount of time available to the `test_verifiable_producer` test to login and get the process name for the verifiable producer from 5 seconds to 10 seconds.
We were seeing some test failures due to the assertion failing because the verifiable producer would complete before we could login, list the processes, and parse out the producer version. Previously, we were giving this operation 5 seconds to run, this PR bumps it up to 10 seconds.
I verified locally that this does not flake, but even at 5 seconds I wasn't seeing any flakes. Ultimately we should find a better strategy than racing to query the producer process (as outlined in the existing comments).
Reviewers: Jason Gustafson <jason@confluent.io>
KIP-431 (#9099) changed the format of console consumer output to `Partition:$PARTITION\t$VALUE` whereas previously the output format was `$VALUE\t$PARTITION`. This PR updates the message verifier to accommodate the updated console consumer output format.
The system test StreamsUpgradeTest.test_version_probing_upgrade tries to verify the wrong version for version probing.
Reviewers: Anna Sophie Blee-Goldman <ableegoldman@apache.org>
quota_test.py tests are failing with below error.
```
23:24:42 [INFO:2020-10-24 17:54:42,366]: RunnerClient: kafkatest.tests.client.quota_test.QuotaTest.test_quota.quota_type=user.override_quota=False: FAIL: not enough arguments for format string
23:24:42 Traceback (most recent call last):
23:24:42 File "/home/jenkins/workspace/system-test-kafka-branch-builder/kafka/venv/lib/python3.6/site-packages/ducktape-0.8.0-py3.6.egg/ducktape/tests/runner_client.py", line 134, in run
23:24:42 data = self.run_test()
23:24:42 File "/home/jenkins/workspace/system-test-kafka-branch-builder/kafka/venv/lib/python3.6/site-packages/ducktape-0.8.0-py3.6.egg/ducktape/tests/runner_client.py", line 192, in run_test
23:24:42 return self.test_context.function(self.test)
23:24:42 File "/home/jenkins/workspace/system-test-kafka-branch-builder/kafka/venv/lib/python3.6/site-packages/ducktape-0.8.0-py3.6.egg/ducktape/mark/_mark.py", line 429, in wrapper
23:24:42 return functools.partial(f, *args, **kwargs)(*w_args, **w_kwargs)
23:24:42 File "/home/jenkins/workspace/system-test-kafka-branch-builder/kafka/tests/kafkatest/tests/client/quota_test.py", line 141, in test_quota
23:24:42 self.quota_config = QuotaConfig(quota_type, override_quota, self.kafka)
23:24:42 File "/home/jenkins/workspace/system-test-kafka-branch-builder/kafka/tests/kafkatest/tests/client/quota_test.py", line 60, in __init__
23:24:42 self.configure_quota(kafka, self.producer_quota, self.consumer_quota, ['users', None])
23:24:42 File "/home/jenkins/workspace/system-test-kafka-branch-builder/kafka/tests/kafkatest/tests/client/quota_test.py", line 83, in configure_quota
23:24:42 (kafka.kafka_configs_cmd_with_optional_security_settings(node, force_use_zk_conection), producer_byte_rate, consumer_byte_rate)
23:24:42 TypeError: not enough arguments for format string
23:24:42
```
ran thee tests locally.
Author: Manikumar Reddy <manikumar.reddy@gmail.com>
Reviewers: David Jacot <djacot@confluent.io>, Ron Dagostino <rndgstn@gmail.com>
Closes#9496 from omkreddy/quota-tests
Fix vagrant for a system tests with a python3.
Author: Nikolay Izhikov <nizhikov@apache.org>
Reviewers: Manikumar Reddy <manikumar.reddy@gmail.com>
Closes#9480 from nizhikov/KAFKA-10592
This PR adds missing broker ACLs required to create topics and SCRAM credentials when ACLs are enabled for a system test. This PR also adds support for using PLAINTEXT as the inter broker security protocol when using SCRAM from the client in a system test with a secured cluster-- without this it would always be necessary to set both the inter-broker and client mechanisms to a SCRAM mechanism. Also contains some refactoring to make assumptions clearer.
Reviewers: Rajini Sivaram <rajinisivaram@googlemail.com>
For now, Kafka system tests use python2 which is outdated and not supported.
This PR upgrades python to the third version.
Reviewers: Ivan Daschinskiy, Mickael Maison <mickael.maison@gmail.com>, Magnus Edenhill <magnus@edenhill.se>, Guozhang Wang <wangguoz@gmail.com>
The test StreamsBrokerBounceTest.test_all_brokers_bounce() fails on
2.5 because in the last stage of the test there is only one broker
left and the offset commit cannot succeed because the
min.insync.replicas of __consumer_offsets is set to 2 and acks is
set to all. This causes a time out and extends the closing of the
Kafka Streams client to beyond the duration passed to the close
method of the client.
This affects especially the 2.5 branch since there Kafka Streams
commits offsets for each task, i.e., close() needs to wait for the
timeout for each task. In 2.6 and trunk the offset commit is done
per thread, so close() does only need to wait for one time out per
stream thread.
I opened this PR on trunk, since the test could also become
flaky on trunk and we want to avoid diverging system tests across
branches.
A more complete solution would be to improve the test by defining
a better success criteria.
Reviewers: Guozhang Wang <wangguoz@gmail.com>
`openjdk:8` includes `git` by default, but `openjdk:11` does not. Install `git` explicitly to make it easier to
test with newer openjdk versions.
Reviewers: Chia-Ping Tsai <chia7712@gmail.com>
Implement the KIP-554 API to create, describe, and alter SCRAM user configurations via the AdminClient. Add ducktape tests, and modify JUnit tests to test and use the new API where appropriate.
Reviewers: Colin P. McCabe <cmccabe@apache.org>, Rajini Sivaram <rajinisivaram@googlemail.com>
ducktape diff: https://github.com/confluentinc/ducktape/compare/v0.7.8...v0.7.9
- bcrypt (a dependency of ducktape) dropped Python2.7 support.
ducktape-0.7.9 now pins bcrypt to a Python2.7-supported version.
Author: Andrew Egelhofer <aegelhofer@confluent.io>
Reviewers: Dhruvil Shah <dhruvil@confluent.io>, Manikumar Reddy <manikumar.reddy@gmail.com>
Closes#9192 from andrewegel/trunk
A system test failed with the following error: global name 'self' is not defined
The reason was that `self` was accessed to log a message in a static method. This commit makes the method an instance method.
Reviewer: Matthias J. Sax <matthias@confluent.io>
KAFKA-10235 fixed a consistency issue with the transaction timeout and the progress timeout. Since the test case relies on transaction timeouts, we need to wait at last as long as the timeout in order to ensure progress. However, having a low transaction timeout makes the test prone to the issue identified in KAFKA-9802, in which the coordinator timed out the transaction while the producer was awaiting a Produce response.
Reviewers: Chia-Ping Tsai <chia7712@gmail.com>, Boyang Chen <boyang@confluent.io>, Jun Rao <junrao@gmail.com>
- After #8312, older brokers are returning empty configs, with latest `adminClient.describeConfigs`. Old brokers are receiving empty configNames in `AdminManageer.describeConfigs()` method. Older brokers does not handle empty configKeys. Due to this old brokers are filtering all the configs.
- Update ClientCompatibilityTest to verify describe configs
- Add test case to test describe configs with empty configuration Keys
Author: Manikumar Reddy <manikumar.reddy@gmail.com>
Reviewers: Rajini Sivaram <rajinisivaram@googlemail.com>
Closes#9046 from omkreddy/KAFKA-9432
Currently, the system tests `connect_distributed_test` and `connect_rest_test` only wait for the REST api to come up.
The startup of the worker includes an asynchronous process for joining the worker group and syncing with other workers.
There are some situations in which this sync takes an unusually long time, and the test continues without all workers up.
This leads to flakey test failures, as worker joins are not given sufficient time to timeout and retry without waiting explicitly.
This changes the `ConnectDistributedTest` to wait for the Joined group message to be printed to the logs before continuing with tests. I've activated this behavior by default, as it's a superset of the checks that were performed by default before.
This log message is present in every version of DistributedHerder that I could find, in slightly different forms, but always with `Joined group` at the beginning of the log message. This change should be safe to backport to any branch.
Signed-off-by: Greg Harris <gregh@confluent.io>
Author: Greg Harris <gregh@confluent.io>
Reviewer: Randall Hauch <rhauch@gmail.com>
The test case `OffsetValidationTest.test_fencing_static_consumer` fails periodically due to this error:
```
Traceback (most recent call last):
File "/home/jenkins/workspace/system-test-kafka_2.6/kafka/venv/lib/python2.7/site-packages/ducktape-0.7.8-py2.7.egg/ducktape/tests/runner_client.py", line 134, in run
data = self.run_test()
File "/home/jenkins/workspace/system-test-kafka_2.6/kafka/venv/lib/python2.7/site-packages/ducktape-0.7.8-py2.7.egg/ducktape/tests/runner_client.py", line 192, in run_test
return self.test_context.function(self.test)
File "/home/jenkins/workspace/system-test-kafka_2.6/kafka/venv/lib/python2.7/site-packages/ducktape-0.7.8-py2.7.egg/ducktape/mark/_mark.py", line 429, in wrapper
return functools.partial(f, *args, **kwargs)(*w_args, **w_kwargs)
File "/home/jenkins/workspace/system-test-kafka_2.6/kafka/tests/kafkatest/tests/client/consumer_test.py", line 257, in test_fencing_static_consumer
assert len(consumer.dead_nodes()) == num_conflict_consumers
AssertionError
```
When a consumer stops, there is some latency between when the shutdown is observed by the service and when the node is added to the dead nodes. This patch fixes the problem by giving some time for the assertion to be satisfied.
Reviewers: Boyang Chen <boyang@confluent.io>
security_rolling_upgrade_test may change the security listener and then restart Kafka servers. has_sasl and has_ssl get out-of-date due to cached _security_config. This PR offers a simple fix that we always check the changes of port mapping and then update the sasl/ssl flag.
Reviewers: Ismael Juma <ismael@juma.me.uk>, Jun Rao <junrao@gmail.com>
Reducing timeout of transaction to clean up the unstable offsets quicker. IN hard_bounce mode, transactional client is killed ungracefully. Hence, it produces unstable offsets which obstructs TransactionalMessageCopier from receiving position of group.
Reviewers: Jun Rao <junrao@gmail.com>
Call KafkaStreams#cleanUp to reset local state before starting application up the second run.
Reviewers: A. Sophie Blee-Goldman <sophie@confluent.io>, Boyang Chen <boyang@confluent.io>, John Roesler <john@confluent.io>
Most of the values in the metadata upgrade test matrix are just testing
the upgrade/downgrade path between two previous releases. This is
unnecessary. We run the tests for all supported branches, so what we
should test is the up-/down-gradability of released versions with respect
to the current branch.
Reviewers: Guozhang Wang <wangguoz@gmail.com>
There are two new configs introduced by 371f14c3c1 and 1c4eb1a575 so we have to update the expected configs in the connect_rest_test.py system test too.
Reviewer: Konstantine Karantasis <konstantine@confluent.io>
Replaces the previous upgrade test's trivial Streams app
with the commonly used SmokeTest, exercising many more
features. Also adjust the test matrix to test upgrading
from each released version since 2.2 to the current branch.
Reviewers: Guozhang Wang <wangguoz@gmail.com>
After 3661f981fff2653aaf1d5ee0b6dde3410b5498db security_config is cached. Hence, the later changes to security flag can't impact the security_config used by later tests.
issue: https://issues.apache.org/jira/browse/KAFKA-10214
Author: Chia-Ping Tsai <chia7712@gmail.com>
Reviewers: Ron Dagostino <rdagostino@confluent.io>, Manikumar Reddy <manikumar.reddy@gmail.com>
Closes#8949 from chia7712/KAFKA-10214
During Streams' system tests the PIDs of the Streams
clients are collected. The method the collects the PIDs
swallows any exception that might be thrown by the
ssh_capture() function. Swallowing any exceptions
might make the investigation of failures harder,
because no information about what happened are recorded.
Reviewers: John Roesler <vvcephei@apache.org>
1. Enables `TLSv1.3` by default with Java 11 or newer.
2. Add unit tests that cover the various TLSv1.2 and TLSv1.3 combinations.
3. Extend `benchmark_test.py` and `replication_test.py` to run with 'TLSv1.2'
or 'TLSv1.3'.
Reviewers: Ismael Juma <ismael@juma.me.uk>
We have been seeing increased flakiness in transaction system tests. I believe the cause might be due to KIP-537, which increased the default zk session timeout from 6s to 18s and the default replica lag timeout from 10s to 30s. In the system test, we use the default transaction timeout of 10s. However, since the system test involves hard failures, the Produce request could be blocking for as long as the max of these two in order to wait for an ISR shrink. Hence this patch increases the timeout to 30s.
Note this patch also includes a minor logging fix in `Partition`. Previously we would see messages like the following:
```
[Broker id=3] Leader output-topic-0 starts at leader epoch 0 from offset 0 with high watermark 0 ISR 3,2,1 addingReplicas removingReplicas .Previous leader epoch was -1.
```
This patch fixes the log to print as the following:
```
[Broker id=3] Leader output-topic-0 starts at leader epoch 0 from offset 0 with high watermark 0 ISR [3,2,1] addingReplicas [] removingReplicas []. Previous leader epoch was -1.
```
Reviewers: Bob Barrett <bob.barrett@confluent.io>, Ismael Juma <github@juma.me.uk>
kafka_log4j_appender.py was broken on JDK11 by befd80b38.
`fix_opts_for_new_jvm` requires `node.version` to be set, we
add the relevant code to the test.
Reviewers: Ismael Juma <ismael@juma.me.uk>
Also fixes a system test by configuring the HATA to perform a one-shot balanced assignment
Reviewers: A. Sophie Blee-Goldman <sophie@confluent.io>, Bruno Cadonna <bruno@confluent.io>
Previous to this fix a plugged-in verifiable client, such as
confluent-kafka-python, would be deployed on the node in the background
worker thread as the client was started. Since this could be time consuming
(e.g., 10+ seconds) and since the main test thread would continue to
operate, it was common for the current test to time out waiting
for e.g. the verifiable producer to produce messages while it was in fact
still deploying.
The fix here is to deploy the verifiable client on the node when
the verifiable client is instantiated, which is thus a blocking
operation on the main test thread, avoiding any test-based timeouts.
Reviewers: Jason Gustafson <jason@confluent.io>
Generalize the verification in the upgrade test so that it
does not rely on the task assignor's behavior.
Reviewers: A. Sophie Blee-Goldman <sophie@confluent.io>, John Roesler <vvcephei@apache.org>
The downgrade test does not currently support 2.4 and 2.5. When you enable them, it fails as a result of consumer group static membership. This PR makes the downgrade test work with all of our released versions again.
Author: Lucas Bradstreet <lucas@confluent.io>
Reviewers: Boyang Chen, Gwen Shapira
Closes#8518 from lbradstreet/downgrade-test-2.4-2.5
* add a config to set the TaskAssignor
* set the default assignor to HighAvailabilityTaskAssignor
* fix broken tests (with some TODOs in the system tests)
Implements: KIP-441
Reviewers: Bruno Cadonna <bruno@confluent.io>, A. Sophie Blee-Goldman <sophie@confluent.io>
I added a change to the upgrade test a while back that would make it wait for
ISR rejoin before rolls. This prevents incompatible brokers charging through a
bad roll and disguising a downgrade problem.
We now also check for protocol errors in the broker logs.
Reviewers: Boyang Chen <boyang@confluent.io>, Ismael Juma <ismael@juma.me.uk>
This fixes a version pinning issue where a transitive dependency had a
major version upgrade that a dependency did not account for, breaking
the build.
Reviewers: Andrew Egelhofer <aegelhofer@confluent.io>, Matthias J. Sax <matthias@confluent.io>
Revert the decision for the sendOffsetsToTransaction(groupMetadata) API to fail with old version of brokers for the sake of making the application easier to adapt between versions. This PR silently downgrade the TxnOffsetCommit API when the build version is small than 3.
Reviewers: Matthias J. Sax <matthias@confluent.io>, Guozhang Wang <wangguoz@gmail.com>
Startup scripts for the early version of Kafka contain removed JVM options like `-XX:+PrintGCDateStamps` or `-XX:UseParNewGC`.
When system tests run on JVM that doesn't support these options we should set up
environment variables with correct options.
Reviewers: Guozhang Wang <guozhang@confluent.io>, Ron Dagostino <rdagostino@confluent.io>, Ismael Juma <ismael@juma.me.uk
* Replace Prev/Standby task lists with a representation of the current poasition
of all tasks, where each task is encoded as the sum of the positions of all the
changelogs in that task.
* Only the protocol change is implemented, not actual positions, and the
assignor is updated to translate the new protocol back to lists of Prev/Standby
tasks so that the current assignment protocol still functions without modification.
Implements: KIP-441
Reviewers: John Roesler <vvcephei@apache.org>, Bruno Cadonna <bruno@confluent.io>
As described in KIP-568.
Waiting on acceptance of the KIP to write the tests, on the off chance something changes. But rest assured unit tests are coming ⚡️
Will also kick off existing Streams system tests which leverage this new API (eg version probing, sometimes broker bounce)
Reviewers: Boyang Chen <boyang@confluent.io>, Guozhang Wang <wangguoz@gmail.com>
what/why
the throttling_test was broken by this PR (#7785) since it depends on the consumer having partitions-assigned before starting the producer
this PR provides the ability to wait for partitions to be assigned in the console consumer before considering it started.
caveat
this does not support starting up the JmxTool inside the console-consumer for custom metrics while using this wait_until_partitions_assigned flag since the code assumes one JmxTool running per node.
I think a proper fix for this would be to make JmxTool its own standalone single-node service
alternatives
we could use the EndToEnd test suite which uses the verifiable producer/consumer under the hood but I found that there were more changes necessary to get this working unfortunately (specifically doesn't seem like this test suite plays nicely with the ProducerPerformanceService)
Reviewers: Mathew Wong <mwong@confluent.io>, Bill Bejeck <bbejeck.com>
The test_throttled_reassignment test fails because the consumer that is used to validate reassignment does not start on time to consume all messages. This does not seem like an issue with the throttling of the reassignment, since increasing the timeout allowed the test to pass multiple consecutive runs locally.
This test seemed to rely on the default JmxTool for the console consumer that was removed in this commit: 179d0d7
The console consumer would check to see if it had partitions assigned to it before beginning to consume. Although the test occasionally failed with the JmxTool, it began to fail much more after the removal.
Error messages of failures followed the below format with varying numbers of missed messages. They are the first messages by the producer.
535 acked message did not make it to the Consumer. They are: 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19...plus 515 more. Total Acked: 192792, Total Consumed: 192259. We validated that the first 535 of these missing messages correctly made it into Kafka's data files. This suggests they were lost on their way to the consumer.
In the scope of the test, this error suggests that the test is falling into the race condition described in produce_consume_validate.py, which has the timeout to prevent the consumer from missing initial messages.
This can serve as a temporary fix until the logic of consumer startup is addressed further.
Reviewers: Jason Gustafson <jason@confluent.io>, Bill Bejeck <bbejeck@gmail.com>
Newer versions of Java have added checks to ensure that trust anchors are CA certificates and contain proper extensions. This PR adds Basic Constraints extension with the CA field set to true for system tests.
Reviewers: ajini Sivaram <rajinisivaram@googlemail.com>
This change mainly have 2 components:
1. extend the existing transactions_test.py to also try out new sendTxnOffsets(groupMetadata) API to make sure we are not introducing any regression or compatibility issue
a. We shrink the time window to 10 seconds for the txn timeout scheduler on broker so that we could trigger expiration earlier than later
2. create a completely new system test class called group_mode_transactions_test which is more complicated than the existing system test, as we are taking rebalance into consideration and using multiple partitions instead of one. For further breakdown:
a. The message count was done on partition level, instead of global as we need to visualize
the per partition order throughout the test. For this sake, we extend ConsoleConsumer to print out the data partition as well to help message copier interpret the per partition data.
b. The progress count includes the time for completing the pending txn offset expiration
c. More visibility and feature improvements on TransactionMessageCopier to better work under either standalone or group mode.
Reviewers: Matthias J. Sax <matthias@confluent.io>, Guozhang Wang <wangguoz@gmail.com>
This PR is collaborated by Guozhang Wang and John Roesler. It is a significant tech debt cleanup on task management and state management, and is broken down by several sub-tasks listed below:
Extract embedded clients (producer and consumer) into RecordCollector from StreamTask.
guozhangwang#2
guozhangwang#5
Consolidate the standby updating and active restoring logic into ChangelogReader and extract out of StreamThread.
guozhangwang#3
guozhangwang#4
Introduce Task state life cycle (created, restoring, running, suspended, closing), and refactor the task operations based on the current state.
guozhangwang#6
guozhangwang#7
Consolidate AssignedTasks into TaskManager and simplify the logic of changelog management and task management (since they are already moved in step 2) and 3)).
guozhangwang#8
guozhangwang#9
Also simplified the StreamThread logic a bit as the embedded clients / changelog restoration logic has been moved into step 1) and 2).
guozhangwang#10
Reviewers: A. Sophie Blee-Goldman <sophie@confluent.io>, Bruno Cadonna <bruno@confluent.io>, Boyang Chen <boyang@confluent.io>
Deprecate existing metadata query APIs in favor of new
ones that include standby hosts as well as partition
information.
Closes: #7960
Implements: KIP-535
Co-authored-by: Navinder Pal Singh Brar <navinder_brar@yahoo.com>
Reviewed-by: John Roesler <vvcephei@apache.org>
Do not initialize `JmxTool` by default when running console consumer. In order to support this, we remove `has_partitions_assigned` and its only usage in an assertion inside `ProduceConsumeValidateTest`, which did not seem to contribute much to the validation.
Reviewers: David Arthur <mumrah@gmail.com>, Jason Gustafson <jason@confluent.io>
In some system tests a Streams app is started and then prints a message to stdout, which the system test waits for to confirm the node has successfully been brought up. It then greps for certain log messages in a retriable loop.
But waiting on the Streams app to start/print to stdout does not mean the log file has been created yet, so the grep may return an error. Although this occurs in a retriable loop it is assumed that grep will not fail, and the result is piped to wc and then blindly converted to an int in the python function, which fails since the error message is a string (throws ValueError)
We should catch the ValueError and return a 0 so it can try again rather than immediately crash
Reviewers: Bill Bejeck <bbejeck@gmail.com>, John Roesler <vvcephei@users.noreply.github.com>, Guozhang Wang <wangguoz@gmail.com>
The upgrade system test correctly rolls by upgrading the broker and
leaving the IBP, and then rolling again with the latest IBP version.
Unfortunately, this is not sufficient to pick up many problems in our IBP
gating as we charge through the rolls and after the second roll all of
the brokers will rejoin the ISR and the test will be treated as a
success.
This test adds two new checks:
1. We wait for the ISR to stabilize for all partitions. This is best
practice during rolls, and is enough to tell us if a broker hasn't
rejoined after each roll.
2. We check the broker logs for some common protocol errors. This is a
fail safe as it's possible for the test to be successful even if some
protocols are incompatible and the ISR is rejoined.
Reviewers: Nikhil Bhatia <nikhil@confluent.io>, Jason Gustafson <jason@confluent.io>
The --enable-autocommit argument is a flag. It does not take a parameter. This was broken in #7724.
Reviewers: Ismael Juma <ismael@juma.me.uk>, Manikumar Reddy <manikumar.reddy@gmail.com>
Two tests using 50k replicas on 8 brokers:
* Do a rolling restart with clean shutdown, delete topics
* Run produce bench and consumer bench on a subset of topics
Reviewed-By: David Jacot <djacot@confluent.io>, Vikas Singh <vikas@confluent.io>, Jason Gustafson <jason@confluent.io>
This patch adds a basic downgrade system test. It verifies that producing and consuming continues to work before and after the downgrade.
Reviewers: Ismael Juma <ismael@juma.me.uk>, David Arthur <mumrah@gmail.com>
Recently, system tests test_rebalance_[simple|complex] failed
repeatedly with a verfication error. The cause was most probably
the missing clean-up of a state directory of one of the processors.
A node is cleaned up when a service on that node is started and when
a test is torn down.
If the clean-up flag clean_node_enabled of a EOS Streams service is
unset, the clean-up of the node is skipped.
The clean-up flag of processor1 in the EOS tests should stay set before
its first start, so that the node is cleaned before the service is started.
Afterwards for the multiple restarts of processor1 the cleans-up flag should
be unset to re-use the local state.
After the multiple restarts are done, the clean-up flag of processor1 should
again be set to trigger node clean-up during the test teardown.
A dirty node can lead to test failures when tests from Streams EOS tests are
scheduled on the same node, because the state store would not start empty
since it reads the local state that was not cleaned up.
Reviewers: Matthias J. Sax <mjsax@apache.org>, Andrew Choi <andchoi@linkedin.com>, Bill Bejeck <bbejeck@gmail.com>