SSH outputs in system tests originating from paramiko are bytes. However, the logger in the system tests does not accept bytes and instead throws an exception. That means, the bytes returned as SSH output from paramiko need to converted to a type that the logger (or other objects) can process.
Reviewers: Guozhang Wang <wangguoz@gmail.com>
This PR introduces a streams specific uncaught exception handler that currently has the option to close the client or the application. If the new handler is set as well as the old handler (java thread handler) will be ignored and an error will be logged.
The application shutdown is achieved through the rebalance protocol.
Reviewers: Bruno Cadonna <cadonna@confluent.io>, Leah Thomas <lthomas@confluent.io>, John Roesler <john@confluent.io>, Anna Sophie Blee-Goldman <ableegoldman@apache.org>
This newly added system test is to verify that with the fix in #9270 , the member.id update caused by static member rejoin would be persisted correctly.
Reviewers: Boyang Chen <boyang@confluent.io>
Increase the amount of time available to the `test_verifiable_producer` test to login and get the process name for the verifiable producer from 5 seconds to 10 seconds.
We were seeing some test failures due to the assertion failing because the verifiable producer would complete before we could login, list the processes, and parse out the producer version. Previously, we were giving this operation 5 seconds to run, this PR bumps it up to 10 seconds.
I verified locally that this does not flake, but even at 5 seconds I wasn't seeing any flakes. Ultimately we should find a better strategy than racing to query the producer process (as outlined in the existing comments).
Reviewers: Jason Gustafson <jason@confluent.io>
KIP-431 (#9099) changed the format of console consumer output to `Partition:$PARTITION\t$VALUE` whereas previously the output format was `$VALUE\t$PARTITION`. This PR updates the message verifier to accommodate the updated console consumer output format.
The system test StreamsUpgradeTest.test_version_probing_upgrade tries to verify the wrong version for version probing.
Reviewers: Anna Sophie Blee-Goldman <ableegoldman@apache.org>
quota_test.py tests are failing with below error.
```
23:24:42 [INFO:2020-10-24 17:54:42,366]: RunnerClient: kafkatest.tests.client.quota_test.QuotaTest.test_quota.quota_type=user.override_quota=False: FAIL: not enough arguments for format string
23:24:42 Traceback (most recent call last):
23:24:42 File "/home/jenkins/workspace/system-test-kafka-branch-builder/kafka/venv/lib/python3.6/site-packages/ducktape-0.8.0-py3.6.egg/ducktape/tests/runner_client.py", line 134, in run
23:24:42 data = self.run_test()
23:24:42 File "/home/jenkins/workspace/system-test-kafka-branch-builder/kafka/venv/lib/python3.6/site-packages/ducktape-0.8.0-py3.6.egg/ducktape/tests/runner_client.py", line 192, in run_test
23:24:42 return self.test_context.function(self.test)
23:24:42 File "/home/jenkins/workspace/system-test-kafka-branch-builder/kafka/venv/lib/python3.6/site-packages/ducktape-0.8.0-py3.6.egg/ducktape/mark/_mark.py", line 429, in wrapper
23:24:42 return functools.partial(f, *args, **kwargs)(*w_args, **w_kwargs)
23:24:42 File "/home/jenkins/workspace/system-test-kafka-branch-builder/kafka/tests/kafkatest/tests/client/quota_test.py", line 141, in test_quota
23:24:42 self.quota_config = QuotaConfig(quota_type, override_quota, self.kafka)
23:24:42 File "/home/jenkins/workspace/system-test-kafka-branch-builder/kafka/tests/kafkatest/tests/client/quota_test.py", line 60, in __init__
23:24:42 self.configure_quota(kafka, self.producer_quota, self.consumer_quota, ['users', None])
23:24:42 File "/home/jenkins/workspace/system-test-kafka-branch-builder/kafka/tests/kafkatest/tests/client/quota_test.py", line 83, in configure_quota
23:24:42 (kafka.kafka_configs_cmd_with_optional_security_settings(node, force_use_zk_conection), producer_byte_rate, consumer_byte_rate)
23:24:42 TypeError: not enough arguments for format string
23:24:42
```
ran thee tests locally.
Author: Manikumar Reddy <manikumar.reddy@gmail.com>
Reviewers: David Jacot <djacot@confluent.io>, Ron Dagostino <rndgstn@gmail.com>
Closes#9496 from omkreddy/quota-tests
Fix vagrant for a system tests with a python3.
Author: Nikolay Izhikov <nizhikov@apache.org>
Reviewers: Manikumar Reddy <manikumar.reddy@gmail.com>
Closes#9480 from nizhikov/KAFKA-10592
This PR adds missing broker ACLs required to create topics and SCRAM credentials when ACLs are enabled for a system test. This PR also adds support for using PLAINTEXT as the inter broker security protocol when using SCRAM from the client in a system test with a secured cluster-- without this it would always be necessary to set both the inter-broker and client mechanisms to a SCRAM mechanism. Also contains some refactoring to make assumptions clearer.
Reviewers: Rajini Sivaram <rajinisivaram@googlemail.com>
For now, Kafka system tests use python2 which is outdated and not supported.
This PR upgrades python to the third version.
Reviewers: Ivan Daschinskiy, Mickael Maison <mickael.maison@gmail.com>, Magnus Edenhill <magnus@edenhill.se>, Guozhang Wang <wangguoz@gmail.com>
The test StreamsBrokerBounceTest.test_all_brokers_bounce() fails on
2.5 because in the last stage of the test there is only one broker
left and the offset commit cannot succeed because the
min.insync.replicas of __consumer_offsets is set to 2 and acks is
set to all. This causes a time out and extends the closing of the
Kafka Streams client to beyond the duration passed to the close
method of the client.
This affects especially the 2.5 branch since there Kafka Streams
commits offsets for each task, i.e., close() needs to wait for the
timeout for each task. In 2.6 and trunk the offset commit is done
per thread, so close() does only need to wait for one time out per
stream thread.
I opened this PR on trunk, since the test could also become
flaky on trunk and we want to avoid diverging system tests across
branches.
A more complete solution would be to improve the test by defining
a better success criteria.
Reviewers: Guozhang Wang <wangguoz@gmail.com>
`openjdk:8` includes `git` by default, but `openjdk:11` does not. Install `git` explicitly to make it easier to
test with newer openjdk versions.
Reviewers: Chia-Ping Tsai <chia7712@gmail.com>
Implement the KIP-554 API to create, describe, and alter SCRAM user configurations via the AdminClient. Add ducktape tests, and modify JUnit tests to test and use the new API where appropriate.
Reviewers: Colin P. McCabe <cmccabe@apache.org>, Rajini Sivaram <rajinisivaram@googlemail.com>
ducktape diff: https://github.com/confluentinc/ducktape/compare/v0.7.8...v0.7.9
- bcrypt (a dependency of ducktape) dropped Python2.7 support.
ducktape-0.7.9 now pins bcrypt to a Python2.7-supported version.
Author: Andrew Egelhofer <aegelhofer@confluent.io>
Reviewers: Dhruvil Shah <dhruvil@confluent.io>, Manikumar Reddy <manikumar.reddy@gmail.com>
Closes#9192 from andrewegel/trunk
A system test failed with the following error: global name 'self' is not defined
The reason was that `self` was accessed to log a message in a static method. This commit makes the method an instance method.
Reviewer: Matthias J. Sax <matthias@confluent.io>
KAFKA-10235 fixed a consistency issue with the transaction timeout and the progress timeout. Since the test case relies on transaction timeouts, we need to wait at last as long as the timeout in order to ensure progress. However, having a low transaction timeout makes the test prone to the issue identified in KAFKA-9802, in which the coordinator timed out the transaction while the producer was awaiting a Produce response.
Reviewers: Chia-Ping Tsai <chia7712@gmail.com>, Boyang Chen <boyang@confluent.io>, Jun Rao <junrao@gmail.com>
- After #8312, older brokers are returning empty configs, with latest `adminClient.describeConfigs`. Old brokers are receiving empty configNames in `AdminManageer.describeConfigs()` method. Older brokers does not handle empty configKeys. Due to this old brokers are filtering all the configs.
- Update ClientCompatibilityTest to verify describe configs
- Add test case to test describe configs with empty configuration Keys
Author: Manikumar Reddy <manikumar.reddy@gmail.com>
Reviewers: Rajini Sivaram <rajinisivaram@googlemail.com>
Closes#9046 from omkreddy/KAFKA-9432
Currently, the system tests `connect_distributed_test` and `connect_rest_test` only wait for the REST api to come up.
The startup of the worker includes an asynchronous process for joining the worker group and syncing with other workers.
There are some situations in which this sync takes an unusually long time, and the test continues without all workers up.
This leads to flakey test failures, as worker joins are not given sufficient time to timeout and retry without waiting explicitly.
This changes the `ConnectDistributedTest` to wait for the Joined group message to be printed to the logs before continuing with tests. I've activated this behavior by default, as it's a superset of the checks that were performed by default before.
This log message is present in every version of DistributedHerder that I could find, in slightly different forms, but always with `Joined group` at the beginning of the log message. This change should be safe to backport to any branch.
Signed-off-by: Greg Harris <gregh@confluent.io>
Author: Greg Harris <gregh@confluent.io>
Reviewer: Randall Hauch <rhauch@gmail.com>
The test case `OffsetValidationTest.test_fencing_static_consumer` fails periodically due to this error:
```
Traceback (most recent call last):
File "/home/jenkins/workspace/system-test-kafka_2.6/kafka/venv/lib/python2.7/site-packages/ducktape-0.7.8-py2.7.egg/ducktape/tests/runner_client.py", line 134, in run
data = self.run_test()
File "/home/jenkins/workspace/system-test-kafka_2.6/kafka/venv/lib/python2.7/site-packages/ducktape-0.7.8-py2.7.egg/ducktape/tests/runner_client.py", line 192, in run_test
return self.test_context.function(self.test)
File "/home/jenkins/workspace/system-test-kafka_2.6/kafka/venv/lib/python2.7/site-packages/ducktape-0.7.8-py2.7.egg/ducktape/mark/_mark.py", line 429, in wrapper
return functools.partial(f, *args, **kwargs)(*w_args, **w_kwargs)
File "/home/jenkins/workspace/system-test-kafka_2.6/kafka/tests/kafkatest/tests/client/consumer_test.py", line 257, in test_fencing_static_consumer
assert len(consumer.dead_nodes()) == num_conflict_consumers
AssertionError
```
When a consumer stops, there is some latency between when the shutdown is observed by the service and when the node is added to the dead nodes. This patch fixes the problem by giving some time for the assertion to be satisfied.
Reviewers: Boyang Chen <boyang@confluent.io>
security_rolling_upgrade_test may change the security listener and then restart Kafka servers. has_sasl and has_ssl get out-of-date due to cached _security_config. This PR offers a simple fix that we always check the changes of port mapping and then update the sasl/ssl flag.
Reviewers: Ismael Juma <ismael@juma.me.uk>, Jun Rao <junrao@gmail.com>
Reducing timeout of transaction to clean up the unstable offsets quicker. IN hard_bounce mode, transactional client is killed ungracefully. Hence, it produces unstable offsets which obstructs TransactionalMessageCopier from receiving position of group.
Reviewers: Jun Rao <junrao@gmail.com>
Call KafkaStreams#cleanUp to reset local state before starting application up the second run.
Reviewers: A. Sophie Blee-Goldman <sophie@confluent.io>, Boyang Chen <boyang@confluent.io>, John Roesler <john@confluent.io>
Most of the values in the metadata upgrade test matrix are just testing
the upgrade/downgrade path between two previous releases. This is
unnecessary. We run the tests for all supported branches, so what we
should test is the up-/down-gradability of released versions with respect
to the current branch.
Reviewers: Guozhang Wang <wangguoz@gmail.com>
There are two new configs introduced by 371f14c3c1 and 1c4eb1a575 so we have to update the expected configs in the connect_rest_test.py system test too.
Reviewer: Konstantine Karantasis <konstantine@confluent.io>
Replaces the previous upgrade test's trivial Streams app
with the commonly used SmokeTest, exercising many more
features. Also adjust the test matrix to test upgrading
from each released version since 2.2 to the current branch.
Reviewers: Guozhang Wang <wangguoz@gmail.com>
After 3661f981fff2653aaf1d5ee0b6dde3410b5498db security_config is cached. Hence, the later changes to security flag can't impact the security_config used by later tests.
issue: https://issues.apache.org/jira/browse/KAFKA-10214
Author: Chia-Ping Tsai <chia7712@gmail.com>
Reviewers: Ron Dagostino <rdagostino@confluent.io>, Manikumar Reddy <manikumar.reddy@gmail.com>
Closes#8949 from chia7712/KAFKA-10214
During Streams' system tests the PIDs of the Streams
clients are collected. The method the collects the PIDs
swallows any exception that might be thrown by the
ssh_capture() function. Swallowing any exceptions
might make the investigation of failures harder,
because no information about what happened are recorded.
Reviewers: John Roesler <vvcephei@apache.org>
1. Enables `TLSv1.3` by default with Java 11 or newer.
2. Add unit tests that cover the various TLSv1.2 and TLSv1.3 combinations.
3. Extend `benchmark_test.py` and `replication_test.py` to run with 'TLSv1.2'
or 'TLSv1.3'.
Reviewers: Ismael Juma <ismael@juma.me.uk>
We have been seeing increased flakiness in transaction system tests. I believe the cause might be due to KIP-537, which increased the default zk session timeout from 6s to 18s and the default replica lag timeout from 10s to 30s. In the system test, we use the default transaction timeout of 10s. However, since the system test involves hard failures, the Produce request could be blocking for as long as the max of these two in order to wait for an ISR shrink. Hence this patch increases the timeout to 30s.
Note this patch also includes a minor logging fix in `Partition`. Previously we would see messages like the following:
```
[Broker id=3] Leader output-topic-0 starts at leader epoch 0 from offset 0 with high watermark 0 ISR 3,2,1 addingReplicas removingReplicas .Previous leader epoch was -1.
```
This patch fixes the log to print as the following:
```
[Broker id=3] Leader output-topic-0 starts at leader epoch 0 from offset 0 with high watermark 0 ISR [3,2,1] addingReplicas [] removingReplicas []. Previous leader epoch was -1.
```
Reviewers: Bob Barrett <bob.barrett@confluent.io>, Ismael Juma <github@juma.me.uk>
kafka_log4j_appender.py was broken on JDK11 by befd80b38.
`fix_opts_for_new_jvm` requires `node.version` to be set, we
add the relevant code to the test.
Reviewers: Ismael Juma <ismael@juma.me.uk>
Also fixes a system test by configuring the HATA to perform a one-shot balanced assignment
Reviewers: A. Sophie Blee-Goldman <sophie@confluent.io>, Bruno Cadonna <bruno@confluent.io>
Previous to this fix a plugged-in verifiable client, such as
confluent-kafka-python, would be deployed on the node in the background
worker thread as the client was started. Since this could be time consuming
(e.g., 10+ seconds) and since the main test thread would continue to
operate, it was common for the current test to time out waiting
for e.g. the verifiable producer to produce messages while it was in fact
still deploying.
The fix here is to deploy the verifiable client on the node when
the verifiable client is instantiated, which is thus a blocking
operation on the main test thread, avoiding any test-based timeouts.
Reviewers: Jason Gustafson <jason@confluent.io>
Generalize the verification in the upgrade test so that it
does not rely on the task assignor's behavior.
Reviewers: A. Sophie Blee-Goldman <sophie@confluent.io>, John Roesler <vvcephei@apache.org>
The downgrade test does not currently support 2.4 and 2.5. When you enable them, it fails as a result of consumer group static membership. This PR makes the downgrade test work with all of our released versions again.
Author: Lucas Bradstreet <lucas@confluent.io>
Reviewers: Boyang Chen, Gwen Shapira
Closes#8518 from lbradstreet/downgrade-test-2.4-2.5
* add a config to set the TaskAssignor
* set the default assignor to HighAvailabilityTaskAssignor
* fix broken tests (with some TODOs in the system tests)
Implements: KIP-441
Reviewers: Bruno Cadonna <bruno@confluent.io>, A. Sophie Blee-Goldman <sophie@confluent.io>
I added a change to the upgrade test a while back that would make it wait for
ISR rejoin before rolls. This prevents incompatible brokers charging through a
bad roll and disguising a downgrade problem.
We now also check for protocol errors in the broker logs.
Reviewers: Boyang Chen <boyang@confluent.io>, Ismael Juma <ismael@juma.me.uk>
This fixes a version pinning issue where a transitive dependency had a
major version upgrade that a dependency did not account for, breaking
the build.
Reviewers: Andrew Egelhofer <aegelhofer@confluent.io>, Matthias J. Sax <matthias@confluent.io>
Revert the decision for the sendOffsetsToTransaction(groupMetadata) API to fail with old version of brokers for the sake of making the application easier to adapt between versions. This PR silently downgrade the TxnOffsetCommit API when the build version is small than 3.
Reviewers: Matthias J. Sax <matthias@confluent.io>, Guozhang Wang <wangguoz@gmail.com>
Startup scripts for the early version of Kafka contain removed JVM options like `-XX:+PrintGCDateStamps` or `-XX:UseParNewGC`.
When system tests run on JVM that doesn't support these options we should set up
environment variables with correct options.
Reviewers: Guozhang Wang <guozhang@confluent.io>, Ron Dagostino <rdagostino@confluent.io>, Ismael Juma <ismael@juma.me.uk
* Replace Prev/Standby task lists with a representation of the current poasition
of all tasks, where each task is encoded as the sum of the positions of all the
changelogs in that task.
* Only the protocol change is implemented, not actual positions, and the
assignor is updated to translate the new protocol back to lists of Prev/Standby
tasks so that the current assignment protocol still functions without modification.
Implements: KIP-441
Reviewers: John Roesler <vvcephei@apache.org>, Bruno Cadonna <bruno@confluent.io>
As described in KIP-568.
Waiting on acceptance of the KIP to write the tests, on the off chance something changes. But rest assured unit tests are coming ⚡️
Will also kick off existing Streams system tests which leverage this new API (eg version probing, sometimes broker bounce)
Reviewers: Boyang Chen <boyang@confluent.io>, Guozhang Wang <wangguoz@gmail.com>
what/why
the throttling_test was broken by this PR (#7785) since it depends on the consumer having partitions-assigned before starting the producer
this PR provides the ability to wait for partitions to be assigned in the console consumer before considering it started.
caveat
this does not support starting up the JmxTool inside the console-consumer for custom metrics while using this wait_until_partitions_assigned flag since the code assumes one JmxTool running per node.
I think a proper fix for this would be to make JmxTool its own standalone single-node service
alternatives
we could use the EndToEnd test suite which uses the verifiable producer/consumer under the hood but I found that there were more changes necessary to get this working unfortunately (specifically doesn't seem like this test suite plays nicely with the ProducerPerformanceService)
Reviewers: Mathew Wong <mwong@confluent.io>, Bill Bejeck <bbejeck.com>
The test_throttled_reassignment test fails because the consumer that is used to validate reassignment does not start on time to consume all messages. This does not seem like an issue with the throttling of the reassignment, since increasing the timeout allowed the test to pass multiple consecutive runs locally.
This test seemed to rely on the default JmxTool for the console consumer that was removed in this commit: 179d0d7
The console consumer would check to see if it had partitions assigned to it before beginning to consume. Although the test occasionally failed with the JmxTool, it began to fail much more after the removal.
Error messages of failures followed the below format with varying numbers of missed messages. They are the first messages by the producer.
535 acked message did not make it to the Consumer. They are: 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19...plus 515 more. Total Acked: 192792, Total Consumed: 192259. We validated that the first 535 of these missing messages correctly made it into Kafka's data files. This suggests they were lost on their way to the consumer.
In the scope of the test, this error suggests that the test is falling into the race condition described in produce_consume_validate.py, which has the timeout to prevent the consumer from missing initial messages.
This can serve as a temporary fix until the logic of consumer startup is addressed further.
Reviewers: Jason Gustafson <jason@confluent.io>, Bill Bejeck <bbejeck@gmail.com>
Newer versions of Java have added checks to ensure that trust anchors are CA certificates and contain proper extensions. This PR adds Basic Constraints extension with the CA field set to true for system tests.
Reviewers: ajini Sivaram <rajinisivaram@googlemail.com>
This change mainly have 2 components:
1. extend the existing transactions_test.py to also try out new sendTxnOffsets(groupMetadata) API to make sure we are not introducing any regression or compatibility issue
a. We shrink the time window to 10 seconds for the txn timeout scheduler on broker so that we could trigger expiration earlier than later
2. create a completely new system test class called group_mode_transactions_test which is more complicated than the existing system test, as we are taking rebalance into consideration and using multiple partitions instead of one. For further breakdown:
a. The message count was done on partition level, instead of global as we need to visualize
the per partition order throughout the test. For this sake, we extend ConsoleConsumer to print out the data partition as well to help message copier interpret the per partition data.
b. The progress count includes the time for completing the pending txn offset expiration
c. More visibility and feature improvements on TransactionMessageCopier to better work under either standalone or group mode.
Reviewers: Matthias J. Sax <matthias@confluent.io>, Guozhang Wang <wangguoz@gmail.com>
This PR is collaborated by Guozhang Wang and John Roesler. It is a significant tech debt cleanup on task management and state management, and is broken down by several sub-tasks listed below:
Extract embedded clients (producer and consumer) into RecordCollector from StreamTask.
guozhangwang#2
guozhangwang#5
Consolidate the standby updating and active restoring logic into ChangelogReader and extract out of StreamThread.
guozhangwang#3
guozhangwang#4
Introduce Task state life cycle (created, restoring, running, suspended, closing), and refactor the task operations based on the current state.
guozhangwang#6
guozhangwang#7
Consolidate AssignedTasks into TaskManager and simplify the logic of changelog management and task management (since they are already moved in step 2) and 3)).
guozhangwang#8
guozhangwang#9
Also simplified the StreamThread logic a bit as the embedded clients / changelog restoration logic has been moved into step 1) and 2).
guozhangwang#10
Reviewers: A. Sophie Blee-Goldman <sophie@confluent.io>, Bruno Cadonna <bruno@confluent.io>, Boyang Chen <boyang@confluent.io>
Deprecate existing metadata query APIs in favor of new
ones that include standby hosts as well as partition
information.
Closes: #7960
Implements: KIP-535
Co-authored-by: Navinder Pal Singh Brar <navinder_brar@yahoo.com>
Reviewed-by: John Roesler <vvcephei@apache.org>
Do not initialize `JmxTool` by default when running console consumer. In order to support this, we remove `has_partitions_assigned` and its only usage in an assertion inside `ProduceConsumeValidateTest`, which did not seem to contribute much to the validation.
Reviewers: David Arthur <mumrah@gmail.com>, Jason Gustafson <jason@confluent.io>
In some system tests a Streams app is started and then prints a message to stdout, which the system test waits for to confirm the node has successfully been brought up. It then greps for certain log messages in a retriable loop.
But waiting on the Streams app to start/print to stdout does not mean the log file has been created yet, so the grep may return an error. Although this occurs in a retriable loop it is assumed that grep will not fail, and the result is piped to wc and then blindly converted to an int in the python function, which fails since the error message is a string (throws ValueError)
We should catch the ValueError and return a 0 so it can try again rather than immediately crash
Reviewers: Bill Bejeck <bbejeck@gmail.com>, John Roesler <vvcephei@users.noreply.github.com>, Guozhang Wang <wangguoz@gmail.com>
The upgrade system test correctly rolls by upgrading the broker and
leaving the IBP, and then rolling again with the latest IBP version.
Unfortunately, this is not sufficient to pick up many problems in our IBP
gating as we charge through the rolls and after the second roll all of
the brokers will rejoin the ISR and the test will be treated as a
success.
This test adds two new checks:
1. We wait for the ISR to stabilize for all partitions. This is best
practice during rolls, and is enough to tell us if a broker hasn't
rejoined after each roll.
2. We check the broker logs for some common protocol errors. This is a
fail safe as it's possible for the test to be successful even if some
protocols are incompatible and the ISR is rejoined.
Reviewers: Nikhil Bhatia <nikhil@confluent.io>, Jason Gustafson <jason@confluent.io>
The --enable-autocommit argument is a flag. It does not take a parameter. This was broken in #7724.
Reviewers: Ismael Juma <ismael@juma.me.uk>, Manikumar Reddy <manikumar.reddy@gmail.com>
Two tests using 50k replicas on 8 brokers:
* Do a rolling restart with clean shutdown, delete topics
* Run produce bench and consumer bench on a subset of topics
Reviewed-By: David Jacot <djacot@confluent.io>, Vikas Singh <vikas@confluent.io>, Jason Gustafson <jason@confluent.io>
This patch adds a basic downgrade system test. It verifies that producing and consuming continues to work before and after the downgrade.
Reviewers: Ismael Juma <ismael@juma.me.uk>, David Arthur <mumrah@gmail.com>
Recently, system tests test_rebalance_[simple|complex] failed
repeatedly with a verfication error. The cause was most probably
the missing clean-up of a state directory of one of the processors.
A node is cleaned up when a service on that node is started and when
a test is torn down.
If the clean-up flag clean_node_enabled of a EOS Streams service is
unset, the clean-up of the node is skipped.
The clean-up flag of processor1 in the EOS tests should stay set before
its first start, so that the node is cleaned before the service is started.
Afterwards for the multiple restarts of processor1 the cleans-up flag should
be unset to re-use the local state.
After the multiple restarts are done, the clean-up flag of processor1 should
again be set to trigger node clean-up during the test teardown.
A dirty node can lead to test failures when tests from Streams EOS tests are
scheduled on the same node, because the state store would not start empty
since it reads the local state that was not cleaned up.
Reviewers: Matthias J. Sax <mjsax@apache.org>, Andrew Choi <andchoi@linkedin.com>, Bill Bejeck <bbejeck@gmail.com>
* Add rate limiting to tc
* Feedback from PR
* Add a sanity test for tc
* Add iperf to vagrant scripts
* Dynamically determine the network interface
* Add some temp code for testing on AWS
* Temp: use hostname instead of external IP
* Temp: more AWS debugging
* More AWS WIP
* More AWS temp
* Lower latency some
* AWS wip
* Trying this again now that ping should work
* Add cluster decorator to tests
* Fix broken import
* Fix device name
* Fix decorator arg
* Remove errant import
* Increase timeouts
* Fix tbf command, relax assertion on latency test
* Fix log line
* Final bit of cleanup
* Newline
* Revert Trogdor retry count
* PR feedback
* More PR feedback
* Feedback from PR
* Remove unused argument
When Trogdor wants to clear all the faults injected to Kibosh, it sends the empty JSON object {}. However, Kibosh expects {"faults":[]} instead. Kibosh should handle the empty JSON object, since that's consistent with how Trogdor handles empty JSON fields in general (if they're empty, they can be omitted). We should also have a test for this.
Reviewers: David Arthur <mumrah@gmail.com>, Stanislav Kozlovski <stanislav_kozlovski@outlook.com>
Instead of caching the checkpoint map during StandbyTask
initialization, use the latest checkpoints (which would have
been updated during suspend).
Reviewers: Bill Bejeck <bill@confluent.io>
Author: Colin P. Mccabe <cmccabe@confluent.io>
Reviewers: Vikas Singh <vikas@confluent.io>, Jason Gustafson <jason@confluent.io>
Closes#7477 from cmccabe/KAFKA-8984
The consumer's `committed` API does not return an entry in the response map for a requested partition if there is no committed offset. The transactional message copier, which is used in the transaction system test, did not account for this. If the first transaction attempted by the copier was randomly aborted, then we would not seek to the beginning as expected, which means we would fail to copy some of the records.
This patch fixes the problem by iterating over the assignment rather than the result of `committed` when resetting offsets. It also adds enables additional logging in the transaction message copier service to make finding problems easier in the future.
Author: Jason Gustafson <jason@confluent.io>
Reviewers: Manikumar Reddy <manikumar.reddy@gmail.com>
Closes#7653 from hachikuji/fix-transaction-system-test
With KIP-444 the metrics definitions are refactored. Thus, Streams' SimpleBenchmark needs to be updated to correctly access the refactored metrics.
Reviewers: Guozhang Wang <wangguoz@gmail.com>, Matthias J. Sax <mjsax@apache.org>, Bill Bejeck <bbejeck@gmail.com>
Inside onLeavePrepare we would look into the assignment and try to revoke the owned tasks and notify users via RebalanceListener#onPartitionsRevoked, and then clear the assignment.
However, the subscription's assignment is already cleared in this.subscriptions.unsubscribe(); which means user's rebalance listener would never be triggered. In other words, from consumer client's pov nothing is owned after unsubscribe, but from the user caller's pov the partitions are not revoked yet. For callers like Kafka Streams which rely on the rebalance listener to maintain their internal state, this leads to inconsistent state management and failure cases.
Before KIP-429 this issue is hidden away since every time the consumer re-joins the group later, it would still revoke everything anyways regardless of the passed-in parameters of the rebalance listener; with KIP-429 this is easier to reproduce now.
Our fixes are following:
• Inside unsubscribe, first do onLeavePrepare / maybeLeaveGroup and then subscription.unsubscribe. This we we are guaranteed that the streams' tasks are all closed as revoked by then.
• [Optimization] If the generation is reset due to fatal error from join / hb response etc, then we know that all partitions are lost, and we should not trigger onPartitionRevoked, but instead just onPartitionsLost inside onLeavePrepare. This is because we don't want to commit for lost tracks during rebalance which is doomed to fail as we don't have any generation info.
Reviewers: Matthias J. Sax <matthias@confluent.io>, A. Sophie Blee-Goldman <sophie@confluent.io>, Bill Bejeck <bill@confluent.io>, Guozhang Wang <guozhang@confluent.io>
MM2 added a few connector classes in Connect's classpath and given that the assertion in the Connect REST system tests need to be adjusted to account for these additions.
This fix makes sure that the loaded Connect plugins are a superset of the expected by the test connectors.
Testing: The change is straightforward. The fix was tested with local system test runs.
Author: Konstantine Karantasis <konstantine@confluent.io>
Reviewers: Manikumar Reddy <manikumar.reddy@gmail.com>
Closes#7578 from kkonstantine/minor-fix-connect-test-after-mm2-classes
This system test was marked @Ignore around a year and a half ago pending the version probing work, but never turned on again.
These days, it is made redundant by the suite of system tests in streams_upgrade_test, which cover rolling upgrades (including version probing and metadata change).
Reviewers: Boyang Chen <boyang@confluent.io>, Bill Bejeck <bbejeck@gmail.com>
All four flavors of the repartition/optimization tests have been reported as flaky and failed in one place or another:
* RepartitionOptimizingIntegrationTest.shouldSendCorrectRecords_OPTIMIZED
* RepartitionOptimizingIntegrationTest.shouldSendCorrectRecords_NO_OPTIMIZATION
* RepartitionWithMergeOptimizingIntegrationTest.shouldSendCorrectRecords_OPTIMIZED
* RepartitionWithMergeOptimizingIntegrationTest.shouldSendCorrectRecords_NO_OPTIMIZATION
They're pretty similar so it makes sense to knock them all out at once. This PR does three things:
* Switch to in-memory stores wherever possible
* Name all operators and update the Topology accordingly (not really a flaky test fix, but had to update the topology names anyway because of the IM stores so figured might as well)
* Port to TopologyTestDriver -- this is the "real" fix, should make a big difference as these repartition tests required multiple roundtrips with the Kafka cluster (while using only the default timeout)
Reviewers: Bill Bejeck <bill@confluent.io>, Guozhang Wang <wangguoz@gmail.com>
Key improvements with this PR:
* tasks will remain available for IQ during a rebalance (but not during restore)
* continue restoring and processing standby tasks during a rebalance
* continue processing active tasks during rebalance until the RecordQueue is empty*
* only revoked tasks must suspended/closed
* StreamsPartitionAssignor tries to return tasks to their previous consumers within a client
* but do not try to commit, for now (pending KAFKA-7312)
Reviewers: John Roesler <john@confluent.io>, Boyang Chen <boyang@confluent.io>, Guozhang Wang <wangguoz@gmail.com>
Small follow-up to trunk PR #7423
While debugging the 2.3 VP PR we realized we should remove the leader-tracking from the VP system test altogether. We'd already merged the corresponding trunk PR so I made a quick new PR for trunk (also fixes a missed version bump in one of the log messages)
Reviewers: Guozhang Wang <wangguoz@gmail.com>
Implemented KIP-507 to secure the internal Connect REST endpoints that are only for intra-cluster communication. A new V2 of the Connect subprotocol enables this feature, where the leader generates a new session key, shares it with the other workers via the configuration topic, and workers send and validate requests to these internal endpoints using the shared key.
Currently the internal `POST /connectors/<connector>/tasks` endpoint is the only one that is secured.
This change adds unit tests and makes some small alterations to system tests to target the new `sessioned` Connect subprotocol. A new integration test ensures that the endpoint is actually secured (i.e., requests with missing/invalid signatures are rejected with a 400 BAD RESPONSE status).
Author: Chris Egerton <chrise@confluent.io>
Reviewed: Konstantine Karantasis <konstantine@confluent.io>, Randall Hauch <rhauch@gmail.com>
Instead of sending the leader's version and having older members try to blindly upgrade.
The only other real change here is that we will also set the VERSION_PROBING error code and return early from onAssignment when we are upgrading our used subscription version (not just downgrading it) since this implies the whole group has finished the rolling upgrade and all members should rejoin with the new subscription version.
Also piggy-backing on a fix for a potentially dangerous edge case, where every thread of an instance is assigned the same set of active tasks.
Reviewers: Guozhang Wang <wangguoz@gmail.com>
This change adds a command line option to the `ducker-ak up' command to enable exposing ports from docker containers. The exposed ports will be mapped to the ephemeral ports on the host. The option is called `expose-ports' and can take either a single value (like 5005) or a range (like 5005-5009). This port will then exposed from each docker container that ducker-ak sets up.
Reviewers: Colin P. McCabe <cmccabe@apache.org>, José Armando García Sancio <jsancio@users.noreply.github.com>
* Leader instance uses dictionary encoding on the wire to send topic partitions
* Topic names (most expensive component) are mapped to an integer using the dictionary
* Follower instances receive the dictionary, decode topic names back
* Purely an on-the-wire optimization, no in-memory structures changed
* Test case added for version 5 AssignmentInfo
Reviewers: Guozhang Wang <wangguoz@gmail.com>
Fix a bug where ClientCompatibilityFeaturesTest fails when running multiple iterations.
Also, fix a typo in tests/docker/Dockerfile.
Reviewers: Ismael Juma <ismael@juma.me.uk>
As part of commit 4d1ee26a13 streams
version 2.3.0 test jar was added, but there was a simple typo in the
path that specified the version.
`ducker-ak up` was failing because of that. Fixed that.
Reviewers: Guozhang Wang <wangguoz@gmail.com>
This adds a basic system test that enables rack-aware brokers with the rack-aware replica selector for fetch from followers (KIP-392). The test asserts that the follower was read from at least once and that all the messages that were produced were successfully consumed.
Reviewers: Jason Gustafson <jason@confluent.io>
Corrected the AbstractHerder to correctly identify task configs that contain variables for externalized secrets. The original method incorrectly used `matcher.matches()` instead of `matcher.find()`. The former method expects the entire string to match the regex, whereas the second one can find a pattern anywhere within the input string (which fits this use case more correctly).
Added unit tests to cover various cases of a config with externalized secrets, and updated system tests to cover case where config value contains additional characters besides secret that requires regex pattern to be found anywhere in the string (as opposed to complete match).
Author: Arjun Satish <arjun@confluent.io>
Reviewer: Randall Hauch <rhauch@gmail.com>
To avoid transient system test failures, tolerate a small amount of data loss due to truncation in upgrade system tests using older message format prior to KIP-101, where data loss was possible.
Reviewers: Ismael Juma <ismael@juma.me.uk>
ZooKeeper 3.5.5 is the first stable release in the 3.5.x series. The key new feature
in is TLS support, but there are a few more noteworthy features:
* Dynamic reconfiguration
* Local sessions
* New node types: Container, TTL
* Ability to remove watchers
* Multi-threaded commit processor
* Upgraded to Netty 4.1
See the release notes for more detail:
https://zookeeper.apache.org/doc/r3.5.5/releasenotes.html
In addition to the version bump, we:
* Add `commons-cli` dependency as it's required by `ZooKeeperMain`, but specified as
`provided` in their pom.
* Remove unnecessary `ZooKeeperMainWrapper`, the bug it worked around was fixed
upstream a long time ago.
* Ignore non zero exit in one system test invocation of `ZooKeeperMain`.
`ZooKeeperMainWrapper` always returned `0` and `ZooKeeperService.query` relies
on that for correct behavior.
Reviewers: Jason Gustafson <jason@confluent.io>
ZkUtils was removed so we don't need this anymore.
Also:
* Fix ZkSecurityMigrator and ReplicaManagerTest not to
reference ZkClient classes.
* Remove references to zkclient in various `log4j.properties`
and `import-control.xml`.
Reviewers: Manikumar Reddy <manikumar.reddy@gmail.com>, Stanislav Kozlovski <stanislav_kozlovski@outlook.com>
Follow on to #6731, this PR adds broker-side support for [KIP-392](https://cwiki.apache.org/confluence/display/KAFKA/KIP-392%3A+Allow+consumers+to+fetch+from+closest+replica) (fetch from followers).
Changes:
* All brokers will handle FetchRequest regardless of leadership
* Leaders can compute a preferred replica to return to the client
* New ReplicaSelector interface for determining the preferred replica
* Incremental fetches will include partitions with no records if the preferred replica has been computed
* Adds new JMX to expose the current preferred read replica of a partition in the consumer
Two new conditions were added for completing a delayed fetch. They both relate to communicating the high watermark to followers without waiting for a timeout:
* For regular fetches, if the high watermark changes within a single fetch request
* For incremental fetch sessions, if the follower's high watermark is lower than the leader
A new JMX attribute `preferred-read-replica` was added to the `kafka.consumer:type=consumer-fetch-manager-metrics,client-id=some-consumer,topic=my-topic,partition=0` object. This was added to support the new system test which verifies that the fetch from follower behavior works end-to-end. This attribute could also be useful in the future when debugging problems with the consumer.
Reviewers: José Armando García Sancio <jsancio@users.noreply.github.com>, Jun Rao <junrao@gmail.com>, Jason Gustafson <jason@confluent.io>
Connect tests were using String version for KafkaService instead of the expected KafkaVersion object. This broke due to recent changes to KafkaVersion. It turns out that the tests with String version were running compatibility tests against `dev` brokers rather than the older broker versions they were expecting to run against. When version was fixed, tests using 0.9.0.1 brokers started failing since new clients are not compatible with 0.9.0.1 brokers. So this PR fixes version parameter and removes the two tests against 0.9.0.1 brokers.
Reviewers: Ismael Juma <ismael@juma.me.uk>, Rajini Sivaram <rajinisivaram@googlemail.com>
This adds a new Trogdor fault spec for inducing network latency on a network device for system testing. It operates very similarly to the existing network partition spec by executing the `tc` linux utility.
This PR fixes a bug in static group membership. Previously we limit the `member.id` replacement in JoinGroup to only cases when the group is in Stable. This is error-prone and could potentially allow duplicate consumers reading from the same topic. For example, imagine a case where two unknown members join in the `PrepareRebalance` stage at the same time.
The PR fixes the following things:
1. Replace `member.id` at any time we see a known static member rejoins group with unknown member.id
2. Immediately fence any ongoing join/sync group callback to early terminate the duplicate member.
3. Clearly handle Dead/Empty cases as exceptional.
4. Return old leader id upon static member leader rejoin to avoid trivial member assignment being triggered.
Reviewers: Guozhang Wang <wangguoz@gmail.com>, Jason Gustafson <jason@confluent.io>
We've seen `ReplicaVerificationToolTest.test_replica_lags` fail occasionally due to errors such as the following:
```
RemoteCommandError: ubuntuworker7: Command 'kill -15 2896' returned non-zero exit status 1. Remote error message: bash: line 0: kill: (2896) - No such process
```
The problem seems to be a shutdown race condition when using `max_messages` with the producer. The process may already be gone which will cause the signal to fail.
Author: Jason Gustafson <jason@confluent.io>
Reviewers: Gwen Shapira
Closes#6906 from hachikuji/fix-failing-replicat-verification-test
We see the upgrade test failing from time to time. I looked into it and found that the root cause is basically that the test throughput can be too high for the 0.9 producer to make progress. Eventually it reaches a point where it has a huge backlog of timed out requests in the accumulator which all have to be expired. We see a long run of messages like this in the output:
```
{"exception":"class org.apache.kafka.common.errors.TimeoutException","time_ms":1559907386132,"name":"producer_send_error","topic":"test_topic","message":"Batch Expired","class":"class org.apache.kafka.tools.VerifiableProducer","value":"335160","key":null}
{"exception":"class org.apache.kafka.common.errors.TimeoutException","time_ms":1559907386132,"name":"producer_send_error","topic":"test_topic","message":"Batch Expired","class":"class org.apache.kafka.tools.VerifiableProducer","value":"335163","key":null}
{"exception":"class org.apache.kafka.common.errors.TimeoutException","time_ms":1559907386133,"name":"producer_send_error","topic":"test_topic","message":"Batch Expired","class":"class org.apache.kafka.tools.VerifiableProducer","value":"335166","key":null}
{"exception":"class org.apache.kafka.common.errors.TimeoutException","time_ms":1559907386133,"name":"producer_send_error","topic":"test_topic","message":"Batch Expired","class":"class org.apache.kafka.tools.VerifiableProducer","value":"335169","key":null}
```
This can continue for a long time (I have observed up to 1 min) and prevents the producer from successfully writing any new data. While it is busy expiring the batches, no data is getting delivered to the consumer, which causes it to eventually raise a timeout.
```
kafka.consumer.ConsumerTimeoutException
at kafka.consumer.NewShinyConsumer.receive(BaseConsumer.scala:50)
at kafka.tools.ConsoleConsumer$.process(ConsoleConsumer.scala:109)
at kafka.tools.ConsoleConsumer$.run(ConsoleConsumer.scala:69)
at kafka.tools.ConsoleConsumer$.main(ConsoleConsumer.scala:47)
at kafka.tools.ConsoleConsumer.main(ConsoleConsumer.scala)
```
The fix here is to reduce the throughput, which seems reasonable since the purpose of the test is to verify the upgrade, which does not demand heavy load. Note that I investigated several failing instances of this test going back to 1.0 and saw a similar pattern, so there does not appear to be a regression.
Author: Jason Gustafson <jason@confluent.io>
Reviewers: Gwen Shapira
Closes#6907 from hachikuji/lower-throughput-for-upgrade-test
As title suggested, we boost 3 stream instances stream job with one minute session timeout, and once the group is stable, doing couple of rolling bounces for the entire cluster. Every rejoin based on restart should have no generation bump on the client side.
Reviewers: Guozhang Wang <wangguoz@gmail.com>, Bill Bejeck <bbejeck@gmail.com>
Reviewers: Bill Bejeck <bill@confluent.io>, Boyang Chen <boyang@confluent.io>, Bruno Cadonna <bruno@confluent.io>, Guozhang Wang <guozhang@confuent.io>
These are important to ensure we don't break compatibility.
Author: Ismael Juma <ismael@juma.me.uk>
Reviewers: Gwen Shapira
Closes#6794 from ijuma/update-version-compat-tests
For static members join/rejoin, we encode the current timestamp in the new member.id. The format looks like group.instance.id-timestamp.
During consumer/broker interaction logic (Join, Sync, Heartbeat, Commit), we shall check the whether group.instance.id is known on group. If yes, we shall match the member.id stored on static membership map with the request member.id. If mismatching, this indicates a conflict consumer has used same group.instance.id, and it will receive a fatal exception to shut down.
Right now the only missing part is the system test. Will work on it offline while getting the major logic changes reviewed.
Reviewers: Ryanne Dolan <ryannedolan@gmail.com>, Jason Gustafson <jason@confluent.io>, Guozhang Wang <wangguoz@gmail.com>
Corrects the system tests to check for either a 404 or a 409 error and sleeping until the Connect REST API becomes available. This corrects a previous change to how REST extensions are initialized (#6651), which added the ability of Connect throwing a 404 if the resources are not yet started. The integration tests were already looking for 409.
Author: Magesh Nandakumar <magesh.n.kumar@gmail.com>
Reviewer: Randall Hauch <rhauch@gmail.com>
This is the first diff for the implementation of JoinGroup logic for static membership. The goal of this diff contains:
* Add group.instance.id to be unique identifier for consumer instances, provided by end user;
Modify group coordinator to accept JoinGroupRequest with/without static membership, refactor the logic for readability and code reusability.
* Add client side support for incorporating static membership changes, including new config for group.instance.id, apply stream thread client id by default, and new join group exception handling.
* Increase max session timeout to 30 min for more user flexibility if they are inclined to tolerate partial unavailability than burdening rebalance.
* Unit tests for each module changes, especially on the group coordinator logic. Crossing the possibilities like:
6.1 Dynamic/Static member
6.2 Known/Unknown member id
6.3 Group stable/unstable
6.4 Leader/Follower
The rest of the 345 change will be broken down to 4 separate diffs:
* Avoid kicking out members through rebalance.timeout, only do the kick out through session timeout.
* Changes around LeaveGroup logic, including version bumping, broker logic, client logic, etc.
* Admin client changes to add ability to batch remove static members
* Deprecate group.initial.rebalance.delay
Reviewers: Liquan Pei <liquanpei@gmail.com>, Stanislav Kozlovski <familyguyuser192@windowslive.com>, Jason Gustafson <jason@confluent.io>, Guozhang Wang <wangguoz@gmail.com>
As titled, this PR changed the default reset policy to latest accidentally for system tests, which in fact was earliest.
Reviewers: Guozhang Wang <wangguoz@gmail.com>
Verified that the https links work.
I didn't update the license header in this PR since that touches
so many files. Will file a separate one for that.
Reviewers: Manikumar Reddy <manikumar.reddy@gmail.com>
This patch adds a TimeIntervalTransactionsGenerator class which enables the Trogdor ProduceBench worker to commit transactions based on a configurable millisecond time interval.
Also, we now handle 409 create task responses in the coordinator command-line client by printing a more informative message
Reviewers: Colin P. McCabe <cmccabe@apache.org>
`kafka.list_topics(...)` should not require a topic parameter
Author: Brian Bushree <bbushree@confluent.io>
Reviewers: Ewen Cheslack-Postava <ewen@confluent.io>
Closes#6367 from brianbushree/list-topics-no-topic
* add a normal windowed suppress with short windows and a short grace
period
* improve the smoke test so that it actually verifies the intended
conditions
See https://issues.apache.org/jira/browse/KAFKA-7944
Reviewers: Bill Bejeck <bill@confluent.io>, Guozhang Wang <guozhang@confluent.io>
For consumers using SSL, this timeout includes the time to create and copy keystores and truststores and sometime it takes longer than 10s to complete the security setup before starting the consumer process.
Reviewers: Ismael Juma <ismael@juma.me.uk>
The StreamsBrokerBounceTest.test_broker_type_bounce experienced what looked like a transient failure. After looking over this test and failure, it seems like it is vulnerable to timing error that streams will start before the kafka service creates all topics.
Reviewers: Matthias J. Sax <mjsax@apache.org>, John Roesler <john@confluent.io>
Allow the Trogdor agent to execute external commands. The agent communicates with the external commands via stdin, stdout, and stderr.
Based on a patch by Xi Yang <xi@confluent.io>
Reviewers: David Arthur <mumrah@gmail.com>
* Enable heap dumps on OOM with -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=<file.bin> in the major services in system tests
* Collect the heap dump from the predefined location as part of the result logs for each service
* Change Connect service to delete the whole root directory instead of individual expected files
* Tested by running the full suite of system tests
Author: Konstantine Karantasis <konstantine@confluent.io>
Reviewers: Ewen Cheslack-Postava <ewen@confluent.io>
Closes#6158 from kkonstantine/KAFKA-7834
* Allow the Trogdor agent to be started in "exec mode", where it simply
runs a single task and exits after it is complete.
* For AgentClient and CoordinatorClient, allow the user to pass the path
to a file containing JSON, instead of specifying the JSON object in the
command-line text itself. This means that we can get rid of the bash
scripts whose only function was to load task specs into a bash string
and run a Trogdor command.
* Print dates and times in a human-readable way, rather than as numbers
of milliseconds.
* When listing tasks or workers, output human-readable tables of
information.
* Allow the user to filter on task ID name, task ID pattern, or task
state.
* Support a --json flag to provide raw JSON output if desired.
Reviewed-by: David Arthur <mumrah@gmail.com>, Stanislav Kozlovski <stanislav_kozlovski@outlook.com>
When using ./ducker-ak test on Jenkins, the script complains that there is no TTY. To fix this, we should skip passing -t to docker exec. We do not need a pseudo-TTY to run the tests. Similarly, we should skip passing -i, since we do not need to keep stdin open.
The down command should have a force option, specified as -f or --force.
Reviewed-by: Colin P. McCabe <cmccabe@apache.org>
[KIP-297](https://cwiki.apache.org/confluence/display/KAFKA/KIP-297%3A+Externalizing+Secrets+for+Connect+Configurations#KIP-297:ExternalizingSecretsforConnectConfigurations-PublicInterfaces) introduced the `ConfigProvider` mechanism, which was primarily intended for externalizing secrets provided in connector configurations. However, when querying the Connect REST API for the configuration of a connector or its tasks, those secrets are still exposed. The changes here prevent the Connect REST API from ever exposing resolved configurations in order to address that. rhauch has given a more thorough writeup of the thinking behind this in [KAFKA-5117](https://issues.apache.org/jira/browse/KAFKA-5117)
Tested and verified manually. If these changes are approved unit tests can be added to prevent a regression.
Author: Chris Egerton <chrise@confluent.io>
Reviewers: Robert Yokota <rayokota@gmail.com>, Randall Hauch <rhauch@gmail.com, Ewen Cheslack-Postava <ewen@confluent.io>
Closes#6129 from C0urante/hide-provided-connect-configs
+ Add a parameter to the ducktap-ak to control the OpenJDK base image.
+ Fix a few issues of using OpenJDK:11 as the base image.
*More detailed description of your change,
if necessary. The PR title and PR message become
the squashed commit message, so use a separate
comment to ping reviewers.*
*Summary of testing strategy (including rationale)
for the feature or bug fix. Unit and/or integration
tests are expected for any behaviour change and
system tests should be considered for larger changes.*
Author: Xi Yang <xi@confluent.io>
Reviewers: Ewen Cheslack-Postava <ewen@confluent.io>
Closes#6071 from yangxi/ducktape-jdk
When I originally refactored the streams_upgrade_test#upgrade_downgrade_brokers test I removed the wait call which would wait for up 24 minutes for the SmokeTestDriver class to publish and verify all of its records.
Since most of the tests run in two minutes or less I set the monitor_log timeout to three minutes. However, the SmokeTestDriver#verify method allows up to six minutes to consume all records before verifying the monitor_log timeout needs to be greater than 6 minutes. I've set the timeout to 8 minutes.
Additionally, the steps needed to update the streams_upgrade_test should be documented as there are several components that need to get updated. So I've documented those steps here on the test as a giant comment.
Reviewers: Guozhang Wang <wangguoz@gmail.com>
This commit creates an EndToEndTest base class which relies on the verifiable consumer. This will ultimately replace ProduceConsumeValidateTest which depends on the console consumer. The advantage is that the verifiable consumer exposes more information to use for validation. It also allows for a nicer shutdown pattern. Rather than relying on the console consumer idle timeout, which requires a minimum wait time, we can halt consumption after we have reached the last acked offsets. This should be more reliable and faster. The downside is that the verifiable consumer only works with the new consumer, so we cannot yet convert the upgrade tests. This commit converts only the replication tests and a flaky security test to use EndToEndTest.
The StreamsUpgradeTest::test_upgrade_downgrade_brokers used sleep calls in the test which led to flaky test performance and as a result, we placed an @ignore annotation on the test. This PR uses log events instead of the sleep calls hence we can now remove the @ignore setting.
Reviewers: Ewen Cheslack-Postava <ewen@confluent.io>, Matthias J. Sax <matthias@confluent.io>, Guozhang Wang <wangguoz@gmail.com>
Some Streams system tests have failed during the setup phase
due to the producer having retries disabled and getting some
transient error from the broker.
This patch adds a retries parameter to the VerifiableProducer
(default unchanged), and sets retries to 10 for Streams tests.
It also sets acks equal to the number of brokers for Streams tests.
Reviewers: Matthias J. Sax <matthias@confluent.io>, Bill Bejeck <bill@confluent.io>, Guozhang Wang <guozhang@confluent.io>
Previous PR #6043 reduced throughput for VerifiableProducer in base class, but the streams_standby_replica_test needs higher throughput for consumer to complete verification in 60 seconds
Update system test and kicked off branch builder with 25 repeats https://jenkins.confluent.io/job/system-test-kafka-branch-builder/2201/
Reviewers: Guozhang Wang <wangguoz@gmail.com>
This PR addresses a few issues with this system test flakiness. This PR is a cherry-picked duplicate of #6041 but for the trunk branch, hence I won't repeat the inline comments here.
1. Need to grab the monitor before a given operation to observe logs for signal
2. Relied too much on a timely rebalance and only sent a handful of messages.
I've updated the test and ran it here https://jenkins.confluent.io/job/system-test-kafka-branch-builder/2143/ parameterized for 15 repeats all passed.
Reviewers: Guozhang Wang <wangguoz@gmail.com>
This is the error message we're after:
"You may not specify an explicit partition assignment when using multiple consumers in the same group."
We apparently changed it midway through #5810 and forgot to update the test.
Reviewers: Dhruvil Shah <dhruvil@confluent.io>, Ismael Juma <ismael@juma.me.uk>
This is a system test for doing a rolling upgrade of a topology with a named repartition topic.
1. An initial Kafka Streams application is started on 3 nodes. The topology has one operation forcing a repartition and the repartition topic is explicitly named.
2. Each node is started and processing of data is validated
3. Then one node is stopped (full stop is verified)
4. A property is set signaling the node to add operations to the topology before the repartition node which forces a renumbering of all operators (except repartition node)
5. Restart the node and confirm processing records
6. Repeat the steps for the other 2 nodes completing the rolling upgrade
I ran two runs of the system test with 25 repeats in each run for a total of 50 test runs.
All test runs passed
Reviewers: John Roesler <john@confluent.io>, Matthias J. Sax <matthias@confluent.io>
This change adds some basic system tests for delegation token based authentication:
- basic delegation token creation
- producing with a delegation token
- consuming with a delegation token
- expiring a delegation token
- producing with an expired delegation token
New files:
- delegation_tokens.py: a wrapper around kafka-delegation-tokens.sh - executed in container where a secure Broker is running (taking advantage of automatic cleanup)
- delegation_tokens_test.py: basic test to validate the lifecycle of a delegation token
Changes were made in the following file to extend their functionality:
- config_property was updated to be able to configure Kafka brokers with delegation token related settings
- jaas.conf template because a broker needs to support multiple login modules when delegation tokens are used
- consule-consumer and verifiable_producer to override KAFKA_OPTS (to specify custom jaas.conf) and the client properties (to authenticate with delegation token).
Author: Attila Sasvari <asasvari@apache.org>
Reviewers: Reviewers: Viktor Somogyi <viktorsomogyi@gmail.com>, Andras Katona <41361962+akatona84@users.noreply.github.com>, Manikumar Reddy <manikumar.reddy@gmail.com>
Closes#5660 from asasvari/KAFKA-4544
This is a new system test testing for optimizing an existing topology. This test takes the following steps
1. Start a Kafka Streams application that uses a selectKey then performs 3 groupByKey() operations and 1 join creating four repartition topics
2. Verify all instances start and process data
3. Stop all instances and verify stopped
4. For each stopped instance update the config for TOPOLOGY_OPTIMIZATION to all then restart the instance and verify the instance has started successfully also verifying Kafka Streams reduced the number of repartition topics from 4 to 1
5. Verify that each instance is processing data from the aggregation, reduce, and join operation
Stop all instances and verify the shut down is complete.
6. For testing I ran two passes of the system test with 25 repeats for a total of 50 test runs.
All test runs passed
Reviewers: Matthias J. Sax <matthias@confluent.io>, Bill Bejeck <bill@confluent.io>, Guozhang Wang <wangguoz@gmail.com>