Commit Graph

9811 Commits

Author SHA1 Message Date
Stanislav Vodetskyi 7e683852b4
MINOR: unpin ducktape dependency to always use the newest version (py3 edition) (#11884)
Ensures we always have the latest published ducktape version.
This way whenever we release a new one, we won't have to cherry pick a bunch of commits across a bunch of branches.
2022-03-11 17:48:19 +05:30
Levani Kokhreidze 87eb0cf03c
KAFKA-6718: Update SubscriptionInfoData with clientTags (#10802)
adds ClientTags to SubscriptionInfoData

Reviewer: Luke Chen <showuon@gmail.com>, Bruno Cadonna <cadonna@apache.org>
2022-03-11 16:29:05 +08:00
xuexiaoyue f025a93c7c
MINOR: Fix comments in TransactionsTest (#11880)
Reviewer: Luke Chen <showuon@gmail.com>
2022-03-11 15:42:44 +08:00
Lucas Bradstreet dc36dedd28
MINOR: jmh.sh swallows compile errors (#11870)
jmh.sh runs tasks in quiet mode which swallows compiler errors. This is a pain and I frequently have to edit the shell script to see the error.

Reviewers:  Ismael Juma <ismael@confluent.io>, Bill Bejeck <bbejeck@apache.org>
2022-03-10 18:18:41 -05:00
Walker Carlson 4d5a28973f
Revert "KAFKA-13542: add rebalance reason in Kafka Streams (#11804)" (#11873)
This reverts commit 2ccc834faa.

This reverts commit 2ccc834. We were seeing serious regressions in our state heavy benchmarks. We saw that our state heavy benchmarks were experiencing a really bad regression. The State heavy benchmarks runs with rolling bounces with 10 nodes.

We regularly saw this exception:  java.lang.OutOfMemoryError: Java heap space                                                                                                                                                                                              

I ran through a git bisect and found this commit. We verified that the commit right before did not have the same issues as this one did. I then reverted the problematic commit and ran the benchmarks again on this commit and did not see any more issues. We are still looking into the root cause, but for now since this isn't a critical improvement so we can remove it temporarily.

Reviewers: Bruno Cadonna <cadonna@confluent.io>, Anna Sophie Blee-Goldman <ableegoldman@apache.org>, David Jacot <djacot@confluent.io>, Ismael Juma <ismael@confluent.io>
2022-03-10 13:52:05 -08:00
A. Sophie Blee-Goldman 113595cf5c
KAFKA-12648: fix flaky #shouldAddToEmptyInitialTopologyRemoveResetOffsetsThenAddSameNamedTopologyWithRepartitioning (#11868)
This test has started to become flaky at a relatively low, but consistently reproducible, rate. Upon inspection, we find this is due to IOExceptions during the #cleanUpNamedTopology call -- specifically, most often a DirectoryNotEmptyException with an ocasional FileNotFoundException

Basically, signs pointed to having returned from/completed the #removeNamedTopology future prematurely, and moving on to try and clear out the topology's state directory while there was a streamthread somewhere that was continuing to process/close its tasks.

I believe this is due to updating the thread's topology version before we perform the actual topology update, in this case specifically the act of eg clearing out a directory. If one thread updates its version and then goes to perform the topology removal/cleanup when the second thread finishes its own topology removal, this other thread will check whether all threads are on the latest version and complete any waiting futures if so -- which means it can complete the future before the first thread has actually completed the corresponding action

Reviewers: Guozhang Wang <guozhang@confluent.io>, Walker Carlson <wcarlson@confluent.io>
2022-03-10 12:02:07 -08:00
aSemy 38e3787d76
Minor typo: "result _is_ a" > "result _in_ a" (#11876)
Reviewers Bill Bejeck <bbejeck@apache.org>
2022-03-10 14:03:12 -05:00
RivenSun 84b41b9d3a
KAFKA-13689: Revert AbstractConfig code changes (#11863)
Reviewer: Luke Chen <showuon@gmail.com>
2022-03-10 10:54:10 +08:00
Vincent Jiang 798275f254
KAFKA-13717: skip coordinator lookup in commitOffsetsAsync if offsets is empty (#11864)
Reviewer: Luke Chen <showuon@gmail.com>, David Jacot <djacot@confluent.io>
2022-03-10 10:52:05 +08:00
A. Sophie Blee-Goldman 9c7d857713
KAFKA-12648: fix #getMinThreadVersion and include IOException + topologyName in StreamsException when topology dir cleanup fails (#11867)
Quick fix to make sure we log the actual source of the failure both in the actual log message as well as the StreamsException that we bubble up to the user's exception handler, and also to report the offending topology by filling in the StreamsException's taskId field.

Also prevents a NoSuchElementException from being thrown when trying to compute the minimum topology version across all threads when the last thread is being unregistered during shutdown.

Reviewers: Guozhang Wang <guozhang@confluent.io>, Walker Carlson <wcarlson@confluent.io>
2022-03-09 16:30:42 -08:00
Randall Hauch d2d49f6421
KAFKA-12879: Remove extra sleep (#11872) 2022-03-09 15:11:46 -06:00
Philip Nee ddcee81043
KAFKA-12879: Addendum to reduce flakiness of tests (#11871)
This is an addendum to the KAFKA-12879 (#11797) to fix some tests that are somewhat flaky when a build machine is heavily loaded (when the timeouts are too small).

- Add an if check to void sleep(0)
- Increase timeout in the tests
2022-03-09 14:37:48 -06:00
Philip Nee 28393be6d7
KAFKA-12879: Revert changes from KAFKA-12339 and instead add retry capability to KafkaBasedLog (#11797)
Fixes the compatibility issue regarding KAFKA-12879 by reverting the changes to the admin client from KAFKA-12339 (#10152) that retry admin client operations, and instead perform the retries within Connect's `KafkaBasedLog` during startup via a new `TopicAdmin.retryEndOffsets(..)` method. This method delegates to the existing `TopicAdmin.endOffsets(...)` method, but will retry on `RetriableException` until the retry timeout elapses.

This change should be backward compatible to the KAFKA-12339 so that when Connect's `KafkaBasedLog` starts up it will retry attempts to read the end offsets for the log's topic. The `KafkaBasedLog` existing thread already has its own retry logic, and this is not changed.

Added more unit tests, and thoroughly tested the new `RetryUtil` used to encapsulate the parameterized retry logic around any supplied function.
2022-03-09 12:39:28 -06:00
Jason Koch 2367c8994b
KAFKA-13630: Reduce amount of time that producer network thread holds batch queue lock (#11722)
Hold the `deque` lock for only as long as is required to collect and make a decision in
`ready()` and `drain()` loops. Once this is done, remaining work can be done without lock,
so release it. This allows producers to continue appending.

For an application with with a single producer thread and a high send() rate, this change
reduces spinlock CPU cycles from 14.6% to 2.5% of the send() path, or more
clearly a 12.1% improvement in efficiency for the send() path by reducing the duration of
contention events with the network thread. Note that this application was executed with
Java 8, which has a slower crc32c implementation.

Reviewers: Luke Chen <showuon@gmail.com>, Ismael Juma <ismael@juma.me.uk>, Artem Livshits <84364232+artemlivshits@users.noreply.github.com>
2022-03-09 05:41:06 -08:00
Adam Kotwasinski add11eed75
MINOR: Correct logging and Javadoc in FetchSessionHandler (#11843)
Reviewers: Justine Olshan <jolshan@confluent.io>, David Jacot <david.jacot@gmail.com>, Luke Chen <showuon@gmail.com>
2022-03-09 16:51:26 +08:00
David Jacot 69926b5193
MINOR: Clean up AlterIsrManager code (#11832)
Reviewers: Justine Olshan <jolshan@confluent.io>, Jason Gustafson <jason@confluent.io>
2022-03-09 07:31:07 +01:00
John Roesler 717f9e2149
MINOR: Restructure ConsistencyVectorIntegrationTest (#11848)
Reviewers: YEONCHEOL JANG <@YeonCheolGit>, Matthias J. Sax <mjsax@apache.org>
2022-03-08 13:59:58 -06:00
Vincent Jiang b27000ec6a
MINOR: Fix flaky test cases SocketServerTest.remoteCloseWithoutBufferedReceives and SocketServerTest.remoteCloseWithIncompleteBufferedReceive (#11861)
When a socket is closed, corresponding channel should be retained only if there is complete buffered requests.

Reviewers: David Jacot <djacot@confluent.io>
2022-03-08 19:03:11 +01:00
John Roesler 10f34ce6b3
MINOR: Clarify acceptable recovery lag config doc (#11411)
Reviewers: A. Sophie Blee-Goldman <ableegoldman@apache.org>, Andrew Eugene Choi < @andrewchoi5 >
2022-03-08 10:42:36 -06:00
A. Sophie Blee-Goldman fc7133d52d
KAFKA-12648: fix bug where thread is re-added to TopologyMetadata when shutting down (#11857)
We used to call TopologyMetadata#maybeNotifyTopologyVersionWaitersAndUpdateThreadsTopologyVersion when a thread was being unregistered/shutting down, to check if any of the futures listening for topology updates had been waiting on this thread and could be completed. Prior to invoking this we make sure to remove the current thread from the TopologyMetadata's threadVersions map, but this thread is actually then re-added in the #maybeNotifyTopologyVersionWaitersAndUpdateThreadsTopologyVersion call.

To fix this, we should break up this method into separate calls for each of its two distinct functions, updating the version and checking for topology update completion. When unregistering a thread, we should only invoke the latter method

Reviewers: Guozhang Wang <guozhang@confluent.io>, Walker Carlson <wcarlson@confluent.io>
2022-03-07 23:59:43 -08:00
Luke Chen 1848f049e1
KAFKA-13710: bring the InvalidTimestampException back for record error (#11853)
Reviewers: Guozhang Wang <guozhang@confluent.io>, Ricardo Brasil <anribrasil@gmail.com>
2022-03-08 14:28:16 +08:00
A. Sophie Blee-Goldman 539f006e65
KAFKA-12648: fix NPE due to race condtion between resetting offsets and removing a topology (#11847)
While debugging the flaky NamedTopologyIntegrationTest. shouldRemoveOneNamedTopologyWhileAnotherContinuesProcessing test, I did discover one real bug. The problem was that we update the TopologyMetadata's builders map (with the known topologies) inside the #removeNamedTopology call directly, whereas the StreamThread may not yet have reached the poll() in the loop and in case of an offset reset, we get an NP.e
I changed the NPE to just log a warning for now, going forward I think we should try to tackle some tech debt by keeping the processing tasks and the TopologyMetadata in sync

Also includes a quick fix on the side where we were re-adding the topology waiter/KafkaFuture for a thread being shut down

Reviewers: Guozhang Wang <guozhang@confluent.io>, Walker Carlson <wcarlson@confluent.io>
2022-03-07 11:09:18 -08:00
Mickael Maison bbb2dc54a0
KAFKA-13671: Add ppc64le build stage (#11833)
Reviewers: David Arthur <mumrah@gmail.com>
2022-03-07 10:18:54 +01:00
Tim Patterson e3ef29ea03
KAFKA-12959: Distribute standby and active tasks across threads to better balance load between threads (#11493)
Balance standby and active stateful tasks evenly across threads

Reviewer: Luke Chen <showuon@gmail.com>
2022-03-05 16:11:42 +08:00
RivenSun 0dac4b4267
KAFKA-13689: printing unused and unknown logs separately (#11800)
Differentiate between unused and unknown configs during log output.

Reviewer: Luke Chen <showuon@gmail.com>
2022-03-05 16:08:14 +08:00
RivenSun 3be978464c
KAFKA-13694: Log more specific information when the verification record fails on brokers. (#11830)
Reviewers: Guozhang Wang <wangguoz@gmail.com>
2022-03-04 10:45:44 -08:00
A. Sophie Blee-Goldman 11143d4883
MINOR: fix flaky shouldRemoveOneNamedTopologyWhileAnotherContinuesProcessing (#11827)
This test has been failing somewhat regularly due to going into the ERROR state before reaching RUNNING during the startup phase. The problem is that we are reusing the DELAYED_INPUT_STREAM topics, which had previously been assumed to be uniquely owned by a particular test. We should make sure to delete and re-create these topics for any test that uses them.
2022-03-04 10:31:37 -08:00
A. Sophie Blee-Goldman 6f54faed2d
KAFKA-12648: fix #add/removeNamedTopology blocking behavior when app is in CREATED (#11813)
Currently the #add/removeNamedTopology APIs behave a little wonky when the application is still in CREATED. Since adding and removing topologies runs some validation steps there is valid reason to want to add or remove a topology on a dummy app that you don't plan to start, or a real app that you haven't started yet. But to actually check the results of the validation you need to call get() on the future, so we need to make sure that get() won't block forever in the case of no failure -- as is currently the case

Reviewers: Guozhang Wang <guozhang@confluent.io>, Walker Carlson <wcarlson@confluent.io>
2022-03-04 09:58:56 -08:00
Vincent Jiang 95dbba9fe5
KAFKA-13706: Remove closed connections from MockSelector.ready (#11839)
Reviewers: David Jacot <djacot@confluent.io>
2022-03-04 09:51:53 +01:00
wangyap ae76b9d45a
KAFKA-13466: delete unused config batch.size in kafka-console-producer.sh (#11517)
delete unused config batch.size in kafka-console-producer.sh

Reviewer: Andrew Eugene Choi <andrew.choi@uwaterloo.ca>, Luke Chen <showuon@gmail.com>,
2022-03-04 09:47:23 +08:00
Justin Lee f5d8fb2b0b
(docs) Add JavaDocs for org.apache.kafka.common.security.oauthbearer.secured (#11811)
Reviewers:  Luke Chen <showuon@confluent.io>, Jun Rao <junrao@gmail.com>
2022-03-03 10:13:01 -08:00
Luke Chen 7c280c1d5f
KAFKA-13673: disable idempotence when config conflicts (#11788)
Disable idempotence when conflicting config values for acks, retries
and max.in.flight.requests.per.connection are set by the user. For the
former two configs, we log at info level when we disable idempotence
due to conflicting configs. For the latter, we log at warn level since
it's due to an implementation detail that is likely to be surprising.

This mitigates compatibility impact of enabling idempotence by default.

Added unit tests to verify the change in behavior.

Reviewers: Ismael Juma <ismael@juma.me.uk>, Jason Gustafson <jason@confluent.io>, Mickael Maison <mickael.maison@gmail.com>
2022-03-03 05:40:41 -08:00
Mickael Maison 029a14b530
KAFKA-13510: Connect APIs to list all connector plugins and retrieve their configs (#11572)
Implements KIP-769: https://cwiki.apache.org/confluence/display/KAFKA/KIP-769%3A+Connect+APIs+to+list+all+connector+plugins+and+retrieve+their+configuration+definitions

Reviewers: Tom Bentley <tbentley@redhat.com>, Chris Egerton <fearthecellos@gmail.com>
2022-03-03 14:28:50 +01:00
Chris Egerton 066cdc8c62
KAFKA-10000: Add producer fencing API to admin client (KIP-618) (#11777)
* KAFKA-10000: Add producer fencing API to admin client

Reviewers: Luke Chen <showuon@gmail.com>, Tom Bentley <tbentley@redhat.com>
2022-03-03 10:27:17 +00:00
Levani Kokhreidze 62e646619b
KAFKA-6718 / Rack aware standby task assignor (#10851)
This PR is part of KIP-708 and adds rack aware standby task assignment logic.

Reviewer: Bruno Cadonna <cadonna@apache.org>, Luke Chen <showuon@gmail.com>, Vladimir Sitnikov <vladimirsitnikov.apache.org>
2022-03-03 11:37:26 +08:00
Colin Patrick McCabe 07553d13f7
MINOR: create KafkaConfigSchema and TimelineObject (#11809)
Create KafkaConfigSchema to encapsulate the concept of determining the types of configuration keys.
This is useful in the controller because we can't import KafkaConfig, which is part of core. Also
introduce the TimelineObject class, which is a more generic version of TimelineInteger /
TimelineLong.

Reviewers: David Arthur <mumrah@gmail.com>
2022-03-02 14:26:31 -08:00
A. Sophie Blee-Goldman f089bea7ed
MINOR: set log4j.logger.kafka and all Config logger levels to ERROR for Streams tests (#11823)
Pretty much any time we have an integration test failure that's flaky or only exposed when running on Jenkins through the PR builds, it's impossible to debug if it cannot be reproduced locally as the logs attached to the test results have truncated the entire useful part of the logs. This is due to the logs being flooded at the beginning of the test when the Kafka cluster is coming up, eating up all of the allotted characters before we even get to the actual Streams test. Setting log4j.logger.kafka to ERROR greatly improves the situation and cuts down on most of the excessive logging in my local runs. To improve things even more and have some hope of getting the part of the logs we actually need, I also set the loggers for all of the Config objects to ERROR, as these print out the value of every single config (of which there are a lot) and are not useful as we can easily figure out what the configs were if necessary by just inspecting the test locally.

Reviewers:  Luke Chen <showuon@confluent.io>,  Guozhang Wang <guozhang@confluent.io>
2022-03-01 21:58:10 -08:00
John Roesler 7172f35807
MINOR: Improve test assertions for IQv2 (#11828)
Reviewer: Bill Bejeck <bbejeck@apache.org>
2022-03-01 20:30:29 -06:00
A. Sophie Blee-Goldman 84f8c90b13
KAFKA-12648: standardize startup timeout to fix some flaky NamedTopologyIntegrationTest tests (#11824)
Seen a few of the new tests added fail on PR builds lately with 

"java.lang.AssertionError: Expected all streams instances in [org.apache.kafka.streams.processor.internals.namedtopology.KafkaStreamsNamedTopologyWrapper@7fb3e6b0] to be RUNNING within 30000 ms"

We already had some tests using the 30s timeout while others were bumped all the way up to 60s, I figured we should try out a default timeout of 45s and if we still see failures in specific tests we can go from there
2022-03-01 13:15:53 -08:00
A. Sophie Blee-Goldman 6eb57f6df1
KAFKA-12738: address minor followup and consolidate integration tests of PR #11787 (#11812)
This PR addresses the remaining nits from the final review of #11787

It also deletes two integration test classes which had only one test in them, and moves the tests to another test class file to save on the time to bring up an entire embedded kafka cluster just for a single run

Reviewers: Guozhang Wang <guozhang@confluent.io>, Walker Carlson <wcarlson@confluent.io>
2022-03-01 12:59:18 -08:00
Kowshik Prakasam 67e99a4236
MINOR: Ensure LocalLog.flush is thread safe to recoveryPoint changes (#11814)
Issue:
Imagine a scenario where two threads T1 and T2 are inside UnifiedLog.flush() concurrently:

KafkaScheduler thread T1 -> The periodic work calls LogManager.flushDirtyLogs() which in turn calls UnifiedLog.flush(). For example, this can happen due to log.flush.scheduler.interval.ms here.
KafkaScheduler thread T2 -> A UnifiedLog.flush() call is triggered asynchronously during segment roll here.
Supposing if thread T1 advances the recovery point beyond the flush offset of thread T2, then this could trip the check within LogSegments.values() here for thread T2, when it is called from LocalLog.flush() here. The exception causes the KafkaScheduler thread to die, which is not desirable.

Fix:
We fix this by ensuring that LocalLog.flush() is immune to the case where the recoveryPoint advances beyond the flush offset.

Reviewers: Jun Rao <junrao@gmail.com>
2022-03-01 10:55:17 -08:00
Marc Löhe 14faea4aab
KAFKA-8659: fix SetSchemaMetadata failing on null value and schema (#7082)
Make SetSchemaMetadata SMT ignore records with null value and valueSchema or key and keySchema.

The transform has been unit tested for handling null values gracefully while still providing the necessary validation for non-null values.

Reviewers: Konstantine Karantasis<konstantine@confluent.io>, Bill Bejeck <bbejeck@apache.org>
2022-03-01 10:10:43 -05:00
Hao Li 2ccc834faa
KAFKA-13542: add rebalance reason in Kafka Streams (#11804)
Add rebalance reason in Kafka Streams.

Reviewers: Luke Chen <showuon@gmail.com>, Bruno Cadonna <cadonna@apache.org>
2022-02-28 18:26:46 +01:00
Jason Gustafson 5f91aa7b4c
KAFKA-13698; KRaft authorizer should use host address instead of name (#11807)
Use `InetAddress.getHostAddress` in `StandardAuthorizer` instead of `InetAddress.getHostName`.

Reviewers: Colin Patrick McCabe <cmccabe@confluent.io>
2022-02-26 10:52:34 -08:00
Walker Carlson abb74d406a
KAFKA-13281: allow #removeNamedTopology while in the CREATED state (#11810)
We should be able to change the topologies while still in the CREATED state. We already allow adding them, but this should include removing them as well

Reviewers: Anna Sophie Blee-Goldman <ableegoldman@apache.org>
2022-02-25 19:11:06 -08:00
Walker Carlson 29317e6953
KAFKA-13281: add API to expose current NamedTopology set (#11808)
List all the named topologies that have been added to this client

Reviewers: Anna Sophie Blee-Goldman <ableegoldman@apache.org>
2022-02-25 19:04:07 -08:00
Jason Gustafson 2c90447a59
KAFKA-13697; KRaft authorizer should support AclOperation.ALL (#11806)
KRaft authorizer should support AclOperation.ALL correctly.

Reviewers: Colin P. McCabe <cmccabe@apache.org>
2022-02-25 15:43:21 -08:00
A. Sophie Blee-Goldman c2ee1411c8
KAFKA-12738: send LeaveGroup request when thread dies to optimize replacement time (#11801)
Quick followup to #11787 to optimize the impact of the task backoff by reducing the time to replace a thread. When a thread is going through a dirty close, ie shutting down from an uncaught exception, we should be sending a LeaveGroup request to make sure the broker acknowledges the thread has died and won't wait up to the `session.timeout` for it to join the group if the user opts to `REPLACE_THREAD` in the handler

Reviewers: Walker Carlson <wcarlson@confluent.io>, John Roesler <vvcephei@apache.org>
2022-02-24 16:18:13 -08:00
Zhang Hongyi 15ebad54b4
MINOR: Skip fsync on parent directory to start Kafka on ZOS (#11793)
Reviewers: Cong Ding <cong@ccding.com>, Jun Rao <junrao@gmail.com>
2022-02-24 13:26:23 -08:00
A. Sophie Blee-Goldman cd4a1cb410
KAFKA-12738: track processing errors and implement constant-time task backoff (#11787)
Part 1 in the initial series of error handling for named topologies.

*Part 1: Track tasks with errors within a named topology & implement constant-time based task backoff
Part 2: Implement exponential task backoff to account for recurring errors
Part 3: Pause/backoff all tasks within a named topology in case of a long backoff/frequent errors for any individual task

Reviewers:  Guozhang Wang <guozhang@confluent.io>, Walker Carlson <wcarlson@confluent.io>
2022-02-24 12:10:31 -08:00