Commit Graph

9952 Commits

Author SHA1 Message Date
A. Sophie Blee-Goldman 9c7d857713
KAFKA-12648: fix #getMinThreadVersion and include IOException + topologyName in StreamsException when topology dir cleanup fails (#11867)
Quick fix to make sure we log the actual source of the failure both in the actual log message as well as the StreamsException that we bubble up to the user's exception handler, and also to report the offending topology by filling in the StreamsException's taskId field.

Also prevents a NoSuchElementException from being thrown when trying to compute the minimum topology version across all threads when the last thread is being unregistered during shutdown.

Reviewers: Guozhang Wang <guozhang@confluent.io>, Walker Carlson <wcarlson@confluent.io>
2022-03-09 16:30:42 -08:00
Randall Hauch d2d49f6421
KAFKA-12879: Remove extra sleep (#11872) 2022-03-09 15:11:46 -06:00
Philip Nee ddcee81043
KAFKA-12879: Addendum to reduce flakiness of tests (#11871)
This is an addendum to the KAFKA-12879 (#11797) to fix some tests that are somewhat flaky when a build machine is heavily loaded (when the timeouts are too small).

- Add an if check to void sleep(0)
- Increase timeout in the tests
2022-03-09 14:37:48 -06:00
Philip Nee 28393be6d7
KAFKA-12879: Revert changes from KAFKA-12339 and instead add retry capability to KafkaBasedLog (#11797)
Fixes the compatibility issue regarding KAFKA-12879 by reverting the changes to the admin client from KAFKA-12339 (#10152) that retry admin client operations, and instead perform the retries within Connect's `KafkaBasedLog` during startup via a new `TopicAdmin.retryEndOffsets(..)` method. This method delegates to the existing `TopicAdmin.endOffsets(...)` method, but will retry on `RetriableException` until the retry timeout elapses.

This change should be backward compatible to the KAFKA-12339 so that when Connect's `KafkaBasedLog` starts up it will retry attempts to read the end offsets for the log's topic. The `KafkaBasedLog` existing thread already has its own retry logic, and this is not changed.

Added more unit tests, and thoroughly tested the new `RetryUtil` used to encapsulate the parameterized retry logic around any supplied function.
2022-03-09 12:39:28 -06:00
Jason Koch 2367c8994b
KAFKA-13630: Reduce amount of time that producer network thread holds batch queue lock (#11722)
Hold the `deque` lock for only as long as is required to collect and make a decision in
`ready()` and `drain()` loops. Once this is done, remaining work can be done without lock,
so release it. This allows producers to continue appending.

For an application with with a single producer thread and a high send() rate, this change
reduces spinlock CPU cycles from 14.6% to 2.5% of the send() path, or more
clearly a 12.1% improvement in efficiency for the send() path by reducing the duration of
contention events with the network thread. Note that this application was executed with
Java 8, which has a slower crc32c implementation.

Reviewers: Luke Chen <showuon@gmail.com>, Ismael Juma <ismael@juma.me.uk>, Artem Livshits <84364232+artemlivshits@users.noreply.github.com>
2022-03-09 05:41:06 -08:00
Adam Kotwasinski add11eed75
MINOR: Correct logging and Javadoc in FetchSessionHandler (#11843)
Reviewers: Justine Olshan <jolshan@confluent.io>, David Jacot <david.jacot@gmail.com>, Luke Chen <showuon@gmail.com>
2022-03-09 16:51:26 +08:00
David Jacot 69926b5193
MINOR: Clean up AlterIsrManager code (#11832)
Reviewers: Justine Olshan <jolshan@confluent.io>, Jason Gustafson <jason@confluent.io>
2022-03-09 07:31:07 +01:00
John Roesler 717f9e2149
MINOR: Restructure ConsistencyVectorIntegrationTest (#11848)
Reviewers: YEONCHEOL JANG <@YeonCheolGit>, Matthias J. Sax <mjsax@apache.org>
2022-03-08 13:59:58 -06:00
Vincent Jiang b27000ec6a
MINOR: Fix flaky test cases SocketServerTest.remoteCloseWithoutBufferedReceives and SocketServerTest.remoteCloseWithIncompleteBufferedReceive (#11861)
When a socket is closed, corresponding channel should be retained only if there is complete buffered requests.

Reviewers: David Jacot <djacot@confluent.io>
2022-03-08 19:03:11 +01:00
John Roesler 10f34ce6b3
MINOR: Clarify acceptable recovery lag config doc (#11411)
Reviewers: A. Sophie Blee-Goldman <ableegoldman@apache.org>, Andrew Eugene Choi < @andrewchoi5 >
2022-03-08 10:42:36 -06:00
A. Sophie Blee-Goldman fc7133d52d
KAFKA-12648: fix bug where thread is re-added to TopologyMetadata when shutting down (#11857)
We used to call TopologyMetadata#maybeNotifyTopologyVersionWaitersAndUpdateThreadsTopologyVersion when a thread was being unregistered/shutting down, to check if any of the futures listening for topology updates had been waiting on this thread and could be completed. Prior to invoking this we make sure to remove the current thread from the TopologyMetadata's threadVersions map, but this thread is actually then re-added in the #maybeNotifyTopologyVersionWaitersAndUpdateThreadsTopologyVersion call.

To fix this, we should break up this method into separate calls for each of its two distinct functions, updating the version and checking for topology update completion. When unregistering a thread, we should only invoke the latter method

Reviewers: Guozhang Wang <guozhang@confluent.io>, Walker Carlson <wcarlson@confluent.io>
2022-03-07 23:59:43 -08:00
Luke Chen 1848f049e1
KAFKA-13710: bring the InvalidTimestampException back for record error (#11853)
Reviewers: Guozhang Wang <guozhang@confluent.io>, Ricardo Brasil <anribrasil@gmail.com>
2022-03-08 14:28:16 +08:00
A. Sophie Blee-Goldman 539f006e65
KAFKA-12648: fix NPE due to race condtion between resetting offsets and removing a topology (#11847)
While debugging the flaky NamedTopologyIntegrationTest. shouldRemoveOneNamedTopologyWhileAnotherContinuesProcessing test, I did discover one real bug. The problem was that we update the TopologyMetadata's builders map (with the known topologies) inside the #removeNamedTopology call directly, whereas the StreamThread may not yet have reached the poll() in the loop and in case of an offset reset, we get an NP.e
I changed the NPE to just log a warning for now, going forward I think we should try to tackle some tech debt by keeping the processing tasks and the TopologyMetadata in sync

Also includes a quick fix on the side where we were re-adding the topology waiter/KafkaFuture for a thread being shut down

Reviewers: Guozhang Wang <guozhang@confluent.io>, Walker Carlson <wcarlson@confluent.io>
2022-03-07 11:09:18 -08:00
Mickael Maison bbb2dc54a0
KAFKA-13671: Add ppc64le build stage (#11833)
Reviewers: David Arthur <mumrah@gmail.com>
2022-03-07 10:18:54 +01:00
Tim Patterson e3ef29ea03
KAFKA-12959: Distribute standby and active tasks across threads to better balance load between threads (#11493)
Balance standby and active stateful tasks evenly across threads

Reviewer: Luke Chen <showuon@gmail.com>
2022-03-05 16:11:42 +08:00
RivenSun 0dac4b4267
KAFKA-13689: printing unused and unknown logs separately (#11800)
Differentiate between unused and unknown configs during log output.

Reviewer: Luke Chen <showuon@gmail.com>
2022-03-05 16:08:14 +08:00
RivenSun 3be978464c
KAFKA-13694: Log more specific information when the verification record fails on brokers. (#11830)
Reviewers: Guozhang Wang <wangguoz@gmail.com>
2022-03-04 10:45:44 -08:00
A. Sophie Blee-Goldman 11143d4883
MINOR: fix flaky shouldRemoveOneNamedTopologyWhileAnotherContinuesProcessing (#11827)
This test has been failing somewhat regularly due to going into the ERROR state before reaching RUNNING during the startup phase. The problem is that we are reusing the DELAYED_INPUT_STREAM topics, which had previously been assumed to be uniquely owned by a particular test. We should make sure to delete and re-create these topics for any test that uses them.
2022-03-04 10:31:37 -08:00
A. Sophie Blee-Goldman 6f54faed2d
KAFKA-12648: fix #add/removeNamedTopology blocking behavior when app is in CREATED (#11813)
Currently the #add/removeNamedTopology APIs behave a little wonky when the application is still in CREATED. Since adding and removing topologies runs some validation steps there is valid reason to want to add or remove a topology on a dummy app that you don't plan to start, or a real app that you haven't started yet. But to actually check the results of the validation you need to call get() on the future, so we need to make sure that get() won't block forever in the case of no failure -- as is currently the case

Reviewers: Guozhang Wang <guozhang@confluent.io>, Walker Carlson <wcarlson@confluent.io>
2022-03-04 09:58:56 -08:00
Vincent Jiang 95dbba9fe5
KAFKA-13706: Remove closed connections from MockSelector.ready (#11839)
Reviewers: David Jacot <djacot@confluent.io>
2022-03-04 09:51:53 +01:00
wangyap ae76b9d45a
KAFKA-13466: delete unused config batch.size in kafka-console-producer.sh (#11517)
delete unused config batch.size in kafka-console-producer.sh

Reviewer: Andrew Eugene Choi <andrew.choi@uwaterloo.ca>, Luke Chen <showuon@gmail.com>,
2022-03-04 09:47:23 +08:00
Justin Lee f5d8fb2b0b
(docs) Add JavaDocs for org.apache.kafka.common.security.oauthbearer.secured (#11811)
Reviewers:  Luke Chen <showuon@confluent.io>, Jun Rao <junrao@gmail.com>
2022-03-03 10:13:01 -08:00
Luke Chen 7c280c1d5f
KAFKA-13673: disable idempotence when config conflicts (#11788)
Disable idempotence when conflicting config values for acks, retries
and max.in.flight.requests.per.connection are set by the user. For the
former two configs, we log at info level when we disable idempotence
due to conflicting configs. For the latter, we log at warn level since
it's due to an implementation detail that is likely to be surprising.

This mitigates compatibility impact of enabling idempotence by default.

Added unit tests to verify the change in behavior.

Reviewers: Ismael Juma <ismael@juma.me.uk>, Jason Gustafson <jason@confluent.io>, Mickael Maison <mickael.maison@gmail.com>
2022-03-03 05:40:41 -08:00
Mickael Maison 029a14b530
KAFKA-13510: Connect APIs to list all connector plugins and retrieve their configs (#11572)
Implements KIP-769: https://cwiki.apache.org/confluence/display/KAFKA/KIP-769%3A+Connect+APIs+to+list+all+connector+plugins+and+retrieve+their+configuration+definitions

Reviewers: Tom Bentley <tbentley@redhat.com>, Chris Egerton <fearthecellos@gmail.com>
2022-03-03 14:28:50 +01:00
Chris Egerton 066cdc8c62
KAFKA-10000: Add producer fencing API to admin client (KIP-618) (#11777)
* KAFKA-10000: Add producer fencing API to admin client

Reviewers: Luke Chen <showuon@gmail.com>, Tom Bentley <tbentley@redhat.com>
2022-03-03 10:27:17 +00:00
Levani Kokhreidze 62e646619b
KAFKA-6718 / Rack aware standby task assignor (#10851)
This PR is part of KIP-708 and adds rack aware standby task assignment logic.

Reviewer: Bruno Cadonna <cadonna@apache.org>, Luke Chen <showuon@gmail.com>, Vladimir Sitnikov <vladimirsitnikov.apache.org>
2022-03-03 11:37:26 +08:00
Colin Patrick McCabe 07553d13f7
MINOR: create KafkaConfigSchema and TimelineObject (#11809)
Create KafkaConfigSchema to encapsulate the concept of determining the types of configuration keys.
This is useful in the controller because we can't import KafkaConfig, which is part of core. Also
introduce the TimelineObject class, which is a more generic version of TimelineInteger /
TimelineLong.

Reviewers: David Arthur <mumrah@gmail.com>
2022-03-02 14:26:31 -08:00
A. Sophie Blee-Goldman f089bea7ed
MINOR: set log4j.logger.kafka and all Config logger levels to ERROR for Streams tests (#11823)
Pretty much any time we have an integration test failure that's flaky or only exposed when running on Jenkins through the PR builds, it's impossible to debug if it cannot be reproduced locally as the logs attached to the test results have truncated the entire useful part of the logs. This is due to the logs being flooded at the beginning of the test when the Kafka cluster is coming up, eating up all of the allotted characters before we even get to the actual Streams test. Setting log4j.logger.kafka to ERROR greatly improves the situation and cuts down on most of the excessive logging in my local runs. To improve things even more and have some hope of getting the part of the logs we actually need, I also set the loggers for all of the Config objects to ERROR, as these print out the value of every single config (of which there are a lot) and are not useful as we can easily figure out what the configs were if necessary by just inspecting the test locally.

Reviewers:  Luke Chen <showuon@confluent.io>,  Guozhang Wang <guozhang@confluent.io>
2022-03-01 21:58:10 -08:00
John Roesler 7172f35807
MINOR: Improve test assertions for IQv2 (#11828)
Reviewer: Bill Bejeck <bbejeck@apache.org>
2022-03-01 20:30:29 -06:00
A. Sophie Blee-Goldman 84f8c90b13
KAFKA-12648: standardize startup timeout to fix some flaky NamedTopologyIntegrationTest tests (#11824)
Seen a few of the new tests added fail on PR builds lately with 

"java.lang.AssertionError: Expected all streams instances in [org.apache.kafka.streams.processor.internals.namedtopology.KafkaStreamsNamedTopologyWrapper@7fb3e6b0] to be RUNNING within 30000 ms"

We already had some tests using the 30s timeout while others were bumped all the way up to 60s, I figured we should try out a default timeout of 45s and if we still see failures in specific tests we can go from there
2022-03-01 13:15:53 -08:00
A. Sophie Blee-Goldman 6eb57f6df1
KAFKA-12738: address minor followup and consolidate integration tests of PR #11787 (#11812)
This PR addresses the remaining nits from the final review of #11787

It also deletes two integration test classes which had only one test in them, and moves the tests to another test class file to save on the time to bring up an entire embedded kafka cluster just for a single run

Reviewers: Guozhang Wang <guozhang@confluent.io>, Walker Carlson <wcarlson@confluent.io>
2022-03-01 12:59:18 -08:00
Kowshik Prakasam 67e99a4236
MINOR: Ensure LocalLog.flush is thread safe to recoveryPoint changes (#11814)
Issue:
Imagine a scenario where two threads T1 and T2 are inside UnifiedLog.flush() concurrently:

KafkaScheduler thread T1 -> The periodic work calls LogManager.flushDirtyLogs() which in turn calls UnifiedLog.flush(). For example, this can happen due to log.flush.scheduler.interval.ms here.
KafkaScheduler thread T2 -> A UnifiedLog.flush() call is triggered asynchronously during segment roll here.
Supposing if thread T1 advances the recovery point beyond the flush offset of thread T2, then this could trip the check within LogSegments.values() here for thread T2, when it is called from LocalLog.flush() here. The exception causes the KafkaScheduler thread to die, which is not desirable.

Fix:
We fix this by ensuring that LocalLog.flush() is immune to the case where the recoveryPoint advances beyond the flush offset.

Reviewers: Jun Rao <junrao@gmail.com>
2022-03-01 10:55:17 -08:00
Marc Löhe 14faea4aab
KAFKA-8659: fix SetSchemaMetadata failing on null value and schema (#7082)
Make SetSchemaMetadata SMT ignore records with null value and valueSchema or key and keySchema.

The transform has been unit tested for handling null values gracefully while still providing the necessary validation for non-null values.

Reviewers: Konstantine Karantasis<konstantine@confluent.io>, Bill Bejeck <bbejeck@apache.org>
2022-03-01 10:10:43 -05:00
Hao Li 2ccc834faa
KAFKA-13542: add rebalance reason in Kafka Streams (#11804)
Add rebalance reason in Kafka Streams.

Reviewers: Luke Chen <showuon@gmail.com>, Bruno Cadonna <cadonna@apache.org>
2022-02-28 18:26:46 +01:00
Jason Gustafson 5f91aa7b4c
KAFKA-13698; KRaft authorizer should use host address instead of name (#11807)
Use `InetAddress.getHostAddress` in `StandardAuthorizer` instead of `InetAddress.getHostName`.

Reviewers: Colin Patrick McCabe <cmccabe@confluent.io>
2022-02-26 10:52:34 -08:00
Walker Carlson abb74d406a
KAFKA-13281: allow #removeNamedTopology while in the CREATED state (#11810)
We should be able to change the topologies while still in the CREATED state. We already allow adding them, but this should include removing them as well

Reviewers: Anna Sophie Blee-Goldman <ableegoldman@apache.org>
2022-02-25 19:11:06 -08:00
Walker Carlson 29317e6953
KAFKA-13281: add API to expose current NamedTopology set (#11808)
List all the named topologies that have been added to this client

Reviewers: Anna Sophie Blee-Goldman <ableegoldman@apache.org>
2022-02-25 19:04:07 -08:00
Jason Gustafson 2c90447a59
KAFKA-13697; KRaft authorizer should support AclOperation.ALL (#11806)
KRaft authorizer should support AclOperation.ALL correctly.

Reviewers: Colin P. McCabe <cmccabe@apache.org>
2022-02-25 15:43:21 -08:00
A. Sophie Blee-Goldman c2ee1411c8
KAFKA-12738: send LeaveGroup request when thread dies to optimize replacement time (#11801)
Quick followup to #11787 to optimize the impact of the task backoff by reducing the time to replace a thread. When a thread is going through a dirty close, ie shutting down from an uncaught exception, we should be sending a LeaveGroup request to make sure the broker acknowledges the thread has died and won't wait up to the `session.timeout` for it to join the group if the user opts to `REPLACE_THREAD` in the handler

Reviewers: Walker Carlson <wcarlson@confluent.io>, John Roesler <vvcephei@apache.org>
2022-02-24 16:18:13 -08:00
Zhang Hongyi 15ebad54b4
MINOR: Skip fsync on parent directory to start Kafka on ZOS (#11793)
Reviewers: Cong Ding <cong@ccding.com>, Jun Rao <junrao@gmail.com>
2022-02-24 13:26:23 -08:00
A. Sophie Blee-Goldman cd4a1cb410
KAFKA-12738: track processing errors and implement constant-time task backoff (#11787)
Part 1 in the initial series of error handling for named topologies.

*Part 1: Track tasks with errors within a named topology & implement constant-time based task backoff
Part 2: Implement exponential task backoff to account for recurring errors
Part 3: Pause/backoff all tasks within a named topology in case of a long backoff/frequent errors for any individual task

Reviewers:  Guozhang Wang <guozhang@confluent.io>, Walker Carlson <wcarlson@confluent.io>
2022-02-24 12:10:31 -08:00
Jason Gustafson 711b603ddc
MINOR: Cleanup admin creation logic in integration tests (#11790)
There seemed to be a little sloppiness in the integration tests in regard to admin client creation. Not only was there duplicated logic, but it wasn't always clear which listener the admin client was targeting. This made it difficult to tell in the context of authorization tests whether we were indeed testing with the right principal. As an example, we had a method in TestUtils which was using the inter-broker listener implicitly. This meant that the test was using the broker principal which had super user privilege. This was intentional, but I think it would be clearer to make the dependence on this listener explicit. This patch attempts to clean this up a bit by consolidating some of the admin creation logic and making the reliance on the listener clearer.

Reviewers: José Armando García Sancio <jsancio@users.noreply.github.com>
2022-02-24 07:37:28 -08:00
Bruno Cadonna 8d88b20b27
KAFKA-10199: Add interface for state updater (#11499)
Reviewers: Andrew Eugene Choi <andrew.choi@uwaterloo.ca>, Guozhang Wang <wangguoz@gmail.com>
2022-02-23 10:13:08 -08:00
Chris Egerton 6f09c3f88b
KAFKA-10000: Utils methods for overriding user-supplied properties and dealing with Enum types (#11774)
Reviewers: Mickael Maison <mickael.maison@gmail.com>
2022-02-23 14:49:30 +01:00
Chris Egerton 6bef673197
KAFKA-10000: Add new metrics for source task transactions (#11772)
Reviewers: Mickael Maison <mickael.maison@gmail.com>
2022-02-23 14:48:43 +01:00
Walker Carlson d8cf47bf28
KAFKA-13676: Commit successfully processed tasks on error (#11791)
When we hit an exception when processing tasks we should save the work we have done so far.
This will only be relevant with ALOS and EOS-v1, not EOS-v2. It will actually reduce the number of duplicated record in ALOS because we will not be successfully processing tasks successfully more than once in many cases.

This is currently enabled only for named topologies.

Reviewers: Anna Sophie Blee-Goldman <ableegoldman@apache.org>, Guozhang Wang <guozhang@confluent.io>
2022-02-22 23:10:05 -08:00
Julien Chanaud a5bb45c11a
KAFKA-13511: Add support for different unix precisions in TimestampConverter SMT (#11575)
Reviewers: Mickael Maison <mickael.maison@gmail.com>, Tom Bentley <tbentley@redhat.com>
2022-02-22 17:17:16 +01:00
Mickael Maison 576496a1ca
MINOR: Improve Connect docs (#11642)
- Fix indendation of code blocks
- Add links to all SMTs and Predicates

Reviewers: Luke Chen <showuon@gmail.com>, Tom Bentley <tbentley@redhat.com>
2022-02-22 20:49:53 +08:00
Mickael Maison 2dc4d5d95e
MINOR: Add links to connector configs in TOC (#11794)
Reviewers: Luke Chen <showuon@gmail.com>, Andrew Eugene Choi <andrew.choi@uwaterloo.ca>
2022-02-22 10:54:53 +08:00
Chris Egerton 4d036ee871
MINOR: Clarify logging behavior with errors.log.include.messages property (#11758)
The docs are a little misleading and some users can be confused about the exact behavior of this property.
2022-02-21 07:55:04 -06:00