Commit Graph

9909 Commits

Author SHA1 Message Date
David Arthur a7369bd52f KAFKA-14136 Generate ConfigRecord for brokers even if the value is unchanged (#12483) 2022-08-04 15:19:49 -04:00
David Arthur 4e049c706f KAFKA-14111 Fix sensitive dynamic broker configs in KRaft (#12455)
Enable some of the dynamic broker reconfiguration tests in KRaft mode
2022-08-04 15:19:38 -04:00
David Arthur 89b2bf257b MINOR: Update 3.2 branch to 3.2.2-SNAPSHOT 2022-07-28 16:42:46 -04:00
David Arthur b172a0a94f Bump version to 3.2.1 2022-07-21 20:33:07 -04:00
Viktor Somogyi-Vass 8464e36682
KAFKA-13917: Avoid calling lookupCoordinator() in tight loop (#12417)
Reviewers: Luke Chen <showuon@gmail.com>
2022-07-21 20:04:39 -04:00
David Arthur cb14b100ad
Add 3.2.1 upgrade docs (#12424)
Reviewers: Randall Hauch <rhauch@gmail.com>, Guozhang Wang <wangguoz@gmail.com>
2022-07-20 15:26:26 -04:00
Shawn d8541b20a1 KAFKA-14024: Consumer keeps Commit offset in onJoinPrepare in Cooperative rebalance (#12349)
In KAFKA-13310, we tried to fix a issue that consumer#poll(duration) will be returned after the provided duration. It's because if rebalance needed, we'll try to commit current offset first before rebalance synchronously. And if the offset committing takes too long, the consumer#poll will spend more time than provided duration. To fix that, we change commit sync with commit async before rebalance (i.e. onPrepareJoin).

However, in this ticket, we found the async commit will keep sending a new commit request during each Consumer#poll, because the offset commit never completes in time. The impact is that the existing consumer will be kicked out of the group after rebalance timeout without joining the group. That is, suppose we have consumer A in group G, and now consumer B joined the group, after the rebalance, only consumer B in the group.

Besides, there's also another bug found during fixing this bug. Before KAFKA-13310, we commitOffset sync with rebalanceTimeout, which will retry when retriable error until timeout. After KAFKA-13310, we thought we have retry, but we'll retry after partitions revoking. That is, even though the retried offset commit successfully, it still causes some partitions offsets un-committed, and after rebalance, other consumers will consume overlapping records.

Reviewers: RivenSun <riven.sun@zoom.us>, Luke Chen <showuon@gmail.com>
2022-07-20 10:05:23 +08:00
Walker Carlson 56a136d820 Revert "KAFKA-12887 Skip some RuntimeExceptions from exception handler (#11228)" (#12421)
This reverts commit 4835c64f

Reviewers: Matthias J. Sax <matthias@confluent.io>
2022-07-19 09:36:44 -07:00
Christopher L. Shannon ca674d9e17
KAFKA-14079 - Ack failed records in WorkerSourceTask when error tolerance is ALL (#12412)
Make sure to ack all records where produce failed, when a connector's `errors.tolerance` config property is set to `all`. Acking is essential so that the task will continue to commit future record offsets properly and remove the records from internal tracking, preventing a memory leak.

Reviewers: Chris Egerton <fearthecellos@gmail.com>, Randall Hauch <rhauch@gmail.com>
2022-07-18 17:06:45 -05:00
Okada Haruki b072b3739b KAFKA-13572 Fix negative preferred replica imbalanced count metric (#12405)
Currently, preferredReplicaImbalanceCount calculation has a race that becomes negative when topic deletion is initiated simultaneously. This PR addresses the problem by fixing cleanPreferredReplicaImbalanceMetric to be called only once per topic-deletion procedure

Reviewers: Luke Chen <showuon@gmail.com>
2022-07-18 14:20:42 +08:00
Kirk True c873d9d7ae KAFKA-14062: OAuth client token refresh fails with SASL extensions (#12398)
- Different objects should be considered unique even with same content to support logout
- Added comments for SaslExtension re: removal of equals and hashCode
- Also swapped out the use of mocks in exchange for *real* SaslExtensions so that we exercise the use of default equals() and hashCode() methods.
- Updates to implement equals and hashCode and add tests in SaslExtensionsTest to confirm

Co-authored-by: Purshotam Chauhan <pchauhan@confluent.io>

Reviewers: Manikumar Reddy <manikumar.reddy@gmail.com>
2022-07-12 14:30:50 +05:30
Eugene Tolbakov 74e24deedb KAFKA-14013: Limit the length of the `reason` field sent on the wire (#12388)
KIP-800 added the `reason` field to the JoinGroupRequest and the LeaveGroupRequest as I mean to provide more information to the group coordinator. In https://issues.apache.org/jira/browse/KAFKA-13998, we discovered that the size of the field is limited to 32767 chars by our serialisation mechanism. At the moment, the field either provided directly by the user or constructed internally is directly set regardless of its length.

This patch sends only the first 255 chars of the used provided or internally generated reason on the wire. Given the purpose of this field, that seems acceptable and that should still provide enough information to operators to understand the cause of a rebalance.

Reviewers: David Jacot <djacot@confluent.io>
2022-07-12 09:38:06 +02:00
Jason Gustafson b29cb162e3 KAFKA-14055; Txn markers should not be removed by matching records in the offset map (#12390)
When cleaning a topic with transactional data, if the keys used in the user data happen to conflict with the keys in the transaction markers, it is possible for the markers to get removed before the corresponding data from the transaction is removed. This results in a hanging transaction or the loss of the transaction's atomicity since it would effectively get bundled into the next transaction in the log. Currently control records are excluded when building the offset map, but not when doing the cleaning. This patch fixes the problem by checking for control batches in the `shouldRetainRecord` callback.

Reviewers: Jun Rao <junrao@gmail.com>
2022-07-10 10:20:09 -07:00
Divij Vaidya 1180f5e10c KAFKA-13474: Allow reconfiguration of SSL certs for broker to controller connection (#12381)
What:
When a certificate is rotated on a broker via dynamic configuration and the previous certificate expires, the broker to controller connection starts failing with SSL Handshake failed.

Why:
A similar fix was earlier performed in #6721 but when BrokerToControllerChannelManager was introduced in v2.7, we didn't enable dynamic reconfiguration for it's channel.

Summary of testing strategy (including rationale)
Add a test which fails prior to the fix done in the PR and succeeds afterwards. The bug wasn't caught earlier because there was no test coverage to validate the scenario.

Reviewers: Luke Chen <showuon@gmail.com>
2022-07-09 18:09:44 +08:00
Niket d4a01afe5e KAFKA-14035; Fix NPE in `SnapshottableHashTable::mergeFrom()` (#12371)
The NPE causes the kraft controller to be in an inconsistent state. 

Reviewers: Jason Gustafson <jason@confluent.io>
2022-06-30 21:05:00 -07:00
Luke Chen f1d4e6c726
KAFKA-14010: AlterPartition request won't retry when receiving retriable error (#12362)
Reviewers: David Jacot <djacot@confluent.io>
2022-07-01 10:51:57 +08:00
James Hughes 8fb8b496cc KAFKA-13998: JoinGroupRequestData 'reason' can be too large (#12298)
The `reason` field cannot contain more than 32767 chars. We did not expect to ever reach this but it turns out that it is possible if the the message provided in the `Throwable` somehow contains the entire stack trace. This patch ensure that the reason crafted based on exceptions remain small.

Co-authored-by: David Jacot <djacot@confluent.io>

Reviewers:  Bruno Cadonna <cadonna@apache.org>, A. Sophie Blee-Goldman <ableegoldman@apache.org>, David Jacot <djacot@confluent.io>
2022-06-20 16:48:06 +02:00
Guozhang Wang b61edf2037 HOTFIX: add space to avoid checkstyle failure 2022-06-06 11:34:13 -07:00
Guozhang Wang 173b8fd26d HOTFIX: only try to clear discover-coordinator future upon commit (#12244)
This is another way of fixing KAFKA-13563 other than #11631.

Instead of letting the consumer to always try to discover coordinator in pool with either mode (subscribe / assign), we defer the clearance of discover future upon committing async only. More specifically, under manual assign mode, there are only three places where we need the coordinator:

* commitAsync (both by the consumer itself or triggered by caller), this is where we want to fix.
* commitSync, which we already try to re-discovery coordinator.
* committed (both by the consumer itself based on reset policy, or triggered by caller), which we already try to re-discovery coordinator.

The benefits are that for manual assign mode that does not try to trigger any of the above three, then we never would be discovering coordinator. The original fix in #11631 would let the consumer to discover coordinator even if none of the above operations are required.

Reviewers: Luke Chen <showuon@gmail.com>, David Jacot <djacot@confluent.io>
2022-06-06 11:07:11 -07:00
Luke Chen 90db4f47d6 KAFKA-13773: catch kafkaStorageException to avoid broker shutdown directly (#12136)
When logManager startup and loadLogs, we expect to catch any IOException (ex: out of space error) and turn the log dir into offline. Later, we'll handle the offline logDir in ReplicaManage, so that the cleanShutdown file won't be created when all logDirs are offline. The reason why the broker shutdown with cleanShutdown file after full disk is because during loadLogs and do log recovery, we'll write leader-epoch-checkpoint fil. And if any IOException thrown, we'll wrap it as KafkaStorageException and rethrow. And since we don't catch KafkaStorageException, so the exception is caught in the other place and go with clean shutdown path.

This PR is to fix the issue by catching the KafkaStorageException with IOException cause exceptions during loadLogs, and mark the logDir as offline to let the ReplicaManager handle the offline logDirs.

Reviewers: Jun Rao <jun@confluent.io>, Alok Thatikunta <alok123thatikunta@gmail.com>
2022-06-04 16:20:43 +08:00
nicolasguyomar 04e6b6e40e MINOR: Replace left single quote with single quote in Connect worker's log message (#12201)
Minor change to use ' and not LEFT SINGLE QUOTATION MARK in this log message, as it's the only place we are using such a quote and it can break ingestion pipelines

Reviewers: Kvicii <Karonazaba@gmail.com>, Divij Vaidya <diviv@amazon.com>, Konstantine Karantasis <k.karantasis@gmail.com>
2022-05-25 10:42:27 -07:00
Lucas Bradstreet f463b6c8a4 MINOR: fix Connect system test runs with JDK 10+ (#12202)
When running our Connect system tests with JDK 10+, we hit the error 
    AttributeError: 'ClusterNode' object has no attribute 'version'
because util.py attempts to check the version variable for non-Kafka service objects.

Reviewers: Konstantine Karantasis <k.karantasis@gmail.com>
2022-05-25 10:26:15 -07:00
Akhilesh Chaganti 49226721c0 KAFKA-13861; Fix the validateOnly behavior for CreatePartitions requests in KRaft mode (#12106)
The KRaft implementation of the `CreatePartitions` ignores the `validateOnly` flag in the
request and creates the partitions if the validations are successful. Fixed the behavior
not to create partitions upon validation if the `validateOnly` flag is true.

Reviewers: Divij Vaidya <divijvaidya13@gmail.com>, dengziming <dengziming1993@gmail.com>, Jason Gustafson <jason@confluent.io>
2022-05-11 11:14:45 -07:00
Ismael Juma abcaa109e0 MINOR: Remove kraft authorizer from list of missing features (#12146)
Also tweak the list of missing features a bit

Reviewers: José Armando García Sancio <jsancio@users.noreply.github.com>
2022-05-11 09:46:57 -07:00
Ismael Juma c761ee0348 MINOR: reload4j build dependency fixes (#12144)
* Replace `log4j` with `reload4j` in `copyDependantLibs`. Since we have
  some projects that have an explicit `reload4j` dependency, it
  was included in the final release release tar - i.e. it was effectively
  a workaround for this bug.
* Exclude `log4j` and `slf4j-log4j12` transitive dependencies for
  `streams:upgrade-system-tests`. Versions 0100 and 0101
  had a transitive dependency to `log4j` and `slf4j-log4j12` via
  `zkclient` and `zookeeper`. This avoids classpath conflicts that lead
  to [NoSuchFieldError](https://github.com/qos-ch/reload4j/issues/41) in
  system tests.

Reviewers: Jason Gustafson <jason@confluent.io>
2022-05-10 20:27:23 -07:00
Bruno Cadonna 8c8d71c024 MINOR: Update 3.2 branch version to 3.2.1-SNAPSHOT 2022-05-09 21:15:53 +02:00
Bruno Cadonna 1d9a09c586 Merge tag '3.2.0-rc1' into 3.2
3.2.0-rc1
2022-05-09 21:03:06 +02:00
Bruno Cadonna e566e53204 MINOR: Fix link to old doc in quickstart (#12129)
In Kafka's quickstart a link points to the 2.5 Kafka Streams demo.
This PR fixes this link.
2022-05-06 11:44:37 +02:00
Bruno Cadonna 38103ffaa9 Bump version to 3.2.0 2022-05-03 14:52:45 +02:00
Bruno Cadonna aaf56bea53 MINOR: Note that slf4j-log4j in version 1.7.35+ should be used (#12114)
Adds a note to the upgrade notes to use slf4j-log4j version
1.7.35+ [1] or slf4j-reload4j to avoid possible compatibility issues
originating from the logging framework [2].

[1] https://www.slf4j.org/manual.html#swapping
[2] https://www.slf4j.org/codes.html#no_tlm

Reviewer: Ismael Juma <ismael@juma.me.uk>
2022-05-03 12:25:45 +02:00
Ismael Juma 50d88ab8d8 KAFKA-13794: Fix comparator of inflightBatchesBySequence in TransactionsManager (round 3) (#12096)
Conceptually, the ordering is defined by the producer id, producer epoch
and the sequence number. This set should generally only have entries
for the same producer id and epoch, but there is one case where
we can have conflicting `remove` calls and hence we add this as
a temporary safe fix.

We'll follow-up with a fix that ensures the original intended invariant.

Reviewers: Jason Gustafson <jason@confluent.io>, David Jacot
<djacot@confluent.io>, Luke Chen <showuon@gmail.com>
2022-04-28 06:22:08 -07:00
Philip Nee e51c43c049 KAFKA-12841: Remove an additional call of onAcknowledgement (#12064)
The bug was introduced in #11689 that an additional onAcknowledgement was made using the InterceptorCallback class. This is undesirable since onSendError will attempt to call onAcknowledgement once more.

Reviewers: Jun Rao <junrao@gmail.com>
2022-04-27 10:18:54 +02:00
Bruno Cadonna 5991baecdc MINOR: Make link in quickstart dynamic (#12057)
Reviewer: Matthias J. Sax <mjsax@apache.org>
2022-04-15 16:23:02 +02:00
Bruno Cadonna b87c12fc1b [MINOR] Update upgrade documentation for 3.2 (#12055)
Reviewer: Bruno Cadonna <cadonna@apache.org>
2022-04-15 10:24:19 +02:00
Bruno Cadonna 5f1e8e4d49 MINOR: Update LICENSE-binary (#12051)
Updates the license file.

Reviewer: Bill Bejeck <bbejeck@apache.org>
2022-04-14 22:25:02 +02:00
Hao Li 75b4d06043 KAFKA-13542: Add rebalance reason in Kafka Streams (#12018)
Reviewers: Bruno Cadonna <bruno@confluent.io>, David Jacot <djacot@confluent.io>
2022-04-13 13:49:56 +02:00
David Jacot 4e7cc335ee KAFKA-13828; Ensure reasons sent by the consumer are small (#12043)
This PR reworks the reasons used in the ConsumerCoordinator to ensure that they remain reasonably short.

Reviewers: Bruno Cadonna <bruno@confluent.io>
2022-04-13 13:42:54 +02:00
Luke Chen 821275e6b3 KAFKA-10405: Set purge interval explicitly in PurgeRepartitionTopicIntegrationTest (#11948)
In KIP-811, we added a new config repartition.purge.interval.ms to set repartition purge interval. In this flaky test, we expected the purge interval is the same as commit interval, which is not correct anymore (default is 30 sec). Set the purge interval explicitly to fix this issue.

Reviewers: Bruno Cadonna <cadonna@apache.org>, Guozhang Wang <wangguoz@gmail.com>
2022-04-13 10:32:00 +02:00
Yang Yu 1181825b79 KAFKA-13761: KafkaLog4jAppender deadlocks when idempotence is enabled (#11939)
When a log entry is appended to a Kafka topic using KafkaLog4jAppender, the producer.send operation
may hit a deadlock if the producer network thread also tries to append a log at the same log level.
This issue is triggered when idempotence is enabled for the KafkaLog4jAppender and the producer
tries to acquire the TransactionManager lock.

This is a temporary workaround to avoid deadlocks by disabling idempotence explicitly in
KafkaLog4jAppender.

Reviewers: Luke Chen <showuon@gmail.com>, Ismael Juma <ismael@juma.me.uk>
2022-04-11 20:30:12 +02:00
Xiaoyue Xue 1483a86591 KAFKA-13794: Follow up to fix producer batch comparator (#12006)
In comparator, objects that are not equal need to have a stable order otherwise, binary search may not find the objects. Improve the producer batch comparator

Reviewers: Luke Chen <showuon@gmail.com>
2022-04-09 11:02:01 +08:00
Alok Nikhil 424fbc9381 MINOR: Fix DescribeLogDirs API error handling for older API versions (#12017)
With KAFKA-13527 / KIP-784 we introduced a new top-level error code for
the DescribeLogDirs API for versions 3 and above. However, the change
regressed the error handling for versions less than 3 since the response
converter fails to write the non-zero error code out (rightly) for
versions lower than 3 and drops the response to the client which
eventually times out instead of receiving an empty log dirs response and
processing that as a Cluster Auth failure.

With this change, the API conditionally propagates the error code out to
the client if the request API version is 3 and above. This keeps the
semantics of the error handling the same for all versions and restores
the behavior for older versions.

See current behavior in the broker log:
```bash
ERROR] 2022-04-08 01:22:56,406 [data-plane-kafka-request-handler-10] kafka.server.KafkaApis - [KafkaApi-0] Unexpected error handling request RequestHeader(apiKey=DESCRIBE_LOG_DIRS, apiVersion=0, clientId=sarama, correlationId=1) -- DescribeLogDirsRequestData(topics=null)
org.apache.kafka.common.errors.UnsupportedVersionException: Attempted to write a non-default errorCode at version 0
[ERROR] 2022-04-08 01:22:56,407 [data-plane-kafka-request-handler-10] kafka.server.KafkaRequestHandler - [Kafka Request Handler 10 on Broker 0], Exception when handling request
org.apache.kafka.common.errors.UnsupportedVersionException: Attempted to write a non-default errorCode at version 0
```

Reviewers: Ismael Juma <ismael@juma.me.uk>
2022-04-08 12:54:47 -07:00
Ismael Juma f059c82e6e MINOR: Fix support for custom commit ids in the build (#12014)
This regressed in ca375d8004 due to a typo. We need tests
for our builds. :)

I verified that passing the commitId via `-PcommitId=123`
works correctly.

Reviewers: Ismael Juma <ismael@juma.me.uk>
2022-04-08 08:56:57 -07:00
Tom Bentley 210b98ad45 MINOR: Mention KAFKA-13748 in release notes (#11994)
Reviewers: Mickael Maison <mickael.maison@gmail.com>, Bruno Cadonna <bruno@confluent.io>
2022-04-06 10:30:25 +01:00
Jason Gustafson df0385435e KAFKA-13782; Ensure correct partition added to txn after abort on full batch (#11995)
Fixes a regression introduced in https://github.com/apache/kafka/pull/11452. Following [KIP-480](https://cwiki.apache.org/confluence/display/KAFKA/KIP-480%3A+Sticky+Partitioner), the `Partitioner` will receive a callback when a batch has been completed so that it can choose another partition. Because of this, we have to wait until the batch has been successfully appended to the accumulator before adding the partition in `TransactionManager.maybeAddPartition`. This is still safe because the `Sender` cannot dequeue a batch from the accumulator until it has been added to the transaction successfully.

Reviewers: Artem Livshits <84364232+artemlivshits@users.noreply.github.com>, David Jacot <djacot@confluent.io>,  Tom Bentley <tbentley@redhat.com>
2022-04-05 10:23:51 -07:00
Xiaoyue Xue e17e2045f4 KAFKA-13794; Fix comparator of `inflightBatchesBySequence` in `TransactionManager` (#11991)
Fixes a bug in the comparator used to sort producer inflight batches for a topic partition. This can cause batches in the map `inflightBatchesBySequence` to be removed incorrectly: i.e. one batch may be removed by another batch with the same sequence number. This leads to an `IllegalStateException` when the inflight request finally returns. This patch fixes the comparator to check equality of the `ProducerBatch` instances if the base sequences match.

Reviewers: Jason Gustafson <jason@confluent.io>
2022-04-05 10:13:47 -07:00
Bounkong Khamphousone 90bd03a0a2 fix: make sliding window works without grace period (#kafka-13739) (#11928)
Fix upperbound for sliding window, making it compatible with no grace period (kafka-13739)

Added unit test for early sliding window and "normal" sliding window for both events within one time difference (small input) and above window time difference (large input).

Fixing this window interval may slightly change stream behavior but probability to happen is extremely slow and may not have a huge impact on the result given.

Reviewers Leah Thomas <lthomas@confluent.io>, Bill Bejeck <bbejeck@apache.org>
2022-03-31 10:09:25 -04:00
Yu 0f25205ab4 KAFKA-13772: Partitions are not correctly re-partitioned when the fetcher thread pool is resized (#11953)
Partitions are assigned to fetcher threads based on their hash modulo the number of fetcher threads. When we resize the fetcher thread pool, we basically re-distribute all the partitions based on the new fetcher thread pool size. The issue is that the logic that resizes the fetcher thread pool updates the `fetcherThreadMap` while iterating over it. The `Map` does not give any guarantee in this case - especially when the underlying map is re-hashed - and that led to not iterating over all the fetcher threads during the process and thus in leaving some partitions in the wrong fetcher threads.

Reviewers: Luke Chen <showuon@gmail.com>, David Jacot <djacot@confluent.io>
2022-03-31 14:53:52 +02:00
David Jacot ce86a54bdc KAFKA-13783; Remove reason prefixing in JoinGroupRequest and LeaveGroupRequest (#11971)
KIP-800 introduced a mechanism to pass a reason in the join group request and in the leave group request. A default reason is used unless one is provided by the user. In this case, the custom reason is prefixed by the default one.

When we tried to used this in Kafka Streams, we noted a significant degradation of the performances, see https://github.com/apache/kafka/pull/11873. It is not clear wether the prefixing is the root cause of the issue or not. To be on the safe side, I think that we should remove the prefixing. It does not bring much anyway as we are still able to distinguish a custom reason from the default one on the broker side.

This patch removes prefixing the user provided reasons. So if a the user provides a reason, the reason is used directly. If the reason is empty or null, the default reason is used.

Reviewers: Luke Chen <showuon@gmail.com>, <jeff.kim@confluent.io>, Hao Li <hli@confluent.io>
2022-03-31 14:32:03 +02:00
dengziming 72809cce96 MINOR: Fix an uncompatible bug in GetOffsetShell (#11936)
In KIP-815 we replaced KafkaConsumer with AdminClient in GetOffsetShell. In the previous implementation, partitions were just ignored if there is no offset for them, however, we will print -1 instead now, This PR fix this inconsistency.

Reviewers: David Jacot <djacot@confluent.io>, Luke Chen <showuon@gmail.com>
2022-03-31 10:35:56 +08:00
Konstantine Karantasis 42c804005e KAFKA-13748: Do not include file stream connectors in Connect's CLASSPATH and plugin.path by default (#11908)
With this change we stop including the non-production grade connectors that are meant to be used for demos and quick starts by default in the CLASSPATH and plugin.path of Connect deployments. The package of these connector will still be shipped with the Apache Kafka distribution and will be available for explicit inclusion. 

The changes have been tested through the system tests and the existing unit and integration tests. 

Reviewers: Mickael Maison <mickael.maison@gmail.com>, Randall Hauch <rhauch@gmail.com>
2022-03-30 13:17:32 -07:00