Commit Graph

10779 Commits

Author SHA1 Message Date
Viktor Somogyi-Vass 16d08e9e63
KAFKA-14978 ExactlyOnceWorkerSourceTask should remove parent metrics (#13690)
Reviewers: Chris Egerton <chrise@aiven.io>, Viktor Somogyi-Vass <viktorsomogyi@gmail.com>

Co-authored-by: Dániel Urbán <48119872+urbandan@users.noreply.github.com>
2023-05-11 13:03:04 +02:00
Yash Mayya 721a917b44 KAFKA-14974: Restore backward compatibility in KafkaBasedLog (#13688)
`KafkaBasedLog` is a widely used utility class that provides a generic implementation of a shared, compacted log of records in a Kafka topic. It isn't in Connect's public API, but has been used outside of Connect and we try to preserve backward compatibility whenever possible. KAFKA-14455 modified the two overloaded void `KafkaBasedLog::send` methods to return a `Future`. While this change is source compatible, it isn't binary compatible. We can restore backward compatibility simply by renaming the new Future returning send methods, and reinstating the older send methods to delegate to the newer methods.

This refactoring changes no functionality other than restoring the older methods.

Reviewers: Randall Hauch <rhauch@gmail.com>
2023-05-09 08:14:25 -05:00
Luke Chen 52fce15ca5
MINOR: fix compilation failure (#13684)
Reviewers: Divij Vaidya <diviv@amazon.com>, Mickael Maison <mickael.maison@gmail.com>
2023-05-09 10:23:38 +08:00
Luke Chen 92ebf0d43d fix compilation failure 2023-05-08 15:04:40 +08:00
Jason Gustafson c81795692f KAFKA-14644: Process should crash after failure in Raft IO thread (#13140)
Unexpected errors caught in the Raft IO thread should cause the process to stop. This is similar to the handling of exceptions in the controller.

Reviewers: Colin P. McCabe <cmccabe@apache.org>
2023-05-04 15:22:32 +08:00
Jason Gustafson 69fbf4c46a MINOR: Allow tagged fields with version subset of flexible version range (#13551)
The generated message types are missing a range check for the case when the tagged version range is a subset of
the flexible version range. This causes the tagged field count, which is computed correctly, to conflict with the
number of tags serialized.

Reviewers: Colin P. McCabe <cmccabe@apache.org>
2023-05-03 15:26:56 -07:00
José Armando García Sancio b7996d1152
KAFKA-14963; Do not use equals with Uuid (#13668)
Uuid is an object so they need to be compared with the equals method and not the == operator.
2023-05-03 10:58:30 -07:00
Luke Chen 09d794852a KAFKA-14946: fix NPE when merging the deltatable (#13653)
Fix NPE while merging the deltatable. Because it's possible that hashTier is
not null but deltatable is null (ex: removing data), we should have null check
while merging for deltatable like other places did. Also added tests that will
fail without this change.

Reviewers: Colin P. McCabe <cmccabe@apache.org>
2023-05-03 10:10:42 -07:00
Philip Nee ae72257028
KAFKA-14639: A single partition may be revoked and assign during a single round of rebalance (#13550) (#13652)
This is a really long story, but the incident started in KAFKA-13419 when we observed a member sending out a topic partition owned from the previous generation when a member missed a rebalance cycle due to REBALANCE_IN_PROGRESS.

This patch changes the AbstractStickyAssignor.AllSubscriptionsEqual method.  In short, it should no long check and validate only the highest generation.  Instead, we consider 3 cases:
1. Member will continue to hold on to its partition if there are no other owners
2. If there are 1+ owners to the same partition. One with the highest generation will win.
3. If two members of the same generation hold on to the same partition.  We will log an error but remove both from the assignment. (Same with the current logic)

Here are some important notes that lead to the patch:
- If a member is kicked out of the group, and `UNKNOWN_MEMBER_ID` will be thrown.
- It seems to be a common situation that members are late to joinGroup and therefore get `REBALANCE_IN_PROGRESS` error.  This is why we don't want to reset generation because it might cause lots of revocations and can be disruptive

To summarize the current behavior of different errors:
`REBALANCE_IN_PROGRESS`
- heartbeat: requestRejoin if member state is stable
- joinGroup: rejoin immediately
- syncGroup: rejoin immediately
- commit: requestRejoin and fail the commit. Raise this exception if the generation is staled, i.e. another rebalance is already in progress.

`UNKNOWN_MEMBER_ID`
- heartbeat: resetStateAndRejoinif generation hasn't changed. otherwise, ignore
- joinGroup: resetStateAndRejoin if generation unchanged, otherwise rejoin immediately
- syncGroup:  resetStateAndRejoin if generation unchanged, otherwise rejoin immediately

`ILLEGAL_GENERATION`
- heartbeat: resetStateAndRejoinif generation hasn't changed. otherwise, ignore
- syncGroup: raised the exception if generation has been resetted or the member hasn't completed rebalancing.  then resetStateAndRejoin if generation unchanged, otherwise rejoin immediately

Reviewers: David Jacot <djacot@confluent.io>
2023-05-02 13:47:18 +02:00
Greg Harris 402cce0796
KAFKA-14666: Add MM2 in-memory offset translation index for offsets behind replication (#13429)
Reviewers: Daniel Urban <durban@cloudera.com>, Chris Egerton <chrise@aiven.io>
2023-05-01 12:40:29 -04:00
Greg Harris c7347c266f
MINOR: Refactor Mirror integration tests to reduce duplication (#13428)
Reviewers: Mickael Maison <mickael.maison@gmail.com>
2023-05-01 11:56:34 -04:00
hudeqi 721d160175
KAFKA-14837/14842:Avoid the rebalance caused by the addition and deletion of irrelevant groups for MirrorCheckPointConnector (#13446)
Reviewers: Chris Egerton <chrise@aiven.io>
2023-05-01 11:50:27 -04:00
Victoria Xia cb7871ca6e
MINOR: update docs note about spurious stream-stream join results (#13642)
Reviewers: Matthias J. Sax <matthias@confluent.io>
2023-04-25 19:41:23 -07:00
Matthias J. Sax 754365a032 KAFKA-14862: Outer stream-stream join does not output all results with multiple input partitions (#13592)
Stream-stream outer join, uses a "shared time tracker" to track stream-time progress for left and right input in a single place. This time tracker is incorrectly shared across tasks.

This PR introduces a supplier to create a "shared time tracker" object per task, to be shared between the left and right join processors.

Reviewers: Victoria Xia <victoria.xia@confluent.io>, Bruno Cadonna <bruno@confluent.io>, Walker Carlson <wcarlson@confluent.io>
2023-04-24 13:03:14 -07:00
Greg Harris 41037bf78d
KAFKA-14905: Reduce flakiness in MM2 ForwardingAdmin test due to admin timeouts (#13575)
Reduce flakiness of `MirrorConnectorsWithCustomForwardingAdminIntegrationTest`

Reviewers: Josep Prat <jlprat@apache.org>
2023-04-21 22:17:05 +02:00
Jeff Kim ec804a5b4e
KAFKA-14869: Bump coordinator value records to flexible versions (KIP-915, Part-2) (#13604)
This patch implemented the second part of KIP-915. It bumps the versions of the value records used by the group coordinator and the transaction coordinator to make them flexible versions. The new versions are not used when writing to the partitions but only when reading from the partitions. This allows downgrades from future versions that will include tagged fields.

Reviewers: David Jacot <djacot@confluent.io>
2023-04-21 13:54:37 +02:00
Ron Dagostino 03b41b54c9
KAFKA-14887: FinalizedFeatureChangeListener should not shut down when ZK session expires
FinalizedFeatureChangeListener shuts the broker down when it encounters an issue trying to process feature change
events. However, it does not distinguish between issues related to feature changes actually failing and other
exceptions like ZooKeeper session expiration. This introduces the possibility that Zookeeper session expiration
could cause the broker to shutdown, which is not intended. This patch updates the code to distinguish between
these two types of exceptions. In the case of something like a ZK session expiration it logs a warning and continues.
We shutdown the broker only for FeatureCacheUpdateException.

Reviewers: Kamal Chandraprakash <kamal.chandraprakash@gmail.com>, Christo Lolov <christololov@gmail.com>, Colin P. McCabe <cmccabe@apache.org>
2023-04-20 20:17:40 -04:00
Victoria Xia 960110dd85
MINOR: update comment for FK join processor renames (#13610)
Reviewers: Matthias J. Sax <matthias@confluent.io>
2023-04-19 16:29:45 -07:00
Matthias J. Sax 4f9ef72a63 MINOR: rename internal FK-join processor classes 2023-04-18 12:18:52 -07:00
David Jacot 6a2331a60a HOTFIX: KAFKA-14869: Updated location of auto-generated records
While cherry-picking 5115906515 to 3.4, auto-generated classes where still on my disk so the issue was not caught. This patch fixes the full qualified named to match the location of the auto-generated records in 3.4.
2023-04-18 11:07:21 +02:00
Jeff Kim 5115906515 KAFKA-14869: Ignore unknown record types for coordinators (KIP-915, Part-1) (#13511)
This patch implemented the first part of KIP-915. It updates the group coordinator and the transaction coordinator to ignores unknown record types while loading their respective state from the partitions. This allows downgrades from future versions that will include new record types.

Reviewers: Alexandre Dupriez <alexandre.dupriez@gmail.com>, David Jacot <djacot@confluent.io>
2023-04-18 10:47:04 +02:00
Matthias J. Sax eb616d3ffc KAFKA-14054: Handle TimeoutException gracefully (#13534)
We incorrectly assumed, that `consumer.position()` should always be
served by the consumer locally set position.

However, within `commitNeeded()` we check if first `if(commitNeeded)`
and thus go into the else only if we have not processed data (otherwise,
`commitNeeded` would be true). For this reason, we actually don't know
if the consumer has a valid position or not.

We should just swallow a timeout if the consumer cannot get the position
from the broker, and try the next partition. If any position advances, we
can return true, and if we timeout for all partitions we can return
false.

Reviewers: Michal Cabak (@miccab), John Roesler <john@confluent.io>, Guozhang Wang <guozhand@confluent.io>
2023-04-14 11:38:57 -07:00
Colin Patrick McCabe 6c89a3f365 KAFKA-14894: MetadataLoader must call finishSnapshot after loading a snapshot (#13541)
The MetadataLoader must call finishSnapshot after loading a snapshot. This function removes
whatever was in the old snapshot that is not in the new snapshot that was just loaded. While this
is not significant when the old snapshot was the empty snapshot, it is important to do when we are
loading a snapshot on top of an existing non-empty image.

In initializeNewPublishers, the newly installed publishers should be given a MetadataDelta based on
MetadataImage.EMPTY, reflecting the fact that they are seeing everything for the first time.

Reviewers: David Arthur <mumrah@gmail.com>
2023-04-12 15:44:41 -07:00
Colin P. McCabe 8f94b627ae KAFKA-14857: Fix some MetadataLoader bugs (#13462)
The MetadataLoader is not supposed to publish metadata updates until we have loaded up to the high
water mark. Previously, this logic was broken, and we published updates immediately. This PR fixes
that and adds a junit test.

Another issue is that the MetadataLoader previously assumed that we would periodically get
callbacks from the Raft layer even if nothing had happened. We relied on this to install new
publishers in a timely fashion, for example. However, in older MetadataVersions that don't include
NoOpRecord, this is not a safe assumption.

Aside from the above changes, also fix a deadlock in SnapshotGeneratorTest, fix the log prefix for
BrokerLifecycleManager, and remove metadata publishers on brokerserver shutdown (like we do for
controllers).

Reviewers: David Arthur <mumrah@gmail.com>, dengziming <dengziming1993@gmail.com>

Conflicts: This patch was cut down to make it easier to cherry-pick to this older branch.
Specifically, I removed the BrokerLifecycleManager.scala logging change and the BrokerServer
installPublishers and removeAndClosePublisher changes.
2023-04-12 15:43:33 -07:00
David Jacot 1666e8e8af KAFKA-14880; TransactionMetadata with producer epoch -1 should be expirable (#13499)
We have seen the following error in logs:

```
"Mar 22, 2019 @ 21:57:56.655",Error,"kafka-0-0","transaction-log-manager-0","Uncaught exception in scheduled task 'transactionalId-expiration'","java.lang.IllegalArgumentException: Illegal new producer epoch -1
```

Investigations showed that it is actually possible for a transaction metadata object to still have -1 as producer epoch when it transitions to Dead.

When a transaction metadata is created for the first time (in handleInitProducerId), it has -1 as its producer epoch. Then a producer epoch is attributed and the transaction coordinator tries to persist the change. If the write fail for instance because there is an under min isr, the transaction metadata remains with its epoch as -1 forever or until the init producer id is retried.

This means that it is possible for transaction metadata to remain with -1 as producer epoch until it gets expired. At the moment, this is not allowed because we enforce a producer epoch greater or equals to 0 in prepareTransitionTo.

Reviewers: Luke Chen <showuon@gmail.com>, Justine Olshan <jolshan@confluent.io>
2023-04-06 08:51:05 +02:00
Guozhang Wang 2dd3713b2a KAFKA-14172: Should clear cache when active recycled from standby (#13369)
This fix is inspired by #12540.

1. Added a clearCache function for CachedStateStore, which would be triggered upon recycling a state manager.
2. Added the integration test inherited from #12540 .
3. Improved some log4j entries.
4. Found and fixed a minor issue with log4j prefix.

Reviewers: Lucas Brutschy <lbrutschy@confluent.io>, Matthias J. Sax <matthias@confluent.io>
2023-04-05 17:02:56 -07:00
Chris Egerton 699c3511a7 MINOR: Fix base ConfigDef in AbstractHerder::connectorPluginConfig (#13466)
Reviewers: Mickael Maison <mickael.maison@gmail.com>, Greg Harris <gharris1727@gmail.com>
2023-04-04 11:59:24 +02:00
Victoria Xia ac72ba571f KAFKA-14864: Close iterator in KStream windowed aggregation emit on window close (#13470)
Reviewers: Matthias J. Sax <matthias@confluent.io>
2023-04-03 21:32:18 -07:00
Chia-Ping Tsai 255bf5a1d3
KAFKA-14774 the removed listeners should not be reconfigurable (#13472)
Reviewers: Luke Chen <showuon@gmail.com>
2023-03-29 22:08:00 +08:00
Jorge Esteban Quilcate Otoya e62556be2b
KAFKA-14843: Include Connect framework properties when retrieving connector config definitions (#13445)
Reviewers: Yash Mayya <yash.mayya@gmail.com>, Greg Harris <greg.harris@aiven.io>, Chris Egerton <chrise@aiven.io>
2023-03-28 11:29:53 -04:00
Chris Egerton 0e95f730bd
KAFKA-14645: Use plugin classloader when retrieving connector plugin config definitions (#13148)
Reviewers: Mickael Maison <mickael.maison@gmail.com>, Greg Harris <gharris1727@gmail.com>
2023-03-28 11:29:37 -04:00
Greg Harris 7814593243
KAFKA-14797: Emit offset sync when offset translation lag would exceed max.offset.lag (#13367)
Reviewers: Chris Egerton <chrise@aiven.io>
2023-03-21 09:42:00 -04:00
Chris Egerton aea37adf3f KAFKA-14816: Only load SSL properties when issuing cross-worker requests to HTTPS URLs (#13415)
This fixes a regression introduced in #12828, which caused workers to start unconditionally loading (and therefore validating) SSL-related properties when issuing REST requests to other workers. That was fine for the most part, but caused unnecessary failures when workers were configured with invalid SSL-related properties and their REST API used HTTP instead of HTTPS.

Reviewers: Ian McDonald <imcdonald@confluent.io>, Greg Harris <greg.harris@aiven.io>, Yash Mayya <yash.mayya@gmail.com>, Justine Olshan <jolshan@confluent.io>
2023-03-20 09:36:28 -07:00
Hector Geraldino a99c7fa44d
KAFKA-14809 Fix logging conditional on WorkerSourceTask (#13386)
Reviewers: Chris Egerton <chrise@aiven.io>
2023-03-16 08:39:50 -04:00
Chris Egerton 9b4397abee
KAFKA-14799: Ignore source task requests to abort empty transactions (#13379)
Reviewers: Chia-Ping Tsai <chia7712@gmail.com>
2023-03-14 16:12:36 -04:00
José Armando García Sancio 0a53cfb3a2 MINOR; Fix command name kafka-metadata-quorum (#13381)
The name of the command should be kafka-metadata-quorum not
kafka-metatada-quorum.

Reviewers: Ron Dagostino <rdagostino@confluent.io>, Divij Vaidya <diviv@amazon.com>
2023-03-14 09:23:04 -07:00
Yash Mayya 1c5ba8b1d0
MINOR: Fix error check in Connect Worker zombie fencing (#13392) 2023-03-14 12:10:59 -04:00
Eric Haag 55d7519cf7 MINOR: Remove unnecessary call to asCollection causing eager dependency resolution (#13149)
Reviewers: Mickael Maison <mickael.maison@gmail.com>, Nelson Osacky
2023-03-10 18:28:32 +01:00
Chris Egerton f7381e134e KAFKA-14781: Downgrade MM2 log message severity when no ACL authorizer is configured on source broker (#13351)
Reviewers: Mickael Maison <mickael.maison@gmail.com>
2023-03-08 10:34:01 -05:00
Greg Harris 22ab975fe2 KAFKA-14649: Isolate failures during plugin path scanning to single plugin classes (#13182)
Reviewers: Christo Lolov <christo_lolov@yahoo.com>, Chris Egerton <chrise@aiven.io>
2023-03-02 15:11:02 -05:00
Hector Geraldino da4e1bf283 KAFKA-14659 source-record-write-[rate|total] metrics should exclude filtered records (#13193)
Reviewers: Christo Lolov <christololov@gmail.com>, Chris Egerton <chrise@aiven.io>
2023-02-28 09:40:57 -05:00
Chia-Ping Tsai 98f770f468 KAFKA-14295 FetchMessageConversionsPerSec meter not recorded (#13279)
Reviewers: Luke Chen <showuon@gmail.com>
2023-02-27 13:14:07 +08:00
Luke Chen fa88333039 Kafka-14743: update request metrics after callback (#13297)
Currently, the kafka.network:type=RequestMetrics,name=MessageConversionsTimeMs,request=Fetch will not get updated because the request metrics is recorded BEFORE the messageConversions metrics value updated. That means, even if we updated the messageConversions metrics value, the request metrics will never reflect the update. This patch fixes it by updating the request metric after callback completed, so that the messageConversions metric value can be updated correctly.

Reviewers: Chia-Ping Tsai <chia7712@gmail.com>, Divij Vaidya <diviv@amazon.com>
2023-02-26 15:22:45 +08:00
Chia-Ping Tsai 7220e3ecc5 MINOR: enable DynamicBrokerReconfigurationTest.testUncleanLeaderElectionEnable (#13296)
Reviewers: Luke Chen <showuon@gmail.com>
2023-02-26 15:20:14 +08:00
Greg Harris 175a342580 KAFKA-12468, KAFKA-13659, KAFKA-12566: Fix MM2 causing negative downstream lag, syncing stale offsets, and flaky integration tests (#13178)
KAFKA-12468: Fix negative lag on down consumer groups synced by MirrorMaker 2

KAFKA-13659: Stop syncing consumer groups with stale offsets in MirrorMaker 2

KAFKA-12566: Fix flaky MirrorMaker 2 integration tests

Reviewers: Chris Egerton <chrise@aiven.io>
2023-02-23 08:18:31 -05:00
csolidum 300779dee4 KAFKA-14545: Make MirrorCheckpointTask.checkpoint handle null OffsetAndMetadata gracefully (#13052)
Reviewers: Mickael Maison <mickael.maison@gmail.com>, Greg Harris <gharris1727@gmail.com>
2023-02-23 07:54:01 -05:00
Chris Egerton a4b33bd0a5 KAFKA-14610: Publish Mirror Maker 2 offset syncs in task commit() method (#13181)
Reviewers: Mickael Maison <mickael.maison@gmail.com>, Greg Harris <gharris1727@gmail.com>
2023-02-23 07:49:02 -05:00
emilnkrastev 55e69a0db8 KAFKA-12558: Do not prematurely mutate internal partition state in Mirror Maker 2 (#11818)
Reviewers: Greg Harris <greg.harris@aiven.io>, Chris Egerton <chrise@aiven.io>
2023-02-23 07:48:55 -05:00
Lucia Cerchie 96e1e41f93 KAFKA-14128: Kafka Streams does not handle TimeoutException (#13161)
Kafka Streams is supposed to handle TimeoutException during internal topic creation gracefully. This PR fixes the exception handling code to avoid crashing on an TimeoutException returned by the admin client.

Reviewer: Matthias J. Sax <matthias@confluent.io>, Colin Patrick McCabe <cmccabe@apache.org>, Alexandre Dupriez (@Hangleton)
2023-02-22 23:01:29 -08:00
Purshotam Chauhan b7a8fd7bfe KAKFA-14733: Added a few missing checks for Kraft Authorizer and updated AclAuthorizerTest to run tests for both zk and kraft (#13282)
Added the following checks - 
* In StandardAuthorizerData.authorize() to fail if `patternType` other than `LITERAL` is passed.
* In AclControlManager.addAcl() to fail if Resource Name is null or empty.

Also, updated `AclAuthorizerTest` includes a lot of tests covering various scenarios that are missing in `StandardAuthorizerTest`. This PR changes the AclAuthorizerTest to run tests for both `zk` and `kraft` modes - 
* Rename AclAuthorizerTest -> AuthorizerTest
* Parameterize relevant tests to run for both modes

Reviewers: Manikumar Reddy <manikumar.reddy@gmail.com>
2023-02-21 19:22:16 +05:30