Commit Graph

13792 Commits

Author SHA1 Message Date
Sushant Mahajan 821c10157d
KAFKA-17367: Introduce share coordinator [2/N] (#17011)
Introduces the share coordinator. This coordinator is built on the new coordinator runtime framework. It 
is responsible for persistence of share-group state in a new internal topic named "__share_group_state".
The responsibility for being a share coordinator is distributed across the brokers in a cluster. 

Reviewers: David Arthur <mumrah@gmail.com>, Andrew Schofield <aschofield@confluent.io>, Apoorv Mittal <apoorvmittal10@gmail.com>
2024-09-09 20:01:24 -04:00
TengYao Chi 92672d1df8
KAFKA-17470: CommitRequestManager should record failed request only once even if multiple errors in response (#17109)
Reviewers: Lianet Magrans <lmagrans@confluent.io>
2024-09-09 21:52:32 +02:00
David Arthur 4d182d12f6
MINOR Add status check for gradle scan (#17140)
Add a commit status check so PRs can easily access the build scan.

Reviewers: Chia-Ping Tsai <chia7712@gmail.com>
2024-09-09 15:15:47 -04:00
David Arthur b317624baa
MINOR Enable the GitHub build by default (#17105)
Reviewers: Chia-Ping Tsai <chia7712@gmail.com>
2024-09-09 13:58:16 -04:00
Apoorv Mittal 4ac1dd4fc7
KAFKA-17482: Make share partition initialization async (KIP-932) (#17097)
The PR introduces states for share partition which are used to make share partition initilzation async.

Reviewers:  Andrew Schofield <aschofield@confluent.io>, Manikumar Reddy <manikumar.reddy@gmail.com>
2024-09-09 19:15:08 +05:30
David Arthur 049b7cde4c
MINOR Allow PRs to publish build scan (#17123)
Publish Gradle build scans produced by PRs. This is done by using a `workflow_run` action that is triggered when the "CI" workflow completes. It downloads the build scan files from the PR workflow and publishes to ge.apache.org.

Reviewers: Chia-Ping Tsai <chia7712@gmail.com>
2024-09-08 17:24:35 -04:00
TaiJuWu 04dee3b2f2
KAFKA-17477 Migrate TopicCommand test to new test infra (#16127)
Reviewers: Chia-Ping Tsai <chia7712@gmail.com>
2024-09-09 02:21:23 +08:00
xijiu 64b909b598
KAFKA-13588 We should consolidate `changelogFor` methods to simplify the generation of internal topic names (#17125)
Reviewers: Chia-Ping Tsai <chia7712@gmail.com>
2024-09-09 01:48:24 +08:00
PoAn Yang 6354f49645
KAFKA-17137 Ensure Admin APIs are properly tested (topic-related) (#16676)
Reviewers: Chia-Ping Tsai <chia7712@gmail.com>
2024-09-09 01:20:44 +08:00
xijiu f325bf53e8
MINOR: Simplify the test for `ListOffsetsRequestTest#testListOffsetsRequestOldestVersion` (#17100)
Reviewers: Chia-Ping Tsai <chia7712@gmail.com>
2024-09-09 00:49:49 +08:00
Ivan Yurchenko a9f3ea2dd3
KAFKA-17323 Document UINT16 and COMPACT_RECORDS in Protocol Guide (#16868)
Reviewers: Chia-Ping Tsai <chia7712@gmail.com>
2024-09-08 13:15:23 +08:00
David Jacot 2ff81f087a
MINOR: Clean up system tests based on new defaults (#17113)
Reviewers: Chia-Ping Tsai <chia7712@gmail.com>
2024-09-08 12:28:05 +08:00
Murali Basani 72e16cb9e1
KAFKA-16863 : Deprecate default exception handlers (#17005)
Implements KIP-1056:
 - deprecates default.deserialization.exception.handler in favor of deserialization.exception.handler
 - deprecates default.production.exception.handler in favor of production.exception.handler

Reviewers: Matthias J. Sax <matthias@confluent.io>
2024-09-07 20:14:46 -07:00
Ayoub Omari be3ab8bdd5
KAFKA-14460: Skip removed entries from in-memory KeyValueIterator (#16505)
As described in KAFKA-14460, one of the functional requirements of KeyValueStore is that "The returned iterator must not return null values" on methods which return iterator.

This is not completely the case today for InMemoryKeyValueStore. To iterate over the store, we copy the keySet in order not to block access for other threads. However, entries that are removed from the store after initializing the iterator will be returned with null values by the iterator.

Reviewers: Anna Sophie Blee-Goldman <ableegoldman@apache.org>
2024-09-07 17:06:55 -07:00
Bill Bejeck 981133d350
KAFKA-17486: Flaky test RestoreIntegrationTest.shouldInvokeUserDefinedGlobalStateRestoreListener (#17104)
This test has a tricky race condition. We want the restoration to go slow enough so that when a second Kafka Streams instance starts, the restoration of a given TopicPartition pauses due to task re-assignment. But after that point, we'd like the test to proceed faster to avoid any timeout assertions. To that end, here are the changes in this PR:

Increase the restore pause to 2 seconds; this should slow the restoration enough so that the process is still in progress once the second instance starts. But once tasks are re-assigned and onRestorePause is called, the restore pause is decremented to zero, allowing the test to proceed faster.
Increase the restore batch to its original value of 5 - otherwise, the test moved too slowly.
Decrease the number of test records produced to the original value of 100. By increasing the time of restoring each batch until Kafka Streams calls onRestorePause and removing the intentional restoration slowness, 100 records proved good enough in local testing.

Reviewers: Matthias J. Sax <mjsax@apache.org>, Chia-Ping Tsai <chia7712@gmail.com>,
   Yu-LinChen <kh87313@gmail.com>
2024-09-07 18:47:02 -04:00
David Arthur 040ae26472
KAFKA-17479 Fail the whole pipeline if junit step times out [4/n] (#17121)
Fixes an issue where the CI workflow could appear to be successful in the event of a timeout and no failing tests. Instead of using Github Action's timeout, this patch makes use of the linux `timeout` command. This lets us capture the exit code and handle timeouts separately from a failed execution.

Reviewers: Chia-Ping Tsai <chia7712@gmail.com>
2024-09-07 15:13:20 -04:00
Chris Egerton 50e7022a1b
MINOR: Improve error message when Connect's EmbeddedKafkaCluster::verifyClusterReadiness method fails (#16918)
Reviewers: Greg Harris <greg.harris@aiven.io>
2024-09-07 10:23:59 -04:00
David Arthur e4d108d4df
KAFKA-15793 Fix ZkMigrationIntegrationTest#testMigrateTopicDeletions (#17004)
Reviewers: Igor Soarez <soarez@apple.com>, Ajit Singh <>
2024-09-06 21:57:13 +01:00
Alyssa Huang a9a4a52c9d
KAFKA-16963: Ducktape test for KIP-853 (#17081)
Add a ducktape system test for KIP-853 quorum reconfiguration, including adding and removing voters.

Reviewers: Colin P. McCabe <cmccabe@apache.org>
2024-09-06 13:44:09 -07:00
David Arthur 62379d7d53
KAFKA-17479 Fix ignoreFailures logic in CI workflow [3/n] (#17106)
The ignoreFailures property was removed in #17066 to prevent test failures from being cached. However, this breaks the JUnit report and makes the github workflow less user friendly.

The problem is that we are copying the junit test report files into a new directory (added in #17098) in a Gradle doLast closure. If we don't run with ignoreFailures=true, then this closure will not run and the test failures won't be processed by junit.py.

This patch adds logic to ensure the doLast closure of :test is always run. The user provided -PignoreFailures is still honored for the test tasks so local developer workflows should not be disturbed.

Reviewers: Chia-Ping Tsai <chia7712@gmail.com>
2024-09-06 14:47:42 -04:00
David Arthur 1fd1646eb9
KAFKA-15648 Update leader volatile before handleLeaderChange in LocalLogManager (#17118)
Update the leader before calling handleLeaderChange and use the given epoch in LocalLogManager#prepareAppend. This should hopefully fix several flaky QuorumControllerTest tests.

Reviewers: José Armando García Sancio <jsancio@apache.org>
2024-09-06 13:54:03 -04:00
Apoorv Mittal eec9eccacb
KAFKA-17483: Complete pending share fetch requests on broker close (#17096)
The PR adds capability to complete pending fetch requests on broker shutdown.

Reviewers: Andrew Schofield <aschofield@confluent.io>, Manikumar Reddy <manikumar.reddy@gmail.com>
2024-09-06 14:03:51 +05:30
Scott Hendricks 50ca2c8c73
MINOR: Fix Trogdor Off-By-One Errors. (#17095)
Upon further inspection of the ConfigurableProducerWorker, I noticed that there is an off-by-one error that can cause us to greatly exceed the target messages per second.

I created a test harness so I could quickly evaluate with and without this change.

With this change, the test harness outputs sane values:

GaussianThroughputGenerator throttle = new GaussianThroughputGenerator(350, 35, 100, 100);
... Output:
Changing Throttle: 323 ==> 318
Rate: 3513 Messages / Second
Changing Throttle: 318 ==> 318
Rate: 3510 Messages / Second
Changing Throttle: 318 ==> 333
Rate: 3506 Messages / Second
Changing Throttle: 333 ==> 352
Rate: 3505 Messages / Second
Changing Throttle: 352 ==> 356
Rate: 3505 Messages / Second
Changing Throttle: 356 ==> 302
Rate: 3505 Messages / Second
Changing Throttle: 302 ==> 347
Rate: 3501 Messages / Second
Changing Throttle: 347 ==> 397
Rate: 3501 Messages / Second
Without this change, the throttle thrashes, the values can skyrocket, and unintentional code paths can be called.

GaussianThroughputGenerator throttle = new GaussianThroughputGenerator(350, 35, 100, 100);
... Output:
Changing Throttle: 374 ==> 314
Changing Throttle: 314 ==> 346
Changing Throttle: 346 ==> 340
Changing Throttle: 340 ==> 382
Changing Throttle: 382 ==> 377
Changing Throttle: 377 ==> 352
Changing Throttle: 352 ==> 397
Changing Throttle: 397 ==> 335
Rate: 4468 Messages / Second
Changing Throttle: 335 ==> 398
Changing Throttle: 398 ==> 345
Changing Throttle: 345 ==> 381
Changing Throttle: 381 ==> 334
Changing Throttle: 334 ==> 303
Changing Throttle: 303 ==> 359
Changing Throttle: 359 ==> 353
Changing Throttle: 353 ==> 422
Changing Throttle: 422 ==> 274
Changing Throttle: 274 ==> 317
Rate: 4733 Messages / Second
Changing Throttle: 317 ==> 316
Changing Throttle: 316 ==> 392
Changing Throttle: 392 ==> 342
Changing Throttle: 342 ==> 429
Changing Throttle: 429 ==> 305
Changing Throttle: 305 ==> 389

Reviewers: Justine Olshan <jolshan@confluent.io>
2024-09-05 13:39:26 -07:00
David Arthur 84aa5d7a63
KAFKA-17479 Relocate junit XML files [2/n] (#17098)
Recently, we fixed caching for ":jar" and ":test" tasks. A side effect of this is that the test results will be restored as part of the Gradle cache resolution. This means test tasks which are skipped (as a result of FROM-CACHE) will still have test results in their build directory. To avoid incorrectly reporting these results in the job summary, this patch uses a doLast task handler to relocate JUnit XML files into a new directory.

This patch also removes the "continue-on-error" from the JUnit test step which caused timed-out builds to appear successful.

Reviewers: Chia-Ping Tsai <chia7712@gmail.com>
2024-09-05 13:50:33 -04:00
Sebastien Viale b4f47aeff5
KAFKA-16448: Add timestamp to error handler context (#17054)
Part of KIP-1033.

Co-authored-by: Dabz <d.gasparina@gmail.com>
Co-authored-by: loicgreffier <loic.greffier@michelin.com>

Reviewers: Matthias J. Sax <matthias@confluent.io>
2024-09-05 08:42:52 -07:00
Kuan-Po Tseng 190df07ace
KAFKA-17265 Fix flaky MemoryRecordsBuilderTest#testBuffersDereferencedOnClose (#17092)
Reviewers: TaiJuWu <tjwu1217@gmail.com>, Chia-Ping Tsai <chia7712@gmail.com>
2024-09-05 19:47:16 +08:00
Luke Chen 887b947d8a
MINOR: Update doc for tiered storage GA (#17088)
Reviewers: Satish Duggana <satishd@apache.org>
2024-09-05 15:56:23 +05:30
Mickael Maison af0604e6ef
KAFKA-16188: Delete kafka.common.MessageReader (#17090)
Reviewers: Chia-Ping Tsai <chia7712@gmail.com>
2024-09-05 09:41:59 +02:00
TaiJuWu 235a5b6cf9
MINOR: `ClusterInstance#waitForTopic` gets hanging when the broker is shutdown (#17085)
Reviewers: Chia-Ping Tsai <chia7712@gmail.com>
2024-09-05 14:07:41 +08:00
David Jacot 9abb8d3b3c
MINOR: Set `group.coordinator.rebalance.protocols` to `classic,consumer` by default (#17057)
Reviewers: Chia-Ping Tsai <chia7712@gmail.com>
2024-09-05 13:50:20 +08:00
Sasaki Toru 748d20200f
MINOR: Fix broken output layout of kafka-consumer-groups.sh (#17058)
Reviewers: Andrew Schofield <aschofield@confluent.io>, Lianet Magrans <lmagrans@confluent.io>
2024-09-04 16:31:59 -04:00
David Arthur 0294b1402d
KAFKA-17479 Allow ":jar" tasks to be cached [1/n] (#17066)
For several modules, we include a kafka-version.properties in the Jar file. This file includes the Git SHA of the project at the time of the build. This means that even if no source files change, the :jar task will never be UP-TO-DATE between two git commits. Ultimately, this breaks Gradle caching.

This patch marks all of the createVersionFile tasks as cacheable and also changes our Gradle invocation to override the commit ID to a dummy static value. This will allow the :jar task to be cacheable and reusable between builds.

This patch also configures the trunk build to only write to the build cache and not read from it. This will prevent any cache pollution/corruption from propagating from build to build.

Reviewers: Chia-Ping Tsai <chia7712@gmail.com>
2024-09-04 15:06:11 -04:00
David Arthur 59f5d91d8f
MINOR increase develocity expiry to 4 hours (#17077)
Reviewers: Chia-Ping Tsai <chia7712@gmail.com>
2024-09-04 09:17:00 -04:00
Bill Bejeck 661825ad28
KAFKA-16993: Flaky test shouldInvokeUserDefinedGlobalStateRestoreListener (#16970)
Increased the number of records while decreasing the restore batch size to ensure the restoration does not complete before the second Kafka Streams instance starts up.

Reviewers: Matthias J. Sax <mjsax@apache.org>
2024-09-04 08:50:27 -04:00
Dmitry Werner 005155ba5c
KAFKA-17406: Move ClientIdAndBroker to server-common module (#16967)
Reviewers: Mickael Maison <mickael.maison@gmail.com>, David Arthur <mumrah@gmail.com>
2024-09-04 11:04:40 +02:00
Dmitry Werner 5fd7ce2ace
KAFKA-17414 Move RequestLocal to server-common module (#16986)
Reviewers: Chia-Ping Tsai <chia7712@gmail.com>
2024-09-04 16:12:20 +08:00
TengYao Chi f0f37409be
KAFKA-17454 Fix failed transactions_mixed_versions_test.py when running with 3.2 (#17067)
why df04887ba5 does not fix it?

The fix of df04887ba5 is to NOT collect the log from path `/mnt/kafka/kafka-operational-logs/debug/xxxx.log`if the task is successful. It does not change the log level. see ducktape b2ad7693f2/ducktape/tests/test.py (L181)

why df04887ba5 does not see the error of "sort"

df04887ba5 does NOT show the error since the number of features is only "one" (only metadata.version). Hence, the bug is not triggered as it does not need to "sort". Now, we have two features - metadata.version and krafe.version - so the sort is executed and then we see the "hello bug"

why we should change the kafka.log_level to INFO?

the template of log4j.properties is controlled by `log_level` (https://github.com/apache/kafka/blob/trunk/tests/kafkatest/services/kafka/templates/log4j.properties#L16), and the bug happens in writing debug message (e4ca066680/core/src/main/scala/kafka/server/metadata/BrokerMetadataListener.scala (L274)). Hence, changing the log level to DEBUG can avoid triggering the bug.

Reviewers: Justine Olshan <jolshan@confluent.io>, Chia-Ping Tsai <chia7712@gmail.com>
2024-09-04 15:09:03 +08:00
Matthias J. Sax 32f03a06be
KAFKA-17474 fix state transition in GlobalStreamThread (#17078)
KAFKA-17100 changed the behavior of GlobalStreamThread introducing a race condition for state changes, that was exposed by failing (flaky) tests in GlobalStreamThreadTest.

This PR moves the state transition to fix the race condition.

Reviewers: Bill Bejeck <bbejeck@gmail.com>, Chia-Ping Tsai <chia7712@gmail.com>
2024-09-04 14:48:51 +08:00
Kuan-Po Tseng 19e8c447a6
KAFKA-17137 Ensure Admin APIs are properly tested (consumer group and quota) (#16717)
Reviewers: Chia-Ping Tsai <chia7712@gmail.com>
2024-09-04 12:08:38 +08:00
Luke Chen eb9cfb06c0
KAFKA-17412: add doc for `unclean.leader.election.enable` in KRaft (#17051)
Reviewers: Colin P. McCabe <cmccabe@apache.org>
2024-09-03 16:11:46 -07:00
Ritika Reddy edac19ba50
KAFKA-17277: [1/2] Add version mapping command to the storage tool and feature command tool (#16973)
As a part of KIP-1022 the following has been implemented in this patch:

A version-mapping command to to look up the corresponding features for a given metadata version. Using the command with no --release-version argument will return the mapping for the latest stable metadata version.
This command has been added to the FeatureCommand Tool and the Storage Tool.
The storage tools parsing method has been made more modular similar to the feature command tool

Reviewers: Justine Olshan <jolshan@confluent.io>
2024-09-03 15:48:36 -07:00
Ken Huang 4d23fe92f1
KAFKA-16928: Test all of the request and response methods in RaftUtil (#16517)
Reviewers: Colin P. McCabe <cmccabe@apache.org>
2024-09-03 13:13:33 -07:00
Andrew Schofield b0d0956b20
KAFKA-17425: Improve coexistence of consumer and share groups (#17039)
This PR ensures that using the various group RPCs work properly when issued against the wrong type of group, such as DescribeConsumerGroups for a share group, or ConsumerGroupHeartbeat for a share group. There are no changes to the RPC error codes required.

The significant code changes are:

Making sure that the group coordinator does not assume that only classic and consumer groups exist. This was the cause of a ClassCastException when ConsumerGroupHeartbeat was being used against a share group.
Making sure that committing offsets to a share group fails with GroupIdNotFoundException rather than java.lang.UnsupportedOperation. This was the cause of a name collision between a share group and a consumer group when using kafka-consumer-groups.sh --reset-offsets which inadvertently created a consumer group of the same name.

 Reviewers: Manikumar Reddy <manikumar.reddy@gmail.com>
2024-09-04 00:16:15 +05:30
Sean Quah b8ea409132
MINOR: Log when a consumer group is created by the admin client (#17073)
Reviewers: David Jacot <djacot@confluent.io>, Andrew Schofield <aschofield@confluent.io>, Chia-Ping Tsai <chia7712@gmail.com>
2024-09-04 02:37:54 +08:00
Mickael Maison 839431e591
KAFKA-17468 Move kafka/log/remote/quota classes to storage module (#17074)
Reviewers: Chia-Ping Tsai <chia7712@gmail.com>
2024-09-04 02:18:47 +08:00
Kamal Chandraprakash 88b9ff30ad
KAFKA-15859 Introduce remote.list.offsets.request.timeout.ms dynamic config (#17045)
Reviewers: Chia-Ping Tsai <chia7712@gmail.com>
2024-09-03 23:06:01 +08:00
TengYao Chi 2f9b236259
KAFKA-17294 Handle retriable errors when fetching offsets in new consumer (#16833)
The original behavior was implemented to maintain the behavior of the Classic consumer, where the ConsumerCoordinator would do the same when handling the OffsetFetchResponse. This behavior is being updated for the legacy coordinator as part of KAFKA-17279, to retry on all retriable errors.

We should review and update the CommitRequestManager to align with this, and retry on all retriable errors, which seems sensible when fetching offsets.

The corresponding PR for classic consumer is #16826

Reviewers: Chia-Ping Tsai <chia7712@gmail.com>
2024-09-03 19:05:29 +08:00
Omnia Ibrahim f59d829381
KAFKA-15853 Move TransactionLogConfig and TransactionStateManagerConfig getters out of KafkaConfig (#16665)
Reviewers: Chia-Ping Tsai <chia7712@gmail.com>
2024-09-03 18:24:12 +08:00
Mickael Maison c30615e6d7
KAFKA-17430: Move RequestChannel.Metrics/RequestMetrics to server module (#17015)
Reviewers: Chia-Ping Tsai <chia7712@gmail.com>
2024-09-03 10:11:47 +02:00
ShivsundarR 743e185c8b
KAFKA-17450 Reduced the handlers for handling ShareAcknowledgeResponse (#17061)
Currently there are 4 handler functions present for handling ShareAcknowledge responses. ShareConsumeRequestManager had an interface and the respective handlers would implement it. Instead of having 4 different handlers for this, now using AcknowledgeRequestType, we could merge the code and have only 2 handler functions, one for ShareAcknowledge success and one for ShareAcknowledge failure, eliminating the need for the interface.

This PR also fixes a bug - We were not using the time at which the response was received while handling the ShareAcknowledge response, we were using an outdated time. Now the bug is fixed.

Reviewers: Manikumar Reddy <manikumar.reddy@gmail.com>, Andrew Schofield <aschofield@confluent.io>, Chia-Ping Tsai <chia7712@gmail.com>
2024-09-03 11:31:49 +08:00