Commit Graph

11999 Commits

Author SHA1 Message Date
Mickael Maison 74be72a559
MINOR: Various fixes in the docs (#14914)
- Only use https links
- Fix broken HTML tags
- Replace usage of <tt> which is deprecated with <code>
- Replace hardcoded version numbers

Reviewers: Chris Egerton <fearthecellos@gmail.com>, Greg Harris <gharris1727@gmail.com>
2023-12-04 22:06:49 +01:00
Apoorv Mittal 7a6d2664cd
KAFKA-15663, KAFKA-15794: Telemetry reporter and request handling (KIP-714) (#14909)
Part of KIP-714.

Implements ClientTelemetryReporter which manages the lifecycle for client metrics collection. The reporter also defines TelemetrySender which will be used by Network clients to send API calls to broker.

Reviewers: Andrew Schofield <aschofield@confluent.io>, Philip Nee <pnee@confluent.io>, Matthias J. Sax <matthias@confluent.io>
2023-12-04 11:44:56 -08:00
David Jacot ddf99880d7
MINOR: Fix ConsumerNetworkThread shutdown (#14913)
This patch fixes a race condition in the shutdown logic of the `ConsumerNetworkThread`. The `running` variable could be set to `true` after `closeInternal` was called.

Reviewers: Andrew Schofield <aschofield@confluent.io>, Lucas Brutschy <lbrutschy@confluent.io>
2023-12-04 11:01:59 -08:00
David Mao bbe87322e6
MINOR: Fix flaky test RefreshingHttpsJwksTest.testBasicScheduleRefresh (#14888)
This test is flaky because maybeExpediteRefresh schedules a refresh in a background thread. Instead pass through a mock executor service so that the refresh is executed directly.

---------

Co-authored-by: ashwinpankaj <appankaj@amazon.com>

Reviewers: Ashwin Pankaj <apankaj@confluent.io>, Kirk True <ktrue@confluent.io>, Justine Olshan <jolshan@confluent.io>
2023-12-04 09:52:38 -08:00
Christo Lolov d4c95cfc2a
KAFKA-14133: Migrate ProcessorStateManagerTest and StreamThreadTest to Mockito (#13932)
This pull request is an attempt to get what has started in #12524 to completion as part of the Streams project migration to Mockito.

Reviewers: Divij Vaidya <diviv@amazon.com>, Bruno Cadonna <cadonna@apache.org>
2023-12-04 18:37:57 +01:00
Colin Patrick McCabe 397582678b
MINOR: fix BrokerRegistrationRequest broken by KAFKA-15922 (#14887)
Reviewers: David Arthur <mumrah@gmail.com>, Justine Olshan <jolshan@confluent.io>
2023-12-04 09:22:35 -08:00
Max Riedel b7c99e22a7
KAFKA-14509: [2/N] Implement server side logic for ConsumerGroupDescribe API (#14544)
This patch implements the ConsumerGroupDescribe API.

Reviewers: David Jacot <djacot@confluent.io>
2023-12-04 07:19:28 -08:00
Andras Katona 270be2dea5
MINOR: Upgrade jetty to 9.4.53.v20231009 (#14877) 2023-12-04 10:54:27 +01:00
Andrew Schofield b6571a5f44
MINOR: Experimentally turn off consumer integration tests using new consumer (#14904)
This is part of the investigation into recent build instability. It simply turns off the consumer integration tests that use the new AsyncKafkaConsumer to see whether the build runs smoothly.

Reviewers: David Jacot <djacot@confluent.io>
2023-12-04 01:18:29 -08:00
Bruno Cadonna 0cf227dd4f
KAFKA-14438: Throw if async consumer configured with invalid group ID (#14872)
Verifies that the group ID passed into the async consumer is valid. That is, if the group ID is not null, it is not empty or it does not consist of only whitespaces.

This change stores the group ID in the group metadata because KAFKA-15281 about the group metadata API will build on that.

Reviewers: Lucas Brutschy <lbrutschy@confluent.io>, Kirk True <ktrue@confluent.io>
2023-12-03 23:11:41 +01:00
Andrew Schofield bce2d4a8b6
KAFKA-15953: Refactor polling delays (#14897)
Caches the maximum time to wait in the consumer network thread so the application thread is better isolated from the request managers.

Reviewers: Lucas Brutschy <lbrutschy@confluent.io>
2023-12-03 23:09:12 +01:00
Lucas Brutschy 59ac9be21c
HOTFIX: fix ConsistencyVectorIntegrationTest failure (#14895)
#14570 changed the result for KeyQuery from ValueAndTimestamp<V> to
V, but forgot to update ConsistencyVectorIntegrationTest accordingly.
2023-12-03 23:06:41 +01:00
Matthias J. Sax 1a2f74be67 MINOR: fix typo 2023-12-01 15:39:32 -08:00
Matthias J. Sax b22bbd656c
MINOR: cleanup internal Iterator impl (#14889)
makeNext() is internal and visibility should not be extended to `public`

Reviewers: Walker Carlson <wcarlson@confluent.io>
2023-12-01 11:53:07 -08:00
Lucas Brutschy bfee3b3c6b
KAFKA-15690: Fix restoring tasks on partition loss, flaky EosIntegrationTest (#14869)
The following race can happen in the state updater code path

Task is restoring, owned by state updater
We fall out of the consumer group, lose all partitions
We therefore register a "TaskManager.pendingUpdateAction", to CLOSE_DIRTY
We also register a "StateUpdater.taskAndAction" to remove the task
We get the same task reassigned. Since it's still owned by the state updater, we don't do much
The task completes restoration
The "StateUpdater.taskAndAction" to remove will be ignored, since it's already restored
Inside "handleRestoredTasksFromStateUpdater", we close the task dirty because of the pending update action
We now have the task assigned, but it's closed.
To fix this particular race, we cancel the "close" pending update action. Furthermore, since we may have made progress in other threads during the missed rebalance, we need to add the task back to the state updater, to at least check if we are still at the end of the changelog. Finally, it seems we do not need to close dirty here, it's enough to close clean when we lose the task, related to KAFKA-10532.

This should fix the flaky EOSIntegrationTest.

Reviewers: Bruno Cadonna <cadonna@apache.org>
2023-12-01 18:57:27 +01:00
Jason Gustafson a701c0e04f
MINOR: Fix flaky `DescribeClusterRequestTest.testDescribeClusterRequestIncludingClusterAuthorizedOperations` (#14890)
Test startup does not assure that all brokers are registered. In flaky failures,
the `DescribeCluster` API does not return a complete list of brokers. To fix
the issue, we add a call to `ensureConsistentKRaftMetadata()` to ensure that all
brokers are registered and have caught up to current metadata.

Reviewers: David Jacot <djacot@confluent.io>
2023-12-01 09:33:17 -08:00
Jeff Kim ba49006561
MINOR: disable test_transactions with new group coordinator
https://issues.apache.org/jira/browse/KAFKA-14505 is not done yet so we need to disable the system test. Added a comment in the jira to re-enable once it's implemented.

Reviewers: Justine Olshan <jolshan@confluent.io>
2023-12-01 08:47:12 -08:00
Andrew Schofield 21edb70788
KAFKA-15890: Consumer.poll with long timeout unaware of assigned partitions (#14835)
In the new consumer, Consumer.poll(Duration timeout) blocks for the entire duration. If the consumer is joining a group and has not yet received its assignments, the poll begins before an assignment has yet been received. Because the poll is blocked, it does not notice when partitions are assigned, and it subsequently does not return any records. The old consumer only blocks for the duration of the heartbeat interval and loops for until the poll timeout has passed, and is thus able to check for assignments received.

When this problem has been fixed, there remains another which prevents the group becoming stable. Because the consumer repeatedly sends the list of topic-partitions that it has been assigned to the group coordinator, the coordinator responds with the list of topic-partitions, which causes the consumer to remain reconciling indefinitely. By making the building of ConsumerGroupHeartbeatRequest stateful, the loop is ended and the group becomes stable as expected.

Reviewers: Lucas Brutschy <lbrutschy@confluent.io>, Kirk True <ktrue@confluent.io>, Lianet Magrans <lianetmr@gmail.com>
2023-12-01 15:41:30 +01:00
Andrew Schofield 1750d735cd
KAFKA-15842: Correct handling of KafkaConsumer.committed for new consumer (#14859)
This PR fixes some details of the interface to KafkaConsumer.committed which were different between the existing consumer and the new consumer.

Adds a unit test that validates the behaviour is the same for both consumer implementations.

Reviewers: Kirk True <ktrue@confluent.io>, Bruno Cadonna <cadonna@apache.org>
2023-12-01 14:37:21 +01:00
David Jacot 5fdfb3afaf
MINOR: Disable FetchFromFollowerIntegrationTest.testRackAwareRangeAssignor (#14876)
`FetchFromFollowerIntegrationTest.testRackAwareRangeAssignor` is extremely flaky and we have never been able to fix it. This patch disables it until we find a solution to make it reliable with https://issues.apache.org/jira/browse/KAFKA-15020.

Reviewers: Stanislav Kozlovski <stanislav@confluent.io>
2023-12-01 00:05:46 -08:00
Ismael Juma db308a9fe5
MINOR: Upgrade to gradle 8.5 (#14883)
Reviewers: Satish Duggana <satishd@apache.org>
2023-12-01 09:35:45 +05:30
Igor Soarez 6b87c85291 KAFKA-15886: Always specify directories for new partition registrations
When creating partition registrations directories must always be defined.

If creating a partition from a PartitionRecord or PartitionChangeRecord from an older version that
does not support directory assignments, then DirectoryId.MIGRATING is assumed.

If creating a new partition, or triggering a change in assignment, DirectoryId.UNASSIGNED should be
specified, unless the target broker has a single online directory registered, in which case the
replica should be assigned directly to that single directory.

Reviewers: Colin P. McCabe <cmccabe@apache.org>
2023-11-30 14:10:47 -08:00
Hanyu Zheng f1cd11dcc5
KAFKA-15629: Proposal to introduce IQv2 Query Types: TimestampedKeyQuery and TimestampedRangeQuery (#14570)
Implements KIP-992.

Adds TimestampedKeyQuery and TimestampedRangeQuery (IQv2) for ts-ks-store, plus changes semantics of existing KeyQuery and RangeQuery if issues against a ts-kv-store, now unwrapping value-and-timestamp and only returning the plain value.

Reviewers: Matthias J. Sax <matthias@confluent.io>
2023-11-30 12:14:23 -08:00
Luke Chen 37416e1aeb
KAFKA-15489: resign leadership when no fetch or fetch snapshot from majority voters (#14428)
In KIP-595, we expect to piggy-back on the `quorum.fetch.timeout.ms` config, and if the leader did not receive Fetch requests from a majority of the quorum for that amount of time, it would begin a new election, to resolve the network partition in the quorum. But we missed this implementation in current KRaft. Fixed it in this PR.

The commit include:
1. Added a timer with timeout configuration in `LeaderState`, and check if expired each time when leader is polled. If expired, resigning the leadership and start a new election.

2. Added `fetchedVoters` in `LeaderState`, and update the value each time received a FETCH or FETCH_SNAPSHOT request, and clear it and resets the timer if the majority - 1 of the remote voters sent such requests.

Reviewers: José Armando García Sancio <jsancio@apache.org>
2023-11-30 11:34:44 -08:00
Colin Patrick McCabe a94bc8d6d5
KAFKA-15922: Add a MetadataVersion for JBOD (#14860)
Assign MetadataVersion.IBP_3_7_IV2 to JBOD.

Move KIP-966 support to MetadataVersion.IBP_3_7_IV3.

Create MetadataVersion.LATEST_PRODUCTION as the latest metadata version that can be used when formatting a
new cluster, or upgrading a cluster using kafka-features.sh. This will allow us to clearly distinguish between stable
and unstable metadata versions for the first time.

Reviewers: Igor Soarez <soarez@apple.com>, Ron Dagostino <rndgstn@gmail.com>, Calvin Liu <caliu@confluent.io>, Proven Provenzano <pprovenzano@confluent.io>
2023-11-30 10:35:13 -08:00
Jason Gustafson a35e021925
MINOR: Fix flaky `MetadataLoaderTest.testNoPublishEmptyImage` (#14875)
There is a race in the assertion on `capturedImages`. Since the future is signaled first, it is still possible to see an empty list. By adding to the collection first, we can ensure the assertion will succeed.

Reviewers: Reviewers: David Jacot <djacot@confluent.io>
2023-11-30 09:50:19 -08:00
Nick Telford 96b43bf16f
KAFKA-14412: Add ProcessingThread tag interface (#14839)
This interface provides a common supertype for `StreamThread` and
`DefaultTaskExecutor.TaskExecutorThread`, which will be used by KIP-892
to differentiate between "processing" threads and interactive query
threads.

This is needed because `DefaultTaskExecutor.TaskExecutorThread` is
`private`, so cannot be seen directly from `RocksDBStore`.

Reviewer: Bruno Cadonna <cadonna@apache.org>
2023-11-30 09:44:02 +01:00
Jason Gustafson 085f1d340b
MINOR: No need for response callback when applying controller mutation throttle (#14861)
With `AbstractResponse.maybeSetThrottleTimeMs`, we don't need to use a callback to build the response with the respective throttle.

Reviewers: David Jacot <djacot@confluent.io>
2023-11-29 16:33:05 -08:00
Colin Patrick McCabe bd18551b32
MINOR: DirectoryId.MIGRATING should be all zeros (#14858)
DirectoryId.MIGRATING should be all zeros. All zeros is the default Uuid value in KPRC, and
MIGRATING is the default directory ID value.

Reviewers: Ron Dagostino <rdagostino@confluent.io>
2023-11-29 13:12:33 -08:00
Greg Harris 9f896ed6c9
KAFKA-15816: Fix leaked sockets in streams tests (#14769)
Signed-off-by: Greg Harris <greg.harris@aiven.io>
Reviewers: Matthias J. Sax <mjsax@apache.org>
2023-11-29 11:53:34 -08:00
Hao Li e7b9bd5a26
KAFKA-15022: add config for balance subtopology in rack aware task assignment (#14711)
Part of KIP-925.

Reviewers: Matthias J. Sax <matthias@confluent.io>
2023-11-29 11:33:52 -08:00
Lucas Brutschy c0ec8131d8
KAFKA-15865: Remove autocommit completion event (#14831)
There is no callback associated with autocommit, so I do not think we need this event. This closes KAFKA-15865.

Reviewers: Bruno Cadonna <cadonna@apache.org>
2023-11-29 19:02:08 +01:00
Okada Haruki d71d0639d9
KAFKA-15046: Get rid of unnecessary fsyncs inside UnifiedLog.lock to stabilize performance (#14242)
While any blocking operation under holding the UnifiedLog.lock could lead to serious performance (even availability) issues, currently there are several paths that calls fsync(2) inside the lock
In the meantime the lock is held, all subsequent produces against the partition may block
This easily causes all request-handlers to be busy on bad disk performance
Even worse, when a disk experiences tens of seconds of glitch (it's not rare in spinning drives), it makes the broker to unable to process any requests with unfenced from the cluster (i.e. "zombie" like status)
This PR gets rid of 4 cases of essentially-unnecessary fsync(2) calls performed under the lock:
(1) ProducerStateManager.takeSnapshot at UnifiedLog.roll
I moved fsync(2) call to the scheduler thread as part of existing "flush-log" job (before incrementing recovery point)
Since it's still ensured that the snapshot is flushed before incrementing recovery point, this change shouldn't cause any problem
(2) ProducerStateManager.removeAndMarkSnapshotForDeletion as part of log segment deletion
This method calls Utils.atomicMoveWithFallback with needFlushParentDir = true internally, which calls fsync.
I changed it to call Utils.atomicMoveWithFallback with needFlushParentDir = false (which is consistent behavior with index files deletion. index files deletion also doesn't flush parent dir)
This change shouldn't cause problems neither.
(3) LeaderEpochFileCache.truncateFromStart when incrementing log-start-offset
This path is called from deleteRecords on request-handler threads.
Here, we don't need fsync(2) either actually.
On unclean shutdown, few leader epochs might be remained in the file but it will be handled by LogLoader on start-up so not a problem
(4) LeaderEpochFileCache.truncateFromEnd as part of log truncation
Likewise, we don't need fsync(2) here, since any epochs which are untruncated on unclean shutdown will be handled on log loading procedure

Reviewers: Luke Chen <showuon@gmail.com>, Divij Vaidya <diviv@amazon.com>, Justine Olshan <jolshan@confluent.io>, Jun Rao <junrao@gmail.com>
2023-11-29 09:43:44 -08:00
Apoorv Mittal f1819f4480
KAFKA-15778 & KAFKA-15779: Implement metrics manager (KIP-714) (#14699)
The PR provide implementation for client metrics manager along with other classes. Manager is responsible to support 3 operations:

UpdateSubscription - From kafka-configs.sh and reload from metadata cache.
Process Get Telemetry Request - From KafkaApis.scala
Process Push Telemetry Request - From KafkaApis.scala
Manager maintains an in-memory cache to keep track of client instances against their instance id.

Reviewers: Andrew Schofield <aschofield@confluent.io>, Jun Rao <junrao@gmail.com>
2023-11-29 09:20:07 -08:00
David Jacot 5ae0b49839
KAFKA-14505; [1/N] Add support for transactional writes to CoordinatorRuntime (#14844)
This patch adds support for transactional writes to the CoordinatorRuntime framework. This mainly consists in adding CoordinatorRuntime#scheduleTransactionalWriteOperation and in adding the producerId and producerEpoch to various interfaces. The patch also extends the CoordinatorLoaderImpl and the CoordinatorPartitionWriter accordingly.

Reviewers: Justine Olshan <jolshan@confluent.io>
2023-11-29 08:54:23 -08:00
Josep Prat 68f4c7e22e
Update NOTICE-binary with latest additions (#14865)
Signed-off-by: Josep Prat <josep.prat@aiven.io>

Reviewers: Mickael Maison <mickael.maison@gmail.com>
2023-11-29 11:20:21 +01:00
Philip Nee 7999fd35d7
KAFKA-15887: Ensure FindCoordinatorRequest is sent before closing (#14842)
A few bugs was created from the previous issues. These are:

* During testing or some edge cases, the coordinator request manager might hold on to an inflight request forever. Therefore, when invoking coordinatorRequestManager.poll(), nothing would return. Here we explicitly create a FindCoordinatorRequest regardless of the current request state because we want to actively search for a coordinator
* ensureCoordinatorReady() might be stuck in an infinite loop forever if the client fail to do so. Even the consumer would be able to shutdown eventually, this is undesirable.
* The current asyncConsumerTest mixes background/network thread shutdown with the consumer shutdown. As the goal of the module is unit testing, we should try to test the shutdown procedure separately. Therefore, this PR adds a Mockito.doAnswer call to the applicationEventHandler.close(). Tests that are testing shutdown are calling shutdown() explicitly.

Reviewers: Lucas Brutschy <lbrutschy@confluent.io>
2023-11-29 11:16:43 +01:00
Mickael Maison a8d5007bfa
MINOR: Update LICENSE-binary for 3.7.0 (#14833)
Reviewers: Josep Prat <josep.prat@aiven.io>
2023-11-29 11:00:22 +01:00
Proven Provenzano 14571054aa
KAFKA-15904: Only add directory.id to meta.properties when migrating or in kraft mode
Only add directory.id to meta.properties when migrating to kraft mode, or already in
kraft mode. This prevents incompatibilities with older Kafka releases, which checked
that each directory in a JBOD ensemble had the same meta.properties values.

Reviewers: Colin P. McCabe <cmccabe@apache.org>
2023-11-28 23:14:10 -08:00
Apoorv Mittal 009b57d870
KAFKA-15618: Kafka metrics collector and supporting classes (KIP-714) (#14620)
The PR outlines classes to collect metrics for client by KafkaMetricsCollector implementation. The MetricsCollector defines mechanism to collect client metrics in sum and gauge metrics format. This requires to define cumulative and delta telemetry metrics while collecting raw metrics.

Singl point metric class helps creating OTLP format Metric object wrapped over Single point metric class itself.

Reviewers: Andrew Schofield <aschofield@confluent.io>, Xavier Léauté <xavier@confluent.io>, Philip Nee <pnee@confluent.io>, Matthias J. Sax <matthias@confluent.io>
2023-11-28 22:07:22 -08:00
Hao Li 10555ec6de
KAFKA-15022: Only relax edge when path exist (#14198)
If there is no path from u to v, we should not represent it at Integer.MAX_VALUE but null instead.

Reviewers: Matthias J. Sax <matthias@confluent.io>
2023-11-28 20:44:12 -08:00
Kamal Chandraprakash 20b0bf063b
MINOR: Fix the flaky TBRLMM `testInternalTopicExists` test (#14840)
The internal topic creation is asynchronous so the test gets flaky. To fix the test flakiness and in this test I want to assert that doesTopicExist should return true when a topic exists, so created a dummy internal topic.

Reviewers: Luke Chen <showuon@gmail.com>, Jun Rao <jun@confluent.io>, Satish Duggana <satishd@apache.org>
2023-11-29 10:50:22 +08:00
Colin Patrick McCabe 4874bf818a
KAFKA-15311: Fix docs about reverting to ZooKeeper mode during KRaft migration (#14160)
- Remove the outdated statement that delegation tokens aren't supported by KRaft.

- Add an invitation to report migration bugs on JIRA.

- Define terminology such as "zk migration phases".

- Mention MV can't be changed during migration.

- Explain how to revert to ZK mode.

Reviewers: Ron Dagostino <rndgstn@gmail.com>, David Arthur <mumrah@gmail.com>
2023-11-28 14:03:59 -08:00
Andrew Schofield 161b94d196
KAFKA-15544: Enable integration tests for new consumer (#14758)
This commit parameterizes the consumer integration tests so they can be run against
the existing "generic" group protocol and the new "consumer" group protocol
introduced in KIP-848.

The KIP-848 client code is under construction so some of the tests do not run on
both variants to start with, but the idea is that the tests can be enabled as the gaps
in functionality are closed.

Reviewers: Lucas Brutschy <lbrutschy@confluent.io>, Kirk True <ktrue@confluent.io>
2023-11-28 21:26:59 +01:00
Lucas Brutschy f3e776fd34
MINOR: time-out hanging ZooKeeperClientTest (#14855)
As described in KAFKA-9470, testBlockOnRequestCompletionFromStateChangeHandler
will block for hours occasionally.

If it passes, it takes 0.5 seconds, so a minute timeout should be safe.

This is not a fix for KAFKA-9470, it's just aiming to make the CI more stable.

Reviewers: David Jacot <djacot@confluent.io>, Matthias J. Sax <matthias@confluent.io>
2023-11-28 12:04:53 -08:00
vamossagar12 bb1c4465c9
KAFKA-14516: [1/N] Static Member leave, join, re-join request using ConsumerGroupHeartbeats (#14432)
This patch add the support for static membership to the new consumer group protocol. With a static member can join, re-join, temporarily leave and leave. When a member leaves with the expectation to rejoin, it must rejoin within the session timeout. It is kicks out from the consumer group otherwise.

Reviewers: David Jacot <djacot@confluent.io>
2023-11-28 10:08:16 -08:00
Apoorv Mittal 38f2faf83f
KAFKA-15681: Add support of client-metrics in kafka-configs.sh (KIP-714) (#14632)
The PR adds support of alter/describe configs for client-metrics as defined in KIP-714

Reviewers: Andrew Schofield <aschofield@confluent.io>, Jun Rao <junrao@gmail.com>
2023-11-28 09:24:25 -08:00
Calvin Liu db626a4804
KAFKA-15582 Unset the previous broker epoch if version < 2 (#14784)
When using older versions of the broker registration RPC, make sure that the new PreviousBrokerEpoch field is set to the default value when building the request object.

Reviewers: David Arthur <mumrah@gmail.com>
2023-11-28 10:36:59 -05:00
Mickael Maison 3c0840d28e
MINOR: Fix typo in 3.2.0 upgrade notes (#14851)
Reviewers: Josep Prat <josep.prat@aiven.io>
2023-11-28 11:32:46 +01:00
Hao Li bbd75b80ce
KAFKA-15022: Detect negative cycle from one source (#14696)
Introduce a dummy node connected to every other node and run Bellman-ford from the dummy node once instead of from every node in the graph.

Reviewers: Qichao Chu (@ex172000), Matthias J. Sax <matthias@confluent.io>
2023-11-28 00:29:00 -08:00