Looking at the [history](https://ge.apache.org/scans/tests?search.rootProjectNames=kafka&search.timeZoneId=Europe%2FZurich&tests.container=kafka.api.PlaintextAdminIntegrationTest&tests.test=testConsumerGroups(String%2C%20String)%5B2%5D), I found out that one source of flakiness is due to syncCommit failing with CommitFailedException. We can ignore it and retry on the next iteration.
```
[2025-01-07 10:17:00,783] ERROR [Consumer instanceId=test_instance_id_1, clientId=test_client_id, groupId=test_group_id] OffsetCommit failed for member VfImExrxT3-w_HNJcTkqnw with stale member epoch error. Last epoch sent: 2 (org.apache.kafka.clients.consumer.internals.CommitRequestManager:773)
Exception in thread "Thread-6" org.apache.kafka.clients.consumer.CommitFailedException: OffsetCommit failed with stale member epoch.The member epoch is stale. The member must retry after receiving its updated member epoch via the ConsumerGroupHeartbeat API.
at org.apache.kafka.clients.consumer.internals.CommitRequestManager.commitSyncExceptionForError(CommitRequestManager.java:481)
at org.apache.kafka.clients.consumer.internals.CommitRequestManager.lambda$commitSyncWithRetries$7(CommitRequestManager.java:472)
at java.base/java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:863)
at java.base/java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:841)
at java.base/java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:510)
at java.base/java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:2162)
at org.apache.kafka.clients.consumer.internals.CommitRequestManager$OffsetCommitRequestState.onResponse(CommitRequestManager.java:776)
at org.apache.kafka.clients.consumer.internals.CommitRequestManager$RetriableRequestState.handleClientResponse(CommitRequestManager.java:893)
at org.apache.kafka.clients.consumer.internals.CommitRequestManager$RetriableRequestState.lambda$buildRequestWithResponseHandling$0(CommitRequestManager.java:883)
at java.base/java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:863)
at java.base/java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:841)
at java.base/java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:510)
at java.base/java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:2147)
at org.apache.kafka.clients.consumer.internals.NetworkClientDelegate$FutureCompletionHandler.onComplete(NetworkClientDelegate.java:433)
at org.apache.kafka.clients.ClientResponse.onComplete(ClientResponse.java:154)
at org.apache.kafka.clients.NetworkClient.completeResponses(NetworkClient.java:669)
at org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:661)
at org.apache.kafka.clients.consumer.internals.NetworkClientDelegate.poll(NetworkClientDelegate.java:153)
at org.apache.kafka.clients.consumer.internals.ConsumerNetworkThread.runOnce(ConsumerNetworkThread.java:160)
at org.apache.kafka.clients.consumer.internals.ConsumerNetworkThread.run(ConsumerNetworkThread.java:106)
```
Reviewers: Lianet Magrans <lmagrans@confluent.io>
Remove RaftManager.maybeDeleteMetadataLogDir since it was only used during ZK migration, and that code has been removed.
Similarly, remove RaftManagerTest.testKRaftBrokerDoesNotDeleteMetadataLog which tested that function.
Remove AutoTopicCreationManagerTest since it tests the ZK-mode-only AutoTopicReationManager.
Reviewers: Mickael Maison <mickael.maison@gmail.com>, Chia-Ping Tsai <chia7712@gmail.com>
30 sec is more inline with the default request.timeout.ms and is still very acceptable for a CLI interaction in my opinion
Reviewers: Manikumar Reddy <manikumar.reddy@gmail.com>
Following https://github.com/apache/kafka/pull/18261, this patch updates the Share Coordinator to use the new record format.
Reviewers: Chia-Ping Tsai <chia7712@gmail.com>, Andrew Schofield <aschofield@confluent.io>
The consumer/producer JavaDocs still contain instruction for naively
computing the offset to be committed.
This PR updates the JavaDocs with regard to the improvements of KIP-1094.
Reviewers: Andrew Schofield <aschofield@confluent.io>, Chia-Ping Tsai <chia7712@gmail.com>, Lianet Magrans <lmagrans@confluent.io>
There are times when the controller has a high event processing time, such as during startup, or when creating a topic with many partitions. We can see these processing times in the p99 metric (kafka.controller:type=ControllerEventManager,name=EventQueueProcessingTimeMs), however it's difficult to see exactly which event is causing high processing time.
With DEBUG logs, we see every event along with its processing time. Even with this, it's a bit tedious to find the event with a high processing time.
This PR logs all events which take longer than 2 seconds at ERROR level. This will help identify events that are taking far too long, and which could be disruptive to the operation of the controller. The slow event logging looks like this:
```
[2024-12-20 15:03:39,754] ERROR [QuorumController id=1] Exceptionally slow controller event createTopics took 5240 ms. (org.apache.kafka.controller.EventPerformanceMonitor)
```
Also, every 60 seconds, it logs some event time statistics, including average time, maximum time, and the name of the event which took the longest. This periodic message looks like this:
```
[2024-12-20 15:35:04,798] INFO [QuorumController id=1] In the last 60000 ms period, 333 events were completed, which took an average of 12.34 ms each. The slowest event was handleCommit[baseOffset=0], which took 41.90 ms. (org.apache.kafka.controller.EventPerformanceMonitor)
```
An operator can disable these logs by adding the following to their log4j config:
```
org.apache.kafka.controller.EventPerformanceMonitor=OFF
```
Reviewers: Colin P. McCabe <cmccabe@apache.org>
A simplified port of "ChangelogTopics" from the client-side to the group coordinator
Compared to the client-side version, the implementation uses immutable data structures, and returns the computed number of partitions instead of modifying mutable data structures and calling the admin client.
Reviewers: Bruno Cadonna <cadonna@apache.org>
This patch ensures we only analyze trunk builds for the Develocity report. Some other minor fixes are also included
Reviewers: David Arthur <mumrah@gmail.com>
Clients in the Streams Rebalance Protocol send an "unconfigured" representation of the topology to the broker. That is, the number of input topics and (some) internal topics is not fixed, regular expressions are not resolved. The broker takes this description of the topology and, together with the current state of the topics on the broker, derives a ConfiguredTopology. The configured topology is what is being returned from StreamsGroupDescribe, and has all number of partitions defined, and regular expressions resolved. The configured topology also contains missing internal topics that need to be created, and potentially configuration errors, such as missing source topics.
In this change, we add the internal data structures for representing the configured topology. They differ in some details from the data structures used in the RPCs. Most importantly, they can be evolved independently of the public interface.
Reviewers: Bruno Cadonna <cadonna@apache.org>
Currently, the log of LeaderState#timeUntilCheckQuorumExpires uses streams without a terminal operator, resulting in output like java.util.stream.ReferencePipeline$3@39660237.
This PR aims to fix this issue and improve the log message.
Reviewers: Luke Chen <showuon@gmail.com>
Relevant methods:
1. `List.of`, `Set.of`, `Map.of` and similar (introduced in Java 9)
2. Optional: `isEmpty` (introduced in Java 11), `stream` (introduced in Java 9).
Reviewers: Mickael Maison <mimaison@users.noreply.github.com>
Remove the flaky annotation from the following tests
* RemoteLogManagerTest#testFetchOffsetByTimestampWithTieredStorageDoesNotFetchIndexWhenExistsLocally
* All the children of BaseConsumerTest#testCoordinatorFailover
* TransactionsTest#testFailureToFenceEpoch
* TransactionsTest#testReadCommittedConsumerShouldNotSeeUndecidedData
* MetricsDuringTopicCreationDeletionTest#testMetricsDuringTopicCreateDelete
* ProduceRequestTest#testProduceWithInvalidTimestamp
Reviewers: Chia-Ping Tsai <chia7712@gmail.com>
Introduces interfaces for defining task assignors. Task assignors are pure functions, mapping the state of the group and a topology to a target assignment. We include a mock assignor, which we will be able to use when testing / benchmarking without the complexities of the sticky task assignor and the high-availability task assignor. We may remove the mock assignor in before the streams rebalance protocol goes GA.
The consumer groups introduce these interfaces to establish a clear separation between the group coordinator code and the pluggable assignors, which may live outside the group coordinator code. We have removed pluggable assignors in KIP-1071, but I think it still makes sense to keep these interfaces for having a clean interface for people to code against. This will pay off, if we plan on making the task assignors pluggable later.
Reviewers: Bill Bejeck <bbejeck@gmail.com>, David Jacot <djacot@confluent.io>
In document the shell command base dir is in Kafka, so we should make all command is bin/<tools>.sh
In kafka repo doesn't have controller_static.properties this file, this would misunderstanding user, I change to use controller.properties
Reviewers: Luke Chen <showuon@gmail.com>, Chia-Ping Tsai <chia7712@gmail.com>
This is no longer required since we dropped support for Java 8. Also update `NOTICE*` and
`spotbugs-exclude.xml` files.
Reviewers: Chia-Ping Tsai <chia7712@gmail.com>, David Arthur <mumrah@gmail.com>
Because of how we have to wrap StoreFactory and StoreBuilder layers on top of each other for various parts of the topology building process, we need to make sure both of these are capable of configuration and will check for & delegate to an underlying layer if it exists
Reviewers: Guozhang Wang <guozhang.wang.us@gmail.com>
The PR implements the behaviour defined in KIP-1109. It corrects the consumer topic and topic-partition metrics, while deprecating the incorrect ones.
Reviewers: Chia-Ping Tsai <chia7712@gmail.com>, Jun Rao <junrao@gmail.com>, Andrew Schofield <aschofield@confluent.io>