Looking at the [history](https://ge.apache.org/scans/tests?search.rootProjectNames=kafka&search.timeZoneId=Europe%2FZurich&tests.container=kafka.api.PlaintextAdminIntegrationTest&tests.test=testConsumerGroups(String%2C%20String)%5B2%5D), I found out that one source of flakiness is due to syncCommit failing with CommitFailedException. We can ignore it and retry on the next iteration.
```
[2025-01-07 10:17:00,783] ERROR [Consumer instanceId=test_instance_id_1, clientId=test_client_id, groupId=test_group_id] OffsetCommit failed for member VfImExrxT3-w_HNJcTkqnw with stale member epoch error. Last epoch sent: 2 (org.apache.kafka.clients.consumer.internals.CommitRequestManager:773)
Exception in thread "Thread-6" org.apache.kafka.clients.consumer.CommitFailedException: OffsetCommit failed with stale member epoch.The member epoch is stale. The member must retry after receiving its updated member epoch via the ConsumerGroupHeartbeat API.
at org.apache.kafka.clients.consumer.internals.CommitRequestManager.commitSyncExceptionForError(CommitRequestManager.java:481)
at org.apache.kafka.clients.consumer.internals.CommitRequestManager.lambda$commitSyncWithRetries$7(CommitRequestManager.java:472)
at java.base/java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:863)
at java.base/java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:841)
at java.base/java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:510)
at java.base/java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:2162)
at org.apache.kafka.clients.consumer.internals.CommitRequestManager$OffsetCommitRequestState.onResponse(CommitRequestManager.java:776)
at org.apache.kafka.clients.consumer.internals.CommitRequestManager$RetriableRequestState.handleClientResponse(CommitRequestManager.java:893)
at org.apache.kafka.clients.consumer.internals.CommitRequestManager$RetriableRequestState.lambda$buildRequestWithResponseHandling$0(CommitRequestManager.java:883)
at java.base/java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:863)
at java.base/java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:841)
at java.base/java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:510)
at java.base/java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:2147)
at org.apache.kafka.clients.consumer.internals.NetworkClientDelegate$FutureCompletionHandler.onComplete(NetworkClientDelegate.java:433)
at org.apache.kafka.clients.ClientResponse.onComplete(ClientResponse.java:154)
at org.apache.kafka.clients.NetworkClient.completeResponses(NetworkClient.java:669)
at org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:661)
at org.apache.kafka.clients.consumer.internals.NetworkClientDelegate.poll(NetworkClientDelegate.java:153)
at org.apache.kafka.clients.consumer.internals.ConsumerNetworkThread.runOnce(ConsumerNetworkThread.java:160)
at org.apache.kafka.clients.consumer.internals.ConsumerNetworkThread.run(ConsumerNetworkThread.java:106)
```
Reviewers: Lianet Magrans <lmagrans@confluent.io>
Remove RaftManager.maybeDeleteMetadataLogDir since it was only used during ZK migration, and that code has been removed.
Similarly, remove RaftManagerTest.testKRaftBrokerDoesNotDeleteMetadataLog which tested that function.
Remove AutoTopicCreationManagerTest since it tests the ZK-mode-only AutoTopicReationManager.
Reviewers: Mickael Maison <mickael.maison@gmail.com>, Chia-Ping Tsai <chia7712@gmail.com>
Following https://github.com/apache/kafka/pull/18261, this patch updates the Share Coordinator to use the new record format.
Reviewers: Chia-Ping Tsai <chia7712@gmail.com>, Andrew Schofield <aschofield@confluent.io>
There are times when the controller has a high event processing time, such as during startup, or when creating a topic with many partitions. We can see these processing times in the p99 metric (kafka.controller:type=ControllerEventManager,name=EventQueueProcessingTimeMs), however it's difficult to see exactly which event is causing high processing time.
With DEBUG logs, we see every event along with its processing time. Even with this, it's a bit tedious to find the event with a high processing time.
This PR logs all events which take longer than 2 seconds at ERROR level. This will help identify events that are taking far too long, and which could be disruptive to the operation of the controller. The slow event logging looks like this:
```
[2024-12-20 15:03:39,754] ERROR [QuorumController id=1] Exceptionally slow controller event createTopics took 5240 ms. (org.apache.kafka.controller.EventPerformanceMonitor)
```
Also, every 60 seconds, it logs some event time statistics, including average time, maximum time, and the name of the event which took the longest. This periodic message looks like this:
```
[2024-12-20 15:35:04,798] INFO [QuorumController id=1] In the last 60000 ms period, 333 events were completed, which took an average of 12.34 ms each. The slowest event was handleCommit[baseOffset=0], which took 41.90 ms. (org.apache.kafka.controller.EventPerformanceMonitor)
```
An operator can disable these logs by adding the following to their log4j config:
```
org.apache.kafka.controller.EventPerformanceMonitor=OFF
```
Reviewers: Colin P. McCabe <cmccabe@apache.org>
Remove the flaky annotation from the following tests
* RemoteLogManagerTest#testFetchOffsetByTimestampWithTieredStorageDoesNotFetchIndexWhenExistsLocally
* All the children of BaseConsumerTest#testCoordinatorFailover
* TransactionsTest#testFailureToFenceEpoch
* TransactionsTest#testReadCommittedConsumerShouldNotSeeUndecidedData
* MetricsDuringTopicCreationDeletionTest#testMetricsDuringTopicCreateDelete
* ProduceRequestTest#testProduceWithInvalidTimestamp
Reviewers: Chia-Ping Tsai <chia7712@gmail.com>
A minor refactoring just before merging #18295 introduced a regression and no test failed. Throw the correct exception and add test to verify it. Also refactor the code slightly to make that possible.
Thanks to Chia-Ping for catching the issue.
Reviewers: Chia-Ping Tsai <chia7712@gmail.com>
Clients that support SASL but don't implement KIP-43 (eg Kafka producer/consumer 0.9.0.x) will
fail to connect after this change.
Added unit tests and also manually tested with the console producer 0.9.0.
While testing, I noticed that the logged message when a 0.9.0 Java client is used without sasl is
slightly misleading - fixed that too.
Reviewers: Manikumar Reddy <manikumar.reddy@gmail.com>
This makes it possible to enable request logs for deprecated protocol api versions without enabling it for the rest. Combined with the ability to enable/disable dynamically, it makes it a bit easier to collect the information about deprecated clients that is not available via metrics.
This isn't particularly useful in trunk/4.0 since there are no deprecated api versions in these versions, but it will be useful for older branches. I intend to backport to those branches and add a release note in the backport regarding the change in behavior.
I manually verified that:
1. If the request logger is configured at `INFO` level, only deprecated protocol api versions are logged and they are logged at `INFO` level.
2. If the request logger is configured at `DEBUG` level, all requests are logged but the log level is `INFO` for deprecated protocol api versions and `DEBUG` for the rest.
3. If the request logger is configured at `WARN` level (the default), no requests are logged.
Reviewers: Chia-Ping Tsai <chia7712@gmail.com>
Added transaction version 2 to some of the system tests. Also marking TV2 as production ready.
Also fixes the defaultVersion test.
Reviewers: Jun Rao <jun@confluent.io>
Included in this change:
1. Remove deprecated protocol api versions from json files.
3. Remove fields that are no longer used from json files (affects ListOffsets, OffsetCommit, DescribeConfigs).
4. Remove record down-conversion support from KafkaApis.
5. No longer return `Errors.UNSUPPORTED_COMPRESSION_TYPE` on the fetch path[1].
6. Deprecate `TopicConfig. MESSAGE_DOWNCONVERSION_ENABLE_CONFIG` and made the relevant
configs (`message.downconversion.enable` and `log.message.downcoversion.enable`) no-ops since
down-conversion is no longer supported. It was an oversight not to deprecate this via KIP-724.
7. Fix `shouldRetainsBufferReference` to handle null request schemas for a given version.
8. Simplify producer logic since it only supports the v2 record format now.
9. Fix tests so they don't exercise protocol api versions that have been removed.
10. Add upgrade note.
Testing:
1. System tests have a lot of failures, but those tests fail for trunk too and I didn't see any issues specific to this change - it's hard to be sure given the number of failing tests, but let's not block on that given the other testing that has been done (see below).
3. Java producers and consumers with version 0.9-0.10.1 don't have api versions support and hence they fail in an ungraceful manner: the broker disconnects and the clients reconnect until the relevant timeout is triggered.
4. Same thing seems to happen for the console producer 0.10.2 although it's unclear why since api versions should be supported. I will look into this separately, it's unlikely to be related to this PR.
5. Console consumer 0.10.2 fails with the expected error and a reasonable message[2].
6. Console producer and consumer 0.11.0 works fine, newer versions should naturally also work fine.
7. kcat 1.5.0 (based on librdkafka 1.1.0) produce and consume fail with a reasonable message[3][4].
8. kcat 1.6.0-1.7.0 (based on librdkafka 1.5.0 and 1.7.0 respectively) consume fails with a reasonable message[5].
9. kcat 1.6.0-1.7.0 produce works fine.
10. kcat 1.7.1 (based on librdkafka 1.8.2) works fine for consumer and produce.
11. confluent-go-client (librdkafka based) 1.8.2 works fine for consumer and produce.
12. I will test more clients, but I don't think we need to block the PR on that.
Note that this also completes part of KIP-724: produce v2 and lower as well as fetch v3 and lower are no longer supported.
Future PRs will remove conditional code that is no longer needed (some of that has been done in KafkaApis,
but only what was required due to the schema changes). We can probably do that in master only as it does
not change behavior.
Note that I did not touch `ignorable` fields even though some of them could have been
changed. The reasoning is that this could result in incompatible changes for clients
that use new protocol versions without setting such fields _if_ we don't manually
validate their presence. I will file a JIRA ticket to look into this carefully for each
case (i.e. if we do validate their presence for the appropriate versions, we can
set them to ignorable=false in the json file).
[1] We would return this error if a fetch < v10 was used and the compression topic config was set
to zstd, but we would not do the same for the case where zstd was compressed at the producer
level (the most common case). Since there is no efficient way to do the check for the common
case, I made it consistent for both by having no checks.
[2] ```org.apache.kafka.common.errors.UnsupportedVersionException: The broker is too new to support JOIN_GROUP version 1```
[3]```METADATA|rdkafka#producer-1| [thrd:main]: localhost:9092/bootstrap: Metadata request failed: connected: Local: Required feature not supported by broker (0ms): Permanent```
[4]```METADATA|rdkafka#consumer-1| [thrd:main]: localhost:9092/bootstrap: Metadata request failed: connected: Local: Required feature not supported by broker (0ms): Permanent```
[5] `ERROR: Topic test-topic [0] error: Failed to query logical offset END: Local: Required feature not supported by broker`
Reviewers: David Arthur <mumrah@gmail.com>
Librdkafka totally breaks if produce v3 is removed - it starts sending records with record format v0.
These api versions have to be undeprecated - KIP-896 has been updated.
Reviewers: David Arthur <mumrah@gmail.com>
This is just a mechanical change to make prepareTransitionTo method use named parameters instead of positional parameters.
Reviewers: Justine Olshan <jolshan@confluent.io>, Ritika Reddy <rreddy@confluent.io>
This patch is the first one in a series to improve how coordinator records are managed. It focuses on making coordinator records first class citizen in the generator.
* Introduce `coordinator-key` and `coordinator-value` in the schema;
* Introduce `apiKey` for those. This is done to avoid relying on the version to determine the type.
* It also allows the generator to enforce some rules: the key cannot use flexible versions, the key must have a single version `0`, there must be a key and a value for a given api key, etc.
* It generates an enum with all the coordinator record types. This is pretty handy in the code.
The patch also updates the group coordinators to use those.
Reviewers: Jeff Kim <jeff.kim@confluent.io>, Andrew Schofield <aschofield@confluent.io>
A lot of these tests assumed that the commit/abort happened immediately. Spoiler alert -- it does not.
For some I ensure that the first send of the next transaction is successful before grabbing the epoch. I also loosened some checks since we don't need to guarantee the exact epoch.
I went back and forth with completely deleting testBumpTransactionalEpochWithTV2Enabled since we don't have client side epoch bumps with V2 (which is what the test was originally testing), but I opted to keep it to just confirm the epoch on each transaction -- even in the timeout scenario.
Reviewers: Calvin Liu <caliu@confluent.io>, Artem Livshits <alivshits@confluent.io>, Jeff Kim <jeff.kim@confluent.io>, David Mao <dmao@confluent.io>
When inter.broker.listener is explicitly set, validate that it is not in the set of controller.listener.names.
Reviewers: Colin P. McCabe <cmccabe@apache.org>, David Arthur <mumrah@gmail.com>
Adds a new RPC StreamsGroupDescribe that returns, given the group ID, all metadata related to the streams group, such as
- The topology metadata of the group.
- The topology epoch of the group.
- The latest member metadata that each member provided through the StreamsGroupHeartbeat API.
- The current target assignment generated by the assignor.
- This just adds the JSON as defined in KIP-1071, together with some plumbing.
Reviewers: Bill Bejeck <bbejeck@gmail.com>
The StreamsGroupHeartbeat API is the new core API used by streams application to form a group. The API allows members to initialize a topology, advertise their state, and their owned tasks. The group coordinator uses it to assign/revoke tasks to/from members. This API is also used as a liveness check.
This change adds the JSON definition of the RPC, as defined in KIP-1071.
Reviewers: Bruno Cadonna <cadonna@apache.org>
The issue has been fixed by https://issues.apache.org/jira/browse/KAFKA-18188. We can re-enable the test with the CONSUMER protocol.
Reviewers: Lianet Magrans <lmagrans@confluent.io>, Andrew Schofield <aschofield@confluent.io>
When a static member rejoins the group, the group state is rewritten to the partition in order to persist the change. If the write fails, the change is reverted. However, this is done without acquiring the group lock.
This is only try in the old group coordinator. The new one does not suffer from this issue.
Reviewers: Jeff Kim <jeff.kim@confluent.io>