The `AdminClient` adds a telemetry reporter to the metrics reporters
list in the constructor. The problem is that the reporter was already
added in the `createInternal` method. In the `createInternal` method
call, the `clientTelemetryReporter` is added to a
`List<MetricReporters>` which is passed to the `Metrics` object, will
get closed when `Metrics.close()` is called. But adding a reporter to
the reporters list in the constructor is not used by the `Metrics`
object and hence doesn't get closed, causing a memory leak.
All related tests pass after this change.
Reviewers: Apoorv Mittal <apoorvmittal10@apache.org>, Matthias J. Sax
<matthias@confluent.io>, Chia-Ping Tsai <chia7712@gmail.com>,
Jhen-Yung Hsu <jhenyunghsu@gmail.com>
This fixes librdkafka older than the recently released 2.11.0 with
Kerberos authentication and Apache Kafka 4.x.
Even though this is a bug in librdkafka, a key goal of KIP-896 is not to
break the popular client libraries listed in it. Adding back JoinGroup
v0 & v1 is a very small change and worth it from that perspective.
Reviewers: Chia-Ping Tsai <chia7712@gmail.com>
- Metadata doesn't have the full view of topicNames to ids during
rebootstrap of client or when topic has been deleted/recreated. The
solution is to pass down topic id and stop trying to figure it out later
in the logic.
---------
Co-authored-by: Kirk True <kirk@kirktrue.pro>
Use Java to rewrite ProducerSendWhileDeletionTest by new test infra and
move it to client-integration-tests module.
Reviewers: Chia-Ping Tsai <chia7712@gmail.com>
The OffsetFetch API does not support top level errors in version 1.
Hence, the top level error must be returned at the partition level.
Side note: It is a tad annoying that we create error response in
multiple places (e.g. KafkaApis, Group CoordinatorService). There were a
reason for this but I cannot remember.
Reviewers: Dongnuo Lyu <dlyu@confluent.io>, Sean Quah <squah@confluent.io>, Ken Huang <s7133700@gmail.com>, TengYao Chi <frankvicky@apache.org>
Fix to ensure assigned partitions whose topics are not in the consumer
explicit subscription are considered not fetchable (so that no records
are returned on poll for them)
This scenario could happen in the new async consumer (using the Consumer
rebalance protocol) when the subscription changes, because the consumer
will keep its assignment until the coordinator sends a new one (broker
drives assignments).
This does not happen in the classic consumer because the assignment
logic lives on the client-side, so the consumer pro-actively updates
assignment as needed.
This PR validates assignment vs subscription on fetch for explicit
subscription only. Regular expressions, shared subscription remain
unchanged (regex case still under discussion, will be handled separately
if needed)
Reviewers: Andrew Schofield <aschofield@confluent.io>, TengYao Chi
<frankvicky@apache.org>, Kirk True <ktrue@confluent.io>, Jhen-Yung Hsu
<jhenyunghsu@gmail.com>
Profiling has shown that using the Collections Streams API approach adds
unnecessary overhead compared to a traditional for loop. Minor revisions
to the code have been made to use simpler constructs to improve
performance.
Reviewers: Lianet Magrans <lmagrans@confluent.io>, Andrew Schofield
<aschofield@confluent.io>
These can be used to implement transformations on top of the RPC
definitions. Group IDs were already marked. This PR additionally adds
the entityType for all topic names.
Reviewers: Matthias J. Sax <matthias@confluent.io>
This change compares the remote replica's HWM with the leader's HWM and
completes the FETCH request if the remote HWM is less than the leader's
HWM. When the leader's HWM is updated any pending FETCH RPC is
completed.
Reviewers: Alyssa Huang <ahuang@confluent.io>, David Arthur
<mumrah@gmail.com>, Andrew Schofield <aschofield@confluent.io>
(cherry picked from commit 742b327025)
Clarify the interaction of `min.insync.replicas` and `ack=all`
configuration. Prior to this change, the doc for `min.insync.replicas`
could have been interpreted as being used to short-circuit in the
`acks=all` case as if it would be enough if `min.inscyn.replicas`
number of brokers replicated a message before it can be acknowledged
back to the producer.
Reviewers: Matthias J. Sax <matthias@confluent.io>
---------
Co-authored-by: Matthias J. Sax <mjsax@apache.org>
Adds a test dependency on
[mock-oauth2-server](https://github.com/navikt/mock-oauth2-server/) for
integration tests for OAuth layer. Also includes fixes for some
regressions that were caught by the integration tests.
Reviewers: Manikumar Reddy <manikumar@confluent.io>, Lianet Magrans
<lmagrans@confluent.io>
This PR uses topic IDs received in assignment (under new protocol) to
ensure that only these assigned topics are included in the consumer
metadata requests performed when the user subscribes to broker-side
regex (RE2J).
For handling the edge case of consumer needing metadata for topics IDs
(from RE2J) and topic names (from transient topics), the approach is to
send a request for the transient topics needed temporarily, and once
those resolved, the request for the topic IDs needed for RE2J will
follow. (this is because the broker doesn't accept requests for names
and IDs at the same time)
With the changes we also end up fixing another issue (KAFKA-18729) aimed
at avoiding iterating the full set of assigned partitions when checking
if a topic should be retained from the metadata response when using
RE2J.
Reviewers: David Jacot <djacot@confluent.io>
Log segment closure results in right sizing the segment on disk along
with the associated index files.
This is specially important for TimeIndexes where a failure to right
size may eventually cause log roll failures leading to under replication
and log cleaner failures.
This change uses `Utils.closeAll` which propagates exceptions, resulting
in an "unclean" shutdown. That would then cause the broker to attempt to
recover the log segment and the index on next startup, thereby avoiding
the failures described above.
Reviewers: Omnia Ibrahim <o.g.h.ibrahim@gmail.com>, Jun Rao
<junrao@gmail.com>, Chia-Ping Tsai <chia7712@gmail.com>
We can use `pollUntilTrue` instead of `waitForCondition`, thus do a
little refactor to reduce the duplicate code
Reviewers: TengYao Chi <frankvicky@apache.org>, Lan Ding
<isDing_L@163.com>, TaiJuWu <tjwu1217@gmail.com>
Use Java to rewrite PlaintextConsumerSubscriptionTest by new test infra
and move it to client-integration-tests module.
Reviewers: Chia-Ping Tsai <chia7712@gmail.com>
Remove the event IDs from the ApplicationEvent and BackgroundEvent as it
serves no functional purpose other than uniquely identifying events in
the logs.
Reviewers: Andrew Schofield <aschofield@confluent.io>
While reading through the code, I found the method name to be somewhat
ambiguous and not fully descriptive of its purpose.
So I renamed the method to make its purpose clearer and more
self-explanatory. If there was another reason for the original naming,
I’d be happy to hear about it.
Reviewers: Lianet Magrans <lmagrans@confluent.io>
* Add `group.share.assignors` config to `GroupCoordinatorConfig`.
* Send `rackId` in share group heartbeat request if it's not null.
* Add integration test `testShareConsumerWithRackAwareAssignor`.
Reviewers: Lan Ding <53332773+DL1231@users.noreply.github.com>, Andrew
Schofield <aschofield@confluent.io>
---------
Signed-off-by: PoAn Yang <payang@apache.org>
The mapKey optimisation can be used in some KIP-932 RPC schemas to
improve efficiency of some key-based accesses.
* AlterShareGroupOffsetsResponse
* ShareFetchRequest
* ShareFetchResponse
* ShareAcknowledgeRequest
* ShareAcknowledgeResponse
Reviewers: Andrew Schofield <aschofield@confluent.io>
---------
Signed-off-by: PoAn Yang <payang@apache.org>
The PR do following:
1. rewrite to new test infra
2. rewrite to java
3. move to clients-integration-tests
Reviewers: Ken Huang <s7133700@gmail.com>, Kuan-Po Tseng
<brandboat@gmail.com>, Chia-Ping Tsai <chia7712@gmail.com>
- Moving off deprecated methods
- Fixing argument order for assertEquals(...)
- Few other minor cleanups
Reviewers: PoAn Yang <payang@apache.org>, Lianet Magrans
<lmagrans@confluent.io>, Ken Huang <s7133700@gmail.com>
* Add `RackAwareAssignor`. It uses `racksForPartition` to check the rack
id of a partition and assign it to a member which has the same rack id.
* Add `ConsumerIntegrationTest#testRackAwareAssignment` to check
`racksForPartition` works correctly.
Reviewers: David Jacot <djacot@confluent.io>
---------
Signed-off-by: PoAn Yang <payang@apache.org>
Adding support for the `urn:ietf:params:oauth:grant-type:jwt-bearer`
grant type (AKA `jwt-bearer`). Includes further refactoring of the
existing OAuth layer and addition of generic JWT assertion layer that
can be leveraged in the future.
This constitutes the main piece of the JWT Bearer grant type support.
Forthcoming commits/PRs will include improvements for both the
`client_credentials` and `jwt-bearer` grant types in the following
areas:
* Integration test coverage (KAFKA-19153)
* Unit test coverage (KAFKA-19308)
* Top-level documentation (KAFKA-19152)
* Improvements to and documentation for `OAuthCompatibilityTool`
(KAFKA-19307)
Reviewers: Manikumar Reddy <manikumar@confluent.io>, Lianet Magrans
<lmagrans@confluent.io>
---------
Co-authored-by: Zachary Hamilton <77027819+zacharydhamilton@users.noreply.github.com>
Co-authored-by: Lianet Magrans <98415067+lianetm@users.noreply.github.com>
The `String.split` method never returns an array containing null
elements.
Reviewers: TengYao Chi <frankvicky@apache.org>, Ken Huang
<s7133700@gmail.com>, Lan Ding <isDing_L@163.com>
## Problem
When an `txnProducer.abortTransaction()`operation encounters a
`TRANSACTION_ABORTABLE` error, it currently tries to transition to
`ABORTABLE_ERROR` state. This can create an infinite retry loop since:
1. The abort operation fails with `TRANSACTION_ABORTABLE`
2. We transition to `ABORTABLE_ERROR` state
3. The application recieves instance of TransactionAbortableException
and it retries the abort
4. The cycle repeats
## Solution
For `txnProducer.abortTransaction()`API, convert
`TRANSACTION_ABORTABLE` errors to fatal errors (`KafkaException`) during
abort operations to ensure clean transaction termination. This prevents
retry loops by:
1. Treating abort failures as fatal errors at application layer
2. Ensuring the transaction can be cleanly terminated
3. Providing clear error messages to the application
## Changes
- Modified `EndTxnHandler.handleResponse()` to convert
`TRANSACTION_ABORTABLE` errors to `KafkaException` during abort
operations
- Set TransactionManager state to FATAL
- Updated test `testAbortableErrorIsConvertedToFatalErrorDuringAbort` to
verify this behavior
## Testing
- Added test case verifying that abort operations convert
`TRANSACTION_ABORTABLE` errors to `KafkaException`
- Verified that Commit API with TRANSACTION_ABORTABLE error should
set TM to Abortable state
- Verified that Abort API with TRANSACTION_ABORTABLE error should
convert to Fatal error i.e. KafkaException
## Impact
At application layer, this change improves transaction reliability by
preventing infinite retry loops during abort operations.
Reviewers: Justine Olshan <jolshan@confluent.io>
### Problem
- Currently, when a transactional producer encounters retriable errors
(like `COORDINATOR_LOAD_IN_PROGRESS`) and exhausts all retries, finally
returns retriable error to Application Layer.
- Application reties can cause duplicate records. As a fix we are
transitioning all retriable errors as Abortable Error in transaction
producer path.
- Additionally added InvalidTxnStateException as part of
https://issues.apache.org/jira/browse/KAFKA-19177
### Solution
- Modified the TransactionManager to automatically transition retriable
errors to abortable errors after all retries are exhausted. This ensures
that applications can abort transaction when they encounter
`TransactionAbortableException`
- `RefreshRetriableException` like `CoordinatorNotAvailableException`
will be refreshed internally
[[code](6c26595ce3/clients/src/main/java/org/apache/kafka/clients/producer/internals/TransactionManager.java (L1702-L1705))]
till reties are expired, then it will be treated as retriable errors and
translated to `TransactionAbortableException`
- Similarly for InvalidTxnStateException
### Testing
Added test `testSenderShouldTransitionToAbortableAfterRetriesExhausted`
to verify in sender thread:
- Retriable errors are properly converted to abortable state after
retries
- Transaction state transitions correctly and subsequent operations fail
appropriately with TransactionAbortableException
Reviewers: Justine Olshan <jolshan@confluent.io>
Update catch to handle compression errors
Before :

After
```
Sent message: KR Message 376
[kafka-producer-network-thread | kr-kafka-producer] INFO
org.apache.kafka.common.telemetry.internals.ClientTelemetryReporter -
KR: Failed to compress telemetry payload for compression: zstd, sending
uncompressed data
Sent message: KR Message 377
```
Reviewers: Apoorv Mittal <apoorvmittal10@gmail.com>, Bill Bejeck <bbejeck@gmail.com>, Chia-Ping Tsai <chia7712@gmail.com>
- Currently, read and write share state requests were allowed on
uninitialized share partitions (share partitions on which
initializeState has NOT been called). This should not be the case.
- This PR addresses the concern by adding error checks on read and
write. Other requests are allowed (initialize, readSummary, alter).
- Refactored `ShareCoordinatorShardTest` to reduce redundancy and added
some new tests.
- Some request/response classes have also been reformatted.
Reviewers: Andrew Schofield <aschofield@confluent.io>
The flakiness occurs when the offsets topic does not yet exist. Hence,
the issue is mitigated by creating the offsets topic in `setup()`. This
serves as a workaround. The root cause is tracked in
[KAFKA-19357](https://issues.apache.org/jira/browse/KAFKA-19357).
I ran the test 100 times on my Mac and all of them passed.
Reviewers: Ken Huang <s7133700@gmail.com>, Chia-Ping Tsai
<chia7712@gmail.com>
Removed the unused FetchResponse#of that is not used in production. The
test cases that originally invoked this method have been updated to call
the other
[FetchResponse#of](6af849f864/clients/src/main/java/org/apache/kafka/common/requests/FetchResponse.java (L232)),
which is currently used by ```KafkaApis```, to maintain the integrity of
the tests.
Reviewers: Jun Rao <junrao@gmail.com>, PoAn Yang <payang@apache.org>,
Chia-Ping Tsai <chia7712@gmail.com>
Adds missing documentation to the `partitionsToOffsetAndMetadata`
methods in both `ListStreamsGroupOffsetsResult` and
`ListShareGroupOffsetsResult` classes to clarify the behavior when a
group does not have a committed offset for a specific partition.
As document in ListConsumerGroupOffsetsResult: > If the group doesn’t
have a committed offset for a specific partition, the corresponding
value in the returned map will be null.
This important detail was previously missing in the JavaDoc of the
stream and share group variants.
Reviewers: Nick Guo <lansg0504@gmail.com>, Chia-Ping Tsai
<chia7712@gmail.com>
#15613 ensures that all `commitAsync` callbacks are triggered before
`commitSync` completes for `AsyncKafkaConsumer`. However, the related
changes to `ClassicKafkaConsumer`, #15693, were not merged. I assume
this might be because we intend to gradually move toward using AsyncConsumer
instead.
In short, this behavioral difference should be documented.
Reviewers: Ken Huang <s7133700@gmail.com>, Chia-Ping Tsai
<chia7712@gmail.com>
Use Java to rewrite BaseConsumerTest, SaslPlainPlaintextConsumerTest by
new test infra and move it to client-integration-tests module, the
BaseConsumerTest test is still used, thus we should not remove now.
Reviewers: PoAn Yang <payang@apache.org>, TengYao Chi
<frankvicky@apache.org>, Chia-Ping Tsai <chia7712@gmail.com>
This PR rewrites `ConsumerWithLegacyMessageFormatIntegrationTest.scala`
in Java and moves it to the `clients-integration-tests module`.
Reviewers: PoAn Yang <payang@apache.org>, Ken Huang
<s7133700@gmail.com>, TengYao Chi <kitingiao@gmail.com>, Chia-Ping Tsai
<chia7712@gmail.com>
* Return resource doesn't exist message when users try to describe a
non-existent resource in kafka-configs.sh and kafka-client-metrics.sh.
* For groups type, the command checks both existent groups and
non-existent groups but having dynamic config. If it cannot find a group
in both conditions, return resource doesn't exist message.
Reviewers: Lan Ding <53332773+DL1231@users.noreply.github.com>, Andrew
Schofield <aschofield@confluent.io>
---------
Signed-off-by: PoAn Yang <payang@apache.org>
public void completeTransaction(PreparedTxnState preparedTxnState)
The method compares the currently prepared transaction state and the
state passed in the argument.
1. Commit if the state matches
2. Abort the transaction otherwise.
If the producer is not in a prepared state (i.e., neither
prepareTransaction was called nor initTransaction(true) was called), we
return an INVALID_TXN_STATE error.
Reviewers: Justine Olshan <jolshan@confluent.io>, Artem Livshits
<alivshits@confluent.io>
There is a sequence of interactions with the membership managers of
KIP-848, KIP-932, KIP-1071 that can put the membership manager into
JOINING state, but where member epoch is set to -1. This can result in
an invalid request being sent, since joining heartbeats should not have
member epoch -1. This may lead to the member failing to join. In the
case of streams, the group coordinator will return INVALID_REQUEST.
This is the sequence triggering the bug, which seems to relatively
likely, caused by two heartbeat responses being received after the next
one has been sent.
```
membershipManager.leaveGroup();
-> transitions to LEAVING
membershipManager.onHeartbeatRequestGenerated();
-> transitions to UNSUBSCRIBED
membershipManager.onHeartbeatSuccess(... with member epoch > 0);
-> unblocks the consumer
membershipManager.onSubscriptionUpdated();
membershipManager.onConsumerPoll();
-> transitions to JOINING
membershipManager.onHeartbeatSuccess(... with member epoch < 0);
-> updates the epoch to a negative value
```
Now we are in state `JOINING` with `memberEpoch=-1`, and the next
heartbeat we send will be malformed, triggering `INVALID_REQUEST`
The bug may also be triggered if the `unsubscribe` times out, but this
seems more of a corner case.
To prevent the bug, we are taking two measures: The likely path to
triggering the bug can be prevented by not unblocking an `unsubscribe`
call in the consumer when a non-leave-heartbeat epoch is received. Once
we have sent out leave group heartbeat, we will ignore all heartbeats,
except for those containing `memberEpoch < 0`.
For extra measure, we also prevent the second case (`unsubscribe` timing
out). In this case, the consumer gets unblocked before we have received
the leave group heartbeat response, and may resubscribe to the group. In
this case, we shall just ignore the heartbeat response that contains a
member epoch < 0, once it arrives, and we have already left the
`UNSUBSCRIBED` state.
Reviewers: Lianet Magrans <lmagrans@confluent.io>, Andrew Schofield
<aschofield@confluent.io>, Shivsundar R <shr@confluent.io>
This PR attempts to removed the flakiness in
`testAlterReadCommittedToReadUncommittedIsolationLevelWithReleaseAck`
and `testAlterReadCommittedToReadUncommittedIsolationLevelWithRejectAck`.
This flakiness could potentially be caused because we were not ensuring
that the aborted transaction record produce happened. In this PR, I have
added a blocking call to make sure the produce future completes before
we abort the transaction.
Reviewers: Andrew Schofield <aschofield@confluent.io>
While rewriting `EndToEndClusterIdTest` in Java (#19741 ), I found that
the test uses `MockInterceptor` and `MockSerializer` together. However,
`MockSerializer` was using a `byte[]` serializer, while
`MockInterceptor` expected a `String` serializer, leading to a
`ClassCastException`.
I chose to update `MockSerializer` to use `String`, as it is used less
frequently than the `MockInterceptor`. Using `String` also simplifies
the code by avoiding the need to write expressions like
`"value".getBytes`.
Reviewers: Ken Huang <s7133700@gmail.com>, TengYao Chi
<frankvicky@apache.org>, Chia-Ping Tsai <chia7712@gmail.com>
I verified the behavior by rewriting the
`GroupAuthorizerIntegrationTest` in Java in this PR:
https://github.com/apache/kafka/pull/19685 The state is now correct.
Reviewers: Chia-Ping Tsai <chia7712@gmail.com>
[KAFKA-16720](https://issues.apache.org/jira/browse/KAFKA-16720) aims to
finish the AlterShareGroupOffsets RPC.
Reviewers: Andrew Schofield <aschofield@confluent.io>
---------
Co-authored-by: jimmy <wangzhiwang@qq.com>