Create 3 new metadata versions:
- 3.8-IV0, for the upcoming 3.8 release.
- 3.9-IV0, to add support for KIP-1005.
- 3.9-IV1, as the new release vehicle for KIP-966.
Create ListOffsetRequest v9, which will be used in 3.9-IV0 to support KIP-1005. v9 is currently an unstable API version.
Reviewers: Jun Rao <junrao@gmail.com>, Justine Olshan <jolshan@confluent.io>
Implement request handling for the updated versions of the KRaft RPCs (Fetch, FetchSnapshot, Vote,
BeginQuorumEpoch and EndQuorumEpoch). This doesn't add support for KRaft replicas to send the new
version of the KRaft RPCs. That will be implemented in KAFKA-16529.
All of the RPCs responses were extended to include the leader's endpoint for the listener of the
channel used in the request. EpochState was extended to include the leader's endpoint information
but only the FollowerState and LeaderState know the leader id and its endpoint(s).
For the Fetch request, the replica directory id was added. The leader now tracks the follower's log
end offset using both the replica id and replica directory id.
For the FetchSnapshot request, the replica directory id was added. This is not used by the KRaft
leader and it is there for consistency with Fetch and for help debugging.
For the Vote request, the replica key for both the voter (destination) and the candidate (source)
were added. The voter key is checked for consistency. The candidate key is persisted when the vote
is granted.
For the BeginQuorumEpoch request, all of the leader's endpoints are included. This is needed so
that the voters can return the leader's endpoint for all of the supported listeners.
For the EndQuorumEpoch request, all of the leader's endpoints are included. This is needed so that
the voters can return the leader's endpoint for all of the supported listeners. The successor list
has been extended to include the directory id. Receiving voters can use the entire replica key when
searching their position in the successor list.
Updated the existing test in KafkaRaftClientTest and KafkaRaftClientSnapshotTest to execute using
both the old version and new version of the RPCs.
Reviewers: Luke Chen <showuon@gmail.com>, Colin P. McCabe <cmccabe@apache.org>
KIP-932 introduces FindCoordinator v6 for finding share coordinators. The initial implementation:
Checks that share coordinators are only requested with v6 or above.
Share coordinator requests are authorized as cluster actions (this is for inter-broker use only)
Responds with COORDINATOR_NOT_AVAILABLE because share coordinators are not yet available.
When the share coordinator code is delivered, the request handling will be gated by configurations which enable share groups and the share coordinator specifically. If these are not enabled, COORDINATOR_NOT_AVAILABLE is the response.
Reviewers: Manikumar Reddy <manikumar.reddy@gmail.com>
Add the cause of TimeoutException for Producer send() errors.
Reviewers: Matthias J. Sax <matthias@confluent.io>, Lianet Magrans <lianetmr@gmail.com>, Artem Livshits <alivshits@confluent.io>, Justine Olshan <jolshan@confluent.io>
As part of KIP-584, brokers expose a range of supported versions for each feature. For example,
metadata.version might be supported from 1 to 21. (Note that feature level ranges are always
inclusive, so this would include both level 1 and 21.)
These supported ranges are supposed to be able to include 0. For example, it should be possible for
a broker to support a kraft.version between 0 and 1. However, in older software versions, there is
an assertion in org.apache.kafka.common.feature.SupportedVersionRange that prevents this. This
causes problems when the older software attempts to deserialize an ApiVersionsResponse containing
such a range.
In order to resolve this dilemma, this PR bumps the version of ApiVersionsRequest from 3 to 4.
Clients which send v4 promise to be able to handle ranges including 0. Clients which send v3 will
not be exposed to these ranges -- the feature will simply be omitted from the response. This work
is part of KIP-1022.
This PR fixes consumer close to avoid updating the subscription state object in the app thread. Now the close simply triggers an UnsubscribeEvent that is handled in the background to trigger callbacks, clear assignment, and send leave heartbeat. Note that after triggering the event, the unsubscribe will continuously process background events until the event completes, to ensure that it allows for callbacks to run in the app thread.
The logic around what happens if the unsubscribe fails remain unchanged: close will log, keep the first exception and carry on.
It also removes the redundant LeaveOnClose event (it used to do the the exact same thing as the UnsubscribeEvent, both calling membershipMgr.leaveGroup).
Reviewers: Lucas Brutschy <lbrutschy@confluent.io>
In `GroupMetadataManager#toTopicPartitions`, we generate a list of `ConsumerGroupHeartbeatRequestData.TopicPartitions` from the input deserialized subscription. Currently the input subscription is `ConsumerPartitionAssignor.Subscription`, where the topic partitions are stored as (topic-partition) pairs, whereas in `ConsumerGroupHeartbeatRequestData.TopicPartitions`, we need the topic partitions to be stored as (topic-partition list) pairs.
`ConsumerProtocolSubscription` is an intermediate data structure in the deserialization where the topic partitions are stored as (topic-partition list) pairs. This pr uses `ConsumerProtocolSubscription` instead as the input subscription to make `toTopicPartitions` more efficient.
Reviewers: David Jacot <djacot@confluent.io>
We observed some runs of the test suite caused CI pipelines to stall.
A thread dump revealed that the test runner was blocked trying to read from a
socket, while attempting to close the socket [[0]]. It turns out this is
due to a bug in JDK which is very similar to JDK-8274524, but it affects
the else branch of `SSLSocketImpl::bruteForceCloseInput` [[1]] which wasn't
fixed in JDK-8274524.
Since the blocking happens in a native call, the test runner's timeouts have
no effect as the blocked test runner thread doesn't seem to respond to
interrupts.
As a mitigation in Kafka's test suite, this change adds `SO_TIMEOUT` of
30 seconds to all the TLS sockets handled by `EchoServer`. The timeout is
reasonably high for tests and a finite upper bound avoids infinite
blocking of the test suite.
[0]: https://issues.apache.org/jira/secure/attachment/13066427/timeout.log
[1]: 890adb6410/src/java.base/share/classes/sun/security/ssl/SSLSocketImpl.java (L808)
Reviewers: Chia-Ping Tsai <chia7712@gmail.com>
Define the interfaces and RPCs for share-group persistence. (KIP-932). This PR is just RPCs and interfaces to allow building of the broker components which depend upon them. The implementation will follow in subsequent PRs.
Reviewers: Manikumar Reddy <manikumar.reddy@gmail.com>, Apoorv Mittal <apoorvmittal10@gmail.com>
Allow the committed offsets fetch to run for as long as needed. This handles the case where a user invokes Consumer.poll() with a very small timeout (including zero).
Reviewers: Andrew Schofield <aschofield@confluent.io>, Lianet Magrans <lianetmr@gmail.com>, Chia-Ping Tsai <chia7712@gmail.com>
In previous PR(#16048), I mistakenly excluded the underscore (_) from the set of valid characters for the protocol,
resulting in the inability to correctly parse the connection string for SASL_PLAINTEXT. This bug fix addresses the
issue and includes corresponding tests.
Reviewers: Andrew Schofield <aschofield@confluent.io>, Luke Chen <showuon@gmail.com>
Fixes for the leave group flow (unsubscribe/close):
Fix to send Heartbeat to leave group on close even if the callbacks fail
fix to ensure that if a member gets fenced while blocked on callbacks (ex. on unsubscribe), it will clear its epoch to not include it in commit requests
fix to avoid race on the subscription state object on unsubscribe, updating it only on the background thread when the callbacks to leave complete (success or failure).
Also improving logging in this area.
Reviewers: Lucas Brutschy <lbrutschy@confluent.io>, Philip Nee <pnee@confluent.io>
Completely migrates ConsumerNetworkThreadTest away from ConsumerTestBuilder and removes all usages of spy objects and replaced with mocks. Removes testEnsureMetadataUpdateOnPoll() since it was doing integration testing. Also I adds new tests to get more complete test coverage of ConsumerNetworkThread.
Reviewers: Kirk True <kirk@kirktrue.pro>, Lianet Magrans <lianetmr@gmail.com>, Philip Nee <pnee@confluent.io>, Matthias J. Sax <matthias@confluent.io>
This commit implements KIP-899: Allow producer and consumer clients to rebootstrap. It introduces the new setting `metadata.recovery.strategy`, applicable to all the types of clients.
Reviewers: Greg Harris <gharris1727@gmail.com>, Rajini Sivaram <rajinisivaram@googlemail.com>
KAFKA-16570 FenceProducers API returns "unexpected error" when successful
* Client handling of ConcurrentTransactionsException as retriable
* Unit test
* Integration test
Reviewers: Chris Egerton <chrise@aiven.io>, Justine Olshan <jolshan@confluent.io>
This patch is the continuation of https://github.com/apache/kafka/pull/15964. It introduces the records coalescing to the CoordinatorRuntime. It also introduces a new configuration `group.coordinator.append.linger.ms` which allows administrators to chose the linger time or disable it with zero. The new configuration defaults to 10ms.
Reviewers: Jeff Kim <jeff.kim@confluent.io>, Justine Olshan <jolshan@confluent.io>
Add support for KIP-953 KRaft Quorum reconfiguration in the DescribeQuorum request and response.
Also add support to AdminClient.describeQuorum, so that users will be able to find the current set of
quorum nodes, as well as their directories, via these RPCs.
Reviewers: Luke Chen <showuon@gmail.com>, Colin P. McCabe <cmccabe@apache.org>, Andrew Schofield <aschofield@confluent.io>
Propagate metadata error from background thread to application thread.
So, this fix ensures that metadata errors are thrown to the user on consumer.poll()
Reviewers: Lucas Brutschy <lbrutschy@confluent.io>, Philip Nee <pnee@confluent.io>, Lianet Magrans <lianetmr@gmail.com>
Fix to complete Future which was stuck due the exception.getCause() throws an error.
The fix in the #16217 unblocked blocking thread but exception in catch block from blocking get call was wrapped in ExecutionException which is not the case when moved to async workflow hence getCause is not needed.
Reviewers: Luke Chen <showuon@gmail.com>, Chia-Ping Tsai <chia7712@gmail.com>
This PR includes changes for AsyncKafkaConsumer to avoid evaluating the subscription regex on every poll if metadata hasn't changed. The metadataVersionSnapshot was introduced to identify whether metadata has changed or not, if yes then the current subscription regex will be evaluated.
This is the same mechanism used by the LegacyConsumer.
Reviewers: Lianet Magrans <lianetmr@gmail.com>, Matthias J. Sax <matthias@confluent.io>