This PR introduces an ExpiringErrorCache that temporarily stores topic
creation errors, allowing the system to provide detailed failure reasons
in subsequent heartbeat responses.
Key Designs:
Time-based expiration: Errors are cached with a TTL based on the
streams group heartbeat interval (2x heartbeat interval). This ensures
errors remain available for at least one retry cycle while preventing
unbounded growth. 2. Priority queue for efficient expiry: Uses a
min-heap to track entries by expiration time, enabling efficient cleanup
of expired entries during cache operations. 3. Capacity enforcement:
Limits cache size to prevent memory issues under high error rates. When
capacity is exceeded, oldest entries are evicted first. 4. Reference
equality checks: Uses eq for object identity comparison when cleaning up
stale entries, avoiding expensive value comparisons while correctly
handling entry updates.
Reviewers: Lucas Brutschy <lucasbru@apache.org>
- Move the `RaftManager` interface to raft module, and remove the
`register` and `leaderAndEpoch` methods since they are already part of
the RaftClient APIs.
- Rename RaftManager.scala to KafkaRaftManager.scala.
Reviewers: Ken Huang <s7133700@gmail.com>, TengYao Chi
<kitingiao@gmail.com>, Chia-Ping Tsai <chia7712@gmail.com>
Clarify timeout errors received on send if the case is topic not in
metadata vs partition not in metadata. Add integration tests showcases
the difference Follow-up from 4.1 fix for misleading timeout error
message (https://issues.apache.org/jira/browse/KAFKA-8862)
Reviewers: TengYao Chi <frankvicky@apache.org>, Kuan-Po Tseng
<brandboat@gmail.com>
This PR migrates the `TransactionLogTest` from Scala to Java for better
consistency with the rest of the test suite and to simplify future
maintenance.
Reviewers: Chia-Ping Tsai <chia7712@gmail.com>
The ApiError.fromThrowable(t) is going to return a generic
Errors.UNKNOWN_SERVER_ERROR back to the calling client (CLI for
instance) (eg if the broker has an authZ issue with ZK) and such
UnknownServerException should have a matching ERROR level log in the
broker logs IHMO to make it easier to troubleshoot
Reviewers: Chia-Ping Tsai <chia7712@gmail.com>
## Changes
This PR improves the stability of the
PlaintextAdminIntegrationTest.testElectPreferredLeaders test by
introducing short Thread.sleep( ) delays before invoking:
- changePreferredLeader( )
- waitForBrokersOutOfIsr( )
## Reasons
- Metadata propagation for partition2 :
Kafka requires time to propagate the updated leader metadata across all
brokers. Without waiting, metadataCache may return outdated leader
information for partition2.
- Eviction of broker1 from the ISR :
To simulate a scenario where broker1 is no longer eligible as leader,
the test relies on broker1 being removed from the ISR (e.g., due to
intentional shutdown). This eviction is not instantaneous and requires a
brief delay before Kafka reflects the change.
Reviewers: PoAn Yang <payang@apache.org>, TengYao Chi
<kitingiao@gmail.com>, TaiJuWu <tjwu1217@gmail.com>, Ken Huang
<s7133700@gmail.com>, Chia-Ping Tsai <chia7712@gmail.com>
This PR is a follow-up from https://github.com/apache/kafka/pull/20468.
Basically makes two things:
1. Moves the variable to the catch block as it is used only there.
2. Refactor FeaturesPublisher to handle exceptions the same as
ScramPublisher or other publishers :)
Reviewers: Chia-Ping Tsai <chia7712@gmail.com>
---------
Signed-off-by: see-quick <maros.orsak159@gmail.com>
The method rollbackOrProcessStateUpdates in SharePartition received 2
separate lists of updatedStates (InFlightState) and stateBatches
(PersisterStateBatch). This PR introduces a new subclass called
`PersisterBatch` which encompasses both these objects.
Reviewers: Apoorv Mittal <apoorvmittal10@gmail.com>
Integration tests for Stream Admin related API
Previous PR: https://github.com/apache/kafka/pull/20244
This one adds:
- Integration test for Admin#listStreamsGroupOffsets API
- Integration test for Admin#deleteStreamsGroupOffsets API
- Integration test for Admin#alterStreamsGroupOffsets API
Reviewers: Alieh Saeedi <asaeedi@confluent.io>, Lucas Brutschy
<lucasbru@apache.org>
As the PR title suggests, this PR is an attempt to perform some cleanups
in the server module. The changes are mostly around the use of Record
type for some classes, changes to use enhanced switch, etc.
Reviewers: Ken Huang <s7133700@gmail.com>, Chia-Ping Tsai
<chia7712@gmail.com>
We add the three main changes in this PR
- Disallowing null values for most LIST-type configurations makes sense,
since users cannot explicitly set a configuration to null in a
properties file. Therefore, only configurations with a default value of
null should be allowed to accept null.
- Disallowing duplicate values is reasonable, as there are currently no
known configurations in Kafka that require specifying the same value
multiple times. Allowing duplicates is both rare in practice and
potentially confusing to users.
- Disallowing empty list, even though many configurations currently
accept them. In practice, setting an empty list for several of these
configurations can lead to server startup failures or unexpected
behavior. Therefore, enforcing non-empty lists helps prevent
misconfiguration and improves system robustness.
These changes may introduce some backward incompatibility, but this
trade-off is justified by the significant improvements in safety,
consistency, and overall user experience.
Additionally, we introduce two minor adjustments:
- Reclassify some STRING-type configurations as LIST-type, particularly
those using comma-separated values to represent multiple entries. This
change reflects the actual semantics used in Kafka.
- Update the default values for some configurations to better align with
other configs.
These changes will not introduce any compatibility issues.
Reviewers: Jun Rao <junrao@gmail.com>, Chia-Ping Tsai
<chia7712@gmail.com>
This reverts commit d86ba7f54a.
Reverting since we are planning to change how KIP-966 is implemented. We
should revert this RPC until we have more clarity on how this KIP will
be executed.
Reviewers: José Armando García Sancio <jsancio@apache.org>
This change adds:
- Integration test for `Admin#describeStreamsGroups` API
- Integration test for `Admin#deleteStreamsGroup` API
Reviewers: Alieh Saeedi <asaeedi@confluent.io>, Lucas Brutschy
<lucasbru@apache.org>
---------
Co-authored-by: Lucas Brutschy <lbrutschy@gmail.com>
* Log error message if `broker.heartbeat.interval.ms * 2` is large than
`broker.session.timeout.ms`.
* Add test case
`testLogBrokerHeartbeatIntervalMsShouldBeLowerThanHalfOfBrokerSessionTimeoutMs`.
Reviewers: Chia-Ping Tsai <chia7712@gmail.com>
Add a lower bound to num.replica.fetchers.
Reviewers: PoAn Yang <payang@apache.org>, TaiJuWu <tjwu1217@gmail.com>,
Ken Huang <s7133700@gmail.com>, jimmy <wangzhiwang611@gmail.com>,
Jhen-Yung Hsu <jhenyunghsu@gmail.com>, Chia-Ping Tsai
<chia7712@gmail.com>
Cleanup default configs in
AutoTopicCreationManager#createStreamsInternalTopics. The streams
protocol would like to be consistent with the kafka streams using the
classic protocol - which would create the internal topics using
CreateTopic and therefore use the controller config.
Reviewers: Lucas Brutschy <lucasbru@apache.org>
These tests were written while finalizing approach for making inflight
state class thread safe but later approach changed and the lock is now
always required by SharePartition to change inflight state. Hence these
tests are incorrect and do not add any value.
Reviewers: Andrew Schofield <aschofield@confluent.io>
Minor PR to update name of maxInFlightMessages to maxInFlightRecords to
maintain consistency in share partition related classes.
Reviewers: Andrew Schofield <aschofield@confluent.io>
As per the suggestion by @adixitconfluent and @chirag-wadhwa5,
[here](https://github.com/apache/kafka/pull/20395#discussion_r2300810004),
I have refactored the code with variable and method names.
Reviewers: Andrew Schofield <aschofield@confluent.io>, Chirag Wadhwa
<cwadhwa@confluent.io>
This is the first part of the implementation of
[KIP-1023](https://cwiki.apache.org/confluence/display/KAFKA/KIP-1023%3A+Follower+fetch+from+tiered+offset)
The purpose of this pull request is for the broker to start returning
the correct offset when it receives a -6 as a timestamp in a ListOffsets
API request.
Added unit tests for the new timestamp.
Reviewers: Kamal Chandraprakash <kamal.chandraprakash@gmail.com>
The PR fixes the batch alignment issue when partitions are re-assigned.
During initial read of state the batches can be broken arbitrarily. Say
the start offset is 10 and cache contains [15-18] batch during
initialization. When fetch happens at offset 10 and say the fetched
batch contain 10 records i.e. [10-19] then correct batches will be
created if maxFetchRecords is greater than 10. But if maxFetchRecords is
less than 10 then last offset of batch is determined, which will be 19.
Hence acquire method will incorrectly create a batch of [10-19] while
[15-18] already exists. Below check is required t resolve the issue:
```
if (isInitialReadGapOffsetWindowActive() && lastAcquiredOffset >
lastOffset) {
lastAcquiredOffset = lastOffset;
}
```
While testing with other cases, other issues were determined while
updating the gap offset, acquire of records prior share partitions end
offset and determining next fetch offset with compacted topics. All
these issues can arise mainly during initial read window after partition
re-assignment.
Reviewers: Andrew Schofield <aschofield@confluent.io>, Abhinav Dixit
<adixit@confluent.io>, Chirag Wadhwa <cwadhwa@confluent.io>
This is followup PR for https://github.com/apache/kafka/pull/19699.
* Update TransactionLog#readTxnRecordValue to initialize
TransactionMetadata with non-empty topic partitions
* Update `TxnTransitMetadata` comment, because it's not immutable.
Reviewers: TengYao Chi <kitingiao@gmail.com>, Justine Olshan
<jolshan@confluent.io>, Kuan-Po Tseng <brandboat@gmail.com>, Chia-Ping
Tsai <chia7712@gmail.com>
**Changes**: Use ClusterTest to rewrite
EligibleLeaderReplicasIntegrationTest.
**Validation**: Run the test 50 times locally with consistent success.
Reviewers: Chia-Ping Tsai <chia7712@gmail.com>
As per the current implementation in archiveRecords, when LSO is
updated, if we have multiple record batches before the new LSO, then
only the first one gets archived. This is because of the following lines
of code ->
`isAnyOffsetArchived = isAnyOffsetArchived ||
archivePerOffsetBatchRecords(inFlightBatch, startOffset, endOffset - 1,
initialState);`
`isAnyBatchArchived = isAnyBatchArchived ||
archiveCompleteBatch(inFlightBatch, initialState);`
The first record / batch will make `isAnyOffsetArchived` /
`isAnyBatchArchived` true, after which this line of code will
short-circuit and the methods `archivePerOffsetBatchRecords` /
`archiveCompleteBatch` will not be called again. This PR changes the
order of the expressions so that the short-circuit does not prevent from
archiving all the required batches.
Reviewers: Apoorv Mittal <apoorvmittal10@gmail.com>
The `record-size` and `throughput` arguments don’t work in
`TestRaftServer`. The `recordsPerSec` and `recordSize` values are always
hard-coded.
- Fix `recordsPerSec` and `recordSize` values hard-coded issue
- Add "Required" description to command-line options to make it clear to
users.
Reviewers: Chia-Ping Tsai <chia7712@gmail.com>
Follow-up to
[KAFKA-18486](https://issues.apache.org/jira/browse/KAFKA-18486)
* Replace PartitionState with PartitionRegistration in
makeFollower/makeLeader
* Remove PartitionState.java since it is no longer referenced
Reviewers: TaiJuWu <tjwu1217@gmail.com>, Ken Huang <s7133700@gmail.com>,
Chia-Ping Tsai <chia7712@gmail.com>
Fixes a false positive in `BrokerRegistrationRequestTest` caused by
`isMigratingZkBroker`, and migrates the test from Scala to Java.
Reviewers: Chia-Ping Tsai <chia7712@gmail.com>
1. Move TransactionMetadata to transaction-coordinator module.
2. Rewrite TransactionMetadata in Java.
3. The `topicPartitions` field uses `HashSet` instead of `Set`, because
it's mutable field.
4. In Scala, when calling `prepare*` methods, they can use current value
as default input in `prepareTransitionTo`. However, in Java, it doesn't
support function default input value. To avoid a lot of duplicated code
or assign value to wrong field, we add a private class `TransitionData`.
It can get current `TransactionMetadata` value as default value and
`prepare*` methods just need to assign updated value.
Reviewers: Justine Olshan <jolshan@confluent.io>, Artem Livshits
<alivshits@confluent.io>, Chia-Ping Tsai <chia7712@gmail.com>
This PR does the following:
- Rewrite to new test infra.
- Rewrite to java.
- Move to clients-integration-tests.
- Add `ensureConsistentMetadata` method to `ClusterInstance`,
similar to `ensureConsistentKRaftMetadata` in the old infra, and
refactors related code.
Reviewers: TengYao Chi <frankvicky@apache.org>, Ken Huang
<s7133700@gmail.com>
Log a warning for each topic that failed to be created as a result of an
automatic creation. This makes the underlying cause more visible so
users can take action.
Previously, at the default log level, you could only see logs that the
broker was attempting to autocreate topics. If the creation failed, then
it was logged at debug.
Signed-off-by: Robert Young <robertyoungnz@gmail.com>
Reviewers: Luke Chen <showuon@gmail.com>, Kuan-Po Tseng <brandboat@gmail.com>
Add the `controller.quorum.auto.join.enable` configuration. When enabled
with KIP-853 supported, follower controllers who are observers (their
replica id + directory id are not in the voter set) will:
- Automatically remove voter set entries which match their replica id
but not directory id by sending the `RemoveVoterRPC` to the leader.
- Automatically add themselves as a voter when their replica id is not
present in the voter set by sending the `AddVoterRPC` to the leader.
Reviewers: José Armando García Sancio
[jsancio@apache.org](mailto:jsancio@apache.org), Chia-Ping Tsai
[chia7712@gmail.com](mailto:chia7712@gmail.com)
### Summary
Adds comprehensive test coverage for the StorageTool format command
feature validation, including tests for valid feature overrides, invalid
feature detection, and multiple feature specifications. Also adds debug
output to help with troubleshooting format operations.
### Changes Made
#### Test Coverage Improvements
- **`testFormatWithReleaseVersionAndFeatureOverride()`**: Tests that
feature overrides work correctly when specified with `--feature` flag
- **`testFormatWithInvalidFeatureThrowsError()`**: Tests error handling
for unsupported features
- **`testFormatWithMultipleFeatures()`**: Tests multiple feature
specifications in a single format command
#### Debug Output Enhancement
- **Formatter.java**: Added debug output to print bootstrap metadata
during format operations
- Helps with troubleshooting format issues by showing the complete
bootstrap metadata being written
- Improves visibility into what features and configurations are being
applied
#### Test Updates
- **FormatterTest.java**: Updated existing tests to account for the new
debug output\
### Related
- KIP-1022: [Formatting and Updating Features
](https://cwiki.apache.org/confluence/display/KAFKA/KIP-1022%3A+Formatting+and+Updating+Features)
Reviewers: Kevin Wu <kevin.wu2412@gmail.com>, Justine Olshan
<jolshan@confluent.io>
In the current implementation, some delayed share fetch operations get
trapped in the delayed share fetch purgatory when the partition
leaderships change during share consumption. This is because there is no
check in code to make sure the current broker is still the partition
leader corresponding to the share partitions. So, when leadership
changes, the share partitions cannot be acquired, because they have
already been fenced, and tryComplete returns false. Although the
operatio does get completed when the timer expires for it, but it is too
late by then, and the operation get stuck in the watchers list waiting
for it to get purged when estimated operations increase to more than
1000. This Pr resolves this by adding the required check so that if
partition leadership changes, then the delayed share fetches waiting on
it gets completed instantaneously.
Reviewers: Apoorv Mittal <apoorvmittal10@gmail.com>, Andrew Schofield
<aschofield@confluent.io>
Fixing max delivery check on acquisition lock timeout and write state
RPC failure.
When acquisition lock is already timed out and write state RPC failure
occurs then we need to check if records need to be archived. However
with the fix we do not persist the information, which is relevant as
some records may be archived or delivery count is bumped. The
information will be persisted eventually.
The persister call has failed already hence issuing another persister
call due to a failed persister call earlier is not correct. Rather let
the data persist in future persister calls.
Reviewers: Manikumar Reddy <manikumar.reddy@gmail.com>, Abhinav Dixit
<adixit@confluent.io>
Minor PR to move persister call outside of the lock. The lock is not
required while making the persister call.
Reviewers: Manikumar Reddy <manikumar.reddy@gmail.com>, Abhinav Dixit <adixit@confluent.io>
The PR removes unnecessary updates for find next fetch offset. When the
state is in transition and not yet completed then anyways respective
offsets should not be considered for acquisition. The find next fetch
offset is updated finally when transition is completed.
Reviewers: Manikumar Reddy <manikumar.reddy@gmail.com>, Abhinav Dixit
<adixit@confluent.io>
This flag also skips control records, so the description needs to be
updated.
---------
Signed-off-by: Federico Valeri <fedevaleri@gmail.com>
Reviewers: Luke Chen <showuon@gmail.com>, Jhen-Yung Hsu <jhenyunghsu@gmail.com>, Vincent Potucek
## Changes:
- Replaced all references to boundPort with brokerBoundPort.
## Reasons
- boundPort and brokerBoundPort share the same definition and behavior.
Reviewers: TaiJuWu <tjwu1217@gmail.com>, Ken Huang <s7133700@gmail.com>,
Chia-Ping Tsai <chia7712@gmail.com>
The PR fixes following:
1. In case share partition arrive at a state which should be treated as
final state
of that batch/offset (example - LSO movement which causes offset/batch
to be ARCHIVED permanently), the result of pending write state RPCs for
that offset/batch override the ARCHIVED state. Hence track such updates
and apply when transition is completed.
2. If an acquisition lock timeout occurs while an offset/batch is
undergoing transition followed by write state RPC failure, then
respective batch/offset can
land in a scenario where the offset stays in ACQUIRED state with no
acquisition lock timeout task.
3. If a timer task is cancelled, but due to concurrent execution of
timer task and acknowledgement, there can be a scenario when timer task
has processed post cancellation. Hence it can mark the offset/batch
re-avaialble despite already acknowledged.
Reviewers: Andrew Schofield <aschofield@confluent.io>, Abhinav Dixit
<adixit@confluent.io>
The code had a mixture of "acknowledgement type" and "acknowledge type".
The latter is preferred.
Reviewers: TengYao Chi <frankvicky@apache.org>, Lan Ding
<isDing_L@163.com>
This PR adds a check to the storage tool's format command which throws a
TerseFailure when the controller.quorum.voters config is defined and the
node is formatted with the --standalone flag or the
--initial-controllers flag.
Without this check, it is possible to have two voter sets. For example,
in a three node setup, the two nodes that formatted with
--no-initial-controllers could form quorum with each other since they
have the static voter set, and the --standalone node would ignore the
config and read the voter set of itself from its log, forming its own
quorum of 1.
Reviewers: José Armando García Sancio <jsancio@apache.org>, TaiJuWu
<tjwu1217@gmail.com>, Alyssa Huang <ahuang@confluent.io>