This patch adds reconciliation logic to migrating ZK brokers to deal with pending topic deletions as well as missed StopReplicas.
During the hybrid mode of the ZK migration, the KRaft controller is asynchronously sending UMR and LISR to the ZK brokers to propagate metadata. Since this process is essentially "best effort" it is possible for a broker to miss a StopReplicas. The new logic lets the ZK broker examine its local logs compared with the full set of replicas in a "Full" LISR. Any local logs which are not present in the set of replicas in the request are removed from ReplicaManager and marked as "stray".
To avoid inadvertent data loss with this new behavior, the brokers do not delete the "stray" partitions. They will rename the directories and log warning messages during log recovery. It will be up to the operator to manually delete the stray partitions. We can possibly enhance this in the future to clean up old stray logs.
This patch makes use of the previously unused Type field on LeaderAndIsrRequest. This was added as part of KIP-516 but never implemented. Since its introduction, an implicit 0 was sent in all LISR. The KRaft controller will now send a value of 2 to indicate a full LISR (as specified by the KIP). The presence of this value acts as a trigger for the ZK broker to perform the log reconciliation.
Reviewers: Colin P. McCabe <cmccabe@apache.org>
Conflicts:
- ReplicaManagerTest.scala: fix imports
- ZkMigrationIntegrationTest.scala: handle absence of KIP-919 changes that added a different way to
fetch the quorum voters config.
- KRaftMigrationDriverTest.java: handle absence of KIP-919 changes that added
setupDeltaForMigration.
TestUtils.createTopicWithAdmin calls waitForAllPartitionsMetadata which waits for partition(s) to be present in each brokers' metadata cache. This is a sufficient check in ZK mode because the controller sends an LISR request before sending an UpdateMetadataRequest which means that the partition in the ReplicaManager will be updated before the metadata cache.
In KRaft mode, the metadata cache is updated first, so the check may return before partitions and other metadata listeners are fully initialized.
Testing:
Insert a Thread.sleep(100) in BrokerMetadataPublisher.onMetadataUpdate after
// Publish the new metadata image to the metadata cache.
metadataCache.setImage(newImage)
and run EdgeCaseRequestTest.testProduceRequestWithNullClientId and the test will fail locally nearly deterministically. After the change(s), the test no longer fails.
Conflicts:
core/src/test/scala/unit/kafka/server/GroupCoordinatorBaseRequestTest.scala
core/src/test/scala/integration/kafka/admin/TopicCommandIntegrationTest.scala
Reviewers: Justine Olshan <jolshan@confluent.io>, Divij Vaidya <diviv@amazon.com>, David Mao <dmao@confluent.io>
With the new callback mechanism we were accidentally passing context with the wrong request local. Now include a RequestLocal as an explicit argument to the callback.
Also make the arguments passed through the callback clearer by separating the method out.
Added a test to ensure we use the request handler's request local and not the one passed in when the callback is executed via the request handler.
Reviewers: Ismael Juma ismael@juma.me.uk, Divij Vaidya diviv@amazon.com, David Jacot djacot@confluent.io, Jason Gustafson jason@confluent.io, Artem Livshits alivshits@confluent.io, Jun Rao junrao@gmail.com,
Conflicts:
core/src/main/scala/kafka/server/KafkaRequestHandler.scala
core/src/main/scala/kafka/server/ReplicaManager.scala
core/src/test/scala/kafka/server/KafkaRequestHandlerTest.scala
Conflicts around verification guard, running the callback on the same thread, and checking the coordinator node before AddPartitionsToTxnManager. Remove test that is not applicable since we don't have 08aa33127a
This patch fixes a problem where we migrate the current producer ID batch to KRaft instead of the next producer ID batch. Since KRaft stores the next batch in the log, we end up serving up a duplicate batch to the first caller of AllocateProducerIds once the KRaft controller has taken over.
Reviewers: Colin P. McCabe <cmccabe@apache.org>
A commit fixes a bug in ProduceRequest#partitionSizes() that may cause this method to incorrectly returning an empty or incomplete response for a thread when another thread is in the process of initialising it.
Reviewers: Divij Vaidya <diviv@amazon.com>, hudeqi <1217150961@qq.com>, vamossagar12 <sagarmeansocean@gmail.com>
--------------------------------
Co-authored-by: fangxiaobing <fangxiaobing@kuaishou.com>
Adding the note from the kafka-site repo to the main repo. I also included the fixed link.
apache/kafka-site@9fa596capache/kafka-site@4eb2409
Reviewers: Divij Vaidya <diviv@amazon.com>, Ismael Juma <ismael@juma.me.uk>
After finishing restoration, we should only log the active tasks. Standby tasks are not part of restoration and it can be confusing to see them show up on this log message.
Reviewers: Matthias J. Sax <matthias@confluent.io>
git gc moves commit hashes from individual .git/refs/heads/ to .git/packed-refs which is not read
by the determineCommitId function.
Replace the existing lookup within the .git directory with a GrGit lookup that handles packed and
unpacked refs transparently.
Reviewers: Ismael Juma <ismael@juma.me.uk>
Fixing bad test setup. We tried to fix an upgrade bug for FK-joins in 3.1 release, but it later turned out that the PR was not sufficient to fix it. We finally fixed in 3.4 release.
This PR updates the system test matrix to only test working versions with FK-joins, limited to available test versions.
Reviewers: Guozhang Wang <wangguoz@gmail.com>, Hao Li <hli@confluent.io>, Mickael Maison <mickael.maison@gmail.com>
When a remote log segment contains multiple epoch, then it gets considered for multiple times during breach by retention size/time/start-offset. This will affect the deletion by remote log retention size as it deletes the number of segments less than expected. This is a follow-up of KAFKA-15352
Reviewers: Divij Vaidya <diviv@amazon.com>, Christo Lolov <lolovc@amazon.com>, Satish Duggana <satishd@apache.org>
All block cache metrics are being multiplied by the total number of
column families. In a `RocksDBTimestampedStore`, we have 2 column
families (the default, and the timestamped values), which causes all
block cache metrics in these stores to become doubled.
The cause is that our metrics recorder uses `getAggregatedLongProperty`
to fetch block cache metrics. `getAggregatedLongProperty` queries the
property on each column family in the database, and sums the results.
Since we always configure all column families to share the same block
cache, that causes the same block cache to be queried multiple times for
its metrics, with the results added togehter, effectively multiplying
the real value by the total number of column families.
To fix this, we should simply use `getLongProperty`, which queries a
single column family (the default one). Since all column families share
the same block cache, querying just one of them will give us the correct
metrics for that shared block cache.
Note: the same block cache is shared among all column families of a store
irrespective of whether the user has configured a shared block cache
across multiple stores.
Reviewers: Matthias J. Sax <matthias@confluent.io>, Bruno Cadonna <cadonna@apache.org>
This change adds the upgrade documentation for 3.6.0 and fixes the position of the notable changes in 3.5.0.
In previous releases, notable changes always come after the upgrade instructions.
Reviewers: Luke Chen <showuon@gmail.com>, Satish Duggana <satishd@apache.org>
With https://issues.apache.org/jira/browse/KAFKA-10575 StateRestoreListener#onRestoreSuspended was added. But local tests show that it is never called because DelegatingStateRestoreListener was not updated to call a new method
Reviewers: Anna Sophie Blee-Goldman <sophie@responsive.dev>, Bruno Cadonna <cadonna@confluent.io>
We have observed an issue where inter broker SSL listener is not coming up when running with TLSv3/JDK 17 .
SSL debug logs shows that TLSv3 post handshake messages >16K are not getting read and causing SslEngineValidator process to stuck while validating the provided trust/key store.
- Right now, WRAP returns if there is already data in the buffer. But if we need more data to be wrapped for UNWRAP to succeed, we end up looping forever. To fix this, now we always attempt WRAP and only return early on BUFFER_OVERFLOW.
- Update SslEngineValidator to unwrap post-handshake messages from peer when local handshake status is FINISHED.
Reviewers: Rajini Sivaram <rajinisivaram@googlemail.com>
I've added details for VerificationFailureRate and VerificationTimeMs.
I considered adding the documentation for the AddPartitionsToTxnVerification metrics, but I noticed that all the request metrics simply listed Produce|FetchConsumer|FetchFollower. If we don't already report the AddPartitionsToTxn request metrics in this file, it doesn't make sense to add the verification variant. (As well as all the other APIs we report)
Filed a followup jira if we want to redo that whole section.
Reviewers: Reviewers: Divij Vaidya <diviv@amazon.com>
A bug in the RemoteIndexCache leads to a situation where the cache does not replace the corrupted index with a new index instance fetched from remote storage. This commit fixes the bug by adding correct handling for `CorruptIndexException`.
Reviewers: Divij Vaidya <diviv@amazon.com>, Satish Duggana <satishd@apache.org>, Kamal Chandraprakash <kamal.chandraprakash@gmail.com>, Alexandre Dupriez <duprie@amazon.com>
DeleteSegmentsDueToLogStartOffsetBreach configures the segment such that it can hold at-most 2 record-batches. And, it asserts that the local-log-start-offset based on the assumption that each segment will contain exactly two messages.
During leader switch, the segment can get rotated and may not always contain two records. Previously, we were checking whether the expected local-log-start-offset is equal to the base-offset-of-the-first-local-log-segment. With this patch, we will scan the first local-log-segment for the expected offset.
Reviewers: Divij Vaidya <diviv@amazon.com>
bump snappy-java version to 1.1.10.4, and add more tests to verify the compressed data can be correctly decompressed and read.
For LogCleanerParameterizedIntegrationTest, we increased the message size for snappy decompression since in the new version of snappy, the decompressed size is increasing compared with the previous version. But since the compression algorithm is not kafka's scope, all we need to do is to make sure the compressed data can be successfully decompressed and parsed/read.
Reviewers: Divij Vaidya <diviv@amazon.com>, Ismael Juma <ismael@juma.me.uk>, Josep Prat <josep.prat@aiven.io>, Kamal Chandraprakash <kamal.chandraprakash@gmail.com>
As part of validating 3.6.0 RC0, I ran the ZK migration system tests at the RC tag. Pretty much all of them failed due to recent changes (particularly, disallowing migrations with JBOD). All of the changes here are test fixes, so not a release blocker.
================================================================================
SESSION REPORT (ALL TESTS)
ducktape version: 0.11.3
session_id: 2023-09-19--007
run time: 8 minutes 51.147 seconds
tests run: 5
passed: 5
flaky: 0
failed: 0
ignored: 0
Reviewers: Luke Chen <showuon@gmail.com>