Commit Graph

11741 Commits

Author SHA1 Message Date
Manikumar Reddy 728666f3ad KAFKA-15502: Update SslEngineValidator to handle large stores (#14445)
We have observed an issue where inter broker SSL listener is not coming up when running with TLSv3/JDK 17 .
SSL debug logs shows that TLSv3 post handshake messages >16K are not getting read and causing SslEngineValidator process to stuck while validating the provided trust/key store.

- Right now, WRAP returns if there is already data in the buffer. But if we need more data to be wrapped for UNWRAP to succeed, we end up looping forever. To fix this, now we always attempt WRAP and only return early on BUFFER_OVERFLOW.
- Update SslEngineValidator to unwrap post-handshake messages from peer when local handshake status is FINISHED.

Reviewers: Rajini Sivaram <rajinisivaram@googlemail.com>
2023-10-08 12:28:40 +05:30
Matthias J. Sax c9ae44e811
MINOR: update Kafka versions for system tests (#14501)
Reviewers: Bill Bejeck <bill@confluent.io>
2023-10-05 11:00:44 -07:00
Justine Olshan 9d7a821273 KAFKA-15330: Add missing documentation of metrics introduced as part of KAFKA-15028 (#14480)
I've added details for VerificationFailureRate and VerificationTimeMs.

I considered adding the documentation for the AddPartitionsToTxnVerification metrics, but I noticed that all the request metrics simply listed Produce|FetchConsumer|FetchFollower. If we don't already report the AddPartitionsToTxn request metrics in this file, it doesn't make sense to add the verification variant. (As well as all the other APIs we report)

Filed a followup jira if we want to redo that whole section.

Reviewers: Reviewers: Divij Vaidya <diviv@amazon.com>
2023-10-04 13:30:50 -07:00
Satish Duggana 2edd22bcab MINOR Update 3.6 branch version to 3.6.1-SNAPSHOT 2023-10-03 14:04:42 -07:00
Satish Duggana 2097c8fa4c Merge tag '3.6.0-rc2' into 3.6
3.6.0-rc2
2023-10-03 13:41:20 -07:00
David Arthur 0022949281
KAFKA-15483: Add KIP-938 and KIP-866 metrics to bundled docs (#14421)
Reviewers: Divij Vaidya <diviv@amazon.com>, Ron Dagostino <rdagostino@confluent.io>
2023-10-03 13:41:41 +02:00
Lucas Brutschy 72e275f6ea MINOR: Logging fix in StreamsPartitionAssignor (#14435)
Fix broken log message

Reviewer: A. Sophie Blee-Goldman <ableegoldman@apache.org>
2023-10-02 12:33:09 +02:00
Hao Li 3a793b094c MINOR: only log error when rack aware assignment is enabled (#14415)
Reviewers:  Lucas Brutschy <lbrutschy@confluent.io>, Matthias J. Sax <matthias@confluent.io>
2023-09-29 10:17:37 -07:00
iit2009060 1897af3ef9 KAFKA-15511: Handle CorruptIndexException in RemoteIndexCache (#14459)
A bug in the RemoteIndexCache leads to a situation where the cache does not replace the corrupted index with a new index instance fetched from remote storage. This commit fixes the bug by adding correct handling for `CorruptIndexException`.

Reviewers: Divij Vaidya <diviv@amazon.com>, Satish Duggana <satishd@apache.org>, Kamal Chandraprakash <kamal.chandraprakash@gmail.com>, Alexandre Dupriez <duprie@amazon.com>
2023-09-29 10:28:37 +00:00
Satish Duggana 60e845626d Bump version to 3.6.0 2023-09-28 21:56:28 -07:00
Kamal Chandraprakash 0d553cc9c6 KAFKA-15499: Fix the flaky DeleteSegmentsDueToLogStartOffsetBreach test (#14439)
DeleteSegmentsDueToLogStartOffsetBreach configures the segment such that it can hold at-most 2 record-batches. And, it asserts that the local-log-start-offset based on the assumption that each segment will contain exactly two messages.

During leader switch, the segment can get rotated and may not always contain two records. Previously, we were checking whether the expected local-log-start-offset is equal to the base-offset-of-the-first-local-log-segment. With this patch, we will scan the first local-log-segment for the expected offset.

Reviewers: Divij Vaidya <diviv@amazon.com>
2023-09-28 13:06:40 +00:00
Luke Chen 4fdac6136b KAFKA-15498: bump snappy-java version to 1.1.10.4 (#14434)
bump snappy-java version to 1.1.10.4, and add more tests to verify the compressed data can be correctly decompressed and read.

For LogCleanerParameterizedIntegrationTest, we increased the message size for snappy decompression since in the new version of snappy, the decompressed size is increasing compared with the previous version. But since the compression algorithm is not kafka's scope, all we need to do is to make sure the compressed data can be successfully decompressed and parsed/read.

Reviewers: Divij Vaidya <diviv@amazon.com>, Ismael Juma <ismael@juma.me.uk>, Josep Prat <josep.prat@aiven.io>, Kamal Chandraprakash <kamal.chandraprakash@gmail.com>
2023-09-27 19:02:04 +08:00
Divij Vaidya a6dd6c58e2 Upgrade Jetty to 9.4.52.v20230823 (#14438)
Reviewers: Viktor Somogyi-Vass <viktorsomogyi@gmail.com>, Luke Chen <showuon@gmail.com>, Satish Duggana <satishd@apache.org>
2023-09-25 10:26:08 -07:00
Luke Chen be527ea36c MINOR: fix kraft upgrade system test (#14424)
We should use DEV_BRANCH instead of DEV_VERSION in this case, otherwise, error will be thrown:

RunnerClient: kafkatest.tests.core.kraft_upgrade_test.TestKRaftUpgrade.test_isolated_mode_upgrade.from_kafka_version=3.6.0-SNAPSHOT.metadata_quorum=ISOLATED_KRAFT: FAIL: RemoteCommandError({'ssh_config': {'host': 'ducker10', 'hostname': 'ducker10', 'user': 'ducker', 'port': 22, 'password': '', 'identityfile': '/home/ducker/.ssh/id_rsa', 'connecttimeout': None}, 'hostname': 'ducker10', 'ssh_hostname': 'ducker10', 'user': 'ducker', 'externally_routable_ip': 'ducker10', '_logger': <Logger kafkatest.tests.core.kraft_upgrade_test.TestKRaftUpgrade.test_isolated_mode_upgrade.from_kafka_version=3.6.0-SNAPSHOT.metadata_quorum=ISOLATED_KRAFT-2 (DEBUG)>, 'os': 'linux', '_ssh_client': <paramiko.client.SSHClient object at 0xffffb35d5820>, '_sftp_client': <paramiko.sftp_client.SFTPClient object at 0xffffb35f8ca0>, '_custom_ssh_exception_checks': None}, '/opt/kafka-3.6.0-SNAPSHOT/bin/kafka-storage.sh format --ignore-formatted --config /mnt/kafka/kafka.properties --cluster-id I2eXt9rvSnyhct8BYmW6-w', 127, b'bash: line 1: /opt/kafka-3.6.0-SNAPSHOT/bin/kafka-storage.sh: No such file or directory\n')

Reviewers: Satish Duggana <satishd@apache.org>
2023-09-25 16:15:51 +08:00
Divij Vaidya e8dffea9ab MINOR: Fix kafka-site formatting (#14419)
Reviewers: Satish Duggana <satishd@apache.org>, Josep Prat <jlprat@apache.org>
2023-09-21 09:31:04 +00:00
David Arthur 01fa95c216
MINOR: Fix the ZK migration system tests (#14409)
As part of validating 3.6.0 RC0, I ran the ZK migration system tests at the RC tag. Pretty much all of them failed due to recent changes (particularly, disallowing migrations with JBOD). All of the changes here are test fixes, so not a release blocker.

================================================================================
SESSION REPORT (ALL TESTS)
ducktape version: 0.11.3
session_id:       2023-09-19--007
run time:         8 minutes 51.147 seconds
tests run:        5
passed:           5
flaky:            0
failed:           0
ignored:          0

Reviewers:  Luke Chen <showuon@gmail.com>
2023-09-20 14:36:50 +08:00
Greg Harris ae352b6397 KAFKA-15473: Hide duplicate plugins in /connector-plugins (#14398)
Reviewers: Yash Mayya <yash.mayya@gmail.com>, Sagar Rao <sagarmeansocean@gmail.com>, Hector Geraldino <hgeraldino@gmail.com>, Chris Egerton <chrise@aiven.io>
2023-09-19 22:30:18 +05:30
Satish Duggana 193d8c5be8
Added missing licenses for libraries (#14393)
Reviewers: Luke Chen <showuon@gmail.com>
2023-09-15 23:23:28 +05:30
Luke Chen 8319163062 KAFKA-15442: add a section in doc for tiered storage (#14382)
Added 6.11: Tiered Storage section and notable changes ini v3.6.0

Reviewers: Satish Duggana <satishd@apache.org>, Gantigmaa Selenge <gselenge@redhat.com>
2023-09-14 21:13:26 +05:30
Kamal Chandraprakash 2508e30670 KAFKA-15439: Transactions test with tiered storage (#14347)
This test extends the existing TransactionsTest. It configures the broker and topic with tiered storage and expects at-least one log segment to be uploaded to the remote storage.

Reviewers: Luke Chen <showuon@gmail.com>, Satish Duggana <satishd@apache.org>,  Divij Vaidya <diviv@amazon.com>
2023-09-14 09:52:46 +08:00
Justine Olshan f13367de4e KAFKA-15459: Convert coordinator retriable errors to a known producer response error (#14378)
KIP-890 Part 1 tries to address hanging transactions on old clients. Thus, the produce version can not be bumped and no new errors can be added. Before we used the java client's notion of retriable and abortable errors -- retriable errors are defined as such by extending the retriable error class, fatal errors are defined explicitly, and abortable errors are the remaining. However, many other clients treat non specified errors as fatal and that means many retriable errors kill the application.

Stuck between having specific errors for Java clients that are handled correctly (ie we retry) or specific fatal errors for cases that should not be fatal, we opted for a middle ground of non-specific error, but a message in the response to specify.

Converting some of the coordinator error codes to NOT_ENOUGH_REPLICAS which is a known produce response.
Also correctly add the old errors to the produce response. (We were not doing this correctly before)

Added tests for the new errors and messages.

Reviewers: Jason Gustafson <jason@confluent.io>, David Jacot <djacot@confluent.io>
2023-09-13 14:23:41 -07:00
Federico Valeri 4902884edd MINOR: Fix metadata.version reference in "ZooKeeper to KRaft Migration" documentation (#14366)
In "ZooKeeper to KRaft Migration" documentation, we are still reporting 3.4 as metadata version. Reworking that phrase to make it more clear and avoid the need to update it in the future.

Signed-off-by: Federico Valeri <fedevaleri@gmail.com>

Reviewers: Luke Chen <showuon@gmail.com>
2023-09-13 17:20:25 +08:00
Luke Chen 89e4976770 MINOR: Fix errors in javadoc and docs in tiered storage (#14379)
Reviewers: Satish Duggana <satishd@apache.org>
2023-09-13 12:46:52 +05:30
Luke Chen 6b91043bfb MINOR: reduce default RLMM retry interval (#14374)
Reduce default remote.log.metadata.initialization.retry.interval.ms value to 100ms.

Reviewers: Satish Duggana <satishd@apache.org>, Kamal Chandraprakash<kamal.chandraprakash@gmail.com>
2023-09-12 23:03:09 +05:30
David Arthur 50fea09724 KAFKA-15450 Don't allow ZK migration with JBOD (#14367)
Reviewers: Ron Dagostino <rndgstn@gmail.com>
2023-09-12 10:29:03 -04:00
Abhijeet Kumar 9c44f705b3 KAFKA-14993: Improve TransactionIndex instance handling while copying to and fetching from RSM (#14363)
- Updated the contract for RSM's fetchIndex to throw a ResourceNotFoundException instead of returning an empty InputStream when it does not have a TransactionIndex.
- Updated the LocalTieredStorage implementation to adhere to the new contract.
- Added Unit Tests for the change.

Reviewers: Satish Duggana <satishd@apache.org>, Luke Chen <showuon@gmail.com>, Divij Vaidya <diviv@amazon.com>, Christo Lolov <lolovc@amazon.com>, Kamal Chandraprakash<kamal.chandraprakash@gmail.com>
2023-09-12 17:54:57 +05:30
Christo Lolov 4e831b967c KAFKA-15352: Update log-start-offset before initiating deletion of remote segments (#14349)
This change is about the current leader updating the log-start-offset before the segments are deleted from remote storage. This will do a best-effort mechanism for followers to receive log-start-offset from the leader and they can update their log-start-offset before it becomes a leader. 

Reviewers: Kamal Chandraprakash<kamal.chandraprakash@gmail.com>, Divij Vaidya <diviv@amazon.com>, Luke Chen <showuon@gmail.com>, Satish Duggana <satishd@apache.org>
2023-09-12 10:13:44 +05:30
Kamal Chandraprakash 2a56edc0ea MINOR: Removed the RSM and RLMM classpath config validator (#14358)
- RSM and RLMM classpath can be empty since it's optional so removed the non-empty string validator
- Fix getting the `localTieredStorage` by brokerId after stopping a broker.

Reviewers: Christo Lolov <lolovc@amazon.com>, Luke Chen <showuon@gmail.com>, Satish Duggana <satishd@apache.org>
2023-09-09 19:03:18 +05:30
David Arthur 5318390e71 KAFKA-15441 Allow broker heartbeats to complete in metadata transaction (#14351)
This patch allows broker heartbeat events to be completed while a metadata transaction is in-flight.

More generally, this patch allows any RUNS_IN_PREMIGRATION event to complete while the controller
is in pre-migration mode even if the migration transaction is in-flight.

We had a problem with broker heartbeats timing out because they could not be completed while a large
ZK migration transaction was in-flight. This resulted in the controller fencing all the ZK brokers which 
has many undesirable downstream effects. 

Reviewers: Akhilesh Chaganti <akhileshchg@users.noreply.github.com>, Colin Patrick McCabe <cmccabe@apache.org>
2023-09-08 16:36:36 -04:00
David Arthur 365308b52d KAFKA-15435 Fix counts in MigrationManifest (#14342)
Reviewers: Liu Zeyu <zeyu.luke@gmail.com>, Colin P. McCabe <cmccabe@apache.org>
2023-09-08 09:14:00 -04:00
Lucas Brutschy 99bc91b73f MINOR: fix currentLag javadoc (#14224)
Reviewers: Matthias J. Sax <matthias@confluent.io>
2023-09-07 19:26:13 -07:00
atu-sharm bb98b61009 KAFKA-15338: The metric group documentation for metrics added in KAFKA-13945 is incorrect (#14221)
Reviewers: Matthias J. Sax <matthias@confluent.io>
2023-09-07 19:06:13 -07:00
Kamal Chandraprakash 946ab8f410 KAFKA-15410: Delete records with tiered storage integration test (4/4) (#14330)
* Added the integration test for DELETE_RECORDS API for tiered storage enabled topic
* Added validation checks before removing remote log segments for log-start-offset breach

Reviewers: Satish Duggana <satishd@apache.org>, Luke Chen <showuon@gmail.com>, Christo Lolov <lolovc@amazon.com>
2023-09-08 05:16:28 +05:30
José Armando García Sancio 522263d195 KAFKA-14273; Close file before atomic move (#14354)
In the Windows OS atomic move are not allowed if the file has another open handle. E.g

__cluster_metadata-0\quorum-state: The process cannot access the file because it is being used by another process
        at java.base/sun.nio.fs.WindowsException.translateToIOException(WindowsException.java:92)
        at java.base/sun.nio.fs.WindowsException.rethrowAsIOException(WindowsException.java:103)
        at java.base/sun.nio.fs.WindowsFileCopy.move(WindowsFileCopy.java:403)
        at java.base/sun.nio.fs.WindowsFileSystemProvider.move(WindowsFileSystemProvider.java:293)
        at java.base/java.nio.file.Files.move(Files.java:1430)
        at org.apache.kafka.common.utils.Utils.atomicMoveWithFallback(Utils.java:949)
        at org.apache.kafka.common.utils.Utils.atomicMoveWithFallback(Utils.java:932)
        at org.apache.kafka.raft.FileBasedStateStore.writeElectionStateToFile(FileBasedStateStore.java:152)

This is fixed by first closing the temporary quorum-state file before attempting to move it.

Reviewers: Colin Patrick McCabe <cmccabe@apache.org>
Co-Authored-By: Renaldo Baur Filho <renaldobf@gmail.com>
2023-09-07 16:35:03 -07:00
Chris Egerton 0db8e8c5f2 KAFKA-15416: Fix flaky TopicAdminTest::retryEndOffsetsShouldRetryWhenTopicNotFound test case (#14313)
Reviewers: Philip Nee <pnee@confluent.io>, Greg Harris <greg.harris@aiven.io>
2023-09-07 19:25:03 -04:00
Chris Egerton 5d185a88e4 KAFKA-15425: Fail fast in Admin::listOffsets when topic (but not partition) metadata is not found (#14314)
This restores previous behavior for Admin::listOffsets, which was to fail immediately if topic metadata could not be found, and only retry if metadata for one or more specific partitions could not be found.

There is a subtle difference here: prior to https://github.com/apache/kafka/pull/13432, the operation would be retried if any metadata error was reported for any individual topic partition, even if an error was also reported for the entire topic. With this change, the operation always fails if an error is reported for the entire topic, even if an error is also reported for one or more individual topic partitions.

I am not aware of any cases where brokers might return both topic- and topic partition-level errors for a metadata request, and if there are none, then this change should be safe. However, if there are such cases, we may need to refine this PR to remove the discrepancy in behavior.

Reviewers: Justine Olshan <jolshan@confluent.io>
2023-09-07 14:04:27 -07:00
Lucia Cerchie d571408672 KAFKA-15307: Removes non-existent configs (#14341)
`partition.grouper` was removed in 3.0 release.

Reviewers: Matthias J. Sax <matthias@confluent.io>
2023-09-07 13:00:58 -07:00
Luke Chen a5e3f0ded4 MINOR: Update the javadoc in RSM (#14352)
Reviewers: Satish Duggana <satishd@apache.org>, Kamal Chandraprakash<kamal.chandraprakash@gmail.com>
2023-09-07 20:57:11 +05:30
Kamal Chandraprakash 5d7840e1b2 KAFKA-15351: Update log-start-offset after leader election for topics enabled with remote storage (#14340)
On leadership failover, the new leader's start offset may be older than the start offset of old leader. This works fine for local storage scenario because the new leader still contains data associated with stale start offset. But in case of remote storage, although new leader has a stale offset, the data associated with it has been deleted from remote by the old leader. Hence, we end up in a situation where leader has a start offset but no data associated with it.

This commit fixes the situation by ensuring that on every leadership failover, for topics with remote storage, the leader will update it's start offset from the base of first segment in current leader chain present in the remote storage (if any).

Reviewers: Satish Duggana <satishd@apache.org>, Luke Chen <showuon@gmail.com>, Christo Lolov <lolovc@amazon.com>, Divij Vaidya <diviv@amazon.com>
2023-09-07 14:37:22 +00:00
Proven Provenzano 940f329007
KAFKA-15422: Update documenttion for delegation tokens when working with Kafka with KRaft (#14339)
Reviewers: Manikumar Reddy <manikumar.reddy@gmail.com>
2023-09-06 10:42:30 +05:30
Kamal Chandraprakash 2be8b15323 KAFKA-15410: Delete topic integration test with LocalTieredStorage and TBRLMM (3/4) (#14329)
Added delete topic integration tests for tiered storage enabled topics with LocalTieredStorage and TBRLMM

Reviewers: Satish Duggana <satishd@apache.org>, Divij Vaidya <diviv@amazon.com>, Luke Chen <showuon@gmail.com>
2023-09-06 06:00:05 +05:30
Yash Mayya 4f855576e6
KAFKA-14876: Add stopped state to Kafka Connect Administration docs section (#14336)
Reviewers: Chris Egerton <chrise@aiven.io>
2023-09-05 14:44:24 -04:00
Yash Mayya 3c50c382af
MINOR: Update the documentation's table of contents to add missing headings for Kafka Connect (#14337)
Reviewers: Chris Egerton <chrise@aiven.io>
2023-09-05 13:59:35 -04:00
Abhijeet Kumar 7f50497925 KAFKA-15293 Added documentation for tiered storage metrics (#14331)
Reviewers: Divij Vaidya <diviv@amazon.com>, Satish Duggana <satishd@apache.org>
2023-09-05 22:19:53 +05:30
Luke Chen b7df99abec MINOR: Update comment in consumeAction (#14335)
Reviewers: Satish Duggana <satishd@apache.org>, Divij Vaidya <diviv@amazon.com>
2023-09-05 21:36:57 +05:30
Kamal Chandraprakash 33b385e3fa KAFKA-15410: Reassign replica expand, move and shrink integration tests (2/4) (#14328)
- Updated the log-start-offset to the correct value while building the replica state in ReplicaFetcherTierStateMachine#buildRemoteLogAuxState

Integration tests added:
1. ReassignReplicaExpandTest
2. ReassignReplicaMoveTest and
3. ReassignReplicaShrinkTest

Reviewers: Satish Duggana <satishd@apache.org>, Luke Chen <showuon@gmail.com>
2023-09-05 19:29:35 +05:30
Kamal Chandraprakash 991c5c0610 KAFKA-15410: Expand partitions, segment deletion by retention and enable remote log on topic integration tests (1/4) (#14307)
Added the below integration tests with tiered storage
 - PartitionsExpandTest
 - DeleteSegmentsByRetentionSizeTest
 - DeleteSegmentsByRetentionTimeTest and
 - EnableRemoteLogOnTopicTest
 - Enabled the test for both ZK and Kraft modes.

These are enabled for both ZK and Kraft modes.

Reviewers: Satish Duggana <satishd@apache.org>, Luke Chen <showuon@gmail.com>, Christo Lolov <lolovc@amazon.com>, Divij Vaidya <diviv@amazon.com>
2023-09-05 10:28:35 +05:30
Justine Olshan d8d7d3127a KAFKA-15424: Make the transaction verification a dynamic configuration (#14324)
This will allow enabling and disabling transaction verification (KIP-890 part 1) without having to roll the cluster.

Tested that restarting the cluster persists the configuration.

If a verification is disabled/enabled while we have an inflight request, depending on the step of the process, the change may or may not be seen in the inflight request (enabling will typically fail unverified requests, but we may still verify and reject when we first disable) Subsequent requests/retries will behave as expected for verification.

Sequence checks will continue to take place after disabling until the first message is written to the partition (thus clearing the verification entry with the tentative sequence) or the broker restarts/partition is reassigned which will clear the memory. On enabling, we will only track sequences that for requests received after the verification is enabled.

Reviewers: Jason Gustafson <jason@confluent.io>, Satish Duggana <satishd@apache.org>
2023-09-04 20:42:34 -07:00
Dimitar Dimitrov c6af3dac00 KAFKA-15052 Fix the flaky testBalancePartitionLeaders - part II (#13908)
A follow-up to https://github.com/apache/kafka/pull/13804.
This follow-up adds the alternative fix approach mentioned in
the PR above - bumping the session timeout used in the test
with 1 second.

Reproducing the flake-out locally has been much harder than
on the CI runs, as neither Gradle with Java 11 or Java 14 nor
IntelliJ with Java 14 could show it, but IntelliJ with Java 11
could occasionally reproduce the failure the first time
immediately after a rebuild. While I was unable to see the
failure with the bumped session timeout, the testing procedure
definitely didn't provide sufficient reassurance for the
fix as even without it often I'd see hundreds of consecutive
successful test runs when the first run didn't fail.

Reviewers: Luke Chen <showuon@gmail.com>, Christo Lolov <lolovc@amazon.com>
2023-09-04 17:03:39 +08:00
Abhijeet Kumar 6d3aa70b26 KAFKA-15260: RLM Task should handle uninitialized RLMM for the associated topic-parititon (#14113)
This change is about RLM task handling retriable exception when it tries to copy segments to remote but the RLMM is not yet initialized. On encountering the exception, we log the error and throw the exception back to the caller. We also make sure that the failure metrics are updated since this is a temporary error because RLMM is not yet initialized.

Added unit tests to verify RLM task does not attempt to copy segments to remote on encountering the retriable exception and that failure metrics remain unchanged.

Reviewers: Satish Duggana <satishd@apache.org>, Luke Chen <showuon@gmail.com>, Kamal Chandraprakash<kamal.chandraprakash@gmail.com>
2023-09-04 09:14:29 +05:30