Commit Graph

11852 Commits

Author SHA1 Message Date
Lucas Brutschy 07a18478be
KAFKA-15326: [7/N] Processing thread non-busy waiting (#14180)
Avoid busy waiting for processable tasks. We need to be a bit careful here to not have the task executors to sleep when work is available. We have to make sure to signal on the condition variable any time a task becomes "processable". Here are some situations where a task becomes processable:

- Task is unassigned from another TaskExecutor.
- Task state is changed (should only happen inside when a task is locked inside the polling phase).
- When tasks are unlocked.
- When tasks are added.
- New records available.
- A task is resumed.

So in summary, we

- We should probably lock tasks when they are paused and unlock them when they are resumed. We should also wake the task executors after every polling phase. This belongs to the StreamThread integration work (separate PR). We add DefaultTaskManager.signalProcessableTasks for this.
- We need to awake the task executors in DefaultTaskManager.unassignTask, DefaultTaskManager.unlockTasks and DefaultTaskManager.add.


Reviewers: Walker Carlson <wcarlson@confluent.io>, Bruno Cadonna <cadonna@apache.org>
2023-09-11 09:58:20 +02:00
Ruslan Krivoshein b72d92919f
KAFKA-14581: Moving GetOffsetShell to tools (#13562)
This PR moves GetOffsetShell from core module to tools module with rewriting from Scala to Java.

Reviewers: Federico Valeri fedevaleri@gmail.com, Ziming Deng dengziming1993@gmail.com, Mickael Maison mimaison@apache.org.
2023-09-11 10:30:22 +08:00
Kamal Chandraprakash 672ea644f0
MINOR: Removed the RSM and RLMM classpath config validator (#14358)
- RSM and RLMM classpath can be empty since it's optional so removed the non-empty string validator
- Fix getting the `localTieredStorage` by brokerId after stopping a broker.

Reviewers: Christo Lolov <lolovc@amazon.com>, Luke Chen <showuon@gmail.com>, Satish Duggana <satishd@apache.org>
2023-09-09 19:02:42 +05:30
David Arthur b24ccd65b7
KAFKA-15441 Allow broker heartbeats to complete in metadata transaction (#14351)
This patch allows broker heartbeat events to be completed while a metadata transaction is in-flight.

More generally, this patch allows any RUNS_IN_PREMIGRATION event to complete while the controller
is in pre-migration mode even if the migration transaction is in-flight.

We had a problem with broker heartbeats timing out because they could not be completed while a large
ZK migration transaction was in-flight. This resulted in the controller fencing all the ZK brokers which 
has many undesirable downstream effects. 

Reviewers: Akhilesh Chaganti <akhileshchg@users.noreply.github.com>, Colin Patrick McCabe <cmccabe@apache.org>
2023-09-08 16:36:13 -04:00
Nikhil Ramakrishnan 84c49c6a09
MINOR: Log when remote fetch execution is rejected (#14350)
Log at warning level when remote fetch execution is rejected. This can be potentially useful to troubleshoot UNKNOWN_SERVER_ERROR responses for remote fetch requests.

Reviewers: Luke Chen <showuon@gmail.com>, Christo Lolov <lolovc@amazon.com>
2023-09-08 19:51:12 +08:00
Lucia Cerchie 17862ffaf2
KAFKA-15418: update Kafka design docs with decompression information (#14322)
Reviewers: Divij Vaidya <diviv@amazon.com>, Matthias J. Sax <mjsax@apache.org>

---------

Co-authored-by: Cerchie <lcerchie@confluent.io>
2023-09-08 11:58:32 +02:00
Lucas Brutschy 01b91af59c
MINOR: fix currentLag javadoc (#14224)
Reviewers: Matthias J. Sax <matthias@confluent.io>
2023-09-07 19:25:31 -07:00
atu-sharm 9ecf6f7f1c
KAFKA-15338: The metric group documentation for metrics added in KAFKA-13945 is incorrect (#14221)
Reviewers: Matthias J. Sax <matthias@confluent.io>
2023-09-07 19:05:14 -07:00
Chris Egerton 54ab5b29e4
KAFKA-15416: Fix flaky TopicAdminTest::retryEndOffsetsShouldRetryWhenTopicNotFound test case (#14313)
Reviewers: Philip Nee <pnee@confluent.io>, Greg Harris <greg.harris@aiven.io>
2023-09-07 19:24:17 -04:00
José Armando García Sancio 7b669e8806
KAFKA-14273; Close file before atomic move (#14354)
In the Windows OS atomic move are not allowed if the file has another open handle. E.g

__cluster_metadata-0\quorum-state: The process cannot access the file because it is being used by another process
        at java.base/sun.nio.fs.WindowsException.translateToIOException(WindowsException.java:92)
        at java.base/sun.nio.fs.WindowsException.rethrowAsIOException(WindowsException.java:103)
        at java.base/sun.nio.fs.WindowsFileCopy.move(WindowsFileCopy.java:403)
        at java.base/sun.nio.fs.WindowsFileSystemProvider.move(WindowsFileSystemProvider.java:293)
        at java.base/java.nio.file.Files.move(Files.java:1430)
        at org.apache.kafka.common.utils.Utils.atomicMoveWithFallback(Utils.java:949)
        at org.apache.kafka.common.utils.Utils.atomicMoveWithFallback(Utils.java:932)
        at org.apache.kafka.raft.FileBasedStateStore.writeElectionStateToFile(FileBasedStateStore.java:152)

This is fixed by first closing the temporary quorum-state file before attempting to move it.

Reviewers: Colin Patrick McCabe <cmccabe@apache.org>
Co-Authored-By: Renaldo Baur Filho <renaldobf@gmail.com>
2023-09-07 16:17:03 -07:00
Kirk True a2de7d32c8
KAFKA-14274 #1: basic refactoring (#14305)
This change introduces some basic clean up and refactoring for forthcoming commits related to the revised fetch code for the consumer threading refactor project.

Reviewers: Christo Lolov <lolovc@amazon.com>, Jun Rao <junrao@gmail.com>
2023-09-07 15:23:44 -07:00
Colin Patrick McCabe 41b695b6e3
KAFKA-15369: Implement KIP-919: Allow AC to Talk Directly with Controllers (#14306)
Implement KIP-919: Allow AdminClient to Talk Directly with the KRaft Controller Quorum and add
Controller Registration. This KIP adds a new version of DescribeClusterRequest which is supported
by KRaft controllers. It also teaches AdminClient how to use this new DESCRIBE_CLUSTER request to
talk directly with the controller quorum. This is all gated behind a new MetadataVersion,
IBP_3_7_IV0.

In order to share the DESCRIBE_CLUSTER logic between broker and controller, this PR factors it out
into AuthHelper.computeDescribeClusterResponse.

The KIP adds three new errors codes: MISMATCHED_ENDPOINT_TYPE, UNSUPPORTED_ENDPOINT_TYPE, and
UNKNOWN_CONTROLLER_ID. The endpoint type errors can be returned from DescribeClusterRequest

On the controller side, the controllers now try to register themselves with the current active
controller, by sending a CONTROLLER_REGISTRATION request. This, in turn, is converted into a
RegisterControllerRecord by the active controller. ClusterImage, ClusterDelta, and all other
associated classes have been upgraded to propagate the new metadata. In the metadata shell, the
cluster directory now contains both broker and controller subdirectories.

QuorumFeatures previously had a reference to the ApiVersions structure used by the controller's
NetworkClient. Because this PR removes that reference, QuorumFeatures now contains only immutable
data. Specifically, it contains the current node ID, the locally supported features, and the list
of quorum node IDs in the cluster.

Reviewers: David Arthur <mumrah@gmail.com>, Ziming Deng <dengziming1993@gmail.com>, Luke Chen <showuon@gmail.com>
2023-09-07 15:21:52 -07:00
Chris Egerton 77a91be22e
KAFKA-15425: Fail fast in Admin::listOffsets when topic (but not partition) metadata is not found (#14314)
This restores previous behavior for Admin::listOffsets, which was to fail immediately if topic metadata could not be found, and only retry if metadata for one or more specific partitions could not be found.

There is a subtle difference here: prior to https://github.com/apache/kafka/pull/13432, the operation would be retried if any metadata error was reported for any individual topic partition, even if an error was also reported for the entire topic. With this change, the operation always fails if an error is reported for the entire topic, even if an error is also reported for one or more individual topic partitions.

I am not aware of any cases where brokers might return both topic- and topic partition-level errors for a metadata request, and if there are none, then this change should be safe. However, if there are such cases, we may need to refine this PR to remove the discrepancy in behavior.

Reviewers: Justine Olshan <jolshan@confluent.io>
2023-09-07 14:02:57 -07:00
Lucia Cerchie 01c7c7a399
KAFKA-15307: Removes non-existent configs (#14341)
`partition.grouper` was removed in 3.0 release.

Reviewers: Matthias J. Sax <matthias@confluent.io>
2023-09-07 12:58:13 -07:00
Nikolay 0029bc4897
KAFKA-14595: ReassignPartitionsCommandArgsTest rewritten in java (#14217)
Reviewers: Taras Ledkov <tledkov@apache.org>, Greg Harris <greg.harris@aiven.io>
2023-09-07 10:12:07 -07:00
Yash Mayya 88b554fdbd
KAFKA-15179: Add integration tests for the file sink and source connectors (#14279)
Reviewers: Ashwin Pankaj <apankaj@confluent.io>, Chris Egerton <chrise@aiven.io>
2023-09-07 12:24:13 -04:00
Kamal Chandraprakash 6e818c6b02
KAFKA-15410: Delete records with tiered storage integration test (4/4) (#14330)
* Added the integration test for DELETE_RECORDS API for tiered storage enabled topic
* Added validation checks before removing remote log segments for log-start-offset breach

Reviewers: Satish Duggana <satishd@apache.org>, Luke Chen <showuon@gmail.com>, Christo Lolov <lolovc@amazon.com>
2023-09-07 21:02:39 +05:30
Luke Chen cd897e6c76
MINOR: Update the javadoc in RSM (#14352)
Reviewers: Satish Duggana <satishd@apache.org>, Kamal Chandraprakash<kamal.chandraprakash@gmail.com>
2023-09-07 20:55:11 +05:30
Max Riedel 90e646052a
KAFKA-14509; [1/2] Define ConsumerGroupDescribe API request and response schemas and classes. (#14124)
This patch adds the schemas of the new ConsumerGroupDescribe API (KIP-848) and adds an handler to the KafkaApis.

Reviewers: Jeff Kim <jeff.kim@confluent.io>, David Jacot <djacot@confluent.io>
2023-09-07 08:05:04 -07:00
Kamal Chandraprakash 6d762480c9
KAFKA-15351: Update log-start-offset after leader election for topics enabled with remote storage (#14340)
On leadership failover, the new leader's start offset may be older than the start offset of old leader. This works fine for local storage scenario because the new leader still contains data associated with stale start offset. But in case of remote storage, although new leader has a stale offset, the data associated with it has been deleted from remote by the old leader. Hence, we end up in a situation where leader has a start offset but no data associated with it.

This commit fixes the situation by ensuring that on every leadership failover, for topics with remote storage, the leader will update it's start offset from the base of first segment in current leader chain present in the remote storage (if any).

Reviewers: Satish Duggana <satishd@apache.org>, Luke Chen <showuon@gmail.com>, Christo Lolov <lolovc@amazon.com>, Divij Vaidya <diviv@amazon.com>
2023-09-07 16:32:16 +02:00
David Arthur 65e2ecffab
KAFKA-15435 Fix counts in MigrationManifest (#14342)
Reviewers: Liu Zeyu <zeyu.luke@gmail.com>, Colin P. McCabe <cmccabe@apache.org>
2023-09-06 13:02:13 -04:00
Lucas Brutschy eb39c95080
MINOR: StoreChangelogReaderTest fails with log-level DEBUG (#14300)
A mocked method is executed unexpectedly when we enable DEBUG
log level, leading to confusing test failures during debugging.
Since the log message itself seems useful, we adapt the test
to take the additional mocked method call into account).

Reviewer: Bruno Cadonna <cadonna@apache.org>
2023-09-06 14:49:48 +02:00
Ritika Reddy cc289d04c7
MINOR: Fix trailing white spaces on reviewers.py (#14343)
Fixing trailing white spaces on reviewers.py.

Reviewers: Jeff Kim <jeff.kim@confluent.io>, David Jacot <djacot@confluent.io>
2023-09-06 00:28:23 -07:00
David Jacot 7054625c45
KAFKA-14499: [6/N] Add MemberId and MemberEpoch to OffsetFetchRequest (#14321)
This patch adds the MemberId and the MemberEpoch fields to the OffsetFetchRequest. Those fields will be populated when the new consumer group protocol is used to ensure that the member fetching the offset has the correct member id and epoch. If it does not, UNKNOWN_MEMBER_ID or STALE_MEMBER_EPOCH are returned to the client.

Our initial idea was to implement the same for the old protocol. The field is called GenerationIdOrMemberEpoch in KIP-848 to materialize this. As a second though, I think that we should only do it for the new protocol. The effort to implement it in the old protocol is not worth it in my opinion.

Reviewers: Ritika Reddy <rreddy@confluent.io>, Calvin Liu <caliu@confluent.io>, Justine Olshan <jolshan@confluent.io>
2023-09-05 23:36:38 -07:00
Kamal Chandraprakash 80982c4ae3
KAFKA-15410: Delete topic integration test with LocalTieredStorage and TBRLMM (3/4) (#14329)
Added delete topic integration tests for tiered storage enabled topics with LocalTieredStorage and TBRLMM

Reviewers: Satish Duggana <satishd@apache.org>, Divij Vaidya <diviv@amazon.com>, Luke Chen <showuon@gmail.com>
2023-09-06 05:50:12 +05:30
Andrew Schofield b49013b73e
KAFKA-9800: Exponential backoff for Kafka clients - KIP-580 (#14111)
Implementation of KIP-580 to add exponential back-off to situations in which retry.backoff.ms
is used to delay backoff attempts. This KIP adds exponential backoff behavior with a maximum
controlled by a new config retry.backoff.max.ms, together with a +/- 20% of jitter to spread the
retry attempts of the client fleet.

Reviewers: Mayank Shekhar Narula <mayanks.narula@gmail.com>, Milind Luthra <i.milind.luthra@gmail.com>, Kirk True <kirk@mustardgrain.com>, Jun Rao<junrao@gmail.com>
2023-09-05 11:57:51 -07:00
Yash Mayya 1f473ebb5e
KAFKA-14876: Add stopped state to Kafka Connect Administration docs section (#14336)
Reviewers: Chris Egerton <chrise@aiven.io>
2023-09-05 14:39:49 -04:00
Yash Mayya 79598b49d6
MINOR: Update the documentation's table of contents to add missing headings for Kafka Connect (#14337)
Reviewers: Chris Egerton <chrise@aiven.io>
2023-09-05 13:58:44 -04:00
Abhijeet Kumar 37a51e286d
KAFKA-15293 Added documentation for tiered storage metrics (#14331)
Reviewers: Divij Vaidya <diviv@amazon.com>, Satish Duggana <satishd@apache.org>
2023-09-05 22:19:10 +05:30
Luke Chen be0de2124a
MINOR: Update comment in consumeAction (#14335)
Reviewers: Satish Duggana <satishd@apache.org>, Divij Vaidya <diviv@amazon.com>
2023-09-05 21:36:28 +05:30
Kamal Chandraprakash 9f2ac375c2
KAFKA-15410: Reassign replica expand, move and shrink integration tests (2/4) (#14328)
- Updated the log-start-offset to the correct value while building the replica state in ReplicaFetcherTierStateMachine#buildRemoteLogAuxState

Integration tests added:
1. ReassignReplicaExpandTest
2. ReassignReplicaMoveTest and
3. ReassignReplicaShrinkTest

Reviewers: Satish Duggana <satishd@apache.org>, Luke Chen <showuon@gmail.com>
2023-09-05 19:28:17 +05:30
Justine Olshan b892acae5e
KAFKA-15424: Make the transaction verification a dynamic configuration (#14324)
This will allow enabling and disabling transaction verification (KIP-890 part 1) without having to roll the cluster.

Tested that restarting the cluster persists the configuration.

If a verification is disabled/enabled while we have an inflight request, depending on the step of the process, the change may or may not be seen in the inflight request (enabling will typically fail unverified requests, but we may still verify and reject when we first disable) Subsequent requests/retries will behave as expected for verification.

Sequence checks will continue to take place after disabling until the first message is written to the partition (thus clearing the verification entry with the tentative sequence) or the broker restarts/partition is reassigned which will clear the memory. On enabling, we will only track sequences that for requests received after the verification is enabled.

Reviewers: Jason Gustafson <jason@confluent.io>, Satish Duggana <satishd@apache.org>
2023-09-04 20:40:50 -07:00
Kamal Chandraprakash caaa4c55fe
KAFKA-15410: Expand partitions, segment deletion by retention and enable remote log on topic integration tests (1/4) (#14307)
Added the below integration tests with tiered storage
 - PartitionsExpandTest
 - DeleteSegmentsByRetentionSizeTest
 - DeleteSegmentsByRetentionTimeTest and
 - EnableRemoteLogOnTopicTest
 - Enabled the test for both ZK and Kraft modes.

These are enabled for both ZK and Kraft modes.

Reviewers: Satish Duggana <satishd@apache.org>, Luke Chen <showuon@gmail.com>, Christo Lolov <lolovc@amazon.com>, Divij Vaidya <diviv@amazon.com>
2023-09-05 05:13:16 +05:30
Yash Mayya d34d84dbef
KAFKA-7438: Migrate WindowStoreBuilderTest from EasyMock to Mockito (#14152)
Reviewers: Divij Vaidya <diviv@amazon.com>
2023-09-04 13:54:18 +02:00
Christo Lolov 7a516b0386
KAFKA-14133: Move AbstractStreamTest and RocksDBMetricsRecordingTriggerTest to Mockito (#14223)
Reviewers: Divij Vaidya <diviv@amazon.com>
2023-09-04 12:58:50 +02:00
Dimitar Dimitrov 78c59cd2b0
KAFKA-15052 Fix the flaky testBalancePartitionLeaders - part II (#13908)
A follow-up to https://github.com/apache/kafka/pull/13804.
This follow-up adds the alternative fix approach mentioned in
the PR above - bumping the session timeout used in the test
with 1 second.

Reproducing the flake-out locally has been much harder than
on the CI runs, as neither Gradle with Java 11 or Java 14 nor
IntelliJ with Java 14 could show it, but IntelliJ with Java 11
could occasionally reproduce the failure the first time
immediately after a rebuild. While I was unable to see the
failure with the bumped session timeout, the testing procedure
definitely didn't provide sufficient reassurance for the
fix as even without it often I'd see hundreds of consecutive
successful test runs when the first run didn't fail.

Reviewers: Luke Chen <showuon@gmail.com>, Christo Lolov <lolovc@amazon.com>
2023-09-04 17:02:32 +08:00
Proven Provenzano a6409e8e61
KAFKA-15422: Update documenttion for delegation tokens when working with Kafka with KRaft (#14318)
Reviewers: Manikumar Reddy <manikumar.reddy@gmail.com>
2023-09-04 14:16:12 +05:30
Abhijeet Kumar 5074c8038e
KAFKA-15260: RLM Task should handle uninitialized RLMM for the associated topic-parititon (#14113)
This change is about RLM task handling retriable exception when it tries to copy segments to remote but the RLMM is not yet initialized. On encountering the exception, we log the error and throw the exception back to the caller. We also make sure that the failure metrics are updated since this is a temporary error because RLMM is not yet initialized.

Added unit tests to verify RLM task does not attempt to copy segments to remote on encountering the retriable exception and that failure metrics remain unchanged.

Reviewers: Satish Duggana <satishd@apache.org>, Luke Chen <showuon@gmail.com>, Kamal Chandraprakash<kamal.chandraprakash@gmail.com>
2023-09-04 09:13:04 +05:30
Luke Chen da99879df7
KAFKA-15421: fix network thread leak in testThreadPoolResize (#14320)
In SocketServerTest, we create SocketServer and enableRequestProcessing on each test class initialization. That's fine since we shutdown it in @AfterEach. The issue we have is we disabled 2 tests in this test suite. And when running these disabled tests, we will go through class initialization, but without @AfterEach. That causes 2 network thread leaked.

Compared the error message in DynamicBrokerReconfigurationTest#testThreadPoolResize test here:

org.opentest4j.AssertionFailedError: Invalid threads: expected 6, got 8: List(data-plane-kafka-socket-acceptor-ListenerName(INTERNAL)-SSL-0, data-plane-kafka-socket-acceptor-ListenerName(PLAINTEXT)-PLAINTEXT-0, data-plane-kafka-socket-acceptor-ListenerName(INTERNAL)-SSL-0, data-plane-kafka-socket-acceptor-ListenerName(EXTERNAL)-SASL_SSL-0, data-plane-kafka-socket-acceptor-ListenerName(INTERNAL)-SSL-0, data-plane-kafka-socket-acceptor-ListenerName(EXTERNAL)-SASL_SSL-0, data-plane-kafka-socket-acceptor-ListenerName(PLAINTEXT)-PLAINTEXT-0, data-plane-kafka-socket-acceptor-ListenerName(EXTERNAL)-SASL_SSL-0) ==> expected: <true> but was: <false>

The 2 unexpected network threads are leaked from SocketServerTest.

Reviewers: Satish Duggana <satishd@apache.org>, Christo Lolov <lolovc@amazon.com>, Divij Vaidya <diviv@amazon.com>, Kamal Chandraprakash <kchandraprakash@uber.com>, Chris Egerton <chrise@aiven.io>
2023-09-03 16:16:54 +08:00
Rohan cc53889aaa
KAFKA-15429: reset transactionInFlight on StreamsProducer close (#14326)
Resets the value of transactionInFlight to false when closing the
StreamsProducer. This ensures we don't try to commit against a
closed producer

Reviewers: Anna Sophie Blee-Goldman <ableegoldman@apache.org>
2023-09-02 18:14:14 -07:00
Rohan d293cd0735
KAFKA-15429: catch+log errors from unsubscribe in streamthread shutdown (#14325)
Preliminary fix for KAFKA-15429 which updates StreamThread.completeShutdown to
catch-and-log errors from consumer.unsubscribe. Though this does not prevent
the exception, it does preserve the original exception that caused the stream
thread to exit.

Reviewers: Anna Sophie Blee-Goldman <ableegoldman@apache.org>
2023-09-02 18:13:16 -07:00
Lianet Magrans 1bb8c11f5a
KAFKA-14965 - OffsetsRequestsManager implementation & API integration (#14308)
Implementation of the OffsetRequestsManager, responsible for building requests and processing responses for requests related to partition offsets.

In this PR, the manager includes support for ListOffset requests, generated when the user makes any of the following consumer API calls:

beginningOffsets
endOffsets
offsetsForTimes
All previous consumer API calls interact with the OffsetsRequestsManager by generating a ListOffsetsApplicationEvent.

Includes tests to cover the new functionality and to ensure no API level changes are introduced.

This covers KAFKA-14965 and KAFKA-15081.

Reviewers: Philip Nee <pnee@confluent.io>, Kirk True <kirk@mustardgrain.com>, Jun Rao<junrao@gmail.com>
2023-09-01 13:57:17 -07:00
Christo Lolov 134f6c07a4
KAFKA-15427: Fix resource leak in integration tests for tiered storage (#14319)
Co-authored-by: Nikhil Ramakrishnan <nikrmk@amazon.com>

Reviewers: Satish Duggana <satishd@apache.org>, Luke Chen <showuon@gmail.com>
2023-09-01 23:12:57 +05:30
Jeff Kim 6391c66035
KAFKA-14500; [7/7] Refactor GroupMetadataManagerTest (#14122)
This patch makes the styling consistent inside GroupMetadataManagerTest. Also, it adds JoinResult to simplify the JoinGroup API responses in the tests.

Reviewers: David Arthur <mumrah@gmail.com>, David Jacot <djacot@confluent.io>
2023-09-01 06:36:33 -07:00
David Jacot dcff0878c4
KAFKA-14499: [5/N] Refactor GroupCoordinator.fetchOffsets and GroupCoordinator.fetchAllOffsets (#14310)
This patch refactors the GroupCoordinator.fetchOffsets and GroupCoordinator.fetchAllOffsets methods to take an OffsetFetchRequestGroup and to return an OffsetFetchResponseGroup. It prepares the ground for adding the member id and the member epoch to the OffsetFetchRequest. This change also makes those two methods more aligned with the others in the interface.

Reviewers: Calvin Liu <caliu@confluent.io>, Justine Olshan <jolshan@confluent.io>
2023-09-01 03:45:24 -07:00
Kamal Chandraprakash d0f3cf1f9f
KAFKA-15351: Ensure log-start-offset not updated to local-log-start-offset when remote storage enabled (#14301)
When tiered storage is enabled on the topic, and the last-standing-replica is restarted, then the log-start-offset should not reset its offset to first-local-log-segment-base-offset.

Reviewers: Satish Duggana <satishd@apache.org>, Luke Chen <showuon@gmail.com>, Divij Vaidya <diviv@amazon.com>, Christo Lolov <lolovc@amazon.com>
2023-09-01 06:33:33 +05:30
Lucas Brutschy 16dc983ad6
Kafka Streams Threading: Timeout behavior (#14171)
Implement setting and clearing task timeouts, as well as changing the output on exceptions to make
it similar to the existing code path. 

Reviewer: Walker Carlson <wcarlson@apache.org>
2023-08-31 15:21:01 -05:00
Kamal Chandraprakash 43fe13350f
KAFKA-15404: Disable the flaky integration tests. (#14296)
Disabled the below tests to fix the thread leak:

1. kafka.server.DynamicBrokerReconfigurationTest.testThreadPoolResize() and
2. org.apache.kafka.tiered.storage.integration.OffloadAndConsumeFromLeaderTest

Reviewers: Luke Chen <showuon@gmail.com>, Divij Vaidya <diviv@amazon.com>, Justine Olshan <jolshan@confluent.io>
2023-08-31 11:39:26 -07:00
Luke Chen c2bb8eb875
MINOR: Close topic based RLMM correctly in integration tests (#14315)
Reviewers: Divij Vaidya <diviv@amazon.com>
2023-08-31 10:44:32 +02:00
A. Sophie Blee-Goldman 95e1cdc4ef
HOTFIX: avoid placement of unnecessary transient standby tasks & improve assignor logging (#14149)
Minor fix to avoid creating unnecessary standby tasks, especially when these may be surprising or unexpected as in the case of an application with num.standby.replicas = 0 and warmup replicas disabled.

The "bug" here was introduced during the fix for an issue with cooperative rebalancing and in-memory stores. The fundamental problem is that in-memory stores cannot be unassigned from a consumer for any period, however temporary, without being closed and losing all the accumulated state. This caused some grief when the new HA task assignor would assign an active task to a node based on the readiness of the standby version of that task, but would have to remove the active task from the initial assignment so it could first be revoked from its previous owner, as per the cooperative rebalancing protocol. This temporary gap in any version of that task among the consumer's assignment for that one intermediate rebalance would end up causing the consumer to lose all state for it, in the case of in-memory stores.

To fix this, we simply began to place standby tasks on the intended recipient of an active task awaiting revocation by another consumer. However, the fix was a bit of an overreach, as we assigned these temporary standby tasks in all cases, regardless of whether there had previously been a standby version of that task. We can narrow this down without sacrificing any of the intended functionality by only assigning this kind of standby task where the consumer had previously owned some version of it that would otherwise potentially be lost.

Also breaks up some of the long log lines in the StreamsPartitionAssignor and expands the summary info while moving it all to the front of the line (following reports of missing info due to truncation of long log lines in larger applications)
2023-08-30 13:29:38 -07:00