To avoid leaking producers, we should add a 'closedflag toStreamProducer` indicating whether we should reset prouder.
Reviewers: Guozhang Wang <guozhang.wang.us@gmail.com>, Anna Sophie Blee-Goldman <ableegoldman@apache.org>
Improve descriptive information in Kafka protocol documentation.
Reviewers: Mickael Maison <mickael.maison@gmail.com>, Andrew Schofield <aschofield@confluent.io>, Apoorv Mittal <apoorvmittal10@gmail.com>
This PR introduces the DescribeGroups v6 API as part of KIP-1043. This adds an error message for the described groups so that it is possible to get some context on the error. It also changes the behaviour for when the group ID cannot be found but returning error code GROUP_ID_NOT_FOUND rather than NONE.
Reviewers: David Jacot <djacot@confluent.io>
Covers wrapping of processors and state stores for KStream-KTable joins
Reviewers: Almog Gavra <almog@responsive.dev>, Guozhang Wang <guozhang.wang.us@gmail.com>
Minor followup to #17929 based on this discussion
Also includes some very minor refactoring/renaming on the side. The only real change is in the KGroupedStreamImpl class
Reviewers: Guozhang Wang <guozhang.wang.us@gmail.com>
Currently, KTable foreign key joins only allow extracting the foreign key from the value of the source record. This forces users to duplicate data that might already exist in the key into the value when the foreign key needs to be derived from both the key and value. This leads to:
- Data duplication
- Additional storage overhead
- Potential data inconsistency if the duplicated data gets out of sync
- Less intuitive API when the foreign key is naturally derived from both key and value
This change allows user to extract the foreign key from the key and value of the source record.
Reviewers: Lucas Brutschy <lbrutschy@confluent.io>
As part of KIP-1112, to maximize the utility of the new ProcessorWrapper, we need to migrate the DSL operators to the new method of attaching state stores by implementing ProcessorSupplier#stores, which makes these stores available for inspection by the user's wrapper.
This PR covers the aggregate operator for both KStream and KTable.
Reviewers: Guozhang Wang <guozhang.wang.us@gmail.com>, Rohan Desai <rohan@responsive.dev>
This PR is part of the implementation for KIP-1112 (KAFKA-18026). In order to have DSL operators be properly wrapped by the interface suggestion in 1112, we need to make sure they all use the ConnectedStoreProvider#stores method to connect stores instead of manually calling addStateStore.
This is a refactor only, there is no new behaviors.
Reviewers: Anna Sophie Blee-Goldman <ableegoldman@apache.org>
The KStreamRepartitionIntegrationTest.shouldThrowAnExceptionWhenNumberOfPartitionsOfRepartitionOperationDoNotMatchSourceTopicWhenJoining test was taking two minutes due not reaching an expected condition. By updating to have the StreamsUncaughtExceptionHandler return a response of SHUTDOWN_CLIENT the expected ERROR state is now reached. The root cause was using the Thread.UncaughtExceptionHandler to handle the exception.
Without this fix, the test takes 2 minutes to run, and now it's 1 second.
Reviewers: Matthias Sax <mjsax@apache.org>
Noticed that RangeQueryIntegrationTest is taking ~approx 20 - 30min to run
Upon deep dive in logs, noticed that there were error for consumer rebalancing and test was stuck in loop
Seems like due to same application.id across tests, Kafka Streams application is failing to track its state correctly across rebalances.
Reviewers: Bill Bejeck <bbejeck@apache.org>
KIP-1043 introduced Admin.listGroups as the way to list all types of groups. As a result, Admin.listShareGroups has been removed. This PR is the final step of the removal.
Reviewers: Chia-Ping Tsai <chia7712@gmail.com>
- Deprecates OffsetResetStrategy enum
- Adds new internal class AutoOffsetResetStrategy
- Replaces all OffsetResetStrategy enum usages with AutoOffsetResetStrategy
- Deprecate old/Add new constructors to MockConsumer
Reviewers: Andrew Schofield <aschofield@confluent.io>, Matthias J. Sax <matthias@confluent.io>
This PR includes the API for KIP-1112 and a partial implementation, which wraps any processors added through the PAPI and the DSL processors that are written to the topology through the ProcessorParameters#addProcessorTo method.
Further PRs will complete the implementation by converting the remaining DSL operators to using the #addProcessorTo method, and future-proof the processor writing mechanism to prevent new DSL operators from being implemented incorrectly/without the wrapper
Reviewers: Almog Gavra <almog@responsive.dev>, Guozhang Wang <guozhang.wang.us@gmail.com>
When Kafka Streams skips overs corrupted messages, it might not resume previously paused partitions,
if more than one record is skipped at once, and if the buffer drop below the max-buffer limit at the same time.
Reviewers: Matthias J. Sax <matthias@confluent.io>
With KAFKA-17872, we changed some internals that effects the conditions
of this test, introducing a race condition when the expected log
messages are printed.
This PR adds additional wait-conditions to the test to close the race
condition.
Reviewers: Bill Bejeck <bill@confluent.io>
This PR updates the key for storing the KIP-714 client instance id for the global consumer to follow a more consistent pattern of the other embedded Kafka Streams consumer clients.
Reviewers: Matthias Sax <mjsax@apache.org>
Following the KIP-1033 a FailedProcessingException is passed to the Streams-specific uncaught exception handler.
The goal of the PR is to unwrap a FailedProcessingException into a StreamsException when an exception occurs during the flushing or closing of a store
Reviewer: Bruno Cadonna <cadonna@apache.org>
The thread that evaluates the gauge for the oldest-iterator-open-since-ms runs concurrently
with threads that open/close iterators (stream threads and interactive query threads). This PR
fixed a race condition between `openIterators.isEmpty()` and `openIterators.first()`, by catching
a potential exception. Because we except the race condition to be rare, we rather catch the
exception in favor of introducing a guard via locking.
Reviewers: Matthias J. Sax <matthias@confluent.io>, Anna Sophie Blee-Goldman <ableegoldman@apache.org>
This commit adds AK 3.9 to the system tests on trunk.
Follow-up of #17797
Reviewers: Chia-Ping Tsai <chia7712@gmail.com>, Bruno Cadonna <cadonna@apache.org>
Remove or update custom display names to make sure we actually include the test method as the first part of the display name.
Reviewers: Chia-Ping Tsai <chia7712@gmail.com>, Bill Bejeck <bill@confluent.io>
Implementation of https://cwiki.apache.org/confluence/display/KAFKA/KIP-1102%3A+Enable+clients+to+rebootstrap+based+on+timeout+or+error+code
- Introduces rebootstrap trigger interval config metadata.recovery.rebootstrap.trigger.ms, set to 5 minutes by default
- Makes rebootstrap the default for metadata.recovery.strategy
- Adds new error code REBOOTSTRAP_REQUIRED, introduces top-level error code in metadata response. On this error, clients rebootstrap.
- Configs apply to producers, consumers, share consumers, admin clients, Connect and KStreams clients.
Reviewers: Apoorv Mittal <apoorvmittal10@gmail.com>, Manikumar Reddy <manikumar.reddy@gmail.com>
The logs for Kafka Streams local state restoration incorrectly refer to the StreamThread instead of the StateUpdater thread, which is responsible for decoupling the restoration process. The restore consumer also references StreamThread instead of StateUpdater.
This commit corrects the log message for more clarity.
Reviewer: Bruno Cadonna <cadonna@apache.org>
When we introduced "startup tasks" in #16922, we initialized them with
no input partitions, because they aren't known until assignment.
However, when we update them during assignment, it's possible that we
update the topology with the incorrect source topics for some internal
topics, due to a difference in the way internal topics are handled for
StandbyTasks.
To resolve this, we now initialize startup tasks with the correct input
partitions, by calculating them from the Topology.
When we assign our startup tasks, we now conditionally update their
input partitions only if they've actually changed, just as we do for
regular StandbyTasks.
With this, the E2E tests now pass, as expected.
Reviewer: Bruno Cadonna <cadonna@apache.org>
TimestampExtractor allows to drop records by returning a timestamp of -1. For this case, we still need to update consumed offsets to allows us to commit progress.
Reviewers: Bill Bejeck <bill@confluent.io>
Fixes an issue with the TTD in the specific case where users don't specify an initial time for the driver and also don't specify a start timestamp for the TestInputTopic, then pipe input records without timestamps. This combination results in a slight mismatch in the expected timestamps for the piped records, which can be noticeable when writing tests with very small time deltas.
The problem is that, while both the TTD and the TestInputTopic will be initialized to the "current time" when not otherwise specified, it's possible for some milliseconds to have passed between the creation of the TTD and the creation of the TestInputTopic. This can result in a TestInputTopic getting a start timestamp that's several ms larger than the driver's time, and ultimately causing the piped input records to have timestamps slightly in the future relative to the driver.
In practice even those who hit this issue might not notice it if they aren't manipulating time in their tests, or are advancing time by enough to negate the several-milliseconds of difference. However we noticed a test fail due to this because we were testing a ttl-based processor and had advanced the driver time by only 1 millisecond past the ttl. The piped record should have been expired, but because it's timestamp was a few milliseconds longer than the driver's start time, this test ended up failing.
Reviewers: Matthias Sax <mjsax@apache.org>, Bruno Cadonna <cadonna@apache.org>, Lucas Brutschy < lbrutschy@confluent.io>
Kafka Streams actively purges records from repartition topics. Prior to this PR, Kafka Streams would retrieve the offset from the consumedOffsets map, but here are a couple of edge cases where the consumedOffsets can get ahead of the commitedOffsets map. In these cases, this means Kafka Streams will potentially purge a repartition record before it's committed.
Updated the current StreamTask test to cover this case
Reviewers: Matthias Sax <mjsax@apache.org>
Implementation of KIP-1076 to allow for adding client application metrics to the KIP-714 framework
Reviewers: Apoorv Mittal <amittal@confluent.io>, Andrew Schofield <aschofield@confluent.io>, Matthias Sax <mjsax@apache.org>
Updates KafkaStreams.clientInstanceIds method to correctly populate the client-id -> clientInstanceId map that was altered in a previous refactoring.
Added a test that confirms ClientInstanceIds is correctly storing consumer and producer instance ids
Reviewers: Matthias Sax <mjsax@apache.org>
Instead of waiting until Tasks are assigned to us, we pre-emptively
create a StandbyTask for each non-empty Task directory found on-disk.
We do this before starting any StreamThreads, and on our first
assignment (after joining the consumer group), we recycle any of these
StandbyTasks that were assigned to us, either as an Active or a
Standby.
We can't just use these "initial Standbys" as-is, because they were
constructed outside the context of a StreamThread, so we first have to
update them with the context (log context, ChangelogReader, and source
topics) of the thread that it has been assigned to.
The motivation for this is to (in a later commit) read StateStore
offsets for unowned Tasks from the StateStore itself, rather than the
.checkpoint file, which we plan to deprecate and remove.
There are a few additional benefits:
Initializing these Tasks on start-up, instead of on-assignment, will
reduce the time between a member joining the consumer group and beginning
processing. This is especially important when active tasks are being moved
over, for example, as part of a rolling restart.
If a Task has corrupt data on-disk, it will be discovered on startup and
wiped under EOS. This is preferable to wiping the state after being
assigned the Task, because another instance may have non-corrupt data and
would not need to restore (as much).
There is a potential performance impact: we open all on-disk Task
StateStores, and keep them all open until we have our first assignment.
This could require large amounts of memory, in particular when there are
a large number of local state stores on-disk.
However, since old local state for Tasks we don't own is automatically
cleaned up after a period of time, in practice, we will almost always
only be dealing with the state that was last assigned to the local
instance.
Reviewers: Anna Sophie Blee-Goldman <ableegoldman@apache.org>, Bruno Cadonna <cadonna@apache.org>, Matthias Sax <mjsax@apache.org>
This PR augments Streams messages with leader epoch. In case of empty buffer queues, the last offset and leader epoch are retrieved from the streams task 's cache of nextOffsets.
Co-authored-by: Lucas Brutschy <lbrutschy@confluent.io>
Reviewers: Lucas Brutschy <lbrutschy@confluent.io>, Matthias J. Sax <matthias@confluent.io>
This PR implements KIP-1094.
Reviewers: Andrew Schofield <aschofield@confluent.io>, Kirk True <ktrue@confluent.io>, Lucas Brutschy <lbrutschy@confluent.io>
With EOSv1 removed, we don't need to collect producer's clientInstanceIds per task any longer. While we never completed this feature, we can remove the corresponding scaffolding code.
Reviewers: Bill Bejeck <bill@confluent.io>
This PR leverages the updates brought by KIP-1033 to get the name of the processor node which raised a processing exception and display it in the stacktrace instead of the source node.
Reviewer: Bruno Cadonna <cadonna@apache.org>
Integration tests should run with either at-least-once or exactly-once.
There is no need to run them twice.
This PR removes the corresponding test parameters and picks either one of both.
Reviewers: Matthias J. Sax <matthias@confluent.io>
Deprecate for 4.0. This feature will be phased out, with new functionality generally not supporting named topologies, so users are encouraged to migrate off of named topologies as soon as possible.
Though who are using this experimental feature are encouraged to reach out by filing a jira ticket so we can better understand your use case and how to support it going forward.
Reviewers: Matthias Sax <mjsax@apache.org>
This PR disables metrics push for the AdminClient as the default. KafkaStreams enables metrics push for its internal AdminClient.
Tests are included that assert an error if a user disables either the main consumer or admin client metrics push but Kafka Streams metrics push config is enabled.
Reviewers: Matthias Sax <mjsax@apache.org>
Unit test shouldRestoreSingleActiveStatefulTask() in DefaultStateUpdaterTest is flaky.
The flakiness comes from the fact that the state updater thread could call the first time changelogReader.allChangelogsCompleted() before it calls the first time changelogReader.completedChangelogs(). That happens, if runOnce() is run before the state updater thread reads a task from the input queue.
This commit fixes the flakiness, by making stubbing changelogReader.allChangelogsCompleted() depend on stubbing changelogReader.completedChangelogs().
Reviewers: Lucas Brutschy <lbrutschy@confluent.io>, Matthias J. Sax <matthias@confluent.io>
With EOSv1 removal, we don't have producer-per-task any longer,
and thus can remove the corresponding code which handles task producers.
Reviewers: Chia-Ping Tsai <chia7712@gmail.com>, Bill Bejeck <bill@confluent.io>
We don't pass in a client-supplier into `StreamsProducer` any longer,
so we can simplify `StreamsProducerTest` and remove client-supplier.
Reviewers: Bill Bejeck <bill@confluent.io>
This PR adds a Reporter instance that will add streams client metrics to the telemetry pipeline.
For testing, the PR adds a unit test.
Reviewers: Matthias Sax <mjsax@apache.org>
This PR adds a Reporter instance that will add streams thread metrics to the telemetry pipeline.
For testing, the PR adds a unit test.
Reviewers: Matthias Sax <mjsax@apache.org>
This patch completely removes the compile-time dependency on core for both test and main sources by introducing two new modules.
1) `test-common` include all the common test implementation code (including dependency on :core for BrokerServer, ControllerServer, etc)
2) `test-common:api` new sub-module that just includes interfaces including our junit extension
Reviewers: David Arthur <mumrah@gmail.com>
When calling KafkaStreams#close from teardown methods in integration tests, we need to pass timeout to avoid potentially blocking forever during teardown.
Reviewers: Matthias J. Sax <matthias@confluent.io>
This PR implements exponential backoff for failed initializations of tasks due to lock exceptions. It increases the time between two consecutive attempts of initializing the tasks.
Reviewer: Bruno Cadonna <cadonna@apache.org>
Implements KIP-1087
Reviewers: Matthias J. Sax <matthias@confluent.io>, Lucas Brutschy <lbrutschy@confluent.io>, Anna Sophie Blee-Goldman <ableegoldman@apache.org>
With EOSv1 removed, we don't need to create a producer per task, and thus can simplify the code by removing KafkaClientSupplier from the deeply nested StreamsProducer, to simplify the code.
Reviewers: Bill Bejeck <bill@confluent.io>
The existing check is not correct, because `byte` range is from -128...127.
This PR fixes the check to use `< 0`.
Reviewers: Chia-Ping Tsai <chia7712@gmail.com>
Found two issues in the flaky tests: (Put the log analysis under Jira comments.)
1) The error "java.nio.file.DirectoryNotEmptyException" occurs if the flush() of kafkaStreams.close() and purgeLocalStreamsState() are triggered in the same time. (The current timeout is 5 sec, which is too short since the CI is unstable and slow).
2) Racing issue: Task to-be restored in ks-1 are rebalanced to ks-2 before entering active restoring state. So no onRestoreSuspend() was triggered.
To solve the issues:
1) Remove the timeout in kafkaStreams.close()
2) Ensure all tasks in ks-1 are active restoring before start second KafkaStreams(ks-2)
Reviewers: Chia-Ping Tsai <chia7712@gmail.com>
Failed tasks discovered when removed from the state updater during assignment or revocation are added to the task registry. From there they are retrieved and handled as normal tasks. This leads to a couple of IllegalStateExceptions because it breaks some invariants that ensure that only good tasks are assigned and processed.
This commit solves this bug by distinguish failed from non-failed tasks in the task registry.
Reviewer: Lucas Brutschy <lbrutschy@confluent.io>
This PR implements exponential backoff for state directory lock to increase the time between two consecutive attempts of acquiring the lock.
Reviewers: Lucas Brutschy <lbrutschy@confluent.io>
Implements KIP-1056:
- deprecates default.deserialization.exception.handler in favor of deserialization.exception.handler
- deprecates default.production.exception.handler in favor of production.exception.handler
Reviewers: Matthias J. Sax <matthias@confluent.io>
As described in KAFKA-14460, one of the functional requirements of KeyValueStore is that "The returned iterator must not return null values" on methods which return iterator.
This is not completely the case today for InMemoryKeyValueStore. To iterate over the store, we copy the keySet in order not to block access for other threads. However, entries that are removed from the store after initializing the iterator will be returned with null values by the iterator.
Reviewers: Anna Sophie Blee-Goldman <ableegoldman@apache.org>
This test has a tricky race condition. We want the restoration to go slow enough so that when a second Kafka Streams instance starts, the restoration of a given TopicPartition pauses due to task re-assignment. But after that point, we'd like the test to proceed faster to avoid any timeout assertions. To that end, here are the changes in this PR:
Increase the restore pause to 2 seconds; this should slow the restoration enough so that the process is still in progress once the second instance starts. But once tasks are re-assigned and onRestorePause is called, the restore pause is decremented to zero, allowing the test to proceed faster.
Increase the restore batch to its original value of 5 - otherwise, the test moved too slowly.
Decrease the number of test records produced to the original value of 100. By increasing the time of restoring each batch until Kafka Streams calls onRestorePause and removing the intentional restoration slowness, 100 records proved good enough in local testing.
Reviewers: Matthias J. Sax <mjsax@apache.org>, Chia-Ping Tsai <chia7712@gmail.com>,
Yu-LinChen <kh87313@gmail.com>
Part of KIP-1033.
Co-authored-by: Dabz <d.gasparina@gmail.com>
Co-authored-by: loicgreffier <loic.greffier@michelin.com>
Reviewers: Matthias J. Sax <matthias@confluent.io>
Increased the number of records while decreasing the restore batch size to ensure the restoration does not complete before the second Kafka Streams instance starts up.
Reviewers: Matthias J. Sax <mjsax@apache.org>
KAFKA-17100 changed the behavior of GlobalStreamThread introducing a race condition for state changes, that was exposed by failing (flaky) tests in GlobalStreamThreadTest.
This PR moves the state transition to fix the race condition.
Reviewers: Bill Bejeck <bbejeck@gmail.com>, Chia-Ping Tsai <chia7712@gmail.com>
We currently use a `CountDownLatch` to signal when a thread has
completed shutdown to the blocking `shutdown` method. However, this
latch triggers _before_ the thread has fully exited.
Dependent on the OS thread scheduling, it's possible that this thread
will still be "alive" after the latch has unblocked the `shutdown`
method.
In practice, this is mostly a problem for `StreamThreadTest`, which now
checks that there are no `TaskExecutor` or `StateUpdater` threads
immediately after shutting them down.
Sometimes, after shutdown returns, we find that these threads are still
"alive", usually completing execution of the "thread shutdown" log
message, or even the `Thread#exit` JVM method that's invoked to clean up
threads just before they exit. This causes sporadic test failures, even
though these threads did indeed shutdown correctly.
Instead of using a `CountDownLatch`, let's just await the thread to exit
directly, using `Thread#join`. Just as before, we set a timeout, and if
the Thread is still alive after the timeout, we throw a
`StreamsException`, maintaining the contract of the `shutdown` method.
There should be no measurable impact on production code here. This will
mostly just improve the reliability of tests that require these threads
have fully exited after calling `shutdown`.
Reviewers: Matthias J. Sax <matthias@confluent.io>, Lucas Brutschy <lbrutschy@confluent.io>, Bruno Cadonna <cadonna@apache.org>
Streams is not compatible with the new consumer rebalance protocol proposed in KIP-848. Thus, Streams should set/override config group.protocol to classic at startup to ensure that the classic protocol is used.
Reviewers: Lucas Brutschy <lbrutschy@confluent.io>
testStateGlobalThreadClose() does fail sometimes, with unclear root
cause. This PR is an attempt to fix it, by cleaning up and improving the
test code across the board.
Reviewers: Lucas Brutschy <lbrutschy@confluent.io>
The method branch in both Java and Scala KStream class was deprecated in version 2.8:
1) org.apache.kafka.streams.scala.kstream.KStream#branch
2) org.apache.kafka.streams.kstream.KStream#branch(org.apache.kafka.streams.kstream.Predicate<? super K,? super V>...)
3) org.apache.kafka.streams.kstream.KStream#branch(org.apache.kafka.streams.kstream.Named, org.apache.kafka.streams.kstream.Predicate<? super K,? super V>...)
Reviewers: Chia-Ping Tsai <chia7712@gmail.com>
- StreamsConfig#RETRIES_CONFIG was deprecated in AK 2.7 and is no longer in use.
- StreamsConfig#DEFAULT_WINDOWED_KEY_SERDE_INNER_CLASS and
- StreamsConfig#DEFAULT_WINDOWED_VALUE_SERDE_INNER_CLASS were deprecated in AK 3.0.
Reviewers: Chia-Ping Tsai <chia7712@gmail.com>