This is a port of #5605 for the 11.3 branch
Reviewers: John Roesler <john@confluent.io>, Matthias J. Sax <matthias@confluent.io>, Guozhang Wang <wangguoz@gmail.com>
This is a cherry-pick PR from #5207
1. add the committed offsets to checkpointable offset map.
2. add the restoration integration test for the source KTable case.
Now that we support re-initializing state stores, we need to clear the segments when the store is closed so that they can be re-opened.
Author: Damian Guy <damian.guy@gmail.com>
Reviewers: Bill Bejeck <bbejeck@gmail.com>, Guozhang Wang <wangguoz@gmail.com>, Ted Yu <yuzhihong@gmail.com>
Closes#4324 from dguy/kafka-6360
As titled, not starting new transaction since during restoration producer would have not activity and hence may cause txn expiration. Also delay starting new txn in resuming until initializing topology.
Reviewers: Matthias J. Sax <mjsax@apache.org>, Bill Bejeck <bill@confluent.io>
Initialize topology after state store restoration.
Although IMHO updating some of the existing tests demonstrates the correct order of operations, I'll probably add an integration test, but I wanted to get this PR in for feedback on the approach.
Author: Bill Bejeck <bill@confluent.io>
Reviewers: Guozhang Wang <wangguoz@gmail.com>, Matthias J. Sax <mjsax@apache.org>
Closes#4415 from bbejeck/KAFKA-6205_restore_state_stores_before_initializing_topology
minor log4j edits
As Frederic reported on mailing list under the subject "kafka-streams Invalid transition attempted from state READY to state ABORTING_TRANSACTION", producer#abortTransaction should only be called when transactionInFlight is true.
Reviewers: Guozhang Wang <wangguoz@gmail.com>, Matthias J. Sax <matthias@confluent.io>
Previously, we failed to remove sensors from the parentSensors map, effectively a memory leak.
Add a test to verify that removed sensors get removed from the underlying registry as well as the parentSensors map.
Reviewers: Guozhang Wang <wangguoz@gmail.com>
Author: Matthias J. Sax <matthias@confluent.io>
Reviewers: Jason Gustafson <jason@confluent.io>, Guozhang Wang <wangguoz@gmail.com>
Closes#4284 from mjsax/minor-improve-eos-docs
Use TestUtil test directory for state directory instead of default /tmp/kafka-streams
Author: Matthias J. Sax <matthias@confluent.io>
Reviewers: Damian Guy <damian.guy@gmail.com>
Closes#4246 from mjsax/improve-flaky-streams-tests
Clarify that state directory must use `storeName`
Author: Matthias J. Sax <matthias@confluent.io>
Reviewers: Bill Bejeck <bill@confluent.io>, Guozhang Wang <wangguoz@gmail.com>
Closes#4228 from mjsax/minor-state-store-javadoc
(cherry picked from commit b604540fbd)
Signed-off-by: Guozhang Wang <wangguoz@gmail.com>
Calculate offset using consumer.position() in GlobalStateManagerImp#restoreState
Author: Alex Good <alexjsgood@gmail.com>
Reviewers: Matthias J. Sax <matthias@confluent.io>, Damian Guy <damian.guy@gmail.com>, Guozhang Wang <wangguoz@gmail.com>
Closes#4197 from alexjg/0.11.0
Author: Guozhang Wang <wangguoz@gmail.com>
Reviewers: Matthias J. Sax <matthias@confluent.io>, Bill Bejeck <bill@confluent.io>, Damian Guy <damian.guy@gmail.com>
Closes#4186 from guozhangwang/K6179-cleanup-timestamp-tracker-on-clear
(cherry picked from commit ee1aaa091f)
Signed-off-by: Guozhang Wang <wangguoz@gmail.com>
Author: Guozhang Wang <wangguoz@gmail.com>
Reviewers: Damian Guy <damian.guy@gmail.com>, Xavier Léauté <xavier@confluent.io>
Closes#4130 from guozhangwang/KHotfix-0110-remove-logging
Mirror of #4096 for 0.11.01.
During the restoration phase, when thread state is still in PARTITION_ASSIGNED not RUNNING yet, call poll() on the normal consumer with 0 millisecond timeout, to unblock the restoration of other tasks as soon as possible.
Author: Guozhang Wang <wangguoz@gmail.com>
Reviewers: Bill Bejeck <bill@confluent.io>, Damian Guy <damian.guy@gmail.com>, Matthias J. Sax <matthias@confluent.io>, Xavier Léauté <xavier@confluent.io>
Closes#4085 from guozhangwang/KHotfix-0110-restore-only
A couple of root causes of this flaky test is fixed:
1. The MockTime was incorrectly used across multiple test methods within the class, as a class rule. Instead we set it on each test case; also remove the scala MockTime dependency.
2. List topics may not contain the deleted topics while their ZK paths are yet to be deleted; so the delete-check-recreate pattern may fail to successfully recreate the topic at all. Change the checking to read from zk path directly instead.
Another minor fix is to remove the misleading wait condition error message as the accumData is always empty.
Author: Guozhang Wang <wangguoz@gmail.com>
Reviewers: Bill Bejeck <bill@confluent.io>, Damian Guy <damian.guy@gmail.com>, Matthias J. Sax <matthias@confluent.io>
Closes#4095 from guozhangwang/KMinor-reset-integration-test
(cherry picked from commit d3f24798f9)
Signed-off-by: Guozhang Wang <wangguoz@gmail.com>
package name: org.apache.kafka.streams.state.internals
Minor change to approximateNumEntries() method in CompositeReadOnlyKeyValueStore class.
long total = 0;
for (ReadOnlyKeyValueStore<K, V> store : stores) {
total += store.approximateNumEntries();
}
return total < 0 ? Long.MAX_VALUE : total;
The check for negative value seems to account for wrapping. However, wrapping can happen within the for loop. So the check should be performed inside the loop.
Author: siva santhalingam <ssanthalingam@netskope.com>
Reviewers: Matthias J. Sax <matthias@confluent.io>, Damian Guy <damian.guy@gmail.com>
Closes#3988 from shivsantham/trunk
(cherry picked from commit 5afeddaa99)
Signed-off-by: Damian Guy <damian.guy@gmail.com>
When logging is disabled and there are state stores the task never transitions from restoring to running. This is because we only ever check if the task has state stores and return false on initialization if it does. The check should be if we have changelog partitions, i.e., we need to restore.
Author: Damian Guy <damian.guy@gmail.com>
Reviewers: Matthias J. Sax <matthias@confluent.io>, Bill Bejeck <bill@confluent.io>, tedyu <yuzhihong@gmail.com>, Ismael Juma <ismael@juma.me.uk>
Closes#3983 from dguy/restore-test
(cherry picked from commit 3107a6c5c8)
Signed-off-by: Damian Guy <damian.guy@gmail.com>
This is the backport of #3748 for trunk
Author: Guozhang Wang <wangguoz@gmail.com>
Reviewers: Damian Guy <damian.guy@gmail.com>
Closes#3806 from guozhangwang/K5797-handle-metadata-available-0110
This is a manual cherry-pick of https://github.com/apache/kafka/pull/3769 for 0.11.0
Author: Guozhang Wang <wangguoz@gmail.com>
Reviewers: Damian Guy <damian.guy@gmail.com>, Bill Bejeck <bill@confluent.io>, Matthias J. Sax <matthias@confluent.io>
Closes#3771 from guozhangwang/KMinor-logging-improvements
Author: Damian Guy <damian.guy@gmail.com>
Reviewers: Guozhang Wang <wangguoz@gmail.com>
Closes#3793 from dguy/sign-mvn-jars
(cherry picked from commit d78eb03fad)
Signed-off-by: Guozhang Wang <wangguoz@gmail.com>
Author: Matthias J. Sax <matthias@confluent.io>
Reviewers: Ted Yu <yuzhihong@gmail.com>, Guozhang Wang <wangguoz@gmail.com>
Closes#3779 from mjsax/kafka-5818-kafkaStreams-state-transition-01101
If a task fails during initialization due to a LockException, its changelog partitions are not immediately added to the StoreChangelogReader as the thread doesn't hold the lock. However StoreChangelogReader#restore will be called and it sets the initialized flag. On a subsequent successfull call to initialize the new tasks the partitions are added to the StoreChangelogReader, however as it is already initialized these new partitions will never be restored. So the task would remain in a non-running state forever.
Author: Damian Guy <damian.guy@gmail.com>
Reviewers: Guozhang Wang <wangguoz@gmail.com>
Closes#3747 from dguy/kafka-5787-0.11
Suggested fix for the bug
Author: radzish <radzish@gmail.com>
Reviewers: Damian Guy <damian.guy@gmail.com>
Closes#3737 from radzish/KAFKA-5771
(cherry picked from commit 05e3850b2e)
Signed-off-by: Damian Guy <damian.guy@gmail.com>
Author: Matthias J. Sax <matthias@confluent.io>
Reviewers: Bill Bejeck <bill@confluent.io>, Damian Guy <damian.guy@gmail.com>
Closes#3722 from mjsax/kafka-5603-dont-abort-tx-for-zombie-tasks-01101
In `onPartitionsAssigned`:
1. release all locks for non-assigned suspended tasks.
2. resume any suspended tasks.
3. Create new tasks, but don't attempt to take the state lock.
4. Pause partitions for any new tasks.
5. set the state to `PARTITIONS_ASSIGNED`
In `StreamThread#runLoop`
1. poll
2. if state is `PARTITIONS_ASSIGNED`
2.1 attempt to initialize any new tasks, i.e, take out the state locks and init state stores
2.2 restore some data for changelogs, i.e., poll once on the restore consumer and return the partitions that have been fully restored
2.3 update tasks with restored partitions and move any that have completed restoration to running
2.4 resume consumption for any tasks where all partitions have been restored.
2.5 if all active tasks are running, transition to `RUNNING` and assign standby partitions to the restoreConsumer.
Author: Damian Guy <damian.guy@gmail.com>
Reviewers: Guozhang Wang <wangguoz@gmail.com>, Bill Bejeck <bill@confluent.io>
Closes#3653 from dguy/0.11.0-restore-on-poll
Cherry picked from https://github.com/apache/kafka/pull/3432
Author: Eno Thereska <eno.thereska@gmail.com>
Reviewers: Damian Guy <damian.guy@gmail.com>, Guozhang Wang <wangguoz@gmail.com>
Closes#3622 from enothereska/KAFKA-5571-0.11
Backported from trunk: https://github.com/apache/kafka/pull/3516
Author: Damian Guy <damian.guy@gmail.com>
Reviewers: Guozhang Wang <wangguoz@gmail.com>, Eno Thereska <eno.thereska@gmail.com>
Closes#3654 from dguy/cherry-pick-stream-thread-cleanup
cache eviction logging at debug level is too high volume. This was already done on trunk but didn't make it into 0.11
Author: Damian Guy <damian.guy@gmail.com>
Reviewers: Ismael Juma <ismael@juma.me.uk>
Closes#3652 from dguy/minor-cache-log-level
Fixed a bug in the InMemoryKeyValueStore restoration where a key with a `null` value is written in to the map rather than being deleted.
Author: Damian Guy <damian.guy@gmail.com>
Reviewers: Bill Bejeck <bbejeck@gmail.com>, Guozhang Wang <wangguoz@gmail.com>
Closes#3650 from dguy/kafka-5717
(cherry picked from commit c35c479813)
Signed-off-by: Damian Guy <damian.guy@gmail.com>
A couple of fixes to metric names to match the KIP
- Removed extra strings in the metric names that are already in the tags
- add a separate metric for "all"
Author: Eno Thereska <eno.thereska@gmail.com>
Reviewers: Guozhang Wang <wangguoz@gmail.com>
Closes#3491 from enothereska/hotfix-metric-names
(cherry picked from commit 6bee1e9e57)
Signed-off-by: Guozhang Wang <wangguoz@gmail.com>
1. Capture `CommitFailedException` in `StreamThread#suspendTasksAndState`.
2. Remove `Cache` from AbstractTask as it is not needed any more; remove not used cleanup related variables from StreamThread (cc dguy to double check).
3. Also fix log4j outputs for error and warn, such that for WARN we do not print stack trace, and for ERROR we remove the dangling colon since the exception stack trace will start in newline.
4. Update one log4j entry to always print as WARN for errors closing a zombie task (cc mjsax ).
Author: Guozhang Wang <wangguoz@gmail.com>
Reviewers: Matthias J. Sax <matthias@confluent.io>, Damian Guy <damian.guy@gmail.com>
Closes#3574 from guozhangwang/KHotfix-handle-commit-failed-exception-in-suspend
(cherry picked from commit 228a4fdb6d)
Signed-off-by: Guozhang Wang <wangguoz@gmail.com>
fix unit test
`MeteredKeyValueStore` wasn't thread safe. Interleaving operations could modify the state, i.e, the `key` and/or `value` which could result in incorrect behaviour.
Author: Damian Guy <damian.guy@gmail.com>
Reviewers: Guozhang Wang <wangguoz@gmail.com>
Closes#3588 from dguy/hotfix-metered-kv-store
(cherry picked from commit 4059fa5763)
Signed-off-by: Damian Guy <damian.guy@gmail.com>