This patch copy over existing metrics and add new consumer group metrics to the new GroupCoordinatorService.
Now that each coordinator is responsible for a topic partition, this patch introduces a GroupCoordinatorMetrics that records gauges for global metrics such as the number of generic groups in PreparingRebalance state, etc. For GroupCoordinatorShard specific metrics, GroupCoordinatorMetrics will activate new GroupCoordinatorMetricsShards that will be responsible for incrementing/decrementing TimelineLong objects and then aggregate the total amount across all shards.
As the CoordinatorRuntime/CoordinatorShard does not care about group metadata, we have introduced a CoordinatorMetrics.java/CoordinatorMetricsShard.java so that in the future transaction coordinator metrics can also be onboarded in a similar fashion.
Main files to look at:
GroupCoordinatorMetrics.java
GroupCoordinatorMetricsShard.java
CoordinatorMetrics.java
CoordinatorMetricsShard.java
CoordinatorRuntime.java
Metrics to add after #14408 is merged:
offset deletions sensor (OffsetDeletions); Meter(offset-deletion-rate, offset-deletion-count)
Metrics to add after https://issues.apache.org/jira/browse/KAFKA-14987 is merged:
offset expired sensor (OffsetExpired); Meter(offset-expiration-rate, offset-expiration-count)
Reviewers: Justine Olshan <jolshan@confluent.io>
Instead of only sending failed log directory UUIDs in the heartbeat
request until a successful response is received, the broker sends
the full cumulative set of failed directories since startup time.
This aims to simplify the handling of log directory failure in the
controller side, considering overload mode handling of heartbeat
requests, which returns an undifferentiated reply.
Reviewers: Colin P. McCabe <cmccabe@apache.org>, Proven Provenzano <pprovenzano@confluent.io>
This patch contains a few small clean-ups in LogValidator and associated classes:
1. Set shallowOffsetOfMaxTimestamp consistently as the last offset in the
batch for v2 compressed and non-compressed data.
2. Rename `RecordConversionStats` to `RecordValidationStats` since one of its
fields `temporaryMemoryBytes` does not depend on conversion.
3. Rename `batchIndex` in `recordIndex` in loops over the records in each batch
inside `LogValidator`.
Reviewers: Qichao Chu <5326144+ex172000@users.noreply.github.com>, Jun Rao <junrao@gmail.com>
When returning fetched records, the code was not properly honouring max.poll.records.
When it had fetched records for multiple topic-partitions, it was intended to accumulate records up to
max.poll.records. Actually, whenever it accumulated records from a partition, it was prepared to add
max.poll.records for that partition regardless of how many records it had already accumulated,
rather than reducing the number of records to add based on records already added from other partitions.
I added a test case which failed before the code change in the PR and passes afterwards.
Reviewers: Kirk True <ktrue@confluent.io>, Justine Olshan <jolshan@confluent.io>
Building AssignReplicasToDirsRequestData relies on iteration over Map entries, which can result in different sorting order. The order does not matter to the semantics of the request, but it can cause issues with test assertions. This issue was introduced in #14369.
Reviewers: Divij Vaidya <diviv@amazon.com>, Ismael Juma <ismael@juma.me.uk>, David Jacot <djacot@confluent.io>
Updates for the Client state machine, Heartbeat and Commit managers, to support required states and transitions as the member interacts with a consumer group based on heartbeat requests.
Includes support for the subscribe and unsubscribe API calls, that drive the full lifecycle of a member as it joins a group, reconciles assignments, commits offsets if needed, handles fencing and fatal errors, and leaves the group.
This also includes support for topic ID only when needed for the reconciliation process.
Reviewers: Kirk True <ktrue@confluent.io>, Philip Nee <pnee@confluent.io>, Andrew Schofield <aschofield@confluent.io>, David Jacot <djacot@confluent.io>
When a node starts (or) restarts, then we send a CREATE_TOPICS request to the controller to create the internal __remote_log_metadata topic.
Topic creation event is costly and handled by the controller. During re-balance, the controller can have pending requests in its queue and can lead to CREATE_TOPICS timeout. Instead of firing the CREATE_TOPICS request when a node restarts, send a METADATA request (topic describe) which is handled by the least loaded node before sending a request to create the topic.
Reviewers: Satish Duggana <satishd@apache.org>, Christo Lolov <lolovc@amazon.com>
The commit covers a few important points:
- Exception handling: We should be thrown RetriableCommitException when the commit exception is retriable. We should throw FencedIdException on commit and poll similar to the current implementation. Other errors should be thrown as it is.
- Callback invocation: The callbacks need to be invoked on the main/application thread; however, the future is completed in the background thread. To achieve this, I created an Invoker class with a queue, so that this callback can be invoked during the consumer.poll()
Note: One change I made is to remove the DefaultOffsetCommit callback. Since the callback is purely for logging, I think it is reasonable for us to move the logging to the background thread instead of relying on the application thread to trigger the logging.
Reviewers: Lucas Brutschy <lbrutschy@confluent.io>
We only move Java classes that have minimal or no dependencies on Scala classes in this PR.
Details:
* Configured `server` module in build files.
* Changed `ControllerRequestCompletionHandler` to be an interface since it has no implementations.
* Cleaned up various import control files.
* Minor build clean-ups for `server-common`.
* Disabled `testAssignmentAggregation` when executed with Java 8, this is an existing issue (see #14794).
For broader context on this change, please check:
* KAFKA-15852: Move server code from `core` to `server` module
Reviewers: Divij Vaidya <diviv@amazon.com>
The topic name was displayed as `Optional<String>` when the topic was created.
```
% bin/kafka-topics.sh --create --topic test --bootstrap-server localhost:9092
Created topic Optional[test].
```
This PR fixed to print the topic name as `String` instead of `Optional<String>`.
```
% bin/kafka-topics.sh --create --topic test --bootstrap-server localhost:9092
Created topic test.
```
Reviewers: Ismael Juma <ismael@juma.me.uk>
The legacy consumer detects any concurrent operations from different
threads. This was not enforced in the new consumer. To avoid inconsistencies
in the behavior, we enforce the same restriction in the new consumer.
Reviewers: Bruno Cadonna <cadonna@apache.org>
The state updater waits on a condition variable if no tasks exist that need to be updated. The condition variable is wrapped by a loop to account for spurious wake-ups. The check whether updating tasks exist is done in the condition of the loop. Actually, the state updater thread can change whether updating tasks exists, but since the state updater thread is waiting for the condition variable the check for the existence of updating tasks will always return the same value as long as the state updater thread is waiting. Thus, the check only need to be done once before entering the loop.
This commit moves check before the loop making also the usage of mocks more robust since the processing becomes more deterministic.
Reviewers: Lucas Brutschy <lbrutschy@confluent.io>
When the group coordinator does not host any __consumer_offsets partitions, the existing ListGroup implementation won't schedule any operation, thus a `new CompletableFuture<>()` is returned directly and never gets completed. This patch fixes the issue.
Reviewers: David Jacot <djacot@confluent.io>
https://github.com/apache/kafka/pull/14369 introduced a compilation error in ReplicaManagerConcurrencyTest for Scala 2.12.
Reviewers: Lucas Brutschy <lbrutschy@confluent.io>
A new AssignmentsManager accumulates, batches, and sends KIP-858
assignment events to the Controller. Assignments are sent via
AssignReplicasToDirs requests.
Move QuorumTestHarness.formatDirectories into TestUtils so it can be
used in other test contexts.
Fix a bug in ControllerRegistration.java where the wrong version of the
record was being generated in ControllerRegistration.toRecord.
Reviewers: Colin P. McCabe <cmccabe@apache.org>, Proven Provenzano <pprovenzano@confluent.io>, Omnia G H Ibrahim <o.g.h.ibrahim@gmail.com>
This KIP-951 commit was reverted to investigate the org.apache.kafka.tiered.storage.integration.ReassignReplicaShrinkTest test failure (#14738).
A fix for that was merged in #14757, hence unreverting this change.
This reverts commit a98bd7d.
Reviewers: Justine Olshan <jolshan@confluent.io>, Mayank Shekhar Narula <mayanks.narula@gmail.com>
This patch adds the concept of a "Full" UpdateMetadataRequest, similar to what is used in
LeaderAndIsr. A new tagged field is added to UpdateMetadataRequest at version 8 which allows the
KRaft controller to indicate if a UMR contains all the metadata or not. Since UMR is implicitly
treated as incremental by the ZK broker, we needed a way to detect topic deletions when the KRaft
broker sends a metadata snapshot to the ZK broker. By sending a "Full" flag, the broker can now
compare existing topic IDs to incoming topic IDs and calculate which topics should be removed from
the MetadataCache.
This patch only removes deleted topics from the MetadataCache. Partition/log management was
implemented in KAFKA-15605.
Reviewers: Colin P. McCabe <cmccabe@apache.org>
A few variables which aren't being used anymore but still exist. This commit removes those unused variables.
Co-authored-by: Sagar Rao <sagarrao@Sagars-MacBook-Pro.local>
Reviewers: Lucas Brutschy <lbrutschy@confluent.io>
This patch fixes a bug in the LeaveGroupResponse construction. Basically, when a top level error is set, no members are expected but the current check always requires one for versions prior to version 3.
Reviewers: David Jacot <djacot@confluent.io>
The consumer refactoring project introduced another `Consumer` implementation, creating two different, coexisting implementations of the `Consumer` interface:
* `KafkaConsumer` (AKA "existing", "legacy" consumer)
* `PrototypeAsyncConsumer` (AKA "new", "refactored" consumer)
The goal of this task is to refactor the code via the delegation pattern so that we can keep a top-level `KafkaConsumer` but then delegate to another implementation under the covers. There will be two delegates at first:
* `LegacyKafkaConsumer`
* `AsyncKafkaConsumer`
`LegacyKafkaConsumer` is essentially a renamed `KafkaConsumer`. That implementation handles the existing group protocol. `AsyncKafkaConsumer` is renamed from `PrototypeAsyncConsumer` and will implement the new consumer group protocol from KIP-848. Both of those implementations will live in the `internals` sub-package to discourage their use.
This task is part of the work to implement support for the new KIP-848 consumer group protocol.
Reviewers: Philip Nee <pnee@confluent.io>, Andrew Schofield <aschofield@confluent.io>, David Jacot <djacot@confluent.io>
In 91fa196, I accidentally removed the action queue paramater that was added in 7d147cf. I also renamed the actionQueue as to not confuse this in the future.
I don't think this broke anything since we don't use verification for group coordinator commits, but I should fix it to be as it was before.
Reviewers: Artem Livshits <alivshits@confluent.io>, Jason Gustafson <jason@confluent.io>
See the motivation in jira description https://issues.apache.org/jira/browse/KAFKA-15824
This was discovered as ReassignReplicaShrinkTest started to fail with KIP-951 changes. KIP-951 changes since then have been reverted(PR), would be put back once this is in.
Reviewers: Walker Carlson <wcarlson@apache.org>, Andrew Schofield <aschofield@confluent.io>
This test fails because `localhost` resolved to more than one IP on the Ci. This patch updates the ClusterConnectionStatesTest.testSingleIP test to use a static resolver.
Reviewers: Luke Chen <showuon@gmail.com>
This patch fixes the compilation error for JDK 21 introduced in https://github.com/apache/kafka/pull/14708.
Reviewers: Ismael Juma <ismael@juma.me.uk>, David Jacot <djacot@confluent.io>
This reverts commit f38b0d8.
Trying to find the root cause of org.apache.kafka.tiered.storage.integration.ReassignReplicaShrinkTest failing in CI.
Reviewers: Justine Olshan <jolshan@confluent.io>
This is a follow up from #14659 that ports the windowed classes to use the StoreFactory abstraction as well. There's a side benefit of not duplicating the materialization code twice for each StreamImpl/CogroupedStreamImpl class as well.
Reviewers: Anna Sophie Blee-Goldman <ableegoldman@apache.org>, Matthias Sax <mjsax@apache.org>
This patch fixes a compilation error introduced by https://github.com/apache/kafka/pull/14392 for Scala 2.12.
```
> Task :core:compileScala
[Error] /home/jenkins/workspace/Kafka_kafka-pr_PR-14392/core/src/main/scala/kafka/server/BrokerLifecycleManager.scala:305:49: value incl is not a member of scala.collection.immutable.Set[org.apache.kafka.common.Uuid]
```
Reviewers: Luke Chen <showuon@gmail.com>
The latest metadata version is now 3.7. Fix the KRaft upgrade
test to upgrade to that version instead of 3.6.
Change the vagrant setup and gradle dependencies to use 3.3.2 instead of 3.3.1.
Reviewers: David Arthur <mumrah@gmail.com>
BrokerLifecycleManager should send the offline log directories in the BrokerHeartbeatRequests it
sends. Also, when handling BrokerHeartbeatResponses, do so by enqueing a BrokerLifecycleManager
event, rather than trying to do the handling directly in the callback.
Reviewers: Colin P. McCabe <cmccabe@apache.org>, Proven Provenzano <pprovenzano@confluent.io>
Update BrokerRegistration to use a Builder. This fixes the proliferation of different constructors,
and makes it clear what arguments are being used where.
Reviewers: Colin P. McCabe <cmccabe@confluent.io>
meta.properties files are used by Kafka to identify log directories within the filesystem.
Previously, the code for handling them was in BrokerMetadataCheckpoint.scala. This PR rewrites the
code for handling them as Java and moves it to the apache.kafka.metadata.properties namespace. It
also gets rid of the separate types for v0 and v1 meta.properties objects. Having separate types
wasn't so bad back when we had a strict rule that zk clusters used v0 and kraft clusters used v1.
But ZK migration has blurred the lines. Now, a zk cluster may have either v0 or v1, if it is
migrating, and a kraft cluster may have either v0 or v1, at any time.
The new code distinguishes between an individual meta.properties file, which is represented by
MetaProperties, and a collection of meta.properties files, which is represented by
MetaPropertiesEnsemble. It is useful to have this distinction, because in JBOD mode, even if some
log directories are inaccessible, we can still use the ensemble to extract needed information like
the cluster ID. (Of course, even when not in JBOD mode, KRaft servers have always been able to
configure a metadata log directory separate from the main log directory.)
Since we recently added a unique directory.id to each meta.properties file, the previous convention
of passing a "canonical" MetaProperties object for the cluster around to various places in the code
needs to be revisited. After all, we can no longer assume all of the meta.properties files are the
same. This PR fixes these parts of the code. For example, it fixes the constructors of
ControllerApis and RaftManager to just take a cluster ID, rather than a MetaProperties object. It
fixes some other parts of the code, like the constructor of SharedServer, to take a
MetaPropertiesEnsemble object.
Another goal of this PR was to centralize meta.properties validation a bit more and make it
unit-testable. For this purpose, the PR adds MetaPropertiesEnsemble.verify, and a few other
verification methods. These enforce invariants like "the metadata directory must be readable," and
so on.
Reviewers: Igor Soarez <soarez@apple.com>, David Arthur <mumrah@gmail.com>, Divij Vaidya <diviv@amazon.com>, Proven Provenzano <pprovenzano@confluent.io>
The task-level commit metrics were removed without deprecation in KIP-447 and the corresponding PR #8218. However, we forgot to update the docs and to remove the code that creates the task-level commit sensor.
This PR removes the task-level commit metrics from the docs and removes the code that creates the task-level commit sensor. The code was effectively dead since it was only used in tests.
Reviewers: Guozhang Wang <wangguoz@gmail.com>, Matthias J. Sax <matthias@confluent.io>
When a task is corrupted, uncorrupted tasks are committed. That is also true for standby tasks. Committing standby tasks actually means that they are checkpointed.
When the state updater is enabled, standbys are owned by the state updater. The stream thread should not checkpoint them when handling corrupted tasks. That is not a big limitation since the state updater periodically checkpoints standbys anyway. Additionally, when handling corrupted tasks the important thing is to commit active running tasks to abort the transaction. Committing standby tasks do not abort any transaction.
Reviewers: Matthias J. Sax <matthias@confluent.io>, Lucas Brutschy <lbrutschy@confluent.io>
With the new callback mechanism we were accidentally passing context with the wrong request local. Now include a RequestLocal as an explicit argument to the callback.
Also make the arguments passed through the callback clearer by separating the method out.
Added a test to ensure we use the request handler's request local and not the one passed in when the callback is executed via the request handler.
Reviewers: Ismael Juma <ismael@juma.me.uk>, Divij Vaidya <diviv@amazon.com>, David Jacot <djacot@confluent.io>, Jason Gustafson <jason@confluent.io>, Artem Livshits <alivshits@confluent.io>, Jun Rao <junrao@gmail.com>,
Only advance the HWM for a partition if the ISR set is equal to or above the min ISR config. This patch also sets an upper bound on the min ISR config so it cannot exceed the number of replicas.
Reviewers: David Arthur <mumrah@gmail.com>
Avoid calling cluster.partitionsForTopic(topic) if cluster.availablePartitionsForTopic(topic) is not empty
Reviewers: Divij Vaidya <diviv@amazon.com>, hudeqi <1217150961@qq.com>
A commit fixes a bug in ProduceRequest#partitionSizes() that may cause this method to incorrectly returning an empty or incomplete response for a thread when another thread is in the process of initialising it.
Reviewers: Divij Vaidya <diviv@amazon.com>, hudeqi <1217150961@qq.com>, vamossagar12 <sagarmeansocean@gmail.com>
--------------------------------
Co-authored-by: fangxiaobing <fangxiaobing@kuaishou.com>
This PR sets up the necessary prerequisites to respect configurations such as dsl.default.store.type and the dsl.store.suppliers.class (introduced in #14648) without requiring them to be passed in to StreamBuilder#new(TopologyConfig) (passing them only into new KafkaStreams(...).
It essentially makes StoreBuilder an external-only API and internally it uses the StoreFactory equivalent. It replaces KeyValueStoreMaterializer with an implementation of StoreFactory that creates the store builder only after configure() is called (in a Future PR we will create the missing equivalents for all of the places where the same thing happens for window stores, such as in all the *WindowKStreamImpl classes)
Reviewers: Anna Sophie Blee-Goldman <ableegoldman@apache.org>
TestUtils.createTopicWithAdmin calls waitForAllPartitionsMetadata which waits for partition(s) to be present in each brokers' metadata cache. This is a sufficient check in ZK mode because the controller sends an LISR request before sending an UpdateMetadataRequest which means that the partition in the ReplicaManager will be updated before the metadata cache.
In KRaft mode, the metadata cache is updated first, so the check may return before partitions and other metadata listeners are fully initialized.
Testing:
Insert a Thread.sleep(100) in BrokerMetadataPublisher.onMetadataUpdate after
// Publish the new metadata image to the metadata cache.
metadataCache.setImage(newImage)
and run EdgeCaseRequestTest.testProduceRequestWithNullClientId and the test will fail locally nearly deterministically. After the change(s), the test no longer fails.
Reviewers: Justine Olshan <jolshan@confluent.io>
The patch includes the following changes as part of KIP-966
* Allow ISR shrink to empty
* Allow leader election with ELR members
* Allow electing the last known leader
Reviewers: Artem Livshits <alivshits@confluent.io>, David Arthur <mumrah@gmail.com>