Replace BrokerStates.scala with BrokerState.java, to make it easier to use from Java code if needed. This also makes it easier to go from a numeric type to an enum.
Reviewers: Chia-Ping Tsai <chia7712@gmail.com>
Tests involving `BrokerToControllerChannelManager` are simplified by being able to leverage `MockClient`. This patch introduces a `MockBrokerToControllerChannelManager` implementation which makes that possible.
The patch updates `ForwardingManagerTest` to use `MockBrokerToControllerChannelManager`. We also add a couple additional timeout cases, which exposed a minor bug. Previously we were using the wrong `TimeoutException`, which meant that expected timeout errors were in fact translated to `UNKNOWN_SERVER_ERROR`.
Reviewers: David Arthur <david.arthur@confluent.io>
Add prefix scan support to State stores. Currently, only RocksDB and InMemory key value stores are being supported.
Reviewers: Bruno Cadonna <bruno@confluent.io>, Guozhang Wang <wangguoz@gmail.com>
Kafka is consuming over 50% of all our Travis executors according to Apache Infra, so let's disable it for now.
Reviewers: Chia-Ping Tsai <chia7712@gmail.com>
`ProducerIdManager` is an existing class that talks to ZooKeeper directly. We won't have ZooKeeper
when using a Raft-based metadata quorum, so we need an abstraction for the functionality of
generating producer IDs. This PR introduces `ProducerIdGenerator` for this purpose, and we pass
an implementation when instantiating `TransactionCoordinator` rather than letting
`TransactionCoordinator.apply()` itself always create a ZooKeeper-based instance.
Reviewers: David Arthur <mumrah@gmail.com>, Ismael Juma <ismael@juma.me.uk>
Brokers receive metadata from the Raft metadata quorum very differently than they do from
ZooKeeper today, and this has implications for ReplicaManager. In particular, when a broker
reads the metadata log it may not arrive at the ultimate state for a partition until it reads multiple
messages. In normal operation the multiple messages associated with a state change will all
appear in a single batch, so they can and will be coalesced and applied together. There are
circumstances where messages associated with partition state changes will appear across
multiple batches and we will be forced to coalesce these multiple batches together. The
circumstances when this occurs are as follows:
- When the broker restarts it must "catch up" on the metadata log, and it is likely that the
broker will see multiple partition state changes for a single partition across different
batches while it is catching up. For example, it will see the `TopicRecord` and the
`PartitionRecords` for the topic creation, and then it will see any `IsrChangeRecords`
that may have been recorded since the creation. The broker does not know the state of
the topic partitions until it reads and coalesces all the messages.
- The broker will have to "catch up" on the metadata log if it becomes fenced and then
regains its lease and resumes communication with the metadata quorum.
- A fenced broker may ultimately have to perform a "soft restart" if it was fenced for so
long that the point at which it needs to resume fetching the metadata log has been
subsumed into a metadata snapshot and is no longer independently fetchable. A soft
restart will entail some kind of metadata reset based on the latest available snapshot
plus a catchup phase to fetch after the snapshot end point.
The first case -- during startup -- occurs before clients are able to connect to the broker.
Clients are able to connect to the broker in the second case. It is unclear if clients will be
able to to connect to the broker during a soft restart (the third case).
We need a way to defer the application of topic partition metadata in all of the above cases,
and while we are deferring the application of the metadata the broker will not service clients
for the affected partitions.
As a side note, it is arguable if the broker should be able to service clients while catching up
or not. The decision to not service clients has no impact in the startup case -- clients can't
connect yet at that point anyway. In the third case it is not yet clear what we are going to do,
but being unable to service clients while performing a soft reset seems reasonable. In the
second case it is most likely true that we will catch up quickly; it would be unusual to
reestablish communication with the metadata quorum such that we gain a new lease and
begin to catch up only to lose our lease again.
So we need a way to defer the application of partition metadata and make those partitions
unavailable while deferring state changes. This PR adds a new internal partition state to
ReplicaManager to accomplish this. Currently the available partition states are simple
`Online`, `Offline` (meaning a log dir failure) and `None` (meaning we don't know about it).
We add a new `Deferred` state. We also rename a couple of methods that refer to
"nonOffline" partitions to instead refer to "online" partitions.
**The new `Deferred` state never happens when using ZooKeeper for metadata storage.**
Partitions can only enter the `Deferred` state when using a KIP-500 Raft metadata quorum
and one of the above 3 cases occurs. The testing strategy is therefore to leverage existing
tests to confirm that there is no functionality change in the ZooKeeper case. We will add
the logic for deferring/applying/reacting to deferred partition state in separate PRs since
that code will never be invoked in the ZooKeeper world.
Reviewers: Jason Gustafson <jason@confluent.io>, Ismael Juma <ismael@juma.me.uk>
Add DuplicateBrokerRegistrationException, as specified in KIP-631.
Reviewers: Chia-Ping Tsai <chia7712@gmail.com>, Jason Gustafson <jason@confluent.io>
This PR moves static property definitions for user client quotas into a
new class called QuotaConfigs in the clients module under the
o.a.k.common.config.internals package. This is needed to support the
client quotas work in the quorum based controller.
Reviewers: Colin McCabe <cmccabe@apache.org>
First issue: When more than one workers join the Connect group the incremental cooperative assignor revokes and reassigns at most average number of tasks per worker.
Side-effect: This results in the additional workers joining the group stay idle and would require more future rebalances to happen to have even distribution of tasks.
Fix: As part of task assignment calculation following a deployment, the reassignment of tasks are calculated by revoking all the tasks above the rounded up (ceil) average number of tasks.
Second issue: When more than one worker is lost and rejoins the group at most one worker will be re assigned with the lost tasks from all the workers that left the group.
Side-effect: In scenarios where more than one worker is lost and rejoins the group only one among them gets assigned all the partitions that were lost in the past. The additional workers that have joined would not get any task assigned to them until a rebalance that happens in future.
Fix: As part fo lost task re assignment all the new workers that have joined the group would be considered for task assignment and would be assigned in a round robin fashion with the new tasks.
Testing strategy :
* System testing in a Kubernetes environment completed
* New integration tests to test for balanced tasks
* Updated unit tests.
Co-authored-by: Rameshkrishnan Muthusamy <rameshkrishnan_muthusamy@apple.com>
Co-authored-by: Randall Hauch <rhauch@gmail.com>
Co-authored-by: Konstantine Karantasis <konstantine@confluent.io>
Reviewers: Randall Hauch <rhauch@gmail.com>, Konstantine Karantasis <k.karantasis@gmail.com>
Add docs for KIP-663.
Reviewers: Jim Galasyn <jim.galasyn@confluent.io>, Walker Carlson <wcarlson@confluent.io>, Matthias J. Sax <matthias@confluent.io>
mTLS is enabled if listener-prefixed ssl.client.auth is configured for SASL_SSL listeners. Broker-wide ssl.client.auth is not applied to SASL_SSL listeners as before, but we now print a warning.
Reviewers: David Jacot <djacot@confluent.io>
This patch adds the schemas and request/response objects for the `BrokerHeartbeat` and `BrokerRegistration` APIs that were added as part of KIP-631. These APIs are only exposed by the KIP-500 controller and not advertised by the broker.
Reviewers: Jason Gustafson <jason@confluent.io>
This patch factors out a trait to allow for other ways to provide the controller `Node` object to `BrokerToControllerChannelManager`. In KIP-500, the controller will be provided from the Raft client and not the metadata cache.
Reviewers: David Arthur <david.arthur@confluent.io>
Rewrite the test case so that it is deterministic and does not depend on multiple threads.
Reviewers: Boyang Chen <boyang@confluent.io>, Chia-Ping Tsai <chia7712@gmail.com>
This patch attempts to fix `CustomQuotaCallbackTest#testCustomQuotaCallback`. The test creates 99 partitions in a topic, and expects that we can get the partition info for all of them after 15 seconds. If we cannot, then we'll get the error:
```
org.scalatest.exceptions.TestFailedException: Partition [group1_largeTopic,69] metadata not propagated after 15000 ms
```
15 secs is not enough to complete the 99 partitions creation on a slow system. So, we fix it by explicitly wait until we've got the expected partition size before retrieving each partition info and we increase the wait time to 60s for all partition metadata to be propagated.
Reviewers: Jason Gustafson <jason@confluent.io>
Since the Raft leader is already doing the work of assigning offsets and the leader epoch, we can skip the same logic in `Log.appendAsLeader`. This lets us avoid an unnecessary round of decompression.
Reviewers: dengziming <dengziming1993@gmail.com>, Jason Gustafson <jason@confluent.io>
This patch contains the new handling of `meta.properties` required by the KIP-500 server as specified in KIP-631. When using the self-managed quorum, the `meta.properties` file is required in each log directory with the new `version` property set to 1. It must include the `cluster.id` property and it must have a `node.id` matching that in the configuration.
The behavior of `meta.properties` for the Zookeeper-based `KafkaServer` does not change. We treat `meta.properties` as optional and as if it were `version=0`. We continue to generate the clusterId and/or the brokerId through Zookeeper as needed.
Reviewers: Ismael Juma <ismael@juma.me.uk>, Chia-Ping Tsai <chia7712@gmail.com>
Adds support for nonzero log start offsets.
Changes to `Log`:
1. Add a new "reason" for increasing the log start offset. This is used by `KafkaMetadataLog` when a snapshot is generated.
2. `LogAppendInfo` should return if it was rolled because of an records append. A log is rolled when a new segment is created. This is used by `KafkaMetadataLog` to in some cases delete the created segment based on the log start offset.
Changes to `KafkaMetadataLog`:
1. Update both append functions to delete old segments based on the log start offset whenever the log is rolled.
2. Update `lastFetchedEpoch` to return the epoch of the latest snapshot whenever the log is empty.
3. Add a function that empties the log whenever the latest snapshot is greater than the replicated log. This is used when first loading the `KafkaMetadataLog` and whenever the `KafkaRaftClient` downloads a snapshot from the leader.
Changes to `KafkaRaftClient`:
1. Improve `validateFetchOffsetAndEpoch` so that it can handle fetch offset and last fetched epoch that are smaller than the log start offset. This is in addition to the existing code that check for a diverging log. This is used by the raft client to determine if the Fetch response should include a diverging epoch or a snapshot id.
2. When a follower finishes fetching a snapshot from the leader fully truncate the local log.
3. When polling the current state the raft client should check if the state machine has generated a new snapshot and update the log start offset accordingly.
Reviewers: Jason Gustafson <jason@confluent.io>
Add timeout to remove thread, and trigger thread to explicitly leave the group even in case of static membership
Reviewers: Bruno Cadonna <bruno@confluent.io>, Anna Sophie Blee-Goldman <ableegoldman@apache.org>
Updated CreateTopicResponse, DeleteTopicsRequest/Response and added some new AdminClient methods and classes. Now the newly created topic ID will be returned in CreateTopicsResult and found in TopicAndMetadataConfig, and topics can be deleted by supplying topic IDs through deleteTopicsWithIds which will return DeleteTopicsWithIdsResult.
Reviewers: dengziming <dengziming1993@gmail.com>, Rajini Sivaram <rajinisivaram@googlemail.com>
We should ensure `NetworkClient` is closed properly when `InterBrokerSendThread` is shutdown. Also use `initiateShutdown` instead of `wakeup()` to alert polling thread.
Reviewers: David Jacot <djacot@confluent.io>
Co-authored-by: Bruno Cadonna <bruno@confluent.io>
Reviewers: Jim Galasyn <jim.galasyn@confluent.io>, Matthias J. Sax <matthias@confluent.io>, Guozhang Wang <guozhang@confluent.io>, Bruno Cadonna <bruno@confluent.io>
Make the default state store directory location to follow
OS-specific temporary directory settings or java.io.tmpdir
JVM parameter, with Utils#getTempDir.
Reviewers: Matthias J. Sax <mjsax@apache.org>, John Roesler <vvcephei@apache.org>
With KIP-595, we previously expect `RaftConfig` to specify the quorum voter endpoints upfront on startup. In the general case, this works fine. However, for testing where the bound port is not known ahead of time, we need a lazier approach that discovers the other voters in the quorum after startup.
In this patch, we take the voter endpoint initialization out of `KafkaRaftClient.initialize` and move it to `RaftManager`. We use a special address to indicate that the voter addresses will be provided later This approach also lends itself well to future use cases where we might discover voter addresses through an external service (for example).
Reviewers: Jason Gustafson <jason@confluent.io>
By default Mirror Maker 2 creates herders for all the possible combinations even if the "links" are not enabled.
This is because the beats are emitted from the "opposite" herder.
If there is a replication flow from A to B and heartbeats are required, 2 herders are needed :
- A->B for the MirrorSourceConnector
- B->A for the MirrorHeartbeatConnector
The MirrorHeartbeatConnector on B->A emits beats into topic heartbeats on cluster A.
The MirrorSourceConnector on A->B then replicates whichever topic is configured as well as heartbeats.
In cases with multiple clusters (10 and more), this leads to an incredible amount of connections, file descriptors and configuration topics created in every target clusters that are not necessary.
With this code change, we will leverage the top level property "emit.heartbeats.enabled" which defaults to "true".
We skip creating the A->B herder whenever A->B.emit.heartbeats.enabled=false (defaults to true) and A->B.enabled=false (defaults to false).
Existing users will not see any change and if they depend on these "opposites" herders for their monitoring, it will still work.
New users with more complex use case can change this property and fine tune their heartbeat generation.
Reviewers: Ryanne Dolan <ryannedolan@gmail.com>, Sanjana Kaundinya <skaundinya@gmail.com>, Jason Gustafson <jason@confluent.io>
- According to https://docs.gradle.org/current/userguide/performance.html#parallel_execution,
gradle executes builds serially by default.
- With this change, the build performance is significantly better (~2x) on multi-core machines
The time to run the following command went from 7m20s to 3m34s on one machine:
`clean install assemble spotlessScalaCheck checkstyleMain checkstyleTest spotbugsMain`
Reviewers: Ismael Juma <ismael@juma.me.uk>
As it's only API extension to match the java API with Named object with lots of duplication, I only tested the logic once.
Reviewers: Bill Bejeck <bbejeck@apache.org>
This adds a new user-facing documentation "Geo-replication (Cross-Cluster Data Mirroring)" section to the Kafka Operations documentation that covers MirrorMaker v2.
Was already merged to kafka-site via apache/kafka-site#324.
Reviewers: Bill Bejeck <bbejeck@apache.org>
A few important fixes:
* ZOOKEEPER-3829: Zookeeper refuses request after node expansion
* ZOOKEEPER-3842: Rolling scale up of zookeeper cluster does not work with reconfigEnabled=false
* ZOOKEEPER-3830: After add a new node, zookeeper cluster won't commit any proposal if this new node is leader
Full release notes: https://zookeeper.apache.org/doc/r3.5.9/releasenotes.html
Reviewers: Chia-Ping Tsai <chia7712@gmail.com>