Replace BrokerStates.scala with BrokerState.java, to make it easier to use from Java code if needed. This also makes it easier to go from a numeric type to an enum.
Reviewers: Chia-Ping Tsai <chia7712@gmail.com>
Tests involving `BrokerToControllerChannelManager` are simplified by being able to leverage `MockClient`. This patch introduces a `MockBrokerToControllerChannelManager` implementation which makes that possible.
The patch updates `ForwardingManagerTest` to use `MockBrokerToControllerChannelManager`. We also add a couple additional timeout cases, which exposed a minor bug. Previously we were using the wrong `TimeoutException`, which meant that expected timeout errors were in fact translated to `UNKNOWN_SERVER_ERROR`.
Reviewers: David Arthur <david.arthur@confluent.io>
`ProducerIdManager` is an existing class that talks to ZooKeeper directly. We won't have ZooKeeper
when using a Raft-based metadata quorum, so we need an abstraction for the functionality of
generating producer IDs. This PR introduces `ProducerIdGenerator` for this purpose, and we pass
an implementation when instantiating `TransactionCoordinator` rather than letting
`TransactionCoordinator.apply()` itself always create a ZooKeeper-based instance.
Reviewers: David Arthur <mumrah@gmail.com>, Ismael Juma <ismael@juma.me.uk>
Brokers receive metadata from the Raft metadata quorum very differently than they do from
ZooKeeper today, and this has implications for ReplicaManager. In particular, when a broker
reads the metadata log it may not arrive at the ultimate state for a partition until it reads multiple
messages. In normal operation the multiple messages associated with a state change will all
appear in a single batch, so they can and will be coalesced and applied together. There are
circumstances where messages associated with partition state changes will appear across
multiple batches and we will be forced to coalesce these multiple batches together. The
circumstances when this occurs are as follows:
- When the broker restarts it must "catch up" on the metadata log, and it is likely that the
broker will see multiple partition state changes for a single partition across different
batches while it is catching up. For example, it will see the `TopicRecord` and the
`PartitionRecords` for the topic creation, and then it will see any `IsrChangeRecords`
that may have been recorded since the creation. The broker does not know the state of
the topic partitions until it reads and coalesces all the messages.
- The broker will have to "catch up" on the metadata log if it becomes fenced and then
regains its lease and resumes communication with the metadata quorum.
- A fenced broker may ultimately have to perform a "soft restart" if it was fenced for so
long that the point at which it needs to resume fetching the metadata log has been
subsumed into a metadata snapshot and is no longer independently fetchable. A soft
restart will entail some kind of metadata reset based on the latest available snapshot
plus a catchup phase to fetch after the snapshot end point.
The first case -- during startup -- occurs before clients are able to connect to the broker.
Clients are able to connect to the broker in the second case. It is unclear if clients will be
able to to connect to the broker during a soft restart (the third case).
We need a way to defer the application of topic partition metadata in all of the above cases,
and while we are deferring the application of the metadata the broker will not service clients
for the affected partitions.
As a side note, it is arguable if the broker should be able to service clients while catching up
or not. The decision to not service clients has no impact in the startup case -- clients can't
connect yet at that point anyway. In the third case it is not yet clear what we are going to do,
but being unable to service clients while performing a soft reset seems reasonable. In the
second case it is most likely true that we will catch up quickly; it would be unusual to
reestablish communication with the metadata quorum such that we gain a new lease and
begin to catch up only to lose our lease again.
So we need a way to defer the application of partition metadata and make those partitions
unavailable while deferring state changes. This PR adds a new internal partition state to
ReplicaManager to accomplish this. Currently the available partition states are simple
`Online`, `Offline` (meaning a log dir failure) and `None` (meaning we don't know about it).
We add a new `Deferred` state. We also rename a couple of methods that refer to
"nonOffline" partitions to instead refer to "online" partitions.
**The new `Deferred` state never happens when using ZooKeeper for metadata storage.**
Partitions can only enter the `Deferred` state when using a KIP-500 Raft metadata quorum
and one of the above 3 cases occurs. The testing strategy is therefore to leverage existing
tests to confirm that there is no functionality change in the ZooKeeper case. We will add
the logic for deferring/applying/reacting to deferred partition state in separate PRs since
that code will never be invoked in the ZooKeeper world.
Reviewers: Jason Gustafson <jason@confluent.io>, Ismael Juma <ismael@juma.me.uk>
This PR moves static property definitions for user client quotas into a
new class called QuotaConfigs in the clients module under the
o.a.k.common.config.internals package. This is needed to support the
client quotas work in the quorum based controller.
Reviewers: Colin McCabe <cmccabe@apache.org>
mTLS is enabled if listener-prefixed ssl.client.auth is configured for SASL_SSL listeners. Broker-wide ssl.client.auth is not applied to SASL_SSL listeners as before, but we now print a warning.
Reviewers: David Jacot <djacot@confluent.io>
This patch adds the schemas and request/response objects for the `BrokerHeartbeat` and `BrokerRegistration` APIs that were added as part of KIP-631. These APIs are only exposed by the KIP-500 controller and not advertised by the broker.
Reviewers: Jason Gustafson <jason@confluent.io>
This patch factors out a trait to allow for other ways to provide the controller `Node` object to `BrokerToControllerChannelManager`. In KIP-500, the controller will be provided from the Raft client and not the metadata cache.
Reviewers: David Arthur <david.arthur@confluent.io>
This patch attempts to fix `CustomQuotaCallbackTest#testCustomQuotaCallback`. The test creates 99 partitions in a topic, and expects that we can get the partition info for all of them after 15 seconds. If we cannot, then we'll get the error:
```
org.scalatest.exceptions.TestFailedException: Partition [group1_largeTopic,69] metadata not propagated after 15000 ms
```
15 secs is not enough to complete the 99 partitions creation on a slow system. So, we fix it by explicitly wait until we've got the expected partition size before retrieving each partition info and we increase the wait time to 60s for all partition metadata to be propagated.
Reviewers: Jason Gustafson <jason@confluent.io>
Since the Raft leader is already doing the work of assigning offsets and the leader epoch, we can skip the same logic in `Log.appendAsLeader`. This lets us avoid an unnecessary round of decompression.
Reviewers: dengziming <dengziming1993@gmail.com>, Jason Gustafson <jason@confluent.io>
This patch contains the new handling of `meta.properties` required by the KIP-500 server as specified in KIP-631. When using the self-managed quorum, the `meta.properties` file is required in each log directory with the new `version` property set to 1. It must include the `cluster.id` property and it must have a `node.id` matching that in the configuration.
The behavior of `meta.properties` for the Zookeeper-based `KafkaServer` does not change. We treat `meta.properties` as optional and as if it were `version=0`. We continue to generate the clusterId and/or the brokerId through Zookeeper as needed.
Reviewers: Ismael Juma <ismael@juma.me.uk>, Chia-Ping Tsai <chia7712@gmail.com>
Adds support for nonzero log start offsets.
Changes to `Log`:
1. Add a new "reason" for increasing the log start offset. This is used by `KafkaMetadataLog` when a snapshot is generated.
2. `LogAppendInfo` should return if it was rolled because of an records append. A log is rolled when a new segment is created. This is used by `KafkaMetadataLog` to in some cases delete the created segment based on the log start offset.
Changes to `KafkaMetadataLog`:
1. Update both append functions to delete old segments based on the log start offset whenever the log is rolled.
2. Update `lastFetchedEpoch` to return the epoch of the latest snapshot whenever the log is empty.
3. Add a function that empties the log whenever the latest snapshot is greater than the replicated log. This is used when first loading the `KafkaMetadataLog` and whenever the `KafkaRaftClient` downloads a snapshot from the leader.
Changes to `KafkaRaftClient`:
1. Improve `validateFetchOffsetAndEpoch` so that it can handle fetch offset and last fetched epoch that are smaller than the log start offset. This is in addition to the existing code that check for a diverging log. This is used by the raft client to determine if the Fetch response should include a diverging epoch or a snapshot id.
2. When a follower finishes fetching a snapshot from the leader fully truncate the local log.
3. When polling the current state the raft client should check if the state machine has generated a new snapshot and update the log start offset accordingly.
Reviewers: Jason Gustafson <jason@confluent.io>
Updated CreateTopicResponse, DeleteTopicsRequest/Response and added some new AdminClient methods and classes. Now the newly created topic ID will be returned in CreateTopicsResult and found in TopicAndMetadataConfig, and topics can be deleted by supplying topic IDs through deleteTopicsWithIds which will return DeleteTopicsWithIdsResult.
Reviewers: dengziming <dengziming1993@gmail.com>, Rajini Sivaram <rajinisivaram@googlemail.com>
We should ensure `NetworkClient` is closed properly when `InterBrokerSendThread` is shutdown. Also use `initiateShutdown` instead of `wakeup()` to alert polling thread.
Reviewers: David Jacot <djacot@confluent.io>
Make the default state store directory location to follow
OS-specific temporary directory settings or java.io.tmpdir
JVM parameter, with Utils#getTempDir.
Reviewers: Matthias J. Sax <mjsax@apache.org>, John Roesler <vvcephei@apache.org>
With KIP-595, we previously expect `RaftConfig` to specify the quorum voter endpoints upfront on startup. In the general case, this works fine. However, for testing where the bound port is not known ahead of time, we need a lazier approach that discovers the other voters in the quorum after startup.
In this patch, we take the voter endpoint initialization out of `KafkaRaftClient.initialize` and move it to `RaftManager`. We use a special address to indicate that the voter addresses will be provided later This approach also lends itself well to future use cases where we might discover voter addresses through an external service (for example).
Reviewers: Jason Gustafson <jason@confluent.io>
A few small cleanups in `GroupCoordinator` and related classes:
- Remove redundant `groupId` field from `MemberMetadata`
- Remove redundant `isStaticMember` field from `MemberMetadata`
- Fix broken log message in `GroupMetadata.loadGroup` and apply it to all loaded members
- Remove ancient TODOs and no-op methods from `GroupCoordinator`
- Move removal of static members into `GroupMetadata.remove`
Reviewers: Chia-Ping Tsai <chia7712@gmail.com>, David Jacot <djacot@confluent.io>
This patch bumps the version of the Metadata API and removes the `IncludeClusterAuthorizedOperations` and the `IncludeClusterAuthorizedOperations` fields from version 11 and onward.
Reviewers: Chia-Ping Tsai <chia7712@gmail.com>, Rajini Sivaram <rajinisivaram@googlemail.com>
The primary purpose of this patch is to remove the internal `enable.metadata.quorum` configuration. Instead, we rely on `process.roles` to determine if the self-managed quorum has been enabled. As a part of this, I've done the following:
1. Replace the notion of "disabled" APIs with "controller-only" APIs. We previously marked some APIs which were intended only for the KIP-500 as "disabled" so that they would not be unintentionally exposed. For example, the Raft quorum APIs were disabled. Marking them as "controller-only" carries the same effect, but makes the intent that they should be only exposed by the KIP-500 controller clearer.
2. Make `ForwardingManager` optional in `KafkaServer` and `KafkaApis`. Previously we used `null` if forwarding was enabled and relied on the metadata quorum check.
3. Make `zookeeper.connect` an optional configuration if `process.roles` is defined.
4. Update raft README to remove reference to `zookeeper.conntect`
Reviewers: Colin Patrick McCabe <cmccabe@confluent.io>, Boyang Chen <boyang@confluent.io>
This patch moves Raft config definitions from `RaftConfig` to `KafkaConfig`, where they are re-defined as internal configs until we are ready to expose them. It also adds the missing "controller" prefix that was added by KIP-631.
Reviewers: Jason Gustafson <jason@confluent.io>
Topics processed by the controller and topics newly created will only be given topic IDs if the inter-broker protocol version on the controller is greater than 2.8. This PR also adds a kafka config to specify whether the IBP is greater or equal to 2.8. System tests have been modified to include topic ID checks for upgrade/downgrade tests. This PR also adds a new integration test file for requests/responses that are not gated by IBP (ex: metadata)
Reviewers: dengziming <dengziming1993@gmail.com>, Lucas Bradstreet <lucas@confluent.io>, Rajini Sivaram <rajinisivaram@googlemail.com>
Get controller api version intersection setup for client queries. When the unsupported exception was hit in the EnvelopeResponse, close the client connection to let it rediscover the api version.
Reviewers: Jason Gustafson <jason@confluent.io>
This patch attempts to generalize server initialization for KIP-500. It adds a `Server` trait which `KafkaServer` extends for the legacy Zookeeper server, and a new `KafkaRaftServer` for the new server. I have also added stubs for `KafkaRaftBroker` and `KafkaRaftController` to give a clearer idea how this will be used.
Note that this patch removes `KafkaServerStartable`, which was intended to enable custom startup logic, but was not codified into an official API and is not planned to be supported after KIP-500.
Reviewers: Ismael Juma <ismael@juma.me.uk>, Colin P. McCabe <cmccabe@apache.org>
We would like to be able to use `KafkaRaftClient` for tooling/debugging use cases. For this, we need the localId to be optional so that the client can be used more like a consumer. This is already supported in the `Fetch` protocol by setting `replicaId=-1`, which the Raft implementation checks for. We just need to alter `QuorumState` so that the `localId` is optional. The main benefit of doing this is that it saves tools the need to generate an arbitrary id (which might cause conflicts given limited Int32 space) and it lets the leader avoid any local state for these observers (such as `ReplicaState` inside `LeaderState`).
Reviewers: Ismael Juma <ismael@juma.me.uk>, Boyang Chen <boyang@confluent.io>
Rename AdminManager to ZkAdminManager to emphasize the fact that it is not used by the KIP-500 code paths.
Reviewers: Ismael Juma <ismael@juma.me.uk>, Boyang Chen <boyang@confluent.io>, Chia-Ping Tsai <chia7712@gmail.com>
In the [KIP-500 development branch](https://github.com/confluentinc/kafka/tree/kip-500),
we have a separate ControllerApis that shares a lot of functionality with KafkaApis. We
introduced a utility class ApisUtils to pull out the common code. Some things were moved
to RequestChannel as well.
We'd like to upstream this work now so we don't continue to diverge (since KafkaApis is
a frequently modified class). There should be no logical changes in this PR, only shuffling
code around.
Reviewers: Jason Gustafson <jason@confluent.io>, Chia-Ping Tsai <chia7712@gmail.com>, Jose Sancio <jsancio@users.noreply.github.com>, Ismael Juma <ismael@juma.me.uk>
It is helpful to delay initialization of the `RaftClient` configuration including the voter string until after construction. This helps in integration test cases where the voter ports may not be known until sockets are bound.
Reviewers: José Armando García Sancio <jsancio@users.noreply.github.com>, Jason Gustafson <jason@confluent.io>
This patch removes the periodic scheduling of a thread to send AlterISR requests and instead sends a request as soon as an ISR update is ready to send. When a response is received, the callback checks for any unsent items and will schedule another request if necessary.
Prior to this patch, envelope handling was a shared responsibility between `SocketServer` and `KafkaApis`. The former was responsible for parsing and validation, while the latter was responsible for authorization. This patch consolidates logic in `KafkaApis` so that envelope requests follow the normal request flow.
Reviewers: David Jacot <djacot@confluent.io>, Chia-Ping Tsai <chia7712@gmail.com>
This patch factors out a `RaftManager` class from `TestRaftServer` which will be needed when we integrate this layer into the server. This class encapsulates the logic to build `KafkaRaftClient` as well as its IO thread.
Reviewers: Chia-Ping Tsai <chia7712@gmail.com>
This PR includes following changes.
1. @Test(expected = Exception.class) is replaced by assertThrows
2. remove reference to org.scalatest.Assertions
3. change the magic code from 1 to 2 for testAppendAtInvalidOffset to test ZSTD
4. rename testMaybeAddPartitionToTransactionXXXX to testNotReadyForSendXXX
5. increase maxBlockMs from 1s to 3s to avoid unexpected timeout from TransactionsTest#testTimeout
Reviewers: Ismael Juma <ismael@confluent.io>