Commit Graph

3933 Commits

Author SHA1 Message Date
Vincent Jiang 8544dba721 KAFKA-15375: fix broken clean shutdown detection logic in LogManager
When running in kraft mode, LogManager.startup is called in a different thread than the main broker (#14239)
startup thread (by BrokerMetadataPublisher when the first metadata update is received.) If a fatal
error happens during broker startup, before LogManager.startup is completed, LogManager.shutdown may
 mark log dirs as clean shutdown improperly.

This PR includes following change:
1. During LogManager startup time:
  - track hadCleanShutdwon info for each log dir
  - track loadLogsCompleted status for each log dir
2. During LogManager shutdown time:
  - do not write clean shutdown marker file for log dirs which have hadCleanShutdown==false and loadLogsCompleted==false

Reviewers: Colin P. McCabe <cmccabe@apache.org>
2023-08-30 12:42:04 -07:00
Bo Gao 041afb73ec KAFKA-15053: Use case insensitive validator for security.protocol config (#13831)
Fixed a regression described in KAFKA-15053 that security.protocol only allows uppercase values like PLAINTEXT, SSL, SASL_PLAINTEXT, SASL_SSL. With this fix, both lower case and upper case values will be supported (e.g. PLAINTEXT, SSL, SASL_PLAINTEXT, SASL_SSL, plaintext, ssl, sasl_plaintext, sasl_ssl)

Reviewers: Chris Egerton <chrise@aiven.io>, Divij Vaidya <diviv@amazon.com>
2023-06-29 11:03:36 +00:00
Jeff Kim 8164a70ec1
KAFKA-14869: Bump coordinator value records to flexible versions (KIP-915, Part-2) (#13526) (#13603)
This patch implemented the second part of KIP-915. It bumps the versions of the value records used by the group coordinator and the transaction coordinator to make them flexible versions. The new versions are not used when writing to the partitions but only when reading from the partitions. This allows downgrades from future versions that will include tagged fields.

Reviewers: David Jacot <djacot@confluent.io>
2023-04-25 13:53:06 +02:00
Jeff Kim 80ad01f390
KAFKA-14869: Ignore unknown record types for coordinators (KIP-915, Part-1) (#13599)
This patch implemented the first part of KIP-915. It updates the group coordinator and the transaction coordinator to ignores unknown record types while loading their respective state from the partitions. This allows downgrades from future versions that will include new record types.

Reviewers: Alexandre Dupriez <alexandre.dupriez@gmail.com>, David Jacot <djacot@confluent.io>
2023-04-21 17:06:46 +02:00
Ron Dagostino bffab97465
KAFKA-14887: FinalizedFeatureChangeListener should not shut down when ZK session expires
FinalizedFeatureChangeListener shuts the broker down when it encounters an issue trying to process feature change
events. However, it does not distinguish between issues related to feature changes actually failing and other
exceptions like ZooKeeper session expiration. This introduces the possibility that Zookeeper session expiration
could cause the broker to shutdown, which is not intended. This patch updates the code to distinguish between
these two types of exceptions. In the case of something like a ZK session expiration it logs a warning and continues.
We shutdown the broker only for FeatureCacheUpdateException.

Reviewers: Kamal Chandraprakash <kamal.chandraprakash@gmail.com>, Christo Lolov <christololov@gmail.com>, Colin P. McCabe <cmccabe@apache.org>
2023-04-20 20:18:40 -04:00
David Jacot 7f242fee15 KAFKA-14880; TransactionMetadata with producer epoch -1 should be expirable (#13499)
We have seen the following error in logs:

```
"Mar 22, 2019 @ 21:57:56.655",Error,"kafka-0-0","transaction-log-manager-0","Uncaught exception in scheduled task 'transactionalId-expiration'","java.lang.IllegalArgumentException: Illegal new producer epoch -1
```

Investigations showed that it is actually possible for a transaction metadata object to still have -1 as producer epoch when it transitions to Dead.

When a transaction metadata is created for the first time (in handleInitProducerId), it has -1 as its producer epoch. Then a producer epoch is attributed and the transaction coordinator tries to persist the change. If the write fail for instance because there is an under min isr, the transaction metadata remains with its epoch as -1 forever or until the init producer id is retried.

This means that it is possible for transaction metadata to remain with -1 as producer epoch until it gets expired. At the moment, this is not allowed because we enforce a producer epoch greater or equals to 0 in prepareTransitionTo.

Reviewers: Luke Chen <showuon@gmail.com>, Justine Olshan <jolshan@confluent.io>
2023-04-06 08:51:58 +02:00
David Jacot b8e01e2406 KAFKA-14704; Follower should truncate before incrementing high watermark (#13230)
When a leader becomes a follower, it is likely that it has uncommitted records in its log. When it reaches out to the leader, the leader will detect that they have diverged and it will return the diverging epoch and offset. The follower truncates it log based on this.

There is a small caveat in this process. When the leader return the diverging epoch and offset, it also includes its high watermark, low watermark, start offset and end offset. The current code in the `AbstractFetcherThread` works as follow. First it process the partition data and then it checks whether there is a diverging epoch/offset. The former may accidentally expose uncommitted records as this step updates the local watermark to whatever is received from the leader. As the follower, or the former leader, may have uncommitted records, it will be able to updated the high watermark to a larger offset if the leader has a higher watermark than the current local one. This result in exposing uncommitted records until the log is finally truncated. The time window is short but a fetch requests coming at the right time to the follower could read those records. This is especially true for clients out there which uses recent versions of the fetch request but without implementing KIP-320.

When this happens, the follower logs the following messages: 
* `Truncating XXX to offset 21434 below high watermark 21437`
* `Non-monotonic update of high watermark from (offset=21437 segment=[20998:98390]) to (offset=21434 segment=[20998:97843])`.

This patch proposes to mitigate the issue by starting by checking on whether a diverging epoch/offset is provided by the leader and skip processing the partition data if it is. This basically means that the first fetch request will result in truncating the log and a subsequent fetch request will update the low/high watermarks.

Reviewers: Ritika Reddy <rreddy@confluent.io>, Justine Olshan <jolshan@confluent.io>, Jason Gustafson <jason@confluent.io>
2023-02-14 09:58:37 +01:00
Manikumar Reddy 0360577ccc MINOR: Add FeatureZNode ZK node path to ZK root paths 2023-02-08 00:10:29 +05:30
José Armando García Sancio 124cff3c80 KAFKA-14557; Lock metadata log dir (#13058)
This change makes sure that Kafka grabs a log dir lock in the following additional cases:

1. When a Kafka node runs in controller only. The current implementation doesn't grab a file lock because the LogManager is never instantiated.
2. When the metadata log dir is different from the log dir(s). The current implementation of LogManager doesn't load or grab a lock on the metadata dir.

Reviewers: Ron Dagostino <rdagostino@confluent.io> , dengziming <dengziming1993@gmail.com>
2023-01-10 12:08:10 -08:00
Edoardo Comar 27847e0c78 KAFKA-14571: Include rack info in ZkMetadataCache.getClusterMetadata (#13073)
ZkMetadataCache.getClusterMetadata returns a Cluster object where the aliveNodes were
missing their rack info.

Problem: when ZkMetadataCache updates the metadataSnapshot, includes the rack in
aliveBrokers but not in aliveNodes

Trivial fix with matching assertion in existing unit test.

Note that the Cluster object returned from `MetadataCache.getClusterMetadata(...)`
is passed to `ClientQuotaCallback.updateClusterMetadata(...)`
so it is used, though not by clients, but by service providers.

Reviewers: Ismael Juma <ismael@juma.me.uk>
2023-01-04 15:32:24 -08:00
Calvin Liu 228a2de767 MINOR: Fix testPreferredReplicaAsLeaderWhenSameRackFollowerIsOutOfIsr (#12982)
Reviewers: David Jacot <djacot@confluent.io>
2022-12-21 09:44:31 +01:00
Justine Olshan 0b75a38dab KAFKA-14417: Address incompatible error code returned by broker from `InitProducerId` (#12968)
Older clients can not handle the `REQUEST_TIMED_OUT` error that is returned from `InitProducerId` when the next producerId block cannot be fetched from the controller. In this patch, we return `COORDINATOR_LOAD_IN_PROGRESS` instead which is retriable.

Reviewers: Jason Gustafson <jason@confluent.io>
2022-12-19 09:50:42 -08:00
Artem Livshits 8d32c49fd5 MINOR: Fix FetchFromFollowerIntegrationTest.testFetchFromFollowerWithRoll (#12987)
The test was added with a fix to KAFKA-14379, the problem was that the replication factor for the offset topic was 1 and consumer group coordinator got unavailable when one of the brokers got shut down.

Reviewers: David Jacot <djacot@confluent.io>
2022-12-14 08:32:06 +01:00
Ron Dagostino 22af3f29ce KAFKA-14392: Fix overly long request timeouts in BrokerToControllerChannelManager (#12856)
In BrokerToControllerChannelManager, set the request timeout to the minimum of the retry timeout
and the controller socket timeout. This fixes some cases where we were unintentionally setting an
overly long request timeout. 

Also, the channel manager used by the BrokerLifecycleManager should set a retry timeout equal to
half of the broker session timeout, rather than the entire broker session timeout, to allow for a
retransmission if the initial attempt fails.

These two fixes should address some cases where heartbeat broker requests were not being resent
in a timely fashion after a network glitch.

Reviewers: Colin P. McCabe <cmccabe@apache.org>, José Armando García Sancio <jsancio@apache.org>
2022-12-12 09:49:33 -08:00
Artem Livshits 2e87992ab0 KAFKA-14379: Consumer should refresh preferred read replica on update metadata (#12956)
The consumer (fetcher) used to refresh the preferred read replica on
three conditions:
    
1. the consumer receives an OFFSET_OUT_OF_RANGE error
2. the follower does not exist in the client's metadata (i.e., offline)
3. after metadata.max.age.ms (5 min default)
    
For other errors, it will continue to reach to the possibly unavailable
follower and only after 5 minutes will it refresh the preferred read
replica and go back to the leader.
    
Another problem is that the client might have stale metadata and not
send fetches to preferred replica, even after the leader redirects to
the preferred replica.
    
A specific example is when a partition is reassigned. the consumer will
get NOT_LEADER_OR_FOLLOWER which triggers a metadata update but the
preferred read replica will not be refreshed as the follower is still
online. it will continue to reach out to the old follower until the
preferred read replica expires.
    
The consumer can instead refresh its preferred read replica whenever it
makes a metadata update request, so when the consumer receives i.e.
NOT_LEADER_OR_FOLLOWER it can find the new preferred read replica without
waiting for the expiration.
    
Generally, we will rely on the leader to choose the correct preferred
read replica and have the consumer fail fast (clear preferred read replica
cache) on errors and reach out to the leader.

Co-authored-by: Jeff Kim <jeff.kim@confluent.io>

Reviewers: David Jacot <djacot@confluent.io>, Jason Gustafson <jason@confluent.io>
2022-12-12 10:00:41 +01:00
José Armando García Sancio 0a7ea3f7dc KAFKA-14358; Disallow creation of cluster metadata topic (#12885)
With KRaft the cluster metadata topic (__cluster_metadata) has a different implementation compared to regular topic. The user should not be allowed to create this topic. This can cause issues if the metadata log dir is the same as one of the log dirs.

This change returns an authorization error if the user tries to create the cluster metadata topic.

Reviewers: David Arthur <mumrah@gmail.com>
2022-12-02 08:26:13 -08:00
David Jacot 0de727af1e MINOR: JoinGroup and SyncGroup callbacks should catch exceptions (#12910)
We recently had a bug causing the JoinGroup callback to thrown an exception (https://github.com/apache/kafka/pull/12909). When it happens, the exception is propagated to the caller and the JoinGroup callback is never completed. To make it worst, the member whose callback failed become a zombie because the group coordinator does not expire member with a pending callback.

This patch catch exceptions for both invocation of JoinGroup and SyncGroup callbacks and retry to complete them with a `UNKNOWN_SERVER_ERROR` error if they failed.

Reviewers: Jason Gustafson <jason@confluent.io>
2022-11-29 09:49:46 +01:00
zou shengfu 16938e5b0a KAFKA-14009: Rebalance timeout should be updated when static member rejoins
When consumers use static membership protocol, they can not update the rebalance timeout because the group coordinator simply ignore any new values. This patch fixes this.

Reviewers: David Jacot <djacot@confluent.io>
2022-11-24 15:18:54 +01:00
Jeff Kim fd3bf0ca65 KAFKA-14372: Choose replicas only from ISR for preferred read replica (#12877)
The default replica selector chooses a replica on whether the broker.rack matches the client.rack in the fetch request and whether the offset exists in the follower. If the follower is not in the ISR, we know it's lagging behind which will also lag the consumer behind. there are two cases:
1. the follower recovers and joins the isr. the consumer will no longer fall behind.
2. the follower continues to lag behind. after 5 minutes, the consumer will refresh its preferred read replica and the leader will return the same lagging follower since the offset the consumer fetched up to is capped by the follower's HWM. this can go on indefinitely.

If the replica selector chooses a broker in the ISR then we can ensure that at least every 5 minutes the consumer will consume from an up-to-date replica. 

Reviewers: David Jacot <djacot@confluent.io>
2022-11-24 15:05:22 +01:00
Igor Soarez e05224edab
KAFKA-14303 Producer.send without record key and batch.size=0 goes into infinite loop (#12752) (#12872)
This fixes an bug which causes a call to producer.send(record) with a record without a key and configured with batch.size=0 never to return.

Without specifying a key or a custom partitioner the new BuiltInPartitioner, as described in KIP-749 kicks in.

BuiltInPartitioner seems to have been designed with the reasonable assumption that the batch size will never be lower than one.

However, documentation for producer configuration states batch.size=0 as a valid value, and even recommends its use directly. [1]

[1] clients/src/main/java/org/apache/kafka/clients/producer/ProducerConfig.java:87

Reviewers: Artem Livshits <alivshits@confluent.io>, Luke Chen <showuon@gmail.com>, Ismael Juma <ismael@juma.me.uk>, Jun Rao <junrao@gmail.com>
2022-11-18 17:53:39 +08:00
Jeff Kim a310cc8e77 KAFKA-14334: Complete delayed purgatory after replication (#12783)
When a consumer makes a fetch request to a follower (KIP-392), the fetch request will sit in the purgatory until `fetch.max.wait.ms` is reached because the purgatory is not completed after replication. This patch aims to complete the delayed fetch purgatory after successfully replicating from the leader.

Reviewers: Artem Livshits <alivshits@confluent.io>, Luke Chen <showuon@gmail.com>, David Jacot <djacot@confluent.io>
2022-11-16 14:55:25 +01:00
Luke Chen 026f98f8c4 KAFKA-14337; Correctly remove topicsWithCollisionChars after topic deletion (#12790)
In https://github.com/apache/kafka/pull/11910 , we added a feature to prevent topics with conflicting metrics names from being created. We added a map to store the normalized topic name to the topic names, but we didn't remove it correctly while deleting topics. This PR fixes this bug and add a test.

Reviewers: Igor Soarez <i@soarez.me>, dengziming <dengziming1993@gmail.com>, Jason Gustafson <jason@confluent.io>
2022-10-28 10:11:19 -07:00
Niket 5d6f8d3e37 MINOR: Address flakiness in `KRaftClusterTest::testDescribeQuorumRequestToBrokers` (#12738)
We have seen some errors such as the following:
```
org.opentest4j.AssertionFailedError: expected: not equal but was: <OptionalLong.empty>
Stacktrace
org.opentest4j.AssertionFailedError: expected: not equal but was: <OptionalLong.empty>
	at org.junit.jupiter.api.AssertionUtils.fail(AssertionUtils.java:39)
	at org.junit.jupiter.api.AssertNotEquals.failEqual(AssertNotEquals.java:276)
	at org.junit.jupiter.api.AssertNotEquals.assertNotEquals(AssertNotEquals.java:265)
	at org.junit.jupiter.api.AssertNotEquals.assertNotEquals(AssertNotEquals.java:260)
	at org.junit.jupiter.api.Assertions.assertNotEquals(Assertions.java:2815)
	at kafka.server.KRaftClusterTest.$anonfun$testDescribeQuorumRequestToBrokers$5(KRaftClusterTest.scala:818)
	at java.util.ArrayList.forEach(ArrayList.java:1259)
	at kafka.server.KRaftClusterTest.testDescribeQuorumRequestToBrokers(KRaftClusterTest.scala:814)
```
The patch changes some of the assertions to wait longer for the condition to be satisfied.

Reviewers: Jason Gustafson <jason@confluent.io>
2022-10-24 16:31:46 -07:00
Jason Gustafson 3acf1514b7 MINOR: Fix incorrect example in feature command help (#12732)
When using `kafka-features.sh` with the `--feature` parameter, we expect a numeric feature level (e.g. `metadata.version=5`). The help example suggests that we can also use a descriptive version string for `metadata.version` such as `3.3-IV3`, which doesn't work.

Reviewers: David Jacot <djacot@confluent.io>
2022-10-24 16:05:54 -07:00
José Armando García Sancio e8c6a7bb69 MINOR; Fix error message when validating KRaft config (#12717)
The error message reported when advertised.listeners is used in
controller only is confusing. When the KRaft server is configured to
controller only the following must be true:

1. `advertised.listeners` is not set
2. `listeners` contains a listener for every name in `controller.listener.names`
3. `controller.listener.names` contains a name for every listener in `listeners`

Reviewers: Jason Gustafson <jason@confluent.io>, Igor Soarez <i@soarez.me>
2022-10-24 15:27:09 -07:00
Jason Gustafson 5992f5141c MINOR: Add initial property tests for StandardAuthorizer (#12703)
In https://github.com/apache/kafka/pull/12695, we discovered a gap in our testing of `StandardAuthorizer`. We addressed the specific case that was failing, but I think we need to establish a better methodology for testing which incorporates randomized inputs. This patch is a start in that direction. We implement a few basic property tests using jqwik which focus on prefix searching. It catches the case from https://github.com/apache/kafka/pull/12695 prior to the fix. In the future, we can extend this to cover additional operation types, principal matching, etc.

Reviewers: David Arthur <mumrah@gmail.com>
2022-10-24 15:26:53 -07:00
Divij Vaidya 7bd43f5466 KAFKA-14248; Fix flaky test PlaintextAdminIntegrationTest.testCreateTopicsReturnsConfigs (#12669)
The test is failing intermittently because we do not wait for propagation of the altered config (LogRetentionTimeMillisProp) across all brokers before proceeding ahead with the test.

This PR makes the following changes:
1. Wait for propagation of altered configuration to propagate to all brokers.
2. Use the existing `killBroker` utility method which waits for shutdown using `awaitshutdown`.
3. Improve code readability by using `TestUtils.incrementalAlterConfigs` to send alter config requests.

Reviewers: Jason Gustafson <jason@confluent.io>
2022-10-24 14:50:04 -07:00
Chia-Ping Tsai 8176519463 MINOR: Appending value to LIST config should not generate empty string with … (#12503)
Reviewers: dengziming <dengziming1993@gmail.com>, Luke Chen <showuon@gmail.com>
2022-10-24 14:49:42 -07:00
Jason Gustafson 23b04ec121 MINOR: Update security docs for kraft Authorizer configuration (#12673)
Update security documentation to describe how to configure the KRaft `Authorizer` implementation and include a note about principal forwarding.

Additionally, this patch renames `KafkaConfig.Defaults.DefaultPrincipalSerde` to `DefaultPrincipalBuilder` since the former is somewhat misleading.

Reviewers: David Arthur <mumrah@gmail.com>
2022-10-24 14:44:30 -07:00
Luke Chen 5aaf0685a1 MINOR; Add kraft controller log level in log4j prop (#12707)
Add KRaft controller log level entry in log4j.properties, otherwise, the controller.log is empty in KRaft mode.

Reviewers: José Armando García Sancio <jsancio@apache.org>, Ismael Juma <ismael@juma.me.uk>
2022-10-14 11:28:41 -07:00
David Jacot fd64b53e6a KAFKA-14296; Partition leaders are not demoted during kraft controlled shutdown (#12741)
When the `BrokerServer` starts its shutting down process, it transitions to `SHUTTING_DOWN` and sets `isShuttingDown` to `true`. With this state change, the follower state changes are short-cutted. This means that a broker which was serving as leader would remain acting as a leader until controlled shutdown completes. Instead, we want the leader and ISR state to be updated so that requests will return NOT_LEADER and the client can find the new leader.

We missed this case while implementing https://github.com/apache/kafka/pull/12187.

This patch fixes the issue and updates an existing test to ensure that `isShuttingDown` has not effect. We should consider adding integration tests for this as well. We can do this separately.

Reviewers: Ismael Juma <ismael@juma.me.uk>, José Armando García Sancio <jsancio@users.noreply.github.com>, Jason Gustafson <jason@confluent.io>
2022-10-13 17:00:34 -07:00
Niket 7617f1a0c9 KAFKA-14275; KRaft Controllers should crash after failing to apply any metadata record (#12709)
Make all faults in metadata processing on standby controllers be fatal. This is the same behavior-wise as the active controller. This prevents a standby controller from eventually becoming active with incomplete state.

Reviewers: Colin Patrick McCabe <cmccabe@apache.org>, Jason Gustafson <jason@confluent.io>
2022-10-13 11:20:00 -07:00
Huilin Shi e213674615 KAFKA-14225; Fix deadlock caused by lazy val exemptSensor (#12634)
There is a chance to cause deadlock when multiple threads access ClientRequestQuotaManager. In the version Scala 2.12, the lazy val initialization is under the object lock. The deadlock could happen in the following condition:

In thread a, when ClientRequestQuotaManager.exemptSensor is being initialized, it has acquired the object lock and enters the the actual initialization block. The initialization of 'exemptSensor' requires another lock private val lock = new ReentrantReadWriteLock() and it is waiting for this lock.

In thread b, at the same time, ClientQuotaManager.updateQuota() is called and it has already acquired ReentrantReadWriteLock lock by calling lock.writeLock().lock(). And then it executes info(). If this is the first time accessing Logging.logger, which is also a lazy val, it need to wait for the object lock.

The deadlock happens.

Since the lazy val initialization is under the object lock, we should avoid using lazy val if the initialization function holds another lock to prevent holding two locks at the same time which is prone for deadlock. Change to create exemptSensor during ClientRequestQuotaManager initialization with an expiration time of Long.MaxValue to prevent expiration if request quota is not enabled at that time.

Reviewers: Jason Gustafson <jason@confluent.io>
2022-10-11 10:30:56 -07:00
Colin P. McCabe 1780f2660e KAFKA-14265: Prefix ACLs may shadow other prefix ACLs 2022-09-29 09:26:47 -07:00
David Arthur 6174f95d61
MINOR: update configuration.html with KRaft details (#12678) 2022-09-26 10:16:12 -04:00
Colin Patrick McCabe ebf3cfe136 KAFKA-14243: Temporarily disable unsafe downgrade (#12664)
Reviewers: David Arthur <mumrah@gmail.com>
2022-09-20 15:49:25 -04:00
Jason Gustafson 0c08c80afa KAFKA-14240; Validate kraft snapshot state on startup (#12653)
We should prevent the metadata log from initializing in a known bad state. If the log start offset of the first segment is greater than 0, then must be a snapshot an offset greater than or equal to it order to ensure that the initialized state is complete.

Reviewers: José Armando García Sancio <jsancio@users.noreply.github.com>
2022-09-19 11:53:51 -07:00
Luke Chen c2c71efddc KAFKA-14233: disable testReloadUpdatedFilesWithoutConfigChange first to fix the build (#12658)
disable testReloadUpdatedFilesWithoutConfigChange first to fix the build

Reviewers: Jason Gustafson <jason@confluent.io>
2022-09-19 12:28:47 +08:00
José Armando García Sancio 74c4bbfaf9 KAFKA-14238; KRaft metadata log should not delete segment past the latest snapshot (#12655)
Disable segment deletion based on size and time by setting the KRaft metadata log's `RetentionMsProp` and `RetentionBytesProp` to `-1`. This will cause `UnifiedLog.deleteRetentionMsBreachedSegments` and `UnifiedLog.deleteRetentionSizeBreachedSegments` to short circuit instead of deleting segments.

Without this changes the included test would fail. This happens because `deleteRetentionMsBreachedSegments` is able to delete past the `logStartOffset`. Deleting past the `logStartOffset` would violate the invariant that if the `logStartOffset` is greater than 0 then there is a snapshot with an end offset greater than or equal to the log start offset.

Reviewers: Luke Chen <showuon@gmail.com>, Jason Gustafson <jason@confluent.io>
2022-09-17 06:32:31 -07:00
Jason Gustafson 85fc267100 KAFKA-14215; Ensure forwarded requests are applied to broker request quota (#12624)
Currently forwarded requests are not applied to any quotas on either the controller or the broker. The controller-side throttling requires the controller to apply the quota changes from the log to the quota managers, which will be done separately. In this patch, we change the response logic on the broker side to also apply the broker's request quota. The enforced throttle time is the maximum of the throttle returned from the controller (which is 0 until we fix the aforementioned issue) and the broker's request throttle time.

Reviewers: David Arthur <mumrah@gmail.com>
2022-09-12 20:51:32 -07:00
David Arthur 89f7f31ac2 KAFKA-14203 Disable snapshot generation on broker after metadata errors (#12596) 2022-09-12 17:01:01 -07:00
David Jacot cfa1f098d6 KAFKA-14201; Consumer should not send group instance ID if committing with empty member ID (server side) (#12598)
The consumer group instance ID is used to support a notion of "static" consumer groups. The idea is to be able to identify the same group instance across restarts so that a rebalance is not needed. However, if the user sets `group.instance.id` in the consumer configuration, but uses "simple" assignment with `assign()`, then the instance ID nevertheless is sent in the OffsetCommit request to the coordinator. This may result in a surprising UNKNOWN_MEMBER_ID error.

This PR attempts to fix this issue for existing consumers by relaxing the validation in this case. One way is to simply ignore the member id and the static id when the generation id is -1. -1 signals that the request comes from either the admin client or a consumer which does not use the group management. This does not apply to transactional offsets commit.

Reviewers: Jason Gustafson <jason@confluent.io>
2022-09-08 14:36:57 -07:00
Manikumar Reddy 015d7aede6 MINOR: Add configurable max receive size for SASL authentication requests
This adds a new configuration `sasl.server.max.receive.size` that sets the maximum receive size for requests before and during authentication.

Reviewers: Tom Bentley <tbentley@redhat.com>, Mickael Maison <mickael.maison@gmail.com>

Co-authored-by: Manikumar Reddy <manikumar.reddy@gmail.com>
Co-authored-by: Mickael Maison <mickael.maison@gmail.com>
2022-09-08 23:44:46 +05:30
Colin Patrick McCabe b2b928338c MINOR: Add more validation during KRPC deserialization
When deserializing KRPC (which is used for RPCs sent to Kafka, Kafka Metadata records, and some
    other things), check that we have at least N bytes remaining before allocating an array of size N.

    Remove DataInputStreamReadable since it was hard to make this class aware of how many bytes were
    remaining. Instead, when reading an individual record in the Raft layer, simply create a
    ByteBufferAccessor with a ByteBuffer containing just the bytes we're interested in.

    Add SimpleArraysMessageTest and ByteBufferAccessorTest. Also add some additional tests in
    RequestResponseTest.

    Reviewers: Tom Bentley <tbentley@redhat.com>, Mickael Maison <mickael.maison@gmail.com>, Colin McCabe <colin@cmccabe.xyz>

    Co-authored-by: Colin McCabe <colin@cmccabe.xyz>
    Co-authored-by: Manikumar Reddy <manikumar.reddy@gmail.com>
    Co-authored-by: Mickael Maison <mickael.maison@gmail.com>
2022-09-08 23:44:46 +05:30
Colin Patrick McCabe 9cd54f5407 KAFKA-14200: kafka-features.sh must exit with non-zero error code on error (#12586)
kafka-features.sh must exit with a non-zero error code on error. We must do this in order to catch
regressions like KAFKA-13990.

Reviewers: David Arthur <mumrah@gmail.com>
2022-09-07 09:04:31 -07:00
Divij Vaidya b7e9fc88ad KAFKA-14122: Fix flaky test DynamicBrokerReconfigurationTest#testKeyStoreAlter (#12452)
Reviewers: Mickael Maison <mickael.maison@gmail.com>
2022-08-31 17:01:27 -07:00
Akhilesh C c4fd780bd3 KAFKA-14170: Fix NPE in the deleteTopics() code path of KRaft Controller (#12533)
Fix a bug in ReplicationControlManager where we got a NullPointerException when removing a topic
with no offline replicas, and there were other topics that did have offline replicas.

Fix an issue in MetadataDelta#replay where we were replaying RemoveTopicRecord twice.

Reviewers: Colin P. McCabe <cmccabe@apache.org>, dengziming <dengziming1993@gmail.com>
2022-08-31 14:34:58 -07:00
dengziming 562c30f187 KAFKA-13990: KRaft controller should return right features in ApiVersionResponse (#12294)
Previously, the KRaft controller was incorrectly reporting an empty feature set in
ApiVersionResponse. This was preventing any multi-node clusters from being upgraded via
kafka-features.sh, since they would incorrectly believe that metadata.version was not a supported
feature. This PR adds a regression test for this bug, KRaftClusterTest.testUpdateMetadataVersion.

Reviewers: José Armando García Sancio <jsancio@gmail.com>, Colin P. McCabe <cmccabe@apache.org>
2022-08-31 11:37:15 -07:00
dengziming c1c7f2a9c7 MINOR: Enable testUpdateFeaturesWithForwarding (#12059)
This test was removed in #11667 since UpdateFeatures is not properly handled in KRaft mode, now we can bring it back since UpdateFeatures is properly handled after #12036.

Reviewers: Luke Chen <showuon@gmail.com>
2022-08-31 11:37:06 -07:00
Colin Patrick McCabe 26a884cc30 KAFKA-14187: kafka-features.sh: add support for --metadata (#12571)
This PR adds support to kafka-features.sh for the --metadata flag, as specified in KIP-778.  This
flag makes it possible to upgrade to a new metadata version without consulting a table mapping
version names to short integers. Change --feature to use a key=value format.

FeatureCommandTest.scala: make most tests here true unit tests (that don't start brokers) in order
to improve test run time, and allow us to test more cases. For the integration test part, test both
KRaft and ZK-based clusters. Add support for mocking feature operations in MockAdminClient.java.

upgrade.html: add a section describing how the metadata.version should be upgraded in KRaft
clusters.

Add kraft_upgrade_test.py to test upgrades between KRaft versions.

Reviewers: David Arthur <mumrah@gmail.com>, dengziming <dengziming1993@gmail.com>, José Armando García Sancio <jsancio@gmail.com>
2022-08-30 16:57:24 -07:00