Commit Graph

1815 Commits

Author SHA1 Message Date
A. Sophie Blee-Goldman a1f2ece323
KAFKA-9525: add enforceRebalance method to Consumer API (#8087)
As described in KIP-568.

Waiting on acceptance of the KIP to write the tests, on the off chance something changes. But rest assured unit tests are coming ️

Will also kick off existing Streams system tests which leverage this new API (eg version probing, sometimes broker bounce)

Reviewers: Boyang Chen <boyang@confluent.io>, Guozhang Wang <wangguoz@gmail.com>
2020-02-29 18:44:22 -08:00
Boyang Chen ede07306a7
KAFKA-9620: Do not throw in the middle of consumer user callbacks (#8187)
One way of fixing it forward.

Reviewers: Guozhang Wang <wangguoz@gmail.com>
2020-02-28 12:06:58 -08:00
Boyang Chen 399d18fd8e
HOTFIX: testInitTransactionTimeout should use prepareResponse instead of respond (#8179)
We have seen a flaky behavior due to using #respond instead of #prepareResponse call for the txn test.

Reviewers: Guozhang Wang <wangguoz@gmail.com>
2020-02-27 09:05:55 -08:00
Tom Bentley 5216da3de2
MINOR: Consistent terminal period in Errors.defaultExceptionMessage (#3909)
Reviewers: Mickael Maison <mickael.maison@gmail.com>
2020-02-25 17:11:17 +00:00
Jason Gustafson 2df7ea5a4a
KAFKA-9530; Fix flaky test `testDescribeGroupWithShortInitializationTimeout` (#8154)
With a short timeout, a call in KafkaAdminClient may timeout and the client might disconnect. Currently this can be exposed to the user as either a TimeoutException or a DisconnectException. To be consistent, rather than exposing the underlying retriable error, we handle both cases with a TimeoutException.

Reviewers: Boyang Chen <boyang@confluent.io>, Ismael Juma <ismael@juma.me.uk>
2020-02-23 10:16:32 -08:00
Lucas Bradstreet 1a8dcffe4a
KAFKA-9577; SaslClientAuthenticator incorrectly negotiates SASL_HANDSHAKE version (#8142)
The SaslClientAuthenticator incorrectly negotiates supported SaslHandshakeRequest version and  uses the maximum version supported by the broker whether or not the client supports it. 

This bug was exposed by a recent version bump in 0a2569e2b9.

This PR rolls back the recent SaslHandshake[Request,Response] bump, fixes the version negotiation, and adds a test to prevent anyone from accidentally bumping the version without a workaround such as a new ApiKey. The existing key will be difficult to support for clients < 2.5 due to the incorrect negotiation.

Reviewers: Ron Dagostino <rdagostino@confluent.io>, Rajini Sivaram <rajinisivaram@googlemail.com>, Colin P. McCabe <cmccabe@apache.org>, Jason Gustafson <jason@confluent.io>
2020-02-21 21:49:11 -08:00
Matthias J. Sax 97d107a270
KAFKA-9441: Add internal TransactionManager (#8105)
Upfront refactoring for KIP-447.

Introduces `StreamsProducer` that allows to share a producer over multiple tasks and track the TX status.

Reviewers: Boyang Chen <boyang@confluent.io>, Guozhang Wang <guozhang@confluent.io>
2020-02-22 06:40:28 +01:00
Agam Brahma 84c4025fdd
KAFKA-9206; Throw KafkaException on CORRUPT_MESSAGE error in Fetch response (#8111)
If a completed fetch has an error code signifying a _corrupt message_, throw a `KafkaException` that notes the fetch offset and the topic-partition.

Reviewers: Jason Gustafson <jason@confluent.io>
2020-02-21 10:03:09 -08:00
Guozhang Wang 3b6573c150
KAFKA-9481: Graceful handling TaskMigrated and TaskCorrupted (#8058)
1. Removed task field from TaskMigrated; the only caller that encodes a task id from StreamTask actually do not throw so we only log it. To handle it on StreamThread we just always enforce rebalance (and we would call onPartitionsLost to remove all tasks as dirty).

2. Added TaskCorruptedException with a set of task-ids. The first scenario of this is the restoreConsumer.poll which throws InvalidOffset indicating that the logs are truncated / compacted. To handle it on StreamThread we first close the corresponding tasks as dirty (if EOS is enabled we would also wipe out the state stores), and then revive them into the CREATED state.

3. Also fixed a bug while investigating KAFKA-9572: when suspending / closing a restoring task we should not commit the new offsets but only updating the checkpoint file.

4. Re-enabled the unit test.
2020-02-20 16:14:45 -08:00
Mickael Maison 8ab0994919
MINOR: Fix a number of warnings in clients test (#8073)
Reviewers: Ismael Juma <ismael@juma.me.uk>, Andrew Choi <li_andchoi@microsoft.com>
2020-02-20 14:54:37 +00:00
Stanislav Kozlovski f51d06712a
MINOR: Add missing @Test annotation to MetadataTest#testMetadataMerge (#8141)
Reviewers: Brian Byrne <bbyrne@confluent.io>, Rajini Sivaram <rajinisivaram@googlemail.com>
2020-02-20 11:10:08 +00:00
Jason Gustafson 5a19fe6cd1
KAFKA-9544; Fix flaky test `AdminClientTest.testDefaultApiTimeoutOverride` (#8101)
There is a race condition with the backoff sleep in the test case and setting the next allowed send time in the AdminClient. To fix it, we allow the test case to do the backoff sleep multiple times if needed.

Reviewers: Rajini Sivaram <rajinisivaram@googlemail.com>
2020-02-19 09:24:26 -08:00
Sanjana Kaundinya eb7dfef245
KAFKA-9558; Fix retry logic in KafkaAdminClient listOffsets (#8119)
This PR is to fix the retry logic for `getListOffsetsCalls`. Previously, if there were partitions with errors, it would only pass in the current call object to retry after a metadata refresh. However, if there's a leader change, the call object never gets updated with the correct leader node to query. This PR fixes this by making another call to `getListOffsetsCalls` with only the error topic partitions as the next calls to be made after the metadata refresh. In addition there is an additional test to test the scenario where a leader change occurs.

Reviewers: Jason Gustafson <jason@confluent.io>
2020-02-19 09:11:45 -08:00
Boyang Chen 913c61934e
MINOR: Reduce log level to Trace for fetch offset downgrade (#8093)
Reviewers: Guozhang Wang <wangguoz@gmail.com>
2020-02-18 16:00:22 -08:00
Boyang Chen 863b534f83
KAFKA-9535; Update metadata before retrying partitions when fetching offsets (#8088)
Today if we attempt to list offsets with a fenced leader epoch, consumer will retry without updating the metadata until the timeout is reached. This affects synchronous APIs such as `offsetsForTimes`, `beginningOffsets`, and `endOffsets`. The fix in this patch is to trigger the metadata update call whenever we see a retriable error before additional attempts.

Reviewers: Jason Gustafson <jason@confluent.io>
2020-02-16 12:06:33 -08:00
Bob Barrett 937f1f741c
KAFKA-8805; Bump producer epoch on recoverable errors (#7389)
This change is the client-side part of KIP-360. It identifies cases where it is safe to abort a transaction, bump the producer epoch, and allow the application to continue without closing the producer. In these cases, when KafkaProducer.abortTransaction() is called, the producer sends an InitProducerId following the transaction abort, which causes the producer epoch to be bumped. The application can then start a new transaction and continue processing.

For recoverable errors in the idempotent producer, the epoch is bumped locally. In-flight requests for partitions with an error are rewritten to reflect the new epoch, and in-flights of all other partitions are allowed to complete using the old epoch. 

Reviewers: Boyang Chen <boyang@confluent.io>, Jason Gustafson <jason@confluent.io>
2020-02-15 22:47:10 -08:00
Guozhang Wang d8756e81c5
KAFKA-9274: Gracefully handle timeout exception (#8060)
1. Delay the initialization (producer.initTxn) from construction to maybeInitialize; if it times out we just swallow and retry in the next iteration.

2. If completeRestoration (consumer.committed) times out, just swallow and retry in the next iteration.

3. For other calls (producer.partitionsFor, producer.commitTxn, consumer.commit), treat the timeout exception as fatal.

Reviewers: Matthias J. Sax <matthias@confluent.io>
2020-02-14 17:28:14 -08:00
Xavier Léauté 7e1c39f75a
KAFKA-9106 make metrics exposed via jmx configurable (#7674)
Reviewers: Colin P. McCabe <cmccabe@apache.org>, Rajini Sivaram <rajinisivaram@googlemail.com>, Manikumar Reddy <manikumar.reddy@gmail.com>
2020-02-13 10:21:14 -08:00
Stanislav Kozlovski ea72edebf2
MINOR: Do not override retries for idempotent producers (#8097)
The KafkaProducer code would set infinite retries (MAX_INT) if the producer was configured with idempotence and no retries were configured by the user. This is superfluous because KIP-91 changed the retry functionality to both be time-based and the default retries config to be MAX_INT.

Reviewers: Jason Gustafson <jason@confluent.io>
2020-02-12 23:42:02 -08:00
Boyang Chen 07db26c20f
KAFKA-9417: New Integration Test for KIP-447 (#8000)
This change mainly have 2 components:

1. extend the existing transactions_test.py to also try out new sendTxnOffsets(groupMetadata) API to make sure we are not introducing any regression or compatibility issue
  a. We shrink the time window to 10 seconds for the txn timeout scheduler on broker so that we could trigger expiration earlier than later

2. create a completely new system test class called group_mode_transactions_test which is more complicated than the existing system test, as we are taking rebalance into consideration and using multiple partitions instead of one. For further breakdown:
  a. The message count was done on partition level, instead of global as we need to visualize 
the per partition order throughout the test. For this sake, we extend ConsoleConsumer to print out the data partition as well to help message copier interpret the per partition data.
  b. The progress count includes the time for completing the pending txn offset expiration
  c. More visibility and feature improvements on TransactionMessageCopier to better work under either standalone or group mode.

Reviewers: Matthias J. Sax <matthias@confluent.io>, Guozhang Wang <wangguoz@gmail.com>
2020-02-12 12:34:12 -08:00
Matthias J. Sax aa0d0ec32a
KAFKA-6607: Commit correct offsets for transactional input data (#8091)
Reviewers: Guozhang Wang <guozhang@confluent.io>
2020-02-12 12:19:34 -08:00
Jason Gustafson 0a5dec0b3a
MINOR: Fix unnecessary metadata fetch before group assignment (#8095)
The recent increase in the flakiness of one of the offset reset tests (KAFKA-9538) traces back to https://github.com/apache/kafka/pull/7941. After investigation, we found that following this patch, the consumer was sending an additional metadata request prior to performing the group assignment. This slight timing difference was enough to trigger the test failures. The problem turned out to be due to a bug in `SubscriptionState.groupSubscribe`, which no longer counted the local subscription when determining if there were new topics to fetch metadata for. Hence the extra metadata update. This patch restores the old logic.

Without the fix, we saw 30-50% test failures locally. With it, I could no longer reproduce the failure. However, #6561 is probably still needed to improve the resilience of this test.

Reviewers: Rajini Sivaram <rajinisivaram@googlemail.com>
2020-02-12 11:45:06 -08:00
Guozhang Wang e70e5d913a
KAFKA-9505: Only loop over topics-to-validate in retries (#8039)
Found this bug from the repeated flaky runs of system tests, it seems to be long lurking but also would only happen if there are frequent rebalances / topic creation within a short time, which is exactly the case in some of our smoke system tests.

Also added a unit test.

Reviewers: Boyang Chen <boyang@confluent.io>, A. Sophie Blee-Goldman <sophie@confluent.io>, Matthias J. Sax <matthias@confluent.io>
2020-02-10 12:59:14 -08:00
Manikumar Reddy 41fdae35df MINOR: Update schema field names in DescribeAcls Request/Response
Author: Manikumar Reddy <manikumar.reddy@gmail.com>

Reviewers: Ismael Juma <ismael@juma.me.uk>, Colin Patrick McCabe <cmccabe@apache.org>

Closes #8075 from omkreddy/KAFKA-9026-Fix
2020-02-11 00:41:48 +05:30
Sönke Liebau 3b1c61385b
KAFKA-9423: Refine layout of configuration options on website and make individual settings directly linkable (#7955)
Reviewers: Mickael Maison <mickael.maison@gmail.com>
2020-02-10 18:05:17 +00:00
Brian Byrne 0f8698a329
KAFKA-8904: Improve producer's topic metadata fetching. (#7781)
When the producer encouteres new topic(s), it now only fetches the metadata for the new topics. For cases where a producer interacts with a lot of topics, this reduces the cost for the topic being evicted from the cache, and during startup when populating the topic cache.

Additionally adds a new producer configuration variable 'metadata.max.idle.ms', which controls how long topic metadata may be idle (i.e. not produced to) before it's finally discarded from the metadata cache.

Reviewers: Rajini Sivaram <rajinisivaram@googlemail.com>, dengziming <dengziming1993@gmail.com>
2020-02-10 14:54:04 +00:00
Ron Dagostino 342f13a838 KAFKA-8843: KIP-515: Zookeeper TLS support
Signed-off-by: Ron Dagostino <rdagostinoconfluent.io>

Author: Ron Dagostino <rdagostino@confluent.io>

Reviewers: Manikumar Reddy <manikumar.reddy@gmail.com>, Rajini Sivaram <rajinisivaram@googlemail.com>

Closes #8003 from rondagostino/KAFKA-8843
2020-02-08 21:16:48 +05:30
Kun Song 87eaa5396d MINOR: Simplify KafkaProducerTest (#8044)
Reviewers: Mickael Maison <mickael.maison@gmail.com>, Ron Dagostino <rndgstn@gmail.com>
2020-02-08 14:14:21 +00:00
David Mao 7a2a198d1e
KAFKA-9507; AdminClient should check for missing committed offsets (#8057)
Addresses exception being thrown by `AdminClient` when `listConsumerGroupOffsets` returns a negative offset. A negative offset indicates the absence of a committed offset for a requested partition, and should result in a null in the returned offset map.

Reviewers: Anna Povzner <anna@confluent.io>, Jason Gustafson <jason@confluent.io>
2020-02-07 16:43:51 -08:00
Joel Hamill 83e1a8d71c
DOCS - clarify transactionalID and idempotent behavior (#7821)
If transactional.id is set without setting enable.idempotence, the producer will set enable.idempotence to true implicitly. The docs should reflect this.

Reviewers: Guozhang Wang <wangguoz@gmail.com>
2020-02-07 13:08:25 -08:00
Boyang Chen 9d17bf98b6
KAFKA-9447: Add new customized EOS model example (#8031)
With the improvement of 447, we are now offering developers a better experience on writing their customized EOS apps with group subscription, instead of manual assignments. With the demo, user should be able to get started more quickly on writing their own EOS app, and understand the processing logic much better.

Reviewers: Guozhang Wang <wangguoz@gmail.com>
2020-02-05 16:51:07 -08:00
Jason Gustafson ae0c6e58e5
KAFKA-9261; Client should handle unavailable leader metadata (#7770)
The client caches metadata fetched from Metadata requests. Previously, each metadata response overwrote all of the metadata from the previous one, so we could rely on the expectation that the broker only returned the leaderId for a partition if it had connection information available. This behavior changed with KIP-320 since having the leader epoch allows the client to filter out partition metadata which is known to be stale. However, because of this, we can no longer rely on the request-level guarantee of leader availability. There is no mechanism similar to the leader epoch to track the staleness of broker metadata, so we still overwrite all of the broker metadata from each response, which means that the partition metadata can get out of sync with the broker metadata in the client's cache. Hence it is no longer safe to validate inside the `Cluster` constructor that each leader has an associated `Node`

Fixing this issue was unfortunately not straightforward because the cache was built to maintain references to broker metadata through the `Node` object at the partition level. In order to keep the state consistent, each `Node` reference would need to be updated based on the new broker metadata. Instead of doing that, this patch changes the cache so that it is structured more closely with the Metadata response schema. Broker node information is maintained at the top level in a single collection and cached partition metadata only references the id of the broker. To accommodate this, we have removed `PartitionInfoAndEpoch` and we have altered `MetadataResponse.PartitionMetadata` to eliminate its `Node` references.

Note that one of the side benefits of the refactor here is that we virtually eliminate one of the hotspots in Metadata request handling in `MetadataCache.getEndpoints` (which was renamed to `maybeFilterAliveReplicas`). The only reason this was expensive was because we had to build a new collection for the `Node` representations of each of the replica lists. This information was doomed to just get discarded on serialization, so the whole effort was wasteful. Now, we work with the lower level id lists and no copy of the replicas is needed (at least for all versions other than 0).

Reviewers: Rajini Sivaram <rajinisivaram@googlemail.com>, Ismael Juma <ismael@juma.me.uk>
2020-02-05 09:13:11 -08:00
David Jacot 5db02ead60
MINOR: Fix typos introduced in KIP-559 (#8042)
A few references to KIP-559 in the schema definitions needed to be fixed.

Reviewers: Brajesh Kumar <bristy@users.noreply.github.com>, Ron Dagostino <rdagostino@confluent.io>, Jason Gustafson <jason@confluent.io>
2020-02-05 08:23:57 -08:00
Guozhang Wang 4090f9a2b0
KAFKA-9113: Clean up task management and state management (#7997)
This PR is collaborated by Guozhang Wang and John Roesler. It is a significant tech debt cleanup on task management and state management, and is broken down by several sub-tasks listed below:

Extract embedded clients (producer and consumer) into RecordCollector from StreamTask.
guozhangwang#2
guozhangwang#5

Consolidate the standby updating and active restoring logic into ChangelogReader and extract out of StreamThread.
guozhangwang#3
guozhangwang#4

Introduce Task state life cycle (created, restoring, running, suspended, closing), and refactor the task operations based on the current state.
guozhangwang#6
guozhangwang#7

Consolidate AssignedTasks into TaskManager and simplify the logic of changelog management and task management (since they are already moved in step 2) and 3)).
guozhangwang#8
guozhangwang#9

Also simplified the StreamThread logic a bit as the embedded clients / changelog restoration logic has been moved into step 1) and 2).
guozhangwang#10

Reviewers: A. Sophie Blee-Goldman <sophie@confluent.io>, Bruno Cadonna <bruno@confluent.io>, Boyang Chen <boyang@confluent.io>
2020-02-04 21:06:39 -08:00
Colin Patrick McCabe a16dfe6739
MINOR: fix checkstyle issue in ConsumerConfig.java (#8038)
Reviewers: Ismael Juma <ismael@juma.me.uk>
2020-02-04 12:38:04 -08:00
Alexandra Rodoni 7748fc2fc6
KAFKA-9477 Document RoundRobinAssignor as an option for partition.assignment.strategy (#8007)
Reviewers: Colin P. McCabe <cmccabe@apache.org>
2020-02-04 09:44:12 -08:00
Rajini Sivaram 281ed90cd8
KAFKA-9492; Ignore record errors in ProduceResponse for older versions (#8030)
Fixes NPE in brokers when processing record errors in produce response for older versions.

Reviewers: Chia-Ping Tsai <chia7712@gmail.com>, Ismael Juma <ismael@juma.me.uk>, Jason Gustafson <jason@confluent.io>, Guozhang Wang <wangguoz@gmail.com>
2020-02-04 17:01:08 +00:00
Ismael Juma 738e14edb8
KAFKA-9027, KAFKA-9028: Convert create/delete acls requests/response to use generated protocol (#7725)
Also add support for flexible versions to both protocol types.

Reviewers: Rajini Sivaram <rajinisivaram@googlemail.com>, Colin Patrick McCabe <cmccabe@apache.org>

Co-authored-by: Rajini Sivaram <rajinisivaram@googlemail.com>
Co-authored-by: Jason Gustafson <jason@confluent.io>
2020-02-03 07:12:00 -08:00
David Jacot 96c4ce4803
KAFKA-9437; Make the Kafka Protocol Friendlier with L7 Proxies [KIP-559] (#7994)
This PR implements the KIP-559: https://cwiki.apache.org/confluence/display/KAFKA/KIP-559%3A+Make+the+Kafka+Protocol+Friendlier+with+L7+Proxies
- it adds the Protocol Type and the Protocol Name fields in JoinGroup and SyncGroup API;
- it validates that the fields are provided by the client when the new version of the API is used and ensure that they are consistent. it errors out otherwise;
- it validates that the fields are consistent in the client and errors out otherwise;
- it adds many tests related to the API changes but also extends the testing coverage of the requests/responses themselves.
- it standardises the naming in the coordinator. now, `ProtocolType` and `ProtocolName` are used across the board in the coordinator instead of having a mix of protocol type, protocol name, subprotocol, protocol, etc.

Reviewers: Jason Gustafson <jason@confluent.io>
2020-01-31 13:54:07 -08:00
Karan Kumar c8d97c6d51
KAFKA-9375: Add names to all Connect threads (#7901)
Reviewers: Mickael Maison <mickael.maison@gmail.com>, Ryanne Dolan <ryannedolan@gmail.com>, gcsaba2
2020-01-31 18:21:21 +00:00
Jason Gustafson 4317325fbc
KAFKA-8503; Add default api timeout to AdminClient (KIP-533) (#8011)
This PR implements `default.api.timeout.ms` as documented by KIP-533. This is a rebased version of #6913 with some additional test cases and small cleanups.

Reviewers: David Arthur <mumrah@gmail.com>

Co-authored-by: huxi <huxi_2b@hotmail.com>
2020-01-30 22:48:51 -08:00
Edoardo Comar d37d95b359
KAFKA-8162: IBM JDK Class not found error when handling SASL (#6524)
Attempt to load multiple IBM classes but fallback on loading the Sun class if the IBM one is not found.

Reviewers: Mickael Maison <mickael.maison@gmail.com>, Ismael Juma <ismael@juma.me.uk>
2020-01-30 17:46:03 +00:00
Brian Byrne 57cef765f5
KAFKA-9474: Adds 'float64' to the RPC protocol types (#8012)
Reviewers: Jason Gustafson <jason@confluent.io>, Ismael Juma <ismael@juma.me.uk>
2020-01-30 04:54:27 -08:00
Ismael Juma bd5a1c4d36
KAFKA-4203: Align broker default for max.message.bytes with Java producer default (#4154)
Also: Improve error message, Add test, Minor code quality fixes
Verified that the test fails if the broker default for max message bytes is lower or higher than the currently set value.

Reviewers: Andrew Choi <andchoi@linkedin.com>, Viktor Somogyi <viktorsomogyi@gmail.com>, Guozhang Wang <wangguoz@gmail.com>
2020-01-29 13:03:35 -08:00
belugabehr b4d7560b4f
KAFKA-9426: Use switch instead of chained if/else in OffsetsForLeaderEpochClient (#7959)
Reviewers: Ismael Juma <ismael@juma.uk>
2020-01-29 13:03:17 -08:00
belugabehr aecd3936a3
KAFKA-9405: Use Map.computeIfAbsent where applicable (#7937)
Reviewers: Ismael Juma <ismael@juma.me.uk>
2020-01-29 13:00:22 -08:00
Mickael Maison 40b35178e8
KAFKA-9026: Use automatic RPC generation in DescribeAcls (#7560)
Reviewers: Ismael Juma <ismael@juma.me.uk>
2020-01-29 12:45:15 -08:00
Nikolay 172409c44b KAFKA-9460: Enable only TLSv1.2 by default and disable other TLS protocol versions (KIP-553) (#7998)
Reviewers: Ron Dagostino <rndgstn@gmail.com>, Rajini Sivaram <rajinisivaram@googlemail.com>
2020-01-28 18:57:23 +00:00
Ron Dagostino a3509c0870 MINOR: MiniKdc JVM shutdown hook fix (#7946)
Also made all shutdown hooks consistent and added tests

Reviewers: Ismael Juma <ismael@juma.me.uk>, Rajini Sivaram <rajinisivaram@googlemail.com>
2020-01-24 22:21:12 +00:00
Rajini Sivaram a565d1a182
KAFKA-9181; Maintain clean separation between local and group subscriptions in consumer's SubscriptionState (#7941)
Reviewers: Jason Gustafson <jason@confluent.io>, Guozhang Wang <wangguoz@gmail.com>
2020-01-24 10:38:21 +00:00