From c2a8b86117ede2ffda4cc4a8800b46f65ef9922d Mon Sep 17 00:00:00 2001 From: Eno Thereska Date: Wed, 19 Oct 2016 14:21:53 -0700 Subject: [PATCH] MINOR: Added more basic concepts to the documentation Author: Eno Thereska Reviewers: Michael G. Noll, Matthias J. Sax, Guozhang Wang Closes #2030 from enothereska/minor-kip63-docs --- docs/streams.html | 45 ++++++++++++++++++++++++++++++++++++++++++--- 1 file changed, 42 insertions(+), 3 deletions(-) diff --git a/docs/streams.html b/docs/streams.html index 9c21ec4d7a6..74620ec8cc7 100644 --- a/docs/streams.html +++ b/docs/streams.html @@ -51,7 +51,7 @@ We first summarize the key concepts of Kafka Streams. @@ -74,8 +74,11 @@ Common notions of time in streams are: - +

+The choice between event-time and ingestion-time is actually done through the configuration of Kafka (not Kafka Streams): From Kafka 0.10.x onwards, timestamps are automatically embedded into Kafka messages. Depending on Kafka's configuration these timestamps represent event-time or ingestion-time. The respective Kafka configuration setting can be specified on the broker level or per topic. The default timestamp extractor in Kafka Streams will retrieve these embedded timestamps as-is. Hence, the effective time semantics of your application depend on the effective Kafka configuration for these embedded timestamps. +

Kafka Streams assigns a timestamp to every data record via the TimestampExtractor interface. @@ -87,6 +90,15 @@ per-record timestamps describe the progress of a stream with regards to time (al are leveraged by time-dependent operations such as joins.

+

+ Finally, whenever a Kafka Streams application writes records to Kafka, then it will also assign timestamps to these new records. The way the timestamps are assigned depends on the context: +

+

+
States

@@ -102,6 +114,9 @@ Every task in Kafka Streams embeds one or more state stores that can be accessed These state stores can either be a persistent key-value store, an in-memory hashmap, or another convenient data structure. Kafka Streams offers fault-tolerance and automatic recovery for local state stores.

+

+ Kafka Streams allows direct read-only queries of the state stores by methods, threads, processes or applications external to the stream processing application that created the state stores. This is provided through a feature called Interactive Queries. All stores are named and Interactive Queries exposes only the read operations of the underlying implementation. +


As we have mentioned above, the computational logic of a Kafka Streams application is defined as a processor topology. @@ -246,7 +261,12 @@ In the next section we present another way to build the processor topology: the To build a processor topology using the Streams DSL, developers can apply the KStreamBuilder class, which is extended from the TopologyBuilder. A simple example is included with the source code for Kafka in the streams/examples package. The rest of this section will walk through some code to demonstrate the key steps in creating a topology using the Streams DSL, but we recommend developers to read the full example source -codes for details. +codes for details. + +

KStream and KTable
+The DSL uses two main abstractions. A KStream is an abstraction of a record stream, where each data record represents a self-contained datum in the unbounded data set. A KTable is an abstraction of a changelog stream, where each data record represents an update. More precisely, the value in a data record is considered to be an update of the last value for the same record key, if any (if a corresponding key doesn't exist yet, the update will be considered a create). To illustrate the difference between KStreams and KTables, let’s imagine the following two data records are being sent to the stream: ("alice", 1) --> ("alice", 3). If these records a KStream and the stream processing application were to sum the values it would return 4. If these records were a KTable, the return would be 3, since the last record would be considered as an update. + +
Create Source Streams from Kafka
@@ -263,6 +283,25 @@ from a single topic). KTable source2 = builder.table("topic3", "stateStoreName"); +
Windowing a stream
+A stream processor may need to divide data records into time buckets, i.e. to window the stream by time. This is usually needed for join and aggregation operations, etc. Kafka Streams currently defines the following types of windows: +
    +
  • Hopping time windows are windows based on time intervals. They model fixed-sized, (possibly) overlapping windows. A hopping window is defined by two properties: the window's size and its advance interval (aka "hop"). The advance interval specifies by how much a window moves forward relative to the previous one. For example, you can configure a hopping window with a size 5 minutes and an advance interval of 1 minute. Since hopping windows can overlap a data record may belong to more than one such windows.
  • +
  • Tumbling time windows are a special case of hopping time windows and, like the latter, are windows based on time intervals. They model fixed-size, non-overlapping, gap-less windows. A tumbling window is defined by a single property: the window's size. A tumbling window is a hopping window whose window size is equal to its advance interval. Since tumbling windows never overlap, a data record will belong to one and only one window.
  • +
  • Sliding windows model a fixed-size window that slides continuously over the time axis; here, two data records are said to be included in the same window if the difference of their timestamps is within the window size. Thus, sliding windows are not aligned to the epoch, but on the data record timestamps. In Kafka Streams, sliding windows are used only for join operations, and can be specified through the JoinWindows class.
  • +
+ +
Joins
+A join operation merges two streams based on the keys of their data records, and yields a new stream. A join over record streams usually needs to be performed on a windowing basis because otherwise the number of records that must be maintained for performing the join may grow indefinitely. In Kafka Streams, you may perform the following join operations: +
    +
  • KStream-to-KStream Joins are always windowed joins, since otherwise the memory and state required to compute the join would grow infinitely in size. Here, a newly received record from one of the streams is joined with the other stream's records within the specified window interval to produce one result for each matching pair based on user-provided ValueJoiner. A new KStream instance representing the result stream of the join is returned from this operator.
  • + +
  • KTable-to-KTable Joins are join operations designed to be consistent with the ones in relational databases. Here, both changelog streams are materialized into local state stores first. When a new record is received from one of the streams, it is joined with the other stream's materialized state stores to produce one result for each matching pair based on user-provided ValueJoiner. A new KTable instance representing the result stream of the join, which is also a changelog stream of the represented table, is returned from this operator.
  • +
  • KStream-to-KTable Joins allow you to perform table lookups against a changelog stream (KTable) upon receiving a new record from another record stream (KStream). An example use case would be to enrich a stream of user activities (KStream) with the latest user profile information (KTable). Only records received from the record stream will trigger the join and produce results via ValueJoiner, not vice versa (i.e., records received from the changelog stream will be used only to update the materialized state store). A new KStream instance representing the result stream of the join is returned from this operator.
  • +
+ +Depending on the operands the following join operations are supported: inner joins, outer joins and left joins. Their semantics are similar to the corresponding operators in relational databases. +a
Transform a stream