KAFKA-2665: Add images to code github
…art of the code github Author: Gwen Shapira <cshapi@gmail.com> Reviewers: Guozhang Wang Closes #325 from gwenshap/KAFKA-2665
|
@ -306,7 +306,7 @@ This functionality is inspired by one of LinkedIn's oldest and most successful p
|
|||
|
||||
Here is a high-level picture that shows the logical structure of a Kafka log with the offset for each message.
|
||||
<p>
|
||||
<img src="/images/log_cleaner_anatomy.png">
|
||||
<img src="images/log_cleaner_anatomy.png">
|
||||
<p>
|
||||
The head of the log is identical to a traditional Kafka log. It has dense, sequential offsets and retains all messages. Log compaction adds an option for handling the tail of the log. The picture above shows a log with a compacted tail. Note that the messages in the tail of the log retain the original offset assigned when they were first written—that never changes. Note also that all offsets remain valid positions in the log, even if the message with that offset has been compacted away; in this case this position is indistinguishable from the next highest offset that does appear in the log. For example, in the picture above the offsets 36, 37, and 38 are all equivalent positions and a read beginning at any of these offsets would return a message set beginning with 38.
|
||||
<p>
|
||||
|
@ -314,7 +314,7 @@ Compaction also allows for deletes. A message with a key and a null payload will
|
|||
<p>
|
||||
The compaction is done in the background by periodically recopying log segments. Cleaning does not block reads and can be throttled to use no more than a configurable amount of I/O throughput to avoid impacting producers and consumers. The actual process of compacting a log segment looks something like this:
|
||||
<p>
|
||||
<img src="/images/log_compaction.png">
|
||||
<img src="images/log_compaction.png">
|
||||
<p>
|
||||
<h4>What guarantees does log compaction provide?</h4>
|
||||
|
||||
|
|
After Width: | Height: | Size: 26 KiB |
After Width: | Height: | Size: 131 KiB |
After Width: | Height: | Size: 33 KiB |
After Width: | Height: | Size: 38 KiB |
After Width: | Height: | Size: 19 KiB |
After Width: | Height: | Size: 18 KiB |
After Width: | Height: | Size: 40 KiB |
After Width: | Height: | Size: 17 KiB |
After Width: | Height: | Size: 8.5 KiB |
After Width: | Height: | Size: 81 KiB |
|
@ -191,7 +191,7 @@ payload : n bytes
|
|||
<p>
|
||||
The use of the message offset as the message id is unusual. Our original idea was to use a GUID generated by the producer, and maintain a mapping from GUID to offset on each broker. But since a consumer must maintain an ID for each server, the global uniqueness of the GUID provides no value. Furthermore the complexity of maintaining the mapping from a random id to an offset requires a heavy weight index structure which must be synchronized with disk, essentially requiring a full persistent random-access data structure. Thus to simplify the lookup structure we decided to use a simple per-partition atomic counter which could be coupled with the partition id and node id to uniquely identify a message; this makes the lookup structure simpler, though multiple seeks per consumer request are still likely. However once we settled on a counter, the jump to directly using the offset seemed natural—both after all are monotonically increasing integers unique to a partition. Since the offset is hidden from the consumer API this decision is ultimately an implementation detail and we went with the more efficient approach.
|
||||
</p>
|
||||
<img src="../images/kafka_log.png">
|
||||
<img src="images/kafka_log.png">
|
||||
<h4>Writes</h4>
|
||||
<p>
|
||||
The log allows serial appends which always go to the last file. This file is rolled over to a fresh file when it reaches a configurable size (say 1GB). The log takes two configuration parameter <i>M</i> which gives the number of messages to write before forcing the OS to flush the file to disk, and <i>S</i> which gives a number of seconds after which a flush is forced. This gives a durability guarantee of losing at most <i>M</i> messages or <i>S</i> seconds of data in the event of a system crash.
|
||||
|
|
|
@ -30,7 +30,7 @@ First let's review some basic messaging terminology:
|
|||
|
||||
So, at a high level, producers send messages over the network to the Kafka cluster which in turn serves them up to consumers like this:
|
||||
<div style="text-align: center; width: 100%">
|
||||
<img src="../images/producer_consumer.png">
|
||||
<img src="images/producer_consumer.png">
|
||||
</div>
|
||||
|
||||
Communication between the clients and the servers is done with a simple, high-performance, language agnostic <a href="https://cwiki.apache.org/confluence/display/KAFKA/A+Guide+To+The+Kafka+Protocol">TCP protocol</a>. We provide a Java client for Kafka, but clients are available in <a href="https://cwiki.apache.org/confluence/display/KAFKA/Clients">many languages</a>.
|
||||
|
@ -40,7 +40,7 @@ Let's first dive into the high-level abstraction Kafka provides—the topic.
|
|||
<p>
|
||||
A topic is a category or feed name to which messages are published. For each topic, the Kafka cluster maintains a partitioned log that looks like this:
|
||||
<div style="text-align: center; width: 100%">
|
||||
<img src="../images/log_anatomy.png">
|
||||
<img src="images/log_anatomy.png">
|
||||
</div>
|
||||
Each partition is an ordered, immutable sequence of messages that is continually appended to—a commit log. The messages in the partitions are each assigned a sequential id number called the <i>offset</i> that uniquely identifies each message within the partition.
|
||||
<p>
|
||||
|
@ -76,7 +76,7 @@ More commonly, however, we have found that topics have a small number of consume
|
|||
<p>
|
||||
|
||||
<div style="float: right; margin: 20px; width: 500px" class="caption">
|
||||
<img src="../images/consumer-groups.png"><br>
|
||||
<img src="images/consumer-groups.png"><br>
|
||||
A two server Kafka cluster hosting four partitions (P0-P3) with two consumer groups. Consumer group A has two consumer instances and group B has four.
|
||||
</div>
|
||||
<p>
|
||||
|
|
|
@ -98,7 +98,7 @@ Since running this command can be tedious you can also configure Kafka to do thi
|
|||
|
||||
We refer to the process of replicating data <i>between</i> Kafka clusters "mirroring" to avoid confusion with the replication that happens amongst the nodes in a single cluster. Kafka comes with a tool for mirroring data between Kafka clusters. The tool reads from one or more source clusters and writes to a destination cluster, like this:
|
||||
<p>
|
||||
<img src="/images/mirror-maker.png">
|
||||
<img src="images/mirror-maker.png">
|
||||
<p>
|
||||
A common use case for this kind of mirroring is to provide a replica in another datacenter. This scenario will be discussed in more detail in the next section.
|
||||
<p>
|
||||
|
|