mirror of https://github.com/apache/kafka.git
MINOR: Update MirrorMaker docs to remove multiple --consumer.config options
See: - https://issues.apache.org/jira/browse/KAFKA-1650 - https://mail-archives.apache.org/mod_mbox/kafka-users/201512.mbox/%3CCAHwHRrUeTq_-EHXiUXdrbgHcRt-0E_t0+5kOYaF9Qy4aNVqYkAmail.gmail.com%3E Author: Andrew Otto <acotto@gmail.com> Reviewers: Gwen Shapira Closes #1654 from ottomata/mirror-maker-doc-fix
This commit is contained in:
parent
20155ef87e
commit
b4ddd021b4
|
@ -111,10 +111,8 @@ However if racks are assigned different numbers of brokers, the assignment of re
|
|||
|
||||
<h4><a id="basic_ops_mirror_maker" href="#basic_ops_mirror_maker">Mirroring data between clusters</a></h4>
|
||||
|
||||
We refer to the process of replicating data <i>between</i> Kafka clusters "mirroring" to avoid confusion with the replication that happens amongst the nodes in a single cluster. Kafka comes with a tool for mirroring data between Kafka clusters. The tool reads from a source cluster and writes to a destination cluster, like this:
|
||||
<p>
|
||||
<img src="images/mirror-maker.png">
|
||||
<p>
|
||||
We refer to the process of replicating data <i>between</i> Kafka clusters "mirroring" to avoid confusion with the replication that happens amongst the nodes in a single cluster. Kafka comes with a tool for mirroring data between Kafka clusters. The tool consumes from a source cluster and produces to a destination cluster.
|
||||
|
||||
A common use case for this kind of mirroring is to provide a replica in another datacenter. This scenario will be discussed in more detail in the next section.
|
||||
<p>
|
||||
You can run many such mirroring processes to increase throughput and for fault-tolerance (if one process dies, the others will take overs the additional load).
|
||||
|
@ -123,10 +121,10 @@ Data will be read from topics in the source cluster and written to a topic with
|
|||
<p>
|
||||
The source and destination clusters are completely independent entities: they can have different numbers of partitions and the offsets will not be the same. For this reason the mirror cluster is not really intended as a fault-tolerance mechanism (as the consumer position will be different); for that we recommend using normal in-cluster replication. The mirror maker process will, however, retain and use the message key for partitioning so order is preserved on a per-key basis.
|
||||
<p>
|
||||
Here is an example showing how to mirror a single topic (named <i>my-topic</i>) from two input clusters:
|
||||
Here is an example showing how to mirror a single topic (named <i>my-topic</i>) from an input cluster:
|
||||
<pre>
|
||||
> bin/kafka-mirror-maker.sh
|
||||
--consumer.config consumer-1.properties --consumer.config consumer-2.properties
|
||||
--consumer.config consumer.properties
|
||||
--producer.config producer.properties --whitelist my-topic
|
||||
</pre>
|
||||
Note that we specify the list of topics with the <code>--whitelist</code> option. This option allows any regular expression using <a href="http://docs.oracle.com/javase/7/docs/api/java/util/regex/Pattern.html">Java-style regular expressions</a>. So you could mirror two topics named <i>A</i> and <i>B</i> using <code>--whitelist 'A|B'</code>. Or you could mirror <i>all</i> topics using <code>--whitelist '*'</code>. Make sure to quote any regular expression to ensure the shell doesn't try to expand it as a file path. For convenience we allow the use of ',' instead of '|' to specify a list of topics.
|
||||
|
@ -1187,7 +1185,7 @@ Operationally, we do the following for a healthy ZooKeeper installation:
|
|||
<li>I/O segregation: if you do a lot of write type traffic you'll almost definitely want the transaction logs on a dedicated disk group. Writes to the transaction log are synchronous (but batched for performance), and consequently, concurrent writes can significantly affect performance. ZooKeeper snapshots can be one such a source of concurrent writes, and ideally should be written on a disk group separate from the transaction log. Snapshots are written to disk asynchronously, so it is typically ok to share with the operating system and message log files. You can configure a server to use a separate disk group with the dataLogDir parameter.</li>
|
||||
<li>Application segregation: Unless you really understand the application patterns of other apps that you want to install on the same box, it can be a good idea to run ZooKeeper in isolation (though this can be a balancing act with the capabilities of the hardware).</li>
|
||||
<li>Use care with virtualization: It can work, depending on your cluster layout and read/write patterns and SLAs, but the tiny overheads introduced by the virtualization layer can add up and throw off ZooKeeper, as it can be very time sensitive</li>
|
||||
<li>ZooKeeper configuration: It's java, make sure you give it 'enough' heap space (We usually run them with 3-5G, but that's mostly due to the data set size we have here). Unfortunately we don't have a good formula for it, but keep in mind that allowing for more ZooKeeper state means that snapshots can become large, and large snapshots affect recovery time. In fact, if the snapshot becomes too large (a few gigabytes), then you may need to increase the initLimit parameter to give enough time for servers to recover and join the ensemble.</li>
|
||||
<li>ZooKeeper configuration: It's java, make sure you give it 'enough' heap space (We usually run them with 3-5G, but that's mostly due to the data set size we have here). Unfortunately we don't have a good formula for it, but keep in mind that allowing for more ZooKeeper state means that snapshots can become large, and large snapshots affect recovery time. In fact, if the snapshot becomes too large (a few gigabytes), then you may need to increase the initLimit parameter to give enough time for servers to recover and join the ensemble.</li>
|
||||
<li>Monitoring: Both JMX and the 4 letter words (4lw) commands are very useful, they do overlap in some cases (and in those cases we prefer the 4 letter commands, they seem more predictable, or at the very least, they work better with the LI monitoring infrastructure)</li>
|
||||
<li>Don't overbuild the cluster: large clusters, especially in a write heavy usage pattern, means a lot of intracluster communication (quorums on the writes and subsequent cluster member updates), but don't underbuild it (and risk swamping the cluster). Having more servers adds to your read capacity.</li>
|
||||
</ul>
|
||||
|
|
Loading…
Reference in New Issue