Update!

- Test failover with single producer and 10 brokers.
consumed: 2012-08-07_18:33:50.846859719 4316
consumed: 2012-08-07_18:33:51.629561192 4709

In a single producer, 10 broker setup, I lost 42 out of 393, about 10% of messages, as expected.  It took ~800ms for ZooKeeper to notify the producer and for the producer to reconfigure.


- Test failover with 10 producers and 10 brokers.
I've put the results here:
https://gist.github.com/3286692#file_results%20for%2010%20producers%2C%2010%20brokers%2C%201%20failed%20broker

You can read the details if you like.  In summary, with 10 brokers and 10 producers, when one of the brokers failed, there was still only about 10% loss of messages from each producer for an average of 0.75 seconds.  

ZooKeeper reconfiguration time seems constant, independent of the number of producers and brokers.  I can't be sure that this is true, since I'm only testing with 10 producers.  If we use this in production, we'll have hundreds of producers.  

Since sent messages are (by default) load balanced between all brokers, the more brokers we have, the fewer messages we'd lose if one goes down.  The scale here looks pretty linear.  Lose x % of brokers, and you lose x % of messages for the amount of time it takes ZooKeeper to reconfigure the producers.



On Aug 7, 2012, at 2:45 PM, Andrew Otto <otto@wikimedia.org> wrote:

 I don't have millisecond times on my messages right now (maybe I should add that in).
Update:  I've done just that.

consumed: 2012-08-07_18:33:47.128786953 2393
consumed: 2012-08-07_18:33:47.724846361 2733

In a single async producer, two broker setup, when taking down a single broker, I lose half of the messages sent for about 600 ms.  If we had 10 brokers and one went, I would expect to lose 1/10th of messages for about 600ms.  It might take longer for producers to reconfigure with more brokers and producers though.  

Next steps:
- Test failover with single producer and 10 brokers.
- Test failover with 10 producers and 10 brokers.



On Aug 7, 2012, at 12:21 PM, Andrew Otto <otto@wikimedia.org> wrote:

I'm working on testing Kafka broker failover, to see if and how many messages are lost in the case that a broker dies while producers are sending messages.  Here's what I'm doing to test.  It's nothing rigorous, just a try it and see what happens.  Oh and here's a gist with the commands and scripts I'm using:  https://gist.github.com/3286692

1)  Start up a broker on each an03 and an04:
$ bin/kafka-server-start.sh config/server.properties

2)  Start producing on an03:
./sequence_generate.sh 20000 10000 | bin/kafka-console-producer.sh --zookeeper analytics1003:2181 --topic test6

3)  While logs are being produced, kill a Kafka broker.

4)  Produce some more, and at some point while still producing, re-start the downed broker:
$ bin/kafka-server-start.sh config/server.properties

5) Finish producing.  Kill producer to make sure it doesn't have anything batched.  (Should only matter with an asynchronous producer).

6) Start consumer on an04, saving stdout:
bin/kafka-consumer-shell.sh --props config/consumer.properties --topic test6 | tee -a /tmp/test6.log

7) Check logs to makes sure all messages made it through:
cat /tmp/test6.log | egrep '^consumed:' | awk '{print $3}' | sort -g | ~/bin/sequence_check.sh 1 20000



When I kill one of the brokers during message producing, I lose between 10-60 log messages (I think depending on the producer type, async vs. sync) in the final consumed output.  I'm sure this is from the producer firing off messages to the assigned downed broker before ZooKeeper can notify the producer of the the pool reconfiguration.  


So, I guess this is kind of as expected.  In my most recent test, I lost 56 messages when I killed the broker.  All of those messages were generated during the same second.  Including those 56, there were about 500 messages fired off during that second.  I don't have millisecond times on my messages right now (maybe I should add that in).  As would be expected, no messages were lost when I brought the second broker back online.  ZooKeeper reconfigured and the producer rerouted with no problems there.

Summary: 

In a 2 node broker pool with one producer, it takes Kafka/Zookeeper less than a second to notice a failed broker and to reroute messages.  

Is this acceptable?  We will discuss :)

-Ao