Understanding, Operating and Monitoring Apache Kafka
Apache Kafka is an attractive service because it’s conceptually simple and powerful. It’s easy to understand writing messages to a log in one place, then reading messages from that log in another place. This simplicity not only allows for a nice separation of concerns, but also is relatively easier to understand than more complex alternatives. Plus, Kafka is powerful enough to pass messages at wire speed, going as fast as your network allows.
Understanding Apache Kafka
At the same time, Apache Kafka has a surprising number of moving parts that can be monitored, and it can be a little hard to know even where to start. In broad strokes, there are two categories to monitor: infrastructure and application performance. Most of the metric space for Kafka centers on measuring the movement of messages over a network, and the underlying mechanisms that support correct passage/delivery of those messages. Kafka labels a subset of the metrics involved as “topics” and “consumer groups,” which typically address application performance, providing answers to questions such as “how many messages have been sent to topic X,” or, “how far behind is consumer group Y?”
The Big Picture
To make things a bit more complicated, Kafka provides node-level metrics, and doesn’t provide a coordinated rollup across the logical service. So, you may want to monitor “active.controller.count” but each node tells you “zero” and only the one coordinator node tells you “one,” and that might move around from node to node as instances are cycled. Or, you may want to monitor “messages.in” but each node only tells you the number of messages it’s seen, rather than the sum total across the brokers. In these cases, it is necessary to create an aggregation across the nodes, to get the sum total of messages or partitions or other kinds of counts.
Now that we have the big picture, let’s get specific. There are lots of good sources for comprehensive monitoring and laundry lists of metrics, so instead I’ll give a couple of embarrassing stories from our own experience when we were first adopting this wonderful technology.
Embarrassing story #1:
It doesn’t work when the disk is full. I think this one must happen to everyone at some point, because you sort of hope that Kafka will manage disk space holistically. Instead, Kafka has a fine-grained policy for the limits of a partition. It doesn’t have an overall limit and it cannot work backward to keep the disk from becoming full by throwing away older data. Instead, it just fills the disk and tips over, as if that’s what you asked it to do, even though you didn’t mean to really. So, we had just rolled out a small staging cluster with mostly default settings, and within a few days the disks were full and everything failed. Luckily, it failed fast.

If you tell this embarrassing story to an experienced Kafka dev, they might say “oh, you didn’t configure things properly.” And, yes, it is possible to take a guess at how many topics you need to support at most, and then work backward from a number of partitions and configure settings that might keep the disk from filling so long as you stay under that number of partitions. However, the number of partitions is a moving target; topics can be made on demand, topics with more partitions than the default can be made, etc. You can make an imaginary partition budget, but there’s no way to enforce that budget. What you really want, in the general case, is to use the disk up to a point, then hope Kafka acts like a FIFO and throws away old things to make room for new things. Until that’s available, I don’t think there’s a robust way to defend against filling the disk with Kafka configuration alone, so monitoring disk utilization is essential.
We use a nice machine learning model in our product that learns disk utilization trends and warns us if the trend predicts a full disk in the near future. We get a warning if the full disk is three days out (the long weekend), and a critical alert if eight hours out. What’s great about this approach is that it adapts to real utilization automatically, pays attention to the velocity of data rather than some arbitrary threshold, and is independent of the number of active topics, and independent of the activity level in different topics. See how I turned our embarrassment into a positive? Pretty slick, right?
Embarrassing story #2:
Broker down is bad. We deployed three Kafka brokers in staging, and the story we told ourselves was that one broker could go down and the other two would pick up the slack. Like any quorum-based service, we hoped, two out of three would mean a majority would always be up and running. So, sure enough a broker went down, and we were like “cool, we’ll fix that later, but leave it for now because it might offer good clues we’ll want to investigate.” Then, much later, one of the producers wedged. Then another. Then it was an outage. It turns out we had configured those Kafka Producers to require acks from each of the replicas, and there were three replicas. So, when one of the brokers went down, there was no fourth broker to take up the missing replica and the Producers were only able to get two acks. Everything had seemed to keep working for a long while. But, while the broker was down, the Producers kept resources for each message hoping for the last ack, then eventually crashed the jvm when it was out of resources. This one was a little tricky to figure out because there were no logged failures on the Producers, and the Consumers were happy to read from the replicas that were working, so it wasn’t obvious that there was a problem with Kafka.
The fix was to configure the Producers to only require one acknowledgement before considering the message delivered. Also, we had to stop thinking about Kafka as if it were another quorum-type service. A better mental image is to think of Kafka brokers as partition servers, and the number of brokers should be the number of partitions you need to serve in parallel at full speed. It won’t quite work out that way in practice, but that mental model is better than thinking about quorum. Then, once you have a number of brokers established, think about replication as providing failure tolerance, but work backward. If one broker goes down, the replication factor should be equal or less than the number of remaining brokers, because otherwise there’s nowhere to move the replicas. Three replicas on three brokers means when a broker goes down, there’s nothing to take on the third replica and you have a policy violation / outage.
Also, note that when the broker is down, it naturally doesn’t have any metrics to report. You can’t tell from inspecting the down node that something is wrong with the cluster. You have to go add up all the metrics from the other brokers. So, it is important to have service-wide aggregations for key performance indicators, such as under.replicated.partitions, offline.partition.count, bytes.in, bytes.out, messages.in, etc. Adding up all the active.controller.counts will tell you that there is no leader if the sum is zero, and some split-brain catastrophe has happened if the sum is more than one. With service aggregations, you can assess the impact of a broker going down: either the cluster recovers and is fine, or it gets into trouble pretty quickly.

Lastly, I’d like to point out that it may be possible in some environments to have a strategy for dealing with the large list of metrics in Kafka, but still collect everything that’s interesting. Whenever a topic or partition is created, data about those are reported essentially forever. A topic may be abandoned and all the message data expired, but all the metrics around that topic will continue to report values. In staging and testing-type environments, especially, we’ve seen large collections of abandoned topics. But even in production, given enough time, the collection of inactive topics can grow substantially. Kafka doesn’t particularly encourage cleaning up old topics, so eventually you have a long list of all the topics ever made. As an adaptive filter, one can track message counts and ignore all the metrics related to topics that had no increase in the number of messages. That way, you can reasonably “watch everything live” and avoid having to maintain a list of topics to be monitored, or otherwise manually tune your data collection.
In the next part of Monitoring Kafka series we will discuss Kafka consumer lag and some other concerns in a little more detail.
Understanding, Operating and Monitoring Apache Kafka的更多相关文章
- Apache Kafka监控之Kafka Web Console
Kafka Web Console:是一款开源的系统,源码的地址在https://github.com/claudemamo/kafka-web-console中.Kafka Web Console也 ...
- Understanding When to use RabbitMQ or Apache Kafka
https://content.pivotal.io/rabbitmq/understanding-when-to-use-rabbitmq-or-apache-kafka How do humans ...
- Understanding When to use RabbitMQ or Apache Kafka Kafka RabbitMQ 性能对比
Understanding When to use RabbitMQ or Apache Kafka https://content.pivotal.io/rabbitmq/understanding ...
- 【转载】Understanding When to use RabbitMQ or Apache Kafka
https://content.pivotal.io/rabbitmq/understanding-when-to-use-rabbitmq-or-apache-kafka RabbitMQ: Erl ...
- Benchmarking Apache Kafka: 2 Million Writes Per Second (On Three Cheap Machines)
I wrote a blog post about how LinkedIn uses Apache Kafka as a central publish-subscribe log for inte ...
- 重磅开源 KSQL:用于 Apache Kafka 的流数据 SQL 引擎 2017.8.29
Kafka 的作者 Neha Narkhede 在 Confluent 上发表了一篇博文,介绍了Kafka 新引入的KSQL 引擎——一个基于流的SQL.推出KSQL 是为了降低流式处理的门槛,为处理 ...
- Apache Kafka: Next Generation Distributed Messaging System---reference
Introduction Apache Kafka is a distributed publish-subscribe messaging system. It was originally dev ...
- Install and Configure Apache Kafka on Ubuntu 16.04
https://devops.profitbricks.com/tutorials/install-and-configure-apache-kafka-on-ubuntu-1604-1/ by hi ...
- An Overview of End-to-End Exactly-Once Processing in Apache Flink (with Apache Kafka, too!)
01 Mar 2018 Piotr Nowojski (@PiotrNowojski) & Mike Winters (@wints) This post is an adaptation o ...
随机推荐
- java web 学习 --第三天(Java三级考试)
第二天的学习内容这里:http://www.cnblogs.com/tobecrazy/p/3446646.html Jsp中的动作标签 <jsp:include> 实现动态包含,在一个文 ...
- 1.【转】spring MVC入门示例(hello world demo)
1. Spring MVC介绍 Spring Web MVC是一种基于Java的实现了Web MVC设计模式的请求驱动类型的轻量级Web框架,即使用了MVC架构模式的思想,将web层进行职责解耦,基于 ...
- andriod一次退出所有的Activity
自己实现了一个Activity管理,可以实现一次退出所有的Activity.在Activity启动的时候,将调用里面的put方法,将Activity对象加入进来.在要退出某个activity的时候,将 ...
- 【leetcode】Binary Search Tree Iterator(middle)
Implement an iterator over a binary search tree (BST). Your iterator will be initialized with the ro ...
- 【hadoop2.6.0】利用JAVA API 实现数据上传
原本的目的是想模拟一个流的上传过程,就是一边生成数据,一边存储数据,因为能用上HADOOP通常情况下原本数据的大小就大到本地硬盘存不下.这一般是通过把数据先一部分一部分的缓冲到本地的某个文件夹下,hd ...
- IOS-KVO&KVC
KVC(key value coding) 我们一般是通过调用set方法或属性的点语法来直接更改对象的状态,即对象的属性值,比如[stu setAge:10]; stu.age = 9; lKVC, ...
- 解决sqlite3_key的问题
报错内容显示如下: ld: warning: ignoring file /Users/rowling/Library/Developer/Xcode/DerivedData/zhinengbango ...
- August 10th, 2016, Week 33rd, Wednesday
The degree of loving is measured by the degree of giving. 爱的深浅是用给与的多少来衡量的. Some say that if you love ...
- Redis事件管理(一)
Redis统一的时间管理器,同时管理文件事件和定时器, 这个管理器的定义: #if defined(__APPLE__) #define HAVE_TASKINFO 1 #endif /* Test ...
- oracle 监控
sqlplus "/as sysdba" .监控当前数据库谁在运行什么SQL语句 SELECT osuser, username, sql_text from v$session ...
