Apache Kafka is an attractive service because it’s conceptually simple and powerful. It’s easy to understand writing messages to a log in one place, then reading messages from that log in another place. This simplicity not only allows for a nice separation of concerns, but also is relatively easier to understand than more complex alternatives. Plus, Kafka is powerful enough to pass messages at wire speed, going as fast as your network allows.

Understanding Apache Kafka

At the same time, Apache Kafka has a surprising number of moving parts that can be monitored, and it can be a little hard to know even where to start. In broad strokes, there are two categories to monitor: infrastructure and application performance. Most of the metric space for Kafka centers on measuring the movement of messages over a network, and the underlying mechanisms that support correct passage/delivery of those messages. Kafka labels a subset of the metrics involved as “topics” and “consumer groups,” which typically address application performance, providing answers to questions such as “how many messages have been sent to topic X,” or, “how far behind is consumer group Y?”

The Big Picture

To make things a bit more complicated, Kafka provides node-level metrics, and doesn’t provide a coordinated rollup across the logical service. So, you may want to monitor “active.controller.count” but each node tells you “zero” and only the one coordinator node tells you “one,” and that might move around from node to node as instances are cycled. Or, you may want to monitor “messages.in” but each node only tells you the number of messages it’s seen, rather than the sum total across the brokers. In these cases, it is necessary to create an aggregation across the nodes, to get the sum total of messages or partitions or other kinds of counts.

Now that we have the big picture, let’s get specific. There are lots of good sources for comprehensive monitoring and laundry lists of metrics, so instead I’ll give a couple of embarrassing stories from our own experience when we were first adopting this wonderful technology.

Embarrassing story #1:

It doesn’t work when the disk is full. I think this one must happen to everyone at some point, because you sort of hope that Kafka will manage disk space holistically. Instead, Kafka has a fine-grained policy for the limits of a partition. It doesn’t have an overall limit and it cannot work backward  to keep the disk from becoming full by throwing away older data. Instead, it just fills the disk and tips over, as if that’s what you asked it to do, even though you didn’t mean to really. So, we had just rolled out a small staging cluster with mostly default settings, and within a few days the disks were full and everything failed. Luckily, it failed fast.

If you tell this embarrassing story to an experienced Kafka dev, they might say “oh, you didn’t configure things properly.” And, yes, it is possible to take a guess at how many topics you need to support at most, and then work backward from a number of partitions and configure settings that might keep the disk from filling so long as you stay under that number of partitions. However,  the number of partitions is a moving target; topics can be made on demand, topics with more partitions than the default can be made, etc. You can make an imaginary partition budget, but there’s no way to enforce that budget. What you really want, in the general case, is to use the disk up to a point, then hope Kafka acts like a FIFO and throws away old things to make room for new things. Until that’s available, I don’t think there’s a robust way to defend against filling the disk with Kafka configuration alone, so monitoring disk utilization is essential.

We use a nice machine learning model in our product that learns disk utilization trends and warns us if the trend predicts a full disk in the near future. We get a warning if the full disk is three days out (the long weekend), and a critical alert if eight hours out. What’s great about this approach is that it adapts to real utilization automatically, pays attention to the velocity of data rather than some arbitrary threshold, and is independent of the number of active topics, and independent of the activity level in different topics. See how I turned our embarrassment into a positive? Pretty slick, right?

Embarrassing story #2:

Broker down is bad. We deployed three Kafka brokers in staging, and the story we told ourselves was that one broker could go down and the other two would pick up the slack. Like any quorum-based service, we hoped, two out of three would mean a majority would always be up and running. So, sure enough a broker went down, and we were like “cool, we’ll fix that later, but leave it for now because it might offer good clues we’ll want to investigate.” Then, much later, one of the producers wedged. Then another. Then it was an outage. It turns out we had configured those Kafka Producers to require acks from each of the replicas, and there were three replicas. So, when one of the brokers went down, there was no fourth broker to take up the missing replica and the Producers were only able to get two acks. Everything had seemed to keep working for a long while. But, while the broker was down, the Producers kept resources for each message hoping for the last ack, then eventually crashed the jvm when it was out of resources. This one was a little tricky to figure out because there were no logged failures on the Producers, and the Consumers were happy to read from the replicas that were working, so it wasn’t obvious that there was a problem with Kafka.

The fix was to configure the Producers to only require one acknowledgement before considering the message delivered. Also, we had to stop thinking about Kafka as if it were another quorum-type service. A better mental image is to think of Kafka brokers as partition servers, and the number of brokers should be the number of partitions you need to serve in parallel at full speed. It won’t quite work out that way in practice, but that mental model is better than thinking about quorum. Then, once you have a number of brokers established, think about replication as providing failure tolerance, but work backward. If one broker goes down, the replication factor should be equal or less than the number of remaining brokers, because otherwise there’s nowhere to move the replicas. Three replicas on three brokers means when a broker goes down, there’s nothing to take on the third replica and you have a policy violation / outage.

Also, note that when the broker is down, it naturally doesn’t have any metrics to report. You can’t tell from inspecting the down node that something is wrong with the cluster. You have to go add up all the metrics from the other brokers. So, it is important to have service-wide aggregations for key performance indicators, such as  under.replicated.partitions, offline.partition.count, bytes.in, bytes.out, messages.in, etc. Adding up all the active.controller.counts will tell you that there is no leader if the sum is zero, and some split-brain catastrophe has happened if the sum is more than one. With service aggregations, you can assess the impact of a broker going down: either the cluster recovers and is fine, or it gets into trouble pretty quickly.

Lastly, I’d like to point out that it may be possible in some environments to have a strategy for dealing with the large list of metrics in Kafka, but still collect everything that’s interesting. Whenever a topic or partition is created, data about those are reported essentially forever. A topic may be abandoned and all the message data expired, but all the metrics around that topic will continue to report values. In staging and testing-type environments, especially,  we’ve seen large collections of abandoned topics. But even in production, given enough time, the collection of inactive topics can grow substantially. Kafka doesn’t particularly encourage cleaning up old topics, so eventually you have a long list of all the topics ever made. As an adaptive filter, one can track message counts and ignore all the metrics related to topics that had no increase in the number of messages. That way, you can reasonably “watch everything live” and avoid having to maintain a list of topics to be  monitored, or otherwise manually tune your data collection.

In the next part of Monitoring Kafka series we will discuss Kafka consumer lag and some other concerns in a little more detail.

Understanding, Operating and Monitoring Apache Kafka的更多相关文章

  1. Apache Kafka监控之Kafka Web Console

    Kafka Web Console:是一款开源的系统,源码的地址在https://github.com/claudemamo/kafka-web-console中.Kafka Web Console也 ...

  2. Understanding When to use RabbitMQ or Apache Kafka

    https://content.pivotal.io/rabbitmq/understanding-when-to-use-rabbitmq-or-apache-kafka How do humans ...

  3. Understanding When to use RabbitMQ or Apache Kafka Kafka RabbitMQ 性能对比

    Understanding When to use RabbitMQ or Apache Kafka https://content.pivotal.io/rabbitmq/understanding ...

  4. 【转载】Understanding When to use RabbitMQ or Apache Kafka

    https://content.pivotal.io/rabbitmq/understanding-when-to-use-rabbitmq-or-apache-kafka RabbitMQ: Erl ...

  5. Benchmarking Apache Kafka: 2 Million Writes Per Second (On Three Cheap Machines)

    I wrote a blog post about how LinkedIn uses Apache Kafka as a central publish-subscribe log for inte ...

  6. 重磅开源 KSQL:用于 Apache Kafka 的流数据 SQL 引擎 2017.8.29

    Kafka 的作者 Neha Narkhede 在 Confluent 上发表了一篇博文,介绍了Kafka 新引入的KSQL 引擎——一个基于流的SQL.推出KSQL 是为了降低流式处理的门槛,为处理 ...

  7. Apache Kafka: Next Generation Distributed Messaging System---reference

    Introduction Apache Kafka is a distributed publish-subscribe messaging system. It was originally dev ...

  8. Install and Configure Apache Kafka on Ubuntu 16.04

    https://devops.profitbricks.com/tutorials/install-and-configure-apache-kafka-on-ubuntu-1604-1/ by hi ...

  9. An Overview of End-to-End Exactly-Once Processing in Apache Flink (with Apache Kafka, too!)

    01 Mar 2018 Piotr Nowojski (@PiotrNowojski) & Mike Winters (@wints) This post is an adaptation o ...

随机推荐

  1. Mac下安装MySQL

    2015-07-13 15:10:32 Mac下用homebrew安装软件还是很方便的 brew install mysql 等待一会儿安装完毕后到安装目录: /usr/local/Cellar/my ...

  2. FastReport 使用技巧篇

    使用技巧篇 1.FastReport中如果访问报表中的对象?       可以使用FindObject方法.      TfrxMemoView(frxReport1.FindObject('memo ...

  3. cxTreeList 控件说明

    树.cxTreeList 属性: Align:布局,靠左,靠右,居中等 AlignWithMargins:带边框的布局 Anchors:停靠 (akTop上,akBottom下,akLeft左,akR ...

  4. C#文本选中及ContextMenuStrip菜单使用

    '文本框选中显示'TextBox1.SelectAll()选择所有文本'textBox1.Text.Insert(start,strInsertText)指定位置添加文本1 Private Sub T ...

  5. 将file转变成contenthash

    一.将MultipartFile转file CommonsMultipartFile cf= (CommonsMultipartFile)file; DiskFileItem fi = (DiskFi ...

  6. Android实现Banner界面广告图片循环轮播(包括实现手动滑动循环)

    前言:经常会看到有一些app的banner界面可以实现循环播放多个广告图片和手动滑动循环.本以为单纯的ViewPager就可以实现这些功能.但是蛋疼的事情来了,ViewPager并不支持循环翻页.所以 ...

  7. vs2010:fatal error LNK1123: 转换到 COFF 期间失败

    解决方法: 项目\属性\配置属性\清单工具\输入和输出\嵌入清单:原来是“是”,改成“否”.

  8. PDO(数据访问抽象层)

    自带事务功能,多条sql同时执行时,如果其中一条执行失败,那么所有的都执行失败.开启了事务,可以进行回滚操作,让程序变得更安全. 1.访问不同的数据库2.自带事务功能3.防止SQL注入:分两次发送 / ...

  9. Mysql分布式事务

    关于Mysql分布式事务介绍,可参考:http://blog.csdn.net/luckyjiuyi/article/details/46955337 分为两个阶段:准备和执行阶段.有两个角色:事务的 ...

  10. 八皇后(dfs+回溯)

    重看了一下刘汝佳的白板书,上次写八皇后时并不是很懂,再写一次: 方法1:逐行放置皇后,然后递归: 代码: #include <bits/stdc++.h> #define MAXN 8 # ...