Understanding, Operating and Monitoring Apache Kafka
Apache Kafka is an attractive service because it’s conceptually simple and powerful. It’s easy to understand writing messages to a log in one place, then reading messages from that log in another place. This simplicity not only allows for a nice separation of concerns, but also is relatively easier to understand than more complex alternatives. Plus, Kafka is powerful enough to pass messages at wire speed, going as fast as your network allows.
Understanding Apache Kafka
At the same time, Apache Kafka has a surprising number of moving parts that can be monitored, and it can be a little hard to know even where to start. In broad strokes, there are two categories to monitor: infrastructure and application performance. Most of the metric space for Kafka centers on measuring the movement of messages over a network, and the underlying mechanisms that support correct passage/delivery of those messages. Kafka labels a subset of the metrics involved as “topics” and “consumer groups,” which typically address application performance, providing answers to questions such as “how many messages have been sent to topic X,” or, “how far behind is consumer group Y?”
The Big Picture
To make things a bit more complicated, Kafka provides node-level metrics, and doesn’t provide a coordinated rollup across the logical service. So, you may want to monitor “active.controller.count” but each node tells you “zero” and only the one coordinator node tells you “one,” and that might move around from node to node as instances are cycled. Or, you may want to monitor “messages.in” but each node only tells you the number of messages it’s seen, rather than the sum total across the brokers. In these cases, it is necessary to create an aggregation across the nodes, to get the sum total of messages or partitions or other kinds of counts.
Now that we have the big picture, let’s get specific. There are lots of good sources for comprehensive monitoring and laundry lists of metrics, so instead I’ll give a couple of embarrassing stories from our own experience when we were first adopting this wonderful technology.
Embarrassing story #1:
It doesn’t work when the disk is full. I think this one must happen to everyone at some point, because you sort of hope that Kafka will manage disk space holistically. Instead, Kafka has a fine-grained policy for the limits of a partition. It doesn’t have an overall limit and it cannot work backward to keep the disk from becoming full by throwing away older data. Instead, it just fills the disk and tips over, as if that’s what you asked it to do, even though you didn’t mean to really. So, we had just rolled out a small staging cluster with mostly default settings, and within a few days the disks were full and everything failed. Luckily, it failed fast.
If you tell this embarrassing story to an experienced Kafka dev, they might say “oh, you didn’t configure things properly.” And, yes, it is possible to take a guess at how many topics you need to support at most, and then work backward from a number of partitions and configure settings that might keep the disk from filling so long as you stay under that number of partitions. However, the number of partitions is a moving target; topics can be made on demand, topics with more partitions than the default can be made, etc. You can make an imaginary partition budget, but there’s no way to enforce that budget. What you really want, in the general case, is to use the disk up to a point, then hope Kafka acts like a FIFO and throws away old things to make room for new things. Until that’s available, I don’t think there’s a robust way to defend against filling the disk with Kafka configuration alone, so monitoring disk utilization is essential.
We use a nice machine learning model in our product that learns disk utilization trends and warns us if the trend predicts a full disk in the near future. We get a warning if the full disk is three days out (the long weekend), and a critical alert if eight hours out. What’s great about this approach is that it adapts to real utilization automatically, pays attention to the velocity of data rather than some arbitrary threshold, and is independent of the number of active topics, and independent of the activity level in different topics. See how I turned our embarrassment into a positive? Pretty slick, right?
Embarrassing story #2:
Broker down is bad. We deployed three Kafka brokers in staging, and the story we told ourselves was that one broker could go down and the other two would pick up the slack. Like any quorum-based service, we hoped, two out of three would mean a majority would always be up and running. So, sure enough a broker went down, and we were like “cool, we’ll fix that later, but leave it for now because it might offer good clues we’ll want to investigate.” Then, much later, one of the producers wedged. Then another. Then it was an outage. It turns out we had configured those Kafka Producers to require acks from each of the replicas, and there were three replicas. So, when one of the brokers went down, there was no fourth broker to take up the missing replica and the Producers were only able to get two acks. Everything had seemed to keep working for a long while. But, while the broker was down, the Producers kept resources for each message hoping for the last ack, then eventually crashed the jvm when it was out of resources. This one was a little tricky to figure out because there were no logged failures on the Producers, and the Consumers were happy to read from the replicas that were working, so it wasn’t obvious that there was a problem with Kafka.
The fix was to configure the Producers to only require one acknowledgement before considering the message delivered. Also, we had to stop thinking about Kafka as if it were another quorum-type service. A better mental image is to think of Kafka brokers as partition servers, and the number of brokers should be the number of partitions you need to serve in parallel at full speed. It won’t quite work out that way in practice, but that mental model is better than thinking about quorum. Then, once you have a number of brokers established, think about replication as providing failure tolerance, but work backward. If one broker goes down, the replication factor should be equal or less than the number of remaining brokers, because otherwise there’s nowhere to move the replicas. Three replicas on three brokers means when a broker goes down, there’s nothing to take on the third replica and you have a policy violation / outage.
Also, note that when the broker is down, it naturally doesn’t have any metrics to report. You can’t tell from inspecting the down node that something is wrong with the cluster. You have to go add up all the metrics from the other brokers. So, it is important to have service-wide aggregations for key performance indicators, such as under.replicated.partitions, offline.partition.count, bytes.in, bytes.out, messages.in, etc. Adding up all the active.controller.counts will tell you that there is no leader if the sum is zero, and some split-brain catastrophe has happened if the sum is more than one. With service aggregations, you can assess the impact of a broker going down: either the cluster recovers and is fine, or it gets into trouble pretty quickly.
Lastly, I’d like to point out that it may be possible in some environments to have a strategy for dealing with the large list of metrics in Kafka, but still collect everything that’s interesting. Whenever a topic or partition is created, data about those are reported essentially forever. A topic may be abandoned and all the message data expired, but all the metrics around that topic will continue to report values. In staging and testing-type environments, especially, we’ve seen large collections of abandoned topics. But even in production, given enough time, the collection of inactive topics can grow substantially. Kafka doesn’t particularly encourage cleaning up old topics, so eventually you have a long list of all the topics ever made. As an adaptive filter, one can track message counts and ignore all the metrics related to topics that had no increase in the number of messages. That way, you can reasonably “watch everything live” and avoid having to maintain a list of topics to be monitored, or otherwise manually tune your data collection.
In the next part of Monitoring Kafka series we will discuss Kafka consumer lag and some other concerns in a little more detail.
Understanding, Operating and Monitoring Apache Kafka的更多相关文章
- Apache Kafka监控之Kafka Web Console
Kafka Web Console:是一款开源的系统,源码的地址在https://github.com/claudemamo/kafka-web-console中.Kafka Web Console也 ...
- Understanding When to use RabbitMQ or Apache Kafka
https://content.pivotal.io/rabbitmq/understanding-when-to-use-rabbitmq-or-apache-kafka How do humans ...
- Understanding When to use RabbitMQ or Apache Kafka Kafka RabbitMQ 性能对比
Understanding When to use RabbitMQ or Apache Kafka https://content.pivotal.io/rabbitmq/understanding ...
- 【转载】Understanding When to use RabbitMQ or Apache Kafka
https://content.pivotal.io/rabbitmq/understanding-when-to-use-rabbitmq-or-apache-kafka RabbitMQ: Erl ...
- Benchmarking Apache Kafka: 2 Million Writes Per Second (On Three Cheap Machines)
I wrote a blog post about how LinkedIn uses Apache Kafka as a central publish-subscribe log for inte ...
- 重磅开源 KSQL:用于 Apache Kafka 的流数据 SQL 引擎 2017.8.29
Kafka 的作者 Neha Narkhede 在 Confluent 上发表了一篇博文,介绍了Kafka 新引入的KSQL 引擎——一个基于流的SQL.推出KSQL 是为了降低流式处理的门槛,为处理 ...
- Apache Kafka: Next Generation Distributed Messaging System---reference
Introduction Apache Kafka is a distributed publish-subscribe messaging system. It was originally dev ...
- Install and Configure Apache Kafka on Ubuntu 16.04
https://devops.profitbricks.com/tutorials/install-and-configure-apache-kafka-on-ubuntu-1604-1/ by hi ...
- An Overview of End-to-End Exactly-Once Processing in Apache Flink (with Apache Kafka, too!)
01 Mar 2018 Piotr Nowojski (@PiotrNowojski) & Mike Winters (@wints) This post is an adaptation o ...
随机推荐
- SQL触发器中若取到null值可能引发的问题
declare @code varchar(20), @cs varchar(20),@zc varchar(20)set @cs='('+@cs+'*'+@zc+')'print '字符'+@csi ...
- Effective C++ -----条款24:若所有参数皆需类型转换,请为此采用non-member函数
如果你需要为某个函数的所有参数(包括被this指针所指的那个隐喻参数)进行类型转换,那么这个函数必须是个non-member.
- Enum:EXTENDED LIGHTS OUT(POJ 1222)
亮灯 题目大意:有一个5*6的灯组,按一盏灯会让其他上下左右4栈和他自己灯变为原来相反的状态,要怎么按才会把所有的灯都按灭? 3279翻版题目,不多说,另外这一题还可以用其他方法,比如DFS,BFS, ...
- Jquery 实现 “下次自动登录” 记住用户名密码功能
转载自:http://blog.csdn.net/aspnet_lyc/article/details/12030039?utm_source=tuicool&utm_medium=refer ...
- wkwebview 代理介绍
iOS 8引入了一个新的框架——WebKit,之后变得好起来了.在WebKit框架中,有WKWebView可以替换UIKit的UIWebView和AppKit的WebView,而且提供了在两个平台可以 ...
- stm32——NFC芯片--PN532的使用
stm32——NFC芯片--PN532的使用 一.NFC简介 NFC(Near Field Communication)近场通信,是一种短距高频的无线电技术,在13.56MHz频率运行于20厘米距离内 ...
- 查询Oracle中字段名带"."的数据
SDE中的TT_L线层会有SHAPE.LEN这样的字段,使用: SQL>select shape.len from tt_l; 或 SQL>select t.shape.len from ...
- Android 更改字体
1. 将字体ttf文件放在assets目录下 2. 使用: Typeface mTypeFaceLight = Typeface.createFromAsset(context.getAssets() ...
- AJAX JSON类型返回
文本样式和下拉样式 <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://w ...
- Android -- View setScale, setTranslation 对View矩阵的处理
参考: 1.Android Matrix理论与应用详解 2.2D平面中关于矩阵(Matrix)跟图形变换的讲解 3.Android中关于矩阵(Matrix)前乘后乘的一些认识 4.Android Ma ...