why’s kafka so fast
As we all know that Kafka is very fast, much faster than most of its competitors. So what’s the reason here?
Avoid Random Disk Access
Kafka writes everything onto the disk in order and consumers fetch data in order too. So disk access always works sequentially instead of randomly. For traditional hard disks(HDD), sequential access is much faster than random access. Here is a comparison:
hardware | sequential writes | random writes |
---|---|---|
6 * 7200rpm SATA RAID-5 | 300MB/s | 50KB/s |
Kafka Writes Everything Onto The Disk Instead of Memory
Yes, you read that right. Kafka writes everything onto the disk instead of memory. But wait a moment, isn’t memory supposed to be faster than disks? Typically it’s the case, for Random Disk Access. But for sequential access, the difference is much smaller. Here is a comparison taken from https://queue.acm.org/detail.cfm?id=1563874.
As you can see, it’s not that different. But still, sequential memory access is faster than Sequential Disk Access, why not choose memory? Because Kafka runs on top of JVM, which gives us two disadvantages.
1.The memory overhead of objects is very high, often doubling the size of the data stored(or even higher).
2.Garbage Collection happens every now and then, so creating objects in memory is very expensive as in-heap data increases because we will need more time to collect unused data(which is garbage).
So writing to file systems may be better than writing to memory. Even better, we can utilize MMAP(memory mapped files) to make it faster.
Memory Mapped Files(MMAP)
Basically, MMAP(Memory Mapped Files) can map the file contents from the disk into memory. And when we write something into the mapped memory, the OS will flush the change onto the disk sometime later. So everything is faster because we are using memory actually, but in an indirect way. So here comes the question. Why would we use MMAP to write data onto disks, which later will be mapped into memory? It seems to be a roundabout route. Why not just write data into memory directly? As we have learned previously, Kafka runs on top of JVM, if we wrote data into memory directly, the memory overhead would be high and GC would happen frequently. So we use MMAP here to avoid the issue.
Zero Copy
Suppose that we are fetching data from the memory and sending them to the Internet. What is happening in the process is usually twofold.
1.To fetch data from the memory, we need to copy those data from the Kernel Context into the Application Context.
2.To send those data to the Internet, we need to copy the data from the Application Context into the Kernel Context.
As you can see, it’s redundant to copy data between the Kernel Context and the Application Context. Can we avoid it? Yes, using Zero Copy we can copy data directly from the Kernel Context to the Kernel Context.
Batch Data
Kafka only sends data when batch.size
is reached instead of one by one. Assuming the bandwidth is 10MB/s, sending 10MB data in one go is much faster than sending 10000 messages one by one(assuming each message takes 100 bytes).
"I would be fine and made no troubles."
why’s kafka so fast的更多相关文章
- Apache Kafka for Item Setup
At Walmart.com in the U.S. and at Walmart's 11 other websites around the world, we provide seamless ...
- Apache Kafka - Schema Registry
关于我们为什么需要Schema Registry? 参考, https://www.confluent.io/blog/how-i-learned-to-stop-worrying-and-love- ...
- Understanding, Operating and Monitoring Apache Kafka
Apache Kafka is an attractive service because it's conceptually simple and powerful. It's easy to un ...
- Build an ETL Pipeline With Kafka Connect via JDBC Connectors
This article is an in-depth tutorial for using Kafka to move data from PostgreSQL to Hadoop HDFS via ...
- How to choose the number of topics/partitions in a Kafka cluster?
This is a common question asked by many Kafka users. The goal of this post is to explain a few impor ...
- kafka producer源码
producer接口: /** * Licensed to the Apache Software Foundation (ASF) under one or more * contributor l ...
- Flume-ng+Kafka+storm的学习笔记
Flume-ng Flume是一个分布式.可靠.和高可用的海量日志采集.聚合和传输的系统. Flume的文档可以看http://flume.apache.org/FlumeUserGuide.html ...
- Apache Kafka: Next Generation Distributed Messaging System---reference
Introduction Apache Kafka is a distributed publish-subscribe messaging system. It was originally dev ...
- Exploring Message Brokers: RabbitMQ, Kafka, ActiveMQ, and Kestrel--reference
[This article was originally written by Yves Trudeau.] http://java.dzone.com/articles/exploring-mess ...
随机推荐
- Eureka源码解析系列文章汇总
先看一张图 0 这个图是Eureka官方提供的架构图,整张图基本上把整个Eureka的核心功能给列出来了,当你要阅读Eureka的源码时可以参考着这个图和下方这些文章 EurekaServer Eur ...
- 怎么把使用vuepress搭建的博客部署到Github Pages
推荐在这里阅读效果更佳 背景 网上搜了很多教程,包括官网的教程,但是还是费了一番功夫, 如果你使用自动化部署脚本部署不成功的话,可以参考我的这个笨方法 这是部署后的效果 前提 我假设你本地运行OK, ...
- Bootstrap Table列宽拖动的方法
在之前做过的一个web项目中,前端表格是基于jQuery和Bootstrap Table实现的,要求能利用拖动改变列宽,现将实现的过程记录如下: 1. Bootstrap Table可拖动,需要用到它 ...
- 关于重学Linux的随笔
毕业已有半年, 现在想想真的后悔, 大学没有认真学Linux, 导致现在Linux操作抓瞎, 连服务器都搭不起来. 下定决心重学Linux, 不追求能比上大佬, 但是要熟练, 常用命令要熟悉. 作为一 ...
- VUE注册局部组件
// 局部组件命名规范 /* 1文件夹名大驼峰 MyLocalBtn.vue 2 使用的时候 将驼峰转化为横杠 <my-local-btn></my-local-btn> */ ...
- 1.编译chromium
1. 前言 做了两年Chromium相关的开发,最近项目遇到瓶颈,自己有点迷茫.回顾之前做的工作,发现对chromium的认识还停留在非常表面的水平.因此,一直想对之前做的做个总结,只有总结反思才能提 ...
- JS高阶---定时器相关
首先看几个问题: [主体] (1)定时器真的时定时执行的吗? 顺序验证: 测试结果: 接下来对上述代码做下修改,增加一个长时间工作的消耗,此时再来验证下定时器运行的精准度 结果如下: (2)定时器回调 ...
- html5 localStorage讲解
早期的web中使用cookies在客户端保存诸如用户名等简单的信息,但是,在使用cookies存储永久数据存在以下问题. 1.cookies的大小限制在4kB,不适合大量的数据存储. 2.浏览器还限制 ...
- 【Spring IoC】依赖注入DI(四)
平常的Java开发中,程序员在某个类中需要依赖其它类的方法.通常是new一个依赖类再调用类实例的方法,这种开发存在的问题是new的类实例不好统一管理. Spring提出了依赖注入的思想,即依赖类不由程 ...
- idea 配置 scala
在setting 中,通过plugin 安装 Scala 然后重启idea 重启后,建一个scala文件,根据上面提示安装scala