As we all know that Kafka is very fast, much faster than most of its competitors. So what’s the reason here?

Avoid Random Disk Access

Kafka writes everything onto the disk in order and consumers fetch data in order too. So disk access always works sequentially instead of randomly. For traditional hard disks(HDD), sequential access is much faster than random access. Here is a comparison:

hardware sequential writes random writes
6 * 7200rpm SATA RAID-5 300MB/s 50KB/s

Kafka Writes Everything Onto The Disk Instead of Memory

Yes, you read that right. Kafka writes everything onto the disk instead of memory. But wait a moment, isn’t memory supposed to be faster than disks? Typically it’s the case, for Random Disk Access. But for sequential access, the difference is much smaller. Here is a comparison taken from https://queue.acm.org/detail.cfm?id=1563874.

As you can see, it’s not that different. But still, sequential memory access is faster than Sequential Disk Access, why not choose memory? Because Kafka runs on top of JVM, which gives us two disadvantages.

1.The memory overhead of objects is very high, often doubling the size of the data stored(or even higher).

2.Garbage Collection happens every now and then, so creating objects in memory is very expensive as in-heap data increases because we will need more time to collect unused data(which is garbage).

So writing to file systems may be better than writing to memory. Even better, we can utilize MMAP(memory mapped files) to make it faster.

Memory Mapped Files(MMAP)

Basically, MMAP(Memory Mapped Files) can map the file contents from the disk into memory. And when we write something into the mapped memory, the OS will flush the change onto the disk sometime later. So everything is faster because we are using memory actually, but in an indirect way. So here comes the question. Why would we use MMAP to write data onto disks, which later will be mapped into memory? It seems to be a roundabout route. Why not just write data into memory directly? As we have learned previously, Kafka runs on top of JVM, if we wrote data into memory directly, the memory overhead would be high and GC would happen frequently. So we use MMAP here to avoid the issue.

Zero Copy

Suppose that we are fetching data from the memory and sending them to the Internet. What is happening in the process is usually twofold.

1.To fetch data from the memory, we need to copy those data from the Kernel Context into the Application Context.

2.To send those data to the Internet, we need to copy the data from the Application Context into the Kernel Context.

As you can see, it’s redundant to copy data between the Kernel Context and the Application Context. Can we avoid it? Yes, using Zero Copy we can copy data directly from the Kernel Context to the Kernel Context.

Batch Data

Kafka only sends data when batch.size is reached instead of one by one. Assuming the bandwidth is 10MB/s, sending 10MB data in one go is much faster than sending 10000 messages one by one(assuming each message takes 100 bytes).

"I would be fine and made no troubles."

why’s kafka so fast的更多相关文章

  1. Apache Kafka for Item Setup

    At Walmart.com in the U.S. and at Walmart's 11 other websites around the world, we provide seamless ...

  2. Apache Kafka - Schema Registry

    关于我们为什么需要Schema Registry? 参考, https://www.confluent.io/blog/how-i-learned-to-stop-worrying-and-love- ...

  3. Understanding, Operating and Monitoring Apache Kafka

    Apache Kafka is an attractive service because it's conceptually simple and powerful. It's easy to un ...

  4. Build an ETL Pipeline With Kafka Connect via JDBC Connectors

    This article is an in-depth tutorial for using Kafka to move data from PostgreSQL to Hadoop HDFS via ...

  5. How to choose the number of topics/partitions in a Kafka cluster?

    This is a common question asked by many Kafka users. The goal of this post is to explain a few impor ...

  6. kafka producer源码

    producer接口: /** * Licensed to the Apache Software Foundation (ASF) under one or more * contributor l ...

  7. Flume-ng+Kafka+storm的学习笔记

    Flume-ng Flume是一个分布式.可靠.和高可用的海量日志采集.聚合和传输的系统. Flume的文档可以看http://flume.apache.org/FlumeUserGuide.html ...

  8. Apache Kafka: Next Generation Distributed Messaging System---reference

    Introduction Apache Kafka is a distributed publish-subscribe messaging system. It was originally dev ...

  9. Exploring Message Brokers: RabbitMQ, Kafka, ActiveMQ, and Kestrel--reference

    [This article was originally written by Yves Trudeau.] http://java.dzone.com/articles/exploring-mess ...

随机推荐

  1. python网络编程-1

    1.网络基础 回顾计算IP所处网段方式 #128 64 32 16 8 4 2 1 #IP1 = 192.168.9.1/24 # 11000000 10101000 00001001 0000000 ...

  2. 华为手机 android8.0APP更新时出现安装包解析异常的提示及安装闪退(无反应)问题

    在做android app升级更新时遇到几个问题,我用的测试机是华为V10 系统为8.0 一.安装闪退(无反应) 解决办法: 只要在Mainfest.xml 中加入权限编码即可解决 <uses- ...

  3. 设计模式:单例模式(singleton)

    singleton模式属于创建型设计模式.其作用是在程序设计中,对于某一个类而言,全局只能存在一个实例对象. 下面以C++为例,对单例模式进行说明: 1. 最基本单例模式(单线程) class Sin ...

  4. Mac下多版本pip共存

    Mac下多版本pip共存 来自于官方的解释, pip是python包管理工具, 该工具提供了对python包的查找, 下载, 安装, 卸载等功能python第三方工具包多数依赖于pip进行安装, 如 ...

  5. mysql建库,建表,补列

    SET NAMES UTF8;DROP DATABASE IF EXISTS tmooc; CREATE DATABASE tmooc CHARSET=UTF8; USE tmooc;CREATE T ...

  6. 2019第一期《python测试开发》课程,10月13号开学

    2019第一期<python测试开发>课程,10月13号开学! 主讲老师:上海-悠悠 上课方式:QQ群视频在线教学,方便交流 本期上课时间:10月13号-12月8号,每周六.周日晚上20: ...

  7. 【ORACLE】Oracle提高篇之DECODE

    DECODE含义 decode(条件,值1,返回值1,值2,返回值2,…值n,返回值n,缺省值)这个是decode的表达式,具体的含义解释为: IF 条件=值1 THEN RETURN(翻译值1) E ...

  8. vue-router路由传参之query和params

    首先简单来说明一下$router和$route的区别 //$router : 是路由操作对象,只写对象 //$route : 路由信息对象,只读对象 //操作 路由跳转 this.$router.pu ...

  9. BeanShell实现加密解密功能

    一,在IDEA中写好加密的脚本 二,然后将整个包文件导出,生成jar包 三,将jar包文件放到jmeter的lib/ext目录下 然后在jmeter的BeanShell中引入该类,调用其中的加密方法 ...

  10. 不知道多大的文件不要用cat查看!

    今天长清女子学校的主任给我微信上发了一图片,说登录服务器的时候就查看了一个文件,结果服务器的风扇忽然变的特别响,系统慢了好多,让我看看是什么回事!我当时心里想,无缘无故怎么会这样,难不成是进病毒了?查 ...