Why malloc+memset is slower than calloc?
https://stackoverflow.com/questions/2688466/why-mallocmemset-is-slower-than-calloc/
The short version: Always use calloc() instead of malloc()+memset(). In most cases, they will be the same. In some cases, calloc() will do less work because it can skip memset() entirely. In other cases, calloc() can even cheat and not allocate any memory! However, malloc()+memset() will always do the full amount of work.
Understanding this requires a short tour of the memory system.
Quick tour of memory
There are four main parts here: your program, the standard library, the kernel, and the page tables. You already know your program, so...
Memory allocators like malloc() and calloc() are mostly there to take small allocations (anything from 1 byte to 100s of KB) and group them into larger pools of memory. For example, if you allocate 16 bytes, malloc() will first try to get 16 bytes out of one of its pools, and then ask for more memory from the kernel when the pool runs dry. However, since the program you're asking about is allocating for a large amount of memory at once, malloc() and calloc() will just ask for that memory directly from the kernel. The threshold for this behavior depends on your system, but I've seen 1 MiB used as the threshold.
The kernel is responsible for allocating actual RAM to each process and making sure that processes don't interfere with the memory of other processes. This is called memory protection, it has been dirt common since the 1990s, and it's the reason why one program can crash without bringing down the whole system. So when a program needs more memory, it can't just take the memory, but instead it asks for the memory from the kernel using a system call like mmap() or sbrk(). The kernel will give RAM to each process by modifying the page table.
The page table maps memory addresses to actual physical RAM. Your process's addresses, 0x00000000 to 0xFFFFFFFF on a 32-bit system, aren't real memory but instead are addresses in virtual memory. The processor divides these addresses into 4 KiB pages, and each page can be assigned to a different piece of physical RAM by modifying the page table. Only the kernel is permitted to modify the page table.
How it doesn't work
Here's how allocating 256 MiB does not work:
- Your process calls - calloc()and asks for 256 MiB.
- The standard library calls - mmap()and asks for 256 MiB.
- The kernel finds 256 MiB of unused RAM and gives it to your process by modifying the page table. 
- The standard library zeroes the RAM with - memset()and returns from- calloc().
- Your process eventually exits, and the kernel reclaims the RAM so it can be used by another process. 
How it actually works
The above process would work, but it just doesn't happen this way. There are three major differences.
- When your process gets new memory from the kernel, that memory was probably used by some other process previously. This is a security risk. What if that memory has passwords, encryption keys, or secret salsa recipes? To keep sensitive data from leaking, the kernel always scrubs memory before giving it to a process. We might as well scrub the memory by zeroing it, and if new memory is zeroed we might as well make it a guarantee, so - mmap()guarantees that the new memory it returns is always zeroed.
- There are a lot of programs out there that allocate memory but don't use the memory right away. Some times memory is allocated but never used. The kernel knows this and is lazy. When you allocate new memory, the kernel doesn't touch the page table at all and doesn't give any RAM to your process. Instead, it finds some address space in your process, makes a note of what is supposed to go there, and makes a promise that it will put RAM there if your program ever actually uses it. When your program tries to read or write from those addresses, the processor triggers a page fault and the kernel steps in assign RAM to those addresses and resumes your program. If you never use the memory, the page fault never happens and your program never actually gets the RAM. 
- Some processes allocate memory and then read from it without modifying it. This means that a lot of pages in memory across different processes may be filled with pristine zeroes returned from - mmap(). Since these pages are all the same, the kernel makes all these virtual addresses point a single shared 4 KiB page of memory filled with zeroes. If you try to write to that memory, the processor triggers another page fault and the kernel steps in to give you a fresh page of zeroes that isn't shared with any other programs.
The final process looks more like this:
- Your process calls - calloc()and asks for 256 MiB.
- The standard library calls - mmap()and asks for 256 MiB.
- The kernel finds 256 MiB of unused address space, makes a note about what that address space is now used for, and returns. 
- The standard library knows that the result of - mmap()is always filled with zeroes (or will beonce it actually gets some RAM), so it doesn't touch the memory, so there is no page fault, and the RAM is never given to your process.
- Your process eventually exits, and the kernel doesn't need to reclaim the RAM because it was never allocated in the first place. 
If you use memset() to zero the page, memset() will trigger the page fault, cause the RAM to get allocated, and then zero it even though it is already filled with zeroes. This is an enormous amount of extra work, and explains why calloc() is faster than malloc() and memset(). If end up using the memory anyway, calloc() is still faster than malloc() and memset() but the difference is not quite so ridiculous.
This doesn't always work
Not all systems have paged virtual memory, so not all systems can use these optimizations. This applies to very old processors like the 80286 as well as embedded processors which are just too small for a sophisticated memory management unit.
This also won't always work with smaller allocations. With smaller allocations, calloc() gets memory from a shared pool instead of going directly to the kernel. In general, the shared pool might have junk data stored in it from old memory that was used and freed with free(), so calloc()could take that memory and call memset() to clear it out. Common implementations will track which parts of the shared pool are pristine and still filled with zeroes, but not all implementations do this.
Dispelling some wrong answers
Depending on the operating system, the kernel may or may not zero memory in its free time, in case you need to get some zeroed memory later. Linux does not zero memory ahead of time, and Dragonfly BSD recently also removed this feature from their kernel. Some other kernels do zero memory ahead of time, however. Zeroing pages durign idle isn't enough to explain the large performance differences anyway.
The calloc() function is not using some special memory-aligned version of memset(), and that wouldn't make it much faster anyway. Most memset() implementations for modern processors look kind of like this:
function memset(dest, c, len)
    // one byte at a time, until the dest is aligned...
    while (len > 0 && ((unsigned int)dest & 15))
        *dest++ = c
        len -= 1
    // now write big chunks at a time (processor-specific)...
    // block size might not be 16, it's just pseudocode
    while (len >= 16)
        // some optimized vector code goes here
        // glibc uses SSE2 when available
        dest += 16
        len -= 16
    // the end is not aligned, so one byte at a time
    while (len > 0)
        *dest++ = c
        len -= 1So you can see, memset() is very fast and you're not really going to get anything better for large blocks of memory.
The fact that memset() is zeroing memory that is already zeroed does mean that the memory gets zeroed twice, but that only explains a 2x performance difference. The performance difference here is much larger (I measured more than three orders of magnitude on my system between malloc()+memset() and calloc()).
Party trick
Instead of looping 10 times, write a program that allocates memory until malloc() or calloc()returns NULL.
What happens if you add memset()?
Why malloc+memset is slower than calloc?的更多相关文章
- Linux C 堆内存管理函数malloc()、calloc()、realloc()、free()详解
		C 编程中,经常需要操作的内存可分为下面几个类别: 堆栈区(stack):由编译器自动分配与释放,存放函数的参数值,局部变量,临时变量等等,它们获取的方式都是由编译器自动执行的 堆区(heap):一般 ... 
- malloc、calloc、realloc函数说明
		malloc 函数 #include <stdlib.h> void* malloc(int n); n为要分配的字节数,如果成功,返回获得空间的首地址,如果分配失败,则返回NULL,ma ... 
- malloc calloc realloc
		三个函数的申明分别是: void* realloc(void* ptr, unsigned newsize); void* malloc(unsigned size); void* calloc(si ... 
- Linux中brk()系统调用,sbrk(),mmap(),malloc(),calloc()的异同【转】
		转自:http://blog.csdn.net/kobbee9/article/details/7397010 brk和sbrk主要的工作是实现虚拟内存到内存的映射.在GNUC中,内存分配是这样的: ... 
- malloc,calloc,realloc
		与堆操作相关的两个函数 malloc #include<stdio.h> #include<stdlib.h> #include<string.h> int mai ... 
- malloc realloc calloc free
		自上次发现自己对这几个C函数不熟悉,就打算抽空整理一下,也就现在吧.这几个函数都是跟堆内存打交道的,还有一个好玩的函数--alloca,它是跟栈内存打交道的,我想留在以后研究出好玩点的来,再专门为其写 ... 
- 关于new,delete,malloc,free的一些总结
		首先,new,delete都是c++的关键字并不是函数,通过特定的语法组成表达式,new可以在编译的时候确定其返回值.可以直接使用string *p=new string("asdfgh&q ... 
- 动态内存管理详解:malloc/free/new/delete/brk/mmap
		c++ 内存获取和释放 new/delete,new[]/delete[] c 内存获取和释放 malloc/free, calloc/realloc 上述8个函数/操作符是c/c++语言里常用来做动 ... 
- 动态内存管理:malloc/free/new/delete/brk/mmap
		这是我去腾讯面试的时候遇到的一个问题——malloc()是如何申请内存的? c++ 内存获取和释放 new/delete,new[]/delete[] c 内存获取和释放 malloc/free, c ... 
随机推荐
- vue -- 打包资源正确引用及背景图引入
			一般情况下,通过webpack+vuecli默认打包的css.js等资源,路径都是绝对的. 但当部署到带有文件夹的项目中,这种绝对路径就会出现问题,因为把配置的static文件夹当成了根路径,那么要解 ... 
- mysql之SQL入门与提升(二)
			在mysql之SQL入门与提升(一)我们已经有了些许基础,今天继续深化 先造表 SET NAMES utf8;SET FOREIGN_KEY_CHECKS = 0; -- -------------- ... 
- EasyUI/TopJUI之如何动态改变下拉列表框ComboBox输入框的背景颜色
			简单记录一下 前段时间接到客户需求:动态改变下拉列表框ComboBox输入框的背景颜色. 刚开始想的很简单在用户选择列表项的时候,判断一下列表框的value值添加相应的背景颜色就OK了,然而在实际操作 ... 
- redux中createStore方法的默认参数
			一般使用方法: createStore(reducer, applyMiddleware(thunk)) 传递默认参数: createStore(reducer, defaultState, appl ... 
- chromedriver对应chrom版本
			chromedriver版本 支持的Chrome版本 v2.37 v64-66 v2.36 v63-65 v2.35 v62-64 v2.34 v61-63 v2.33 v60-62 v2.32 v5 ... 
- JMeter - REST API测试 - 完整的数据驱动方法(翻译)
			https://github.com/vinsguru/jmeter-rest-data-drivern/tree/master 在本文中,我想向您展示一种用于REST API测试的数据驱动方法.如果 ... 
- heapq模块
			该模块提供了堆排序算法的实现.堆是二叉树,最大堆中父节点大于或等于两个子节点,最小堆父节点小于或等于两个子节点. 创建堆 heapq有两种方式创建堆, 一种是使用一个空列表,然后使用heapq.hea ... 
- Docker从入门到实战(一)
			Docker从入门到实战(一) 一:容器技术与Docker概念 1 什么是容器 容器技术并不是一个全新的概念,它又称为容器虚拟化.虚拟化技术目前主要有硬件虚拟化.半虚拟化.操作系统虚拟化等.1.1关于 ... 
- Vue 项目: npm run dev  b报错 “'webpack-dev-server' 不是内部或外部命令,也不是可运行的程序 或批处理文件。”
			前提: 电脑已经安装了nodeJS和npm, 项目是直接下载的zip包. 报错步骤为1:cd /d 目录: 2. npm ren dev -------> 报错如下: > webpac ... 
- C# 或与非
			或 ||与 &&非 ! 逻辑或 | 逻辑与 & 逻辑非 ~ 
