Most of the tools do not actually read a single line in from a file at a time, rather they use a buffer in memory to store chunks of lines. The tools operate a line at a time on the data in this buffer.

NOTE: By "line" I mean split by a \n, in grep's case, or whatever character is denoted as the "delimiter" when the "tool" is invoked.

Here's a example to illustrate this effect.

Sample data

Create a file with 100,000 lines in it. The file is called afile100k.

$ for i in $(seq 100000);do echo "file$i" >> afile100k; done

Using strace

We can utilize strace to peak inside a running process, in this case the grep command.

$ strace -s 2000 -o log100k grep 5 afile100k

This will log the output from strace, up to 2000 characters per line of output to the file log100k. The command we'll be tracing is grep 5 afile100k.

Here's some key output from the log:

ioctl(3, SNDCTL_TMR_TIMEBASE or SNDRV_TIMER_IOCTL_NEXT_DEVICE or TCGETS, 0x7fff8bf73b20) = -1 ENOTTY (Inappropriate ioctl for device)

read(3, "file1\nfile2\nfile3\nfile4\nfile5\nfile6\nfile7\nfile8\nfile9\nfile10\nfile11\nfile12\nfile13\nfile14\nfile15\nfile16\nfile17\nfile18\nfile19\nfile20\nfile21\nfile22\nfile23\nfile24\nfile25\nfile26\nfile27\nfile28\nfile29\nfile30\nfile31\nfile32\nfile33\nfile34\nfile35\nfile36\nfile37\nfile38\nfile39\nfile40\nfile41\nfile42\nfile43\nfile44\nfile45\nfile46\nfile47\nfile48\nfile49\nfile50\nfile51\nfile52\nfile53\nfile54\nfile55\nfile56\nfile57\nfile58\nfile59\nfile60\nfile61\nfile62\nfile63\nfile64\nfile65\nfile66\nfile67\nfile68\nfile69\nfile70\nfile71\nfile72\nfile73\nfile74\nfile75\nfile76\nfile77\nfile78\nfile79\nfile80\nfile81\nfile82\nfile83\nfile84\nfile85\nfile86\nfile87\nfile88\nfile89\nfile90\nfile91\nfile92\nfile93\nfile94\nfile95\nfile96\nfile97\nfile98\nfile99\nfile100\nfile101\nfile102\nfile103\nfile104\nfile105\nfile106\nfile107\nfile108\nfile109\nfile110\nfile111\nfile112\nfile113\nfile114\nfile115\nfile116\nfile117\nfile118\nfile119\nfile120\nfile121\nfile122\nfile123\nfile124\nfile125\nfile126\nfile127\nfile128\nfile129\nfile130\nfile131\nfile132\nfile133\nfile134\nfile135\nfile136\nfile137\nfile138\nfile139\nfile140\nfile141\nfile142\nfile143\nfile144\nfile145\nfile146\nfile147\nfile148\nfile149\nfile150\nfile151\nfile152\nfile153\nfile154\nfile155\nfile156\nfile157\nfile158\nfile159\nfile160\nfile161\nfile162\nfile163\nfile164\nfile165\nfile166\nfile167\nfile168\nfile169\nfile170\nfile171\nfile172\nfile173\nfile174\nfile175\nfile176\nfile177\nfile178\nfile179\nfile180\nfile181\nfile182\nfile183\nfile184\nfile185\nfile186\nfile187\nfile188\nfile189\nfile190\nfile191\nfile192\nfile193\nfile194\nfile195\nfile196\nfile197\nfile198\nfile199\nfile200\nfile201\nfile202\nfile203\nfile204\nfile205\nfile206\nfile207\nfile208\nfile209\nfile210\nfile211\nfile212\nfile213\nfile214\nfile215\nfile216\nfile217\nfile218\nfile219\nfile220\nfile221\nfile222\nfile223\nfile224\nfile225\nfile226\nfile227\nfile228\nfile229\nfile230\nfile231\nfile232\nfile233\nfile234\nfile235\nfile236\nfile237\nfile238\nfile239\nfile240\nfile241\nfile242\nfile243\nfile244\nfile245\nfile246\nfile247\nfile248\nfile249\nfile250\nfile251\nfile252\nfile253\nfile254\nfile255\nfile256\nfile257\nfile258\nfile259\nfile260\nfile261\nfile262\nfile263\nfile"..., 32768) = 32768

lseek(3, 32768, SEEK_HOLE)              = 988895
lseek(3, 32768, SEEK_SET) = 32768

Notice grep is reading in 32k (32768) at a time. NOTE: I've tried to break up the log a bit so that it's easier to read here on SE.

Now it starts writing out results:

fstat(1, {st_mode=S_IFCHR|0620, st_rdev=makedev(136, 5), ...}) = 0
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fcafcfff000
write(1, "file5\n", 6) = 6
write(1, "file15\n", 7) = 7
write(1, "file25\n", 7) = 7
write(1, "file35\n", 7) = 7
write(1, "file45\n", 7) = 7
write(1, "file50\n", 7) = 7

After exhausting out the contents of this buffer it will re-read the next 32k (32768) chunk from the file.

write(1, "file3759\n", 9)               = 9
read(3, "\nfile3765\nfile3766\nfile3767\nfile3768\nfile3769\nfile3770\nfile3771\nfile3772\nfile3773....\nfile3986\nf"..., 32768) = 32768
Followed by more writes:
write(1, "file3765\n", 9)               = 9
write(1, "file3775\n", 9) = 9

Grep continues to do this until it's completely exhausted the contents of the file, at which point it ends.

write(1, "file99995\n", 10)             = 10
read(3, "", 24576) = 0
close(3) = 0
close(1) = 0
munmap(0x7fcafcfff000, 4096) = 0
close(2) = 0
exit_group(0) = ?
+++ exited with 0 +++

工作中遇到grep -w比较慢,从而strace -p跟踪进程过程,google了下比较通俗易懂,就不翻译了,记录下

原文link

grep每次读取多大的文本的更多相关文章

  1. awk、grep、sed是linux操作文本的三大利器,也是必须掌握的linux命令之一

    awk.grep.sed是linux操作文本的三大利器,也是必须掌握的linux命令之一.三者的功能都是处理文本,但侧重点各不相同,其中属awk功能最强大,但也最复杂.grep更适合单纯的查找或匹配文 ...

  2. C#大数据文本高效去重

    C#大数据文本高效去重 转载请注明出处 http://www.cnblogs.com/Huerye/ TextReader reader = File.OpenText(@"C:\Users ...

  3. 黑马基础阶段测试题:通过字符输入流读取info.txt中的所有内容,每次读取一行,将每一行的第一个文字截取出来并打印在控制台上。

    package com.swift; import java.io.BufferedReader; import java.io.BufferedWriter; import java.io.File ...

  4. PHP 通过实现 Iterator(迭代器)接口来读取大文件文本

    读了NGINX的access日志,bnb_manage_access.log(31M) 和  bnb_wechat_access.log(50M) 附上代码: <?php /** * User: ...

  5. PHP读取CSV大文件导入数据库的示例

    对于数百万条数据量的CSV文件,文件大小可能达到数百M,如果简单读取的话很可能出现超时或者卡死的现象. 为了成功将CSV文件里的数据导入数据库,分批处理是非常必要的. 下面这个函数是读取CSV文件中指 ...

  6. PHP快速按行读取CSV大文件的封装类分享(也适用于其它超大文本文件)

    CSV大文件的读取已经在前面讲述过了(PHP按行读取.处理较大CSV文件的代码实例),但是如何快速完整的操作大文件仍然还存在一些问题. 1.如何快速获取CSV大文件的总行数? 办法一:直接获取文件内容 ...

  7. 交互设计:隐藏或显示大段文本的UI组件有哪些?

    应用场景: 在手机上要给列表中的每一项添加一个大段的介绍,应该用什么UI组件 A: 这里可以用,模态对话框,弹出提示,工具提示这类组件.模态对话框的好处,就是用关闭的按钮,用户操作方便:而弹出提示和工 ...

  8. C#读取Txt大数据并更新到数据库

    环境 Sqlserver 2016 .net 4.5.2 目前测试数据1300万 大约3-4分钟.(限制一次读取条数 和 线程数是 要节省服务器资源,如果调太大服务器其它应用可能就跑不了了), Sql ...

  9. js读取修改创建txt文本类型文件(.ini)

    <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/ ...

随机推荐

  1. Spring Boot Dubbo 应用启停源码分析

    作者:张乎兴 来源:Dubbo官方博客 背景介绍 Dubbo Spring Boot 工程致力于简化 Dubbo | grep tid | grep -v "daemon" tid ...

  2. 13、如何拆分含有多种分隔符的字符串 14、如何判断字符串a是否以字符串b开头或结尾 15、如何调整字符串中文本的格式 16、如何将多个小字符串拼接成一个大的字符串

    13.如何拆分含有多种分隔符的字符串 import re s = "23:41:2314\1234#sdf\23;" print(re.split(r'[#:\;]+',s))   ...

  3. $.Deferred 对象详解

    1.什么是deferred对象? 开发网站的过程中,我们经常遇到某些耗时很长的javascript操作.其中,既有异步的操作(比如ajax读取服务器数据),也有同步的操作(比如遍历一个大型数组),它们 ...

  4. 12_通过 CR3 切换_读取指定进程数据

    注意: cr3 切换 ,导致eip 指向的页面,改变为对应cr3 的页面:所以代码也变了:这里需要将这部分代码放入公共区域. 解决: 使用 类似前面 山寨 systemfastcallentry 的方 ...

  5. 读书笔记二、ndarray的数据类型

    dtype(数据类型)是一个特殊的对象,它含有ndarray将一块内存解释为特定数据类型所需的信息: import numpy as np arr1=np.array([1,2,3],dtype=np ...

  6. curl直接作为http的客户端?也是醉了

  7. ZMQ相关

    一般来说,做bind的是服务端,做connect的是客服端.zmq的bind和connect与我们通常的socket中bind和connect是不一样的,最起码的,我们它没有启动的先后顺序,而在我们通 ...

  8. react 使用触摸事件

    react开发支持的事件中,onClick事件,部分标签不支持点击,只能onTouchEnd,但是在移动端,手指触碰到事件绑定元素上,滑动,也会触发该事件,故来share解决办法,有更好的方法,欢迎评 ...

  9. BeanFactory 和 ApplicationContext 区别

    区别 BeanFactory: Spring里面最低层的接口,提供了最简单的容器的功能,只提供了实例化对象和拿对象的功能 BeanFactory在启动的时候不会去实例化Bean,中有从容器中拿Bean ...

  10. C++11中的技术剖析(萃取技术)

    从C++98开始萃取在泛型编程中用的特别多,最经典的莫过于STL.STL中的拷贝首先通过萃取技术识别是否是已知并且支持memcpy类型,如果是则直接通过内存拷贝提高效率,否则就通过类的重载=运算符,相 ...