How Python Handles Big Files
The Python programming language has become more and more popular in handling data analysis and processing because of its certain unique advantages. It’s easy to read and maintain. pandas, with a rich library of functions and methods packaged in it, is a fast, flexible and easy to use data analysis and manipulation tool built on top of Python. It is one of the big boosters to make Python an efficient and powerful data analysis environment.
pandas is memory-based. It does a great job when the to-be-manipulated data can fit into the memory. It is inconvenient, even unable, to deal with big data, which can’t be wholly loaded into the memory. Large files, however, like those containing data imported from the database or downloaded from the web, are common in real-world businesses. We need to have ways to manage them. How? That’s what I’d like to say something about.
By “big data” here, I am not talking about the TB or PB level data that requires distributed processing. I mean the GB level file data that can’t fit into the normal PC memory but can be held on disk. This is the more common type of big file processing scenario.
Since a big file can’t be loaded into the memory at once, we often need to retrieve it line by line or chunk by chunk for further processing. Both Python and pandas support this way of retrieval, but they don’t have cursors. Because of the absence of a cursor mechanism, we need to write code to implement the chunk-by-chunk retrieval in order to use it in functions and methods; sometimes we even have to write code to implement functions and methods. Here I list the typical scenarios of big file processing and their code examples to make you better understand Python’s way of dealing with them.
I. Aggregation
A simple aggregation is to traverse values in the target column and to perform calculation according to the specified aggregate operation, such as the sum operation that adds up traversed values; the count operation that records the number of traversed values; and the mean operation that adds up and counts the traversed values and then divides the sum by the number. Here let’s look at how Python does a sum.
Below is a part of a file:
To calculate the total sales amount, that is, doing sum over the amount column:
1. Retrieve file line by line
|
total=0 with open("orders.txt",'r') as f: line=f.readline() while True: line = f.readline() if not line: break total += float(line.split("\t")[4]) print(total) |
Open the file Read the header row Read detail data line by line Reading finishes when all lines are traversed Get cumulated value |
2. Retrieve file chunk by chunk in pandas
pandas supports data retrieval chunk by chunk. Below is the workflow diagram:
|
import pandas as pd chunk_data = pd.read_csv("orders.txt",sep="\t",chunksize=100000) total=0 for chunk in chunk_data: total+=chunk['amount'].sum() print(total) |
Retrieve the file chunk by chunk; each contains 100,000 lines Add up amounts of all chunks |
Pandas is good at retrieval and processing in large chunks. In theory, the bigger the chunk size, the faster the processing. Note that the chunk size should be able to fit into the available memory. If the chunksize is set as 1, it is a line-by-line retrieval, which is extremely slow. So I do not recommend a line-by-line retrieval when handling large files in pandas.
II. Filtering
The workflow diagram for filtering in pandas:
Similar to the aggregation, pandas will divide a big file into multiple chunks (n), filter each data chunk and concatenate the filtering results.
To get the sales records in New York state according to the above file:
1. With small data sets
|
import pandas as pd chunk_data = pd.read_csv("orders.txt",sep="\t",chunksize=100000) chunk_list = []
for chunk in chunk_data: chunk_list.append(chunk[chunk.state=="New York"]) res = pd.concat(chunk_list) print(res) |
Define an empty list for storing the result set Filter chunk by chunk Concatenate filtering results |
2. With big data sets
|
import pandas as pd chunk_data = pd.read_csv("orders.txt",sep="\t",chunksize=100000) n=0 for chunk in chunk_data: need_data = chunk[chunk.state=='New York'] if n == 0: need_data.to_csv("orders_filter.txt",index=None) n+=1 else: need_data.to_csv("orders_filter.txt",index=None,mode='a',header=None) |
For the result set of processing the first chunk, write it to the target file with headers retained and index removed For the result sets of processing other chunks, append them to the target file with both headers and index removed |
The logic of doing aggregates and filters is simple. But as Python doesn’t provide the cursor data type, we need to write a lot of code to get them done.
III. Sorting
The workflow diagram for sorting in pandas:
Sorting is complicated because you need to:
- Retrieve one chunk each time;
- Sort this chunk;
- Write the sorting result of each chunk to a temporary file;
- Maintain a list of k elements (k is the number of chunks) into which a row of data in each temporary file is put;
- Sort records in the list by the sorting field (same as the sort direction in step 2);
- Write the record with smallest (in ascending order) or largest (in descending order) value to the result file;
- Put another row from each temporary file to the list;
- Repeat step 6, 7 until all records are written to the result file.
To sort the above file by amount in ascending order, I write a complete Python program of implementing the external sorting algorithm:
|
import pandas as pd import os import time import shutil import uuid import traceback
def parse_type(s): if s.isdigit(): return int(s) try: res = float(s) return res except: return s
def pos_by(by,head,sep): by_num = 0 for col in head.split(sep): if col.strip()==by: break else: by_num+=1 return by_num
def merge_sort(directory,ofile,by,ascending=True,sep=","):
with open(ofile,'w') as outfile:
file_list = os.listdir(directory)
file_chunk = [open(directory+"/"+file,'r') for file in file_list] k_row = [file_chunk[i].readline()for i in range(len(file_chunk))] by = pos_by(by,k_row[0],sep)
outfile.write(k_row[0]) k_row = [file_chunk[i].readline()for i in range(len(file_chunk))] k_by = [parse_type(k_row[i].split(sep)[by].strip()) for i in range(len(file_chunk))]
with open(ofile,'a') as outfile:
while True: for i in range(len(k_by)): if i >= len(k_by): break
sorted_k_by = sorted(k_by) if ascending else sorted(k_by,reverse=True) if k_by[i] == sorted_k_by[0]: outfile.write(k_row[i]) k_row[i] = file_chunk[i].readline() if not k_row[i]: file_chunk[i].close() del(file_chunk[i]) del(k_row[i]) del(k_by[i]) else: k_by[i] = parse_type(k_row[i].split(sep)[by].strip()) if len(k_by)==0: break
def external_sort(file_path,by,ofile,tmp_dir,ascending=True,chunksize=50000,sep=',', os.makedirs(tmp_dir,exist_ok=True)
try: data_chunk = pd.read_csv(file_path,sep=sep,usecols=usecols,index_col=index_col,chunksize=chunksize) for chunk in data_chunk: chunk = chunk.sort_values(by,ascending=ascending) chunk.to_csv(tmp_dir+"/"+"chunk"+str(int(time.time()*10**7))+str(uuid.uuid4())+".csv",index=None,sep=sep) merge_sort(tmp_dir,ofile=ofile,by=by,ascending=ascending,sep=sep) except Exception: print(traceback.format_exc()) finally: shutil.rmtree(tmp_dir, ignore_errors=True)
if __name__ == "__main__": infile = "D:/python_question_data/orders.txt" ofile = "D:/python_question_data/extra_sort_res_py.txt" tmp = "D:/python_question_data/tmp" external_sort(infile,'amount',ofile,tmp,ascending=True,chunksize=1000000,sep='\t') |
Function Parse data type for the string Function Find the position of the column name by which records are ordered in the headers Function External merge sort List temporary files Open a temporary file Read the headers Get the position of column name by which records are ordered among the headers Export the headers Read the first line of detail data Maintain a list of k elements to store k sorting column values Perform sort in the order of the list Export the row with the smallest value Read and process temporary files one by one If the file traversal isn’t finished, continue reading and update the list Finish reading the file Function External sort Create a directory to store the temporary files Retrieve the file chunk by chunk Sort the chunks one by one Write the sorted file External merge sort Delete the temporary directory Main program Call the external sort function |
Python handles the external sort using line-by-line merge & write. I didn’t use pandas because it is incredibly slow when doing the line-wise retrieval. Yet it is fast to do the chunk-wise merge in pandas. You can compare their speeds if you want to.
The code is too complicated compared with that for aggregation and filtering. It’s beyond a non-professional programmer’s ability. The second problem is that it is slow to execute.
The third problem is that it is only for standard structured files and single column sorting. If the file doesn’t have a header row, or if there are variable number of separators in rows, or if the sorting column contains values of nonstandard date format, or if there are multiple sorting columns, the code will be more complicated.
IV. Grouping
It’s not easy to group and summarize a big file in Python, too. A convenient way out is to sort the file by the grouping column and then to traverse the ordered file during which neighboring records are put to same group if they have same grouping column values and a record is put to a new group if its grouping column value is different from the previous one. If a result set is too large, we need to write grouping result before the memory lose its hold.
It’s convenient yet slow because a full-text sorting is needed. Generally databases use the hash grouping to increase speed. It’s effective but much more complicated. It’s almost impossible for non-professionals to do that.
So, it’s inconvenient and difficult to handle big files with Python because of the absence of cursor data type and relevant functions. We have to write all the code ourselves and the code is inefficient.
If only there was a language that a non-professional programmer can handle to process large files. Luckily, we have esProc SPL.
It’s convenient and easy to use. Because SPL is designed to process structured data and equipped with a richer library of functions than pandas and the built-in cursor data type. It handles large files concisely, effortlessly and efficiently.
1. Aggregation
| A | |
| 1 | =file(file_path).cursor@tc() |
| 2 | =A1.total(sum(col)) |
2. Filtering
| A | B | |
| 1 | =file(file_path).cursor@tc() | |
| 2 | =A1.select(key==condition) | |
| 3 | =A2.fetch() | / Fetch data from a small result set |
| 4 | =file(out_file).export@tc(A2) | / Write a large result set to a target file |
3. Sorting
| A | |
| 1 | =file(file_path).cursor@tc() |
| 2 | =A1.sortx(key) |
| 3 | =file(out_file).export@tc(A2) |
4. Grouping
| A | B | |
| 1 | =file(file_path).cursor@tc() | |
| 2 | =A1.groups(key;sum(coli):total) | / Return a small result set directly |
| 3 | =A1.groupx(key;sum(coli):total) | |
| 4 | =file(out_file).export@tc(A3) | / Write a large result set to a target file |
SPL also employs the above-mentioned HASH algorithm to effectively increase performance.
SPL has the embedded parallel processing ability to be able to make the most use of the multi-core CPU to boost performance. A @m option only enables a function to perform parallel computing.
| A | |
| 1 | =file(file_path).cursor@mtc() |
| 2 | =A1.groups(key;sum(coli):total) |
There are a lot of Python-version parallel programs, but none is simple enough.
How Python Handles Big Files的更多相关文章
- 解决:Elipse配置Jython Interpreters时报错Error: Python stdlib source files not found
今天学习lynnLi的博客monkeyrunner之eclipse中运行monkeyrunner脚本之环境搭建(四)时,遇到了一个问题,即: lynnLi给出的解决办法是:将Python下的Lib拷贝 ...
- Huge CSV and XML Files in Python, Error: field larger than field limit (131072)
Huge CSV and XML Files in Python January 22, 2009. Filed under python twitter facebook pinterest lin ...
- 理解python的with语句
Python’s with statement provides a very convenient way of dealing with the situation where you have ...
- 转: 理解Python的With语句
Python’s with statement provides a very convenient way of dealing with the situation where you have ...
- [翻译]Python with 语句
With语句是什么? Python's with statement provides a very convenient way of dealing with the situation wher ...
- 能分析压缩的日志,且基于文件输入的PYTHON代码实现
确实感觉长见识了. 希望能坚持,并有多的时间用来分析这些思路和模式. #!/usr/bin/python import sys import gzip import bz2 from optparse ...
- PYTHON文本处理指南之日志LOG解析
处理特定字段的内容,并指指定条件输出. 注意代码中用一个方法列表,并且将方法参数延后传递. GOOGLE作过PYTHON代码的水平,就是不一样呀. 希望能学到这种通用的技巧. 只是,英文PDF看起来有 ...
- Awesome Python,Python的框架集合
Awesome Python A curated list of awesome Python frameworks, libraries and software. Inspired by awes ...
- Awesome Python(中文对照)
python中文资源大全:https://github.com/jobbole/awesome-python-cn A curated list of awesome Python framework ...
- Python——import与reload模块的区别
原创声明:本文系博主原创文章,转载或引用请注明出处. 1. 语法不同 import sys reload('sys') 2. 导入特性不同 import 和reload都可以对同一个模块多次加载, ...
随机推荐
- 想做大模型开发前,先来了解一下MoE
为了实现大模型的高效训练和推理,混合专家模型MoE便横空出世. 大模型发展即将进入下一阶段但目前仍面临众多难题.为满足与日俱增的实际需求,大模型参数会越来越大,数据集类型越来越多,从而导致训练难度大增 ...
- 开源好用的所见即所得(WYSIWYG)编辑器:Editor.js
@ 目录 特点 基于区块 干净的数据 界面与交互 插件 标题和文本 图片 列表 Todo 表格 使用 安装 创建编辑器实例 配置工具 本地化 自定义样式 今天介绍一个开源好用的Web所见即所得(WYS ...
- [更新/已解决] Nodejs 16.18.0 和 Nodejs 18.16.0 两个版本同时共存 nvm-desktop
[更新/已解决] https://github.com/1111mp/nvm-desktop/blob/main/README-zh_CN.md 软件名字叫 nvm-desktop 装完 window ...
- 从一线方案商的角度来看高通QCC3020芯片
写在前面的话 QCC3020的推出已经有一段时间了.在蓝牙音频的圈子里,属于家喻户晓的芯片了.再加上高通的大力宣传和一些顶尖级产品的使用,可以说,它是高通在吸收CSR的技术之后,着力推出的最具竞争 ...
- 使用ConnectivityManager.bindProcessToNetwork绑定特殊网络
最近测试那边提了一个bug,经过排查后发现其原因:是因为连接的Wi-Fi无法上网,因此在Android系统的多网络策略中,可以上网的SIM移动网络被设置为系统默认网络,投屏组件docker传输与反控模 ...
- 移远EC20 4G模块Linux驱动移植和测试
PS:要转载请注明出处,本人版权所有. PS: 这个只是基于<我自己>的理解, 如果和你的原则及想法相冲突,请谅解,勿喷. 前置说明 本文作为本人csdn blog的主站的备份.(Bl ...
- java的接口和抽象类区别
转自:深入理解Java的接口和抽象类 对于面向对象编程来说,抽象是它的一大特征之一.在Java中,可以通过两种形式来体现OOP的抽象:接口和抽象类.这两者有太多相似的地方,又有太多不同的地方.很多人在 ...
- Linux输入输出
1.重定向概述 1.什么是重定向 将原本要输出到屏幕的数据信息,重新定向到某个指定的文件中.比如:每天凌晨定时备份数据,希望将备份数据的结果保存到某个文件中. 这样第二天通过查看文件的内容就知道昨天备 ...
- tomcat正常启动,但网页拒绝连接的解决方法
当发生拒绝连接的时候 1.首先要排除端口的占用 上一篇文章已经详细介绍了,这里不再赘述tomcat端口配置 2.设置防火墙放行tomcat 3.配置环境变量 此电脑→属性→高级系统设置→环境变量 点击 ...
- ts-对象数组reduce-数组转对象数组
将字符串数组转化成{name:xxx,count:xxx}[]数组的代码 #定义数据类型 interface CartInfo{ name:string, count:number } let raw ...