Huge CSV and XML Files in Python, Error: field larger than field limit (131072)
Huge CSV and XML Files in Python
January 22, 2009. Filed under python
I, like most people, never realized I'd be dealing with large files. Oh, I knew there would be some files with megabytes of data, but I never suspected I'd be begging Perl to processhundreds of megabytes of XML, nor that this week I'd be asking Python to process 6.4 gigabytes of CSV into 6.5 gigabytes of XML1.
As a few out-of-memory experiences will teach you, the trick for dealing with large files is pretty easy: use code that treats everything as a stream. For inputs, read from disk in chunks. For outputs, frequently write to disk and let system memory forge onward unburdened.
When reading and writing files yourself, this is easier to do correctly...
from __future__ import with_statement # for python 2.5
with open('data.in','r') as fin:
with open('data.out','w') as fout:
for line in fin:
fout.write(','.join(line.split(' ')))
...than it is to do incorrectly...
with open('data.in','r') as fin:
data = fin.read()
data2 = [ ','.join(x.split(' ')) for x in data ]
with open('data.out','w') as fout:
fout.write(data2)
...at least in simple cases.
Loading Large CSV Files in Python
Python has an excellent csv library, which can handle large files right out of the box. Sort of.
>> import csv
>> r = csv.reader(open('doc.csv', 'rb'))
>>> for row in r:
... print row
...
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
_csv.Error: field larger than field limit (131072)
Staring at the module documentation2, I couldn't find anything of use. So I cracked open the csv.py file and confirmed what the _csv in the error message suggests: the bulk of the module's code (and the input parsing in particular) is implemented in C rather than Python.
After a while staring at that error, I began dreaming of how I would create a stream pre-processor using StringIO, but it didn't take too long to figure out I would need to recreate my own version of csv in order to accomplish that.
So back to the blogs, one of which held the magic grain of information I was looking for: csv.field_size_limit.
>>> import csv >>> csv.field_size_limit() 131072 >>> csv.field_size_limit(1000000000) 131072 >>> csv.field_size_limit() 1000000000
Yep. That's all there is to it. The sucker just works after that.
Well, almost. I did run into an issue with a NULL byte 1.5 gigs into the data. Because the streaming code is written using C based IO, the NULL byte shorts out the reading of data in an abrupt and non-recoverable manner. To get around this we need to pre-process the stream somehow, which you could do in Python by wrapping the file with a custom class that cleans each line before returning it, but I went with some command line utilities for simplicity.
cat data.in | tr -d '\0' > data.out
After that, the 6.4 gig CSV file processed without any issues.
Creating Large XML Files in Python
This part of the process, taking each row of csv and converting it into an XML element, went fairly smoothly thanks to the xml.sax.saxutils.XMLGenerator class. The API for creating elements isn't an example of simplicity, but it is--unlike many of the more creative schemes--predictable, and has one killer feature: it correctly writes output to a stream.
As I mentioned, the mechanism for creating elements was a bit verbose, so I made a couple of wrapper functions to simplify (note that I am sending output to standard out, which lets me simply print strings to the file I am generating, for example creating the XML file's version declaration).
import sys
from xml.sax.saxutils import XMLGenerator
from xml.sax.xmlreader import AttributesNSImpl
g = XMLGenerator(sys.stdout, 'utf-8')
def start_tag(name, attr={}, body=None, namespace=None):
attr_vals = {}
attr_keys = {}
for key, val in attr.iteritems():
key_tuple = (namespace, key)
attr_vals[key_tuple] = val
attr_keys[key_tuple] = key
attr2 = AttributesNSImpl(attr_vals, attr_keys)
g.startElementNS((namespace, name), name, attr2)
if body:
g.characters(body)
def end_tag(name, namespace=None):
g.endElementNS((namespace, name), name)
def tag(name, attr={}, body=None, namespace=None):
start_tag(name, attr, body, namespace)
end_tag(name, namespace)
From there, usage looks like this:
print """<?xml version="1.0" encoding="utf-8'?>"""
start_tag(u'list', {u'id':10})
for item in some_list:
start_tag(u'item', {u'id': item[0]})
tag(u'title', body=item[1])
tag(u'desc', body=item[2])
end_tag(u'item')
end_tag(u'list')
g.endDocument()
The one issue I did run into (in my data) was some pagebreak characters floating around (^L aka 12 aka x0c) which were tweaking the XML encoder, but you can strip them out in a variety of places, for example by rewriting the main loop:
for item in some_list:
item = [ x.replace('\x0c','') for x in item ]
# etc
Really, the XMLGenerator just worked, even when dealing with a quite large file.
Performance
Although my script created a different mix of XML elements than the above example, it wasn't any more complex, and had fairly reasonable performance. Processing of the 6.4 gig CSV file into a 6.5 gig XML file took between 19 - 24 minutes, which means it was able to read-process-write about five megabytes per second.
In terms of raw speed, that isn't particularly epic, but performing a similar operation (was actually XML to XML rather than CSV to XML) with Perl's XML::Twig it took eight minutes to process a ~100 megabyte file, so I'm pretty pleased with the quality of the Python standard library and how it handles large files.
The breadth and depth of the standard library really makes Python a joy to work with for these simple one-shot scripts. If only it had Perl's easier to use regex syntax...
This is a peculiar nature of data, which makes it different from media: data files can--with a large system--become infinitely large. Media files, on the other hand, can be extremely dense (a couple of gigs for a high quality movie), but conform to predictable limits.
If you are dealing with large files, you're probably dealing with a company's logs from the last decade or the entire dump of their MySQL database.↩
I really want to like the new Python documentation. I mean, it certainly looks much better, but I think it has made it harder to actually find what I'm looking for. I think they've hit the same stumbling block as the Django documentation: the more you customize your documentation, the greater the learning curve for using your documentation.
I think the big thing is just the incompleteness of the documentation that gives me trouble. They are certain to cover all the important and frequently used components (along with helpful overviews and examples), but the new docs often don't even mention less important methods and objects.
For the time being, I am throwing around a lot more
dircommands.↩
Huge CSV and XML Files in Python, Error: field larger than field limit (131072)的更多相关文章
- Java读取CSV和XML文件方法
游戏开发中,读取策划给的配置表是必不可少的,我在之前公司,策划给的是xml表来读取,现在公司策划给的是CSV表来读取,其实大同小异,也并不是什么难点,我就简单分享下Java如何读取XML文件和CSV文 ...
- Nginx failing to load CSS and JS files (MIME type error)
Nginx failing to load CSS and JS files (MIME type error) Nginx加载静态文件失败的解决方法(MIME type错误) 上线新的页面,需要在n ...
- 关于xml加载提示: Error on line 1 of document : 前言中不允许有内容
我是在java中做的相关测试, 首先粘贴下报错: 读取xml配置文件:xmls\property.xml org.dom4j.DocumentException: Error on line 1 of ...
- Binary XML file line #2: Error inflating
06-27 14:29:27.600: E/AndroidRuntime(6936): FATAL EXCEPTION: main 06-27 14:29:27.600: E/AndroidRunti ...
- Android项目部署时,发生AndroidRuntime:android.view.InflateException: Binary XML file line #168: Error inflating class错误
这个错误也是让我纠结了一天,当时写的项目在安卓虚拟机上运行都很正常,于是当我部署到安卓手机上时,点击登陆按钮跳转到用户主界面的时候直接结束运行返回登陆界面. 当时,我仔细检查了一下自己的代码,并 ...
- Python--Cmd窗口运行Python时提示Fatal Python error: Py_Initialize: can't initialize sys standard streams LookupError: unknown encoding: cp65001
源地址连接: http://www.tuicool.com/articles/ryuaUze 最近,我在把一个 Python 2 的视频下载工具 youku-lixian 改写成 Python 3,并 ...
- bug_ _图片_android.view.InflateException: Binary XML file line #1: Error inflating class <unknown>
=========== 1 java.lang.RuntimeException: Unable to start activity ComponentInfo{com.zgan.communit ...
- bug_ _ android.view.InflateException: Binary XML file line #2: Error inflating class <unknown
========= 5.0 android异常“android.view.InflateException: Binary XML file line # : Error inflating ...
- java.lang.RuntimeException: Unable to start activity ComponentInfo{com.ex.activity/com.ex.activity.LoginActivity}: android.view.InflateException: Binary XML file line #1: Error inflating class
java.lang.RuntimeException: Unable to start activity ComponentInfo{com.ex.activity/com.ex.activity.L ...
随机推荐
- 你用什么方法检查 PHP 脚本的执行效率(通常是脚本执行时间)和数据库 SQL 的效率(通常是数据库 Query 时间), 并定位和分析脚本执行和数据库查询的瓶颈所在?
php: 一般是在你要检查的代码开头记录一个时间,结尾记录一个时间.取差值, 数据库SQL的效率 sql的explain(mysql),启用slow query log记录慢查询. 通常还要 ...
- 在Visual Studio中快速启动调试Web应用程序
原文:http://blog.csdn.net/effun/article/details/2638535 到2005,Visual Studio在启动调试的功能上进行了一些改善,不过因为只是简单的一 ...
- java url中文参数乱码问题
http://www.blogjava.net/jerry-zhaoj/archive/2009/07/16/286993.html 转 JAVA 中URL链接中文参数乱码的处理方法JAVA 中URL ...
- 【POJ】【2975】Nim
博弈论 我哭……思路错误WA了6次?(好像还有手抖点错……) 本题是要求Nim游戏的第一步必胜策略有几种. 一开始我想:先全部异或起来得到ans,从每个比ans大的堆里取走ans个即可,答案如此累计… ...
- Codeforces Round #240 (Div. 2)->A. Mashmokh and Lights
A. Mashmokh and Lights time limit per test 1 second memory limit per test 256 megabytes input standa ...
- [设计模式] 8 组合模式 Composite
DP书上给出的定义:将对象组合成树形结构以表示“部分-整体”的层次结构.组合使得用户对单个对象和组合对象的使用具有一致性.注意两个字“树形”.这种树形结构在现实生活中随处可见,比如一个集团公司,它有一 ...
- Mac和iOS开发资源汇总
小引 本文主要汇集一些苹果开发的资源,会经常更新,建议大家把这篇文章单独收藏(在浏览器中按command+D). 今天(2013年7月19日)收录了许多中文网站和博客.大家一定要去感受一下哦. 如果大 ...
- 表中相同数据的sql语句
1.查找表中多余的重复记录,重复记录是根据单个字段(peopleId)来判断select * from peoplewhere peopleId in (select peopleId from ...
- Java Notes
1.java是解释型语言.java虚拟机能实现一次编译多次运行. 2.JDK(java software Development kit 软件开发包),JRE(java Runtime Environ ...
- HDU4612 Warm up 边双连通分量&&桥&&树直径
题目的意思很简单,给你一个已经连通的无向图,我们知道,图上不同的边连通分量之间有一定数量的桥,题目要求的就是要你再在这个图上加一条边,使得图的桥数目减到最少. 首先要做的就是找出桥,以及每个点所各自代 ...