Fast data loading from files to R
Recently we were building a Shiny App in which we had to load data from a very large dataframe. It was directly impacting the app initialization time, so we had to look into different ways of reading data from files to R (in our case customer provided csv files) and identify the best one.
The goal of my post is to compare:
read.csv
fromutils
, which was the standard way of reading csvfiles to R in RStudio,read_csv
fromreadr
which replaced the former method as a standard way of doing it in RStudio,load
andreadRDS
frombase
, andread_feather
fromfeather
andfread
fromdata.table
.
Data
First let’s generate some random data
set.seed(123)
df <- data.frame(replicate(10, sample(0:2000, 15 * 10^5, rep = TRUE)),
replicate(10, stringi::stri_rand_strings(1000, 5)))
and save the files on a disk to evaluate the loading time. Besides thecsv
format we will also need feather
, RDS
and Rdata
files.
path_csv <- '../assets/data/fast_load/df.csv'
path_feather <- '../assets/data/fast_load/df.feather'
path_rdata <- '../assets/data/fast_load/df.RData'
path_rds <- '../assets/data/fast_load/df.rds'
library(feather)
library(data.table)
write.csv(df, file = path_csv, row.names = F)
write_feather(df, path_feather)
save(df, file = path_rdata)
saveRDS(df, path_rds)
Next let’s check our files sizes:
files <- c('../assets/data/fast_load/df.csv', '../assets/data/fast_load/df.feather', '../assets/data/fast_load/df.RData', '../assets/data/fast_load/df.rds')
info <- file.info(files)
info$size_mb <- info$size/(1024 * 1024)
print(subset(info, select=c("size_mb")))
## size_mb
## ../assets/data/fast_load/df.csv 1780.3005
## ../assets/data/fast_load/df.feather 1145.2881
## ../assets/data/fast_load/df.RData 285.4836
## ../assets/data/fast_load/df.rds 285.4837
As we can see both csv
and feather
format files are taking much more storage space. Csv
more than 6 times and feather
more than 4 times comparing to RDS
and RData
.
Benchmark
We will use microbenchmark
library to compare the reading times of the following methods:
- utils::read.csv
- readr::read_csv
- data.table::fread
- base::load
- base::readRDS
- feather::read_feather
in 10 rounds.
library(microbenchmark)
benchmark <- microbenchmark(readCSV = utils::read.csv(path_csv),
readrCSV = readr::read_csv(path_csv, progress = F),
fread = data.table::fread(path_csv, showProgress = F),
loadRdata = base::load(path_rdata),
readRds = base::readRDS(path_rds),
readFeather = feather::read_feather(path_feather), times = 10)
print(benchmark, signif = 2)
##Unit: seconds
## expr min lq mean median uq max neval
## readCSV 200.0 200.0 211.187125 210.0 220.0 240.0 10
## readrCSV 27.0 28.0 29.770890 29.0 32.0 33.0 10
## fread 15.0 16.0 17.250016 17.0 17.0 22.0 10
## loadRdata 4.4 4.7 5.018918 4.8 5.5 5.9 10
## readRds 4.6 4.7 5.053674 5.1 5.3 5.6 10
## readFeather 1.5 1.8 2.988021 3.4 3.6 4.1 10
And the winner is… feather
! However, using feather
requires prior conversion of the file to the feather format.
Using load
or readRDS
can improve performance (second and third place in terms of speed) and has a benefit of storing smaller/compressed file. In both cases you will have to convert your file to the proper format first.
When it comes to reading from csv
format fread
significantly beatsread_csv
and read.csv
, and thus is the best option to read a csv
file.
In our case we decided to go with feather
file since conversion fromcsv
to this format is just a one time job and we didn’t have a strict limitation on a storage space to consider usage of Rds
or RData
format.
The final workflow was:
- reading a
csv
file provided by our customer usingfread
, - writing it to
feather
usingwrite_feather
, and - loading a
feather
file on app initialization usingread_feather
.
First two tasks were done once and outside of a Shiny App context.
There is also quite interesting benchmark done by Hadley here on reading complete files to R. Unfortunately, if you use functions defined in that post, you will end up with an character type object, and you will have to apply string manipulations to obtain a commonly and widely used dataframe.
转自:http://blog.appsilondatascience.com/rstats/2017/04/11/fast-data-load.html
Fast data loading from files to R的更多相关文章
- pytorch例子学习-DATA LOADING AND PROCESSING TUTORIAL
参考:https://pytorch.org/tutorials/beginner/data_loading_tutorial.html DATA LOADING AND PROCESSING TUT ...
- Redisql: the lightning fast data polyglot【翻译】 - Linvo's blog - 博客频道 - CSDN.NET
Redisql: the lightning fast data polyglot[翻译] - Linvo's blog - 博客频道 - CSDN.NET Redisql: the lightnin ...
- 安装mysql时出现initialize specified but the data directory has files in in.Aborting.该如何解决
eclipse中写入sql插入语句时,navicat中显示的出现乱码(???). 在修改eclipse工作空间编码.navicate中的数据库编码.mysql中my.ini中的配置之后还是出现乱码. ...
- The multi-part request contained parameter data (excluding uploaded files) that exceeded the limit for maxPostSize set on the associated connector.
springboot 表单体积过大时报错: The multi-part request contained parameter data (excluding uploaded files) tha ...
- Springboot 上传报错: Failed to parse multipart servlet request; nested exception is java.lang.IllegalStateException: The multi-part request contained parameter data (excluding uploaded files) that exceede
Failed to parse multipart servlet request; nested exception is java.lang.IllegalStateException: The ...
- MYSQL常见安装错误集:[ERROR] --initialize specified but the data directory has files in it. Abort
1.[ERROR] --initialize specified but the data directory has files in it. Abort [错误] -初始化指定,但数据目录中有文件 ...
- Data Manipulation with dplyr in R
目录 select The filter and arrange verbs arrange filter Filtering and arranging Mutate The count verb ...
- 启动MySQL5.7时报错:initialize specified but the data directory has files in it. Aborting.
启动MySQL5.7时报错:initialize specified but the data directory has files in it. Aborting 解决方法: vim /etc/m ...
- STM32 GPIO fast data transfer with DMA
AN2548 -- 使用 STM32F101xx 和 STM32F103xx 的 DMA 控制器 DMA控制器 DMA是AMBA的先进高性能总线(AHB)上的设备,它有2个AHB端口: 一个是从端口, ...
随机推荐
- 前端jquery validate表单验证框架的使用
一.框架本身校验方法的扩展 建议写在页内用于扩展框架本身的一些校验方法, 使用频繁也可以直接在源码上修改 例如扩展手机号码的校验: /*手机号码验证扩展 最新的号码 mobile: class的表示 ...
- Node.js基本开发流程
创建一个hello world: 1.打开一个文本编辑器,在其中输入console.log("hello world"),并保存为hello.js; 注意:输入中文如果编码不是ut ...
- (转)Python 遍历List三种方式
转自: http://www.cnblogs.com/pizitai/archive/2017/02/14/6398276.html # 方法1 print '遍历列表方法1:' for i in l ...
- Unity属性的封装、继承、方法隐藏
(一)Unity属性封装.继承.方法隐藏的学习和总结 一.属性的封装 1.属性封装的定义:通过对属性的读和写来保护类中的域. 2.格式例子: private string departname; // ...
- MVC框架中,遇到 [程序集清单定义与程序集引用不匹配]怎么办?
项目里有一个WinForm程序,它需要使用一套第三方控件.而我的机器上存有这套控件的两种版本(一个是源码版,一个是演示版).结果经常出现“程序集清单定义与程序集引用不匹配的问题”的异常.最要命的是有时 ...
- Python 操作 MySQL 的正确姿势
欢迎大家关注腾讯云技术社区-博客园官方主页,我们将持续在博客园为大家推荐技术精品文章哦~ 作者:邵建永 使用Python进行MySQL的库主要有三个,Python-MySQL(更熟悉的名字可能是MyS ...
- selenium + python 登录页面,输入账号、密码,元素定位问题
示例简介: 要求:登录QQ邮箱,输入账号.密码 出现问题:页面中含有iframe框架,因此直接进行元素的查找与操作,出现找不到元素的现象,首先需进行iframe框架的转换,使用switch_to_fr ...
- javascript原生方法实现extend
var extend = (function () { for(var p in {toString:null}){ //检查当前浏览器是否支持forin循环去遍历出一个不可枚举的属性,比如toStr ...
- Android打开其它应用程序
PackageManager pm = getPackageManager(); Intent i = pm.getLaunchIntentForPackage(packageName); start ...
- hdu1520 Anniversary party 简单树形DP
题目链接:http://acm.hdu.edu.cn/showproblem.php?pid=1520 思路:树形DP的入门题 定义dp[root][1]表示以root为根节点的子树,且root本身参 ...