Fast data loading from files to R
Recently we were building a Shiny App in which we had to load data from a very large dataframe. It was directly impacting the app initialization time, so we had to look into different ways of reading data from files to R (in our case customer provided csv files) and identify the best one.
The goal of my post is to compare:
read.csv
fromutils
, which was the standard way of reading csvfiles to R in RStudio,read_csv
fromreadr
which replaced the former method as a standard way of doing it in RStudio,load
andreadRDS
frombase
, andread_feather
fromfeather
andfread
fromdata.table
.
Data
First let’s generate some random data
set.seed(123)
df <- data.frame(replicate(10, sample(0:2000, 15 * 10^5, rep = TRUE)),
replicate(10, stringi::stri_rand_strings(1000, 5)))
and save the files on a disk to evaluate the loading time. Besides thecsv
format we will also need feather
, RDS
and Rdata
files.
path_csv <- '../assets/data/fast_load/df.csv'
path_feather <- '../assets/data/fast_load/df.feather'
path_rdata <- '../assets/data/fast_load/df.RData'
path_rds <- '../assets/data/fast_load/df.rds'
library(feather)
library(data.table)
write.csv(df, file = path_csv, row.names = F)
write_feather(df, path_feather)
save(df, file = path_rdata)
saveRDS(df, path_rds)
Next let’s check our files sizes:
files <- c('../assets/data/fast_load/df.csv', '../assets/data/fast_load/df.feather', '../assets/data/fast_load/df.RData', '../assets/data/fast_load/df.rds')
info <- file.info(files)
info$size_mb <- info$size/(1024 * 1024)
print(subset(info, select=c("size_mb")))
## size_mb
## ../assets/data/fast_load/df.csv 1780.3005
## ../assets/data/fast_load/df.feather 1145.2881
## ../assets/data/fast_load/df.RData 285.4836
## ../assets/data/fast_load/df.rds 285.4837
As we can see both csv
and feather
format files are taking much more storage space. Csv
more than 6 times and feather
more than 4 times comparing to RDS
and RData
.
Benchmark
We will use microbenchmark
library to compare the reading times of the following methods:
- utils::read.csv
- readr::read_csv
- data.table::fread
- base::load
- base::readRDS
- feather::read_feather
in 10 rounds.
library(microbenchmark)
benchmark <- microbenchmark(readCSV = utils::read.csv(path_csv),
readrCSV = readr::read_csv(path_csv, progress = F),
fread = data.table::fread(path_csv, showProgress = F),
loadRdata = base::load(path_rdata),
readRds = base::readRDS(path_rds),
readFeather = feather::read_feather(path_feather), times = 10)
print(benchmark, signif = 2)
##Unit: seconds
## expr min lq mean median uq max neval
## readCSV 200.0 200.0 211.187125 210.0 220.0 240.0 10
## readrCSV 27.0 28.0 29.770890 29.0 32.0 33.0 10
## fread 15.0 16.0 17.250016 17.0 17.0 22.0 10
## loadRdata 4.4 4.7 5.018918 4.8 5.5 5.9 10
## readRds 4.6 4.7 5.053674 5.1 5.3 5.6 10
## readFeather 1.5 1.8 2.988021 3.4 3.6 4.1 10
And the winner is… feather
! However, using feather
requires prior conversion of the file to the feather format.
Using load
or readRDS
can improve performance (second and third place in terms of speed) and has a benefit of storing smaller/compressed file. In both cases you will have to convert your file to the proper format first.
When it comes to reading from csv
format fread
significantly beatsread_csv
and read.csv
, and thus is the best option to read a csv
file.
In our case we decided to go with feather
file since conversion fromcsv
to this format is just a one time job and we didn’t have a strict limitation on a storage space to consider usage of Rds
or RData
format.
The final workflow was:
- reading a
csv
file provided by our customer usingfread
, - writing it to
feather
usingwrite_feather
, and - loading a
feather
file on app initialization usingread_feather
.
First two tasks were done once and outside of a Shiny App context.
There is also quite interesting benchmark done by Hadley here on reading complete files to R. Unfortunately, if you use functions defined in that post, you will end up with an character type object, and you will have to apply string manipulations to obtain a commonly and widely used dataframe.
转自:http://blog.appsilondatascience.com/rstats/2017/04/11/fast-data-load.html
Fast data loading from files to R的更多相关文章
- pytorch例子学习-DATA LOADING AND PROCESSING TUTORIAL
参考:https://pytorch.org/tutorials/beginner/data_loading_tutorial.html DATA LOADING AND PROCESSING TUT ...
- Redisql: the lightning fast data polyglot【翻译】 - Linvo's blog - 博客频道 - CSDN.NET
Redisql: the lightning fast data polyglot[翻译] - Linvo's blog - 博客频道 - CSDN.NET Redisql: the lightnin ...
- 安装mysql时出现initialize specified but the data directory has files in in.Aborting.该如何解决
eclipse中写入sql插入语句时,navicat中显示的出现乱码(???). 在修改eclipse工作空间编码.navicate中的数据库编码.mysql中my.ini中的配置之后还是出现乱码. ...
- The multi-part request contained parameter data (excluding uploaded files) that exceeded the limit for maxPostSize set on the associated connector.
springboot 表单体积过大时报错: The multi-part request contained parameter data (excluding uploaded files) tha ...
- Springboot 上传报错: Failed to parse multipart servlet request; nested exception is java.lang.IllegalStateException: The multi-part request contained parameter data (excluding uploaded files) that exceede
Failed to parse multipart servlet request; nested exception is java.lang.IllegalStateException: The ...
- MYSQL常见安装错误集:[ERROR] --initialize specified but the data directory has files in it. Abort
1.[ERROR] --initialize specified but the data directory has files in it. Abort [错误] -初始化指定,但数据目录中有文件 ...
- Data Manipulation with dplyr in R
目录 select The filter and arrange verbs arrange filter Filtering and arranging Mutate The count verb ...
- 启动MySQL5.7时报错:initialize specified but the data directory has files in it. Aborting.
启动MySQL5.7时报错:initialize specified but the data directory has files in it. Aborting 解决方法: vim /etc/m ...
- STM32 GPIO fast data transfer with DMA
AN2548 -- 使用 STM32F101xx 和 STM32F103xx 的 DMA 控制器 DMA控制器 DMA是AMBA的先进高性能总线(AHB)上的设备,它有2个AHB端口: 一个是从端口, ...
随机推荐
- Spring+SpringMVC+MyBatis+easyUI整合优化篇(五)结合MockMvc进行服务端的单元测试
日常啰嗦 承接前一篇文章<Spring+SpringMVC+MyBatis+easyUI整合优化篇(四)单元测试实例>,已经讲解了dao层和service层的单元测试,还有控制器这层也不能 ...
- add spring-boot-modules to maven project
spring boot 项目中 多modules parent 冲突 在IDEAJ 中,如果建立多多modules 项目,pom文件应该是这样: <groupId>cn.ifengkou& ...
- jQuery / zepto ajax 全局默认设置
jQuery / zepto 的 $.ajax 方法需要配置很多选项, 有些是很常用的每个 ajax 请求都要用到的, 可以全局设置, 避免每次都写. 注意: 此处用的 jQuery 版本是 1.8. ...
- GitHub上最受欢迎的iOS开源项目TOP20
AFNetworking 在众多iOS开源项目中,AFNetworking可以称得上是最受开发者欢迎的库项目.AFNetworking是一个轻量级的iOS.Mac OS X网络通信类库,现在是GitH ...
- 浅谈聚类算法(K-means)
聚类算法(K-means)目的是将n个对象根据它们各自属性分成k个不同的簇,使得簇内各个对象的相似度尽可能高,而各簇之间的相似度尽量小. 而如何评测相似度呢,采用的准则函数是误差平方和(因此也叫K-均 ...
- nginx源码分析——http模块
源码:nginx 1.12.0 一.nginx http模块简介 由于nginx的性能优势,现在已经有越来越多的单位.个人采用nginx或者openresty. ...
- Linux-进程描述(5)之进程环境
main函数和启动例程 当内核使用一个exec函数执行C程序时,在调用main函数之前先调用一个特殊的启动例程,可执行程序将此例程指定为程序的起始地址.启动例程从内核获取命令行参数和环境变量,然后为调 ...
- PHP 分支与循环
一.概述: 上面一章我们讲解了PHP当中的运算符和表达式,通过上面的知识点我们就可以完成一些基本的运算操作了.但是涉及到一些比较复杂的逻辑,分支与循环就必不可少了.通过分支和循环的结合使用可以使业务更 ...
- 最新的css3动画按钮效果
效果演示 插件下载
- bzoj4800 [Ceoi2015]Ice Hockey World Championship
Description 有n个物品,m块钱,给定每个物品的价格,求买物品的方案数. Input 第一行两个数n,m代表物品数量及钱数 第二行n个数,代表每个物品的价格 n<=40,m<=1 ...