Fast data loading from files to R
Recently we were building a Shiny App in which we had to load data from a very large dataframe. It was directly impacting the app initialization time, so we had to look into different ways of reading data from files to R (in our case customer provided csv files) and identify the best one.
The goal of my post is to compare:
read.csvfromutils, which was the standard way of reading csvfiles to R in RStudio,read_csvfromreadrwhich replaced the former method as a standard way of doing it in RStudio,loadandreadRDSfrombase, andread_featherfromfeatherandfreadfromdata.table.
Data
First let’s generate some random data
set.seed(123)
df <- data.frame(replicate(10, sample(0:2000, 15 * 10^5, rep = TRUE)),
replicate(10, stringi::stri_rand_strings(1000, 5)))
and save the files on a disk to evaluate the loading time. Besides thecsv format we will also need feather, RDS and Rdata files.
path_csv <- '../assets/data/fast_load/df.csv'
path_feather <- '../assets/data/fast_load/df.feather'
path_rdata <- '../assets/data/fast_load/df.RData'
path_rds <- '../assets/data/fast_load/df.rds'
library(feather)
library(data.table)
write.csv(df, file = path_csv, row.names = F)
write_feather(df, path_feather)
save(df, file = path_rdata)
saveRDS(df, path_rds)
Next let’s check our files sizes:
files <- c('../assets/data/fast_load/df.csv', '../assets/data/fast_load/df.feather', '../assets/data/fast_load/df.RData', '../assets/data/fast_load/df.rds')
info <- file.info(files)
info$size_mb <- info$size/(1024 * 1024)
print(subset(info, select=c("size_mb")))
## size_mb
## ../assets/data/fast_load/df.csv 1780.3005
## ../assets/data/fast_load/df.feather 1145.2881
## ../assets/data/fast_load/df.RData 285.4836
## ../assets/data/fast_load/df.rds 285.4837
As we can see both csv and feather format files are taking much more storage space. Csv more than 6 times and feather more than 4 times comparing to RDS and RData.
Benchmark
We will use microbenchmark library to compare the reading times of the following methods:
- utils::read.csv
- readr::read_csv
- data.table::fread
- base::load
- base::readRDS
- feather::read_feather
in 10 rounds.
library(microbenchmark)
benchmark <- microbenchmark(readCSV = utils::read.csv(path_csv),
readrCSV = readr::read_csv(path_csv, progress = F),
fread = data.table::fread(path_csv, showProgress = F),
loadRdata = base::load(path_rdata),
readRds = base::readRDS(path_rds),
readFeather = feather::read_feather(path_feather), times = 10)
print(benchmark, signif = 2)
##Unit: seconds
## expr min lq mean median uq max neval
## readCSV 200.0 200.0 211.187125 210.0 220.0 240.0 10
## readrCSV 27.0 28.0 29.770890 29.0 32.0 33.0 10
## fread 15.0 16.0 17.250016 17.0 17.0 22.0 10
## loadRdata 4.4 4.7 5.018918 4.8 5.5 5.9 10
## readRds 4.6 4.7 5.053674 5.1 5.3 5.6 10
## readFeather 1.5 1.8 2.988021 3.4 3.6 4.1 10
And the winner is… feather! However, using feather requires prior conversion of the file to the feather format.
Using load or readRDS can improve performance (second and third place in terms of speed) and has a benefit of storing smaller/compressed file. In both cases you will have to convert your file to the proper format first.
When it comes to reading from csv format fread significantly beatsread_csv and read.csv, and thus is the best option to read a csv file.
In our case we decided to go with feather file since conversion fromcsv to this format is just a one time job and we didn’t have a strict limitation on a storage space to consider usage of Rds or RDataformat.
The final workflow was:
- reading a
csvfile provided by our customer usingfread, - writing it to
featherusingwrite_feather, and - loading a
featherfile on app initialization usingread_feather.
First two tasks were done once and outside of a Shiny App context.
There is also quite interesting benchmark done by Hadley here on reading complete files to R. Unfortunately, if you use functions defined in that post, you will end up with an character type object, and you will have to apply string manipulations to obtain a commonly and widely used dataframe.
转自:http://blog.appsilondatascience.com/rstats/2017/04/11/fast-data-load.html
Fast data loading from files to R的更多相关文章
- pytorch例子学习-DATA LOADING AND PROCESSING TUTORIAL
参考:https://pytorch.org/tutorials/beginner/data_loading_tutorial.html DATA LOADING AND PROCESSING TUT ...
- Redisql: the lightning fast data polyglot【翻译】 - Linvo's blog - 博客频道 - CSDN.NET
Redisql: the lightning fast data polyglot[翻译] - Linvo's blog - 博客频道 - CSDN.NET Redisql: the lightnin ...
- 安装mysql时出现initialize specified but the data directory has files in in.Aborting.该如何解决
eclipse中写入sql插入语句时,navicat中显示的出现乱码(???). 在修改eclipse工作空间编码.navicate中的数据库编码.mysql中my.ini中的配置之后还是出现乱码. ...
- The multi-part request contained parameter data (excluding uploaded files) that exceeded the limit for maxPostSize set on the associated connector.
springboot 表单体积过大时报错: The multi-part request contained parameter data (excluding uploaded files) tha ...
- Springboot 上传报错: Failed to parse multipart servlet request; nested exception is java.lang.IllegalStateException: The multi-part request contained parameter data (excluding uploaded files) that exceede
Failed to parse multipart servlet request; nested exception is java.lang.IllegalStateException: The ...
- MYSQL常见安装错误集:[ERROR] --initialize specified but the data directory has files in it. Abort
1.[ERROR] --initialize specified but the data directory has files in it. Abort [错误] -初始化指定,但数据目录中有文件 ...
- Data Manipulation with dplyr in R
目录 select The filter and arrange verbs arrange filter Filtering and arranging Mutate The count verb ...
- 启动MySQL5.7时报错:initialize specified but the data directory has files in it. Aborting.
启动MySQL5.7时报错:initialize specified but the data directory has files in it. Aborting 解决方法: vim /etc/m ...
- STM32 GPIO fast data transfer with DMA
AN2548 -- 使用 STM32F101xx 和 STM32F103xx 的 DMA 控制器 DMA控制器 DMA是AMBA的先进高性能总线(AHB)上的设备,它有2个AHB端口: 一个是从端口, ...
随机推荐
- C语言枚举类型(Enum)深入理解
在实际编程中,有些数据的取值往往是有限的,只能是非常少量的整数,并且最好为每个值都取一个名字,以方便在后续代码中使用,比如一个星期只有七天,一年只有十二个月,一个班每周有六门课程等. 以每周七天为例, ...
- 1147: 零起点学算法54——Fibonacc
1147: 零起点学算法54--Fibonacc Time Limit: 1 Sec Memory Limit: 64 MB 64bit IO Format: %lldSubmitted: 20 ...
- 1137: 零起点学算法44——多组测试数据输出II
1137: 零起点学算法44--多组测试数据输出II Time Limit: 1 Sec Memory Limit: 64 MB 64bit IO Format: %lldSubmitted: ...
- Use “error_messages” in Rails 3.2? (raises “undefined method” error)
I am getting the following error in my Rails 3.2 functional tests: ActionView::Template::Error: unde ...
- css高级选择器
并集选择器 p,h1{} 交集选择器 p.first{} 后代选择器:嵌套标签 h1 span{} 子元素选择器 h1>span{} 属性选择器 input[type="passwor ...
- 从零开始用 Flask 搭建一个网站(三)
从零开始用 Flask 搭建一个网站(二) 介绍了有关于数据库的运用,接下来我们在完善一下数据在前端以及前端到后端之间的交互.本节涉及到前端,因此也会讲解一下 jinja2 模板.jQuery.aja ...
- IOS 私有变量 私有属性的书写方法
一.早期只能定义在.h文件中.用@private 关键字来定义私有变量. @interface ViewController{ @private Bool _isBool; } @end 二.允许在. ...
- Java中如何动态创建接口的实现
有很多应用场景,用到了接口动态实现,下面举几个典型的应用: 1.mybatis / jpa 等orm框架,可以在接口上加注解进行开发,不需要编写实现类,运行时动态产生实现. 2.dubbo等分布式服务 ...
- java多线程基本概述(二十)——中断
线程中断我们已经直到可以使用 interrupt() 方法,但是你必须要持有 Thread 对象,但是新的并发库中似乎在避免直接对 Thread 对象的直接操作,尽量使用 Executor 来执行所有 ...
- 将ImageList中的图片转化成byte数组
Image imgwd = this.imageList1.Images["wd.png"]; var bytes = ImageToBytes(imgwd); public by ...