Fast data loading from files to R
Recently we were building a Shiny App in which we had to load data from a very large dataframe. It was directly impacting the app initialization time, so we had to look into different ways of reading data from files to R (in our case customer provided csv files) and identify the best one.
The goal of my post is to compare:
read.csvfromutils, which was the standard way of reading csvfiles to R in RStudio,read_csvfromreadrwhich replaced the former method as a standard way of doing it in RStudio,loadandreadRDSfrombase, andread_featherfromfeatherandfreadfromdata.table.
Data
First let’s generate some random data
set.seed(123)
df <- data.frame(replicate(10, sample(0:2000, 15 * 10^5, rep = TRUE)),
replicate(10, stringi::stri_rand_strings(1000, 5)))
and save the files on a disk to evaluate the loading time. Besides thecsv format we will also need feather, RDS and Rdata files.
path_csv <- '../assets/data/fast_load/df.csv'
path_feather <- '../assets/data/fast_load/df.feather'
path_rdata <- '../assets/data/fast_load/df.RData'
path_rds <- '../assets/data/fast_load/df.rds'
library(feather)
library(data.table)
write.csv(df, file = path_csv, row.names = F)
write_feather(df, path_feather)
save(df, file = path_rdata)
saveRDS(df, path_rds)
Next let’s check our files sizes:
files <- c('../assets/data/fast_load/df.csv', '../assets/data/fast_load/df.feather', '../assets/data/fast_load/df.RData', '../assets/data/fast_load/df.rds')
info <- file.info(files)
info$size_mb <- info$size/(1024 * 1024)
print(subset(info, select=c("size_mb")))
## size_mb
## ../assets/data/fast_load/df.csv 1780.3005
## ../assets/data/fast_load/df.feather 1145.2881
## ../assets/data/fast_load/df.RData 285.4836
## ../assets/data/fast_load/df.rds 285.4837
As we can see both csv and feather format files are taking much more storage space. Csv more than 6 times and feather more than 4 times comparing to RDS and RData.
Benchmark
We will use microbenchmark library to compare the reading times of the following methods:
- utils::read.csv
- readr::read_csv
- data.table::fread
- base::load
- base::readRDS
- feather::read_feather
in 10 rounds.
library(microbenchmark)
benchmark <- microbenchmark(readCSV = utils::read.csv(path_csv),
readrCSV = readr::read_csv(path_csv, progress = F),
fread = data.table::fread(path_csv, showProgress = F),
loadRdata = base::load(path_rdata),
readRds = base::readRDS(path_rds),
readFeather = feather::read_feather(path_feather), times = 10)
print(benchmark, signif = 2)
##Unit: seconds
## expr min lq mean median uq max neval
## readCSV 200.0 200.0 211.187125 210.0 220.0 240.0 10
## readrCSV 27.0 28.0 29.770890 29.0 32.0 33.0 10
## fread 15.0 16.0 17.250016 17.0 17.0 22.0 10
## loadRdata 4.4 4.7 5.018918 4.8 5.5 5.9 10
## readRds 4.6 4.7 5.053674 5.1 5.3 5.6 10
## readFeather 1.5 1.8 2.988021 3.4 3.6 4.1 10
And the winner is… feather! However, using feather requires prior conversion of the file to the feather format.
Using load or readRDS can improve performance (second and third place in terms of speed) and has a benefit of storing smaller/compressed file. In both cases you will have to convert your file to the proper format first.
When it comes to reading from csv format fread significantly beatsread_csv and read.csv, and thus is the best option to read a csv file.
In our case we decided to go with feather file since conversion fromcsv to this format is just a one time job and we didn’t have a strict limitation on a storage space to consider usage of Rds or RDataformat.
The final workflow was:
- reading a
csvfile provided by our customer usingfread, - writing it to
featherusingwrite_feather, and - loading a
featherfile on app initialization usingread_feather.
First two tasks were done once and outside of a Shiny App context.
There is also quite interesting benchmark done by Hadley here on reading complete files to R. Unfortunately, if you use functions defined in that post, you will end up with an character type object, and you will have to apply string manipulations to obtain a commonly and widely used dataframe.
转自:http://blog.appsilondatascience.com/rstats/2017/04/11/fast-data-load.html
Fast data loading from files to R的更多相关文章
- pytorch例子学习-DATA LOADING AND PROCESSING TUTORIAL
参考:https://pytorch.org/tutorials/beginner/data_loading_tutorial.html DATA LOADING AND PROCESSING TUT ...
- Redisql: the lightning fast data polyglot【翻译】 - Linvo's blog - 博客频道 - CSDN.NET
Redisql: the lightning fast data polyglot[翻译] - Linvo's blog - 博客频道 - CSDN.NET Redisql: the lightnin ...
- 安装mysql时出现initialize specified but the data directory has files in in.Aborting.该如何解决
eclipse中写入sql插入语句时,navicat中显示的出现乱码(???). 在修改eclipse工作空间编码.navicate中的数据库编码.mysql中my.ini中的配置之后还是出现乱码. ...
- The multi-part request contained parameter data (excluding uploaded files) that exceeded the limit for maxPostSize set on the associated connector.
springboot 表单体积过大时报错: The multi-part request contained parameter data (excluding uploaded files) tha ...
- Springboot 上传报错: Failed to parse multipart servlet request; nested exception is java.lang.IllegalStateException: The multi-part request contained parameter data (excluding uploaded files) that exceede
Failed to parse multipart servlet request; nested exception is java.lang.IllegalStateException: The ...
- MYSQL常见安装错误集:[ERROR] --initialize specified but the data directory has files in it. Abort
1.[ERROR] --initialize specified but the data directory has files in it. Abort [错误] -初始化指定,但数据目录中有文件 ...
- Data Manipulation with dplyr in R
目录 select The filter and arrange verbs arrange filter Filtering and arranging Mutate The count verb ...
- 启动MySQL5.7时报错:initialize specified but the data directory has files in it. Aborting.
启动MySQL5.7时报错:initialize specified but the data directory has files in it. Aborting 解决方法: vim /etc/m ...
- STM32 GPIO fast data transfer with DMA
AN2548 -- 使用 STM32F101xx 和 STM32F103xx 的 DMA 控制器 DMA控制器 DMA是AMBA的先进高性能总线(AHB)上的设备,它有2个AHB端口: 一个是从端口, ...
随机推荐
- 【js数据结构】可逐次添加叶子的二叉树(非最优二叉树)
最近小菜鸟西瓜莹看到了一道面试题: 给定二叉树,按层打印.例如1的子节点是2.3, 2的子节点是3.4, 5的子节点是6,7. 需要建立如图二叉树: 但是西瓜莹找到的相关代码都是用js构建最优二叉树, ...
- memcached+tomcat转发forward时 sessionid一直变化的问题
今天遇到了一个很奇怪的问题, 我在tomcat过滤器 中, 对请求过来的静态资源及html页面做了forword转发操作,核心代码如下: private void redirectMobile(Htt ...
- javaScript 中String的常用方法
1.length() 字符串的长度 例:char chars[]={'a','b'.'c'}; String s=new String(chars); int len=s.length(); 2.ch ...
- 【iOS系列】-多图片多线程异步下载
多图片多线程异步下载 开发中非常常用的就是就是图片下载,我们常用的就是SDWebImage,但是作为开发人员,不仅要能会用,还要知道其原理.本文就会介绍多图下载的实现. 本文中的示例Demno地址,下 ...
- 简单c语言子集词法分析器
概述 词法分析是编译的第一个环节,其输入是高级语言程序,输出是单词串.词法分析器的主要任务是将高级语言程序作为字符串输入,然后依据词法规则将字符串组合成单词,并输出单词串. 为了方便之后的编译环节,通 ...
- 第二章 Struts 2的应用
2.1 Struts 2的应用 2.1.1 使用步骤 1.创建web项目,添加jar包,创建helloWorld.jsp页面 2.创建HelloWorldAction ...
- phpcms基础
CSM基础(做中小型企业网站) 做一个企业站,三个页面比较重要1.首页2.列表页3.内容页 做企业站的流程:1.由美工出一张,设计效果图2.将设计图静态化3.开始安装CMS4.强模板文件放到CSM里面 ...
- 系统启动 之 Linux系统启动概述(2)
博客:http://blog.csdn.net/younger_china/article/details/51615916 Linu系统启动是一个"冗长乏味"的过程,那么我们现就 ...
- 系统启动 之 Linux系统启动概述(1)
随着智能终端功能的越来越庞大,与之,硬件配置越来越高,开机时间却越来越长.人们在享受强大功能的同时,对冗长的智能终端的开机时间却越来越缺乏耐心. 为了"取悦"用户,需要提供较好的用 ...
- 【算法系列学习】[kuangbin带你飞]专题二 搜索进阶 D - Escape (BFS)
Escape 参考:http://blog.csdn.net/libin56842/article/details/41909459 [题意]: 一个人从(0,0)跑到(n,m),只有k点能量,一秒消 ...