Recently we were building a Shiny App in which we had to load data from a very large dataframe. It was directly impacting the app initialization time, so we had to look into different ways of reading data from files to R (in our case customer provided csv files) and identify the best one.

The goal of my post is to compare:

  1. read.csv from utils, which was the standard way of reading csvfiles to R in RStudio,
  2. read_csv from readr which replaced the former method as a standard way of doing it in RStudio,
  3. load and readRDS from base, and
  4. read_feather from feather and fread from data.table.

Data

First let’s generate some random data

set.seed(123)
df <- data.frame(replicate(10, sample(0:2000, 15 * 10^5, rep = TRUE)),
replicate(10, stringi::stri_rand_strings(1000, 5)))

and save the files on a disk to evaluate the loading time. Besides thecsv format we will also need featherRDS and Rdata files.

path_csv <- '../assets/data/fast_load/df.csv'
path_feather <- '../assets/data/fast_load/df.feather'
path_rdata <- '../assets/data/fast_load/df.RData'
path_rds <- '../assets/data/fast_load/df.rds'
library(feather)
library(data.table)
write.csv(df, file = path_csv, row.names = F)
write_feather(df, path_feather)
save(df, file = path_rdata)
saveRDS(df, path_rds)

Next let’s check our files sizes:

files <- c('../assets/data/fast_load/df.csv', '../assets/data/fast_load/df.feather', '../assets/data/fast_load/df.RData', '../assets/data/fast_load/df.rds')
info <- file.info(files)
info$size_mb <- info$size/(1024 * 1024)
print(subset(info, select=c("size_mb")))
##                                       size_mb
## ../assets/data/fast_load/df.csv 1780.3005
## ../assets/data/fast_load/df.feather 1145.2881
## ../assets/data/fast_load/df.RData 285.4836
## ../assets/data/fast_load/df.rds 285.4837

As we can see both csv and feather format files are taking much more storage space. Csv more than 6 times and feather more than 4 times comparing to RDS and RData.

Benchmark

We will use microbenchmark library to compare the reading times of the following methods:

  • utils::read.csv
  • readr::read_csv
  • data.table::fread
  • base::load
  • base::readRDS
  • feather::read_feather

in 10 rounds.

library(microbenchmark)
benchmark <- microbenchmark(readCSV = utils::read.csv(path_csv),
readrCSV = readr::read_csv(path_csv, progress = F),
fread = data.table::fread(path_csv, showProgress = F),
loadRdata = base::load(path_rdata),
readRds = base::readRDS(path_rds),
readFeather = feather::read_feather(path_feather), times = 10)
print(benchmark, signif = 2)
##Unit: seconds
## expr min lq mean median uq max neval
## readCSV 200.0 200.0 211.187125 210.0 220.0 240.0 10
## readrCSV 27.0 28.0 29.770890 29.0 32.0 33.0 10
## fread 15.0 16.0 17.250016 17.0 17.0 22.0 10
## loadRdata 4.4 4.7 5.018918 4.8 5.5 5.9 10
## readRds 4.6 4.7 5.053674 5.1 5.3 5.6 10
## readFeather 1.5 1.8 2.988021 3.4 3.6 4.1 10

And the winner is… feather! However, using feather requires prior conversion of the file to the feather format.
Using load or readRDS can improve performance (second and third place in terms of speed) and has a benefit of storing smaller/compressed file. In both cases you will have to convert your file to the proper format first.

When it comes to reading from csv format fread significantly beatsread_csv and read.csv, and thus is the best option to read a csv file.

In our case we decided to go with feather file since conversion fromcsv to this format is just a one time job and we didn’t have a strict limitation on a storage space to consider usage of Rds or RDataformat.

The final workflow was:

  1. reading a csv file provided by our customer using fread,
  2. writing it to feather using write_feather, and
  3. loading a feather file on app initialization using read_feather.

First two tasks were done once and outside of a Shiny App context.

There is also quite interesting benchmark done by Hadley here on reading complete files to R. Unfortunately, if you use functions defined in that post, you will end up with an character type object, and you will have to apply string manipulations to obtain a commonly and widely used dataframe.

转自:http://blog.appsilondatascience.com/rstats/2017/04/11/fast-data-load.html

Fast data loading from files to R的更多相关文章

  1. pytorch例子学习-DATA LOADING AND PROCESSING TUTORIAL

    参考:https://pytorch.org/tutorials/beginner/data_loading_tutorial.html DATA LOADING AND PROCESSING TUT ...

  2. Redisql: the lightning fast data polyglot【翻译】 - Linvo's blog - 博客频道 - CSDN.NET

    Redisql: the lightning fast data polyglot[翻译] - Linvo's blog - 博客频道 - CSDN.NET Redisql: the lightnin ...

  3. 安装mysql时出现initialize specified but the data directory has files in in.Aborting.该如何解决

    eclipse中写入sql插入语句时,navicat中显示的出现乱码(???). 在修改eclipse工作空间编码.navicate中的数据库编码.mysql中my.ini中的配置之后还是出现乱码. ...

  4. The multi-part request contained parameter data (excluding uploaded files) that exceeded the limit for maxPostSize set on the associated connector.

    springboot 表单体积过大时报错: The multi-part request contained parameter data (excluding uploaded files) tha ...

  5. Springboot 上传报错: Failed to parse multipart servlet request; nested exception is java.lang.IllegalStateException: The multi-part request contained parameter data (excluding uploaded files) that exceede

    Failed to parse multipart servlet request; nested exception is java.lang.IllegalStateException: The ...

  6. MYSQL常见安装错误集:[ERROR] --initialize specified but the data directory has files in it. Abort

    1.[ERROR] --initialize specified but the data directory has files in it. Abort [错误] -初始化指定,但数据目录中有文件 ...

  7. Data Manipulation with dplyr in R

    目录 select The filter and arrange verbs arrange filter Filtering and arranging Mutate The count verb ...

  8. 启动MySQL5.7时报错:initialize specified but the data directory has files in it. Aborting.

    启动MySQL5.7时报错:initialize specified but the data directory has files in it. Aborting 解决方法: vim /etc/m ...

  9. STM32 GPIO fast data transfer with DMA

    AN2548 -- 使用 STM32F101xx 和 STM32F103xx 的 DMA 控制器 DMA控制器 DMA是AMBA的先进高性能总线(AHB)上的设备,它有2个AHB端口: 一个是从端口, ...

随机推荐

  1. Java线程池ThreadPoolExecutor使用和分析(三) - 终止线程池原理

    相关文章目录: Java线程池ThreadPoolExecutor使用和分析(一) Java线程池ThreadPoolExecutor使用和分析(二) - execute()原理 Java线程池Thr ...

  2. JavaScript高级内容:原型链、继承、执行上下文、作用域链、闭包

    了解这些问题,我先一步步来看,先从基础说起,然后引出这些概念. 本文只用实例验证结果,并做简要说明,给大家增加些印象,因为单独一项拿出来都需要大篇幅讲解. 1.值类型 & 引用类型 funct ...

  3. js中计算两个日期之差

    js中计算两个日期之差            var aBgnDate, aEndDate;            var oBgnDate, oEndDate;            var nYl ...

  4. 简单的利用JS来判断页面是在手机端还是在PC端打开的方法

    在移动设备应用越来越广泛的今天,许多网站都开始做移动端的界面展示,两者屏幕尺寸差异很大,所以展示的内容也有所差别.于是就遇到一个问题,如何判断你的页面是在移动端还是在PC端打开的,很简单的问题,那我们 ...

  5. 简单聊聊Storm的流分组策略

    简单聊聊Storm的流分组策略 首先我要强调的是,Storm的分组策略对结果有着直接的影响,不同的分组的结果一定是不一样的.其次,不同的分组策略对资源的利用也是有着非常大的不同,本文主要讲一讲loca ...

  6. phpcms基础

    CSM基础(做中小型企业网站) 做一个企业站,三个页面比较重要1.首页2.列表页3.内容页 做企业站的流程:1.由美工出一张,设计效果图2.将设计图静态化3.开始安装CMS4.强模板文件放到CSM里面 ...

  7. DirectFB学习笔记四

    本篇目的,实现按钮的点击事件捕获,也就是鼠标点击,如果点击在方框范围内,则响应,在方框外,则忽略. 由于鼠标移动和点击都会产生事件,因此,我们可以在鼠标移动的时候记录坐标,在点击时比较坐标是否在方框范 ...

  8. JS模式---命令模式

    var opendoor = { execute: function () { console.log("开门"); } }; var closedoor = { execute: ...

  9. Swing学习篇 API之JButton组件

    按钮(Jbutton) Swing中的按钮是Jbutton,它是javax.swing.AbstracButton类的子类,swing中的按钮可以显示图像,并且可以将按钮设置为窗口的默认图标,而且还可 ...

  10. ASP.NET MVC 常用扩展点:过滤器、模型绑定等

    一.过滤器(Filter) ASP.NET MVC中的每一个请求,都会分配给对应Controller(以下简称“控制器”)下的特定Action(以下简称“方法”)处理,正常情况下直接在方法里写代码就可 ...