Every now and again someone comes along and writes an R package that I consider to be a ‘game changer’ for the language and it’s application to Data Science. For example, I consider dplyr one such package as it has made data munging/manipulation that more intuitive and more productive than it had been before. Although I only first read about it at the beginning of this week, my instinct tells me that in Henrik Bengtsson’s futurepackage we might have another such game-changing R package.

The future package provides an API for futures (or promises) in R. To quote Wikipedia, afuture or promise is,

… a proxy for a result that is initially unknown, usually because the computation of its value is yet incomplete.

A classic example would be a request made to a web server via HTTP, that has yet to return and whose value remains unknown until it does (and which has promised to return at some point in the future). This ‘promise’ is an object assigned to a variable in R like any other, and allows code execution to progress until the moment the code explicitly requires the future to be resolved (i.e. to ‘make good’ on it’s promise). So the code does not need to wait for the web server until the very moment that the information anticipated in its response it actually needed. In the intervening execution time we can send requests to other web servers, run some other computations, etc. Ultimately, this leads to faster and more efficient code. This way of working also opens the door to distributed (i.e. parallel) computation, as the computation assigned to each new future can be executed on a new thread (and executed on a different core on the same machine, or on another machine/node).

The future API is extremely expressive and the associated documentation is excellent. My motivation here is not to repeat any of this, but rather to give a few examples to serve as inspiration for how futures could be used for day-to-day Data Science tasks in R.

Creating a Future to be Executed on a Different Core to that Running the Main Script

To demonstrate the syntax and structure required to achieve this aim, I am going to delegate to a future the task of estimating the mean of 10 million random samples from the normal distribution, and ask it to spawn a new R process on a different core in order to do so. The code to achieve this is as follows,

library(future)

f <- future({

samples <- rnorm(10000000)

mean(samples)

}) %plan% multiprocess

w <- value(f)

w

# [1] 3.046653e-05

future({...}) assigns the code (actually a construct known as a closure), to be computed asynchronously from the main script. The code will be start execution the moment this initial assignment is made;
%plan% multiprocess sets the future’s execution plan to be on a different core (or thread); and,
value asks for the return value of future. This will block further code execution until the future can be resolved.

The above example can easily be turned into a function that outputs dots (...) to the console until the future can be resolved and return it’s value,

f_dots <- function() {

f <- future({

s <- rnorm(10000000)

mean(s)

}) %plan% multiprocess

while (!resolved(f)) {

cat("...")

}

cat("\n")

value(f)

}

f_dots()

# ............

# [1] -0.0001872372

Here, resolved(f) will return FALSE until the future f has finished executing.

Useful Use Cases

I can recall many situations where futures would have been handy when writing R scripts. The examples below are the most obvious that come to mind. No doubt there will be many more.

Distributed (Parallel) Computation

In the past, when I’ve felt the need to distribute a calculation I have usually used themclapply function (i.e. multi-core lapply), from the parallel library that comes bundled together with base R. Computing the mean of 100 million random samples from the normal distribution would look something like,

library(parallel)

sub_means <- mclapply(

X = 1:4,

FUN = function(x) { samples <- rnorm(25000000); mean(samples) },

mc.cores = 4)

final_mean <- mean(unlist(sub_mean))

final_mean

# [1] -0.0002100956

Perhaps more importantly, the script will be ‘blocked’ until sub_means has finished executing. We can achieve the same end-result, but without blocking, using futures,

single_thread_mean <- function() {

samples <- rnorm(25000000)

mean(samples)

}

multi_thread_mean <- function() {

f1 <- future({ single_thread_mean() }) %plan% multiprocess

f2 <- future({ single_thread_mean() }) %plan% multiprocess

f3 <- future({ single_thread_mean() }) %plan% multiprocess

f4 <- future({ single_thread_mean() }) %plan% multiprocess

mean(value(f1), value(f2), value(f3), value(f4))

}

multi_thread_mean()

# [1] -4.581293e-05

We can compare computation time between the single and multi-threaded versions of the mean computation (using the microbenchmark package),

library(microbenchmark)

microbenchmark({ samples <- rnorm(100000000); mean(samples) },

multi_thread_mean(),

times = 10)

# Unit: seconds

#                  expr      min       lq     mean   median       uq      max neval

# single_thread(1e+08) 7.671721 7.729608 7.886563 7.765452 7.957930 8.406778 10

# multi_thread(1e+08) 2.046663 2.069641 2.139476 2.111769 2.206319 2.344448 10

We can see that the multi-threaded version is nearly 3 times faster, which is not surprising given that we’re using 3 extra threads. Note that time is lost spawning the extra threads and combining their results (usually referred to as ‘overhead’), such that distributing a calculation can actually increase computation time if the benefit of parallelisation is less than the cost of the overhead.

Non-Blocking Asynchronous Input/Output

I have often found myself in the situation where I need to read several large CSV files, each of which can take a long time to load. Because the files can only be loaded sequentially, I have had to wait for one file to be read before the next one can start loading, which compounds the time devoted to input. Thanks to futures, we can can now achieve asynchronous input and output as follows,

library(readr)

df1 <- future({ read_csv("data/csv1.csv") }) %plan% multiprocess

df2 <- future({ read_csv("data/csv2.csv") }) %plan% multiprocess

df3 <- future({ read_csv("data/csv3.csv") }) %plan% multiprocess

df4 <- future({ read_csv("data/csv4.csv") }) %plan% multiprocess

df <- rbind(value(df1), value(df2), value(df3), value(df4))

Running microbenchmark on the above code illustrates the speed-up (each file is ~50MB in size),

# Unit: seconds

# min lq mean median uq max neval

# synchronous 7.880043 8.220015 8.502294 8.446078 8.604284 9.447176 10

# asynchronous 4.203271 4.256449 4.494366 4.388478 4.490442 5.748833 10

The same pattern can be applied to making HTTP requests asynchronously. In the following example I make an asynchronous HTTP GET request to the OpenCPU public API, to retrieve the Boston housing dataset via JSON. While I’m waiting for the future to resolve the response I keep making more asynchronous requests, but this time tohttp://time.jsontest.com to get the current time. Once the original future has resolved, I block output until all remaining futures have been resolved.

library(httr)

library(jsonlite)

time_futures <- list()

data_future <- future({

response <- GET("http://public.opencpu.org/ocpu/library/MASS/data/Boston/json")

fromJSON(content(response, as = "text"))

}) %plan% multiprocess

while (!resolved(data_future)) {

time_futures <- append(time_futures, future({ GET("http://time.jsontest.com") }) %plan% multiprocess)

}

values(time_futures)

# [[1]]

# Response [http://time.jsontest.com/]

# Date: 2016-11-02 01:31

# Status: 200

# Content-Type: application/json; charset=ISO-8859-1

# Size: 100 B

# {

# "time": "01:31:19 AM",

# "milliseconds_since_epoch": 1478050279145,

# "date": "11-02-2016"

# }

head(value(data_future))

# crim zn indus chas nox rm age dis rad tax ptratio black lstat medv

# 1 0.0063 18 2.31 0 0.538 6.575 65.2 4.0900 1 296 15.3 396.90 4.98 24.0

# 2 0.0273 0 7.07 0 0.469 6.421 78.9 4.9671 2 242 17.8 396.90 9.14 21.6

# 3 0.0273 0 7.07 0 0.469 7.185 61.1 4.9671 2 242 17.8 392.83 4.03 34.7

# 4 0.0324 0 2.18 0 0.458 6.998 45.8 6.0622 3 222 18.7 394.63 2.94 33.4

# 5 0.0690 0 2.18 0 0.458 7.147 54.2 6.0622 3 222 18.7 396.90 5.33 36.2

# 6 0.0298 0 2.18 0 0.458 6.430 58.7 6.0622 3 222 18.7 394.12 5.21 28.7

The same logic applies to accessing databases and executing SQL queries via ODBC orJDBC. For example, large complex queries can be split into ‘chunks’ and sent asynchronously to the database server in order to have them executed on multiple server threads. The output can then be unified once the server has sent back the chunks, using R (e.g. with dplyr). This is a strategy that I have been using with Apache Spark, but I could now implement it within R. Similarly, multiple database tables can be accessed concurrently, and so on.

Final Thoughts

I have only really scratched the surface of what is possible with futures. For example,future supports multiple execution plans including lazy and cluster (for multiple machines/nodes) – I have only focused on increasing performance on a single machine with multiple cores. If this post has provided some inspiration or left you curious, then head over to the official future docs for the full details (which are a joy to read and work-through).

转自：https://alexioannides.com/2016/11/02/asynchronous-and-distributed-programming-in-r-with-the-future-package/

Asynchronous and Distributed Programming in R with the Future Package的更多相关文章

Create and format Word documents using R software and Reporters package
http://www.sthda.com/english/wiki/create-and-format-word-documents-using-r-software-and-reporters-pa ...
R 报错：package ‘＊＊＊’ is not available (for R version ＊＊＊＊) 的解决方案
R 安装sparklyr,ggplot2等包出现如下warning package '****' is not available (for R version 3.0.2) 系统环境 ubuntu1 ...
基于R数据分析之常用Package讲解系列--1. data.table
利用data.table包变形数据一. 基础概念 data.table 这种数据结构相较于R中本源的data.frame 在数据处理上有运算速度更快,内存运用更高效,可认为它是data.frame ...
【深度学习Deep Learning】资料大全
最近在学深度学习相关的东西,在网上搜集到了一些不错的资料,现在汇总一下: Free Online Books by Yoshua Bengio, Ian Goodfellow and Aaron C ...
机器学习(Machine Learning)&深度学习(Deep Learning)资料(Chapter 2)
##机器学习(Machine Learning)&深度学习(Deep Learning)资料(Chapter 2)---#####注:机器学习资料[篇目一](https://github.co ...
R – GPU Programming for All with ‘gpuR’
INTRODUCTION GPUs (Graphic Processing Units) have become much more popular in recent years for compu ...
Deep Learning in R
Introduction Deep learning is a recent trend in machine learning that models highly non-linear repre ...
Apache Spark 2.2.0 中文文档 - SparkR (R on Spark) | ApacheCN
SparkR (R on Spark) 概述 SparkDataFrame 启动: SparkSession 从 RStudio 来启动创建 SparkDataFrames 从本地的 data fr ...
How-to: Do Statistical Analysis with Impala and R
sklearn实战-乳腺癌细胞数据挖掘(博客主亲自录制视频教程) https://study.163.com/course/introduction.htm?courseId=1005269003&a ...

随机推荐

c++设计成员变量可动态调整的动态类结构
本文主要介绍一下如何使用c++设计成员变量可动态调整的抽象动态类结构.首先介绍一下项目中以前使用的一种类结构:静态类结构 1.静态类结构很多时候,在项目开发中设计类结构时,我们往往有一种简单.直接的 ...
POJ 2585 Window Pains 题解
链接:http://poj.org/problem?id=2585 题意: 某个人有一个屏幕大小为4*4的电脑,他很喜欢打开窗口,他肯定打开9个窗口,每个窗口大小2*2.并且每个窗口肯定在固定的位置上 ...
处理json数据的空数据为任意字符
处理json数据的空数据为任意字符有时候从后台返回来的数据需要处理一下,根据实际开发需求,不能在页面上直接显示空字符,需要显示为"无内容"或者其他字段,而有些json数据结构比较 ...
类似智能购票的demo--进入页面后默认焦点在第一个输入框，输入内容、回车、right时焦点自动跳到下一个，当跳到select时，下拉选项自动弹出，并且可以按上下键选择，选择完成后再跳到下一个。
要实现的效果:进入页面后默认焦点在第一个输入框,输入内容.回车.right时焦点自动跳到下一个,当跳到select时,下拉选项自动弹出,并且可以按上下键选择,选择完成后再跳到下一个. PS:自己模拟的 ...
谈谈一些有趣的CSS题目（十五）-- 谈谈 CSS 关键字 initial、inherit 和 unset
开本系列,谈谈一些有趣的 CSS 题目,题目类型天马行空,想到什么说什么,不仅为了拓宽一下解决问题的思路,更涉及一些容易忽视的 CSS 细节. 解题不考虑兼容性,题目天马行空,想到什么说什么,如果解题 ...
js解决苹果移动端300ms延迟的问题
做移动端页面开发的可能会了解到,ios系统click事件会有卡顿的现象,这个问题的根源是苹果本身自带的safari有双击放大页面的功能,再次双击会返回到原始尺寸,所以在第一次点击的系统会延迟300ms ...
JS模式---命令模式
var opendoor = { execute: function () { console.log("开门"); } }; var closedoor = { execute: ...
【国家集训队2012】tree(伍一鸣)
Description 一棵n个点的树,每个点的初始权值为1.对于这棵树有q个操作,每个操作为以下四种操作之一: + u v c:将u到v的路径上的点的权值都加上自然数c: - u1 v1 u2 ...
mina.net 梳理
LZ最近离职,闲着也是闲着,打算梳理下公司做的是电商,CTO打算把2.0系统用java 语言开发,LZ目前不打算做java,所以选择离职.离职前,在公司负责的最后一个项目供应链系统. 系统分为 ...
[Git]05 如何使用分支
作者:Younger Liu, 本作品采用知识共享署名-非商业性使用-相同方式共享 3.0 未本地化版本许可协议进行许可. 几乎每一种版本控制系统都以某种形式支持分支.使用分支意味着你可以从开发 ...

Asynchronous and Distributed Programming in R with the Future Package

Creating a Future to be Executed on a Different Core to that Running the Main Script

Useful Use Cases

Distributed (Parallel) Computation

Non-Blocking Asynchronous Input/Output

Final Thoughts

Asynchronous and Distributed Programming in R with the Future Package的更多相关文章

随机推荐

热门专题