Running R jobs quickly on many machines(转)
As we demonstrated in “A gentle introduction to parallel computing in R” one of the great things about R is how easy it is to take advantage of parallel processing capabilities to speed up calculation. In this note we will show how to move from running jobs multiple CPUs/cores to running jobs multiple machines (for even larger scaling and greater speedup). Using the technique on Amazon EC2 even turns your credit card into a supercomputer.

Colossus supercomputer : The Forbin Project
R itself is not a language designed for parallel computing. It doesn’t have a lot of great user exposed parallel constructs. What saves us is the data science tasks we tend to use R for are themselves are very well suited for parallel programming and many people have prepared very goodpragmatic libraries to exploit this. There are three main ways for a user to benefit from library supplied parallelism:
- Link against superior and parallel libraries such as the Intel BLAS library (supplied on Linux, OSX, and Windows as part of theMicrosoft R Open distribution of R). This replaces libraries you are already using with parallel ones, and you get a speed up for free (on appropriate tasks, such as linear algebra portions of lm()/glm()).
- Ship your modeling tasks out of R into an external parallel system for processing. This is strategy of systems such as rx methods from RevoScaleR, now Microsoft Open R, h2o methods from h2o.ai, orRHadoop.
- Use R’s
parallelfacility to ship jobs to cooperating R instances.This is the strategy used in “A gentle introduction to parallel computing in R” and many libraries that sit on top ofparallel. This is essentially implementing remote procedure call through sockets or networking.
We are going to write more about the third technique.
The third technique is essentially very course grained remote procedure call. It depends on shipping copies of code and data to remote processes and then returning results. It is ill suited for very small tasks. But very well suited a reasonable number of moderate to large tasks. This is the strategy used by R’s parallel library and Python‘s multiprocessinglibrary (though with Python multiprocessing you pretty much need to bring in additional libraries to move from single machine to cluster computing).
This method may seem less efficient and less sophisticated than shared memory methods, but relying on object transmission means it is in principle very easy to extend the technique from a single machine to many machines (also called “cluster computing”). This is what we will demonstrate the R portion of here (in moving from a single machine to a cluster we necessarily bring in a lot of systems/networking/security issues which we will have to defer on).
Here is the complete R portion of the lesson. This assumes you already understand how to configure “ssh” or have a systems person who can help you with the ssh system steps.
Take the examples from “A gentle introduction to parallel computing in R” and instead of starting your parallel cluster with the command: “parallelCluster <- parallel::makeCluster(parallel::detectCores()).”
Do the following:
Collect a list of addresses of machines you can ssh. This is the hard part, depends on your operating system, and something you should get help with if you have not tried it before. In this case I am using ipV4 addresses, but when using Amazon EC2 I use hostnames.
In my case my list is:
- My machine (primary): “192.168.1.235”, user “johnmount”
- Another Win-Vector LLC machine: “192.168.1.70”, user “johnmount”
Notice we are not collecting passwords, as we are assuming we have set up proper “authorized_keys” and keypairs in the “.ssh” configurations of all of these machines. We are calling the machine we are using to issue the overall computation “primary.”
It is vital you try all of these addresses with “ssh” in a terminal shell before trying them with R. Also the machine address you choose as “primary” must be an address the worker machines can use reach back to the primary machine (so you can’t use “localhost”, or use an unreachable machine as primary). Try ssh by hand back and forth from primary to all of these machines and from all of these machines back to your primary before trying to use ssh with R.
Now with the system stuff behind us the R part is as follows. Start your cluster with:
primary <- '192.168.1.235'
machineAddresses <- list(
list(host=primary,user='johnmount',
ncore=4),
list(host='192.168.1.70',user='johnmount',
ncore=4)
) spec <- lapply(machineAddresses,
function(machine) {
rep(list(list(host=machine$host,
user=machine$user)),
machine$ncore)
})
spec <- unlist(spec,recursive=FALSE) parallelCluster <- parallel::makeCluster(type='PSOCK',
master=primary,
spec=spec)
print(parallelCluster)
## socket cluster with 8 nodes on hosts
## ‘192.168.1.235’, ‘192.168.1.70’
And that is it. You can now run your job on many cores on many machines. For the right tasks this represents a substantial speedup. As always separate your concerns when starting: first get a trivial “hello world” task to work on your cluster, then get a smaller version of your computation to work on a local machine, and only after these throw your real work at the cluster.
As we have mentioned before, with some more system work you canspin up transient Amazon ec2 instances to join your computation. At this point your credit card becomes a supercomputer (though you do have to remember to shut them down to prevent extra expenses!).
转自:http://www.win-vector.com/blog/2016/01/running-r-jobs-quickly-on-many-machines/
Running R jobs quickly on many machines(转)的更多相关文章
- 社交网络分析的 R 基础:(四)循环与并行
前三章中列出的大多数示例代码都很短,并没有涉及到复杂的操作.从本章开始将会把前面介绍的数据结构组合起来,构成真正的程序.大部分程序是由条件语句和循环语句控制,R 语言中的条件语句(if-else)和 ...
- Graphics for R
https://cran.r-project.org/web/views/Graphics.html CRAN Task View: Graphic Displays & Dynamic Gr ...
- Configuring and Running Django + Celery in Docker Containers
Configuring and Running Django + Celery in Docker Containers Justyna Ilczuk Oct 25, 2016 0 Commen ...
- Python调用R编程——rpy2
在Python调用R,最常见的方式是使用rpy2模块. 简介 模块 The package is made of several sub-packages or modules: rpy2.rinte ...
- 配置 Sublime Text 3 作为Python R LaTeX Markdown IDE
配置 Sublime Text 3 作为Python R LaTeX Markdown IDE 配置 Sublime Text 3 作为Python IDE IDE的基本功能:代码提醒.补全:编译文件 ...
- Data Science With R In Visual Studio
R Projects Similar to Python, when we installed the data science tools we get an “R” section in our ...
- How-to: Do Statistical Analysis with Impala and R
sklearn实战-乳腺癌细胞数据挖掘(博客主亲自录制视频教程) https://study.163.com/course/introduction.htm?courseId=1005269003&a ...
- [SQL in Azure] Getting Started with SQL Server in Azure Virtual Machines
This topic provides guidelines on how to sign up for SQL Server on a Azure virtual machine and how t ...
- Scheduled Jobs with Custom Clock Processes in Java with Quartz and RabbitMQ
原文地址: https://devcenter.heroku.com/articles/scheduled-jobs-custom-clock-processes-java-quartz-rabbit ...
随机推荐
- JavaScript 简易版 自动轮播 手动轮播 菜鸟交流
本人刚刚接触前端,许多知识还不了解,以前经常到博客园查询自己需要的东西,现在也终于反客为主了.作为新手,所展示的东西也是浅显易懂,希望同是新手的伙伴们共同交流.共同进步,若是成功捕获一位大大,也请您赐 ...
- 深入浅出数据结构C语言版(7)——特殊的表:队列与栈
从深入浅出数据结构(4)到(6),我们分别讨论了什么是表.什么是链表.为什么用链表以及如何用数组模拟链表(游标数组),而现在,我们要进入到对线性表(特意加了"线性"二字是因为存在多 ...
- 统计学习方法:KNN
作者:桂. 时间:2017-04-19 21:20:09 链接:http://www.cnblogs.com/xingshansi/p/6736385.html 声明:欢迎被转载,不过记得注明出处哦 ...
- 计算单词出现的次数--linq
1.直接给出代码:声明数据,也可以是txt等文件,通过File类的静态方法读取其中的文本,再转换成List<string>数组. private static List<string ...
- HTTP上传 文件上传 图片上传 HTTP上传原理 文件上传原理 图片上传原理
1.概述 在最初的http协议中,没有上传文件方面的功能.rfc1867(http://www.ietf.org/rfc/rfc1867.txt )为http协议添加了这个功能.浏览器按照此规范将用户 ...
- 使用charles抓取htpps的方法
自己整理的步骤做个记录 1.下载证书,官方地址:http://www.charlesproxy.com/ssl.zip 可直接点击链接下载:http://charlesproxy.com/getssl ...
- vector介绍
vector(向量,也可称为容器): C++中的一种数据结构,确切的说是一个类.它相当于一个动态的数组,当程序员无法知道自己需要的数组的规模多大时,用其来解决问题可以达到最大节约空间的目的. 1.1 ...
- Xcode8插件安装
一.创建一个自定义证书并且为Xcode重新签名1.打开钥匙串 2.创建自定义签名证书 3.重新签名Xcode(速度比较慢,大概要等1分钟) $ sudo codesign -f -s XcodeSig ...
- vue的Virtual Dom实现- snabbdom解密
vue在官方文档中提到与react的渲染性能对比中,因为其使用了snabbdom而有更优异的性能. JavaScript 开销直接与求算必要 DOM 操作的机制相关.尽管 Vue 和 React 都使 ...
- linux中/etc/profile、/etc/profile.d/、/etc/bashrc、~/.bashrc、~/.bash_profile、~/.bash_logout的作用与区别
作用: /etc/profile:登录时用来设置环境变量,执行文件中的命令,对所有用户生效. /etc/profile.d/:登录时和执行bash命令打开子shell时执行目录下所有已.sh结尾的脚本 ...