As we demonstrated in “A gentle introduction to parallel computing in R” one of the great things about R is how easy it is to take advantage of parallel processing capabilities to speed up calculation. In this note we will show how to move from running jobs multiple CPUs/cores to running jobs multiple machines (for even larger scaling and greater speedup). Using the technique on Amazon EC2 even turns your credit card into a supercomputer.


Colossus supercomputer : The Forbin Project

R itself is not a language designed for parallel computing. It doesn’t have a lot of great user exposed parallel constructs. What saves us is the data science tasks we tend to use R for are themselves are very well suited for parallel programming and many people have prepared very goodpragmatic libraries to exploit this. There are three main ways for a user to benefit from library supplied parallelism:

  • Link against superior and parallel libraries such as the Intel BLAS library (supplied on Linux, OSX, and Windows as part of theMicrosoft R Open distribution of R). This replaces libraries you are already using with parallel ones, and you get a speed up for free (on appropriate tasks, such as linear algebra portions of lm()/glm()).
  • Ship your modeling tasks out of R into an external parallel system for processing. This is strategy of systems such as rx methods from RevoScaleR, now Microsoft Open Rh2o methods from h2o.ai, orRHadoop.
  • Use R’s parallel facility to ship jobs to cooperating R instances.This is the strategy used in “A gentle introduction to parallel computing in R” and many libraries that sit on top of parallel. This is essentially implementing remote procedure call through sockets or networking.

We are going to write more about the third technique.

The third technique is essentially very course grained remote procedure call. It depends on shipping copies of code and data to remote processes and then returning results. It is ill suited for very small tasks. But very well suited a reasonable number of moderate to large tasks. This is the strategy used by R’s parallel library and Python‘s multiprocessinglibrary (though with Python multiprocessing you pretty much need to bring in additional libraries to move from single machine to cluster computing).

This method may seem less efficient and less sophisticated than shared memory methods, but relying on object transmission means it is in principle very easy to extend the technique from a single machine to many machines (also called “cluster computing”). This is what we will demonstrate the R portion of here (in moving from a single machine to a cluster we necessarily bring in a lot of systems/networking/security issues which we will have to defer on).

Here is the complete R portion of the lesson. This assumes you already understand how to configure “ssh” or have a systems person who can help you with the ssh system steps.

Take the examples from “A gentle introduction to parallel computing in R” and instead of starting your parallel cluster with the command: “parallelCluster <- parallel::makeCluster(parallel::detectCores()).”

Do the following:

Collect a list of addresses of machines you can ssh. This is the hard part, depends on your operating system, and something you should get help with if you have not tried it before. In this case I am using ipV4 addresses, but when using Amazon EC2 I use hostnames.

In my case my list is:

  • My machine (primary): “192.168.1.235”, user “johnmount”
  • Another Win-Vector LLC machine: “192.168.1.70”, user “johnmount”

Notice we are not collecting passwords, as we are assuming we have set up proper “authorized_keys” and keypairs in the “.ssh” configurations of all of these machines. We are calling the machine we are using to issue the overall computation “primary.”

It is vital you try all of these addresses with “ssh” in a terminal shell before trying them with R. Also the machine address you choose as “primary” must be an address the worker machines can use reach back to the primary machine (so you can’t use “localhost”, or use an unreachable machine as primary). Try ssh by hand back and forth from primary to all of these machines and from all of these machines back to your primary before trying to use ssh with R.

Now with the system stuff behind us the R part is as follows. Start your cluster with:

primary <- '192.168.1.235'
machineAddresses <- list(
list(host=primary,user='johnmount',
ncore=4),
list(host='192.168.1.70',user='johnmount',
ncore=4)
) spec <- lapply(machineAddresses,
function(machine) {
rep(list(list(host=machine$host,
user=machine$user)),
machine$ncore)
})
spec <- unlist(spec,recursive=FALSE) parallelCluster <- parallel::makeCluster(type='PSOCK',
master=primary,
spec=spec)
print(parallelCluster)
## socket cluster with 8 nodes on hosts
## ‘192.168.1.235’, ‘192.168.1.70’

And that is it. You can now run your job on many cores on many machines. For the right tasks this represents a substantial speedup. As always separate your concerns when starting: first get a trivial “hello world” task to work on your cluster, then get a smaller version of your computation to work on a local machine, and only after these throw your real work at the cluster.

As we have mentioned before, with some more system work you canspin up transient Amazon ec2 instances to join your computation. At this point your credit card becomes a supercomputer (though you do have to remember to shut them down to prevent extra expenses!).

转自:http://www.win-vector.com/blog/2016/01/running-r-jobs-quickly-on-many-machines/

Running R jobs quickly on many machines(转)的更多相关文章

  1. 社交网络分析的 R 基础:(四)循环与并行

    前三章中列出的大多数示例代码都很短,并没有涉及到复杂的操作.从本章开始将会把前面介绍的数据结构组合起来,构成真正的程序.大部分程序是由条件语句和循环语句控制,R 语言中的条件语句(if-else)和 ...

  2. Graphics for R

    https://cran.r-project.org/web/views/Graphics.html CRAN Task View: Graphic Displays & Dynamic Gr ...

  3. Configuring and Running Django + Celery in Docker Containers

    Configuring and Running Django + Celery in Docker Containers  Justyna Ilczuk  Oct 25, 2016  0 Commen ...

  4. Python调用R编程——rpy2

    在Python调用R,最常见的方式是使用rpy2模块. 简介 模块 The package is made of several sub-packages or modules: rpy2.rinte ...

  5. 配置 Sublime Text 3 作为Python R LaTeX Markdown IDE

    配置 Sublime Text 3 作为Python R LaTeX Markdown IDE 配置 Sublime Text 3 作为Python IDE IDE的基本功能:代码提醒.补全:编译文件 ...

  6. Data Science With R In Visual Studio

    R Projects Similar to Python, when we installed the data science tools we get an “R” section in our ...

  7. How-to: Do Statistical Analysis with Impala and R

    sklearn实战-乳腺癌细胞数据挖掘(博客主亲自录制视频教程) https://study.163.com/course/introduction.htm?courseId=1005269003&a ...

  8. [SQL in Azure] Getting Started with SQL Server in Azure Virtual Machines

    This topic provides guidelines on how to sign up for SQL Server on a Azure virtual machine and how t ...

  9. Scheduled Jobs with Custom Clock Processes in Java with Quartz and RabbitMQ

    原文地址: https://devcenter.heroku.com/articles/scheduled-jobs-custom-clock-processes-java-quartz-rabbit ...

随机推荐

  1. CI Weekly #17 | flow.ci 支持 Java 构建以及 Docker/DevOps 实践分享

    这周一,我们迫不及待写下了最新的 changelog -- 项目语言新增「Java」.创建 Java 项目工作流和其它语言项目配置很相似,flow.ci 提供了默认的 Java 项目构建流程模版,快去 ...

  2. Linux学习(一)

    Linux系统 1.组成部分 1.1内核负责的功能 1.1.1:系统内存管理 内存管理即管理物理内存和虚拟内存 (通过硬盘实现的,即swap space),长时间为被访问的内存块会被放到虚拟内存中,当 ...

  3. Unity CommandInvokationFailure: Failed to re-package resources. 解决方案

    在导入谷歌的SDK的时候,打包出来报错CommandInvokationFailure: Failed to re-package resources. 把Android SDK更新一下就轻松搞定了, ...

  4. 《Vue2.0 实践揭秘》终于出版啦!

    不知不觉间在园子开博都两年多了,最近一些园友问最近去哪了为何都没有新的文章了.最近确实发生了很多的事,一是忙工作二就是忙着写书.这还得多些园子的小编,自两年前发表的"架构师修炼"系 ...

  5. List<String> 和 ArrayList<String>的区别

    最近对这两个问题比较懵逼,关于List和ArrayList.List<String> list = new ArrayList<String>(); 好了,先搞明白List 和 ...

  6. Hadoop安全机制之令牌

    介绍 Hadoop中的安全机制包括认证和授权.而Hadoop RPC中采用SASL(Simple Authentication and Security Layer,简单认证和安全层)进行安全认证,具 ...

  7. 蓝桥杯-第39级台阶-java

    /* (程序头部注释开始) * 程序的版权和版本声明部分 * Copyright (c) 2016, 广州科技贸易职业学院信息工程系学生 * All rights reserved. * 文件名称: ...

  8. redis 主从配置实例、注意事项、及备份方式

    这两天在配置线上使用的redis服务.总得看起来,redis服务的配置文件还是非常简洁.清楚,配置起来非常顺畅,赞一下作者. 下面是我使用的配置,使用主从模式,在master上关掉所有持久化,在sla ...

  9. 小谈ThinkPHP

    ThinkPHP也是一个MVC框架,分视图.控制器和模型,和Yii框架相比相对较好理解,并且是轻量级的框架(相对于Yii来说),在使用Yii框架时候如果将框架放在项目文件中,用编辑器打开文件都比较慢, ...

  10. Gartner:当商业智能成熟度低时,如何加快分析采用率

    文 | 水手哥 本文出自:知乎专栏<帆软数据应用研究院>--数据干货&资讯集中地 根据Gartner近7年的调查结果,71%的受访企业处于低成熟度阶段,也就是Gartner五级BI ...