https://grapeot.me/easy-and-cheap-cluster-building-on-aws.html

Thu 17 July 2014 , by Yan Wang | 2 Comments Linux Parallel github Image

Why?

It often requires a lot of computational resources to do machine learning / computer vision research, like extracting features from a lot of images, and training large-scale or many classifiers. Therefore people use more than one machines to do the task. The procedures are often like, copy executable/data files to all the machines, configure environments, manually divide the tasks, actually run the commands, and collect the results. In addition to the complicated workflow, another practical problem is where to get the machines. Maintaining your own cluster is definitely an option, an extremely expensive and time-costing option. Renting from AWS, especially using spot instances, is a much cheaper and more practical alternative.

But a lot of factors prevent them to be really useful (I assume you already know how spot instances work):

  • Spot instances don't have persistent storage, which means whatever you have on the hard disk may lost in the next minute. How to deal with this?
  • This property of spot instances also makes system configuration a problem -- how do you easily make a blank system usable?
  • How to efficiently copy bulk of data to AWS?
  • Manual task division and command execution doesn't sound right. How to make it easier and smarter (and faster)?

After quite a few months, I gradually accumulate a tool chain to handle all of these problems.

What will you get?

Here is an example of a 128-core 240GB cluster. It requires ~10 minutes to build it from scratch (or ~1 minute to build from AMI image), and costs about 1 dollar per hour. Like any AWS instances, the instances themselves cost nothing if you don't use them (by shutting them down). All your data will be on your hard disk and the loss due to spot request failure will be minimized. The best thing is, task submission is fairly simple -- one single line of bash command will do the job, like

cat cluster.sh | parallel --sshlogin 8/m1 --sshlogin 8/m2 --sshlogin 8/m3 --sshlogin 8/m4 bash -c '{}'

It will automatically distribute every line of cluster.sh to the four nodes, and display all the stdouts on your screen. Whenever a node has less than 8 tasks running, the script will automatically dispatch one to it.

How? (TL; DR)

  • Use automated script to do fast system configuration.
  • Use sshfs to do selective file transfer with compression, including training data transfer and result collection.
  • Use GNU parallel to do job submission.
  • AMI can also be used to further expedite virtual machine initialization

How?

  1. Create spot instances on AWS.
  2. On each machine, run curl https://grapeot.me/aws.sh | sh if that fits you. Orgit clone http://github.com/grapeot/debianinit and execute setup-ubuntu.sh to initialize the system. Note the script is personalized for me with python and vim support. Folk it to add your own stuffs.
  3. That's it for configuration. To submit jobs, use parallel. Let's look at this example:
cat cluster.sh | parallel --sshlogin 8/m1 --sshlogin 8/m2 --sshlogin 8/m3 --sshlogin 8/m4 bash -c '{}'

We already explained what it means, and here are more details. For switches like --sshlogin 8/m1--sshlogin means to send the task to remote machines. 8/m1 tells parallel to send it to a ssh host named m1, which you can configure in ~/.ssh/config, and maintain at most 8 tasks on that host. bash -c '{}' is the actual command to execute on the remote machine, with {} as the placeholder for each line from stdinparallel is much more flexible than this, and I'd leave the exploration of more switches and usage to you. :)

Easy and cheap cluster building on AWS backup的更多相关文章

  1. Nacos Cluster Building

    原文链接:https://www.javaspring.net/nacos/nacos-cluster-building Continue to talk about the Nacos build ...

  2. AWS backup

    shadowsocks ssserver -c /etc/shadowsocks/config.json start/stop/reset

  3. AWS 免费套餐

    AWS 免费套餐 转载自:https://aws.amazon.com/cn/free/?sc_channel=PS&sc_campaign=acquisition_CN&sc_pub ...

  4. AWS 存储服务(三)

    目录 AWS S3 业务场景 挑战 解决方案 S3的好处 S3 属性 存储桶 Buckets 对象 Object S3 特性 S3 操作 可用性和持久性 一致性 S3 定价策略 S3高级功能 存储级别 ...

  5. Awesome Go

    A curated list of awesome Go frameworks, libraries and software. Inspired by awesome-python. Contrib ...

  6. Go 语言相关的优秀框架,库及软件列表

    If you see a package or project here that is no longer maintained or is not a good fit, please submi ...

  7. Awesome Go (http://awesome-go.com/)

    A curated list of awesome Go frameworks, libraries and software. Inspired by awesome-python. Contrib ...

  8. Awesome Go精选的Go框架,库和软件的精选清单.A curated list of awesome Go frameworks, libraries and software

    Awesome Go      financial support to Awesome Go A curated list of awesome Go frameworks, libraries a ...

  9. RAC的QA

    RAC: Frequently Asked Questions [ID 220970.1]   修改时间 13-JAN-2011     类型 FAQ     状态 PUBLISHED   Appli ...

随机推荐

  1. 四: 使用vue搭建网站前端页面

    ---恢复内容开始--- 在搭建路由项目的时候的基本步骤 一:创建项目 安装好vue 搭好环境 (步骤在上篇博客中) 进入项目目录      cd   目录路径/ 目录名 创建项目          ...

  2. GZip对字符串压缩和解压

    /// <summary> /// 压缩 /// </summary> /// <param name="value">需要压缩字符串</ ...

  3. java.io.FileNotFoundException异常,一是“拒绝访问”,二是“系统找不到指定路径”

    关于java.io.FileNotFoundException异常 因为这个异常抛出俩种情况:一是“拒绝访问”,二是“系统找不到指定路径” 这里只讲明什么时候抛拒绝访问,什么时候抛找不到指定路径. 原 ...

  4. ArcFace Android 人脸检测与人脸识别集成分享

    目前我们的应用内使用了 ArcFace 的人脸检测功能,其他的我们并不了解,所以这里就和大家分享一下我们的集成过程和一些使用心得集成ArcFace FD 的集成过程非常简单在 ArcFace FD 的 ...

  5. python3+虹软2.0 离线人脸识别 demo

    python3+虹软2.0的所有功能整合测试完成,并对虹软所有功能进行了封装,现提供demo主要功能,1.人脸识别2.人脸特征提取3.特征比对4.特征数据存储与比对其他特征没有添加 虹软SDK下载戳这 ...

  6. 利用jenkins+git自动执行接口测试代码

    事前准备 部署好jenkins.申请一个码云账号.代码同步至码云 1.在码云上新建一个项目,并把代码同步上去 我设置成了私有项目 2.为项目添加部署公钥 因为是私有项目,所以需要添加一个部署公钥,不然 ...

  7. Program Option Modifiers

    Some option are 'boolean' and control behavior that can be turned on or off. --column-names option d ...

  8. RNA Spike-in Control(转)

    Spike-in Control:添加/加入(某种物质)的对照(组)在某些情况下,待检验样本中不含待测物质或者含有但是浓度很低,为了证明自己建立的方法能对样本中待测物质进行有效的检测,可在待检样本中加 ...

  9. 2017-2018-2 20165303 实验二《Java面向对象程序设计》实验报告

    实验一 实验要求 参考 http://www.cnblogs.com/rocedu/p/6371315.html#SECUNITTEST 完成单元测试的学习 提交最后三个JUnit测试用例(正常情况, ...

  10. springboot---->java.lang.IllegalArgumentException

    springboot aop编程时,在方法上加入通知的注解,添加织入路径测试,发生报错: java.lang.IllegalArgumentException: Pointcut is not wel ...