Easy and cheap cluster building on AWS backup
https://grapeot.me/easy-and-cheap-cluster-building-on-aws.html
Thu 17 July 2014 , by Yan Wang | 2 Comments Linux Parallel github ImageWhy?
It often requires a lot of computational resources to do machine learning / computer vision research, like extracting features from a lot of images, and training large-scale or many classifiers. Therefore people use more than one machines to do the task. The procedures are often like, copy executable/data files to all the machines, configure environments, manually divide the tasks, actually run the commands, and collect the results. In addition to the complicated workflow, another practical problem is where to get the machines. Maintaining your own cluster is definitely an option, an extremely expensive and time-costing option. Renting from AWS, especially using spot instances, is a much cheaper and more practical alternative.
But a lot of factors prevent them to be really useful (I assume you already know how spot instances work):
- Spot instances don't have persistent storage, which means whatever you have on the hard disk may lost in the next minute. How to deal with this?
- This property of spot instances also makes system configuration a problem -- how do you easily make a blank system usable?
- How to efficiently copy bulk of data to AWS?
- Manual task division and command execution doesn't sound right. How to make it easier and smarter (and faster)?
After quite a few months, I gradually accumulate a tool chain to handle all of these problems.
What will you get?

Here is an example of a 128-core 240GB cluster. It requires ~10 minutes to build it from scratch (or ~1 minute to build from AMI image), and costs about 1 dollar per hour. Like any AWS instances, the instances themselves cost nothing if you don't use them (by shutting them down). All your data will be on your hard disk and the loss due to spot request failure will be minimized. The best thing is, task submission is fairly simple -- one single line of bash command will do the job, like
cat cluster.sh | parallel --sshlogin 8/m1 --sshlogin 8/m2 --sshlogin 8/m3 --sshlogin 8/m4 bash -c '{}'
It will automatically distribute every line of cluster.sh to the four nodes, and display all the stdouts on your screen. Whenever a node has less than 8 tasks running, the script will automatically dispatch one to it.
How? (TL; DR)
- Use automated script to do fast system configuration.
- Use
sshfsto do selective file transfer with compression, including training data transfer and result collection. - Use GNU
parallelto do job submission. - AMI can also be used to further expedite virtual machine initialization
How?
- Create spot instances on AWS.
- On each machine, run
curl https://grapeot.me/aws.sh | shif that fits you. Orgit clone http://github.com/grapeot/debianinitand executesetup-ubuntu.shto initialize the system. Note the script is personalized for me withpythonandvimsupport. Folk it to add your own stuffs. - That's it for configuration. To submit jobs, use
parallel. Let's look at this example:
cat cluster.sh | parallel --sshlogin 8/m1 --sshlogin 8/m2 --sshlogin 8/m3 --sshlogin 8/m4 bash -c '{}'
We already explained what it means, and here are more details. For switches like --sshlogin 8/m1, --sshlogin means to send the task to remote machines. 8/m1 tells parallel to send it to a ssh host named m1, which you can configure in ~/.ssh/config, and maintain at most 8 tasks on that host. bash -c '{}' is the actual command to execute on the remote machine, with {} as the placeholder for each line from stdin. parallel is much more flexible than this, and I'd leave the exploration of more switches and usage to you. :)
Easy and cheap cluster building on AWS backup的更多相关文章
- Nacos Cluster Building
原文链接:https://www.javaspring.net/nacos/nacos-cluster-building Continue to talk about the Nacos build ...
- AWS backup
shadowsocks ssserver -c /etc/shadowsocks/config.json start/stop/reset
- AWS 免费套餐
AWS 免费套餐 转载自:https://aws.amazon.com/cn/free/?sc_channel=PS&sc_campaign=acquisition_CN&sc_pub ...
- AWS 存储服务(三)
目录 AWS S3 业务场景 挑战 解决方案 S3的好处 S3 属性 存储桶 Buckets 对象 Object S3 特性 S3 操作 可用性和持久性 一致性 S3 定价策略 S3高级功能 存储级别 ...
- Awesome Go
A curated list of awesome Go frameworks, libraries and software. Inspired by awesome-python. Contrib ...
- Go 语言相关的优秀框架,库及软件列表
If you see a package or project here that is no longer maintained or is not a good fit, please submi ...
- Awesome Go (http://awesome-go.com/)
A curated list of awesome Go frameworks, libraries and software. Inspired by awesome-python. Contrib ...
- Awesome Go精选的Go框架,库和软件的精选清单.A curated list of awesome Go frameworks, libraries and software
Awesome Go financial support to Awesome Go A curated list of awesome Go frameworks, libraries a ...
- RAC的QA
RAC: Frequently Asked Questions [ID 220970.1] 修改时间 13-JAN-2011 类型 FAQ 状态 PUBLISHED Appli ...
随机推荐
- HTTP是什么连接
① 在HTTP/1.0中,默认使用的是短连接. 但从 HTTP/1.1起,默认使用长连接,用以保持连接特性. ②http长连接并不是一直保持连接 http的长连接也不会是永久保持连接,它有一个保持时间 ...
- 力扣(LeetCode) 905. 按奇偶排序数组
给定一个非负整数数组 A,返回一个由 A 的所有偶数元素组成的数组,后面跟 A 的所有奇数元素. 你可以返回满足此条件的任何数组作为答案. 示例: 输入:[3,1,2,4] 输出:[2,4,3,1] ...
- Django框架中,使用celery实现异步
作用:在使用框架时,在视图函数中实现异步构成: 任务task:一段耗时并与响应结果无关的代码,如发短信 工人worker:新进程,用于执行任务代码 代理人broker:调用任务时,将任务添加到队列中, ...
- 定时任务redis锁+自定义lambda优化提取冗余代码
功能介绍: 我系统中需要跑三个定时任务,由于是多节点部署,为了防止多个节点的定时任务重复执行.所以在定时任务执行时加个锁,抢到锁的节点才能执行定时任务,没有抢到锁的节点就不执行.从而避免了定时任务重复 ...
- 线程---local数据隔离
线程之间本身是数据共享的,当多个线程同时修改一份数据的时候,数据就可能不 准确,特别是线程量特别大的时候,为了保证数据准确性: (1) 通过线程锁Lock (2)通过local数据隔离 from th ...
- 图片路径转base64字节码
package product; import java.io.FileInputStream; import java.io.FileOutputStream; import java.io.IOE ...
- 解释变量(Explanatory Variable)
转自:http://www.statisticshowto.com/explanatory-variable/ What is an Explanatory Variable? An explanat ...
- 荧光激活细胞分选( FACS)
全称:fluorescence-activated cell sorting 参考: 利用荧光激活细胞分选技术获取荧光蛋白标记肾小球足细胞 荧光激活细胞分离技术在角膜缘干细胞研究中的应用 [求助]急! ...
- 20165303学习基础和C语言基础调查
20165303学习基础和C语言基础调查 技能学习心得 我认为我的乒乓球打的还不错,不能说非常好,但是基本的一些技巧都还是会的,小时候爸爸就非常爱看乒乓球比赛,有时候也带着我一起看,最开始看的时候我发 ...
- 20170921xlVBA_SQL蒸发循环查询2
'ARRAY("1991","1992","1993","1994","1996","19 ...