Easy and cheap cluster building on AWS backup
https://grapeot.me/easy-and-cheap-cluster-building-on-aws.html
Thu 17 July 2014 , by Yan Wang | 2 Comments Linux Parallel github ImageWhy?
It often requires a lot of computational resources to do machine learning / computer vision research, like extracting features from a lot of images, and training large-scale or many classifiers. Therefore people use more than one machines to do the task. The procedures are often like, copy executable/data files to all the machines, configure environments, manually divide the tasks, actually run the commands, and collect the results. In addition to the complicated workflow, another practical problem is where to get the machines. Maintaining your own cluster is definitely an option, an extremely expensive and time-costing option. Renting from AWS, especially using spot instances, is a much cheaper and more practical alternative.
But a lot of factors prevent them to be really useful (I assume you already know how spot instances work):
- Spot instances don't have persistent storage, which means whatever you have on the hard disk may lost in the next minute. How to deal with this?
- This property of spot instances also makes system configuration a problem -- how do you easily make a blank system usable?
- How to efficiently copy bulk of data to AWS?
- Manual task division and command execution doesn't sound right. How to make it easier and smarter (and faster)?
After quite a few months, I gradually accumulate a tool chain to handle all of these problems.
What will you get?

Here is an example of a 128-core 240GB cluster. It requires ~10 minutes to build it from scratch (or ~1 minute to build from AMI image), and costs about 1 dollar per hour. Like any AWS instances, the instances themselves cost nothing if you don't use them (by shutting them down). All your data will be on your hard disk and the loss due to spot request failure will be minimized. The best thing is, task submission is fairly simple -- one single line of bash command will do the job, like
cat cluster.sh | parallel --sshlogin 8/m1 --sshlogin 8/m2 --sshlogin 8/m3 --sshlogin 8/m4 bash -c '{}'
It will automatically distribute every line of cluster.sh to the four nodes, and display all the stdouts on your screen. Whenever a node has less than 8 tasks running, the script will automatically dispatch one to it.
How? (TL; DR)
- Use automated script to do fast system configuration.
- Use
sshfsto do selective file transfer with compression, including training data transfer and result collection. - Use GNU
parallelto do job submission. - AMI can also be used to further expedite virtual machine initialization
How?
- Create spot instances on AWS.
- On each machine, run
curl https://grapeot.me/aws.sh | shif that fits you. Orgit clone http://github.com/grapeot/debianinitand executesetup-ubuntu.shto initialize the system. Note the script is personalized for me withpythonandvimsupport. Folk it to add your own stuffs. - That's it for configuration. To submit jobs, use
parallel. Let's look at this example:
cat cluster.sh | parallel --sshlogin 8/m1 --sshlogin 8/m2 --sshlogin 8/m3 --sshlogin 8/m4 bash -c '{}'
We already explained what it means, and here are more details. For switches like --sshlogin 8/m1, --sshlogin means to send the task to remote machines. 8/m1 tells parallel to send it to a ssh host named m1, which you can configure in ~/.ssh/config, and maintain at most 8 tasks on that host. bash -c '{}' is the actual command to execute on the remote machine, with {} as the placeholder for each line from stdin. parallel is much more flexible than this, and I'd leave the exploration of more switches and usage to you. :)
Easy and cheap cluster building on AWS backup的更多相关文章
- Nacos Cluster Building
原文链接:https://www.javaspring.net/nacos/nacos-cluster-building Continue to talk about the Nacos build ...
- AWS backup
shadowsocks ssserver -c /etc/shadowsocks/config.json start/stop/reset
- AWS 免费套餐
AWS 免费套餐 转载自:https://aws.amazon.com/cn/free/?sc_channel=PS&sc_campaign=acquisition_CN&sc_pub ...
- AWS 存储服务(三)
目录 AWS S3 业务场景 挑战 解决方案 S3的好处 S3 属性 存储桶 Buckets 对象 Object S3 特性 S3 操作 可用性和持久性 一致性 S3 定价策略 S3高级功能 存储级别 ...
- Awesome Go
A curated list of awesome Go frameworks, libraries and software. Inspired by awesome-python. Contrib ...
- Go 语言相关的优秀框架,库及软件列表
If you see a package or project here that is no longer maintained or is not a good fit, please submi ...
- Awesome Go (http://awesome-go.com/)
A curated list of awesome Go frameworks, libraries and software. Inspired by awesome-python. Contrib ...
- Awesome Go精选的Go框架,库和软件的精选清单.A curated list of awesome Go frameworks, libraries and software
Awesome Go financial support to Awesome Go A curated list of awesome Go frameworks, libraries a ...
- RAC的QA
RAC: Frequently Asked Questions [ID 220970.1] 修改时间 13-JAN-2011 类型 FAQ 状态 PUBLISHED Appli ...
随机推荐
- VS IIS Express 支持局域网访问
使用Visual Studio开发Web网页的时候有这样的情况:想要在调试模式下让局域网的其他设备进行访问,以便进行测试.虽然可以部署到服务器中,但是却无法进行调试,就算是注入进程进行调试也是无法达到 ...
- Cordova 混合开发
详细的教程在以下博客 https://blog.csdn.net/csdn100861/article/details/78585333
- kbengine学习1 安装
KBengine一年前就知道了,但是没来得及学(只记得是C++ + python脚本),前一个项目unity3d+fkask+socketio+sqlite硬怼出来的.这半年也没来得及管.(好像当时看 ...
- Codeforces 1016 E - Rest In The Shades
E - Rest In The Shades 思路: 相似 红色的长度等于(y - s) / y 倍的 A' 和 B' 之间的 fence的长度 A' 是 p 和 A 连线和 x 轴交点, B'同理 ...
- js怎么把一个数组里面的值作为一个属性添加到另一数组包含的对象里(小程序)
上面这个需求我说的似乎不太明白,之前也是没有碰到过,也是最近在搞小程序,涉及到小程序前后台数据交互,展示的部分!!不太明白没关系等会我给大家举个例子,就明白了说起来有点拗口,一看就明白了,其实如果是原 ...
- echarts的学习
博客1.:https://zrysmt.github.io/ 博客2:http://blog.csdn.net/future_todo/article/details/60956942 工作中一个需求 ...
- SSD固态硬盘是会掉速的。
也没什么好的办法. 只是自己不再疑神疑鬼,总觉得中病毒了. 下面的文章还是挺有参考意义的. http://diy.pconline.com.cn/627/6271636_all.html SSD变慢了 ...
- MySql常用函数全部汇总
MySQL数据库中提供了很丰富的函数.MySQL函数包括数学函数.字符串函数.日期和时间函数.条件判断函数.系统信息函数.加密函数.格式化函数等.通过这些函数,可以简化用户的操作.例如,字符串连接函数 ...
- String的intern()方法和java关键字、保留字
String s1 = new StringBuilder("hel").append("lo").toString(); //hello System.out ...
- 20181013xlVba成绩报表优化
Public Sub 成绩报表优化() Application.ScreenUpdating = False Application.DisplayAlerts = False Application ...