Easy and cheap cluster building on AWS backup
https://grapeot.me/easy-and-cheap-cluster-building-on-aws.html
Thu 17 July 2014 , by Yan Wang | 2 Comments Linux Parallel github ImageWhy?
It often requires a lot of computational resources to do machine learning / computer vision research, like extracting features from a lot of images, and training large-scale or many classifiers. Therefore people use more than one machines to do the task. The procedures are often like, copy executable/data files to all the machines, configure environments, manually divide the tasks, actually run the commands, and collect the results. In addition to the complicated workflow, another practical problem is where to get the machines. Maintaining your own cluster is definitely an option, an extremely expensive and time-costing option. Renting from AWS, especially using spot instances, is a much cheaper and more practical alternative.
But a lot of factors prevent them to be really useful (I assume you already know how spot instances work):
- Spot instances don't have persistent storage, which means whatever you have on the hard disk may lost in the next minute. How to deal with this?
- This property of spot instances also makes system configuration a problem -- how do you easily make a blank system usable?
- How to efficiently copy bulk of data to AWS?
- Manual task division and command execution doesn't sound right. How to make it easier and smarter (and faster)?
After quite a few months, I gradually accumulate a tool chain to handle all of these problems.
What will you get?

Here is an example of a 128-core 240GB cluster. It requires ~10 minutes to build it from scratch (or ~1 minute to build from AMI image), and costs about 1 dollar per hour. Like any AWS instances, the instances themselves cost nothing if you don't use them (by shutting them down). All your data will be on your hard disk and the loss due to spot request failure will be minimized. The best thing is, task submission is fairly simple -- one single line of bash command will do the job, like
cat cluster.sh | parallel --sshlogin 8/m1 --sshlogin 8/m2 --sshlogin 8/m3 --sshlogin 8/m4 bash -c '{}'
It will automatically distribute every line of cluster.sh to the four nodes, and display all the stdouts on your screen. Whenever a node has less than 8 tasks running, the script will automatically dispatch one to it.
How? (TL; DR)
- Use automated script to do fast system configuration.
- Use
sshfsto do selective file transfer with compression, including training data transfer and result collection. - Use GNU
parallelto do job submission. - AMI can also be used to further expedite virtual machine initialization
How?
- Create spot instances on AWS.
- On each machine, run
curl https://grapeot.me/aws.sh | shif that fits you. Orgit clone http://github.com/grapeot/debianinitand executesetup-ubuntu.shto initialize the system. Note the script is personalized for me withpythonandvimsupport. Folk it to add your own stuffs. - That's it for configuration. To submit jobs, use
parallel. Let's look at this example:
cat cluster.sh | parallel --sshlogin 8/m1 --sshlogin 8/m2 --sshlogin 8/m3 --sshlogin 8/m4 bash -c '{}'
We already explained what it means, and here are more details. For switches like --sshlogin 8/m1, --sshlogin means to send the task to remote machines. 8/m1 tells parallel to send it to a ssh host named m1, which you can configure in ~/.ssh/config, and maintain at most 8 tasks on that host. bash -c '{}' is the actual command to execute on the remote machine, with {} as the placeholder for each line from stdin. parallel is much more flexible than this, and I'd leave the exploration of more switches and usage to you. :)
Easy and cheap cluster building on AWS backup的更多相关文章
- Nacos Cluster Building
原文链接:https://www.javaspring.net/nacos/nacos-cluster-building Continue to talk about the Nacos build ...
- AWS backup
shadowsocks ssserver -c /etc/shadowsocks/config.json start/stop/reset
- AWS 免费套餐
AWS 免费套餐 转载自:https://aws.amazon.com/cn/free/?sc_channel=PS&sc_campaign=acquisition_CN&sc_pub ...
- AWS 存储服务(三)
目录 AWS S3 业务场景 挑战 解决方案 S3的好处 S3 属性 存储桶 Buckets 对象 Object S3 特性 S3 操作 可用性和持久性 一致性 S3 定价策略 S3高级功能 存储级别 ...
- Awesome Go
A curated list of awesome Go frameworks, libraries and software. Inspired by awesome-python. Contrib ...
- Go 语言相关的优秀框架,库及软件列表
If you see a package or project here that is no longer maintained or is not a good fit, please submi ...
- Awesome Go (http://awesome-go.com/)
A curated list of awesome Go frameworks, libraries and software. Inspired by awesome-python. Contrib ...
- Awesome Go精选的Go框架,库和软件的精选清单.A curated list of awesome Go frameworks, libraries and software
Awesome Go financial support to Awesome Go A curated list of awesome Go frameworks, libraries a ...
- RAC的QA
RAC: Frequently Asked Questions [ID 220970.1] 修改时间 13-JAN-2011 类型 FAQ 状态 PUBLISHED Appli ...
随机推荐
- win10 安装keras
1.安装Python环境 建议安装Anconda3 ,4.2.0版本 下载地址: https://repo.continuum.io/archive/index.html 或 https://mirr ...
- [osg]osg窗口显示和单屏幕显示
osg::ref_ptr<osg::Node> loadedModel = osgDB::readNodeFile("cow.osg"); osg::ref_ptr&l ...
- [html]自定义滚动条风格
webkit: <style type="text/css"> *{ margin: 0; padding: 0; } ::-webkit-scrollbar { wi ...
- SpringMVC获取页面表单参数的几种方式
以下几种方式只有在已搭好的SpringMVC环境中,才能执行成功! 首先,写一个登陆页面和一个Bean类 <%@ page language="java" co ...
- 从flask视角学习angular(一)整体对比
写在前面 前端框架完全不懂. 看着angular中文官网的英雄编辑器教程和核心知识,用偷懒的类比法,从flask django的角度 记录一下自己对angular的理解. 作为工科的武曲,自己的体会是 ...
- MySQL学习(六)
1 注意 select cout(*) from 表名: 查询的就是绝对的行数,哪怕某一列所有字段全部为NULL,也计算在内.而select cout(列名) form 表名:查询的是该列不为null ...
- 牛客国庆集训派对Day3 I Metropolis
Metropolis 思路: 多源点最短路 只要两个不同源点的最短路相遇,我们就更新两个源点的答案 代码: #pragma GCC optimize(2) #pragma GCC optimize(3 ...
- Codeforces 374C - Inna and Dima
374C - Inna and Dima 思路:dfs+记忆化搜索 代码: #include<bits/stdc++.h> using namespace std; #define ll ...
- JAVA基础知识总结:十四
一.泛型 1.概念 泛型指的是泛指的类型.主要用于子类和父类,接口和实现类之间的数据传递 JDK1.5之后新增的特性,主要用于解决安全问题,是一个安全机制 好处: a.可以提高代码的复用性 b.避免了 ...
- 猫眼电影爬取(二):requests+beautifulsoup,并将数据存储到mysql数据库
上一篇通过requests+正则爬取了猫眼电影榜单,这次通过requests+beautifulsoup再爬取一次(其实这个网站更适合使用beautifulsoup库爬取) 1.先分析网页源码 可以看 ...