Easy and cheap cluster building on AWS backup
https://grapeot.me/easy-and-cheap-cluster-building-on-aws.html
Thu 17 July 2014 , by Yan Wang | 2 Comments Linux Parallel github ImageWhy?
It often requires a lot of computational resources to do machine learning / computer vision research, like extracting features from a lot of images, and training large-scale or many classifiers. Therefore people use more than one machines to do the task. The procedures are often like, copy executable/data files to all the machines, configure environments, manually divide the tasks, actually run the commands, and collect the results. In addition to the complicated workflow, another practical problem is where to get the machines. Maintaining your own cluster is definitely an option, an extremely expensive and time-costing option. Renting from AWS, especially using spot instances, is a much cheaper and more practical alternative.
But a lot of factors prevent them to be really useful (I assume you already know how spot instances work):
- Spot instances don't have persistent storage, which means whatever you have on the hard disk may lost in the next minute. How to deal with this?
- This property of spot instances also makes system configuration a problem -- how do you easily make a blank system usable?
- How to efficiently copy bulk of data to AWS?
- Manual task division and command execution doesn't sound right. How to make it easier and smarter (and faster)?
After quite a few months, I gradually accumulate a tool chain to handle all of these problems.
What will you get?

Here is an example of a 128-core 240GB cluster. It requires ~10 minutes to build it from scratch (or ~1 minute to build from AMI image), and costs about 1 dollar per hour. Like any AWS instances, the instances themselves cost nothing if you don't use them (by shutting them down). All your data will be on your hard disk and the loss due to spot request failure will be minimized. The best thing is, task submission is fairly simple -- one single line of bash command will do the job, like
cat cluster.sh | parallel --sshlogin 8/m1 --sshlogin 8/m2 --sshlogin 8/m3 --sshlogin 8/m4 bash -c '{}'
It will automatically distribute every line of cluster.sh to the four nodes, and display all the stdouts on your screen. Whenever a node has less than 8 tasks running, the script will automatically dispatch one to it.
How? (TL; DR)
- Use automated script to do fast system configuration.
- Use
sshfsto do selective file transfer with compression, including training data transfer and result collection. - Use GNU
parallelto do job submission. - AMI can also be used to further expedite virtual machine initialization
How?
- Create spot instances on AWS.
- On each machine, run
curl https://grapeot.me/aws.sh | shif that fits you. Orgit clone http://github.com/grapeot/debianinitand executesetup-ubuntu.shto initialize the system. Note the script is personalized for me withpythonandvimsupport. Folk it to add your own stuffs. - That's it for configuration. To submit jobs, use
parallel. Let's look at this example:
cat cluster.sh | parallel --sshlogin 8/m1 --sshlogin 8/m2 --sshlogin 8/m3 --sshlogin 8/m4 bash -c '{}'
We already explained what it means, and here are more details. For switches like --sshlogin 8/m1, --sshlogin means to send the task to remote machines. 8/m1 tells parallel to send it to a ssh host named m1, which you can configure in ~/.ssh/config, and maintain at most 8 tasks on that host. bash -c '{}' is the actual command to execute on the remote machine, with {} as the placeholder for each line from stdin. parallel is much more flexible than this, and I'd leave the exploration of more switches and usage to you. :)
Easy and cheap cluster building on AWS backup的更多相关文章
- Nacos Cluster Building
原文链接:https://www.javaspring.net/nacos/nacos-cluster-building Continue to talk about the Nacos build ...
- AWS backup
shadowsocks ssserver -c /etc/shadowsocks/config.json start/stop/reset
- AWS 免费套餐
AWS 免费套餐 转载自:https://aws.amazon.com/cn/free/?sc_channel=PS&sc_campaign=acquisition_CN&sc_pub ...
- AWS 存储服务(三)
目录 AWS S3 业务场景 挑战 解决方案 S3的好处 S3 属性 存储桶 Buckets 对象 Object S3 特性 S3 操作 可用性和持久性 一致性 S3 定价策略 S3高级功能 存储级别 ...
- Awesome Go
A curated list of awesome Go frameworks, libraries and software. Inspired by awesome-python. Contrib ...
- Go 语言相关的优秀框架,库及软件列表
If you see a package or project here that is no longer maintained or is not a good fit, please submi ...
- Awesome Go (http://awesome-go.com/)
A curated list of awesome Go frameworks, libraries and software. Inspired by awesome-python. Contrib ...
- Awesome Go精选的Go框架,库和软件的精选清单.A curated list of awesome Go frameworks, libraries and software
Awesome Go financial support to Awesome Go A curated list of awesome Go frameworks, libraries a ...
- RAC的QA
RAC: Frequently Asked Questions [ID 220970.1] 修改时间 13-JAN-2011 类型 FAQ 状态 PUBLISHED Appli ...
随机推荐
- CentOS7 下更改源
在CentOS 7下更改yum源与更新系统. [1] 首先备份/etc/yum.repos.d/CentOS-Base.repo cp /etc/yum.repos.d/CentOS-Base.rep ...
- python中常用的模块二
一.序列化 指:在我们存储数据的时候,需要对我们的对象进行处理,把对象处理成方便存储和传输的数据格式,这个就是序列化, 不同的序列化结果不同,但目的是一样的,都是为了存储和传输. 一,pickle.可 ...
- Kali Linux 更新源 操作完整版教程
一.查看kali系统的更新源地址文件 命令: vim /etc/apt/sources.list 上面这是kali官方的更新源: 拓展知识: 一个完整的源包括:deb 和 deb-src:上图源地址是 ...
- SpringBoot简单的REST风格例子
关于REST和RESTful的说明请移步至:怎样用通俗的语言解释REST,以及RESTful? 其实我自己也不是十分的理解,只是今天学SpringBoot时看到有个标着REST风格的简单例子,就记录一 ...
- IPC 之 Socket 的使用
一.概述 我们知道在开发中,即时通讯.设备间的通信都是使用 Socket 实现,那当然用它来实现进程间通信更是不成问题.Socket 即套接字,是一个对 TCP / IP协议进行封装 的编程调用接口( ...
- cookie被禁用,如何使用session
转载自:https://blog.csdn.net/ai_shuyingzhixia/article/details/80778183 1.禁止使用cookie response.encodeURL( ...
- 阿里云域名+github建立网站
1.准备工作 ①购买一个阿里云域名,这里测试的域名为 www.cores.vip ②创建一个github账号 (注意:一个github账号只能建立一个username.github.io的网站,不能建 ...
- Asp.net core 学习笔记 ( OData )
2018-12-10 更新 : 从前我都是把 entity 直接用于 odata 曝露 api 给程序用. 如果这个程序是我们自己写的前端,这样的方式非常好,因为就好比前端可以直接对数据库每一个表做操 ...
- Python pickle使用
2019-01-15 10:04:32 用于序列化的两个模块 json:用于字符串和Python数据类型间进行转换 pickle: 用于python特有的类型和python的数据类型间进行转换 jso ...
- (转)C# 单例模式
文章1: 一.多线程不安全方式实现 public sealed class SingleInstance { private static SingleInstance inst ...