FireCaffe

Forrest N. Iandola FireCaffe: near-linear acceleration of deep neural network training on computer clusters 2016.1

Problem statements from data scientists

4 key pain points summarized by Jeff Dean from Google:

1. DNN researchers and users want results of experiments quickly.

2. There is a “patience threshold”: No one wants to wait more than a few days or a week for result.

3. This significantly affects scale of problems that can be tackled.

4. We sometimes optimize for experiment turnaround time, rather than absolute minimal system resources for performing the experiments

Problem analysis

The speed and scalability of distributed algorithm are almost always limited by the overhead of communicating between servers; DNN training is not an exception to this rule.
So the design focuses on the communication enhancement, including:

1. Upgrade to high throughput interconnects, i.e. use high throughput interconnections like IB etc.
2. Decrease the data transmission volume while training, which includes:
  a) Balance carefully between data parallelism and model parallelism
  b) Increase batch size to reduce communication quantity. And identify hyperparameters suitable for large batch size.
  c) Communication data quantity balance among nodes to avoid single point dependency.

Key take-aways

Parallelism Scheme: Model parallelism or Data Parallelism

Model parallelism

Each worker gets a subset of the model parameters, and the workers communication by exchanging data gradients and exchanging activations . and data quantity is:

Data parallelism

Each worker gets a subset of the batch, and then the workers communicate by exchanging weight gradient updates , where and data quantity is:

Convolution layer and fully connection layer have different characteristics in data/weight ratio. So they can use different parallelism schemes.


So a basic conclusion is: convolution layers can be fitted into data parallelism, and fc layers can be fitted into model parallelism.
Further more, for more advanced CNNs like GoogLeNet and ResNet etc., we can directly use data parallelism, as this paper is using.

Gradient Aggregation Scheme: Parameter Server or Reduction Tree

One picture to show how parameter server and reduction tree work in data parallelism.

Parameter Server

Parameter communication time with regard to worker number in parameter server scheme.

The communication time scales linearly as we increase the number of workers. single parameter server becomes scalability bottleneck.
Microsoft Adam and Google DistBelief relief this issue by defining a poll of nodes taht colelctively behave as a parameter server. The bigger the parameter server hierarchy gets, the more it looks like a reduction tree.

Reduction Tree

The idea is same as allreduce in message passing model. Parameter communication time with regard to worker number in reduction tree scheme.

It scales logrithmatically as the number of workers.

Batch size selection

Larger batch size lead to less frequent communication and therefore enable more scalability in a distributed setting. But for larger batch size, we need identify a suitable hyperparameter setting to maintain the speed and accuracy produced in DNN training.
Hyperparameters includes:

1. Initial learning rate

2. learning rate update scheme

3. weight delay

4. momentum

Weight update rule used, here means iteration index:


Learning rate update rule:

On how to get hyperparameters according to batch size, I will write another article for this.

Results

Final results on GPU cluster w/ GoogleNet.

More thinking

1. 以上方案基本上是无损的,为了更进一步减少通信开销,大家开始尝试有损的方案,在训练速度和准确度之间进行折衷。典型的有:

1). Reduce parameter size using 16-bit floating-point - Google
     2). Use 16-bit weights and 8-bit activations.
     3). 1-bit gradients backpropagation - Microsoft
     4). Discard gradients whose numerical values fall below a certain threshold - Amazon
     5). Compress(e.g. using PCA) weights before transmitting
     6). Network pruning/encoding/quantization - Intel, DeePhi
2. 使用新的底层技术来减少通信开销 - Matrix
     1) RDMA rather than traditional TCP/IP?

[专题论文阅读]【分布式DNN训练系统】 FireCaffe的更多相关文章

  1. 暑假第二弹:基于docker的hadoop分布式集群系统的搭建和测试

    早在四月份的时候,就已经开了这篇文章.当时是参加数据挖掘的比赛,在计科院大佬的建议下用TensorFlow搞深度学习,而且要在自己的hadoop分布式集群系统下搞. 当时可把我们牛逼坏了,在没有基础的 ...

  2. 分布式链路追踪系统Sleuth和ZipKin

    1.微服务下的链路追踪讲解和重要性 简介:讲解什么是分布式链路追踪系统,及使用好处 进行日志埋点,各微服务追踪. 2.SpringCloud的链路追踪组件Sleuth 1.官方文档 http://cl ...

  3. 基于zipkin分布式链路追踪系统预研第一篇

    本文为博主原创文章,未经博主允许不得转载. 分布式服务追踪系统起源于Google的论文“Dapper, a Large-Scale Distributed Systems Tracing Infras ...

  4. 高性能分布式内存队列系统beanstalkd(转)

    beanstalkd一个高性能.轻量级的分布式内存队列系统,最初设计的目的是想通过后台异步执行耗时的任务来降低高容量Web应用系统的页面访问延迟,支持过有9.5 million用户的Facebook ...

  5. 转: 透过CAT,来看分布式实时监控系统的设计与实现

    评注: 开源的分布式监控系统 转:http://www.infoq.com/cn/articles/distributed-real-time-monitoring-and-control-syste ...

  6. Cola:一个分布式爬虫框架 - 系统架构 - Python4cn(news, jobs)

    Cola:一个分布式爬虫框架 - 系统架构 - Python4cn(news, jobs) Cola:一个分布式爬虫框架 发布时间:2013-06-17 14:58:27, 关注:+2034, 赞美: ...

  7. zipkin分布式链路追踪系统

    基于zipkin分布式链路追踪系统预研第一篇   分布式服务追踪系统起源于Google的论文“Dapper, a Large-Scale Distributed Systems Tracing Inf ...

  8. 分布式日志收集系统Apache Flume的设计详细介绍

    问题导读: 1.Flume传输的数据的基本单位是是什么? 2.Event是什么,流向是怎么样的? 3.Source:完成对日志数据的收集,分成什么打入Channel中? 4.Channel的作用是什么 ...

  9. Apache shiro集群实现 (七)分布式集群系统下---cache共享

    Apache shiro集群实现 (一) shiro入门介绍 Apache shiro集群实现 (二) shiro 的INI配置 Apache shiro集群实现 (三)shiro身份认证(Shiro ...

随机推荐

  1. Python的第四天

    函数 函数是组织好的,可重复使用的,用来实现单一,或相关联功能的代码段. 函数能提高应用的模块性,和代码的重复利用率.你已经知道Python提供了许多内建函数,比如print().但你也可以自己创建函 ...

  2. Spring Mvc Rest为不支持DELETE/PUT的浏览器添加DELETE/PUT支持

    现在都流行RESTFul,但是有一个问题,有些浏览器现在就不支持delete/put方式的请求,这些请求发出去之后都会变成get请求,导致rest接口无法被访问到.为了解决这个问题,spring提出了 ...

  3. ORM框架详解

    .Net开源微型ORM框架测评 什么是ORM? 对象关系映射(英语:Object Relation Mapping,简称ORM,或O/RM,或O/R mapping),是一种程序技术,用于实现面向对象 ...

  4. easyUI+mvc权限管理后台

    通过按钮和菜单,组合成基本的功能,菜单的功能可以编码修改,但浏览功能是菜单基本的入口,只有角色赋予了浏览功能,才能访问. 基本按钮表 菜单模块 菜单分配按钮 角色授权 下面是对一张表的基本操作 模型 ...

  5. Java File.renameTo方法的问题

    今天发现一个问题,renameTo执行失败. 程序是这样的:一个小程序在执行完成时会将A目录的文件renameTo到B目录,该程序一直运行正常.今天将B目录进行了mount挂载(Linux上),挂载后 ...

  6. queue

    http://www.codeforces.com/contest/625/problem/E

  7. 设计一个Stack,要求Push、Pop、获取最大最小值时间复杂度都为O(1)

    面试的时候,面试官让设计一个栈,要求有Push.Pop和获取最大最小值的操作,并且所有的操作都能够在O(1)的时间复杂度完成. 当时真没啥思路,后来在网上查了一下,恍然大悟,只能恨自己见识短浅.思路不 ...

  8. RabbitMQ小结

    1.帮助文档 rabbitmq官网:http://www.rabbitmq.com/ rabbitmq谷歌论坛:https://groups.google.com/forum/#!forum/rabb ...

  9. Redis 配置文件详解

    # Redis 配置文件 # 当配置中需要配置内存大小时,可以使用 1k, 5GB, 4M 等类似的格式,其转换方式如下(不区分大小写)## 1k => 1000 bytes# 1kb => ...

  10. 使用微软CORS包不能跨域访问的问题

    使用jquery的ajax异步调用的时候会出现不能跨域访问的问题,这个问题一般有两种方法. 1:使用jsonp跨域 2:使用html5的CORS 在这里只谈论第二种,微软对CORS提供的了支持,在Nu ...