Distributed Deep Learning

安利一下刘铁岩老师的《分布式机器学习》这本书

以及一个大神的blog：

https://zhuanlan.zhihu.com/p/29032307

https://zhuanlan.zhihu.com/p/30976469

分布式深度学习原理

在很多教程中都有介绍DL training的原理。我们来简单回顾一下：

那么如果scale太大，需要分布式呢？分布式机器学习大致有以下几个思路：

对于计算量太大的场景（计算并行），可以多线程/多节点并行计算。常用的一个算法就是同步随机梯度下降（synchronous stochastic gradient descent），含义大致相当于K个（K是节点数）mini-batch SGD [ch6.2]
对于训练数据太多的场景（数据并行，也是最主要的场景），需要将数据划分到多个节点上训练。每个节点先用本地的数据先训练出一个子模型，同时和其他节点保持通信（比如更新参数）以保证最终可以有效整合来自各个节点的训练结果，并得到全局的ML模型。 [ch6.3]
对于模型太大的场景，需要把模型（例如NN中的不同层）划分到不同节点上进行训练。此时不同节点之间可能需要频繁的sync。 [ch6.4]

它们可以总结为下图：

以数据并行为例，整个pipeline如下：

划分数据到不同节点
每个节点单机训练
节点之间的通信以及整个拓扑结构设计【ch7】
多个训练好的子模型的聚合【ch8】

Distributed DL model

目前工业界常见的Distributed DL方法有以下三种：【ch7.3】

1. PyTorch: AllReduce Model
MPI is a common method of distributed computing framework to implement distributed machine learning system. The main idea is to use AllReduce API to synchronize message and it also supports operations which satisfy Reduce rules. The common polymerization method for machine learning models is addition and average, so AllReduce logic is suitable to deal with it. The standard API of AllReduce have various implemented methods.
AllReduce mode is simple and convenient which is beneficial for paralleling training in synchronization algorithm. Till now, there are many deep learning systems still use it to complete communication function in distributed training, such as gloo communication library from Caffe2, DeepSpeech system in Baidu and NCCL communication library in Nvidia.
However, AllReduce can only support synchronizing communication and the logic of all working nodes are same which means every working node should handle completed model. It is unsuitable for large scale model.
Limitation of AllReduce:
When working nodes in system is increasing and the computing is unbalance, the training speed is decided by the slowest node in this system; once a working node does not work, the whole system has to stop.
Also, when the number of parameters of models in machine learning task is too large, it will exceed the memory capacity of single machine.

2. MXNet: Parameter Server Model
In the parameter server framework, all nodes in system are divided into worker and server logically. The main task of each worker is to take charge of local training task and communicate with parameter server through server interface. In this way, they can obtain latest model parameters from parameter server or send latest local training model to parameter server. With this parameter server, machine learning can be synchronous or asynchronous, or even mixed.

3. TensorFlow: Dataflow Model
Computational graph model in TensorFlow: Computation is described as a directed acyclic data flow graph. The nodes in the figure represent compute nodes and the edges represent data flow.
Distributed machine learning system based on data flow draws on the flexibility of DAG-based big data processing system, it describes the computing task as a directed acyclic data flow graph. The nodes in the figure represent the operations on the data and the edges in the figure represent the dependencies of the operation.
The system automatically provides distributed execution of the dataflow graph, so the user cares about how to design the appropriate dataflow graph to represent the algorithmic logic that is to be executed.
Below, it will take a data flow diagram representing the data flow system in TensorFlow as an example to introduce a typical data flow diagram.

分布式机器学习算法

【ch9】