1. 单机多卡启动并行训练

飞桨2.0增加paddle.distributed.spawn函数来启动单机多卡训练,同时原有的paddle.distributed.launch的方式依然保留。

  • paddle.distributed.launch通过指定启动的程序文件,以文件为单位启动多进程来实现多卡同步训练。以前在aistudio脚本任务说明里,就是推荐这种方法启动多卡任务。launch这种方式对进程管理要求较高。
  • paddle.distributed.spawn是以function函数为单位启动多进程来实现多卡同步的,可以更好地控制进程,在日志打印、训练退出时更友好。这是当前推荐的用法。

下面分别介绍这两种方法。

1.1单机多卡启动方式1、launch启动

1.1.1使用高层API的场景

  • 当调用paddle.Model高层API来实现训练时,想要启动单机多卡训练非常简单,代码不需要做任何修改,只需要在启动时增加一下参数-m paddle.distributed.launch。

      #单机单卡启动,默认使用第0号卡
    $ python train.py #单机多卡启动,默认使用当前可见的所有卡
    $ python -m paddle.distributed.launch train.py #单机多卡启动,设置当前使用的第0号和第1号卡
    $ python -m paddle.distributed.launch --selected_gpus='0,1' train.py #单机多卡启动,设置当前使用第0号和第1号卡
    $ export CUDA_VISIABLE_DEVICES='0,1'
    $ python -m paddle.distributed.launch train.py
  • 下面是一个高阶API的例子代码,直接执行cell代码框,就会在根目录生成hapitrain.py文件,然后就可以使用python来启动这个训练了。

%%writefile hapitrain.py
import paddle
from paddle.vision.transforms import ToTensor train_dataset = paddle.vision.datasets.MNIST(mode='train', transform=ToTensor())
test_dataset = paddle.vision.datasets.MNIST(mode='test', transform=ToTensor())
lenet = paddle.vision.models.LeNet() # Mnist继承paddle.nn.Layer属于Net,model包含了训练功能
model = paddle.Model(lenet) # 设置训练模型所需的optimizer, loss, metric
model.prepare(
paddle.optimizer.Adam(learning_rate=0.001, parameters=model.parameters()),
paddle.nn.CrossEntropyLoss(),
paddle.metric.Accuracy(topk=(1, 2))
) # 启动训练
model.fit(train_dataset, epochs=1, batch_size=64, log_freq=400) # 启动评估
model.evaluate(test_dataset, log_freq=100, batch_size=64)

单机单卡启动,默认使用第0号卡

# 单机单卡启动,默认使用第0号卡
!python hapitrain.py
Begin to download

Download finished
Cache file /home/aistudio/.cache/paddle/dataset/mnist/train-labels-idx1-ubyte.gz not found, downloading https://dataset.bj.bcebos.com/mnist/train-labels-idx1-ubyte.gz
Begin to download
........
Download finished
Cache file /home/aistudio/.cache/paddle/dataset/mnist/t10k-images-idx3-ubyte.gz not found, downloading https://dataset.bj.bcebos.com/mnist/t10k-images-idx3-ubyte.gz
Begin to download Download finished
Cache file /home/aistudio/.cache/paddle/dataset/mnist/t10k-labels-idx1-ubyte.gz not found, downloading https://dataset.bj.bcebos.com/mnist/t10k-labels-idx1-ubyte.gz
Begin to download
..
Download finished
W0628 15:25:11.488023 114 device_context.cc:362] Please NOTE: device: 0, GPU Compute Capability: 7.0, Driver API Version: 11.0, Runtime API Version: 10.1
W0628 15:25:11.614305 114 device_context.cc:372] device: 0, cuDNN Version: 7.6.
The loss value printed in the log is the current step, and the metric is the average value of previous step.
Epoch 1/1
/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddle/fluid/dataloader/dataloader_iter.py:89: DeprecationWarning: `np.bool` is a deprecated alias for the builtin `bool`. To silence this warning, use `bool` by itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type, use `np.bool_` here.
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
if isinstance(slot[0], (np.ndarray, np.bool, numbers.Number)):
/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddle/fluid/layers/utils.py:77: DeprecationWarning: Using or importing the ABCs from 'collections' instead of from 'collections.abc' is deprecated, and in 3.8 it will stop working
return (isinstance(seq, collections.Sequence) and
step 400/938 - loss: 0.0555 - acc_top1: 0.9217 - acc_top2: 0.9649 - 50ms/step
step 800/938 - loss: 0.0300 - acc_top1: 0.9454 - acc_top2: 0.9782 - 39ms/step
step 938/938 - loss: 0.0213 - acc_top1: 0.9498 - acc_top2: 0.9803 - 38ms/step
Eval begin...
The loss value printed in the log is the current batch, and the metric is the average value of previous step.
step 100/157 - loss: 0.0057 - acc_top1: 0.9731 - acc_top2: 0.9927 - 28ms/step
step 157/157 - loss: 0.0013 - acc_top1: 0.9785 - acc_top2: 0.9945 - 28ms/step
Eval samples: 10000

单机多卡启动,默认使用当前可见的所有卡

# 单机多卡启动,默认使用当前可见的所有卡
!python -m paddle.distributed.launch hapitrain.py
/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddle/fluid/layers/utils.py:26: DeprecationWarning: `np.int` is a deprecated alias for the builtin `int`. To silence this warning, use `int` by itself. Doing this will not modify any behavior and is safe. When replacing `np.int`, you may wish to use e.g. `np.int64` or `np.int32` to specify the precision. If you wish to review your current use, check the release note link for additional information.
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
def convert_to_list(value, n, name, dtype=np.int):
----------- Configuration Arguments -----------
gpus: None
heter_worker_num: None
heter_workers:
http_port: None
ips: 127.0.0.1
log_dir: log
nproc_per_node: None
server_num: None
servers:
training_script: hapitrain.py
training_script_args: []
worker_num: None
workers:
------------------------------------------------
WARNING 2021-06-28 15:26:17,473 launch.py:316] Not found distinct arguments and compiled with cuda. Default use collective mode
launch train in GPU mode
INFO 2021-06-28 15:26:17,475 launch_utils.py:471] Local start 1 processes. First process distributed environment info (Only For Debug):
+=======================================================================================+
| Distributed Envs Value |
+---------------------------------------------------------------------------------------+
| PADDLE_TRAINER_ID 0 |
| PADDLE_CURRENT_ENDPOINT 127.0.0.1:35079 |
| PADDLE_TRAINERS_NUM 1 |
| PADDLE_TRAINER_ENDPOINTS 127.0.0.1:35079 |
| FLAGS_selected_gpus 0 |
+=======================================================================================+ INFO 2021-06-28 15:26:17,475 launch_utils.py:475] details abouts PADDLE_TRAINER_ENDPOINTS can be found in log/endpoints.log, and detail running logs maybe found in log/workerlog.0
/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddle/fluid/layers/utils.py:26: DeprecationWarning: `np.int` is a deprecated alias for the builtin `int`. To silence this warning, use `int` by itself. Doing this will not modify any behavior and is safe. When replacing `np.int`, you may wish to use e.g. `np.int64` or `np.int32` to specify the precision. If you wish to review your current use, check the release note link for additional information.
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
def convert_to_list(value, n, name, dtype=np.int):
W0628 15:26:24.305920 285 device_context.cc:362] Please NOTE: device: 0, GPU Compute Capability: 7.0, Driver API Version: 11.0, Runtime API Version: 10.1
W0628 15:26:24.311555 285 device_context.cc:372] device: 0, cuDNN Version: 7.6.
The loss value printed in the log is the current step, and the metric is the average value of previous step.
Epoch 1/1
/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddle/fluid/dataloader/dataloader_iter.py:89: DeprecationWarning: `np.bool` is a deprecated alias for the builtin `bool`. To silence this warning, use `bool` by itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type, use `np.bool_` here.
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
if isinstance(slot[0], (np.ndarray, np.bool, numbers.Number)):
/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddle/fluid/layers/utils.py:77: DeprecationWarning: Using or importing the ABCs from 'collections' instead of from 'collections.abc' is deprecated, and in 3.8 it will stop working
return (isinstance(seq, collections.Sequence) and
step 400/938 - loss: 0.0586 - acc_top1: 0.9130 - acc_top2: 0.9611 - 38ms/step
step 800/938 - loss: 0.0288 - acc_top1: 0.9397 - acc_top2: 0.9759 - 39ms/step
step 938/938 - loss: 0.0545 - acc_top1: 0.9448 - acc_top2: 0.9785 - 40ms/step
Eval begin...
The loss value printed in the log is the current batch, and the metric is the average value of previous step.
step 100/157 - loss: 0.0035 - acc_top1: 0.9677 - acc_top2: 0.9911 - 36ms/step
step 157/157 - loss: 0.0057 - acc_top1: 0.9723 - acc_top2: 0.9929 - 36ms/step
Eval samples: 10000
INFO 2021-06-28 15:27:26,569 launch.py:240] Local processes completed.

单机多卡启动,设置当前使用第0号和第1号卡 aistudio单卡也可以运行,可以看到launch的容错率较高

# 单机多卡启动,设置当前使用第0号和第1号卡 aistudio单卡也可以运行,可以看到launch的容错率较高
!CUDA_VISIABLE_DEVICES='0,1' && python -m paddle.distributed.launch hapitrain.py
-----------  Configuration Arguments -----------
gpus: None
heter_worker_num: None
heter_workers:
http_port: None
ips: 127.0.0.1
log_dir: log
nproc_per_node: None
server_num: None
servers:
training_script: hapitrain.py
training_script_args: []
worker_num: None
workers:
------------------------------------------------
WARNING 2021-06-28 15:28:10,632 launch.py:316] Not found distinct arguments and compiled with cuda. Default use collective mode
launch train in GPU mode
INFO 2021-06-28 15:28:10,637 launch_utils.py:471] Local start 1 processes. First process distributed environment info (Only For Debug):
+=======================================================================================+
| Distributed Envs Value |
+---------------------------------------------------------------------------------------+
| PADDLE_TRAINER_ID 0 |
| PADDLE_CURRENT_ENDPOINT 127.0.0.1:46909 |
| PADDLE_TRAINERS_NUM 1 |
| PADDLE_TRAINER_ENDPOINTS 127.0.0.1:46909 |
| FLAGS_selected_gpus 0 |
+=======================================================================================+ INFO 2021-06-28 15:28:10,637 launch_utils.py:475] details abouts PADDLE_TRAINER_ENDPOINTS can be found in log/endpoints.log, and detail running logs maybe found in log/workerlog.0
/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddle/fluid/layers/utils.py:26: DeprecationWarning: `np.int` is a deprecated alias for the builtin `int`. To silence this warning, use `int` by itself. Doing this will not modify any behavior and is safe. When replacing `np.int`, you may wish to use e.g. `np.int64` or `np.int32` to specify the precision. If you wish to review your current use, check the release note link for additional information.
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
def convert_to_list(value, n, name, dtype=np.int):
W0628 15:28:19.819196 448 device_context.cc:362] Please NOTE: device: 0, GPU Compute Capability: 7.0, Driver API Version: 11.0, Runtime API Version: 10.1
W0628 15:28:19.905493 448 device_context.cc:372] device: 0, cuDNN Version: 7.6.
The loss value printed in the log is the current step, and the metric is the average value of previous step.
Epoch 1/1
/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddle/fluid/dataloader/dataloader_iter.py:89: DeprecationWarning: `np.bool` is a deprecated alias for the builtin `bool`. To silence this warning, use `bool` by itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type, use `np.bool_` here.
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
if isinstance(slot[0], (np.ndarray, np.bool, numbers.Number)):
/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddle/fluid/layers/utils.py:77: DeprecationWarning: Using or importing the ABCs from 'collections' instead of from 'collections.abc' is deprecated, and in 3.8 it will stop working
return (isinstance(seq, collections.Sequence) and
step 400/938 - loss: 0.0376 - acc_top1: 0.9136 - acc_top2: 0.9610 - 37ms/step
step 800/938 - loss: 0.0159 - acc_top1: 0.9423 - acc_top2: 0.9764 - 35ms/step
step 938/938 - loss: 0.0444 - acc_top1: 0.9479 - acc_top2: 0.9791 - 35ms/step
Eval begin...
The loss value printed in the log is the current batch, and the metric is the average value of previous step.
step 100/157 - loss: 0.0039 - acc_top1: 0.9767 - acc_top2: 0.9939 - 36ms/step
step 157/157 - loss: 0.0029 - acc_top1: 0.9815 - acc_top2: 0.9952 - 35ms/step
Eval samples: 10000
INFO 2021-06-28 15:29:19,766 launch.py:240] Local processes completed.

1.1.2使用基础API场景

  • 如果使用基础API的代码程序启动单机多卡训练,需要对单机单卡的代码进行3处修改,具体看下面未改变版本和改变版本的对比:

修改三处:

  • 第1处改动,import库**

import paddle.distributed as dist

  • 第2处改动,初始化并行环境**

dist.init_parallel_env()

  • 第3处改动,增加paddle.DataParallel封装

net = paddle.DataParallel(paddle.vision.models.LeNet())

import paddle #未改动版本
from paddle.vision.transforms import ToTensor train_dataset = paddle.vision.datasets.MNIST(mode='train', transform=ToTensor())
test_dataset = paddle.vision.datasets.MNIST(mode='test', transform=ToTensor())
lenet = paddle.vision.models.LeNet() # 加载训练集 batch_size 设为 64
train_loader = paddle.io.DataLoader(train_dataset, batch_size=64, shuffle=True) def train():
epochs = 1
adam = paddle.optimizer.Adam(learning_rate=0.001, parameters=lenet.parameters())
# 用Adam作为优化函数
for epoch in range(epochs):
for batch_id, data in enumerate(train_loader()):
x_data, y_data = data
predicts = lenet(x_data)
loss = paddle.nn.functional.cross_entropy(predicts, y_data, reduction='mean')
acc = paddle.metric.accuracy(predicts, y_data, k=1)
avg_acc = paddle.mean(acc)
loss.backward()
if batch_id % 400 == 0:
print("epoch: {}, batch_id: {}, loss is: {}, acc is: {}".format(epoch, batch_id, loss.numpy(), avg_acc.numpy()))
adam.step()
adam.clear_grad()
# 启动训练
train()
> epoch: 0, batch_id: 0, loss is: [2.7922328], acc is: [0.15625] epoch:
> 0, batch_id: 400, loss is: [0.10373791], acc is: [0.96875] epoch: 0,
> batch_id: 800, loss is: [0.01435608], acc is: [1.]

这是有3处改动的基础API版本
还是先通过%%writefile normaltrain.py 命令将该文件存盘到根目录

%%writefile normaltrain.py
import paddle #这是有3处改动的版本
from paddle.vision.transforms import ToTensor
import paddle.distributed as dist #第1处改动,import库 train_dataset = paddle.vision.datasets.MNIST(mode='train', transform=ToTensor())
test_dataset = paddle.vision.datasets.MNIST(mode='test', transform=ToTensor()) # 加载训练集 batch_size 设为 64
train_loader = paddle.io.DataLoader(train_dataset, batch_size=64, shuffle=True) def train():
# 第2处改动,初始化并行环境
dist.init_parallel_env() # 第3处改动,增加paddle.DataParallel封装
net = paddle.DataParallel(paddle.vision.models.LeNet()) #手册这里没有写全LeNet的库路径
epochs = 1
adam = paddle.optimizer.Adam(learning_rate=0.001, parameters=net.parameters())
# 用Adam作为优化函数
for epoch in range(epochs):
for batch_id, data in enumerate(train_loader()):
x_data = data[0]
y_data = data[1]
predicts = net(x_data)
acc = paddle.metric.accuracy(predicts, y_data, k=2)
avg_acc = paddle.mean(acc)
loss = paddle.nn.functional.cross_entropy(predicts, y_data, reduction='mean')
loss.backward() #这里手册误写成了avg_loss
if batch_id % 400 == 0:
print("epoch: {}, batch_id: {}, loss is: {}, acc is: {}".format(epoch, batch_id, loss.numpy(), avg_acc.numpy())) #这里手册误写成了avg_loss
adam.step()
adam.clear_grad()
# 启动训练
train()
# 单机单卡启动,默认使用第0号卡 。这里单机单卡执行改后的代码会报错
# !python normaltrain.py # 单机多卡启动,默认使用当前可见的所有卡
!python -m paddle.distributed.launch normaltrain.py # 单机多卡启动,设置当前使用第0号和第1号卡 自动用当前所有的卡,只有单卡也不会报错
!CUDA_VISIABLE_DEVICES='0,1' && python -m paddle.distributed.launch normaltrain.py

1.2 单机多卡启动方式2、spawn启动【推荐!!】

就像把物品放进盒子寄快递一样,只要将待并行计算的train函数体放入paddle.distributed.spawn里面就行了。命令为:

import paddle.distributed as dist

# 启动train多进程训练,默认使用所有可见的GPU卡
if __name__ == '__main__':
dist.spawn(train) # 启动train函数2个进程训练,默认使用当前可见的前2张卡
if __name__ == '__main__':
dist.spawn(train, nprocs=2) # 启动train函数2个进程训练,默认使用第4号和第5号卡
if __name__ == '__main__':
dist.spawn(train, nprocs=2, selelcted_gpus='4,5')
  • 基础API场景(不管是否像launch里面那样改代码) aistudio
    notebook里会报错,在实际多卡环境下正常。在aistudio 命令行下正常
  • 高阶API场景 aistudio notebook里会报错,在aistudio 命令行下正常。
%%writefile normal3spawn.py
import paddle #这是有3处改动的版本
from paddle.vision.transforms import ToTensor
import paddle.distributed as dist #第1处改动,import库 train_dataset = paddle.vision.datasets.MNIST(mode='train', transform=ToTensor())
test_dataset = paddle.vision.datasets.MNIST(mode='test', transform=ToTensor()) # 加载训练集 batch_size 设为 64
train_loader = paddle.io.DataLoader(train_dataset, batch_size=64, shuffle=True) def train():
# 第2处改动,初始化并行环境
dist.init_parallel_env() # 第3处改动,增加paddle.DataParallel封装
net = paddle.DataParallel(paddle.vision.models.LeNet()) #手册这里没有写全LeNet的库路径
epochs = 1
adam = paddle.optimizer.Adam(learning_rate=0.001, parameters=net.parameters())
# 用Adam作为优化函数
for epoch in range(epochs):
for batch_id, data in enumerate(train_loader()):
x_data = data[0]
y_data = data[1]
predicts = net(x_data)
acc = paddle.metric.accuracy(predicts, y_data, k=2)
avg_acc = paddle.mean(acc)
loss = paddle.nn.functional.cross_entropy(predicts, y_data, reduction='mean')
loss.backward() #这里手册误写成了avg_loss
if batch_id % 400 == 0:
print("epoch: {}, batch_id: {}, loss is: {}, acc is: {}".format(epoch, batch_id, loss.numpy(), avg_acc.numpy())) #这里手册误写成了avg_loss
adam.step()
adam.clear_grad() # 启动train多进程训练,默认使用所有可见的GPU卡
import paddle.distributed as dist
if __name__ == '__main__':
dist.spawn(train)

1.3单机多卡简要总结:

spawn方式下在notebook里报错的情况,猜测应该是notebook进程管理限制导致的。在命令行情况下或者cell里加叹号运行的时候,就没有问题。

spawn方式不需要去修改代码的内部部分,只是加上dist.spawn(train)这句,相当于给训练代码加了一个多进程的壳,简单方便,是推荐使用的单机多卡组网方式!

在不支持spawn的情况,再去考虑用launch方式启动单机多卡。

飞桨完备的并行模式:

  • 数据并行:针对产业界最常用的数据并行模式,飞桨针对实际业务需求重点打磨多项技术,包括;飞桨提供集合通信架构和参数服务器架构两种方式,支持工业实践中常见的同步训练和异步训练的机制,并提供收敛效果有保障的分布式优化算法。
  • 流水线并行:面向异构硬件,流水线并行能够将模型计算部分拆分到不同硬件并充分流水线化,从而大规模提升异构硬件的整体利用率。
  • **模型并行:**对于超大规模分类问题,飞桨提供计算与存储同时并行的模型并行,解决单GPU无法解决的问题。

1.4使用fleetrun启动分布式任务

1.4.1 使用fleetrun启动分布式任务

Paddle提供命令行启动命令fleetrun,配合Paddle的分布式高级APIpaddle.distributed.fleet 即可轻松启动Paddle集合通信模式或参数服务器模式下的分布式任务。 fleetrun在静态图和动态图场景下均可使用。

注:目前paddle.distributed.fleet启动动态图分布式训练仅支持集合通信(Colletive Communication)模式,不支持参数服务器(Parameter-Server)模式。

  • GPU单机多卡训练

若启动单机4卡的任务,只需通过–gpus指定空闲的4张卡即可。

    fleetrun --gpus=0,1,2,3 train.py

注:如果指定了export CUDA_VISIBLE_DEVICES=0,1,2,3,则可以直接使用:

    export CUDA_VISIBLE_DEVICES=0,1,2,3
fleetrun train.py
  • GPU多机多卡训练

[示例一] 2机8卡 (每个节点4卡)

    fleetrun --ips="xx.xx.xx.xx,yy.yy.yy.yy" --gpus=0,1,2,3 train.py

注:如果每台机器均指定了export CUDA_VISIBLE_DEVICES=0,1,2,3,则可以直接在每台节点上启动:

    export CUDA_VISIBLE_DEVICES=0,1,2,3
fleetrun --ips="xx.xx.xx.xx,yy.yy.yy.yy" train.py

[示例二] 2机16卡(每个节点8卡,假设每台机器均有8卡可使用)

    fleetrun --ips="xx.xx.xx.xx,yy.yy.yy.yy" train.py

1.4.2 Fleet单机多卡训练

使用Fleet接口进行动态图分布式训练其实非常的简单,基础API程序代码只需修改3个步骤:

  • 导入paddle.distributed.fleet包

      from paddle.distributed import fleet
  • 初始化fleet环境

      fleet.init(is_collective=True)
  • 通过fleet获取分布式优化器和分布式模型

      strategy = fleet.DistributedStrategy()
    adam = fleet.distributed_optimizer(adam, strategy=strategy)
    dp_layer = fleet.distributed_model(layer)

Fleet手册提供的例子

%%writefile train_fleet.py
# -*- coding: UTF-8 -*-
import paddle
import paddle.nn as nn
#分布式step 1: 导入paddle.distributed.fleet包
from paddle.distributed import fleet # 定义全连接网络,需继承自nn.Layer
class LinearNet(nn.Layer):
def __init__(self):
super(LinearNet, self).__init__()
self._linear1 = nn.Linear(10, 10)
self._linear2 = nn.Linear(10, 1) def forward(self, x):
return self._linear2(self._linear1(x)) # 1.开启动态图模式
paddle.disable_static() # 分布式step 2: 初始化fleet
fleet.init(is_collective=True) # 2. 定义网络对象,损失函数和优化器
layer = LinearNet()
loss_fn = nn.MSELoss()
adam = paddle.optimizer.Adam(
learning_rate=0.001, parameters=layer.parameters()) # 分布式step 3: 通过fleet获取分布式优化器和分布式模型
strategy = fleet.DistributedStrategy()
adam = fleet.distributed_optimizer(adam, strategy=strategy)
dp_layer = fleet.distributed_model(layer) for step in range(20):
# 3. 执行前向网络
inputs = paddle.randn([10, 10], 'float32')
outputs = dp_layer(inputs)
labels = paddle.randn([10, 1], 'float32')
loss = loss_fn(outputs, labels) print("step:{}\tloss:{}".format(step, loss.numpy())) # 4. 执行反向计算和参数更新
loss.backward() adam.step()
adam.clear_grad()
!fleetrun --gpus=0 train_fleet.py
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
def convert_to_list(value, n, name, dtype=np.int):
----------- Configuration Arguments -----------
gpus: 0
heter_worker_num: None
heter_workers:
http_port: None
ips: 127.0.0.1
log_dir: log
nproc_per_node: None
server_num: None
servers:
training_script: train_fleet.py
training_script_args: []
worker_num: None
workers:
------------------------------------------------
WARNING 2021-06-28 15:56:16,986 launch.py:316] Not found distinct arguments and compiled with cuda. Default use collective mode
launch train in GPU mode
INFO 2021-06-28 15:56:16,990 launch_utils.py:471] Local start 1 processes. First process distributed environment info (Only For Debug):
+=======================================================================================+
| Distributed Envs Value |
+---------------------------------------------------------------------------------------+
| PADDLE_TRAINER_ID 0 |
| PADDLE_CURRENT_ENDPOINT 127.0.0.1:47263 |
| PADDLE_TRAINERS_NUM 1 |
| PADDLE_TRAINER_ENDPOINTS 127.0.0.1:47263 |
| FLAGS_selected_gpus 0 |
+=======================================================================================+ INFO 2021-06-28 15:56:16,991 launch_utils.py:475] details abouts PADDLE_TRAINER_ENDPOINTS can be found in log/endpoints.log, and detail running logs maybe found in log/workerlog.0
/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddle/fluid/layers/utils.py:26: DeprecationWarning: `np.int` is a deprecated alias for the builtin `int`. To silence this warning, use `int` by itself. Doing this will not modify any behavior and is safe. When replacing `np.int`, you may wish to use e.g. `np.int64` or `np.int32` to specify the precision. If you wish to review your current use, check the release note link for additional information.
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
def convert_to_list(value, n, name, dtype=np.int):
W0628 15:56:18.760403 1539 device_context.cc:362] Please NOTE: device: 0, GPU Compute Capability: 7.0, Driver API Version: 11.0, Runtime API Version: 10.1
W0628 15:56:18.826562 1539 device_context.cc:372] device: 0, cuDNN Version: 7.6.
/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddle/distributed/fleet/base/fleet_base.py:633: UserWarning: It is recommended to use DistributedStrategy in fleet.init(). The strategy here is only for compatibility. If the strategy in fleet.distributed_optimizer() is not None, then it will overwrite the DistributedStrategy in fleet.init(), which will take effect in distributed training.
"It is recommended to use DistributedStrategy "
/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddle/fluid/dygraph/parallel.py:423: UserWarning: The program will return to single-card operation. Please check 1, whether you use spawn or fleetrun to start the program. 2, Whether it is a multi-card program. 3, Is the current environment multi-card.
warnings.warn("The program will return to single-card operation. "
step:0 loss:[2.747072]
step:1 loss:[3.9464068]
step:2 loss:[3.3363562]
step:3 loss:[1.7597802]
step:4 loss:[2.4984336]
step:5 loss:[1.3766874]
step:6 loss:[3.3678422]
step:7 loss:[1.8410085]
step:8 loss:[1.6417965]
step:9 loss:[4.009201]
step:10 loss:[1.7387416]
step:11 loss:[1.6013482]
step:12 loss:[1.6388085]
step:13 loss:[3.7573469]
step:14 loss:[0.9461777]
step:15 loss:[2.4906065]
step:16 loss:[2.613153]
step:17 loss:[2.8367076]
step:18 loss:[2.170548]
step:19 loss:[2.2705061]
INFO 2021-06-28 15:56:35,049 launch.py:240] Local processes completed.

2.手写数字识别API Fleet多版本

2.1手写数字识别基础API Fleet版本

%%writefile normal_fleet.py
import paddle #这是有3处改动的版本
from paddle.vision.transforms import ToTensor
#分布式step 1: 导入paddle.distributed.fleet包
from paddle.distributed import fleet train_dataset = paddle.vision.datasets.MNIST(mode='train', transform=ToTensor())
test_dataset = paddle.vision.datasets.MNIST(mode='test', transform=ToTensor()) # 加载训练集 batch_size 设为 64
train_loader = paddle.io.DataLoader(train_dataset, batch_size=64, shuffle=True) # 分布式step 2: 初始化fleet
fleet.init(is_collective=True) def train(): epochs = 1
net = paddle.vision.models.LeNet()
adam = paddle.optimizer.Adam(learning_rate=0.001, parameters=net.parameters()) # 分布式step 3: 通过fleet获取分布式优化器和分布式模型
strategy = fleet.DistributedStrategy()
adam = fleet.distributed_optimizer(adam, strategy=strategy)
net = fleet.distributed_model(net) # 用Adam作为优化函数
for epoch in range(epochs):
for batch_id, data in enumerate(train_loader()):
x_data = data[0]
y_data = data[1]
predicts = net(x_data)
acc = paddle.metric.accuracy(predicts, y_data, k=2)
avg_acc = paddle.mean(acc)
loss = paddle.nn.functional.cross_entropy(predicts, y_data, reduction='mean')
loss.backward() #这里手册误写成了avg_loss
if batch_id % 400 == 0:
print("epoch: {}, batch_id: {}, loss is: {}, acc is: {}".format(epoch, batch_id, loss.numpy(), avg_acc.numpy())) #这里手册误写成了avg_loss
adam.step()
adam.clear_grad() if __name__ == '__main__':
train()
!fleetrun --gpus=0 normal_fleet.py
 +=======================================================================================+
| Distributed Envs Value |
+---------------------------------------------------------------------------------------+
| PADDLE_TRAINER_ID 0 |
| PADDLE_CURRENT_ENDPOINT 127.0.0.1:42501 |
| PADDLE_TRAINERS_NUM 1 |
| PADDLE_TRAINER_ENDPOINTS 127.0.0.1:42501 |
| FLAGS_selected_gpus 0 |
+=======================================================================================+
epoch: 0, batch_id: 0, loss is: [2.5425684], acc is: [0.234375]
epoch: 0, batch_id: 400, loss is: [0.05207598], acc is: [1.]
epoch: 0, batch_id: 800, loss is: [0.04818164], acc is: [1.]

2.2 手写数字识别高层API Fleet版本

%%writefile hapi_fleet.py
import paddle
from paddle.vision.transforms import ToTensor
import paddle.distributed as dist train_dataset = paddle.vision.datasets.MNIST(mode='train', transform=ToTensor())
test_dataset = paddle.vision.datasets.MNIST(mode='test', transform=ToTensor())
lenet = paddle.vision.models.LeNet() # Mnist继承paddle.nn.Layer属于Net,model包含了训练功能
model = paddle.Model(lenet) # 设置训练模型所需的optimizer, loss, metric
model.prepare(
paddle.optimizer.Adam(learning_rate=0.1, parameters=model.parameters()),
paddle.nn.CrossEntropyLoss(),
paddle.metric.Accuracy(topk=(1, 2))
)
def train():
# 启动训练
# 使用VisualDL 可视化
callback = paddle.callbacks.VisualDL(log_dir='visualdl_log')
model.fit(train_dataset, epochs=1, batch_size=64, callbacks=callback, log_freq=400) # 未使用VisualDL 可视化
# model.fit(train_dataset, epochs=1, batch_size=64, log_freq=400) # 启动评估
# model.evaluate(test_dataset, log_freq=20, batch_size=64) if __name__ == '__main__':
train()
!fleetrun hapi_fleet.py

2.3 多机多卡手写数字识别

从单机多卡到多机多卡训练,在代码上并不需要做任何改动,只需修改启动命令,以2机4卡为例:

    fleetrun --ips="xx.xx.xx.xx,yy.yy.yy.yy" --gpus=0,1 dygraph_fleet.py

在2台机器上分别运行以上启动命令,fleetrun将在后台分别启动2个多进程任务,执行分布式多机训练。 您将在ip为xx.xx.xx.xx的机器上看到命令台输出日志信息。

下面还是以aistudio为例子演示一下多机多卡,直接运行:

!fleetrun --ips="127.0.0.1" --gpus=0 normal_fleet.py

3.飞桨2.0并行计算总结:

飞桨2.0在并行计算方面有着完备的解决方案,且是经过超大规模业务数据检验过的训练框架。并行计算,就是这么简单!

3.1 针对单机多卡的情况,优先推荐使用spawn方式

spawn的优点是:几乎不需要修改代码,只要导入spawn库,并在最后用spawn去调用训练函数即可。同时spawn方式可以更好地控制进程,在日志打印、训练退出时更友好

程序中只需要增加这两句:

    import paddle.distributed as dist
if __name__ == '__main__':
dist.spawn(train)

然后直接用python train.py启动训练即可

3.2 针对多机多卡的情况,使用fleet方式。

普通API程序需要对应修改3个步骤:

  • 导入paddle.distributed.fleet包

      from paddle.distributed import fleet
  • 初始化fleet环境

      fleet.init(is_collective=True)
  • 通过fleet获取分布式优化器和分布式模型

      strategy = fleet.DistributedStrategy()
    adam = fleet.distributed_optimizer(adam, strategy=strategy)
    dp_layer = fleet.distributed_model(layer)
  • 然后运行命令:
    fleetrun --ips=“xx.xx.xx.xx,yy.yy.yy.yy” --gpus=0,1 train.py

3.3 如果使用高层API代码,则程序不用修改,直接运行fleetrun命令即可。

4.利用VisualDL进行并行计算下的可视化

VisualDL是一个面向深度学习任务设计的可视化工具。VisualDL 利用了丰富的图表来展示数据,用户可以更直观、清晰地查看数据的特征与变化趋势,有助于分析数据、及时发现错误,进而改进神经网络模型的设计。喜欢的同学可以去star支持一下哦~

AI Studio Notebook 项目(Paddle1.8.0及以上版本)已经集成VisualDL工具以便于您的使用,可在可视化tab中启动VisualDL服务。

4.1 VisualDL可视化

在高层API程序中,只需要加上这句callback = paddle.callbacks.VisualDL(log_dir='visualdl_log')并在model.fit里面加上callbacks=callback参数即可,也就是这样:model.fit(train_dataset, epochs=1, batch_size=64, callbacks=callback, log_freq=400)

前面的hapi_fleet.py代码中已经加入了VisualDL语句支持,前面cell已经执行!fleetrun hapi_fleet.py现在直接就可以在AIStudio里面打开可视化了:

打开左侧标签栏 可视化->设置logdir->点击添加->选择 visualdl_log/ -> 点击启动VisualDL服务 -> 点击打开VisualDL,在打开的网页中,就能看到训练的loss/acc等统计了;

4.2 利用VisualDL-Service共享可视化结果

  • 此功能是 VisualDL 2.0.4 新添加的功能,需要安装 VisualDL 2.0.4 或者以上的版本,只需要一行代码 visualdl service upload 即可以将自己的log文件上传到远端,

  • 非常推荐这个功能,我们上传文件之后,就不再需要在本地保存这些文件,直接访问生成的链接就可以了,十分方便!

  • 如果没有安装 VisualDL 2.0.4 + ,需要使用命令pip install visualdl==2.0.5安装

  • 执行下面的代码之后,访问生成的链接, 所有人都可以对训练过程进行查看分析

!pip install -U visualdl -q # ==2.0.5

!visualdl service upload --logdir visualdl_log

【三】分布式训练---单机多卡与多机多卡组网(飞桨paddle2.0+)更加推荐spawn方式!的更多相关文章

  1. 云原生的弹性 AI 训练系列之一:基于 AllReduce 的弹性分布式训练实践

    引言 随着模型规模和数据量的不断增大,分布式训练已经成为了工业界主流的 AI 模型训练方式.基于 Kubernetes 的 Kubeflow 项目,能够很好地承载分布式训练的工作负载,业已成为了云原生 ...

  2. [源码解析] 深度学习分布式训练框架 Horovod (1) --- 基础知识

    [源码解析] 深度学习分布式训练框架 Horovod --- (1) 基础知识 目录 [源码解析] 深度学习分布式训练框架 Horovod --- (1) 基础知识 0x00 摘要 0x01 分布式并 ...

  3. [翻译] 使用 TensorFlow 进行分布式训练

    本文以两篇官方文档为基础来学习TensorFlow如何进行分布式训练,借此进入Strategy世界.

  4. 云原生的弹性 AI 训练系列之二:PyTorch 1.9.0 弹性分布式训练的设计与实现

    背景 机器学习工作负载与传统的工作负载相比,一个比较显著的特点是对 GPU 的需求旺盛.在之前的文章中介绍过(https://mp.weixin.qq.com/s/Nasm-cXLtJObjLwLQH ...

  5. [源码解析] 深度学习分布式训练框架 horovod (2) --- 从使用者角度切入

    [源码解析] 深度学习分布式训练框架 horovod (2) --- 从使用者角度切入 目录 [源码解析] 深度学习分布式训练框架 horovod (2) --- 从使用者角度切入 0x00 摘要 0 ...

  6. Pytorch使用分布式训练,单机多卡

    pytorch的并行分为模型并行.数据并行 左侧模型并行:是网络太大,一张卡存不了,那么拆分,然后进行模型并行训练. 右侧数据并行:多个显卡同时采用数据训练网络的副本. 一.模型并行 二.数据并行 数 ...

  7. windows下使用pytorch进行单机多卡分布式训练

    现在有四张卡,但是部署在windows10系统上,想尝试下在windows上使用单机多卡进行分布式训练,网上找了一圈硬是没找到相关的文章.以下是踩坑过程. 首先,pytorch的版本必须是大于1.7, ...

  8. 『TensorFlow』分布式训练_其一_逻辑梳理

    1,PS-worker架构 将模型维护和训练计算解耦合,将模型训练分为两个作业(job): 模型相关作业,模型参数存储.分发.汇总.更新,有由PS执行 训练相关作业,包含推理计算.梯度计算(正向/反向 ...

  9. [源码解析] 深度学习分布式训练框架 horovod (3) --- Horovodrun背后做了什么

    [源码解析] 深度学习分布式训练框架 horovod (3) --- Horovodrun背后做了什么 目录 [源码解析] 深度学习分布式训练框架 horovod (3) --- Horovodrun ...

  10. [源码解析] 深度学习分布式训练框架 horovod (16) --- 弹性训练之Worker生命周期

    [源码解析] 深度学习分布式训练框架 horovod (16) --- 弹性训练之Worker生命周期 目录 [源码解析] 深度学习分布式训练框架 horovod (16) --- 弹性训练之Work ...

随机推荐

  1. Nacos 服务状态监听四种写发

    监听服务的四种实现方式,以监听 Nacos 服务为例 1. 传统方式 public void subscribe() { try { NamingService namingService = Nam ...

  2. peewee 操作 sqlite 锁表问题分析

    在使用python orm 框架 peewee 操作数据库时时常会抛出以一个异常,具体的报错就是 database is locked 初步了解是因为sqlite锁的颗粒度比较大,是库锁.当一个连接在 ...

  3. sqlalchemy 报错 Lost connection to MySQL server during query 解决

    最近在开发过程中遇到一个sqlalchemy lost connection的报错,记录解决方法. 报错信息 python后端开发,使用的框架是Fastapi + sqlalchemy.在一个接口请求 ...

  4. Tomcat--多实例

    配置信息 centos:7.8 tomcat:7.0.3 instans1:/usr/local/tomcat/instans1 8081 instans2:/usr/local/tomcat/ins ...

  5. Visual Studio 2022 激活码

    Pro: TD244-P4NB7-YQ6XK-Y8MMM-YWV2J Enterprise: VHF9H-NXBBB-638P6-6JHCY-88JWH Key 来自网络 备忘...

  6. uni-app实现扫码签到

    1 uni.scanCode({ 2 success: res => { 3 this.$http({ 4 url: '/checkin/scanSign', 5 data: { 6 codeI ...

  7. Docker 魔法解密:探索 UnionFS 与 OverlayFS

    本文主要介绍了 Docker 的另一个核心技术:Union File System.主要包括对 overlayfs 的演示,以及分析 docker 是如何借助 ufs 实现容器 rootfs 的. 如 ...

  8. Go语言安装(Windows10)

    一. 官网下载 https://golang.google.cn/dl/   二. 软件包安装 选择对应的路径进行安装   三. 环境变量设置 1.path 检查系统环境变量Path内已经添加Go的安 ...

  9. (已解决)pulse secure 连接功能变灰禁用 连接面板找不到

    今天打开 pulse secure 时,发现窗口变成了这样: 连接功能是灰色的,被禁用了: 解决方案: 运行 PulseSecureService 服务. 然后就正常了!

  10. Java之利用openCsv将csv文件导入mysql数据库

    前两天干活儿的时候有个需求,前台导入csv文件,后台要做接收处理,mysql数据库中,项目用的springboot+Vue+mybatisPlus实现,下面详细记录一下实现流程. 1.Controll ...