『MXNet』第七弹_多GPU并行程序设计

资料原文

一、概述思路

假设一台机器上有个GPU。给定需要训练的模型，每个GPU将分别独立维护一份完整的模型参数。

在模型训练的任意一次迭代中，给定一个小批量，我们将该批量中的样本划分成份并分给每个GPU一份。

然后，每个GPU将分别根据自己分到的训练数据样本和自己维护的模型参数计算模型参数的梯度。

接下来，我们把k个GPU上分别计算得到的梯度相加，从而得到当前的小批量梯度。

之后，每个GPU都使用这个小批量梯度分别更新自己维护的那一份完整的模型参数。

二、网络以及辅助函数

使用“卷积神经网络——从零开始”里的LeNet来作为本节的样例：

# 初始化模型参数。

scale = 0.01

W1 = nd.random.normal(scale=scale, shape=(20, 1, 3, 3))

b1 = nd.zeros(shape=20)

W2 = nd.random.normal(scale=scale, shape=(50, 20, 5, 5))

b2 = nd.zeros(shape=50)

W3 = nd.random.normal(scale=scale, shape=(800, 128))

b3 = nd.zeros(shape=128)

W4 = nd.random.normal(scale=scale, shape=(128, 10))

b4 = nd.zeros(shape=10)

params = [W1, b1, W2, b2, W3, b3, W4, b4]

# 定义模型。

def lenet(X, params):

    h1_conv = nd.Convolution(data=X, weight=params[0], bias=params[1],

                             kernel=(3, 3), num_filter=20)

    h1_activation = nd.relu(h1_conv)

    h1 = nd.Pooling(data=h1_activation, pool_type="avg", kernel=(2, 2),

                    stride=(2, 2))

    h2_conv = nd.Convolution(data=h1, weight=params[2], bias=params[3],

                             kernel=(5, 5), num_filter=50)

    h2_activation = nd.relu(h2_conv)

    h2 = nd.Pooling(data=h2_activation, pool_type="avg", kernel=(2, 2),

                    stride=(2, 2))

    h2 = nd.flatten(h2)

    h3_linear = nd.dot(h2, params[4]) + params[5]

    h3 = nd.relu(h3_linear)

    y_hat = nd.dot(h3, params[6]) + params[7]

    return y_hat

# 交叉熵损失函数。

loss = gloss.SoftmaxCrossEntropyLoss()

参数列表复制到指定设备

下面函数将模型参数[参数一，参数二，……]复制到某个特定GPU，并标记梯度求解：

def get_params(params, ctx):

    new_params = [p.copyto(ctx) for p in params]

    for p in new_params:

        p.attach_grad()

    return new_params

同一参数设备间同步

以下函数可以把各个GPU上的同一参数数据加起来，然后再广播到所有GPU上：

def allreduce(data):  # 输入为list，包含位于不同设备上的同一参数

    for i in range(1, len(data)):

        data[0][:] += data[i].copyto(data[0].context)  # 将i位复制到0位设备上，并加给0位

    for i in range(1, len(data)):

        data[0].copyto(data[i])  # 使用累计后的0位替换i位

数据划分到设备

给定一个批量的数据样本，以下函数可以划分它们并复制到各个GPU上：

def split_and_load(data, ctx):

    n, k = data.shape[0], len(ctx)

    m = n // k

    assert m * k == n, '# examples is not divided by # devices.'

    return [data[i * m: (i + 1) * m].as_in_context(ctx[i]) for i in range(k)]

三、训练过程

将完整的模型参数复制到多个GPU上，并在每次迭代时对单个小批量上进行多GPU训练：

def train(num_gpus, batch_size, lr):

    train_iter, test_iter = gb.load_data_fashion_mnist(batch_size)

    ctx = [mx.gpu(i) for i in range(num_gpus)]  # 设备代号list

    print('running on:', ctx)

    # 将模型参数复制到 num_gpus 个 GPU 上。

    gpu_params = [get_params(params, c) for c in ctx]  # 每个元素为一个设备上的参数

    for epoch in range(1, 6):

        start = time()

        for X, y in train_iter:

            # 对单个小批量上进行多 GPU 训练。

            train_batch(X, y, gpu_params, ctx, lr)

        nd.waitall()

        print('epoch %d, time: %.1f sec' % (epoch, time() - start))

        # 在 GPU0 上验证模型。

        net = lambda x: lenet(x, gpu_params[0])

        test_acc = gb.evaluate_accuracy(test_iter, net, ctx[0])

        print('validation accuracy: %.4f' % test_acc)

实现单个小批量上的多GPU训练：

def train_batch(X, y, gpu_params, ctx, lr):

    # 划分小批量数据样本并复制到各个 GPU 上。

    gpu_Xs = split_and_load(X, ctx)

    gpu_ys = split_and_load(y, ctx)

    # 在各个 GPU 上计算损失。

    with autograd.record():

        ls = [loss(lenet(gpu_X, gpu_W), gpu_y)  # 不同设备上的loss对象

              for gpu_X, gpu_y, gpu_W in zip(gpu_Xs, gpu_ys, gpu_params)]

    # 在各个 GPU 上反向传播。

    for l in ls:

        l.backward()

    # 把各个 GPU 上的梯度加起来，然后再广播到所有 GPU 上。

    for i in range(len(gpu_params[0])):  # gpu_params[0]:位于设备0上的全部参数

        allreduce([gpu_params[c][i].grad for c in range(len(ctx))])  # 汇总梯度并广播

    # 在各个 GPU 上更新自己维护的那一份完整的模型参数。

    for param in gpu_params:  # 各个设备分别更新

        gb.sgd(param, lr, X.shape[0])

四、Gluon实现

模型参数初始化于各个设备

前我们介绍了如何使用initialize函数的ctx参数在CPU或单个GPU上初始化模型参数。事实上，ctx可以接受一系列的CPU/GPU，从而使初始化好的模型参数复制到ctx里所有的CPU/GPU上。

ctx = [mx.gpu(0), mx.gpu(1)]

net.initialize(init=init.Normal(sigma=0.01), ctx=ctx)

此时的net对象对应一系列相同结构不同设备的实体。

数据分发到各个设备

Gluon提供了上一节中实现的split_and_load函数。它可以划分一个小批量的数据样本并复制到各个CPU/GPU上。之后，根据输入数据所在的CPU/GPU，模型计算会发生在相同的CPU/GPU上。

x = nd.random.uniform(shape=(4, 1, 28, 28))

gpu_x = gutils.split_and_load(x, ctx)

net(gpu_x[0]), net(gpu_x[1])

默认下weight.data()会返回CPU上的参数值。由于我们指定了2个GPU来初始化模型参数，我们需要指定GPU访问。我们看到，相同参数在不同的GPU上的值一样。

weight = net[1].params.get('weight')

try:

    weight.data()

except:

    print('not initialized on', mx.cpu())

weight.data(ctx[0])[0], weight.data(ctx[1])[0]

not initialized on cpu(0)

(

 [[[-0.01473444 -0.01073093 -0.01042483]

   [-0.01327885 -0.01474966 -0.00524142]

   [ 0.01266256  0.00895064 -0.00601594]]]

 <NDArray 1x3x3 @gpu(0)>,

 [[[-0.01473444 -0.01073093 -0.01042483]

   [-0.01327885 -0.01474966 -0.00524142]

   [ 0.01266256  0.00895064 -0.00601594]]]

 <NDArray 1x3x3 @gpu(1)>)

参数同步

当我们使用多个GPU来训练模型时，gluon.Trainer会自动做数据并行，例如划分小批量数据样本并复制到各个GPU上，对各个GPU上的梯度求和再广播到所有GPU上。这样，我们就可以很方便地实现训练函数了。

另外net.collect_params().reset_ctx()可以重置设备

loss = gloss.SoftmaxCrossEntropyLoss()

def train(num_gpus, batch_size, lr):

    train_iter, test_iter = gb.load_data_fashion_mnist(batch_size)

    ctx = [mx.gpu(i) for i in range(num_gpus)]

    print('running on:', ctx)

    net.initialize(init=init.Normal(sigma=0.01), ctx=ctx, force_reinit=True)  # 网络初始化于各个设备

    trainer = gluon.Trainer(

        net.collect_params(), 'sgd', {'learning_rate': lr})  # 优化器会自动识别net对象的设备列表

    for epoch in range(1, 6):

        start = time()

        for X, y in train_iter:

            gpu_Xs = gutils.split_and_load(X, ctx)  # 数据划分到各个设备

            gpu_ys = gutils.split_and_load(y, ctx)

            with autograd.record():

                ls = [loss(net(gpu_X), gpu_y) for gpu_X, gpu_y in zip(

                    gpu_Xs, gpu_ys)]  # 记录各个设备的损失函数

            for l in ls:

                l.backward()  # 各个设备分别反向传播

            trainer.step(batch_size)  # 优化时自动同步各个设备参数

        nd.waitall()

        print('epoch %d, training time: %.1f sec'%(epoch, time() - start))

        test_acc = gb.evaluate_accuracy(test_iter, net, ctx[0])

        print('validation accuracy: %.4f'%(test_acc))

train(num_gpus=2, batch_size=512, lr=0.3)