MinkowskiEngine多GPU训练

MinkowskiEngine多GPU训练

目前，MinkowskiEngine通过数据并行化支持Multi-GPU训练。在数据并行化中，有一组微型批处理，这些微型批处理将被送到到网络的一组副本中。

首先定义一个网络。

import MinkowskiEngine as ME

from examples.minkunet import MinkUNet34C

# Copy the network to GPU

net = MinkUNet34C(3, 20, D=3)

net = net.to(target_device)

同步批处理规范

接下来，创建一个新网络，以ME.MinkowskiSynchBatchNorm替换all ME.MinkowskiBatchNorm。这样一来，网络就可以使用大批处理量，并通过单GPU训练来保持相同的性能。

# Synchronized batch norm

net = ME.MinkowskiSyncBatchNorm.convert_sync_batchnorm(net);

接下来，需要创建网络和最终损耗层的副本（如果使用一个副本）。

import torch.nn.parallel as parallel

criterion = nn.CrossEntropyLoss()

criterions = parallel.replicate(criterion, devices)

加载多个批次

在训练过程中，每次训练迭代都需要一组微型批次。使用了一个返回一个mini-batches批处理的函数，但是无需遵循这种模式。

# Get new data

inputs, labels = [], []

for i in range(num_devices):

coords, feat, label = data_loader() // parallel data loaders can be used

with torch.cuda.device(devices[i]):

inputs.append(ME.SparseTensor(feat, coords=coords).to(devices[i]))

labels.append(label.to(devices[i]))

将weights复制到设备

首先，将权重复制到所有设备。

replicas = parallel.replicate(net, devices)

将副本应用于所有批次

接下来，将所有mini-batches批次送到到所有设备上网络的相应副本。然后将所有输出要素输入损耗层。

outputs = parallel.parallel_apply(replicas, inputs, devices=devices)

# Extract features from the sparse tensors to use a pytorch criterion

out_features = [output.F for output in outputs]

losses = parallel.parallel_apply(

criterions, tuple(zip(out_features, labels)), devices=devices)

收集所有损失到目标设备。

loss = parallel.gather(losses, target_device, dim=0).mean()

其余训练（如backward训练和在优化器中采取向前步骤）类似于单GPU训练。请参阅完整的multi-gpu示例以获取更多详细信息。

import os
	import argparse
	import numpy as np
	from time import time
	from urllib.request import urlretrieve

	try:
	import open3d as o3d
	except ImportError:
	raise ImportError("Please install open3d-python with `pip install open3d`.")

	import torch
	import torch.nn as nn
	from torch.optim import SGD

	import MinkowskiEngine as ME
	from examples.minkunet import MinkUNet34C

	import torch.nn.parallel as parallel

	if not os.path.isfile("weights.pth"):
	urlretrieve("http://cvgl.stanford.edu/data2/minkowskiengine/1.ply", "1.ply")

	parser = argparse.ArgumentParser()
	parser.add_argument("--file_name", type=str, default="1.ply")
	parser.add_argument("--batch_size", type=int, default=4)
	parser.add_argument("--max_ngpu", type=int, default=2)

	cache = {}


	def load_file(file_name, voxel_size):
	if file_name not in cache:
	pcd = o3d.io.read_point_cloud(file_name)
	cache[file_name] = pcd

	pcd = cache[file_name]
	quantized_coords, feats = ME.utils.sparse_quantize(
	np.array(pcd.points, dtype=np.float32),
	np.array(pcd.colors, dtype=np.float32),
	quantization_size=voxel_size,
	)
	random_labels = torch.zeros(len(feats))

	return quantized_coords, feats, random_labels


	def generate_input(file_name, voxel_size):
	# Create a batch, this process is done in a data loader during training in parallel.
	batch = [load_file(file_name, voxel_size)]
	coordinates_, featrues_, labels_ = list(zip(*batch))
	coordinates, features, labels = ME.utils.sparse_collate(
	coordinates_, featrues_, labels_
	)

	# Normalize features and create a sparse tensor
	return coordinates, (features - 0.5).float(), labels


	if __name__ == "__main__":
	# loss and network
	config = parser.parse_args()
	num_devices = torch.cuda.device_count()
	num_devices = min(config.max_ngpu, num_devices)
	devices = list(range(num_devices))
	print("''''''''''''''''''''''''''''''''''''''''''''''''''''''''''")
	print("' WARNING: This example is deprecated. '")
	print("' Please use DistributedDataParallel or pytorch-lightning'")
	print("''''''''''''''''''''''''''''''''''''''''''''''''''''''''''")
	print(
	f"Testing {num_devices} GPUs. Total batch size: {num_devices * config.batch_size}"
	)

	# For copying the final loss back to one GPU
	target_device = devices[0]

	# Copy the network to GPU
	net = MinkUNet34C(3, 20, D=3)
	net = net.to(target_device)

	# Synchronized batch norm
	net = ME.MinkowskiSyncBatchNorm.convert_sync_batchnorm(net)
	optimizer = SGD(net.parameters(), lr=1e-1)

	# Copy the loss layer
	criterion = nn.CrossEntropyLoss()
	criterions = parallel.replicate(criterion, devices)
	min_time = np.inf

	for iteration in range(10):
	optimizer.zero_grad()

	# Get new data
	inputs, all_labels = [], []
	for i in range(num_devices):
	coordinates, features, labels = generate_input(config.file_name, 0.05)
	with torch.cuda.device(devices[i]):
	inputs.append(ME.SparseTensor(features, coordinates, device=devices[i]))
	all_labels.append(labels.long().to(devices[i]))

	# The raw version of the parallel_apply
	st = time()
	replicas = parallel.replicate(net, devices)
	outputs = parallel.parallel_apply(replicas, inputs, devices=devices)

	# Extract features from the sparse tensors to use a pytorch criterion
	out_features = [output.F for output in outputs]
	losses = parallel.parallel_apply(
	criterions, tuple(zip(out_features, all_labels)), devices=devices
	)
	loss = parallel.gather(losses, target_device, dim=0).mean()
	# Gradient
	loss.backward()
	optimizer.step()

	t = time() - st
	min_time = min(t, min_time)
	print(
	f"Iteration: {iteration}, Loss: {loss.item()}, Time: {t}, Min time: {min_time}"
	)

	# Must clear cache at regular interval
	if iteration % 10 == 0:
	torch.cuda.empty_cache()

加速实验

在4x Titan XP上使用各种批次大小进行实验，并将负载平均分配给每个GPU。例如，使用1个GPU，每个批次将具有8个批处理大小。使用2个GPU，每个GPU将具有4个批次。使用4个GPU，每个GPU的批处理大小为2。

GPU数量	每个GPU的批量大小	每次迭代时间	加速（理想）
1个GPU	8	1.611秒	x1（x1）
2个GPU	4	0.916秒	x1.76（x2）
4个GPU	2	0.689秒	x2.34（x4）
GPU数量	每个GPU的批量大小	每次迭代时间	加速（理想）
1个GPU	12	2.691秒	x1（x1）
2个GPU	6	1.413秒	x1.90（x2）
3个GPU	4	1.064秒	x2.53（x3）
4个GPU	3	1.006秒	x2.67（x4）

GPU数量	每个GPU的批量大小	每次迭代时间	加速（理想）
1个GPU	16	3.543秒	x1（x1）
2个GPU	8	1.933秒	x1.83（x2）
4个GPU	4	1.322秒	x2.68（x4）
GPU数量	每个GPU的批量大小	每次迭代时间	加速（理想）
1个GPU	18岁	4.391秒	x1（x1）
2个GPU	9	2.114秒	x2.08（x2）
3个GPU	6	1.660秒	x2.65（x3）

GPU数量	每个GPU的批量大小	每次迭代时间	加速（理想）
1个GPU	20	4.639秒	x1（x1）
2个GPU	10	2.426秒	x1.91（x2）
4个GPU	5	1.707秒	x2.72（x4）
GPU数量	每个GPU的批量大小	每次迭代时间	加速（理想）
1个GPU	21	4.894秒	x1（x1）
3个GPU	7	1.877秒	x2.61（x3）

分析

批量较小时，加速非常适中。对于大批处理大小（例如18和20），随着线程初始化开销在大工作量上摊销，速度会提高。

同样，在所有情况下，使用4个GPU效率都不高，并且速度似乎很小（总批量大小为18的3-GPU的x2.65与总批量大小为20的4-GPU的x2.72）。因此，建议最多使用3个大批量的GPU。

GPU数量	平均加速（理想）
1个GPU	x1（x1）
2个GPU	x1.90（x2）
3个GPU	x2.60（x3）
4个GPU	x2.60（x4）

适度加速的原因是由于CPU使用率过高。在Minkowski引擎中，所有稀疏张量坐标都在CPU上进行管理，并且内核in-out出入图需要大量的CPU计算。因此，为了提高速度，建议使用更快的CPU，这可能是大点云的瓶颈。

MinkowskiEngine多GPU训练的更多相关文章

Pytorch多GPU训练
Pytorch多GPU训练临近放假, 服务器上的GPU好多空闲, 博主顺便研究了一下如何用多卡同时训练原理多卡训练的基本过程首先把模型加载到一个主设备把模型只读复制到多个设备把大的batc ...
使用Deeplearning4j进行GPU训练时，出错的解决方法
一.问题使用deeplearning4j进行GPU训练时,可能会出现java.lang.UnsatisfiedLinkError: no jnicudnn in java.library.path错 ...
tensorflow使用多个gpu训练
关于多gpu训练,tf并没有给太多的学习资料,比较官方的只有:tensorflow-models/tutorials/image/cifar10/cifar10_multi_gpu_train.py ...
Tensorflow检验GPU是否安装成功及使用GPU训练注意事项
1. 已经安装cuda但是tensorflow仍然使用cpu加速的问题电脑上同时安装了GPU和CPU版本的TensorFlow,本来想用下面代码测试一下GPU程序,但无奈老是没有调用GPU. imp ...
使用Keras进行多GPU训练 multi_gpu_model
使用Keras训练具有多个GPU的深度神经网络(照片来源:Nor-Tech.com). 摘要在今天的博客文章中,我们学习了如何使用多个GPU来训练基于Keras的深度神经网络. 使用多个GPU使我们 ...
『开发技术』GPU训练加速原理（附KerasGPU训练技巧）
0.深入理解GPU训练加速原理我们都知道用GPU可以加速神经神经网络训练(相较于CPU),具体的速度对比可以参看我之前写的速度对比博文: [深度应用]·主流深度学习硬件速度对比(CPU,GPU,TP ...
使用GPU训练TensorFlow模型
查看GPU-ID CMD输入: nvidia-smi 观察到存在序号为0的GPU ID 观察到存在序号为0.1.2.3的GPU ID 在终端运行代码时指定GPU 如果电脑有多个GPU,Tensorfl ...
Tensorflow 多gpu训练
Tensorflow可在训练时制定占用那几个gpu,但如果想真正的使用多gpu训练,则需要手动去实现. 不知道tf2会不会改善一下. 具体参考:https://wizardforcel.gitbook ...
pytorch 多GPU训练总结（DataParallel的使用）
版权声明:本文为博主原创文章,遵循 CC 4.0 BY-SA 版权协议,转载请附上原文出处链接和本声明.本文链接:https://blog.csdn.net/weixin_40087578/artic ...

随机推荐

Vue2.0组件之间通信
Vue中组件这个特性让不少前端er非常喜欢,我自己也是其中之一,它让前端的组件式开发更加合理和简单.笔者之前有写过一篇Vue2.0子父组件通信,这次我们就来聊一聊平级组件之间的通信. 首先我们先搭好开 ...
python 第三方库大全
Python 作为程序员的宠儿,越来越得到人们的关注,使用 Python 进行应用程序开发的越来也多.那么,在 2013 年有哪些流行的 Python 项目呢?下面,我们一起来看下. https:// ...
hdu3746 KMP的next数组应用，求项链首尾项链循环
题意: 给你一个项链,问你最少加多少个珠子能满足整个项链是一个循环的项链(首尾相连) 思路: KMP的简单应用只要了解next数组的意义就好说了,下面总结下 next在循环方面 ...
CVE-2018-8174(双杀漏洞)复现
目录 CVE-2018-8174双杀漏洞复现一(不稳定) 下载payload MSF监听 CVE-2018-8174双杀漏洞复现二
【Android Jetpack高手日志】DataBinding 从入门到精通
前言 DataBinding 数据绑定库是 Android Jetpack 的一部分,借助该库可以使用声明性格式(而非程序化地)将布局中的界面组件绑定到应用中的数据源.我个人觉得,使用 DataBin ...
IDEA 导入Springboot 项目：
更多精彩关注公众号: IDEA 导入Springboot 项目: 1. 菜单->File->New->Project From Existing Sources 2. 选中项目中的p ...
Django（19）QuerySet API
前言我们通常做查询操作的时候,都是通过模型名字.objects的方式进行操作.其实模型名字.objects是一个django.db.models.manager.Manager对象,而Manager ...
使用PuTTY连接Azure VM
使用PuTTY连接Azure VMhtml { -webkit-print-color-adjust: exact } * { box-sizing: border-box; -webkit-prin ...
docker中运行envoy 报错 cannot bind '0.0.0.0:80': Permission denied
docker-compose文件 version: '3' services: envoy: image: envoyproxy/envoy-alpine:v1.15-latest volumes: ...
【转载】CentOS 7自动以root身份登录gnome桌面操作系统开机后自动登录到桌面跳过GDM
CentOS 7自动以root身份登录gnome桌面 ################### #cd /etc/gdm ]# cat custom.conf# GDM configuration st ...

MinkowskiEngine多GPU训练

MinkowskiEngine多GPU训练的更多相关文章

随机推荐

热门专题