Multi-GPU processing with data parallelism

If you write your software in a language like C++ for a single cpu core, making it run on multiple GPUs in parallel would require rewriting the software from scratch. But this is not the case with TensorFlow. Because of its symbolic nature, tensorflow can hide all that complexity, making it effortless to scale your program across many CPUs and GPUs.

Let’s start with the simple example of adding two vectors on CPU:

 import tensorflow as tf

with tf.device(tf.DeviceSpec(device_type='CPU', device_index=0)):
a = tf.random_uniform([1000, 100])
b = tf.random_uniform([1000, 100])
c = a + b tf.Session().run(c)

The same thing can as simply be done on GPU:

with tf.device(tf.DeviceSpec(device_type='GPU', device_index=0)):
a = tf.random_uniform([1000, 100])
b = tf.random_uniform([1000, 100])
c = a + b
``` But what if we have two GPUs and want to utilize both? To do that, we can split the data and use a separate GPU for processing each half:
```python
split_a = tf.split(a, 2)
split_b = tf.split(b, 2) split_c = []
for i in range(2):
with tf.device(tf.DeviceSpec(device_type='GPU', device_index=i)):
split_c.append(split_a[i] + split_b[i]) c = tf.concat(split_c, axis=0)
``` Let's rewrite this in a more general form so that we can replace addition with any other set of operations: <div class="se-preview-section-delimiter"></div> ```python
def make_parallel(fn, num_gpus, **kwargs):
in_splits = {}
for k, v in kwargs.items():
in_splits[k] = tf.split(v, num_gpus) out_split = []
for i in range(num_gpus):
with tf.device(tf.DeviceSpec(device_type='GPU', device_index=i)):
with tf.variable_scope(tf.get_variable_scope(), reuse=i > 0):
out_split.append(fn(**{k : v[i] for k, v in in_splits.items()})) return tf.concat(out_split, axis=0) def model(a, b):
return a + b c = make_parallel(model, 2, a=a, b=b)

You can replace the model with any function that takes a set of tensors as input and returns a tensor as result with the condition that both the input and output are in batch. Note that we also added a variable scope and set the reuse to true. This makes sure that we use the same variables for processing both splits. This is something that will become handy in our next example.

Let’s look at a slightly more practical example. We want to train a neural network on multiple GPUs. During training we not only need to compute the forward pass but also need to compute the backward pass (the gradients). But how can we parallelize the gradient computation? This turns out to be pretty easy.

Recall from the first item that we wanted to fit a second degree polynomial to a set of samples. We reorganized the code a bit to have the bulk of the operations in the model function:

import numpy as np
import tensorflow as tf def model(x, y):
w = tf.get_variable("w", shape=[3, 1]) f = tf.stack([tf.square(x), x, tf.ones_like(x)], 1)
yhat = tf.squeeze(tf.matmul(f, w), 1) loss = tf.square(yhat - y)
return loss x = tf.placeholder(tf.float32)
y = tf.placeholder(tf.float32) loss = model(x, y) train_op = tf.train.AdamOptimizer(0.1).minimize(
tf.reduce_mean(loss)) def generate_data():
x_val = np.random.uniform(-10.0, 10.0, size=100)
y_val = 5 * np.square(x_val) + 3
return x_val, y_val sess = tf.Session()
sess.run(tf.global_variables_initializer())
for _ in range(1000):
x_val, y_val = generate_data()
_, loss_val = sess.run([train_op, loss], {x: x_val, y: y_val}) _, loss_val = sess.run([train_op, loss], {x: x_val, y: y_val})
print(sess.run(tf.contrib.framework.get_variables_by_name("w")))

Now let’s use make_parallel that we just wrote to parallelize this. We only need to change two lines of code from the above code:

loss = make_parallel(model, 2, x=x, y=y)

train_op = tf.train.AdamOptimizer(0.1).minimize(
tf.reduce_mean(loss),
colocate_gradients_with_ops=True)

The only thing that we need to change to parallelize backpropagation of gradients is to set the colocate_gradients_with_ops flag to true. This ensures that gradient ops run on the same device as the original op.

更多教程:http://www.tensorflownews.com/

TensorFlow 多 GPU 处理并行数据的更多相关文章

  1. Setup Tensorflow with GPU on Mac OSX 10.11

    Setup Tensorflow with GPU on OSX 10.11 环境描述 电脑:MacBook Pro 15.6 CPU: 2.7GHz 显卡: GT 650m 系统:OSX 10.11 ...

  2. linux 安装tensorflow(gpu版本)

    一.安装cuda 具体安装过程见我的另一篇博客,ubuntu16.04下安装配置深度学习环境 二.安装tensorflow 1.具体安装过程官网其实写的比较详细,总结一下的话可以分为两种:安装rele ...

  3. Tensorflow检验GPU是否安装成功 及 使用GPU训练注意事项

    1. 已经安装cuda但是tensorflow仍然使用cpu加速的问题 电脑上同时安装了GPU和CPU版本的TensorFlow,本来想用下面代码测试一下GPU程序,但无奈老是没有调用GPU. imp ...

  4. Ubuntu16.04下安装tensorflow(GPU加速)【转】

    本文转载自:https://blog.csdn.net/qq_30520759/article/details/78947034 版权声明:本文为博主原创文章,未经博主允许不得转载. https:// ...

  5. tensorflow 安装GPU版本,个人总结,步骤比较详细【转】

    本文转载自:https://blog.csdn.net/gangeqian2/article/details/79358543 手把手教你windows安装tensorflow的教程参考另一篇博文ht ...

  6. Google TensorFlow for GPU安装、配置大坑

    Google TensorFlow for GPU安装.配置大坑 从本周一开始(12.05),共4天半的时间,终于折腾好Google TensorFlow for GPU版本,其间跳坑无数,摔得遍体鳞 ...

  7. Win10 TensorFlow(gpu)安装详解

    Win10 TensorFlow(gpu)安装详解 写在前面:TensorFlow是谷歌基于DistBelief进行研发的第二代人工智能学习系统,其命名来源于本身的运行原理.Tensor(张量)意味着 ...

  8. [开发技巧]·TensorFlow&Keras GPU使用技巧

    [开发技巧]·TensorFlow&Keras GPU使用技巧 ​ 1.问题描述 在使用TensorFlow&Keras通过GPU进行加速训练时,有时在训练一个任务的时候需要去测试结果 ...

  9. tensorflow with gpu 环境配置

    1.准备工作 1.1 确保GPU驱动已经安装 lspci | grep -i nvidia 通过此命令可以查看GPU信息,测试机已经安装GPU驱动

随机推荐

  1. 达拉草201771010105《面向对象程序设计(java)》第四周学习总结

    实验四类与对象的定义及使用 实验时间 2018-9-20 第一部分:理论知识 1.类与对象概念 (1)类是具有相同属性和方法的一类事物的抽象,是构造对象的模板或蓝图,由类构造对象的过程称为创建类的实例 ...

  2. java反序列化-ysoserial-调试分析总结篇(6)

    前言: 这篇记录CommonsCollections6的调试,外层也是新的类,换成了hashset,即从hashset触发其readObject(),yso给的调用链如下图所示 利用链分析: 首先在h ...

  3. JZOJ 1349. 最大公约数 (Standard IO)

    1349. 最大公约数 (Standard IO) Time Limits: 1000 ms Memory Limits: 65536 KB Description 小菜的妹妹小诗就要读小学了!正所谓 ...

  4. springmvc 的@ResponseBody 如何使用JSONP?

    JSONP解释 在解释JSONP之前,我们需要了解下”同源策略“这个概念,这对理解跨域有帮助.基于安全的原因,浏览器是存在同源策略机制的,同源策略阻止从一个源加载的文档或脚本获取或设置另一个源加载额文 ...

  5. Web架构之Nginx基础配置

    目录 1.Nginx 虚拟主机 1.1.基于域名的虚拟主机 1.2.基于端口的虚拟主机 1.3.基于IP的虚拟主机 2.Nginx include 3.Nginx 日志配置 3.1.访问日志 3.2. ...

  6. selenium+Python 将登录模块化

      公共模块化:(登录) login.py   from selenium import webdriver from time import sleep   class login(): def u ...

  7. Java探针技术-retransformclasses的介绍

    retransformclasses void retransformclasses(class... classes) throws unmodifiableclassexception 重转换提供 ...

  8. JVM05——JVM类加载机制知多少

    我们已经讲过 JVM 相关的很多常见知识点,感兴趣的朋友可以在我的往期文章中查看.接下来将继续为各位带来 JVM 类加载机制.关注我的公众号「Java面典」了解更多 Java 相关知识点. 类生命周期 ...

  9. 6.前台项目vue环境、创建、目录重构、CSS、JS配置

    目录 前台 vue环境 创建项目 重构项目目录 文件修订:目录中非配置文件的多余文件可以移除 App.vue router/index.js Home.vue 全局配置:全局样式.配置文件 globa ...

  10. HDU 5448 Marisa’s Cake

    给定一个由n个整点构成的凸多边形,求从n个点里任意选不少于3个点组成的所有凸多边形的面积之和,显然整点构成的多边形面积一定是0.5的整数倍,所以题目需要你算出答案的2倍 mod1000000007的值 ...