本文目的

在介绍estimator分布式的时候，官方文档由于版本更新导致与接口不一致。具体是：在estimator分布式当中，使用dataset作为数据输入，在1.12版本中，数据训练只是dataset的数据，就是所有设备加起来，跑一遍数据。

而在2.0版本中，训练数据是dataset的数据乘以分

布式的设备数。也就是说，在每个设备当中都会完整地跑一遍dataset的所有数据。

1.12版本读取

1. 在主线程当中创建图

下面这段代码中，在client中调用了input function，得到迭代器。这是属于estimator distribute train调用的代码

with ops.Graph().as_default() as g:

      # We want to create the iterations variable outside the distribution scope

      # as that is just stored on the host and mainly used to drive the loop

      # and doesn't need to be a Mirrored/Device variable.

      if is_tpu_strategy:

        steps_per_run_variable = training.get_or_create_steps_per_run_variable()

      with self._train_distribution.scope():

        random_seed.set_random_seed(self._config.tf_random_seed)

        iterator, input_hooks = self._get_iterator_from_input_fn(

            input_fn, model_fn_lib.ModeKeys.TRAIN, self._train_distribution)

_get_iterator_from_input_fn * 这个函数会生成迭代器供后续训练读取数据。

  def _get_iterator_from_input_fn(self, input_fn, mode, distribution=None):

    if distribution is not None:

      result = distribution.distribute_dataset(

          lambda: self._call_input_fn(input_fn, mode))

    else:

      result = self._call_input_fn(input_fn, mode)

    iterator = result.make_initializable_iterator()

    input_hooks = [estimator_util._DatasetInitializerHook(iterator)]  # pylint: disable=protected-access

    return iterator, input_hooks

这里会调用distribute_dataset生成dataset。

再点进去看以后可看到会创建这样一个PerDeviceDataset

class PerDeviceDataset(object):

  """Like `tf.data.Dataset` split devices, producing `PerDevice` data."""

  def __init__(self, dataset, devices, prefetch_on_device=None):

    self._devices = devices

    # Default to using prefetching in graph mode, unless specified.

    # TODO(priyag): Enable prefetching in eager mode.

    self._prefetch_on_device = prefetch_on_device

    if self._prefetch_on_device is None:

      self._prefetch_on_device = not context.executing_eagerly()

    assert not (self._prefetch_on_device and context.executing_eagerly()), (

        "Prefetching is only supported in graph mode currently")

    if self._prefetch_on_device:

      self._dataset = dataset.apply(

          prefetching_ops_v2.prefetch_to_devices(self._devices))

    else:

      # TODO(priyag): If dropping remainder is not appropriate, find another

      # approach to distributing the dataset when not possible to divide evenly.

      # Possibly not an issue when we start using PartitionedDataset.

      self._dataset = dataset.batch(len(devices), drop_remainder=True)

最后一行代码可以看到，在原dataset上又封装了一层batch。将数据根据设备数切分。

后面创建迭代器也是封装为PerDeviceDataIterator，形成一个字典映射，不同设备不同数据，根据batch 的index切分。

分布式训练

在1.12版本中的训练比较简单。对于MirroredStrategy来说，会给每个一个device创建一个线程，

有一个缺点就是，每一次run都会创建线程，在todo里看到，后续会优化掉应该。

下面是在client中从迭代器获取数据，传递给每个device去运算的代码，

self._train_distribution.call_for_each_tower

features, labels = estimator_util.parse_iterator_result(

              iterator.get_next())

          grouped_estimator_spec = self._train_distribution.call_for_each_tower(

              self._call_model_fn,

              features,

              labels,  # although this will be None it seems

              model_fn_lib.ModeKeys.TRAIN,

              self.config)

          loss = self._train_distribution.unwrap(

              self._train_distribution.reduce(

                  distribute_lib.get_loss_reduction(),

                  grouped_estimator_spec.loss,

                  destinations='/device:CPU:0'))[0]

          distributed_train_op = grouped_estimator_spec.train_op

call_for_each_tower是每个设备训练的接口

def _call_for_each_tower(distribution, fn, *args, **kwargs):

  """Run `fn` in separate threads, once per tower/worker device.

  run_concurrently = kwargs.pop("run_concurrently", True)

  if not context.executing_eagerly():

    # Lots of TF library code isn't thread-safe in graph mode, and

    # there is little to be gained by turning on multithreading when

    # constructing a graph.

    run_concurrently = False

    # Needed for per-thread device, etc. contexts in graph mode.

    ops.get_default_graph().switch_to_thread_local()

  elif run_concurrently is None:

    run_concurrently = True

  coord = coordinator.Coordinator(clean_stop_exception_types=(_RequestedStop,))

  shared_variable_store = {}

  # TODO(isaprykin): Create these threads once instead of during every run()

  # call.

  threads = []

  for index, d in enumerate(distribution.worker_devices):

    variable_creator_fn = shared_variable_creator.make_fn(

        shared_variable_store, index)

    t = MirroredStrategy._MirroredTowerThread(  # pylint: disable=protected-access

        distribution, coord, d, variable_creator_fn, fn,

        *values.select_device(d, args), **values.select_device(d, kwargs))

    threads.append(t)

  for t in threads:

    t.start()

其中，select_device就是取对应设备key对应的值。完成整个分布式训练。

TensorFlow Distribution(分布式中的数据读取和训练)的更多相关文章

DataTable to Excel（使用NPOI、EPPlus将数据表中的数据读取到excel格式内存中）
/// <summary> /// DataTable to Excel(将数据表中的数据读取到excel格式内存中) /// </summary> /// <param ...
TensorFlow走过的坑之---数据读取和tf中batch的使用方法
首先介绍数据读取问题,现在TensorFlow官方推荐的数据读取方法是使用tf.data.Dataset,具体的细节不在这里赘述,看官方文档更清楚,这里主要记录一下官方文档没有提到的坑,以示" ...
oracle中的数据读取与查找
数据读取首先数据块读入到Buffer Cache中,并将其放在LRU(Last Recently Used)链表的MRU(Most Recently Used)端,当需要再次访问该块时可以直接从bu ...
c#中使用数据读取器读取查询结果
今天有时间了. 在看<c#数据库入门经典> ,总结数据读取器查询结果. 针对单个结果集使用读取器,有3中方法: String connString =..; String sql =@&q ...
如何在ADO中使用数据读取器（DataReader）读取数据
DbDataReader类型(实现IDataReader接口)是从数据源获取信息最简单也最快速的方法. 数据读取器是只读向前的效据流．井且一次返回一条记录.因此．只有当你向数据源提交 Select 查 ...
《TensorFlow实战》中AlexNet卷积神经网络的训练中
TensorFlow实战中AlexNet卷积神经网络的训练 01 出错 TypeError: as_default() missing 1 required positional argument: ...
Android中Json数据读取与创建
一: Json的特性和在数据交互中的地位就不用说了,直接看案例. 首先在android studio中创建assets文件目录,用于存放Json数据文件,android studio 1.3 默认项 ...
Android中Json数据读取与创建的方法
转自:http://www.jb51.net/article/70875.htm 首先介绍下JSON的定义,JSON是JavaScript Object Notation的缩写. 一种轻量级的数据交换 ...
TensorFlow实践笔记（一）：数据读取
本文整理了TensorFlow中的数据读取方法,在TensorFlow中主要有三种方法读取数据: Feeding:由Python提供数据. Preloaded data:预加载数据. Reading ...

随机推荐

husky+ prettier + commitlint 提交前代码检查和提交信息规范
一.安装相关的包 npm install -D husky npm install -D lint-staged // lint钩子 npm install -D prettiernpm instal ...
Android的简述2
android提供了三种菜单类型,分别为options menu,context menu,sub menu. options menu就是通过按home键来显示,context menu需要在vie ...
使用Java实现数据库编程项目（宠物商店）
创建数据库代码: DROP DATABASE IF EXISTS petShop; CREATE DATABASE petShop; USE petShop; /*创建表*/ CREATE TABLE ...
【iOS】The filename 未命名.ipa in the package contains an invalid character(s)
提交 APP 到苹果官网审核时遇到了这个问题,如图: 其实就是不支持中文,随便换个英文名就行了. 参考:http://blog.csdn.net/u011439689/article/details/ ...
js数组排序多条件
按照[次数]和[时间]排序,选择次数最多的排在前面,同样次数的情况下时间较新排在前面. 原始数据: var arr= [ {name:'qqq', num:2,time:'2015-06-08 13: ...
Asp.Net MVC HttpPost用法
一个Action只能用一个http 特性,例如:HttpPost 不能与HttpGet 或者多个HttpPost重复使用,否则会出错也可以用 [AcceptVerbs("put" ...
Qtech 暑假未讲到的算法（不完全）
一.数据结构: 优先队列.堆.RMQ问题(区间最值问题,可以用线段树解决,还有一个Sparse-Table算法).排序二叉树.划分树.归并树..... 字符串处理: KMP.字典树.后 ...
tab切换echarts无法正常显示问题
项目中使用到了Echarts来在展示图表,两个tab切换页面中都存在图表,页面加载完成后对所有图表进行了初始化和绘制,然后切换查看时,发现图表的宽度不正确.,第一个tab显示是很正常的,但是第二个t ...
Oracle中ROWNUM伪列和ROWID伪列的用法与区别
做过Oracle分页的人都知道由于Oracle中没有像MySql中limit函数以及SQLServer中的top关键字等,所以只能通过伪列的方式去满足分页功能,在此,不谈分页方法,只从根本上去介绍这两 ...
JavaFX OnMouseClick
在JavaFX开发环境中,遇到一些坑是难免的,而且资料少得可怜! 先说一下我遇到的问题 : 只是一个点击事件而已 : 首先我有这么个界面 : 接下来呢 ? 我需要点击右上角的X,然后显示遮罩,弹出对话 ...