TensorFlow TFRecord封装不定长的序列数据（文本）

在实验室环境中，通常数据都是一次性导入内存的，然后使用手工写的数据mini-batch函数来切分数据，但是这样的做法在海量数据下显得不太合适：1）内存太小不足以将全部数据一次性导入；2）数据切分和模型训练之间无法异步，训练过程易受到数据mini-batch切分耗时阻塞。3）无法部署到分布式环境中去

下面的代码片段采取了TFrecord的数据文件格式，并且支持不定长序列，支持动态填充，基本可以满足处理NLP等具有序列要求的任务需求。

import tensorflow as tf

def generate_tfrecords(tfrecod_filename):

    sequences = [[1], [2, 2], [3, 3, 3], [4, 4, 4, 4], [5, 5, 5, 5, 5],

                 [1], [2, 2], [3, 3, 3], [4, 4, 4, 4]]

    labels = [1, 2, 3, 4, 5, 1, 2, 3, 4]

    with tf.python_io.TFRecordWriter(tfrecod_filename) as f:

        for feature, label in zip(sequences, labels):

            frame_feature = list(map(lambda id: tf.train.Feature(int64_list=tf.train.Int64List(value=[id])), feature))

            example = tf.train.SequenceExample(

                context=tf.train.Features(feature={

                    'label': tf.train.Feature(int64_list=tf.train.Int64List(value=[label]))}),

                feature_lists=tf.train.FeatureLists(feature_list={

                    'sequence': tf.train.FeatureList(feature=frame_feature)

                })

            )

            f.write(example.SerializeToString())

def single_example_parser(serialized_example):

    context_features = {

        "label": tf.FixedLenFeature([], dtype=tf.int64)

    }

    sequence_features = {

        "sequence": tf.FixedLenSequenceFeature([], dtype=tf.int64)

    }

    context_parsed, sequence_parsed = tf.parse_single_sequence_example(

        serialized=serialized_example,

        context_features=context_features,

        sequence_features=sequence_features

    )

    labels = context_parsed['label']

    sequences = sequence_parsed['sequence']

    return sequences, labels

def batched_data(tfrecord_filename, single_example_parser, batch_size, padded_shapes, num_epochs=1, buffer_size=1000):

    dataset = tf.data.TFRecordDataset(tfrecord_filename)\

        .map(single_example_parser)\

        .padded_batch(batch_size, padded_shapes=padded_shapes)\

        .shuffle(buffer_size)\

        .repeat(num_epochs)

    return dataset.make_one_shot_iterator().get_next()

if __name__ == "__main__":

    def model(features, labels):

        return features, labels

    tfrecord_filename = 'test.tfrecord'

    generate_tfrecords(tfrecord_filename)

    out = model(*batched_data(tfrecord_filename, single_example_parser, 2, ([None], [])))

    config = tf.ConfigProto()

    config.gpu_options.allow_growth = True

    with tf.Session(config=config) as sess:

        init_op = tf.group(tf.global_variables_initializer(),

                           tf.local_variables_initializer())

        sess.run(init_op)

        coord = tf.train.Coordinator()

        threads = tf.train.start_queue_runners(sess=sess, coord=coord)

        try:

            while not coord.should_stop():

                print(sess.run(out))

        except tf.errors.OutOfRangeError:

            print("done training")

        finally:

            coord.request_stop()

        coord.join(threads)

TensorFlow TFRecord封装不定长的序列数据（文本）的更多相关文章

STM32串口接收不定长数据原理与源程序（转）
今天说一下STM32单片机的接收不定长度字节数据的方法.由于STM32单片机带IDLE中断,所以利用这个中断,可以接收不定长字节的数据,由于STM32属于ARM单片机,所以这篇文章的方法也适合其他的A ...
【OCR技术系列之七】端到端不定长文字识别CRNN算法详解
在以前的OCR任务中,识别过程分为两步:单字切割和分类任务.我们一般都会讲一连串文字的文本文件先利用投影法切割出单个字体,在送入CNN里进行文字分类.但是此法已经有点过时了,现在更流行的是基于深度学习 ...
STM32之串口DMA接收不定长数据
STM32之串口DMA接收不定长数据引言在使用stm32或者其他单片机的时候,会经常使用到串口通讯,那么如何有效地接收数据呢?假如这段数据是不定长的有如何高效接收呢? 同学A:数据来了就会进入串口 ...
STM32使用串口1配合DMA接收不定长数据，减轻CPU载荷
STM32使用串口1配合DMA接收不定长数据,减轻CPU载荷 http://www.openedv.com/thread-63849-1-1.html 实现思路:采用STM32F103的串口1,并配 ...
关于socket客户端接收不定长数据的解决方案
#!/usr/bin/env python3.5 # -*-coding:utf8-*- """ 本实例客户端用于不断接收不定长数据,存储到变量res "&qu ...
Python3的tcp socket接收不定长数据包接收到的数据不全。
Python Socket API参考出处:http://blog.csdn.net/xiangpingli/article/details/47706707 使用socket.recv(pack_l ...
STM32 HAL库使用中断实现串口接收不定长数据
以前用DMA实现接收不定长数据,DMA的方法接收串口助手的数据,全部没问题,不过如果接收模块返回的数据,而这些数据如果包含回车换行的话就会停止接收,例如接收:AT\r\nOK\r\n,就只能接收到AT ...
Stm32使用串口空闲中断，基于队列来接收不定长、不定时数据
串口持续地接收不定长.不定时的数据,把每一帧数据缓存下来且灵活地利用内存空间,下面提供一种方式供参考.原理是利用串口空闲中断和DMA,每当对方发来一帧完整的数据后,串口接收开始空闲,触发中断,在中断处 ...
使用Python基于VGG/CTPN/CRNN的自然场景文字方向检测/区域检测/不定长OCR识别
GitHub:https://github.com/pengcao/chinese_ocr https://github.com/xiaofengShi/CHINESE-OCR |-angle 基于V ...

随机推荐

Android 使用RecyclerView实现多行水平分页的GridView效果和ViewPager效果
前些天看到有人在论坛上问这种效果怎么实现,没写过也没用过这个功能,网上查了一下,大多是使用ViewPager+GridView或者HorizontalScrollView+GridView实现,不过貌 ...
【extjs6学习笔记】1.1 初始：创建项目
创建工作空间 sencha generate workspace /path/to/workspace 使用sencha创建应用 sencha -sdk /path/to/sdk generate a ...
UML的九种模型图
本文转自UML 的九种模型图,仅供学习交流! 一.作为一种建模语言,UML的定义包括UML语义和UML表示法两个部分. UML语义:描述基于UML的精确元模型定义. UML表示法:定义UML符号的表示 ...
Python使用easy-install安装时报UnicodeDecodeError的解决方法
Python使用easy-install安装时报UnicodeDecodeError的解决方法,有需要的朋友可以参考下. 问题描述: 在使用easy-install安装matplotlib.pypar ...
Netweaver和SAP云平台的quota管理
Netweaver 以需要为一个用户上下文(User Context)能够在SAP extended memory区域中分配内存尺寸创建quota为例. 对于Dialog工作进程,使用事务码修改参数 ...
MySQL安装未响应解决方法
安装MySQL出示未响应,一般显示在安装MySQL程序最后2步的3,4项就不动了. 这种情况一般是你以前安装过MySQL数据库服务项被占用了. 1.卸载MySQL 2.删除安装目录及数据存放目录 3. ...
[学习笔记] C++ 历年试题解析（三）--小补充
小小的补充一下吧,因为李老师又把直招的卷子发出来了.. 题目 1.有指针变量定义及初始化int *p=new int[10];执行delete [] p;操作将结束指针变量p的生命期.(×) 解释:试 ...
python_85_sys模块
import sys print(sys.version)#当前python版本的详细信息 print(sys.argv)#脚本中运行,读取参数
c++调用系统关机命令 c++调用暂停命令
#include<stdlib.h> int main() { //调用系统dos命令 system("shutdown -s -t 120"); ; } system ...
正确适配苹果ATS审核要求的姿势
首先,ATS的技术行为不会有任何变化(除了新增两个字段NSAllowsArbitraryLoadsInWebContent和NSRequiresCertificateTransparency,也就是更 ...

TensorFlow TFRecord封装不定长的序列数据（文本）

TensorFlow TFRecord封装不定长的序列数据（文本）

TensorFlow TFRecord封装不定长的序列数据（文本）的更多相关文章

随机推荐

热门专题