tensorflow-TFRecord 文件详解

TFRecord 是 tensorflow 内置的文件格式，它是一种二进制文件，具有以下优点：

1. 统一各种输入文件的操作

2. 更好的利用内存，方便复制和移动

3. 将二进制数据和标签(label)存储在同一个文件中

引言

在了解如下操作后进一步详细讲解TFRecord

tf.train.Int64List(value=list_data)

它的作用是把 list 中每个元素转换成 key-value 形式，

注意，输入必须是 list，且 list 中元素类型要相同，且与 Int 保持一致；

# value = tf.constant([1, 2])     ### 这会报错的

ss = 1               ### Int64List 对应的元素只能是 int long，其他同理

tt = 2

out1 = tf.train.Int64List(value = [ss, tt])

print(out1)

# value: 1

# value: 2

ss = [1 ,2]

out2 = tf.train.Int64List(value = ss)

print(out2)

# value: 1

# value: 2

同类型的方法还有 2 个

tf.train.FloatList

tf.train.BytesList

tf.train.Feature(int64_list=)

它的作用是构建一种类型的特征集，比如整型

out = tf.train.Feature(int64_list=tf.train.Int64List(value=[33, 22]))

print(out)

# int64_list {

#   value: 33

#   value: 22

# }

也可以是其他类型

tf.train.Feature(float_list=tf.train.FloatList())

tf.train.Feature(bytes_list=tf.train.BytesList())

tf.train.Features(feature=dict_data)

它的作用是构建多种类型的特征集，可以 dict 格式表达多种类型

ut = tf.train.Features(feature={

                            "suibian": tf.train.Feature(int64_list=tf.train.Int64List(value=[1, 2, 4])),

                            "a": tf.train.Feature(float_list=tf.train.FloatList(value=[5., 7.]))

                        })

print(out)

# feature {

#   key: "a"

#   value {

#     float_list {

#       value: 5.0

#       value: 7.0

#     }

#   }

# }

# feature {

#   key: "suibian"

#   value {

#     int64_list {

#       value: 1

#       value: 2

#       value: 4

#     }

#   }

# }

tf.train.Example(features=tf.train.Features())

它的作用是创建一个样本，Example 对应一个样本

example = tf.train.Example(features=

                           tf.train.Features(feature={

                               'a': tf.train.Feature(int64_list=tf.train.Int64List(value=range(2))),

                               'b': tf.train.Feature(bytes_list=tf.train.BytesList(value=[b'm',b'n']))

                           }))

print(example)

# features {

#   feature {

#     key: "a"

#     value {

#       int64_list {

#         value: 0

#         value: 1

#       }

#     }

#   }

#   feature {

#     key: "b"

#     value {

#       bytes_list {

#         value: "m"

#         value: "n"

#       }

#     }

#   }

# }

一幅图总结一下上面的代码

Example 协议块

它其实是一种数据存储的格式，类似于 xml、json 等；

用上述方法实现该格式；

一个 Example 协议块对应一个样本，一个样本有多种特征，每种特征下有多个元素，可参看上图；

message Example{

    Features features = 1;

}

message Features{

    map<string,Features> feature = 1;

}

message Feature {

    oneof kind {

        BytesList bytes_list = 1;

        FloateList float_list = 2;

        Int64List int64_list = 3;

    }

}

TFRecord 文件就是以 Example协议块格式存储的；

TFRecord 文件

该类文件具有写功能，且可以把其他类型的文件转换成该类型文件，其实相当于先读取其他文件，再写入 TFRecord 文件；

该类文件也具有读功能；

TFRecord 存储

存储分两步：

1.建立存储器

2. 构造每个样本的 Example 协议块

tf.python_io.TFRecordWriter(file_name)

构造存储器，存储器有两个常用方法

write(record)：向文件中写入一个样本
close()：关闭存储器

注意：此处的 record 为一个序列化的 Example，通过 Example.SerializeToString()来实现，它的作用是将 Example 中的 map 压缩为二进制，节约大量空间

示例代码1：将 MNIST 数据集保存成 TFRecord 文件

import tensorflow as tf

import numpy as np

import input_data

# 生成整数型的属性

def _int64_feature(value):

    return tf.train.Feature(int64_list = tf.train.Int64List(value = [value]))

# 生成字符串类型的属性，也就是图像的内容

def _string_feature(value):

    return tf.train.Feature(bytes_list = tf.train.BytesList(value = [value]))

# 读取图像数据 和一些属性

mniset = input_data.read_data_sets('../../../data/MNIST_data',dtype=tf.uint8, one_hot=True)

images = mniset.train.images

labels = mniset.train.labels

pixels = images.shape[1]        # (55000, 784)

num_examples = mniset.train.num_examples        #

file_name = 'output.tfrecords'          ### 文件名

writer = tf.python_io.TFRecordWriter(file_name)     ### 写入器

for index in range(num_examples):

    ### 遍历样本

    image_raw = images[index].tostring()        ### 图片转成 字符型

    example = tf.train.Example(features = tf.train.Features(feature = {

        'pixel': _int64_feature(pixels),

        'label': _int64_feature(np.argmax(labels[index])),

        'image_raw': _string_feature(image_raw)

    }))

    writer.write(example.SerializeToString())       ### 写入 TFRecord

writer.close()

示例代码2：将 csv 保存成 TFRecord 文件

train_frame = pd.read_csv("../myfiles/xx3.csv")

train_labels_frame = train_frame.pop(item="label")

train_values = train_frame.values

train_labels = train_labels_frame.values

print("values shape: ", train_values.shape)     # values shape:  (2, 3)

print("labels shape:", train_labels.shape)      # labels shape: (2,)

writer = tf.python_io.TFRecordWriter("xx3.tfrecords")

for i in range(train_values.shape[0]):

    image_raw = train_values[i].tostring()

    example = tf.train.Example(

        features=tf.train.Features(

            feature={

                "image_raw": tf.train.Feature(bytes_list=tf.train.BytesList(value=[image_raw])),

                "label": tf.train.Feature(int64_list=tf.train.Int64List(value=[train_labels[i]]))

            }

        )

    )

    writer.write(record=example.SerializeToString())

writer.close()

示例3：将 png 文件保存成 TFRecord 文件

# filenames = tf.train.match_filenames_once('../myfiles/*.png')

filenames = glob.iglob('..\myfiles\*.png')

writer = tf.python_io.TFRecordWriter('png.tfrecords')

for filename in filenames:

    img = Image.open(filename)

    img_raw = img.tobytes()

    label = 1

    example = tf.train.Example(

        features=tf.train.Features(

            feature={

                "image_raw": tf.train.Feature(bytes_list=tf.train.BytesList(value=[img_raw])),

                "label": tf.train.Feature(int64_list=tf.train.Int64List(value=[label]))

            }

        )

    )

    writer.write(record=example.SerializeToString())

writer.close()

TFRecord 读取

读取文件和 tensorflow 读取数据方法类似，参考我的博客读取数据

tf.TFRecordReader()

建立读取器，有 read 和 close 方法

tf.parse_single_example(serialized,features=None,name= None)

解析单个 Example 协议块

serialized : 标量字符串的Tensor，一个序列化的Example,文件经过文件阅读器之后的value
features :字典数据，key为读取的名字，value为FixedLenFeature
return : 一个键值对组成的字典，键为读取的名字

features中的value还可以为tf.VarLenFeature(),但是这种方式用的比较少，它返回的是SparseTensor数据，这是一种只存储非零部分的数据格式，了解即可。

tf.FixedLenFeature(shape,dtype)

shape : 输入数据的形状，一般不指定，为空列表
dtype : 输入数据类型，与存储进文件的类型要一致，类型只能是float32，int 64, string
return : 返回一个定长的 Tensor (即使有零的部分也存储）

示例代码

filename = 'png.tfrecords'

file_queue = tf.train.string_input_producer([filename], shuffle=True)

reader = tf.TFRecordReader()

key, value = reader.read(file_queue)

### features 的 key 必须和 写入时 一致，数据类型也必须一致，shape 可为 空

dict_data= tf.parse_single_example(value, features={'label': tf.FixedLenFeature(shape=(1,1), dtype=tf.int64),

                                                        'image_raw': tf.FixedLenFeature(shape=(), dtype=tf.string)})

label = tf.cast(dict_data['label'], tf.int32)

img = tf.decode_raw(dict_data['image_raw'], tf.uint8)       ### 将 string、bytes 转换成 int、float

image_tensor = tf.reshape(img, [500, 500, -1])

sess = tf.Session()

sess.run(tf.local_variables_initializer())

tf.train.start_queue_runners(sess=sess)

while 1:

    # print(sess.run(key))        # b'png.tfrecords:0'

    image = sess.run(image_tensor)

    img_PIL = Image.fromarray(image)

    img_PIL.show()

参考资料：

https://blog.csdn.net/chengshuhao1991/article/details/78656724

https://www.cnblogs.com/yanshw/articles/12419616.html