如何将notMNIST转成MNIST格式

相信了解机器学习的对MNIST不会陌生，Google的工程师Yaroslav Bulatov 创建了notMNIST，它和MNIST类似，图像28x28，也有10个Label（A-J）。

在Tensorflow中已经封装好了读取MNIST数据集的函数 read_data_sets()，

from tensorflow.contrib.learn.python.learn.datasets.mnist import read_data_sets

mnist = read_data_sets("data", one_hot=True, reshape=False, validation_size=0)

但是由于notMNIST的格式和MNIST的格式不是完全相同，所以基于tensorflow创建的针对MNIST的模型并不能直接读取notMNIST的图片。

Github上有人编写了格式转换代码（https://github.com/davidflanagan/notMNIST-to-MNIST），转换后可直接使用read_data_sets()完成读取，这样模型代码的变动就不会很大。本文是对在阅览完代码后所做的注释。

 import numpy, imageio, glob, sys, os, random

 #Imageio 提供简单的用于读写图像数据的接口

 #glob 功能类似于文件搜索，查找文件只用到三个匹配符：”*”, “?”, “[]”。”*”匹配0个或多个字符；”?”匹配单个字符；”[]”匹配指定范围内的字符，如：[0-9]匹配数字。

 def get_labels_and_files(folder, number):

   # Make a list of lists of files for each label

   filelists = []

   for label in range(0,10):

     filelist = []

     filelists.append(filelist);

     dirname = os.path.join(folder, chr(ord('A') + label))

     #label实际为0-9，chr(ord('A') + label)返回A-J

     #拼接路径dirname=folder/[A-J]

     for file in os.listdir(dirname):

     #返回一个装满当前路径中文件名的list

       if (file.endswith('.png')):

         fullname = os.path.join(dirname, file)

         if (os.path.getsize(fullname) > 0):

           filelist.append(fullname)

         else:

           print('file ' + fullname + ' is empty')

     # sort each list of files so they start off in the same order

     # regardless of how the order the OS returns them in

     filelist.sort()

   # Take the specified number of items for each label and

   # build them into an array of (label, filename) pairs

   # Since we seeded the RNG, we should get the same sample each run

   labelsAndFiles = []

   for label in range(0,10):

     filelist = random.sample(filelists[label], number)

     #随机采样 设定个数的文件名

     for filename in filelist:

       labelsAndFiles.append((label, filename))

       #Python的元组与列表类似，不同之处在于元组的元素不能修改。元组使用小括号，列表使用方括号。

   return labelsAndFiles

 def make_arrays(labelsAndFiles):

   images = []

   labels = []

   for i in range(0, len(labelsAndFiles)):

     # display progress, since this can take a while

     if (i % 100 == 0):

       sys.stdout.write("\r%d%% complete" % ((i * 100)/len(labelsAndFiles)))

       #\r 返回第一个指针，覆盖前面的内容

       sys.stdout.flush()

     filename = labelsAndFiles[i][1]

     try:

       image = imageio.imread(filename)

       images.append(image)

       labels.append(labelsAndFiles[i][0])

     except:

       # If this happens we won't have the requested number

       print("\nCan't read image file " + filename)

   count = len(images)

   imagedata = numpy.zeros((count,28,28), dtype=numpy.uint8)

   labeldata = numpy.zeros(count, dtype=numpy.uint8)

   for i in range(0, len(labelsAndFiles)):

     imagedata[i] = images[i]

     labeldata[i] = labels[i]

   print("\n")

   return imagedata, labeldata

 def write_labeldata(labeldata, outputfile):

   header = numpy.array([0x0801, len(labeldata)], dtype='>i4')

   with open(outputfile, "wb") as f:

   #以二进制写模式打开

   #这里使用了 with 语句，不管在处理文件过程中是否发生异常，都能保证 with 语句执行完毕后已经关闭了打开的文件句柄

     f.write(header.tobytes())

     #写入二进制数

     f.write(labeldata.tobytes())

 def write_imagedata(imagedata, outputfile):

   header = numpy.array([0x0803, len(imagedata), 28, 28], dtype='>i4')

   with open(outputfile, "wb") as f:

     f.write(header.tobytes())

     f.write(imagedata.tobytes())

 def main(argv):

   # Uncomment the line below if you want to seed the random

   # number generator in the same way I did to produce the

   # specific data files in this repo.

   # random.seed(int("notMNIST", 36))

   #当我们设置相同的seed，每次生成的随机数相同。如果不设置seed，则每次会生成不同的随机数

   labelsAndFiles = get_labels_and_files(argv[1], int(argv[2]))

   #随机排序

   random.shuffle(labelsAndFiles)

   imagedata, labeldata = make_arrays(labelsAndFiles)

   write_labeldata(labeldata, argv[3])

   write_imagedata(imagedata, argv[4])

 if __name__=='__main__':

 #Make a script both importable and executable

 #如果我们是直接执行某个.py文件的时候，该文件中那么”__name__ == '__main__'“是True

 #如果被别的模块import，__name__！='__main__'，这样main()就不会执行

   main(sys.argv)

使用方法

下载解压notMNIST：

curl -o notMNIST_small.tar.gz http://yaroslavvb.com/upload/notMNIST/notMNIST_small.tar.gz

curl -o notMNIST_large.tar.gz http://yaroslavvb.com/upload/notMNIST/notMNIST_large.tar.gz

tar xzf notMNIST_small.tar.gz

tar xzf notMNIST_large.tar.gz

运行转换代码：

python convert_to_mnist_format.py notMNIST_small  data/t10k-labels-idx1-ubyte data/t10k-images-idx3-ubyte

python convert_to_mnist_format.py notMNIST_large  data/train-labels-idx1-ubyte data/train-images-idx3-ubyte

gzip data/*ubyte

如何将notMNIST转成MNIST格式的更多相关文章

tensorflow学习笔记(10) mnist格式数据转换为TFrecords
本程序 (1)mnist的图片转换成TFrecords格式 (2) 读取TFrecords格式 # coding:utf-8 # 将MNIST输入数据转化为TFRecord的格式 # http://b ...
CAFFE学习笔记（四）将自己的jpg数据转成lmdb格式
1 引言 1-1 以example_mnist为例,如何加载属于自己的测试集? 首先抛出一个问题:在example_mnist这个例子中,测试集是人家给好了的.那么如果我们想自己试着手写几个数字然后验 ...
TensorFlow笔记五：将cifar10数据文件复原成图片格式
cifar10数据集(http://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz)源格式是数据文件,因为训练需要转换成图片格式转换代码: 注意文件路 ...
asp.net dataTable转换成Json格式
/// <summary> /// dataTable转换成Json格式 /// </summary> /// <param name="dt"> ...
[jquery]将当前时间转换成yyyymmdd格式
如题: function nowtime(){//将当前时间转换成yyyymmdd格式 var mydate = new Date(); var str = "" + mydate ...
MySQL Binlog Mixed模式记录成Row格式
背景: 一个简单的主从结构,主的binlog format是Mixed模式,在执行一条简单的导入语句时,通过mysqlbinlog导出发现记录的Binlog全部变成了Row的格式(明明设置的是Mixe ...
[转] 将DOS格式文本文件转换成UNIX格式
点击此处阅读原文用途说明 dos2unix命令用来将DOS格式的文本文件转换成UNIX格式的(DOS/MAC to UNIX text file format converter).DOS下的文本文 ...
.NET调用外部接口将得到的List数据，并使用XmlSerializer序列化List对象成XML格式
BidOpeningData.BidSupervisionSoapClient client = new BidOpeningData.BidSupervisionSoapClient(); Dict ...
将序列化成json格式的日期（毫秒数）转成日期格式
<script> $(function () { loadInfo(); }) function loadInfo() { $.post("InfoList.ashx" ...

随机推荐

click和blur事件冲突解决方案
场景:例如做一个模仿百度搜索的搜索框,输入文字下面会有匹配项,当点击下拉项中的值时,就将值添加到搜索框中同时隐藏下拉框,点击其他地方就直接隐藏下拉框,这时所需要的事件分别为下拉框事件onclick, ...
网络编程应用：基于TCP协议【实现文件上传】--练习
要求: 基于TCP协议实现一个向服务器端上传文件的功能客户端代码: package Homework2; import java.io.File; import java.io.FileInputS ...
JDBC的批处理操作三种方式
SQL批处理是JDBC性能优化的重要武器,批处理的用法有三种. package lavasoft.jdbctest; import lavasoft.common.DBToolkit; import ...
javascript学习笔记(一)：词法结构
一:字符集 javascript程序是用Unicode字符集编写的. 二:区分大小写 javascript是区分大小写的语言,但需注意的是HTML不区分大小写三:空格.换行符和格式控制符 javas ...
python 标准库 -- os
os os.getcwd() os.getcwd() # 获取当前工作目录 os.listdir(path) os.listdir('/tmp') # 列出指定目录下的文件和目录 os.mkdir(p ...
虚拟桌面 VDI
什么是VDI(Virtual Desktop Infrastructure): 通过对于本企业的服务器进行整合,使用VMware进行虚拟机部署,利用服务器资源,实现由一个物理机实现多个虚拟机,解决资源 ...
浅谈Swift和OC的区别
前言转眼Swift3都出来快一年了,从OC到Swift也经历了很多,所以对两者的一些使用区别也总结了一点,暂且记录下,权当自己的一个笔记. 当然其中一些区别可能大家都有耳闻,所以这里也会结合自身的一 ...
.NET Core程序中使用User Secrets存储敏感数据
前言在开发中经常会用到一些敏感数据,比如AppSecret或数据库连接字符串,无论是硬编码还是写在配置文件中,最终都要push到svn或git上.对于开源项目,这些敏感数据就无隐私可言了,对于私有项 ...
windows端口占用处理工具
一.描述笔者在最近使用tomcat时,老是会遇到这种端口占用的问题,便写了这个小的exe,用于解决windows下的端口占用问题. 好吧,其实是我实在记不住CMD下的那几行命令.这玩意的实现比较简单 ...
Swift App项目总结
最近公司新开了一个项目,由于我的同事的离职,所以就剩我自己了.于是就果断的使用纯纯Swift写了,之前也用过Swift,不过很早了,那时候Swift还不稳定,每次一升级Xcode,Swift升级以后语 ...

如何将notMNIST转成MNIST格式

如何将notMNIST转成MNIST格式的更多相关文章

随机推荐

热门专题