PDNN中数据格式和数据装载选项
吃人家嘴短,拿人家手短,用别人的东西就不要BB了,按规矩来吧。
训练和验证的数据都在命令行以变量的形式按如下方式指定:
--train-data "train.pfile,context=5,ignore-label=0:3-9,map-label=1:0/2:1,partition=1000m"
--valid-data "valid.pfile,stream=False,random=True"
在第一个逗号前面的部分(如果有的话)指定了文件的名称。
全局样式通配符可以被用来指定多个文件(当前不支持Kaldi数据文件)。
数据文件也可以用gzip或者bz2进行压缩,在这种情况下,原始的扩展名后面有".gz"或者".bz2"这类的扩展名。
在文件名后面,你可以以"key=value"的格式指定任意数目的数据装载选项。这些选项的功能将在下面的部分中描述。
受支持的数据格式
PDNN目前支持3中数据格式:PFiles,Python pickle files,和Kaldi files。
PFiles
PFile是ICSI特征文件存档格式。PFiles以.pfile为扩展名。一个PFile可以存储多个语句,每个语句是一个帧序列。
每个帧都是和一个特征向量相关联,并且有一个或者多个标签。下面是一个PFile文件例子中的内容。
| Sentence ID | Frame ID | Feature Vector | Class Label |
| 0 | 0 | [0.2, 0.3, 0.5, 1.4, 1.8, 2.5] | 10 |
| 0 | 1 | [1.3, 2.1, 0.3, 0.1, 1.4, 0.9] | 179 |
| 1 | 0 | [0.3, 0.5, 0.5, 1.4, 0.8, 1.4] | 32 |
对于语音处理,语句和帧分别对应于话语和帧。帧在每句话里面都是添加索引的。
对于其他的应用,可以使用伪造的语句指标和帧指标。
例如,如果有N个实例,你可以将所有的语句指标都设为0,而帧指标则从0到N-1。
一个标准的PFile工具箱是pfile_utils-v0_51。这个脚本将会自动安装,如果你在Linux上运行的话。HTK用户可以用这个Python脚本将HTK的特征和标签转化为PFiles。更多的信息可以参考上面的注释。
Python Pickle Files
Python pickle files是以".pickle"或者".pkl"为扩展名。一个Python pickle file将两个numpy数组的元组,(feature,label)序列化。在pickle files中没有像"sentence"之类的符号;换句话说,一个pickle files仅保存一个句子。feature是一个2维的numpy数组,每行都是一个实例的特征向量;label是一个一维的numpy数组,其中的每个元素是一个实例的类标签。
在Python中读取一个(gzip-compressed)pickle file可以通过:
import cPickle, numpy, gzip
with gzip.open('filename.pkl.gz', 'rb') as f:
feature, label = cPickle.load(f)
通过下面语句可以在Python中创建一个(gzip-compressed)pickle file:
import cPickle, numpy, gzip
feature = numpy.array([[0.2, 0.3, 0.5, 1.4], [1.3, 2.1, 0.3, 0.1], [0.3, 0.5, 0.5, 1.4]], dtype = 'float32')
label = numpy.array([2, 0, 1])
with gzip.open('filename.pkl.gz', 'wb') as f:
cPickle.dump((feature, label), f)
Kaldi Files
Kaldi数据文件在PDNN中用Kaldi脚本文件生成,以".scp"为扩展名。这些文件包含了指向实际存储在以".ark"为扩展名的Kaldi存档文件中feature数据的指针。Kaldi脚本文件指定了一个语句的名字(等价于pfiles中的sentence),和它在Kaldi归档文件中的偏移量,如下:
utt01 train.ark:15213
对应于特征的标签在以".ali"为扩展名的"alignment files"文件中。为了指定一个alignment文件,可以用选项"label=filename.ali"。alignment文件是普通的文本文件,其中每行指定了语句的名称,后面紧跟着这个语句中每一帧的标签。例子如下:
utt01 0 51 51 51 51 51 51 48 48 7 7 7 7 51 51 51 51 48
即时语境填充和标签操作
通常,我们希望将临近帧的特征包括到当前帧的特征向量中。当然,你在准备数据的时候可以完成这个,但是这会使得文件变大。更聪明的做法是即时地完成语境填充。PDNN中提供了选项"context"来完成这个。指定"context=5"将用一个帧旁边的5个帧来填充,这样一个特征向量就变成原来的维度的11倍。指定"context=5:1"将用左边的5个帧和右边的一个帧来填充此帧。或者也可以指定"lcxt=5, rcxt=1"。语境填充不会越过语句的边界。在每个语句的开头和结尾,如果语境到达了语句的边界,那么将重复第一个和最后一个帧。
Some frames in the data files may be garbage frames (i.e. they do not belong to any of the classes to be classified), but they are important in making up the context for useful frames. To ignore such frames, you can assign a special class label (say c) to these frames, and specify the option "ignore-label=c". The garbage frames will be discarded; but the context of neighboring frames will still be correct, as the garbage frames are only discarded after context padding happens. Sometimes you may also want to train a classifier for only a subset of the classes in a data file. In such cases, you may specify multiple class labels to be ignored, e.g. "ignore-label=0:2:7-9". Multiple class labels are separated by colons; contiguous class labels may be specified with a dash.
When training a classifier of N classes, PDNN requires that their class labels be 0, 1, ..., N-1. When you ignore some class labels, the remaining class labels may not form such a sequence. In this situation, you may use the "map-label" option to map the remaining class labels to 0, 1, ..., N-1. For example, to map the classes 1, 3, 4, 5, 6 to 0, 1, 2, 3, 4, you can specify "map-label=1:0/3:1/4:2/5:3/6:4". Each pair of labels are separated by a colon; pairs are separated by slashes. The label mapping happens after unwanted labels are discarded; all the mappings are applied simultaneously (therefore class 3 is mapped to class 1 and is not further mapped to class 0). You may also use this option to merge classes. For example, "map-label=1:0/3:1/4-6:2" will map all the labels 4, 5, 6 to class 2.

Partitions, Streaming and Shuffling
The training / validation corpus may be too large to fit in the CPU or GPU memory. Therefore they are broken down into several levels of units: files, partitions, and minibatches. Such division happens after context padding and label manipulation, and the concept of "sentences" are no longer relevant. As a result, a sentence may be broken into multiple partitions of minibatches.
Both the training and validation corpora may consist of multiple files that can be matched by a single glob-style pattern. At any point in time, at most one file is held in the CPU memory. This means if you have multiple files, all the files will be reloaded every epoch. This can be very inefficient; you can avoid this inefficiency by lumping all the data into a single file if they can fit in the CPU memory.
A partition is the amount of data that is fed to the GPU at a time. For pickle files, a partition is always an entire file; for other files, you may specify the partition size with the option "partition", e.g. "partition=1000m". The partition size is specified in megabytes (220bytes); the suffix "m" is optional. The default partition size is 600 MB.
Files may be read in either the "stream" or the "non-stream" mode, controlled by the option "stream=True" or "stream=False". In the non-stream mode, an entire file is kept in the CPU memory. If there is only one file in the training / validation corpus, the file is loaded only once (and this is efficient). In the stream mode, only a partition is kept in the CPU memory. This is useful when the corpus is too large to fit in the CPU memory. Currently, PFiles can be loaded in either the stream mode or the non-stream mode; pickle files can only be loaded in the non-stream mode; Kaldi files can only be loaded in the stream mode.
It is usually desirable that instances of different classes be mixed evenly in the training data. To achieve this, you may specify the option "random=True". This options shuffles the order of the training instances loaded into the CPU memory at a time: in the stream mode, instances are shuffled partition by partition; in the non-stream mode, instance are shuffled across an entire file. The latter achieves better mixing, so it is again recommended to turn off the stream mode when the files can fit in the CPU memory.
A minibatch is the amount of data consumed by the training procedure between successive updates of the model parameters. The minibatch size is not specified as a data loading option, but as a separate command-line argument to the training scripts. A partition may not consist of a whole number of minibatches; the last instances in each partition that are not enough to make a minibatch are discarded.
PDNN中数据格式和数据装载选项的更多相关文章
- layui中使用自定义数据格式对数据表格进行渲染
1.引入 <link rel="stylesheet" href="../layui/css/layui.css"> <script src= ...
- 对于改善 MySQL 数据装载操作有效率的方法是怎样
多时候关心的是优化SELECT 查询,因为它们是最常用的查询,而且确定怎样优化它们并不总是直截了当.相对来说,将数据装入数据库是直截了当的.然而,也存在可用来改善数据装载操作效率的策略,其基本原理如下 ...
- CODESOFT中怎样打印数据库中的特定数据?
CODESOFT可用于打印.标记和跟踪的零售库存标签软件,每种产品的售卖都代表着需要打印大量的条码标签.通常我们采用的方法就是在CODESOFT连接数据库批量打 印.但是如果数据量很大,该如何选择 ...
- Entity Framework 数据生成选项DatabaseGenerated
在EF中,我们建立数据模型的时候,可以给属性配置数据生成选项DatabaseGenerated,它后有三个枚举值:Identity.None和Computed. Identity:自增长 None:不 ...
- 【MySQL】MySQL中针对大数据量常用技术_创建索引+缓存配置+分库分表+子查询优化(转载)
原文地址:http://blog.csdn.net/zwan0518/article/details/11972853 目录(?)[-] 一查询优化 1创建索引 2缓存的配置 3slow_query_ ...
- 数据库开发——参照完整性——在外键中使用Delete on cascade选项
原文:数据库开发--参照完整性--在外键中使用Delete on cascade选项 原文: http://www.mssqltips.com/sqlservertip/2743/using-dele ...
- sql点滴38—SQL Server 2008和SQL Server 2008 R2导出数据的选项略有不同
原文:sql点滴38—SQL Server 2008和SQL Server 2008 R2导出数据的选项略有不同 说明: 以前要将一个表中的数据导出为脚本,只有用存储过程.现在在SQL Server ...
- SQL Server数据库中导入导出数据及结构时主外键关系的处理
2015-01-26 软件开发中,经常涉及到不同数据库(包括不同产品的不同版本)之间的数据结构与数据的导入导出.处理过程中会遇到很多问题,尤为突出重要的一个问题就是主从表之间,从表有外检约束,从而导致 ...
- JMeter 中对于Json数据的处理方法
JMeter中对于Json数据的处理方法 http://eclipsesource.com/blogs/2014/06/12/parsing-json-responses-with-jmeter/ J ...
随机推荐
- java 运行时环境和编译器环境
必须要保证运行环境高于编译环境 1.编译器的环境设置 单击项目右键-> Properties -> Java Compiler -> 5或6 如果编译器的环境高于运行时环境会报错. ...
- Linux各主要发行版的包管理命令对照
Linux各主要发行版的包管理命令对照 Debian使用的是apt和dpkg,Gentoo则用的是emerge,Redhat的yum.Suse的zypper.Arch的pacman.Slackware ...
- eclipse svn登陆用户保存信息删除
win7系统 eclipse svn saveinfo path:磁盘:\%用户名%\AppData\Roaming\Subversion\auth\svn.simple 例如:D:\Users\用户 ...
- 【js】正则表达式(II)
JavaScript中提供了一个名为RegExp的对象来完成有关正则表达式的操作和功能,每一条正则表达式模式对应一个RegExp对象实例. 在JavaScript中,有两种方式可以创建RegExp对象 ...
- mvc中Action前HttpPost的作用
本文导读:在ASP.NET MVC框架中,为了限制某个action只接受HttpPost的请求,对于HttpGet的请求则提示404找不到页面,可以在action的方法前面加上[HttpPost]属性 ...
- JMeter学习笔记--JMeter前置处理器
前置处理器被用来修改作用域内的采样器 HTML 链接解析器:解析从服务器得到的HTML响应,并从中提取链接和表单,使用perl型的正则表达式来寻求匹配项. HTML URL重写修饰符:使用URL重写来 ...
- Linux调度器 - 进程优先级
一.前言 本文主要描述的是进程优先级这个概念.从用户空间来看,进程优先级就是nice value和scheduling priority,对应到内核,有静态优先级.realtime优先级.归一化优先级 ...
- C语言sprintf与sscanf函数
1.前言 我们经常涉及到数字与字符串之间的转换,例如将32位无符号整数的ip地址转换为点分十进制的ip地址字符串,或者反过来.从给定的字符串中提取相关内容,例如给定一个地址:http://www.bo ...
- C# POST与参数的字符串格式
参数拼接方法:& 类似url参数.然后转化为字节型 string postdate = "Submit=" + Submit + "&dopost=&q ...
- scrapy 项目实战(一)----爬取雅昌艺术网数据
第一步:创建scrapy项目: scrapy startproject Demo 第二步:创建一个爬虫 scrapy genspider demo http://auction.artron.net/ ...