用CNTK搞深度学习 (二) 训练基于RNN的自然语言模型 ( language model )
前一篇文章 用 CNTK 搞深度学习 (一) 入门 介绍了用CNTK构建简单前向神经网络的例子。现在假设读者已经懂得了使用CNTK的基本方法。现在我们做一个稍微复杂一点,也是自然语言挖掘中很火的一个模型: 用递归神经网络构建一个语言模型。
递归神经网络 (RNN),用图形化的表示则是隐层连接到自己的神经网络(当然只是RNN中的一种):

不同于普通的神经网络,RNN假设样例之间并不是独立的。例如要预测“上”这个字的下一个字是什么,那么在“上”之前出现过的字就很重要,如果之前出现过“工作”,那么很可能是在说“上班”; 如果之前出前过“家乡”,那么很可能就是“上海”。 RNN就可以很好的学习出时序的特征。简单的说,RNN把前一时刻的隐层的值也作为一类feature,作为下一时刻输入的一部分。
我们这里构建这样一种language model:给定一个单词,预测下一个可能出现的单词。
这个RNN的输入是dim维的,dim等于词汇量的大小。输入向量只有在代表这个单词的分量上是1,其余为0,即[0,0,0,...0,1,0,...0]。 输出也是dim维的向量,表示每个单词出现的概率。
CNTK上构建RNN模型,主要有两点与普通的神经网络很不一样:
(1)输入格式。 此时输入的是按句子分开的文本,同一个句子内部的单词是有顺序的。所以输入要指定成 LMSequenceReader 的格式。 这个格式很麻烦(再吐槽一下,我也不是很懂,就不详细解释了,大家可以按照格式自行领悟)
(2) 模型:要使用递归模型。 主要是Delay() 函数的使用
一个可用的代码如下(再次被官方教程坑了好久,现代码改编自 CNTK-2016-02-08-Windows-64bit-CPU-Only\cntk\Examples\Text\PennTreebank\Config ):
# Parameters can be overwritten on the command line
# for example: cntk configFile=myConfigFile RootDir=../..
# For running from Visual Studio add
# currentDirectory=$(SolutionDir)/<path to corresponding data folder>
RootDir = ".." ConfigDir = "$RootDir$/Config"
DataDir = "$RootDir$/Data"
OutputDir = "$RootDir$/Output"
ModelDir = "$OutputDir$/Models" # deviceId=- for CPU, >= for GPU devices, "auto" chooses the best GPU, or CPU if no usable GPU is available
deviceId = "-1" command = writeWordAndClassInfo:train
#command = write precision = "float"
traceLevel =
modelPath = "$ModelDir$/rnn.dnn" # uncomment the following line to write logs to a file
stderr=$OutputDir$/rnnOutput type = double
numCPUThreads = confVocabSize =
confClassSize = #trainFile = "ptb.train.txt"
trainFile = "review_tokens_split_first5w_lines.txt"
#validFile = "ptb.valid.txt"
testFile = "review_tokens_split_first10_lines.txt" writeWordAndClassInfo = [
action = "writeWordAndClass"
inputFile = "$DataDir$/$trainFile$"
outputVocabFile = "$ModelDir$/vocab.txt"
outputWord2Cls = "$ModelDir$/word2cls.txt"
outputCls2Index = "$ModelDir$/cls2idx.txt"
vocabSize = "$confVocabSize$"
nbrClass = "$confClassSize$"
cutoff =
printValues = true
] #######################################
# TRAINING CONFIG #
####################################### train = [
action = "train"
minibatchSize =
traceLevel =
epochSize =
recurrentLayer =
defaultHiddenActivity = 0.1
useValidation = true
rnnType = "CLASSLM" # uncomment below and comment SimpleNetworkBuilder section to use NDL to train RNN LM
NDLNetworkBuilder=[
networkDescription="D:\tools\Deep Learning\CNTK-2016-02-08-Windows-64bit-CPU-Only\cntk\Examples\Text\PennTreebank\AdditionalFiles\RNNLM\rnnlm.ndl"
] SGD = [
learningRatesPerSample = 0.1
momentumPerMB =
gradientClippingWithTruncation = true
clippingThresholdPerSample = 15.0
maxEpochs =
unroll = false
numMBsToShowResult =
gradUpdateType = "none"
loadBestModel = true # settings for Auto Adjust Learning Rate
AutoAdjust = [
autoAdjustLR = "adjustAfterEpoch"
reduceLearnRateIfImproveLessThan = 0.001
continueReduce = false
increaseLearnRateIfImproveMoreThan =
learnRateDecreaseFactor = 0.5
learnRateIncreaseFactor = 1.382
numMiniBatch4LRSearch =
numPrevLearnRates =
numBestSearchEpoch =
] dropoutRate = 0.0
] reader = [
readerType = "LMSequenceReader"
randomize = "none"
nbruttsineachrecurrentiter = # word class info
wordclass = "$ModelDir$/vocab.txt" # if writerType is set, we will cache to a binary file
# if the binary file exists, we will use it instead of parsing this file
# writerType=BinaryReader # write definition
wfile = "$OutputDir$/sequenceSentence.bin" # wsize - inital size of the file in MB
# if calculated size would be bigger, that is used instead
wsize = # wrecords - number of records we should allocate space for in the file
# files cannot be expanded, so this should be large enough. If known modify this element in config before creating file
wrecords = # windowSize - number of records we should include in BinaryWriter window
windowSize = "$confVocabSize$" file = "$DataDir$/$trainFile$" # additional features sections
# for now store as expanded category data (including label in)
features = [
# sentence has no features, so need to set dimension to zero
dim =
# write definition
sectionType = "data"
] # sequence break table, list indexes into sequence records, so we know when a sequence starts/stops
sequence = [
dim =
wrecords =
# write definition
sectionType = "data"
] #labels sections
labelIn = [
dim =
labelType = "Category"
beginSequence = "</s>"
endSequence = "</s>" # vocabulary size
labelDim = "$confVocabSize$"
labelMappingFile = "$OutputDir$/sentenceLabels.txt" # Write definition
# sizeof(unsigned) which is the label index type
elementSize =
sectionType = "labels"
mapping = [
# redefine number of records for this section, since we don't need to save it for each data record
wrecords =
# variable size so use an average string size
elementSize =
sectionType = "labelMapping"
] category = [
dim =
# elementSize = sizeof(ElemType) is default
sectionType = "categoryLabels"
]
] # labels sections
labels = [
dim =
labelType = "NextWord"
beginSequence = "O"
endSequence = "O" # vocabulary size
labelDim = "$confVocabSize$"
labelMappingFile = "$OutputDir$/sentenceLabels.out.txt" # Write definition
# sizeof(unsigned) which is the label index type
elementSize =
sectionType = "labels"
mapping = [
# redefine number of records for this section, since we don't need to save it for each data record
wrecords =
# variable size so use an average string size
elementSize =
sectionType = "labelMapping"
] category = [
dim =
# elementSize = sizeof(ElemType) is default
sectionType = categoryLabels
]
]
]
] write = [
action = "write" outputPath = "$OutputDir$/Write"
#outputPath = "-" # "-" will write to stdout; useful for debugging
outputNodeNames = "Out,WFeat2Hid,WHid2Hid,WHid2Word" # when processing one sentence per minibatch, this is the sentence posterior
#format = [
#sequencePrologue = "log P(W)=" # (using this to demonstrate some formatting strings)
#type = "real"
#] minibatchSize = # choose this to be big enough for the longest sentence
# need to be small since models are updated for each minibatch
traceLevel =
epochSize = reader = [
# reader to use
readerType = "LMSequenceReader"
randomize = "none" # BUGBUG: This is ignored.
nbruttsineachrecurrentiter = # one sentence per minibatch
cacheBlockSize = # workaround to disable randomization # word class info
wordclass = "$ModelDir$/vocab.txt" # if writerType is set, we will cache to a binary file
# if the binary file exists, we will use it instead of parsing this file
# writerType = "BinaryReader" # write definition
wfile = "$OutputDir$/sequenceSentence.bin"
# wsize - inital size of the file in MB
# if calculated size would be bigger, that is used instead
wsize = # wrecords - number of records we should allocate space for in the file
# files cannot be expanded, so this should be large enough. If known modify this element in config before creating file
wrecords = # windowSize - number of records we should include in BinaryWriter window
windowSize = "$confVocabSize$" file = "$DataDir$/$testFile$" # additional features sections
# for now store as expanded category data (including label in)
features = [
# sentence has no features, so need to set dimension to zero
dim =
# write definition
sectionType = "data"
] #labels sections
labelIn = [
dim = # vocabulary size
labelDim = "$confVocabSize$"
labelMappingFile = "$OutputDir$/sentenceLabels.txt" labelType = "Category"
beginSequence = "</s>"
endSequence = "</s>" # Write definition
# sizeof(unsigned) which is the label index type
elementSize =
sectionType = "labels" mapping = [
# redefine number of records for this section, since we don't need to save it for each data record
wrecords =
# variable size so use an average string size
elementSize =
sectionType = "labelMapping"
] category = [
dim =
# elementSize = sizeof(ElemType) is default
sectionType = "categoryLabels"
]
] #labels sections
labels = [
dim =
labelType = "NextWord"
beginSequence = "O"
endSequence = "O" # vocabulary size
labelDim = "$confVocabSize$" labelMappingFile = "$OutputDir$/sentenceLabels.out.txt"
# Write definition
# sizeof(unsigned) which is the label index type
elementSize =
sectionType = "labels" mapping = [
# redefine number of records for this section, since we don't need to save it for each data record
wrecords =
# variable size so use an average string size
elementSize =
sectionType = "labelMapping"
] category = [
dim =
# elementSize = sizeof(ElemType) is default
sectionType = "categoryLabels"
]
]
]
]
rnnlm.ndl:
run=ndlCreateNetwork ndlCreateNetwork=[
# vocabulary size
featDim=
# vocabulary size
labelDim=
# hidden layer size
hiddenDim=
# number of classes
nbrClass= initScale= features=SparseInput(featDim, tag="feature") # labels in classbasedCrossEntropy is dense and contain values for each sample
labels=Input(, tag="label") # define network
WFeat2Hid=Parameter(hiddenDim, featDim, init="uniform", initValueScale=initScale)
WHid2Hid=Parameter(hiddenDim, hiddenDim, init="uniform", initValueScale=initScale) # WHid2Word is special that it is hiddenSize X labelSize
WHid2Word=Parameter( hiddenDim,labelDim, init="uniform", initValueScale=initScale)
WHid2Class=Parameter(nbrClass, hiddenDim, init="uniform", initValueScale=initScale) PastHid = Delay(hiddenDim, HidAfterSig, delayTime=, needGradient=true)
HidFromHeat = Times(WFeat2Hid, features)
HidFromRecur = Times(WHid2Hid, PastHid)
HidBeforeSig = Plus(HidFromHeat, HidFromRecur)
HidAfterSig = Sigmoid(HidBeforeSig) Out = TransposeTimes(WHid2Word, HidAfterSig) #word part ClassProbBeforeSoftmax=Times(WHid2Class, HidAfterSig) cr = ClassBasedCrossEntropyWithSoftmax(labels, HidAfterSig, WHid2Word, ClassProbBeforeSoftmax, tag="criterion")
EvalNodes=(Cr)
OutputNodes=(Cr)
]
从代码上看,CNTK会让人花很大一部分精力在Data Reader上。
writeWordAndClassInfo 是简单的对所有词汇做个统计,并对单词聚类。 这里用的class based RNN,主要是为了加速计算,先把单词分成不相交的几类。 这个模块输出的文件有4列,分别是单词索引,出现频率,单词,类别。
Train 当然就是训练模型了,文本量大的话,训练还是很慢的。
Write 是输出模块,注意看这一行: outputNodeNames = "Out,WFeat2Hid,WHid2Hid,WHid2Word"
我想最多人关心的应该是对于一个句子,运行这个训练好的RNN之后,如何得到隐层的值吧? 我的做法是把训练好的RNN的参数给保存下来,然后...然后无论是用java还是用python的人,都能根据这个参数还原一个RNN网络,然后我们想干嘛就能干嘛了。
Train中我是用了自己定义的模型:NDLNetworkBuilder 。 也可以用通用的递归模型,此时只要简单地规定一个参数就行了,例如
SimpleNetworkBuilder=[
trainingCriterion=classcrossentropywithsoftmax
evalCriterion=classcrossentropywithsoftmax
nodeType=Sigmoid
initValueScale=6.0
layerSizes=::
addPrior=false
addDropoutNodes=false
applyMeanVarNorm=false
uniformInit=true; # these are for the class information for class-based language modeling
vocabSize=
nbrClass=
]
我这里使用自己定义的网络,主要是为了日后想改成LSTM结构。
原创博客,未经允许,请勿转载。
用CNTK搞深度学习 (二) 训练基于RNN的自然语言模型 ( language model )的更多相关文章
- 用 CNTK 搞深度学习 (一) 入门
Computational Network Toolkit (CNTK) 是微软出品的开源深度学习工具包.本文介绍CNTK的基本内容,如何写CNTK的网络定义语言,以及跑通一个简单的例子. 根据微软开 ...
- CNTK 搞深度学习-1
CNTK 搞深度学习 Computational Network Toolkit (CNTK) 是微软出品的开源深度学习工具包.本文介绍CNTK的基本内容,如何写CNTK的网络定义语言,以及跑通一个简 ...
- 深度学习实战篇-基于RNN的中文分词探索
深度学习实战篇-基于RNN的中文分词探索 近年来,深度学习在人工智能的多个领域取得了显著成绩.微软使用的152层深度神经网络在ImageNet的比赛上斩获多项第一,同时在图像识别中超过了人类的识别水平 ...
- [源码解析] 深度学习分布式训练框架 horovod (3) --- Horovodrun背后做了什么
[源码解析] 深度学习分布式训练框架 horovod (3) --- Horovodrun背后做了什么 目录 [源码解析] 深度学习分布式训练框架 horovod (3) --- Horovodrun ...
- [源码解析] 深度学习分布式训练框架 horovod (7) --- DistributedOptimizer
[源码解析] 深度学习分布式训练框架 horovod (7) --- DistributedOptimizer 目录 [源码解析] 深度学习分布式训练框架 horovod (7) --- Distri ...
- [源码解析] 深度学习分布式训练框架 horovod (12) --- 弹性训练总体架构
[源码解析] 深度学习分布式训练框架 horovod (12) --- 弹性训练总体架构 目录 [源码解析] 深度学习分布式训练框架 horovod (12) --- 弹性训练总体架构 0x00 摘要 ...
- [源码解析] 深度学习分布式训练框架 horovod (18) --- kubeflow tf-operator
[源码解析] 深度学习分布式训练框架 horovod (18) --- kubeflow tf-operator 目录 [源码解析] 深度学习分布式训练框架 horovod (18) --- kube ...
- [源码解析] 深度学习分布式训练框架 horovod (19) --- kubeflow MPI-operator
[源码解析] 深度学习分布式训练框架 horovod (19) --- kubeflow MPI-operator 目录 [源码解析] 深度学习分布式训练框架 horovod (19) --- kub ...
- [源码解析] 深度学习分布式训练框架 Horovod (1) --- 基础知识
[源码解析] 深度学习分布式训练框架 Horovod --- (1) 基础知识 目录 [源码解析] 深度学习分布式训练框架 Horovod --- (1) 基础知识 0x00 摘要 0x01 分布式并 ...
随机推荐
- Visual Studio发布Web项目报错:Unable to add 'xxx' to the Web site. Unable to add file 'xxx'. The specified file could not be encrypted.
背景 Visual Studio下的Web项目 现象 发布时遇到Unable to add 'xxx' to the Web site. Unable to add file 'xxx'. The ...
- Tomcat6环境JBPM4.4报错:java.lang.ClassNotFoundException: de.odysseus.el.util.SimpleResolver
Tomcat6环境JBPM4.4报错:java.lang.ClassNotFoundException: de.odysseus.el.util.SimpleResolver 报错信息:
- Effective Java 08 Obey the general contract when overriding equals
When it's the case that each instance of the class is equal to only itself. 1. Each instance of the ...
- Effective Java 54 Use native methods judiciously
Java Native Interface(JNI) allows Java applications to call native methods, which are special method ...
- JavaScript Patterns 3.4 Array Literal
Array Literal Syntax To avoid potential errors when creating dynamic arrays at runtime, it's much sa ...
- 【nginx】关于fastcgi_cache
一.简介 Nginx版本从0.7.48开始,支持了类似Squid的缓存功能.这个缓存是把URL及相关组合当做Key,用Md5算法对Key进行哈希,得到硬盘上对应的哈希目录路径,从而将缓存内容保存在该目 ...
- Web Service中的几个重要术语
WSDL:web service definition language 直译:WebService定义语言 1.对应一种该类型的文件.WSDL 2.定义了Web Service的服务器与客户端应用交 ...
- 烂泥:LVM学习之逻辑卷及卷组缩小空间
本文由秀依林枫提供友情赞助,首发于烂泥行天下. 上一篇文章,我们学习了如何给LVM的逻辑卷及卷组扩容.这篇文章我们来学习,如何给LVM的逻辑卷及卷组缩小空间. 注意逻辑卷的缩小一定要离线操作,不能是在 ...
- uname
uname uname用于打印操作系统和硬件架构相关的信息,对于可能在多个系统或架构上运行的Shell脚本程序很有用, 缺省选项相当于 -s 或--system $uname [-amnrsvpio] ...
- andriod增、删、改、查
将数据库的增删改查单独放进一个包 */ package com.itheima28.sqlitedemo.dao; import java.util.ArrayList; import java.ut ...