用CNTK搞深度学习（二）训练基于RNN的自然语言模型 ( language model )

前一篇文章用 CNTK 搞深度学习（一）入门介绍了用CNTK构建简单前向神经网络的例子。现在假设读者已经懂得了使用CNTK的基本方法。现在我们做一个稍微复杂一点，也是自然语言挖掘中很火的一个模型：用递归神经网络构建一个语言模型。

递归神经网络（RNN），用图形化的表示则是隐层连接到自己的神经网络（当然只是RNN中的一种）：

不同于普通的神经网络，RNN假设样例之间并不是独立的。例如要预测“上”这个字的下一个字是什么，那么在“上”之前出现过的字就很重要，如果之前出现过“工作”，那么很可能是在说“上班”; 如果之前出前过“家乡”，那么很可能就是“上海”。 RNN就可以很好的学习出时序的特征。简单的说，RNN把前一时刻的隐层的值也作为一类feature，作为下一时刻输入的一部分。

我们这里构建这样一种language model：给定一个单词，预测下一个可能出现的单词。

这个RNN的输入是dim维的，dim等于词汇量的大小。输入向量只有在代表这个单词的分量上是1，其余为0，即[0,0,0,...0,1,0,...0]。输出也是dim维的向量，表示每个单词出现的概率。

CNTK上构建RNN模型，主要有两点与普通的神经网络很不一样：

（1）输入格式。此时输入的是按句子分开的文本，同一个句子内部的单词是有顺序的。所以输入要指定成 LMSequenceReader 的格式。这个格式很麻烦（再吐槽一下，我也不是很懂，就不详细解释了，大家可以按照格式自行领悟）

（2）模型：要使用递归模型。主要是Delay() 函数的使用

一个可用的代码如下（再次被官方教程坑了好久，现代码改编自 CNTK-2016-02-08-Windows-64bit-CPU-Only\cntk\Examples\Text\PennTreebank\Config ）：

# Parameters can be overwritten on the command line

# for example: cntk configFile=myConfigFile RootDir=../..

# For running from Visual Studio add

# currentDirectory=$(SolutionDir)/<path to corresponding data folder>

RootDir = ".."

ConfigDir = "$RootDir$/Config"

DataDir = "$RootDir$/Data"

OutputDir = "$RootDir$/Output"

ModelDir = "$OutputDir$/Models"

# deviceId=- for CPU, >= for GPU devices, "auto" chooses the best GPU, or CPU if no usable GPU is available

deviceId = "-1"

command = writeWordAndClassInfo:train

#command = write

precision = "float"

traceLevel =

modelPath = "$ModelDir$/rnn.dnn"

# uncomment the following line to write logs to a file

stderr=$OutputDir$/rnnOutput

type = double

numCPUThreads = 

confVocabSize =

confClassSize = 

#trainFile = "ptb.train.txt"

trainFile = "review_tokens_split_first5w_lines.txt"

#validFile = "ptb.valid.txt"

testFile = "review_tokens_split_first10_lines.txt"

writeWordAndClassInfo = [

    action = "writeWordAndClass"

    inputFile = "$DataDir$/$trainFile$"

    outputVocabFile = "$ModelDir$/vocab.txt"

    outputWord2Cls = "$ModelDir$/word2cls.txt"

    outputCls2Index = "$ModelDir$/cls2idx.txt"

    vocabSize = "$confVocabSize$"

    nbrClass = "$confClassSize$"

    cutoff =

    printValues = true

]

#######################################

#  TRAINING CONFIG                    #

#######################################

train = [

    action = "train"

    minibatchSize =

    traceLevel =

    epochSize =

    recurrentLayer =

    defaultHiddenActivity = 0.1

    useValidation = true

    rnnType = "CLASSLM"

     # uncomment below and comment SimpleNetworkBuilder section to use NDL to train RNN LM

     NDLNetworkBuilder=[

        networkDescription="D:\tools\Deep Learning\CNTK-2016-02-08-Windows-64bit-CPU-Only\cntk\Examples\Text\PennTreebank\AdditionalFiles\RNNLM\rnnlm.ndl"

     ]

    SGD = [

        learningRatesPerSample = 0.1

        momentumPerMB =

        gradientClippingWithTruncation = true

        clippingThresholdPerSample = 15.0

        maxEpochs =

        unroll = false

        numMBsToShowResult =

        gradUpdateType = "none"

        loadBestModel = true

        # settings for Auto Adjust Learning Rate

        AutoAdjust = [

            autoAdjustLR = "adjustAfterEpoch"

            reduceLearnRateIfImproveLessThan = 0.001

            continueReduce = false

            increaseLearnRateIfImproveMoreThan =

            learnRateDecreaseFactor = 0.5

            learnRateIncreaseFactor = 1.382

            numMiniBatch4LRSearch =

            numPrevLearnRates =

            numBestSearchEpoch =

        ]

        dropoutRate = 0.0

    ]

    reader = [

        readerType = "LMSequenceReader"

        randomize = "none"

        nbruttsineachrecurrentiter = 

        # word class info

        wordclass = "$ModelDir$/vocab.txt"

        # if writerType is set, we will cache to a binary file

        # if the binary file exists, we will use it instead of parsing this file

        # writerType=BinaryReader

        # write definition

        wfile = "$OutputDir$/sequenceSentence.bin"

        # wsize - inital size of the file in MB

        # if calculated size would be bigger, that is used instead

        wsize = 

        # wrecords - number of records we should allocate space for in the file

        # files cannot be expanded, so this should be large enough. If known modify this element in config before creating file

        wrecords = 

        # windowSize - number of records we should include in BinaryWriter window

        windowSize = "$confVocabSize$"

        file = "$DataDir$/$trainFile$"

        # additional features sections

        # for now store as expanded category data (including label in)

        features = [

            # sentence has no features, so need to set dimension to zero

            dim =

            # write definition

            sectionType = "data"

        ]

        # sequence break table, list indexes into sequence records, so we know when a sequence starts/stops

        sequence = [

            dim =

            wrecords =

            # write definition

            sectionType = "data"

        ]

        #labels sections

        labelIn = [

            dim =

            labelType = "Category"

            beginSequence = "</s>"

            endSequence = "</s>"

            # vocabulary size

            labelDim = "$confVocabSize$"

            labelMappingFile = "$OutputDir$/sentenceLabels.txt"

            # Write definition

            # sizeof(unsigned) which is the label index type

            elementSize =

            sectionType = "labels"

            mapping = [

                # redefine number of records for this section, since we don't need to save it for each data record

                wrecords =

                # variable size so use an average string size

                elementSize =

                sectionType = "labelMapping"

            ]

            category = [

                dim =

                # elementSize = sizeof(ElemType) is default

                sectionType = "categoryLabels"

            ]

        ]

        # labels sections

        labels = [

            dim =

            labelType = "NextWord"

            beginSequence = "O"

            endSequence = "O"

            # vocabulary size

            labelDim = "$confVocabSize$"

            labelMappingFile = "$OutputDir$/sentenceLabels.out.txt"

            # Write definition

            # sizeof(unsigned) which is the label index type

            elementSize =

            sectionType = "labels"

            mapping = [

                # redefine number of records for this section, since we don't need to save it for each data record

                wrecords =

                # variable size so use an average string size

                elementSize =

                sectionType = "labelMapping"

            ]

            category = [

                dim =

                # elementSize = sizeof(ElemType) is default

                sectionType = categoryLabels

            ]

        ]

    ]

]

write = [

    action = "write"

    outputPath = "$OutputDir$/Write"

    #outputPath = "-"                    # "-" will write to stdout; useful for debugging

    outputNodeNames = "Out,WFeat2Hid,WHid2Hid,WHid2Word" # when processing one sentence per minibatch, this is the sentence posterior

    #format = [

        #sequencePrologue = "log P(W)="    # (using this to demonstrate some formatting strings)

        #type = "real"

    #]

    minibatchSize =               # choose this to be big enough for the longest sentence

    # need to be small since models are updated for each minibatch

    traceLevel =

    epochSize = 

    reader = [

        # reader to use

        readerType = "LMSequenceReader"

        randomize = "none"              # BUGBUG: This is ignored.

        nbruttsineachrecurrentiter =   # one sentence per minibatch

        cacheBlockSize =               # workaround to disable randomization

        # word class info

        wordclass = "$ModelDir$/vocab.txt"

        # if writerType is set, we will cache to a binary file

        # if the binary file exists, we will use it instead of parsing this file

        # writerType = "BinaryReader"

        # write definition

        wfile = "$OutputDir$/sequenceSentence.bin"

        # wsize - inital size of the file in MB

        # if calculated size would be bigger, that is used instead

        wsize = 

        # wrecords - number of records we should allocate space for in the file

        # files cannot be expanded, so this should be large enough. If known modify this element in config before creating file

        wrecords = 

        # windowSize - number of records we should include in BinaryWriter window

        windowSize = "$confVocabSize$"

        file = "$DataDir$/$testFile$"

        # additional features sections

        # for now store as expanded category data (including label in)

        features = [

            # sentence has no features, so need to set dimension to zero

            dim =

            # write definition

            sectionType = "data"

        ]

        #labels sections

        labelIn = [

            dim = 

            # vocabulary size

            labelDim = "$confVocabSize$"

            labelMappingFile = "$OutputDir$/sentenceLabels.txt"

            labelType = "Category"

            beginSequence = "</s>"

            endSequence = "</s>"

            # Write definition

            # sizeof(unsigned) which is the label index type

            elementSize =

            sectionType = "labels"

            mapping = [

                # redefine number of records for this section, since we don't need to save it for each data record

                wrecords =

                # variable size so use an average string size

                elementSize =

                sectionType = "labelMapping"

            ]

            category = [

                dim =

                # elementSize = sizeof(ElemType) is default

                sectionType = "categoryLabels"

            ]

        ]

        #labels sections

        labels = [

            dim =

            labelType = "NextWord"

            beginSequence = "O"

            endSequence = "O"

            # vocabulary size

            labelDim = "$confVocabSize$"

            labelMappingFile = "$OutputDir$/sentenceLabels.out.txt"

            # Write definition

            # sizeof(unsigned) which is the label index type

            elementSize =

            sectionType = "labels"

            mapping = [

                # redefine number of records for this section, since we don't need to save it for each data record

                wrecords =

                # variable size so use an average string size

                elementSize =

                sectionType = "labelMapping"

            ]

            category = [

                dim =

                # elementSize = sizeof(ElemType) is default

                sectionType = "categoryLabels"

            ]

        ]

    ]

]

rnnlm.ndl:

run=ndlCreateNetwork

ndlCreateNetwork=[

    # vocabulary size

    featDim=

    # vocabulary size

    labelDim=

    # hidden layer size

    hiddenDim=

    # number of classes

    nbrClass=

    initScale=

    features=SparseInput(featDim, tag="feature")

    # labels in classbasedCrossEntropy is dense and contain  values for each sample

    labels=Input(, tag="label")

    # define network

    WFeat2Hid=Parameter(hiddenDim, featDim, init="uniform", initValueScale=initScale)

    WHid2Hid=Parameter(hiddenDim, hiddenDim, init="uniform", initValueScale=initScale)

    # WHid2Word is special that it is hiddenSize X labelSize

    WHid2Word=Parameter( hiddenDim,labelDim,  init="uniform", initValueScale=initScale)

     WHid2Class=Parameter(nbrClass, hiddenDim, init="uniform", initValueScale=initScale)

    PastHid = Delay(hiddenDim, HidAfterSig, delayTime=, needGradient=true)

    HidFromHeat = Times(WFeat2Hid, features)

    HidFromRecur = Times(WHid2Hid, PastHid)

    HidBeforeSig = Plus(HidFromHeat, HidFromRecur)

    HidAfterSig = Sigmoid(HidBeforeSig)

    Out = TransposeTimes(WHid2Word, HidAfterSig)  #word part

    ClassProbBeforeSoftmax=Times(WHid2Class, HidAfterSig)

    cr = ClassBasedCrossEntropyWithSoftmax(labels, HidAfterSig, WHid2Word, ClassProbBeforeSoftmax, tag="criterion")

    EvalNodes=(Cr)

    OutputNodes=(Cr)

]

从代码上看，CNTK会让人花很大一部分精力在Data Reader上。

writeWordAndClassInfo 是简单的对所有词汇做个统计，并对单词聚类。 这里用的class based RNN，主要是为了加速计算，先把单词分成不相交的几类。 这个模块输出的文件有4列，分别是单词索引，出现频率，单词，类别。
Train 当然就是训练模型了，文本量大的话，训练还是很慢的。
Write 是输出模块，注意看这一行：  outputNodeNames = "Out,WFeat2Hid,WHid2Hid,WHid2Word"

我想最多人关心的应该是对于一个句子，运行这个训练好的RNN之后，如何得到隐层的值吧？我的做法是把训练好的RNN的参数给保存下来，然后...然后无论是用java还是用python的人，都能根据这个参数还原一个RNN网络，然后我们想干嘛就能干嘛了。

Train中我是用了自己定义的模型：NDLNetworkBuilder 。也可以用通用的递归模型，此时只要简单地规定一个参数就行了，例如

SimpleNetworkBuilder=[

        trainingCriterion=classcrossentropywithsoftmax

        evalCriterion=classcrossentropywithsoftmax

        nodeType=Sigmoid

        initValueScale=6.0

        layerSizes=::

        addPrior=false

        addDropoutNodes=false

        applyMeanVarNorm=false

        uniformInit=true;

        # these are for the class information for class-based language modeling

        vocabSize=

        nbrClass=

    ]

我这里使用自己定义的网络，主要是为了日后想改成LSTM结构。

原创博客，未经允许，请勿转载。

用CNTK搞深度学习（二）训练基于RNN的自然语言模型 ( language model )的更多相关文章

用 CNTK 搞深度学习（一）入门
Computational Network Toolkit (CNTK) 是微软出品的开源深度学习工具包.本文介绍CNTK的基本内容,如何写CNTK的网络定义语言,以及跑通一个简单的例子. 根据微软开 ...
CNTK 搞深度学习-1
CNTK 搞深度学习 Computational Network Toolkit (CNTK) 是微软出品的开源深度学习工具包.本文介绍CNTK的基本内容,如何写CNTK的网络定义语言,以及跑通一个简 ...
深度学习实战篇-基于RNN的中文分词探索
深度学习实战篇-基于RNN的中文分词探索近年来,深度学习在人工智能的多个领域取得了显著成绩.微软使用的152层深度神经网络在ImageNet的比赛上斩获多项第一,同时在图像识别中超过了人类的识别水平 ...
[源码解析] 深度学习分布式训练框架 horovod (3) --- Horovodrun背后做了什么
[源码解析] 深度学习分布式训练框架 horovod (3) --- Horovodrun背后做了什么目录 [源码解析] 深度学习分布式训练框架 horovod (3) --- Horovodrun ...
[源码解析] 深度学习分布式训练框架 horovod (7) --- DistributedOptimizer
[源码解析] 深度学习分布式训练框架 horovod (7) --- DistributedOptimizer 目录 [源码解析] 深度学习分布式训练框架 horovod (7) --- Distri ...
[源码解析] 深度学习分布式训练框架 horovod (12) --- 弹性训练总体架构
[源码解析] 深度学习分布式训练框架 horovod (12) --- 弹性训练总体架构目录 [源码解析] 深度学习分布式训练框架 horovod (12) --- 弹性训练总体架构 0x00 摘要 ...
[源码解析] 深度学习分布式训练框架 horovod (18) --- kubeflow tf-operator
[源码解析] 深度学习分布式训练框架 horovod (18) --- kubeflow tf-operator 目录 [源码解析] 深度学习分布式训练框架 horovod (18) --- kube ...
[源码解析] 深度学习分布式训练框架 horovod (19) --- kubeflow MPI-operator
[源码解析] 深度学习分布式训练框架 horovod (19) --- kubeflow MPI-operator 目录 [源码解析] 深度学习分布式训练框架 horovod (19) --- kub ...
[源码解析] 深度学习分布式训练框架 Horovod (1) --- 基础知识
[源码解析] 深度学习分布式训练框架 Horovod --- (1) 基础知识目录 [源码解析] 深度学习分布式训练框架 Horovod --- (1) 基础知识 0x00 摘要 0x01 分布式并 ...

随机推荐

剔除editor编辑器中的HTML标签
1.剔除editor编辑器中的HTML标签 public static string striphtml(string strhtml) { string stroutput = ...
使用TRACE时输出 _CrtDbgReport: String too long or IO Error
在VS2010中使用MFC,使用UNICODE 调用TRACE,输出_CrtDbgReport: String too long or IO Error 可尝试使用OutputDebugString函 ...
cd dirname $0
这个命令的功能是返回脚本正在执行的目录. 可以根据这个目录来定位运行的程序的相对位置. 这样,对shell脚本里面的相对目录的路径代码就比较安全了.在任何一台服务器上面都可以安全执行.
Programming ActionScript 3.0 for Flash
http://help.adobe.com/en_US/ActionScript/3.0_ProgrammingAS3/WS5b3ccc516d4fbf351e63e3d118a9b90204-7ec ...
FDTD Python API
源代码 #!/usr/bin/env python from math import exp from gnuplot_leon import * imp0 = 377.0 class fdtd_le ...
Linux 系统常用命令汇总（六）文件打包与压缩
文件打包与压缩命令选项注解示例 compress 文件名压缩指定的文件,压缩后的格式为*.z compress install.log -d 解压被压缩的文件 .z为后缀的文件:compr ...
ZBrush中如何才能快速完成脸部雕刻（下）
骨骼,是一门基础艺术,几百年来一直为伟大的艺术大师所研究,它曾经,也将一直是创作现实且可信角色的关键,提高骨骼知识更将大大提高雕刻技能. 查看更多内容请直接前往:http://www.zbrush ...
java GUI,贷款服务器
本习题来自<java语言程序设计--进阶篇>第30章,网络编程的习题. 题目描述:为一个客户端编写一个服务器.客户端向服务器发送贷款信息(年利率.贷款年限和贷款总额).服务器计算月偿还额和 ...
UESTC 898 方老师和缘分 --二分图匹配+强连通分量
这题原来以为是某种匹配问题,后来好像说是强连通的问题. 做法:建图,每个方老师和它想要的缘分之间连一条有向边,然后,在给出的初始匹配中反向建边,即如果第i个方老师现在找到的是缘分u,则建边u-> ...
更改QTP默认测试脚本路径
QTP的默认测试脚本路径为安装路径下的Tests文件夹下, 如果你安装在D:,那么默认脚本路径为D:\Program Files\HP\QuickTest Professional\Tests 但是因 ...

用CNTK搞深度学习 （二） 训练基于RNN的自然语言模型 ( language model )

用CNTK搞深度学习 （二） 训练基于RNN的自然语言模型 ( language model )的更多相关文章

随机推荐

热门专题

用CNTK搞深度学习（二）训练基于RNN的自然语言模型 ( language model )

用CNTK搞深度学习（二）训练基于RNN的自然语言模型 ( language model )的更多相关文章