NLP（二十三）使用LSTM进行语言建模以预测最优词

原文链接：http://www.one2know.cn/nlp23/

N元模型

预测要输入的连续词，比如

如果抽取两个连续的词汇，则称之为二元模型
准备工作

数据集使用 Alice in Wonderland

将初始数据提取N-grams

import nltk

import string

with open('alice_in_wonderland.txt', 'r') as content_file:

    content = content_file.read()

content2 = " ".join("".join([" " if ch in string.punctuation else ch for ch in content]).split())

tokens = nltk.word_tokenize(content2)

tokens = [word.lower() for word in tokens if len(word)>=2]

N = 3

quads = list(nltk.ngrams(tokens,N))

"""

    Return the ngrams generated from a sequence of items, as an iterator.

    For example:

        >>> from nltk.util import ngrams

        >>> list(ngrams([1,2,3,4,5], 3))

        [(1, 2, 3), (2, 3, 4), (3, 4, 5)]

"""

newl_app = []

for ln in quads:

    new1 = ' '.join(ln)

    newl_app.append(new1)

print(newl_app[:3])

输出：

['alice adventures in', 'adventures in wonderland', 'in wonderland alice']

如何实现

1.预处理：词转换为词向量

2.创建模型和验证：将输入映射到输出的收敛-发散模型（convergent-divergent）

3.预测：最优词预测
代码

from __future__ import print_function

from sklearn.model_selection import train_test_split

import nltk

import numpy as np

import string

with open('alice_in_wonderland.txt', 'r') as content_file:

    content = content_file.read()

content2 = " ".join("".join([" " if ch in string.punctuation else ch for ch in content]).split())

tokens = nltk.word_tokenize(content2)

tokens = [word.lower() for word in tokens if len(word)>=2]

N = 3

quads = list(nltk.ngrams(tokens,N))

"""

    Return the ngrams generated from a sequence of items, as an iterator.

    For example:

        >>> from nltk.util import ngrams

        >>> list(ngrams([1,2,3,4,5], 3))

        [(1, 2, 3), (2, 3, 4), (3, 4, 5)]

"""

newl_app = []

for ln in quads:

    new1 = ' '.join(ln)

    newl_app.append(new1)

# print(newl_app[:3])

# 将单词向量化

from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer() # 词=>词向量

"""

    >>> corpus = [

    ...     'This is the first document.',

    ...     'This document is the second document.',

    ...     'And this is the third one.',

    ...     'Is this the first document?',

    ... ]

    >>> vectorizer = CountVectorizer()

    >>> X = vectorizer.fit_transform(corpus)

    >>> print(vectorizer.get_feature_names())

    ['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third', 'this']

    >>> print(X.toarray())  # doctest: +NORMALIZE_WHITESPACE

    [[0 1 1 1 0 0 1 0 1]

     [0 2 0 1 0 1 1 0 1]

     [1 0 0 1 1 0 1 1 1]

     [0 1 1 1 0 0 1 0 1]]

"""

x_trigm = []

y_trigm = []

for l in newl_app:

    x_str = " ".join(l.split()[0:N-1])

    y_str = l.split()[N-1]

    x_trigm.append(x_str)

    y_trigm.append(y_str)

x_trigm_check = vectorizer.fit_transform(x_trigm).todense()

y_trigm_check = vectorizer.fit_transform(y_trigm).todense()

# Dictionaries from word to integer and integer to word

dictnry = vectorizer.vocabulary_

rev_dictnry = {v:k for k,v in dictnry.items()}

X = np.array(x_trigm_check)

Y = np.array(y_trigm_check)

Xtrain, Xtest, Ytrain, Ytest,xtrain_tg,xtest_tg = train_test_split(X, Y,x_trigm, test_size=0.3,random_state=1)

print("X Train shape",Xtrain.shape, "Y Train shape" , Ytrain.shape)

print("X Test shape",Xtest.shape, "Y Test shape" , Ytest.shape)

# Model Building

from keras.layers import Input,Dense,Dropout

from keras.models import Model

np.random.seed(1)

BATCH_SIZE = 128

NUM_EPOCHS = 20

input_layer = Input(shape = (Xtrain.shape[1],),name="input")

first_layer = Dense(1000,activation='relu',name = "first")(input_layer)

first_dropout = Dropout(0.5,name="firstdout")(first_layer)

second_layer = Dense(800,activation='relu',name="second")(first_dropout)

third_layer = Dense(1000,activation='relu',name="third")(second_layer)

third_dropout = Dropout(0.5,name="thirdout")(third_layer)

fourth_layer = Dense(Ytrain.shape[1],activation='softmax',name = "fourth")(third_dropout)

history = Model(input_layer,fourth_layer)

history.compile(optimizer = "adam",loss="categorical_crossentropy",metrics=["accuracy"])

print (history.summary())

# Model Training

history.fit(Xtrain, Ytrain, batch_size=BATCH_SIZE,epochs=NUM_EPOCHS, verbose=1,validation_split = 0.2)

# Model Prediction

Y_pred = history.predict(Xtest)

# 测试

print ("Prior bigram words","|Actual","|Predicted","\n")

import random

NUM_DISPLAY = 10

for i in random.sample(range(len(xtest_tg)),NUM_DISPLAY):

    print (i,xtest_tg[i],"|",rev_dictnry[np.argmax(Ytest[i])],"|",rev_dictnry[np.argmax(Y_pred[i])])

输出：

X Train shape (17947, 2559) Y Train shape (17947, 2559)

X Test shape (7692, 2559) Y Test shape (7692, 2559)

_________________________________________________________________

Layer (type)                 Output Shape              Param #

=================================================================

input (InputLayer)           (None, 2559)              0

_________________________________________________________________

first (Dense)                (None, 1000)              2560000

_________________________________________________________________

firstdout (Dropout)          (None, 1000)              0

_________________________________________________________________

second (Dense)               (None, 800)               800800

_________________________________________________________________

third (Dense)                (None, 1000)              801000

_________________________________________________________________

thirdout (Dropout)           (None, 1000)              0

_________________________________________________________________

fourth (Dense)               (None, 2559)              2561559

=================================================================

Total params: 6,723,359

Trainable params: 6,723,359

Non-trainable params: 0

_________________________________________________________________

None

Prior bigram words |Actual |Predicted

595 words don | fit | know

3816 in tone | of | of

5792 queen had | only | been

2757 who seemed | to | to

5393 her and | she | she

4197 heard of | one | its

2464 sneeze were | the | of

1590 done with | said | whiting

3039 and most | things | of

4226 the queen | of | said

训练结果不好，因为单词向量维度太大，为2559，相对而言总数据集的单词太少；除了单词预测，还可以字符预测，每碰到一个空格算一个单词

NLP（二十三）使用LSTM进行语言建模以预测最优词的更多相关文章

NLP（二十二）使用LSTM进行语言建模以预测最优词
预处理数据集使用Facebook上的BABI数据集将文件提取成可训练的数据集,包括:文章问题答案 def get_data(infile): stories,questions,answers ...
C++学习（二十三）（C语言部分）之指针4
指针指针存放地址只能存放地址使用 &取地址运算符 *取值解引用运算符 malloc 申请堆内存 free释放堆内存 1.1 指针存放的地址(变量地址常量区的地址堆区内存首地址 ...
一起talk C栗子吧（第一百二十三回：C语言实例--显示变量和函数的地址）
各位看官们,大家好,上一回中咱们说的是多线程的样例.这一回咱们说的样例是:显示变量和函数的地址. 闲话休提,言归正转.让我们一起talk C栗子吧! 在编敲代码时,有时候须要获取程序中变量和函数的地址 ...
利用UML语言建模--以图书馆管理系统为例
一.基本信息标题:利用UML语言建模--以图书馆管理系统为例时间:2016 出版源:内蒙古科技与经济领域分类:UML:RFID:图书馆:模型: 二.研究背景问题定义:建立图书馆管理系统难点: ...
二十三、并发编程之深入解析Condition源码
二十三.并发编程之深入解析Condition源码一.Condition简介 1.Object的wait和notify/notifyAll方法与Condition区别任何一个java对象都继承于 ...
网络流量预测入门（二）之LSTM介绍
目录网络流量预测入门(二)之LSTM介绍 LSTM简介 Simple RNN的弊端 LSTM的结构细胞状态(Cell State) 门(Gate) 遗忘门(Forget Gate) 输入门(Inp ...
WPF入门教程系列二十三——DataGrid示例(三)
DataGrid的选择模式默认情况下,DataGrid 的选择模式为“全行选择”,并且可以同时选择多行(如下图所示),我们可以通过SelectionMode 和SelectionUnit 属性来修改 ...
Bootstrap <基础二十三>页面标题（Page Header）
页面标题(Page Header)是个不错的功能,它会在网页标题四周添加适当的间距.当一个网页中有多个标题且每个标题之间需要添加一定的间距时,页面标题这个功能就显得特别有用.如需使用页面标题(Page ...
Web 前端开发精华文章推荐（HTML5、CSS3、jQuery）【系列二十三】
<Web 前端开发精华文章推荐>2014年第2期(总第23期)和大家见面了.梦想天空博客关注前端开发技术,分享各类能够提升网站用户体验的优秀 jQuery 插件,展示前沿的 HTML5 ...

随机推荐

VisualStudio中的单元测试
1. VisualStuio中的测试资源管理器.CodeLens和ReSharper 上一篇文章重温了<单元测试的艺术>里提到的单元测试的技术及原则.这篇文章实践使用VisualStudi ...
角度转弧度&根据弧度计算圆周上点的坐标的方法
角度转弧度: #define AngleToRadian(angle) (M_PI/180.0f)*angle 以正东面为0度起点计算指定角度所对应的圆周上的点的坐标: float radian = ...
MapReduce 编程模型 & WordCount 示例
学习大数据接触到的第一个编程思想 MapReduce. 前言之前在学习大数据的时候,很多东西很零散的做了一些笔记,但是都没有好好去整理它们,这篇文章也是对之前的笔记的整理,或者叫输出吧.一来是加 ...
JS节流和防抖函数
一. 实现一个节流函数 // 思路:在规定时间内只触发一次function throttle (fn, delay) { // 利用闭包保存时间 let prev = Date.now() re ...
记录用友T+接口对接的心酸历程
前言:公司的业务主要是对接财务系统做单据传输或者凭证处理的,难免少不了和各大财务软件做数据对接,其中当然是必须通过接口来传递数据了.于是乎,用友T+的版本来了,对接的工作自然是我来做,可没想到就是这样 ...
codeforces679A_Bear and Prime 100 交互题
传送门第一道交互题题意: 电脑事先想好了一个数[,] 你会每次问电脑一个数是否是它想的那个数的因数电脑会告诉你yes或no 至多询问20次最后要输出它想的数是质数还是合数思路: 枚举< ...
简洁明了的Noip考场策略 / 平时做题也适用
1.选择策略: 评估的标准得分的难度不是AC的难度 2.思考问题: 怀疑的眼光审视自己 3.写代码前: 想想可不可以换一种代码实现会好写很多把自己的思路再理一遍,可以写到纸上,记下来大致关键顺序 4 ...
12、面向对象的思想（OOP）
面向对象与面向过程 1.都是解决问题的思维方式,都是代码的组织的方式: 2.解决简单的问题可以使用面向过程: 3.解决复杂的问题建议使用面向对象,微观处理依旧会使用面向过程. 对象的进化史(数据管理的 ...
Spring IoC源码解析之invokeBeanFactoryPostProcessors
一.Bean工厂的后置处理器 Bean工厂的后置处理器:BeanFactoryPostProcessor(触发时机:bean定义注册之后bean实例化之前)和BeanDefinitionRegistr ...
Android 使用 DiffUtil 处理 RecyclerView 数据更新问题
背景 RecyclerView.Adapter#notifyDataSetChanged() 会每次刷新整个布局: 每次手动调用 RecyclerView.Adapter#notifyItemXx 系 ...

NLP（二十三）使用LSTM进行语言建模以预测最优词

NLP（二十三）使用LSTM进行语言建模以预测最优词的更多相关文章

随机推荐

热门专题