『TensotFlow』RNN中文文本

中文文字预处理流程

文本处理
- 读取+去除特殊符号
- 按照字段长度排序

辅助数据结构生成
- 生成 {字符：出现次数} 字典
- 生成按出现次数排序好的字符list
- 生成 {字符：序号} 字典
- 生成序号list

文本预处理生成字典是需要去重的，一般的思路是使用set数据结构来达成，不过这里使用的是collection.Counter，可以去重还能计数

这里的文本以全唐诗为例，一般一行为1首，目的是去掉作者，生成为“[诗主体]”的格式作为RNN输入，为了保证等长，引入字符“_”在后续处理中为长度不够的诗句补齐长度

之后生成对应的向量格式，首先做好{字符：序号} 字典并根据它生成“[序号诗]”作为向量化输入的准备

import numpy as np

import tensorflow as tf

from collections import Counter

poetry_file = 'poetry.txt'

poetrys = []

with open(poetry_file, 'r', encoding='utf-8') as f:

    for line in f:

        try:

            title, content = line.strip().split(':')

            content = content.replace(' ','')         # 去空格，实际上没用到

            if '_' in content or '(' in content or '（' in content or '《' in content or '[' in content:

                continue

            if len(content) < 5 or len(content) > 79:

                continue

            content = '[' + content + ']'

            poetrys.append(content)

        except Exception as e:

            pass

# 依照每个元素的长度排序

poetrys = sorted(poetrys, key=lambda poetry: len(poetry))

print('唐诗数量：', len(poetrys))

# 统计字出现次数

all_words = []

for poetry in poetrys:

    all_words += [word for word in poetry]

counter = Counter(all_words)

print(counter.items())

# item会把字典中的每一项变成一个2元素元组，字典变成大list

count_pairs = sorted(counter.items(), key=lambda x:-x[1])

# 利用zip提取，因为是原生数据结构，在切片上远不如numpy的结构灵活

words, _ = zip(*count_pairs)

print(words)

words = words[:len(words)] + (' ',)               # 后面要用' '来补齐诗句长度

print(words)

# 转换为字典

word_num_map = dict(zip(words, range(len(words))))

# 把诗词转换为向量

to_num = lambda word: word_num_map.get(word, len(words))

poetry_vector = [list(map(to_num, poetry)) for poetry in poetrys]

生成RNN的batch数据，并生成标签，在这里使用了上面提到的'_'对诗句进行补齐（原因是RNN输入长度是固定的），

batch_size = 1

n_chunk = len(poetry_vector) // batch_size

x_batches = []

y_batches = []

for i in range(n_chunk):

    start_index = i*batch_size

    end_index = start_index + batch_size

    batches = poetry_vector[start_index:end_index]

    length = max(map(len, batches))                 # 记录下最长的诗句的长度

    xdata = np.full((batch_size, length), word_num_map[' '], np.int32)

    for row in range(batch_size):

        xdata[row,:len(batches[row])] = batches[row]

    ydata = np.copy(xdata)

    ydata[:,:-1] = xdata[:,1:]

    """

        xdata             ydata

        [6,2,4,6,9]       [2,4,6,9,9]

        [1,4,2,8,5]       [4,2,8,5,5]

        """

    x_batches.append(xdata)                         # (n_chunk, batch, length)

    y_batches.append(ydata)

由于本篇仅仅介绍预处理，所以下面列出向量化函数，这步处理之后的batch就可以作为RNN网络的输入了，

input_data = tf.placeholder(tf.int32, [batch_size, None])

output_targets = tf.placeholder(tf.int32, [batch_size, None])

embedding = tf.get_variable("embedding",[len(words),128])

inputs = tf.nn.embedding_lookup(embedding,input_data)

sess = tf.Session()

sess.run(tf.global_variables_initializer())

print(sess.run(inputs,feed_dict={input_data: x_batches[0]}).shape)

解释一下 tf.nn.embedding_lookup，在之前cs231n的作业中做过类似的实现，就是把[batch [data]]映射为[batch [data [vactor]]]，所以它需要提前生成一个映射用矩阵{总的字符数*RNN输入尺寸}。

『TensotFlow』RNN中文文本_下

『TensotFlow』RNN中文文本_上的更多相关文章

『TensotFlow』RNN中文文本_下_暨研究生开学感想
承前接上节代码『TensotFlow』RNN中文文本_上, import numpy as np import tensorflow as tf from collections import Co ...
『TensotFlow』RNN/LSTM古诗生成
往期RNN相关工程实践文章『TensotFlow』基础RNN网络分类问题『TensotFlow』RNN中文文本_上『TensotFlow』基础RNN网络回归问题『TensotFlow』RNN中 ...
『PyTorch』第十弹_循环神经网络
RNN基础: 『cs231n』作业3问题1选讲_通过代码理解RNN&图像标注训练 TensorFlow RNN: 『TensotFlow』基础RNN网络分类问题『TensotFlow』基础R ...
『PyTorch』第四弹_通过LeNet初识pytorch神经网络_下
『PyTorch』第四弹_通过LeNet初识pytorch神经网络_上 # Author : Hellcat # Time : 2018/2/11 import torch as t import t ...
『TensorFlow』第七弹_保存&载入会话_霸王回马
首更: 由于TensorFlow的奇怪形式,所以载入保存的是sess,把会话中当前激活的变量保存下来,所以必须保证(其他网络也要求这个)保存网络和载入网络的结构一致,且变量名称必须一致,这是caffe ...
『cs231n』RNN之理解LSTM网络
概述 LSTM是RNN的增强版,1.RNN能完成的工作LSTM也都能胜任且有更好的效果:2.LSTM解决了RNN梯度消失或爆炸的问题,进而可以具有比RNN更为长时的记忆能力.LSTM网络比较复杂,而恰 ...
『MXNet』第八弹_数据处理API_上
一.Gluon数据加载下面的两个dataset处理类一般会成对出现,两个都可做预处理,但是由于后面还可能用到原始图片,.ImageFolderDataset不加预处理的话可以满足,所以建议在.Dat ...
『PyTorch』第五弹_深入理解autograd_上：Variable属性方法
在PyTorch中计算图的特点可总结如下: autograd根据用户对variable的操作构建其计算图.对变量的操作抽象为Function. 对于那些不是任何函数(Function)的输出,由用户创 ...
『PyTorch』第四弹_通过LeNet初识pytorch神经网络_上
总结一下相关概念: torch.Tensor - 一个近似多维数组的数据结构 autograd.Variable - 改变Tensor并且记录下来操作的历史记录.和Tensor拥有相同的API,以及b ...

随机推荐

POJ 1873 The Fortified Forest（凸包）题解
题意:二维平面有一堆点,每个点有价值v和删掉这个点能得到的长度l,问你删掉最少的价值能把剩余点围起来,价值一样求删掉的点最少思路:n<=15,那么直接遍历2^15,判断每种情况.这里要优化一下 ...
（转）renren-fast解读（二）
(二期)9.renren-fast项目解读(二) [课程九]jwt.xmind36.4KB [课程九]动态数据源.xmind0.2MB JWT 概要 JWT是一种用于双方之间传递安全信息的简洁的.UR ...
e信与酸酸结合开wifi使用路由器上网
关于e信"正常情况下"使用路由器网上是有方法的,入户线插上lan,电脑接lan拨号我想要说的是连接e信后使用路由器上网,并且是绝对正常的思维手机也是可以连接上wifi,但是手机 ...
C# this.Invoke和this.BeginInvoke 最简单的写法
https://blog.csdn.net/gtosky4u/article/details/20118813 this.BeginInvoke(new EventHandler(delegate { ...
(转) AI突破性论文及代码实现汇总
本文转自:https://zhuanlan.zhihu.com/p/25191377 AI突破性论文及代码实现汇总极视角 · 2 天前 What Can AI Do For You? “The bu ...
(转) Face-Resources
本文转自:https://github.com/betars/Face-Resources Face-Resources Following is a growing list of some ...
Component 组件props 属性设置
props定义属性并获取属性值 html <div id="app">  <!-- 注意如果自定义的属性带-像下面这 ...
Unity3D学习笔记（三十二）：Xlua（2）
Xlua支持通过子类对象访问父类的变量属性和方法对于C#的ref,out参数的方法当调用的时候:out类型的参数是不需要传递实参的,普通的参数和ref参数需要传递实参. out,ref传出值通 ...
Docker网络配置概述
Overview One of the reasons Docker containers and services are so powerful is that you can connect t ...
facebook api之Access and Authentication
Access and Authentication There are three access levels to the Marketing APIs. You can upgrade acces ...

『TensotFlow』RNN中文文本_上

『TensotFlow』RNN中文文本_上的更多相关文章

随机推荐

热门专题