tensorflow在文本处理中的使用—

代码来源于：tensorflow机器学习实战指南（曾益强译，2017年9月）——第七章：自然语言处理

代码地址：https://github.com/nfmcclure/tensorflow-cookbook

数据：http://www.cs.cornell.edu/people/pabo/movie-review-data/rt-polaritydata.tar.gz

问题：加载和使用预训练的嵌套，并使用这些单词嵌套进行情感分析，通过训练线性逻辑回归模型来预测电影的好坏

步骤如下：

必要包
声明模型参数
读取并转换文本数据集，划分训练集和测试集
构建图
训练

step1：必要包

import tensorflow as tf

import matplotlib.pyplot as plt

import numpy as np

import random

import os

import pickle

import string

import requests

import collections

import io

import tarfile

import urllib.request

import text_helpers

from nltk.corpus import stopwords

from tensorflow.python.framework import ops

ops.reset_default_graph()

os.chdir(os.path.dirname(os.path.realpath(__file__)))

# Start a graph session

sess = tf.Session()

step2：声明模型参数

# Declare model parameters

embedding_size = 200

vocabulary_size = 2000

batch_size = 100

max_words = 100

# Declare stop words

stops = stopwords.words('english')

step3：读取并转换本文数据集，划分训练集和测试集

参考：tensorflow在文本处理中的使用——辅助函数

# Load Data

print('Loading Data')

data_folder_name = 'temp'

texts, target = text_helpers.load_movie_data(data_folder_name)

# Normalize text

print('Normalizing Text Data')

texts = text_helpers.normalize_text(texts, stops)

# Texts must contain at least 3 words

target = [target[ix] for ix, x in enumerate(texts) if len(x.split()) > 2]

texts = [x for x in texts if len(x.split()) > 2]

# Split up data set into train/test

train_indices = np.random.choice(len(target), round(0.8*len(target)), replace=False)

test_indices = np.array(list(set(range(len(target))) - set(train_indices)))

texts_train = [x for ix, x in enumerate(texts) if ix in train_indices]

texts_test = [x for ix, x in enumerate(texts) if ix in test_indices]

target_train = np.array([x for ix, x in enumerate(target) if ix in train_indices])

target_test = np.array([x for ix, x in enumerate(target) if ix in test_indices])

# Load dictionary and embedding matrix加载CBOW嵌套中保存的单词字典

dict_file = os.path.join(data_folder_name, 'movie_vocab.pkl')

word_dictionary = pickle.load(open(dict_file, 'rb'))

# Convert texts to lists of indices根据单词字典将加载的句子转化为数值型numpy数组

text_data_train = np.array(text_helpers.text_to_numbers(texts_train, word_dictionary))

text_data_test = np.array(text_helpers.text_to_numbers(texts_test, word_dictionary))

# Pad/crop movie reviews to specific length电影影评长度不一，不满100维的用0凑满，超过100维的取前100维

text_data_train = np.array([x[0:max_words] for x in [y+[0]*max_words for y in text_data_train]])

text_data_test = np.array([x[0:max_words] for x in [y+[0]*max_words for y in text_data_test]])

step4：构建图

print('Creating Model')

# Define Embeddings:创建嵌套变量，用于之后加载CBOW训练好的嵌套向量

embeddings = tf.Variable(tf.random_uniform([vocabulary_size, embedding_size], -1.0, 1.0))

# Define model:

# Create variables for logistic regression变量

A = tf.Variable(tf.random_normal(shape=[embedding_size,1]))

b = tf.Variable(tf.random_normal(shape=[1,1]))

# Initialize placeholders数据占位符

x_data = tf.placeholder(shape=[None, max_words], dtype=tf.int32)

y_target = tf.placeholder(shape=[None, 1], dtype=tf.float32)

# Lookup embeddings vectors

embed = tf.nn.embedding_lookup(embeddings, x_data)

# Take average of all word embeddings in documents计算句子中所有单词的平均嵌套

embed_avg = tf.reduce_mean(embed, 1)

# Declare logistic model (sigmoid in loss function)

model_output = tf.add(tf.matmul(embed_avg, A), b)

# Declare loss function (Cross Entropy loss)

loss = tf.reduce_mean(tf.nn.sigmoid_cross_entropy_with_logits(model_output, y_target))

# Actual Prediction

prediction = tf.round(tf.sigmoid(model_output))

predictions_correct = tf.cast(tf.equal(prediction, y_target), tf.float32)

accuracy = tf.reduce_mean(predictions_correct)

# Declare optimizer

my_opt = tf.train.AdagradOptimizer(0.005)

train_step = my_opt.minimize(loss)

step5：训练

# Intitialize Variables

init = tf.initialize_all_variables()

sess.run(init)

# Load model embeddings加载CBOW训练好的嵌套矩阵

model_checkpoint_path = os.path.join(data_folder_name,'cbow_movie_embeddings.ckpt')

saver = tf.train.Saver({"embeddings": embeddings})

saver.restore(sess, model_checkpoint_path)

# Start Logistic Regression

print('Starting Model Training')

train_loss = []

test_loss = []

train_acc = []

test_acc = []

i_data = []

for i in range(10000):

    rand_index = np.random.choice(text_data_train.shape[0], size=batch_size)

    rand_x = text_data_train[rand_index]

    rand_y = np.transpose([target_train[rand_index]])

    sess.run(train_step, feed_dict={x_data: rand_x, y_target: rand_y})

    # Only record loss and accuracy every 100 generations

    if (i+1)%100==0:

        i_data.append(i+1)

        train_loss_temp = sess.run(loss, feed_dict={x_data: rand_x, y_target: rand_y})

        train_loss.append(train_loss_temp)

        test_loss_temp = sess.run(loss, feed_dict={x_data: text_data_test, y_target: np.transpose([target_test])})

        test_loss.append(test_loss_temp)

        train_acc_temp = sess.run(accuracy, feed_dict={x_data: rand_x, y_target: rand_y})

        train_acc.append(train_acc_temp)

        test_acc_temp = sess.run(accuracy, feed_dict={x_data: text_data_test, y_target: np.transpose([target_test])})

        test_acc.append(test_acc_temp)

    if (i+1)%500==0:

        acc_and_loss = [i+1, train_loss_temp, test_loss_temp, train_acc_temp, test_acc_temp]

        acc_and_loss = [np.round(x,2) for x in acc_and_loss]

        print('Generation # {}. Train Loss (Test Loss): {:.2f} ({:.2f}). Train Acc (Test Acc): {:.2f} ({:.2f})'.format(*acc_and_loss))

可视化结果展示：

# Plot loss over time

plt.plot(i_data, train_loss, 'k-', label='Train Loss')

plt.plot(i_data, test_loss, 'r--', label='Test Loss', linewidth=4)

plt.title('Cross Entropy Loss per Generation')

plt.xlabel('Generation')

plt.ylabel('Cross Entropy Loss')

plt.legend(loc='upper right')

plt.show()

# Plot train and test accuracy

plt.plot(i_data, train_acc, 'k-', label='Train Set Accuracy')

plt.plot(i_data, test_acc, 'r--', label='Test Set Accuracy', linewidth=4)

plt.title('Train and Test Accuracy')

plt.xlabel('Generation')

plt.ylabel('Accuracy')

plt.legend(loc='lower right')

plt.show()

tensorflow在文本处理中的使用——Word2Vec预测的更多相关文章

tensorflow在文本处理中的使用——Doc2Vec情感分析
代码来源于:tensorflow机器学习实战指南(曾益强译,2017年9月)——第七章:自然语言处理代码地址:https://github.com/nfmcclure/tensorflow-coo ...
tensorflow在文本处理中的使用——CBOW词嵌入模型
代码来源于:tensorflow机器学习实战指南(曾益强译,2017年9月)——第七章:自然语言处理代码地址:https://github.com/nfmcclure/tensorflow-coo ...
tensorflow在文本处理中的使用——skip-gram模型
代码来源于:tensorflow机器学习实战指南(曾益强译,2017年9月)——第七章:自然语言处理代码地址:https://github.com/nfmcclure/tensorflow-coo ...
tensorflow在文本处理中的使用——TF-IDF算法
代码来源于:tensorflow机器学习实战指南(曾益强译,2017年9月)——第七章:自然语言处理代码地址:https://github.com/nfmcclure/tensorflow-coo ...
tensorflow在文本处理中的使用——词袋
代码来源于:tensorflow机器学习实战指南(曾益强译,2017年9月)——第七章:自然语言处理代码地址:https://github.com/nfmcclure/tensorflow-coo ...
tensorflow在文本处理中的使用——辅助函数
代码来源于:tensorflow机器学习实战指南(曾益强译,2017年9月)——第七章:自然语言处理代码地址:https://github.com/nfmcclure/tensorflow-coo ...
tensorflow在文本处理中的使用——skip-gram & CBOW原理总结
摘自:http://www.cnblogs.com/pinard/p/7160330.html 先看下列三篇,再理解此篇会更容易些(个人意见) skip-gram,CBOW,Word2Vec 词向量基 ...
TensorFlow实现文本情感分析详解
http://c.biancheng.net/view/1938.html 前面我们介绍了如何将卷积网络应用于图像.本节将把相似的想法应用于文本. 文本和图像有什么共同之处?乍一看很少.但是,如果将句 ...
jQuery文本框中的事件应用
jQuery文本框中的事件应用 <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "ht ...

随机推荐

[C#] 使用WebSocket进行通讯
客户端客户端很简单 string url = "ws://localhost:24900/" + "test.ashx"; try { System.Net. ...
Xici drop flower
var xici_user_api = "http://www.xici.net/apps/wedding/?method=wedding.user.getusername&from ...
JavaScript学习之倒计时
倒计时很常见,例如离XX活动还有XX天XX小时XX分XX秒,然后逐秒减少,实现很简单,我只是想记录这过程中的一点小坑. 先上代码: <html> <head> <meta ...
从零学React Native之12 组件的生命周期
一个React Native组件从它被加载,到最终被卸载会经历一个完整的生命周期.所谓生命周期,就是一个对象从开始生成到最后消亡所经历的状态,理解生命周期,是合理开发的关键. ES6语法和之前的ES5 ...
UVa 679 【思维题】
UVA 679 紫书P148例题. 题目大意:小球从一棵所有叶子深度相同的二叉树的顶点开始向下落,树开始所有节点都为0.若小球落到节点为0的则往左落,否则向右落.并且小球会改变它经过的节点,0变1,1 ...
Unity3dShader学习合集暂存
http://www.cnblogs.com/flappy/archive/2012/08/10/2631348.html 1. Unity3d的參考手冊, 里面涵盖Unity3d官方的描写叙述 ht ...
mysql ip常见异常
这次的项目采用mysql数据库,以前没怎么接触过,所以遇到很多问题,在此小小总结一下: (1)com.mysql.jdbc.exceptions.jdbc4.CommunicationsExcepti ...
即插即用，基于阿里云Ganos快速构建云上开源GIS方案
对于轻量级GIS应用,选择具备时空能力的云上数据库再搭配开源GIS软件,能够快速构建稳定.廉价.实用的GIS解决方案.Ganos是阿里云自研时空基础设施(PaaS层)的核心引擎,该引擎整合了云上异构计 ...
Android 使用SystemBarTint设置状态栏颜色
做项目时,发现APP的状态栏是系统默认的颜色,突然想到,为啥别的APP是自己设置的颜色(和APP本身很相搭),于是也想给自己的APP设置系统状态栏的颜色,更加美美哒... 搜了下,发现原来设置状态栏居 ...
20182019-acmicpc-asia-dhaka-regional F .Path Intersection 树链剖分
直接进行树链剖分,每次对路径区间内的所有点值+1,线段树进行维护,然后查询线段树的最大值的个数!!! 查询线段树区间最大值个数,可以先维护区间和,在维护区间最值,如果区间和等于区间最值乘以区间长度,那 ...

tensorflow在文本处理中的使用——Word2Vec预测

tensorflow在文本处理中的使用——Word2Vec预测的更多相关文章

随机推荐

热门专题