tensorflow在文本处理中的使用—

代码来源于：tensorflow机器学习实战指南（曾益强译，2017年9月）——第七章：自然语言处理

代码地址：https://github.com/nfmcclure/tensorflow-cookbook

解决问题：使用“词袋”嵌入来进行垃圾短信的预测（使用逻辑回归算法）

缺点：不考虑相关单词顺序特征，长文本的处理困难

步骤如下：

step1：导入需要的包

step2：准备数据集

step3：选择参数（每个文本保留多少单词数，最低词频是多少）

step4：构建词袋

step5：分割数据集

step6：构建图

step7：训练

step8：测试

step1：导入需要的包

import tensorflow as tf

import matplotlib.pyplot as plt

import os

import numpy as np

import csv

import string

import requests

import io

from zipfile import ZipFile

from tensorflow.contrib import learn

from tensorflow.python.framework import ops

ops.reset_default_graph()

# Start a graph session

sess = tf.Session()

step2：准备数据集

# Check if data was downloaded, otherwise download it and save for future use

save_file_name = os.path.join('temp','temp_spam_data.csv')

if os.path.isfile(save_file_name):

    text_data = []

    with open(save_file_name, 'r') as temp_output_file:

        reader = csv.reader(temp_output_file)

        for row in reader:

            text_data.append(row)

else:

    zip_url = 'http://archive.ics.uci.edu/ml/machine-learning-databases/00228/smsspamcollection.zip'

    r = requests.get(zip_url)

    z = ZipFile(io.BytesIO(r.content))

    file = z.read('SMSSpamCollection')

    # Format Data

    text_data = file.decode()

    text_data = text_data.encode('ascii',errors='ignore')

    text_data = text_data.decode().split('\n')

    text_data = [x.split('\t') for x in text_data if len(x)>=1]

    # And write to csv

    with open(save_file_name, 'w') as temp_output_file:

        writer = csv.writer(temp_output_file)

        writer.writerows(text_data)

texts = [x[1] for x in text_data]

target = [x[0] for x in text_data]

# Relabel 'spam' as 1, 'ham' as 0

target = [1 if x=='spam' else 0 for x in target]

# Normalize text，为减少无意义的词汇，对文本进行规则化处理

# Lower case

texts = [x.lower() for x in texts]

# Remove punctuation

texts = [''.join(c for c in x if c not in string.punctuation) for x in texts]

# Remove numbers

texts = [''.join(c for c in x if c not in '') for x in texts]

# Trim extra whitespace

texts = [' '.join(x.split()) for x in texts]

step3：选择参数

# Plot histogram of text lengths文本数据中的单词数的直方图

text_lengths = [len(x.split()) for x in texts]

text_lengths = [x for x in text_lengths if x < 50]

plt.hist(text_lengths, bins=25)

plt.title('Histogram of # of Words in Texts')

step4：构建词袋

# Choose max text word length at 25，也可以设为30或者40

sentence_size = 25

min_word_freq = 3

# Setup vocabulary processor

vocab_processor = learn.preprocessing.VocabularyProcessor(sentence_size, min_frequency=min_word_freq)

# Have to fit transform to get length of unique words.

vocab_processor.fit_transform(texts)

embedding_size = len(vocab_processor.vocabulary_)

step5：分割数据集

# Split up data set into train/test

train_indices = np.random.choice(len(texts), round(len(texts)*0.8), replace=False)

test_indices = np.array(list(set(range(len(texts))) - set(train_indices)))

texts_train = [x for ix, x in enumerate(texts) if ix in train_indices]

texts_test = [x for ix, x in enumerate(texts) if ix in test_indices]

target_train = [x for ix, x in enumerate(target) if ix in train_indices]

target_test = [x for ix, x in enumerate(target) if ix in test_indices]

step6：构建图

step6.1：构建出文本的向量

# Setup Index Matrix for one-hot-encoding，使用该矩阵为每个单词查找稀疏向量

identity_mat = tf.diag(tf.ones(shape=[embedding_size]))

# Create variables for logistic regression

A = tf.Variable(tf.random_normal(shape=[embedding_size,1]))

b = tf.Variable(tf.random_normal(shape=[1,1]))

# Initialize placeholders

x_data = tf.placeholder(shape=[sentence_size], dtype=tf.int32)

y_target = tf.placeholder(shape=[1, 1], dtype=tf.float32)

# Text-Vocab Embedding，使用tf的嵌入查找函数来映射句子中的单词为单位矩阵的one-hot向量，再进行求和

# tf.nn.embedding_lookup(y,x)x为索引，找出y中对应索引的值

x_embed = tf.nn.embedding_lookup(identity_mat, x_data)

x_col_sums = tf.reduce_sum(x_embed, 0)

# Declare model operations

x_col_sums_2D = tf.expand_dims(x_col_sums, 0)

疑问：如何利用词袋将文本变成向量?

step6.2 构建图

model_output = tf.add(tf.matmul(x_col_sums_2D, A), b)

# Declare loss function (Cross Entropy loss)

loss = tf.reduce_mean(tf.nn.sigmoid_cross_entropy_with_logits(model_output, y_target))

# Prediction operation

prediction = tf.sigmoid(model_output)

# Declare optimizer

my_opt = tf.train.GradientDescentOptimizer(0.001)

train_step = my_opt.minimize(loss)

step7：训练

# Intitialize Variables

init = tf.initialize_all_variables()

sess.run(init)

# Start Logistic Regression

print('Starting Training Over {} Sentences.'.format(len(texts_train)))

loss_vec = []

train_acc_all = []

train_acc_avg = []

for ix, t in enumerate(vocab_processor.fit_transform(texts_train)):

    y_data = [[target_train[ix]]]

    sess.run(train_step, feed_dict={x_data: t, y_target: y_data})

    temp_loss = sess.run(loss, feed_dict={x_data: t, y_target: y_data})

    loss_vec.append(temp_loss)

    if (ix+1)%10==0:

        print('Training Observation #' + str(ix+1) + ': Loss = ' + str(temp_loss))

    # Keep trailing average of past 50 observations accuracy

    # Get prediction of single observation

    [[temp_pred]] = sess.run(prediction, feed_dict={x_data:t, y_target:y_data})

    # Get True/False if prediction is accurate

    train_acc_temp = target_train[ix]==np.round(temp_pred)

    train_acc_all.append(train_acc_temp)

    if len(train_acc_all) >= 50:

        train_acc_avg.append(np.mean(train_acc_all[-50:]))

step8：测试

# Get test set accuracy

print('Getting Test Set Accuracy For {} Sentences.'.format(len(texts_test)))

test_acc_all = []

for ix, t in enumerate(vocab_processor.fit_transform(texts_test)):

    y_data = [[target_test[ix]]]

    if (ix+1)%50==0:

        print('Test Observation #' + str(ix+1))    

    # Keep trailing average of past 50 observations accuracy

    # Get prediction of single observation

    [[temp_pred]] = sess.run(prediction, feed_dict={x_data:t, y_target:y_data})

    # Get True/False if prediction is accurate

    test_acc_temp = target_test[ix]==np.round(temp_pred)

    test_acc_all.append(test_acc_temp)

print('\nOverall Test Accuracy: {}'.format(np.mean(test_acc_all)))

# Plot training accuracy over time

plt.plot(range(len(train_acc_avg)), train_acc_avg, 'k-', label='Train Accuracy')

plt.title('Avg Training Acc Over Past 50 Generations')

plt.xlabel('Generation')

plt.ylabel('Training Accuracy')

plt.show()

tensorflow在文本处理中的使用——词袋的更多相关文章

tensorflow在文本处理中的使用——TF-IDF算法
代码来源于:tensorflow机器学习实战指南(曾益强译,2017年9月)——第七章:自然语言处理代码地址:https://github.com/nfmcclure/tensorflow-coo ...
tensorflow在文本处理中的使用——CBOW词嵌入模型
代码来源于:tensorflow机器学习实战指南(曾益强译,2017年9月)——第七章:自然语言处理代码地址:https://github.com/nfmcclure/tensorflow-coo ...
tensorflow在文本处理中的使用——Doc2Vec情感分析
代码来源于:tensorflow机器学习实战指南(曾益强译,2017年9月)——第七章:自然语言处理代码地址:https://github.com/nfmcclure/tensorflow-coo ...
tensorflow在文本处理中的使用——skip-gram模型
代码来源于:tensorflow机器学习实战指南(曾益强译,2017年9月)——第七章:自然语言处理代码地址:https://github.com/nfmcclure/tensorflow-coo ...
tensorflow在文本处理中的使用——Word2Vec预测
代码来源于:tensorflow机器学习实战指南(曾益强译,2017年9月)——第七章:自然语言处理代码地址:https://github.com/nfmcclure/tensorflow-coo ...
tensorflow在文本处理中的使用——skip-gram & CBOW原理总结
摘自:http://www.cnblogs.com/pinard/p/7160330.html 先看下列三篇,再理解此篇会更容易些(个人意见) skip-gram,CBOW,Word2Vec 词向量基 ...
tensorflow在文本处理中的使用——辅助函数
代码来源于:tensorflow机器学习实战指南(曾益强译,2017年9月)——第七章:自然语言处理代码地址:https://github.com/nfmcclure/tensorflow-coo ...
（数据科学学习手札71）在Python中制作个性化词云图
本文对应脚本及数据已上传至我的Github仓库https://github.com/CNFeffery/DataScienceStudyNotes 一.简介词云图是文本挖掘中用来表征词频的数据可视化 ...
TensorFlow实现文本情感分析详解
http://c.biancheng.net/view/1938.html 前面我们介绍了如何将卷积网络应用于图像.本节将把相似的想法应用于文本. 文本和图像有什么共同之处?乍一看很少.但是,如果将句 ...

随机推荐

Python 中的 map, reduce, zip, filter, lambda基本使用方法
map(function, sequence[, sequence, ...] 该函数是对sequence中的每个成员调用一次function函数,如果参数有多个,则对每个sequence中对应的元素 ...
could not insert: [com.trs.om.bean.UserLog] The user specified as a definer ('root'@'127.0.0.1') does not exist
2019-07-01 11:24:09,315 [http-8080-24] org.hibernate.util.JDBCExceptionReporter logExceptionsWARN: S ...
列表list和元祖tuple
list和tuple list列表: Python内置的一种数据类型是列表:list.list是一种有序的集合,可以随时添加和删除其中的元素. 比如,列出班里所有同学的名字,就可以用一个list表示: ...
Directx11教程(12) 禁止alt+enter全屏窗口
原文:Directx11教程(12) 禁止alt+enter全屏窗口在D3D11应用程序中,我们按下alt+enter键,会切换到全屏模式.有时候,我们在WM_SIZE中有一些代码,全 ...
使用 Swift 构建自定义的ActivityIndicator View
目前在自己的个人项目里,已经开始使用Swift去编写代码.这篇文章把项目中自己设计的一个ActivityIndicator View展示给大家. 在开始之前,我们先看看最终的效果,如下图: 我建议大家 ...
HDU 4217
点击打开题目链接题型就是数据结构.给一个数组,然后又k次操作,每次操作给定一个数ki, 从数组中删除第ki小的数,要求的是k次操作之后被删除的所有的数字的和. 简单的思路就是,用1标记该数没有被删除 ...
关于Apple Watch，听听开发了两个月Watch App的工程师怎么说
今年1月份有幸应苹果邀请,秘密参与苹果 Watch App 的真机现场调试.4月份,Apple Watch 会正式上市.在这之前,也算是亲自抢先体验了 Apple Watch,以及开发了一下 Watc ...
jenkins集成错误标签：发布 2016-01-10 20:45 747人阅读评论(21) 收藏
进入ITOO的项目以后,终于要将自己负责的模块在jenkins上面集成发布了.首先自己按照文档要求一步一步的将配置完成,然后构建,不错所料出错了,经过修改,终于构建成功!构建成功以后就没再管了,结果第 ...
python MySQLdb用法，python中cursor操作数据库（转）
数据库连接连接数据库前,请先确认以下事项: 您已经创建了数据库 TESTDB. 在TESTDB数据库中您已经创建了表 EMPLOYEE EMPLOYEE表字段为 FIRST_NAME, LAST_N ...
在springmvc中 @RequestMapping(value={"", "/"})是什么意思
这个意思是说请求路径可以为空或者/ 我给你举个例子:比如百度知道的个人中心访问路径是 http://zhidao.baidu.com/ihome,当然你也可以通过 http://zhidao.ba ...

tensorflow在文本处理中的使用——词袋

tensorflow在文本处理中的使用——词袋的更多相关文章

随机推荐

热门专题