代码来源于:tensorflow机器学习实战指南(曾益强 译,2017年9月)——第七章:自然语言处理

代码地址:https://github.com/nfmcclure/tensorflow-cookbook

在讲述skip-gram,CBOW,Word2Vec,Doc2Vec模型时需要复用的函数

  • 加载数据函数
  • 归一化文本函数
  • 生成词汇表函数
  • 生成单词索引表
  • 生成批量数据函数

加载数据函数

# Load the movie review data
# Check if data was downloaded, otherwise download it and save for future use
def load_movie_data(data_folder_name):
pos_file = os.path.join(data_folder_name, 'rt-polarity.pos')
neg_file = os.path.join(data_folder_name, 'rt-polarity.neg') # Check if files are already downloaded
if os.path.isfile(pos_file):
pos_data = []
with open(pos_file, 'r') as temp_pos_file:
for row in temp_pos_file:
pos_data.append(row)
neg_data = []
with open(neg_file, 'r') as temp_neg_file:
for row in temp_neg_file:
neg_data.append(row)
else: # If not downloaded, download and save
movie_data_url = 'http://www.cs.cornell.edu/people/pabo/movie-review-data/rt-polaritydata.tar.gz'
stream_data = urllib.request.urlopen(movie_data_url)
tmp = io.BytesIO()
while True:
s = stream_data.read(16384)
if not s:
break
tmp.write(s)
stream_data.close()
tmp.seek(0) tar_file = tarfile.open(fileobj=tmp, mode="r:gz")
pos = tar_file.extractfile('rt-polaritydata/rt-polarity.pos')
neg = tar_file.extractfile('rt-polaritydata/rt-polarity.neg')
# Save pos/neg reviews
pos_data = []
for line in pos:
pos_data.append(line.decode('ISO-8859-1').encode('ascii',errors='ignore').decode())
neg_data = []
for line in neg:
neg_data.append(line.decode('ISO-8859-1').encode('ascii',errors='ignore').decode())
tar_file.close()
# Write to file
if not os.path.exists(save_folder_name):
os.makedirs(save_folder_name)
# Save files
with open(pos_file, "w") as pos_file_handler:
pos_file_handler.write(''.join(pos_data))
with open(neg_file, "w") as neg_file_handler:
neg_file_handler.write(''.join(neg_data))
texts = pos_data + neg_data
target = [1]*len(pos_data) + [0]*len(neg_data)
return(texts, target)

归一化文本函数

# Normalize text
def normalize_text(texts, stops):
# Lower case
texts = [x.lower() for x in texts] # Remove punctuation
texts = [''.join(c for c in x if c not in string.punctuation) for x in texts] # Remove numbers
texts = [''.join(c for c in x if c not in '') for x in texts] # Remove stopwords
texts = [' '.join([word for word in x.split() if word not in (stops)]) for x in texts] # Trim extra whitespace
texts = [' '.join(x.split()) for x in texts] return(texts)

生成词汇表函数

# Build dictionary of words构建词汇表(单词和单词数对),词频不够的单词(即标记为unknown的单词)标记为RARE
def build_dictionary(sentences, vocabulary_size):
# Turn sentences (list of strings) into lists of words
split_sentences = [s.split() for s in sentences]
words = [x for sublist in split_sentences for x in sublist] # Initialize list of [word, word_count] for each word, starting with unknown
count = [['RARE', -1]] # Now add most frequent words, limited to the N-most frequent (N=vocabulary size)
count.extend(collections.Counter(words).most_common(vocabulary_size-1)) # Now create the dictionary
word_dict = {}
# For each word, that we want in the dictionary, add it, then make it
# the value of the prior dictionary length
for word, word_count in count:
word_dict[word] = len(word_dict) return(word_dict)

生成单词索引表

# Turn text data into lists of integers from dictionary
def text_to_numbers(sentences, word_dict):
# Initialize the returned data
data = []
for sentence in sentences:
sentence_data = []
# For each word, either use selected index or rare word index
for word in sentence.split():
if word in word_dict:
word_ix = word_dict[word]
else:
word_ix = 0
sentence_data.append(word_ix)
data.append(sentence_data)
return(data)

生成批量数据函数

# Generate data randomly (N words behind, target, N words ahead)
def generate_batch_data(sentences, batch_size, window_size, method='skip_gram'):
# Fill up data batch
batch_data = []
label_data = []
while len(batch_data) < batch_size:
# select random sentence to start
rand_sentence_ix = int(np.random.choice(len(sentences), size=1))
rand_sentence = sentences[rand_sentence_ix]
# Generate consecutive windows to look at
window_sequences = [rand_sentence[max((ix-window_size),0):(ix+window_size+1)] for ix, x in enumerate(rand_sentence)]
# Denote which element of each window is the center word of interest
label_indices = [ix if ix<window_size else window_size for ix,x in enumerate(window_sequences)] # Pull out center word of interest for each window and create a tuple for each window
if method=='skip_gram':
batch_and_labels = [(x[y], x[:y] + x[(y+1):]) for x,y in zip(window_sequences, label_indices)]
# Make it in to a big list of tuples (target word, surrounding word)
tuple_data = [(x, y_) for x,y in batch_and_labels for y_ in y]
batch, labels = [list(x) for x in zip(*tuple_data)]
elif method=='cbow':
batch_and_labels = [(x[:y] + x[(y+1):], x[y]) for x,y in zip(window_sequences, label_indices)]
# Only keep windows with consistent 2*window_size
batch_and_labels = [(x,y) for x,y in batch_and_labels if len(x)==2*window_size]
batch, labels = [list(x) for x in zip(*batch_and_labels)]
elif method=='doc2vec':
# For doc2vec we keep LHS window only to predict target word
batch_and_labels = [(rand_sentence[i:i+window_size], rand_sentence[i+window_size]) for i in range(0, len(rand_sentence)-window_size)]
batch, labels = [list(x) for x in zip(*batch_and_labels)]
# Add document index to batch!! Remember that we must extract the last index in batch for the doc-index
batch = [x + [rand_sentence_ix] for x in batch]
else:
raise ValueError('Method {} not implmented yet.'.format(method)) # extract batch and labels
batch_data.extend(batch[:batch_size])
label_data.extend(labels[:batch_size])
# Trim batch and label at the end
batch_data = batch_data[:batch_size]
label_data = label_data[:batch_size] # Convert to numpy array
batch_data = np.array(batch_data)
label_data = np.transpose(np.array([label_data])) return(batch_data, label_data)

tensorflow在文本处理中的使用——辅助函数的更多相关文章

  1. tensorflow在文本处理中的使用——Doc2Vec情感分析

    代码来源于:tensorflow机器学习实战指南(曾益强 译,2017年9月)——第七章:自然语言处理 代码地址:https://github.com/nfmcclure/tensorflow-coo ...

  2. tensorflow在文本处理中的使用——Word2Vec预测

    代码来源于:tensorflow机器学习实战指南(曾益强 译,2017年9月)——第七章:自然语言处理 代码地址:https://github.com/nfmcclure/tensorflow-coo ...

  3. tensorflow在文本处理中的使用——CBOW词嵌入模型

    代码来源于:tensorflow机器学习实战指南(曾益强 译,2017年9月)——第七章:自然语言处理 代码地址:https://github.com/nfmcclure/tensorflow-coo ...

  4. tensorflow在文本处理中的使用——skip-gram模型

    代码来源于:tensorflow机器学习实战指南(曾益强 译,2017年9月)——第七章:自然语言处理 代码地址:https://github.com/nfmcclure/tensorflow-coo ...

  5. tensorflow在文本处理中的使用——TF-IDF算法

    代码来源于:tensorflow机器学习实战指南(曾益强 译,2017年9月)——第七章:自然语言处理 代码地址:https://github.com/nfmcclure/tensorflow-coo ...

  6. tensorflow在文本处理中的使用——词袋

    代码来源于:tensorflow机器学习实战指南(曾益强 译,2017年9月)——第七章:自然语言处理 代码地址:https://github.com/nfmcclure/tensorflow-coo ...

  7. tensorflow在文本处理中的使用——skip-gram & CBOW原理总结

    摘自:http://www.cnblogs.com/pinard/p/7160330.html 先看下列三篇,再理解此篇会更容易些(个人意见) skip-gram,CBOW,Word2Vec 词向量基 ...

  8. TensorFlow实现文本情感分析详解

    http://c.biancheng.net/view/1938.html 前面我们介绍了如何将卷积网络应用于图像.本节将把相似的想法应用于文本. 文本和图像有什么共同之处?乍一看很少.但是,如果将句 ...

  9. jQuery文本框中的事件应用

    jQuery文本框中的事件应用 <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "ht ...

随机推荐

  1. Linux C socket 基于 UDP

    /*************************************************************************     > File Name: serve ...

  2. 如何用SPSS分析学业情绪量表数据

    如何用SPSS分析学业情绪量表数据 1.数据检验.由于问卷.量表的题目是主观判断和选择,因而难免有些人不认真填,所以,筛选出有效.高质量的数据非常关键.通常需要作如下检查:(1)是否有人回答互相矛盾, ...

  3. Hibernate的DetachedCriteria使用(含Criteria)转载

    https://www.cnblogs.com/deng-cc/p/6428599.html 1.背景了解:Hibernate的三种查询方式 Hibernate总的来说共有三种查询方式:HQL.QBC ...

  4. 【JZOJ3852】【NOIP2014八校联考第2场第2试9.28】单词接龙(words)

    DDD Bsny从字典挑出N个单词,并设计了接龙游戏,只要一个单词的最后两个字母和另一个单词的前两个字母相同,那么这两个单词就可以有序的连接起来. Bsny想要知道在所给的所有单词中能否按照上述方式接 ...

  5. C++之以分隔符的形式获取字符串

    void CConvert::Split(const std::string& src, const std::string& separator, std::vector<st ...

  6. git 密钥

    为什么配置SHH呢?是为了方便我们剪切代码的时间免密码输入,特别方便如何配置呢? 首先安装git: 先到官网下载:官网下载git 然后安装后在桌面任意空白处右击,选择Git Base Here即可如下 ...

  7. 撸了一个微信小程序项目

    学会一项开发技能最快的步骤就是:准备,开火,瞄准.最慢的就是:准备,瞄准,瞄准,瞄准-- 因为微信小程序比较简单,直接开撸就行,千万别瞄准. 于是乎,趁着今天上午空气质量不错,撸了一个小程序,放在了男 ...

  8. ACM:树的变换,依据表达式建立表达式树

    题目:输入一个表达式.建立一个表达式树. 分析:找到最后计算的运算符(它是整棵表达式树的根),然后递归处理!             在代码中.仅仅有当p==0的时候.才考虑这个运算符,由于括号中的运 ...

  9. [转]embedding technic:从SNE到t-SNE再到LargeVis

    http://bindog.github.io/blog/2016/06/04/from-sne-to-tsne-to-largevis/#top

  10. 20.libgdx,stage中默认相机的使用

    主要思路: 通过查资料得知,stage中的默认封装的相机为OrthographicCamera,要操纵该相机,直接把他转化为OrthographicCamera即可使用 但是这会导致一个问题,即原本固 ...