Python Tensorflow下的Word2Vec代码解释

前言：

作为一个深度学习的重度狂热者，在学习了各项理论后一直想通过项目练手来学习深度学习的框架以及结构用在实战中的知识。心愿是好的，但机会却不好找。最近刚好有个项目，借此机会练手的过程中，我发现其实各大机器学习以及tensorflow框架群里的同学们也有类似的问题。于是希望借项目之手分享一点本人运行过程中的理解以及经验，希望在有益大家工作的基础上抛砖引玉，得到行业内各位专业人士的批评指点，多谢大家支持！

第一章博客我将会分为两个部分，这一部分将讲述Word2Vec在tensorflow中官方提供的basic版本的构造原理以及如何搭建一个CBOW模型来弥补提供版本里缺失的模型构架。于下一个部分里，我会重点对比tensorflow下basic, optimised以及gensim三个版本的Word2Vec的运行结果情况。

代码解析：

首先，Tensorflow提供的基础教程已经讲解了什么是Word2Vec以及Tensorflow是如何构建这个网络来训练的。教程的地址请看这里。另外这个basic版本的代码可以在这里找到。

代码的结构看似混乱，其实很直白。首先，第61行限制了这个demo可以学习一共50000个不同的单词。之后，在build_dataset(words)函数里，第65行展示了Python语言的强劲，即一行整理整个输入。在count的UNK(也就是unknown单词，即词频率少于一定数量的稀有词的代号)后用extend函数嵌入count数为从高网低数第vocabulary_size-1个，这样所有的重复数量少于49999个词儿的就只能对不住了，count将会把它排挤在外。形成count后dictionary来自于对count里的词频进行整理，除去重复次数但换做排行顺序作为这个dict结构的key。单词本身即成为了dict结构的value。之后，将输入的单词转化为他们在dictionary中的代码以及最后，统计下输入数据里有多少词不在这个dictionary里，按照个数增加UNK的数量，并把dictionary函数按照由高频到低频的排序方法排好顺序。由此，build_dataset函数成功的重建了输入数据以及形成了代码单词对照表，其中data将会被用于训练模型而dictionary将可以最为最后查询矢量及单词关系的翻译本。如果大家不希望限定dictionary里的vocabulary_size怎么办呢？其实答案很简单。Mikolov原文里表示只要除去频率少于3到10的词儿就好，那么我们可以对该函数做以下修改将可以达成：

def build_dataset(words, min_cut_freq):

  count_org = [['UNK', -1]]

  count_org.extend(collections.Counter(words).most_common()) #这里我们收集全部的单词的词频

  count = [['UNK', -1]]

  for word, c in count_org:

    word_tuple = [word, c]

    if word == 'UNK':   #保留UNK的位置已备后用

        count[0][1] = c

        continue

    if c > min_cut_freq: #这里定义一个para为min_cut_freq，少于这个数量的将会被咔掉

        count.append(word_tuple)

  dictionary = dict()

  for word, _ in count:

    dictionary[word] = len(dictionary)

  data = list()

  unk_count = 0

  for word in words:

    if word in dictionary:

      index = dictionary[word]

    else:

      index = 0  # dictionary['UNK']

      unk_count += 1

    data.append(index)

  count[0][1] = unk_count

  reverse_dictionary = dict(zip(dictionary.values(), dictionary.keys()))

  return data, count, dictionary, reverse_dictionary

之后，源代码第91行的generate_batch其实就是构建skip-gram模型的入口，而不是自第137行with graph.as_default()之后的框架。137行之后的为建立一个简单的MLP模型以便tensor在模型里flow。而这个tensor以及其target的形式才是构建模型的要素。如果大家仔细阅读后会发现在一个输入为“蝙蝠侠战胜了超人,美国队长却被钢铁侠暴打”这句中，在build_dataset函数转换后可能蝙蝠侠被它的在dictionary中的代码3替代，战胜了被90替代，超人被600替代，美国队长为58,被为77,钢铁侠为888以及暴打为965。于是这句话变成了[3,90,600,58,77,888,965]. 假设window size是3, 这里的模型是skip-gram，这个generate_batch函数从90出发，输出的batch为[90,90,600,600,58,58,77,77,888,888], 输出的target为[3,600,90,58,600,77,58,888,77,965]. 那么，如何构建CBOW模型呢？其实很简单，注意到CBOW模型的输入以及预测跟SkipGram正好相反，那么我们把第109行的batch和第110行的labels对调不就okay了么？具体代码如下：

def generate_cbow_batch(batch_size, num_skips, skip_window):

  global data_index

  assert batch_size % num_skips == 0

  assert num_skips <= 2 * skip_window

  batch = np.ndarray(shape=(batch_size), dtype=np.int32)

  labels = np.ndarray(shape=(batch_size, 1), dtype=np.int32)

  span = 2 * skip_window + 1 # [ skip_window target skip_window ]

  buffer = collections.deque(maxlen=span)

  for _ in range(span):

    buffer.append(data[data_index])

    data_index = (data_index + 1) % len(data)

  for i in range(batch_size // num_skips):

    target = skip_window  # target label at the center of the buffer

    targets_to_avoid = [ skip_window ]

    for j in range(num_skips):

      while target in targets_to_avoid:

        target = random.randint(0, span - 1)

      targets_to_avoid.append(target)
      #这里的batch和labels是skipgram模型的

      #batch[i * num_skips + j] = buffer[skip_window]

      #labels[i * num_skips + j, 0] = buffer[target]
      #这里的batch和labels是CBOW模型的，原理是对掉上面skipgram模型的两行。

      batch[i * num_skips + j] = buffer[target]

      labels[i * num_skips + j, 0] = buffer[skip_window]

    buffer.append(data[data_index])

    data_index = (data_index + 1) % len(data)

  return batch, labels

由此，我们只需要在后面的batch_inputs, batch_labels = generate_batch(batch_size, num_skips, skip_window)函数更换函数为你的CBOW模型函数就好了。

重要更新(2016-05-21)：

感谢深圳大学陈老师推荐的关于word embedding的论文How to Generate a Good Word Embedding。文中不仅阐述了如何对词向量的质量进行分析外，也充分介绍了不同模型间的区别。在阅读论文时发现，Skip-Gram与CBOW模型的区别并不单单存在于其模型的输入与输出为颠倒状态，还有一个比较特别的地方，在模型上，CBOW模型的输入层为sum函数，结果为输入矢量的加权平均值，而Skip-gram采用的是中间单词代表环境，即one of the context owrds as the representation of the context. 在考虑了这个因素后，对比之上的generate_cbow_batch函数的代码，我们发现的问题是batch和labels的期望输出不应该是[3,600,90,58,600,77,58,888,77,965]和[90,90,600,600,58,58,77,77,888,888]，而应该是[[3,600], [90, 58], [600,77],[58,888],[77,965]]为输入，[90, 600, 58, 77, 88]为输出。如何修改generate_cbow_batch代码做到这个呢？改动很简单，如下：

def generate_cbow_batch(batch_size, num_skips, skip_window):

  global data_index

  assert batch_size % num_skips == 0

  assert num_skips <= 2 * skip_window

  #这里batch要作为一个2d的array，每一行代表一个词所对应的环境

  batch = np.ndarray(shape=(batch_size, num_skips), dtype=np.int32)

  labels = np.ndarray(shape=(batch_size, 1), dtype=np.int32)

  span = 2 * skip_window + 1 # [ skip_window target skip_window ]

  buffer = collections.deque(maxlen=span)

  for _ in range(span):

    buffer.append(data[data_index])

    data_index = (data_index + 1) % len(data)

  for i in range(batch_size):

    target = skip_window  # target label at the center of the buffer

    targets_to_avoid = [ skip_window ]

    #定义一个temp的batch array作为暂时储存环境的array,在储存完毕后输出

    batch_temp = np.ndarray(shape=(num_skips), dtype=np.int32)

    for j in range(num_skips):

      while target in targets_to_avoid:

        target = random.randint(0, span - 1)

      targets_to_avoid.append(target)

      batch_temp[j] = buffer[target]

    batch[i] = batch_temp

    labels[i,0] = buffer[skip_window]

    buffer.append(data[data_index])

    data_index = (data_index + 1) % len(data)

  return batch, labels

之后，由于CBOW模型对于Skip-Gram模型结构上的不同，我们需要定义一个中间层作为加权层来叠加环境并平均答案来作为输出，于是，对于tensorflow的skip-gram模型我们做出如下改动：

graph = tf.Graph()

with graph.as_default():

  # Input data.

  #变更1:

  #---------------------------------------------------------------------------------------------------------------

  # 这里的输入对应的是skip-gram，input大小是batch_size X 1

  #train_inputs = tf.placeholder(tf.int32, shape=[batch_size])

  #这里由于我们的输入对于每个词而言有一个context的输入，我们的input的大小为batch_size X context

  train_inputs = tf.placeholder(tf.int32,shape=[batch_size, skip_window * 2])

  #---------------------------------------------------------------------------------------------------------------

  train_labels = tf.placeholder(tf.int32, shape=[batch_size, 1])

  valid_dataset = tf.constant(valid_examples, dtype=tf.int32)

  # Ops and variables pinned to the CPU because of missing GPU implementation

  with tf.device('/cpu:0'):

    # Look up embeddings for inputs.

    embeddings = tf.Variable(

        tf.random_uniform([vocabulary_size, embedding_size], -1.0, 1.0))

    # Embedding size is calculated as shape(train_inputs) + shape(embeddings)[1:]

    embed = tf.nn.embedding_lookup(embeddings, train_inputs)

    #变更2:

    #---------------------------------------------------------------------------------------------------------------

    #这里增加的就是首先加权embed变量，然后平均。注意这个reduce_sum里的第二个para设为1

    #原因在于假设我们的batch_size是200, window_size是4, 然后词向量size是200, 我们会得到

    #一个大小为200X4X200的张量，因为我们一次运行200个例子，每个例子有4个环境词，然后

    #每个词的大小为200维。但是，别忘了我们需要对这些输入加权，我们所期待的其实是把张量

    #里4的那个维度加权起来，于是，我们需要把这个para设为1.设为0加权的是例子的200维，3加权

    #的是每个词向量自身。

    reduced_embed = tf.div(tf.reduce_sum(embed, 1), skip_window*2)

    #---------------------------------------------------------------------------------------------------------------

    # Construct the variables for the NCE loss

    nce_weights = tf.Variable(

        tf.truncated_normal([vocabulary_size, embedding_size],

                            stddev=1.0 / math.sqrt(embedding_size)))

    nce_biases = tf.Variable(tf.zeros([vocabulary_size]))

  # Compute the average NCE loss for the batch.

  # tf.nce_loss automatically draws a new sample of the negative labels each

  # time we evaluate the loss.

  loss = tf.reduce_mean(

      tf.nn.nce_loss(nce_weights, nce_biases, reduced_embed, train_labels,

                     num_sampled, vocabulary_size))

  # Construct the SGD optimizer using a learning rate of 1.0.

  optimizer = tf.train.GradientDescentOptimizer(1.0).minimize(loss)

  # Compute the cosine similarity between minibatch examples and all embeddings.

  norm = tf.sqrt(tf.reduce_sum(tf.square(embeddings), 1, keep_dims=True))

  normalized_embeddings = embeddings / norm

  valid_embeddings = tf.nn.embedding_lookup(

      normalized_embeddings, valid_dataset)

  similarity = tf.matmul(

      valid_embeddings, normalized_embeddings, transpose_b=True)

  # Add variable initializer.

  init = tf.initialize_all_variables()

# Step 5: Begin training.

num_steps = 100001

with tf.Session(graph=graph) as session:

  # We must initialize all variables before we use them.

  init.run()

  print("Initialized")

  average_loss = 0

  for step in xrange(num_steps):

    #变更3:

    #---------------------------------------------------------------------------------------------------------------

    #在这里把generate_batch或者generate_skipgram_batch修改为generate_cbow_batch就可以了

    batch_inputs, batch_labels = generate_cbow_batch(

        batch_size, num_skips, skip_window)

    #---------------------------------------------------------------------------------------------------------------

    feed_dict = {train_inputs : batch_inputs, train_labels : batch_labels}

    # We perform one update step by evaluating the optimizer op (including it

    # in the list of returned values for session.run()

    _, loss_val = session.run([optimizer, loss], feed_dict=feed_dict)

    average_loss += loss_val

试运行这个程序，我们得到了如下结果：

Nearest to to: cruel, must, would, should, will, could, nigeria, captive,

Nearest to may: can, would, could, will, might, must, should, cannot,

Nearest to was: is, had, has, were, became, be, been, perceive,

Nearest to into: through, delicious, from, comrades, reflexive, pellets, awarding, slowly,

Nearest to some: many, these, any, various, several, both, their, wise,

Nearest to that: which, meadow, how, battlefront, however, powell, animism, this,

Nearest to also: never, still, often, actually, sometimes, usually, originally, below,

Nearest to are: were, have, is, be, include, do, sprites, been,

Nearest to new: nominally, dns, fermentable, final, proprietorships, aloe, junior, reservoirs,

Nearest to their: its, his, her, the, your, some, my, whose,

Nearest to years: decades, year, history, times, days, months, marmoset, wrangler,

Nearest to there: they, it, she, he, these, generally, lemon, we,

Nearest to th: eight, zero, nine, plasticizers, fairies, characteristic, documentation, anecdotes,

Nearest to many: some, several, these, such, most, various, wise, other,

Nearest to but: however, and, although, while, pursuing, marmoset, glowing, components,

Nearest to see: wants, atomic, charlotte, crimson, tanaka, caius, maine, scuttled,

由此可见，该系统运行的还是可以的。其中，are对应词有were, have, is be, include, do等，有英语基础的朋友都了解，这些词确实在在用法及意义上相似于are。另外包括their在内的很多词效果看似还是不错的。有兴趣的朋友欢迎阅读我的源代码。

Python Tensorflow下的Word2Vec代码解释的更多相关文章

word2vec代码解释
以前看的国外的一篇文章,用代码解释word2vec训练过程,觉得写的不错,转过来了原文链接 http://nbviewer.jupyter.org/github/dolaameng/tutorial ...
TensorFlow的序列模型代码解释（RNN、LSTM）---笔记（16）
1.学习单步的RNN:RNNCell.BasicRNNCell.BasicLSTMCell.LSTMCell.GRUCell (1)RNNCell 如果要学习TensorFlow中的RNN,第一站应该 ...
Deep Learning入门视频（下）之关于《感受神经网络》两节中的代码解释
代码1如下: #深度学习入门课程之感受神经网络(上)代码解释: import numpy as np import matplotlib.pyplot as plt #matplotlib是一个库,p ...
第二十四节，TensorFlow下slim库函数的使用以及使用VGG网络进行预训练、迁移学习(附代码)
在介绍这一节之前,需要你对slim模型库有一些基本了解,具体可以参考第二十二节,TensorFlow中的图片分类模型库slim的使用.数据集处理,这一节我们会详细介绍slim模型库下面的一些函数的使用 ...
python平台下实现xgboost算法及输出的解释
python平台下实现xgboost算法及输出的解释 1. 问题描述近来, 在python环境下使用xgboost算法作若干的机器学习任务, 在这个过程中也使用了其内置的函数来可视化树的结果, ...
从零开始Windows环境下安装python+tensorflow
从零开始Windows环境下安装python+tensorflow 2017年07月12日 02:30:47 qq_16257817 阅读数:29173 标签: windowspython机器学习te ...
如何在python中调用C语言代码
1.使用C扩展CPython还为开发者实现了一个有趣的特性,使用Python可以轻松调用C代码开发者有三种方法可以在自己的Python代码中来调用C编写的函数-ctypes,SWIG,Python/ ...
python介绍、安装及相关语法、python运维、编译与解释
1.python介绍 Python(英国发音:/ˈpaɪθən/ 美国发音:/ˈpaɪθɑːn/)是一种广泛使用的解释型.高级编程.通用型编程语言,由吉多.范罗苏姆创造,第一版发布于1991年.可以视 ...
python 小数据池，代码块， is == 深入剖析
python小数据池,代码块的最详细.深入剖析一. id is == 二. 代码块三. 小数据池四. 总结一,id,is,== 在Python中,id是什么?id是内存地址,那就有人问了, ...

随机推荐

python正则表达式基础篇
1.正则表达式基础 1.1简单介绍正则表达式并不是Python的一部分.正则表达式是用于处理字符串的强大工具,拥有自己独特的语法以及一个独立的处理引擎,效率上可能不如str自带的方法,但功能十分强大 ...
update语句的执行步骤及commit语句的执行顺序
update语句的执行步骤和其他DML语句的执行步骤是一样的包含insert .delete语句等,执行步骤如下: 一.如果数据和回滚数据不在数据库高速缓存区中,则oracle服务器进程将把他们从数据 ...
【学习笔记03】Javascript数组学习
数组定义的方法一: var Myarr=new Array(); //先声明一维 for(var i=0;i<2;i++){ //一维长度2 Myarr[i]=new Array(); //再声 ...
Linux突然断电后文件丢失的问题
原创作品,允许转载,转载时请务必以超链接形式标明文章原始出处 .作者信息和本声明.否则将追究法律责任.http://yuyongid.blog.51cto.com/10626891/168504 ...
MFC中SQLite数据库的使用
1打开数据库 BOOL playDlg::openData() { WCHAR a[100]; CString path; path = m_exePath+L"sentence_makin ...
忘记linux密码
http://blog.163.com/xygzlyq@126/blog/static/22585899200810471512530/
service:jmx:rmi:///jndi/rmi
service:jmx:rmi:///jndi/rmi://ip:9889/jmxrmi http://stackoverflow.com/questions/2768087/explain-jmx- ...
C# 中奇妙的函数–7. String Split 和 Join
很多时候处理字符串数据,比如从文件中读取或者存入 - 我们可能需要加入分隔符(如CSV文件中的逗号),或使用一个分隔符来合并字符串序列. 很多人都知道使用split()的方法,但使用与其对应的Join ...
使用Vitamio打造自己的Android万能播放器（3）——本地播放（主界面、播放列表）
前言打造一款完整可用的Android播放器有许多功能和细节需要完成,也涉及到各种丰富的知识和内容,本章将结合Fragment.ViewPager来搭建播放器的主界面,并实现本地播放基本功能.系列文章 ...
bootstarp栅格系统
##### 1.3.2 栅格系统 - Bootstrap中定义了一套响应式的网格系统,- 其使用方式就是将一个容器划分成12列,- 然后通过col-xx-xx的类名控制每一列的占比 ##### 1.3 ...

Python Tensorflow下的Word2Vec代码解释

Python Tensorflow下的Word2Vec代码解释的更多相关文章

随机推荐

热门专题