From: https://www.youtube.com/watch?v=pw187aaz49o

Ref: http://blog.csdn.net/abcjennifer/article/details/46397829

Ref: Word2Vec (Part 1): NLP With Deep Learning with Tensorflow (Skip-gram) [Nice!]

Ref: Word2Vec (Part 2): NLP With Deep Learning with Tensorflow (CBOW) [Nice!]

Vector Representations of Words

In this tutorial we look at the word2vec model by Mikolov et al. This model is used for learning vector representations of words, called "word embeddings".

Highlights

This tutorial is meant to highlight the interesting, substantive parts of building a word2vec model in TensorFlow.

  • We start by giving the motivation for why we would want to represent words as vectors.
  • We look at the intuition behind the model and how it is trained (with a splash of math for good measure).
  • We also show a simple implementation of the model in TensorFlow.
  • Finally, we look at ways to make the naive version scale better.

tensorflow/examples/tutorials/word2vec/word2vec_basic.py

This basic example contains the code needed to download some data, train on it a bit and visualize the result.

Once you get comfortable with reading and running the basic version, you can graduate to models/tutorials/embedding/word2vec.py

which is a more serious implementation that showcases some more advanced TensorFlow principles about how to efficiently use threads to move data into a text model, how to checkpoint during training, etc.

Motivation: Why Learn Word Embeddings?

However, natural language processing systems traditionally treat words as discrete atomic symbols, and therefore 'cat' may be represented as Id537 and 'dog' as Id143.

This means that the model can leverage very little of what it has learned about 'cats' when it is processing data about 'dogs' (such that they are both animals, four-legged, pets, etc.).

Vector space models

Vector space models (VSMs) represent (embed) words in a continuous vector space where semantically similar words are mapped to nearby points ('are embedded nearby each other').

VSMs have a long, rich history in NLP, but all methods depend in some way or another on the Distributional Hypothesis, which states that words that appear in the same contexts share semantic meaning.

The different approaches that leverage this principle can be divided into two categories:

This distinction is elaborated in much more detail by Baroni et al., but in a nutshell:

  • Count-based methods compute the statistics of how often some word co-occurs with its neighbor words in a large text corpus, and then map these count-statistics down to a small, dense vector for each word.
  • Predictive models directly try to predict a word from its neighbors in terms of learned small, dense embedding vectors (considered parameters of the model).

Word2vec is a particularly computationally-efficient predictive model for learning word embeddings from raw text.

It comes in two flavors,

  • the Continuous Bag-of-Words model (CBOW)
  • the Skip-Gram model (Section 3.1 and 3.2 in Mikolov et al.).

Algorithmically, these models are similar, except that

  • CBOW predicts target words (e.g. 'mat') from source context words ('the cat sits on the'),
  • the skip-gram does the inverse and predicts source context-words from the target words.
  • CBOW smoothes over a lot of the distributional information (by treating an entire context as one observation). For the most part, this turns out to be a useful thing for smaller datasets.
  • skip-gram treats each context-target pair as a new observation, and this tends to do better when we have larger datasets. We will focus on the skip-gram model in the rest of this tutorial.

Scaling up with Noise-Contrastive Training

Neural probabilistic language models are traditionally trained using the maximum likelihood (ML) principle to maximize the probability of the next word wt (for "target") given the previous words h (for "history") in terms of a softmax function,

 

On the other hand, for feature learning in word2vec we do not need a full probabilistic model.

The CBOW and skip-gram models are instead trained using a binary classification objective (logistic regression) to discriminate the real target words wt from k imaginary (noise) words w~,

in the same context. We illustrate this below for a CBOW model. For skip-gram the direction is simply inverted.

 
 

The Skip-gram Model

We can visualize the learned vectors by projecting them down to 2 dimensions using for instance something like the t-SNE dimensionality reduction technique.

When we inspect these visualizations it becomes apparent that the vectors capture some general, and in fact quite useful, semantic information about words and their relationships to one another.

It was very interesting when we first discovered that certain directions in the induced vector space specialize towards certain semantic relationships, e.g. male-femaleverb tense and even country-capital relationships between words, as illustrated in the figure below (see also for example Mikolov et al., 2013).

This explains why these vectors are also useful as features for many canonical NLP prediction tasks, such as part-of-speech tagging or named entity recognition (see for example the original work by Collobert et al., 2011 (pdf), or follow-up work by Turian et al., 2010).

But for now, let's just use them to draw pretty pictures!

Building the Graph

This is all about embeddings, so let's define our embedding matrix. This is just a big random matrix to start. We'll initialize the values to be uniform in the unit cube.

embeddings = tf.Variable(
tf.random_uniform([vocabulary_size, embedding_size], -1.0, 1.0))

The noise-contrastive estimation loss is defined in terms of a logistic regression model. For this, we need to define the weights and biases for each word in the vocabulary (also called the output weights as opposed to the input embeddings). So let's define that.

nce_weights = tf.Variable(
tf.truncated_normal([vocabulary_size, embedding_size],
stddev=1.0 / math.sqrt(embedding_size)))
nce_biases = tf.Variable(tf.zeros([vocabulary_size]))

Now that we have the parameters in place, we can define our skip-gram model graph. For simplicity, let's suppose we've already integerized our text corpus with a vocabulary

so that each word is represented as an integer (seetensorflow/examples/tutorials/word2vec/word2vec_basic.py for the details).

The skip-gram model takes two inputs. One is a batch full of integers representing the source context words, the other is for the target words. Let's create placeholder nodes for these inputs, so that we can feed in data later.

# Placeholders for inputs
train_inputs = tf.placeholder(tf.int32, shape=[batch_size])
train_labels = tf.placeholder(tf.int32, shape=[batch_size, 1])

Now what we need to do is look up the vector for each of the source words in the batch. TensorFlow has handy helpers that make this easy.

embed = tf.nn.embedding_lookup(embeddings, train_inputs)

Ok, now that we have the embeddings for each word,

we'd like to try to predict the target word using the noise-contrastive training objective.

# Compute the NCE loss, using a sample of the negative labels each time.
loss = tf.reduce_mean(
tf.nn.nce_loss(weights=nce_weights,
biases=nce_biases,
labels=train_labels,
inputs=embed,
num_sampled=num_sampled,
num_classes=vocabulary_size))

Now that we have a loss node, we need to add the nodes required to compute gradients and update the parameters, etc. For this we will use stochastic gradient descent, and TensorFlow has handy helpers to make this easy as well.

# We use the SGD optimizer.
optimizer = tf.train.GradientDescentOptimizer(learning_rate=1.0).minimize(loss)

Training the Model

Training the model is then as simple as using a feed_dict to push data into the placeholders and calling tf.Session.run with this new data in a loop.

for inputs, labels in generate_batch(...):
feed_dict = {train_inputs: inputs, train_labels: labels}
_, cur_loss = session.run([optimizer, loss], feed_dict=feed_dict)

See the full example code in tensorflow/examples/tutorials/word2vec/word2vec_basic.py.

 

Visualizing the Learned Embeddings

After training has finished we can visualize the learned embeddings using t-SNE.

As expected, words that are similar end up clustering nearby each other.

For a more heavy weight implementation of word2vec that showcases more of the advanced features of TensorFlow, see the implementation in models/tutorials/embedding/word2vec.py.

Evaluating Embeddings: Analogical Reasoning

Embeddings are useful for a wide variety of prediction tasks in NLP. Short of training a full-blown part-of-speech model or named-entity model, one simple way to evaluate embeddings is to directly use them to predict syntactic and semantic relationships like king is to queen as father is to ?. This is called analogical reasoning and the task was introduced by Mikolov and colleagues. Download the dataset for this task from download.tensorflow.org.

To see how we do this evaluation, have a look at the build_eval_graph() and eval() functions inmodels/tutorials/embedding/word2vec.py.

The choice of hyperparameters can strongly influence the accuracy on this task. To achieve state-of-the-art performance on this task requires training over a very large dataset, carefully tuning the hyperparameters and making use of tricks like subsampling the data, which is out of the scope of this tutorial.

Optimizing the Implementation

Loss function change.

For example, changing the training objective is as simple as swapping out the call to tf.nn.nce_loss() for an off-the-shelf alternative such as tf.nn.sampled_softmax_loss().

If you have a new idea for a loss function, you can manually write an expression for the new objective in TensorFlow and let the optimizer compute its derivatives.

Run more efficiently

If you find your model is seriously bottlenecked on input data, you may want to implement a custom data reader for your problem, as described in New Data Formats. For the case of Skip-Gram modeling, we've actually already done this for you as an example in models/tutorials/embedding/word2vec.py.

Custom API for more performance

If your model is no longer I/O bound but you want still more performance, you can take things further by writing your own TensorFlow Ops, as described in Adding a New Op.

Again we've provided an example of this for the Skip-Gram case models/tutorials/embedding/word2vec_optimized.py. Feel free to benchmark these against each other to measure performance improvements at each stage.

Conclusion

In this tutorial we covered the word2vec model, a computationally efficient model for learning word embeddings. We motivated why embeddings are useful, discussed efficient training techniques and showed how to implement all of this in TensorFlow. Overall, we hope that this has show-cased how TensorFlow affords you the flexibility you need for early experimentation, and the control you later need for bespoke optimized implementation.

[IR] Word Embeddings的更多相关文章

  1. 翻译 | Improving Distributional Similarity with Lessons Learned from Word Embeddings

    翻译 | Improving Distributional Similarity with Lessons Learned from Word Embeddings 叶娜老师说:"读懂论文的 ...

  2. 论文阅读笔记 Word Embeddings A Survey

    论文阅读笔记 Word Embeddings A Survey 收获 Word Embedding 的定义 dense, distributed, fixed-length word vectors, ...

  3. 课程五(Sequence Models),第二 周(Natural Language Processing & Word Embeddings) —— 1.Programming assignments:Operations on word vectors - Debiasing

    Operations on word vectors Welcome to your first assignment of this week! Because word embeddings ar ...

  4. Word Embeddings

    能够充分意识到W的这些属性不过是副产品而已是很重要的.我们没有尝试着让相似的词离得近.我们没想把类比编码进不同的向量里.我们想做的不过是一个简单的任务,比如预测一个句子是不是成立的.这些属性大概也就是 ...

  5. Papers of Word Embeddings

    首先解释一下什么叫做embedding.举个例子:地图就是对于现实地理的embedding,现实的地理地形的信息其实远远超过三维 但是地图通过颜色和等高线等来最大化表现现实的地理信息. embeddi ...

  6. NLP:单词嵌入Word Embeddings

    深度学习.自然语言处理和表征方法 原文链接:http://blog.jobbole.com/77709/ 一个感知器网络(perceptron network).感知器 (perceptron)是非常 ...

  7. Word Embeddings: Encoding Lexical Semantics

    Word Embeddings: Encoding Lexical Semantics Getting Dense Word Embeddings Word Embeddings in Pytorch ...

  8. [C5W2] Sequence Models - Natural Language Processing and Word Embeddings

    第二周 自然语言处理与词嵌入(Natural Language Processing and Word Embeddings) 词汇表征(Word Representation) 上周我们学习了 RN ...

  9. deeplearning.ai 序列模型 Week 2 NLP & Word Embeddings

    1. Word representation One-hot representation的缺点:把每个单词独立对待,导致对相关词的泛化能力不强.比如训练出“I want a glass of ora ...

随机推荐

  1. Centos 7 安装 Mysql 5.5 5.6 5.7

    环境 [root@node1 ~]# cat /etc/redhat-release CentOS Linux release (Core) [root@node1 ~]# uname -a Linu ...

  2. Microsoft.mshtml.dll 添加引用及类型选择错误问题解决办法

    在比较早的文章中,提到使用 Microsoft.mshtml.dll 进行模拟浏览器点击的例子. 1.添加引用的问题 一般在开发环境下会在三个地方存有microsoft.mshtml.dll文件.所以 ...

  3. android http中请求访问添加 cookie

    第一种HashMap<String, String> map = new HashMap<String, String>();map.put("cookie" ...

  4. AutoRegister ASM AOP 字节码 案例 原理 MD

    Markdown版本笔记 我的GitHub首页 我的博客 我的微信 我的邮箱 MyAndroidBlogs baiqiantao baiqiantao bqt20094 baiqiantao@sina ...

  5. hadoop权威指南学习(一) - 天气预报MapReduce程序的开发和部署

    看过Tom White写的Hadoop权威指南(大象书)的朋友一定得从第一个天气预报的Map Reduce程序所吸引, 殊不知,Tom White大牛虽然在书中写了程序和讲解了原理,但是他以为你们都会 ...

  6. grid - 初识

    Grid有三个参数 目前介绍以下两种:grid.inline-grid <view class="grid"> <view class='grid-row'> ...

  7. SCWS 中文分词_测试成功

    地址: http://www.xunsearch.com/scws/index.php

  8. Redis相关技巧

    一. 内存占用过大,设置内存最大上限. vi /etc/redis.conf maxmemory 1g maxmemory-policy allkeys-lru (慎用) appendonly yes ...

  9. JAVA常用代码

    一. 判断是否包含某个注解.    1). 声明接口 @Retention(RetentionPolicy.RUNTIME) @Target(ElementType.TYPE) @Documented ...

  10. 湾区求职分享:三个月刷题拿到 Google offer,欢迎踊跃提问

    本文仅以个人经历和个人观点作为参考.如能受益,不胜荣幸. 本文会不断的修正,更新.希望通过大家的互动最后能写出一份阅者受益的文章. 本文纯手打,会有错别字,欢迎指出,虚心接受及时更改. 小马过河,大牛 ...