Paper Reference: word2vec Parameter Learning Explained

1. One-word context Model

In our setting, the vocabulary size is $V$, and the hidden layer size is $N$.

The input $x$ is a one-hot representation vector, which means for a given input context word, only one out of $V$ units, $\{x_1,\cdots,x_V\}$, will be 1, and all other units are 0. for example,

\[x=[0,\cdots,1,\cdots,0]\]

The weight between the input layer and the output layer can be represented by a $V \times N$ matrix $W$. Each row of $W$ is the $N$-dimension vector representation $v_w$ of the associated word of the input layer.

Given a context (a word), assuming $x_k=1$ and $x_{k’}=0$ for $k’\neq k$ then

\[h=x^TW=W{(k,\cdot):=v_{w_I}}\]

which is just copying the $k$-th row of $W$ to $h$. $v_{w_I}$ is the vector representation of the input word $w_I$. This implies that the link (activation) function of the hidden layer units is simply linear (i.e., directly passing its weighted sum of inputs to the next layer).

From the hidden layer to the output layer, there is a different weight matrix $W’=\{w’_{ij}\}$, which is a $N \times V$ matrix. Using these weights, we can compute a score $u_j$ for each word in the vocabulary,

\[ u_j={v’_{w_j}}^T \cdot h \]

where $v’_{w_j}$ is the $j$-th column of the matrix $W’$. Then we can use the softmax classification model to obtain the posterior distribution of the words, which is a multinomial distribution.

\[p(w_j|w_I)=y_j=\frac{\exp(u_j)}{\sum_{j’=1}^V{\exp(u_{j’})}}\]

where $y_j$ is the output of the $j$-th unit in the output layer.

Finally, we obtain:

\[p(w_j | w_I) = y_j = \frac{\exp( {v’_{w_o}}^T v_{w_I})}{\sum_{j’=1}^V{\exp( {v’_{w’_j}}^T v_{w_I})}}\]

Note that $v_w$ and $v’_w$ are two representations of the word $w$. $v_w$ comes from rows of $W$, which is the input $\to$ hidden weight matrix, and $v’_w$ comes from columns of $W’$, which is the hidden $\to$ output matrix. In subsequent analysis, we call $v_w$ as the “input vector”, and $v’_w$ as the “output vector” of the word w.

1.2. Cost Function

Let’s derive the weight update equation for this model. Although the actual computation is impractical, we still derivate the update equation to gain insights on the original model without tricks.

The training objective is to maximize the conditional probability of observing the actual output word $w_o$ (denote its corresponding index in the output layer as $j*$) given the input context word $w_I$ with regard to the weights.

\[ \max p(w_o|w_I)=\max y_{j*}\]

which is equivlant to minimize the negative-log probability, where :

\[ E=-\log p(w_o|w_I)=u_{j*}-\log \sum\limits_{j’=1}^V{\exp(u_{j’})}\]

$j*$ is the index of the actual output word in the output layer. Note that, this loss function can be understood as a special case of the cross-entropy measurement between two probabilistic distributions, which has been talked about in previous post:  Negative log-likelihood function.

1.3. Update weight from output $\to$ hidden layer

Let’s derive the update equation of the weights between hidden and output layer $W’_{N\times V}$.

Take the derivative of $E$ with regard to $j$-th unit’s net input $u_j$:

\[ \frac{\partial E}{\partial u_j}= y_j-t_j := e_j \]

where

\[t_j = \left\{\begin{aligned}1, j=j* \\ 0, j \neq j* \end{aligned}\right. \]

i.e., $t_j$ will only be 1 when the $j$-th is the actual output word, otherwise, $t_j=0$. Note that, this derivative tis simply the prediction error $e_j$ of the output layer.

Next we take the derivative on $w’_{ij}$ to obtain the gradient on the hidden $\to$ output weights $W’_{N \times V}$.

\[  \frac{\partial E}{\partial w’_{ij}}=  \frac{\partial E}{\partial u_j} \cdot  \frac{\partial u_j}{\partial w’_{ij} }= e_j\cdot h_i  \]

Therefore, with SGD, we can obtain the weight update equation for the hidden $\to$ output weight:

\[ w’_{ij} = w’_{ij} – \alpha \cdot e_j \cdot h_i\]

or vector repesentation:

\[ v’_{w_j}=v’_{w_j} – \alpha\cdot e_j \cdot h ~~~~~~j=1,2,\cdots,V\]

where $\alpha$ is the learning rate, $e_j = y_j – t_j$ and $h_i$ is the $i$-th unit in the hidden layer; $v’_{w_j}$ is the output vector of $w_j$.

Note that this update equation implies that we Have To Go Through Every Word In The Vocabulary, check its output probability $y_j$, and compare $y_j$ with its expected output $t_j$ (either 0 or 1).

If $y_j \ge t_j$ (“overestimating”), then we subtract a proportion of the hidden vector $h$ (i.e., $v_{w_I}$) from $v’_{w_o}$, thus making $v’_{w_o}$ far away from $v_{w_I}$; If $y_j \le t_j$ (“underestimating”), we add some $h$ to $v’_{w_o}$, thus making $v’_{w_o}$ closer to $v_{w_I}$. if $y_j \approx t_j$, then according to the update equation, little change will be made to the weights. Note, $v_w$ (input vector) and $v’_w$ (output vector) are two different vector representations of the word $w$.

1.4. Update weight from hidden $\to$ input layer

Having obtained the update equations for $W’$, we now move on to $W$. We take the derivative of $E$ on the output of the hidden layer, obteaining:

\[\frac{\partial E}{\partial h_i}=\sum\limits_{j=1}^V\frac{\partial E}{\partial u_j}\cdot\frac{\partial u_j}{\partial h_i}=\sum\limits_{j=1}^V{e_j\cdot w’_{ij}}:=EH_i\]

where $h_i$ is the output of the $i$-th unit of the hidden layer; $u_j$ is the net input of the $j$-thunit in the output layer; $e_j=y_j-t_j$ is the prediction error of the $j$-th word in the output layer. $EH$, a $N$-dim vector, is the sum of the output vectors of all words in the vocabulary, weighted by their prediction error.

Next, we should take the derivative of $E$ on $W$. First, we recall that the input layer value performs a linear computation to form the hidden layer:

\[h_j=\sum\limits_{k=1}^V{x_k}\cdot w_{ki}\]

Then, we  obtain:

\[\frac{\partial E}{\partial w_{ki}}=\frac{\partial E}{\partial h_i} \cdot \frac{\partial h_i}{\partial w_{ki}}=EH_i \cdot x_k\]

\[\frac{\partial E}{\partial W}=x\cdot EH\]

from which, we obtain a $V\times N$ matrix. since only one component of $x$ is non-zero, only one row of $\frac{\partial E}{\partial W}$ is non-zero, and the value of that row is $EH$, a $N$-dim vector.

We obtain the update equation of $W$ as :

\[v_{w_I}=v_{w_I}-\alpha\cdot EH\]

where, $v_{w_I}$ is a row of $W$, the “input vector”of the only context word, and is the only row of $W$ whose gradient is non-zero. All the other rows of $W$ remains unchanged in this iteration.

Vector $EH$ is the sum of output vectors of all words in the vocabulary weighted by their prediction error $e_j=y_j-t_j$, which is adding a portion of every output vector in  the vocabulary to the input vector of the context word.

As we iteratively update the model parameters by going through context-target word pairs generated from a training corpus, the e ects on the vectors will accumulate. We can imagine that the output vector of a word w is “dragged” back-and-forth by the input vectors of $w$'s co-occurring neighbors, as if there are physical strings between the vector of w and the vectors of its neighbors. Similarly, an input vector can also be considered as being dragged by many output vectors. This interpretation can remind us of gravity, or force-directed graph layout. The equilibrium length of each imaginary string is related to the strength of cooccurrence between the associated pair of words, as well as the learning rate. After many iterations, the relative positions of the input and output vectors will eventually stabilize.

2. Multi-word context Model

Now, we move on to the model with a multi-word context setting.

When computing the hidden layer output, the CBOW model computes the mean value of the inputs:

\[h=\frac{1}{C}W\dot (x_1+x_2+\cdots+x_C)=\frac{1}{C}\cdot (v_{w_1}+v_{w_2}+\cdots+v_{w_C})\]

where $C$ is the number of words in the context, $w_1,\cdots,w_C$ are the words in the context, and $v_w$ is the input vector of a word $w$. The loss function is:

\[E = - \log p(w_o | w_{I,1},\cdots,w_{I,C}) \]

\[=-u_{j*}+\log\sum\limits_{j’=1}^{V}{\exp(u_j’)}\]

2.2. Update weight from output $\to$ hidden layer

The update equation for the hidden $\to$ output weights stay the same as that for the one-word-context model.

\[ v’_{w_j}=v’_{w_j} – \alpha\cdot e_j \cdot h ~~~~~~j=1,2,\cdots,V\]

Note that we need to apply this to every element of the hidden!output weight matrix for each training instance.

2.3 Update weight from hidden $\to$ input layer

The update equation for input $to$ hidden weights is similar to 1.4, except that now we need to apply the following equation for every word $w_{I,c}$ in the context:

\[ v_{w_{I,c}}=v_{w_{I,c}} –\frac{1}{C}\cdot  \alpha \cdot EH \]

where $w_{w_{I,c}}$ is the input vector of the $c$-th word in the onput context; $\alpha$ is the learning rate; and $EH=\frac{\partial E}{\partial h_i}$.

CBOW Model Formula Deduction的更多相关文章

  1. RBM Formula Deduction

    Energy based Model the probability distribution (softmax function): \[p(x)=\frac{\exp(-E(x))}{\sum\l ...

  2. Logistic Regression - Formula Deduction

    Sigmoid Function \[ \sigma(z)=\frac{1}{1+e^{(-z)}} \] feature: axial symmetry: \[ \sigma(z)+ \sigma( ...

  3. word2vec模型原理与实现

    word2vec是Google在2013年开源的一款将词表征为实数值向量的高效工具. gensim包提供了word2vec的python接口. word2vec采用了CBOW(Continuous B ...

  4. word2vec——高效word特征提取

    继上次分享了经典统计语言模型,最近公众号中有很多做NLP朋友问到了关于word2vec的相关内容, 本文就在这里整理一下做以分享. 本文分为 概括word2vec 相关工作 模型结构 Count-ba ...

  5. 基于pytorch实现word2vec

    一.介绍 word2vec是Google于2013年推出的开源的获取词向量word2vec的工具包.它包括了一组用于word embedding的模型,这些模型通常都是用浅层(两层)神经网络训练词向量 ...

  6. Spark之导出PMML文件(Python)

    PMML,全称预言模型标记语言(Predictive Model Markup Language),利用XML描述和存储数据挖掘模型,是一个已经被W3C所接受的标准.PMML是一种基于XML的语言,用 ...

  7. ISLR系列:(2)分类 Logistic Regression & LDA & QDA & KNN

       Classification 此博文是 An Introduction to Statistical Learning with Applications in R 的系列读书笔记,作为本人的一 ...

  8. cw2vec理论及其实现

    导读 本文对AAAI 2018(Association for the Advancement of Artificial Intelligence 2018)高分录用的一篇中文词向量论文(cw2ve ...

  9. 通过Visualizing Representations来理解Deep Learning、Neural network、以及输入样本自身的高维空间结构

    catalogue . 引言 . Neural Networks Transform Space - 神经网络内部的空间结构 . Understand the data itself by visua ...

随机推荐

  1. StackExchange.Redis--纯干货喂饱你

    Redis和StackExchange.Redis redis有多个数据库1.redis 中的每一个数据库,都由一个 redisDb 的结构存储.其中,redisDb.id 存储着 redis 数据库 ...

  2. unity3d 三分钟实现简单的赛车漂移

    提到赛车游戏,大家最关心的应该就是漂移吧?! 从学unity开始,我就一直在断断续续的研究赛车 因为自己技术太烂.悟性太差等原因,我走了不少弯路 也许你会说,网上那么多资料,你不会查啊 是啊!网上一搜 ...

  3. ul、li实现横向导航按钮

    好久没写博客了,主要是懒得呼气都不想呼,上周分给我一个新的任务,就是自己新建一个系统,快速极限开发,虽然之前自己也做过小的系统,但毕竟是自己做,随着自己的心意做,没有做其他的限制等,现在呢是给公司做, ...

  4. AVL树插入操作实现

    为了提高二插排序树的性能,规定树中的每个节点的左子树和右子树高度差的绝对值不能大于1.为了满足上面的要求需要在插入完成后对树进行调整.下面介绍各个调整方式. 右单旋转 如下图所示,节点A的平衡因子(左 ...

  5. linux基础-第十六单元 yum管理RPM包

    第十六单元 yum管理RPM包 yum的功能 本地yum配置 光盘挂载和镜像挂载 本地yum配置 网络yum配置 网络yum配置 Yum命令的使用 使用yum安装软件 使用yum删除软件 安装组件 删 ...

  6. Java学习之ConcurrentHashMap实现一个本地缓存

    ConcurrentHashMap融合了Hashtable和HashMap二者的优势. Hashtable是做了线程同步,HashMap未考虑同步.所以HashMap在单线程下效率较高,Hashtab ...

  7. 同 一个页面,不同请求路径,如何根据实际场景写JS

    场景:使用同一个“添加群成员”的页面来操作 建群页面:建群成功后,返回查看群成员页面.在建群过程中直接添加群成员并返回一个群名称的参数. 添加群成员页面:在巳有群内添加群成员,添加成功后,返回查看群成 ...

  8. URL详解与URL编码

    作为前端,每日与 URL 打交道是必不可少的.但是也许每天只是单纯的用,对其只是一知半解,随着工作的展开,我发现在日常抓包调试,接口调用,浏览器兼容等许多方面,不深入去理解URL与URL编码则会踩到很 ...

  9. SSM框架搭建(转发)

    SSM框架,顾名思义,就是Spring+SpringMVC+mybatis. 通过Spring来将各层进行整合, 通过spring来管理持久层(mybatis), 通过spring来管理handler ...

  10. C#-WinForm-布局-Anchor-锁定布局、Dock-填充布局、工具箱中的<容器>

    Anchor - 锁定布局,锁定控件对于其父控件或窗体的位置,保持与边框固定的距离还是居中等 Dock - 填充布局,控件是否如何进行填充 ============================== ...