pytorch --- word2vec 实现 --《Efficient Estimation of Word Representations in Vector Space》

论文来自Mikolov等人的《Efficient Estimation of Word Representations in Vector Space》

论文地址： 66666

论文介绍了2个方法，原理不解释...

skim code and comment https://github.com/graykode/nlp-tutorial:

# -*- coding: utf-8 -*-

# @time : 2019/11/9  12:53

import numpy as np

import torch

import torch.nn as nn

import torch.optim as optim

from torch.autograd import Variable

import matplotlib.pyplot as plt

dtype = torch.FloatTensor

# 3 Words Sentence

sentences = [ "i like dog", "i like cat", "i like animal",

              "dog cat animal", "apple cat dog like", "dog fish milk like",

              "dog cat eyes like", "i like apple", "apple i hate",

              "apple i movie book music like", "cat dog hate", "cat dog like"]

word_sequence = " ".join(sentences).split()

word_list = " ".join(sentences).split()

word_list = list(set(word_list))

word_dict = {w: i for i, w in enumerate(word_list)}

# Word2Vec Parameter

batch_size = 20  # To show 2 dim embedding graph

embedding_size = 2  # To show 2 dim embedding graph

voc_size = len(word_list)

# 产生 batch_size个，每个都是一个input和label, both are ont-hot vector

def random_batch(data, size):

    random_inputs = []

    random_labels = []

    random_index = np.random.choice(range(len(data)), size, replace=False)

    for i in random_index:

        random_inputs.append(np.eye(voc_size)[data[i][0]])  # target

        random_labels.append(data[i][1])  # context word

    return random_inputs, random_labels

# Make skip gram of one size window

skip_grams = []

# 从第2个word_sequence开始(index=1),预测index=0和index=2，也就是[index=1,index=0]和[index=1,index=2]的添加到skim_grams中

for i in range(1, len(word_sequence) - 1):

    target = word_dict[word_sequence[i]]

    context = [word_dict[word_sequence[i - 1]], word_dict[word_sequence[i + 1]]]

    for w in context:

        skip_grams.append([target, w])

# Model

class Word2Vec(nn.Module):

    def __init__(self):

        super(Word2Vec, self).__init__()

        # W and WT is not Traspose relationship

        self.W = nn.Parameter(-2 * torch.rand(voc_size, embedding_size) + 1).type(dtype) # voc_size > embedding_size Weight

        self.WT = nn.Parameter(-2 * torch.rand(embedding_size, voc_size) + 1).type(dtype) # embedding_size > voc_size Weight

    def forward(self, X):

        # X : [batch_size, voc_size]

        hidden_layer = torch.matmul(X, self.W) # hidden_layer : [batch_size, embedding_size]

        output_layer = torch.matmul(hidden_layer, self.WT) # output_layer : [batch_size, voc_size]

        return output_layer

model = Word2Vec()

criterion = nn.CrossEntropyLoss()

optimizer = optim.Adam(model.parameters(), lr=0.001)

# Training

for epoch in range(5000):

    input_batch, target_batch = random_batch(skip_grams, batch_size)

    input_batch = Variable(torch.Tensor(input_batch))

    target_batch = Variable(torch.LongTensor(target_batch))

    optimizer.zero_grad()

    output = model(input_batch)

    # output : [batch_size, voc_size], target_batch : [batch_size] (LongTensor, not one-hot)

    loss = criterion(output, target_batch)

    if (epoch + 1)%1000 == 0:

        print('Epoch:', '%04d' % (epoch + 1), 'cost =', '{:.6f}'.format(loss))

    loss.backward()

    optimizer.step()

# because

# input_size is [batch_size,voc_size] , ( a word is one-hot voctor(lenght is voc_size) )

# W is [voc_size,emmedding_size]

# a word*W ,result is same as:

# [1,0,0]*[w1,w4

#          w2,w5

#          w3,w6]

# so one word embedding vector is [w1,w4]

# 即: W[i][0],W[i][1]

for i, label in enumerate(word_list):

    W, WT = model.parameters()

    x,y = float(W[i][0]), float(W[i][1])

    plt.scatter(x, y)

    plt.annotate(label, xy=(x, y), xytext=(5, 2), textcoords='offset points', ha='right', va='bottom')

plt.show()

pytorch --- word2vec 实现 --《Efficient Estimation of Word Representations in Vector Space》的更多相关文章

Efficient Estimation of Word Representations in Vector Space 论文笔记
Mikolov T , Chen K , Corrado G , et al. Efficient Estimation of Word Representations in Vector Space ...
一天一经典Efficient Estimation of Word Representations in Vector Space
摘要本文提出了两种从大规模数据集中计算连续向量表示(Continuous Vector Representation)的计算模型架构.这些表示的有效性是通过词相似度任务(Word Similarit ...
Efficient Estimation of Word Representations in Vector Space (2013)论文要点
论文链接:https://arxiv.org/pdf/1301.3781.pdf 参考: A Neural Probabilistic Language Model (2003)论文要点 https ...
【Deep Learning学习笔记】Efficient Estimation of Word Representations in Vector Space_google2013
标题:Efficient Estimation of Word Representations in Vector Space 作者:Tomas Mikolov 发表于:ICLR 2013 主要内容: ...
论文翻译——Deep contextualized word representations
Abstract We introduce a new type of deep contextualized word representation that models both (1) com ...
Word Representations 词向量
常用的词向量方法word2vec. 一.Word2vec 1.参考资料: 1.1) 总览 https://zhuanlan.zhihu.com/p/26306795 1.2) 基础篇: 深度学习wo ...
word2vec 理论与实践
导读本文简单的介绍了Google 于 2013 年开源推出的一个用于获取 word vector 的工具包(word2vec),并且简单的介绍了其中的两个训练模型(Skip-gram,CBOW),以 ...
TensorFlow v2.0实现Word2Vec算法
使用TensorFlow v2.0实现Word2Vec算法计算单词的向量表示,这个例子是使用一小部分维基百科文章来训练的. 更多信息请查看论文: Mikolov, Tomas et al. " ...
文本深度表示模型Word2Vec
简介 Word2vec 是 Google 在 2013 年年中开源的一款将词表征为实数值向量的高效工具, 其利用深度学习的思想,可以通过训练,把对文本内容的处理简化为 K 维向量空间中的向量运算,而向 ...

随机推荐

最新IDEA永久激活攻略
前言写这篇文章的原因是我最近想自己写两个项目,却发现自己的IDEA过期了,对,就是那个JAVA编辑器,于是研究了一下IDEA的激活.发现网上的攻略大多数不可用. 当然这里推荐大家去官网购买正版使用. ...
Scala实践5
一.Scala的层级 1.1类层级 Scala中,Any是所其他类的超类,在底端定义了一些有趣的类NULL和Nothing,是所有其他类的子类. 根类Any有两个子类:AnyVal和AnyRef.其中 ...
.net core appsetting/获取配置文件
修改appsetting 最近用Identity4所以需要做一个配置项项目 { "Logging": { "IncludeScopes": false, &qu ...
[bzoj2286] [洛谷P2495] [sdoi2015] 消耗战
Description 在一场战争中,战场由 \(n\) 个岛屿和 \(n-1\) 个桥梁组成,保证每两个岛屿间有且仅有一条路径可达.现在,我军已经侦查到敌军的总部在编号为1的岛屿,而且他们已经没有足 ...
pywin32 获取 windows 的窗体内文本框的内容
用 spy++去确认找到了文本框的句柄了. 用函数 win32gui.SendMessage 获取不了文本框的文本内容,用 str 类型的参数接收获取的内容的话没有获取到东西,而用 PyBuffer ...
面向初学者的指南：创建时间序列预测 (使用Python）
https://blog.csdn.net/orDream/article/details/100013682 上面这一篇是对 https://www.analyticsvidhya.com/blog ...
关于MySQL5.6配置文件my-default.ini不生效问题
一.问题描述首先,由于工作要求,需使用MySQL5.6版本(绿色版),从解压到修改root密码,一切都很顺利,但是在我要修改mysql的最大连接数的时候,出现问题了,配置不生效.完蛋.还好有万能的百 ...
Nginx配置及负载均衡
转载:http://www.cnblogs.com/jingmoxukong/p/5945200.html nginx简易教程目录 Nginx 概述安装与使用 nginx 配置实战参 ...
jdk源码Object类解析
一简介 java.lang.Object,是Java所有类的父类,在你编写一个类的时候,若无指定父类(没有显式extends一个父类),会默认的添加Object为该类的父类. 在JDK 6之前是编译 ...
Hyper-V虚拟机设置外部网络访问
在Hyper-V管理器中新建一个虚拟交换机,类型为内部 ,修改名称为 nat 在虚拟机的设置页面中,将网络适配器设置为新建的虚拟交换机 nat 打开win10->控制面板->网络和共享中 ...

pytorch --- word2vec 实现 --《Efficient Estimation of Word Representations in Vector Space》

pytorch --- word2vec 实现 --《Efficient Estimation of Word Representations in Vector Space》的更多相关文章

随机推荐

热门专题