Word Embeddings: Encoding Lexical Semantics
Word Embeddings: Encoding Lexical Semantics
- Getting Dense Word Embeddings
- Word Embeddings in Pytorch
- An Example: N-Gram Language Modeling
- Exercise: Computing Word Embeddings: Continuous Bag-of-Words
Word Embeddings in Pytorch
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim torch.manual_seed(1) word_to_ix = {"hello": 0, "world": 1}
embeds = nn.Embedding(2, 5) # 2 words in vocab, 5 dimensional embeddings
lookup_tensor = torch.tensor([word_to_ix["hello"]], dtype=torch.long)
hello_embed = embeds(lookup_tensor)
print(hello_embed)
Out:
tensor([[ 0.6614, 0.2669, 0.0617, 0.6213, -0.4519]],
grad_fn=<EmbeddingBackward>)
An Example: N-Gram Language Modeling
CONTEXT_SIZE = 2
EMBEDDING_DIM = 10
# We will use Shakespeare Sonnet 2
test_sentence = """When forty winters shall besiege thy brow,
And dig deep trenches in thy beauty's field,
Thy youth's proud livery so gazed on now,
Will be a totter'd weed of small worth held:
Then being asked, where all thy beauty lies,
Where all the treasure of thy lusty days;
To say, within thine own deep sunken eyes,
Were an all-eating shame, and thriftless praise.
How much more praise deserv'd thy beauty's use,
If thou couldst answer 'This fair child of mine
Shall sum my count, and make my old excuse,'
Proving his beauty by succession thine!
This were to be new made when thou art old,
And see thy blood warm when thou feel'st it cold.""".split()
# we should tokenize the input, but we will ignore that for now
# build a list of tuples. Each tuple is ([ word_i-2, word_i-1 ], target word)
trigrams = [([test_sentence[i], test_sentence[i + 1]], test_sentence[i + 2])
for i in range(len(test_sentence) - 2)] vocab = set(test_sentence) #the element in set is distinct
word_to_ix = {word: i for i, word in enumerate(vocab)} class NGramLanguageModeler(nn.Module): def __init__(self, vocab_size, embedding_dim, context_size):
super(NGramLanguageModeler, self).__init__()
self.embeddings = nn.Embedding(vocab_size, embedding_dim)
self.linear1 = nn.Linear(context_size * embedding_dim, 128)
self.linear2 = nn.Linear(128, vocab_size) def forward(self, inputs):
embeds = self.embeddings(inputs).view((1, -1))
out = F.relu(self.linear1(embeds))
out = self.linear2(out)
log_probs = F.log_softmax(out, dim=1)
return log_probs losses = []
loss_function = nn.NLLLoss()
model = NGramLanguageModeler(len(vocab), EMBEDDING_DIM, CONTEXT_SIZE)
optimizer = optim.SGD(model.parameters(), lr=0.001) for epoch in range(10):
total_loss = 0
for context, target in trigrams: context_idxs = torch.tensor([word_to_ix[w] for w in context], dtype=torch.long) model.zero_grad() log_probs = model(context_idxs) loss = loss_function(log_probs, torch.tensor([word_to_ix[target]], dtype=torch.long)) loss.backward()
optimizer.step() total_loss += loss.item()
losses.append(total_loss)
print(losses)
Exercise: Computing Word Embeddings: Continuous Bag-of-Words
CONTEXT_SIZE=2
raw_text= """We are about to study the idea of a computational process.
Computational processes are abstract beings that inhabit computers.
As they evolve, processes manipulate other abstract things called data.
The evolution of a process is directed by a pattern of rules
called a program. People create programs to direct processes. In effect,
we conjure the spirits of the computer with our spells.""".split() # By deriving a set from `raw_text`, we deduplicate the array
vocab = set(raw_text)
vocab_size = len(vocab) word_to_ix={word:i for i,word in enumerate(vocab)}
data=[]
for i in range(2,len(raw_text)-2):
context=[raw_text[i-2],raw_text[i-1],raw_text[i+1],raw_text[i+2]]
target=raw_text[i]
data.append((context,target))
print(data[:5]) class CBOW(nn.Module):
def __init__(self):
pass def forward(self,inputs):
pass def make_context_vector(context,word_to_ix):
idxs=[word_to_ix[w] for w in context]
return torch.tensor(idxs,dtype=torch.long) make_context_vector(data[0][0],word_to_ix)
Word Embeddings: Encoding Lexical Semantics的更多相关文章
- Word Embeddings: Encoding Lexical Semantics(译文)
词向量:编码词汇级别的信息 url:http://pytorch.org/tutorials/beginner/nlp/word_embeddings_tutorial.html?highlight= ...
- [C5W2] Sequence Models - Natural Language Processing and Word Embeddings
第二周 自然语言处理与词嵌入(Natural Language Processing and Word Embeddings) 词汇表征(Word Representation) 上周我们学习了 RN ...
- deeplearning.ai 序列模型 Week 2 NLP & Word Embeddings
1. Word representation One-hot representation的缺点:把每个单词独立对待,导致对相关词的泛化能力不强.比如训练出“I want a glass of ora ...
- 翻译 | Improving Distributional Similarity with Lessons Learned from Word Embeddings
翻译 | Improving Distributional Similarity with Lessons Learned from Word Embeddings 叶娜老师说:"读懂论文的 ...
- 论文阅读笔记 Word Embeddings A Survey
论文阅读笔记 Word Embeddings A Survey 收获 Word Embedding 的定义 dense, distributed, fixed-length word vectors, ...
- 课程五(Sequence Models),第二 周(Natural Language Processing & Word Embeddings) —— 1.Programming assignments:Operations on word vectors - Debiasing
Operations on word vectors Welcome to your first assignment of this week! Because word embeddings ar ...
- [IR] Word Embeddings
From: https://www.youtube.com/watch?v=pw187aaz49o Ref: http://blog.csdn.net/abcjennifer/article/deta ...
- Word Embeddings
能够充分意识到W的这些属性不过是副产品而已是很重要的.我们没有尝试着让相似的词离得近.我们没想把类比编码进不同的向量里.我们想做的不过是一个简单的任务,比如预测一个句子是不是成立的.这些属性大概也就是 ...
- Papers of Word Embeddings
首先解释一下什么叫做embedding.举个例子:地图就是对于现实地理的embedding,现实的地理地形的信息其实远远超过三维 但是地图通过颜色和等高线等来最大化表现现实的地理信息. embeddi ...
随机推荐
- 【19.46%】【codeforces 551B】ZgukistringZ
time limit per test2 seconds memory limit per test256 megabytes inputstandard input outputstandard o ...
- windows下的定时任务设置详解
windows下的定时任务设置详解 一.总结 一句话总结: 1.php.exe是什么? 就是php中自带的一个exe,不是我们写的,这个exe是可以执行其他的PHP的 二.windows下的定时任务设 ...
- MONyog使用图解(一)-数据库性能监控工具
原文:MONyog使用图解(一)-数据库性能监控工具 一.安装步骤 较为简单,网上可以搜索到,此处不做详细说明. 二.使用图解 此处介绍监控数据库连接量.并发量.吞吐量.响应时间等功能 1.设置连接需 ...
- HDU 1422 重温世界杯 - 贪心
传送门 题目大意: 给一串数,又正有负,求每一个前缀都大于0的最长子串长度. 题目分析: 直接贪心:每次左端点向右推1,不断延伸右端点,更新答案. code #include<bits/stdc ...
- Java泛型详解:<T>和Class<T>的使用。泛型类,泛型方法的详细使用实例
一.引入 1.泛型是什么 首先告诉大家ArrayList就是泛型.那ArrayList能完成哪些想不到的功能呢?先看看下面这段代码: [java] view plain copy ArrayList& ...
- CentOS 配置epel源
先查询下有没有epel rpm -qa|grep epel 没有的话到官网https://fedoraproject.org/wiki/EPEL下载rpm包 然后 rpm -ivh 安装 安装完毕后到 ...
- hdu1693插头dp(多回路)
题意:在n*m的矩阵中,有些格子有树,没有树的格子不能到达,找一条或多条回路,吃全然部的树,求有多少中方法. 这题是插头dp,刚刚学习,不是非常熟悉,研究了好几天才明确插头dp的方法,他们老是讲一些什 ...
- Qt自定义密码框,先显示后隐藏(继承以后改写slot即可,即与哪个相近就改写哪个)good
现在很多应用在密码输入时,会先显示一段时间,大概几百毫秒,然后再变成星号或者圆点隐藏起来.这样做的好处是,可以让密码输入者看到自己输入的字符,同时又防止密码被偷窥.但是Qt自带的密码输入框,要么输入时 ...
- Cython 的学习
开发效率极高的 Python 一直因执行效率过低为人所诟病,Cython 由此诞生,特性介于 Python 和 C 语言之间. Cython 学习 1. Cython 是什么? 它是一个用来快速生成 ...
- 微信公众平台消息接口开发(12)消息接口Bug
微信公众平台开发模式 微信公众平台消息接口 微信公众平台API 微信开发模式 Bug 方倍工作室 原文:http://www.cnblogs.com/txw1958/archive/2013/03/1 ...