Word Embeddings: Encoding Lexical Semantics
Word Embeddings: Encoding Lexical Semantics
- Getting Dense Word Embeddings
- Word Embeddings in Pytorch
- An Example: N-Gram Language Modeling
- Exercise: Computing Word Embeddings: Continuous Bag-of-Words
Word Embeddings in Pytorch
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim torch.manual_seed(1) word_to_ix = {"hello": 0, "world": 1}
embeds = nn.Embedding(2, 5) # 2 words in vocab, 5 dimensional embeddings
lookup_tensor = torch.tensor([word_to_ix["hello"]], dtype=torch.long)
hello_embed = embeds(lookup_tensor)
print(hello_embed)
Out:
tensor([[ 0.6614, 0.2669, 0.0617, 0.6213, -0.4519]],
grad_fn=<EmbeddingBackward>)
An Example: N-Gram Language Modeling
CONTEXT_SIZE = 2
EMBEDDING_DIM = 10
# We will use Shakespeare Sonnet 2
test_sentence = """When forty winters shall besiege thy brow,
And dig deep trenches in thy beauty's field,
Thy youth's proud livery so gazed on now,
Will be a totter'd weed of small worth held:
Then being asked, where all thy beauty lies,
Where all the treasure of thy lusty days;
To say, within thine own deep sunken eyes,
Were an all-eating shame, and thriftless praise.
How much more praise deserv'd thy beauty's use,
If thou couldst answer 'This fair child of mine
Shall sum my count, and make my old excuse,'
Proving his beauty by succession thine!
This were to be new made when thou art old,
And see thy blood warm when thou feel'st it cold.""".split()
# we should tokenize the input, but we will ignore that for now
# build a list of tuples. Each tuple is ([ word_i-2, word_i-1 ], target word)
trigrams = [([test_sentence[i], test_sentence[i + 1]], test_sentence[i + 2])
for i in range(len(test_sentence) - 2)] vocab = set(test_sentence) #the element in set is distinct
word_to_ix = {word: i for i, word in enumerate(vocab)} class NGramLanguageModeler(nn.Module): def __init__(self, vocab_size, embedding_dim, context_size):
super(NGramLanguageModeler, self).__init__()
self.embeddings = nn.Embedding(vocab_size, embedding_dim)
self.linear1 = nn.Linear(context_size * embedding_dim, 128)
self.linear2 = nn.Linear(128, vocab_size) def forward(self, inputs):
embeds = self.embeddings(inputs).view((1, -1))
out = F.relu(self.linear1(embeds))
out = self.linear2(out)
log_probs = F.log_softmax(out, dim=1)
return log_probs losses = []
loss_function = nn.NLLLoss()
model = NGramLanguageModeler(len(vocab), EMBEDDING_DIM, CONTEXT_SIZE)
optimizer = optim.SGD(model.parameters(), lr=0.001) for epoch in range(10):
total_loss = 0
for context, target in trigrams: context_idxs = torch.tensor([word_to_ix[w] for w in context], dtype=torch.long) model.zero_grad() log_probs = model(context_idxs) loss = loss_function(log_probs, torch.tensor([word_to_ix[target]], dtype=torch.long)) loss.backward()
optimizer.step() total_loss += loss.item()
losses.append(total_loss)
print(losses)
Exercise: Computing Word Embeddings: Continuous Bag-of-Words
CONTEXT_SIZE=2
raw_text= """We are about to study the idea of a computational process.
Computational processes are abstract beings that inhabit computers.
As they evolve, processes manipulate other abstract things called data.
The evolution of a process is directed by a pattern of rules
called a program. People create programs to direct processes. In effect,
we conjure the spirits of the computer with our spells.""".split() # By deriving a set from `raw_text`, we deduplicate the array
vocab = set(raw_text)
vocab_size = len(vocab) word_to_ix={word:i for i,word in enumerate(vocab)}
data=[]
for i in range(2,len(raw_text)-2):
context=[raw_text[i-2],raw_text[i-1],raw_text[i+1],raw_text[i+2]]
target=raw_text[i]
data.append((context,target))
print(data[:5]) class CBOW(nn.Module):
def __init__(self):
pass def forward(self,inputs):
pass def make_context_vector(context,word_to_ix):
idxs=[word_to_ix[w] for w in context]
return torch.tensor(idxs,dtype=torch.long) make_context_vector(data[0][0],word_to_ix)
Word Embeddings: Encoding Lexical Semantics的更多相关文章
- Word Embeddings: Encoding Lexical Semantics(译文)
词向量:编码词汇级别的信息 url:http://pytorch.org/tutorials/beginner/nlp/word_embeddings_tutorial.html?highlight= ...
- [C5W2] Sequence Models - Natural Language Processing and Word Embeddings
第二周 自然语言处理与词嵌入(Natural Language Processing and Word Embeddings) 词汇表征(Word Representation) 上周我们学习了 RN ...
- deeplearning.ai 序列模型 Week 2 NLP & Word Embeddings
1. Word representation One-hot representation的缺点:把每个单词独立对待,导致对相关词的泛化能力不强.比如训练出“I want a glass of ora ...
- 翻译 | Improving Distributional Similarity with Lessons Learned from Word Embeddings
翻译 | Improving Distributional Similarity with Lessons Learned from Word Embeddings 叶娜老师说:"读懂论文的 ...
- 论文阅读笔记 Word Embeddings A Survey
论文阅读笔记 Word Embeddings A Survey 收获 Word Embedding 的定义 dense, distributed, fixed-length word vectors, ...
- 课程五(Sequence Models),第二 周(Natural Language Processing & Word Embeddings) —— 1.Programming assignments:Operations on word vectors - Debiasing
Operations on word vectors Welcome to your first assignment of this week! Because word embeddings ar ...
- [IR] Word Embeddings
From: https://www.youtube.com/watch?v=pw187aaz49o Ref: http://blog.csdn.net/abcjennifer/article/deta ...
- Word Embeddings
能够充分意识到W的这些属性不过是副产品而已是很重要的.我们没有尝试着让相似的词离得近.我们没想把类比编码进不同的向量里.我们想做的不过是一个简单的任务,比如预测一个句子是不是成立的.这些属性大概也就是 ...
- Papers of Word Embeddings
首先解释一下什么叫做embedding.举个例子:地图就是对于现实地理的embedding,现实的地理地形的信息其实远远超过三维 但是地图通过颜色和等高线等来最大化表现现实的地理信息. embeddi ...
随机推荐
- uiwebview的常用属性1-故事版
- spark accumulator累加器
java /** * accumulator可以让多个task共同操作一份变量,主要进行多个节点对一个变量进行共享性的操作,accumulator只提供了累加的功能 * 只有driver可以获取acc ...
- experiment : 在私有堆和默认进程堆中, 测试能分配的堆空间总和, 每次能分配的最大堆空间
实验环境: Win7X64Sp1 + vs2008, 物理内存16GB. 实验结论: * 进程堆的最大Size并没有使用完剩余的物理内存 * 每次能分配的最大堆空间接近2M, 不管是私有堆 ...
- ServletContextListener接口用法
ServletContextListener接口用于tomcat启动时自动加载函数,方法如下: 一.需加载的类必须实现ServletContextListener接口. 二.该接口中有两个方法必须实现 ...
- 类的XML序列化(XML Serialization)
最近做的一个ASP.NET项目中,需要在一个页面中维护一个类的数组,在每次页面刷新的使其前一次的状态保持不变.开始错误的使用了static,导致了致命的共享错误.后来突然想起C#类能够使用XML序列化 ...
- Java:JSON解析工具-org.json
一.简介 org.json是Java常用的Json解析工具,主要提供JSONObject和JSONArray类,现在就各个类的使用解释如下. 二.准备 1.在使用org.json之前,我们应该先从该网 ...
- Ibatis之RowHandler
如果一个场景:账户表中有1千万账户,现在,我们需要这个1千万账户利息结算业务.需求是基于Ibatis框架来实现这个功能. 如果按照一般的编程模式,我们将编写一个sql,然后调用QueryForList ...
- Hibernate综合问题
n + 1问题 query.iterate()信息返回迭代查询将开始发表声明:录ID语句 Hibernate: select student0_.id ascol_0_0_from t_student ...
- Android--在Android应用中愉快地写C/C++代码(转)
1 前言 一直想在android层面写c进程,然后java可以与c进程交互,以前在android源码中想玩就可以直接在init.rc中加上交叉编译好的c进程就可以了,而在ide中,也就是ndk编译后各 ...
- 微信公众平台消息接口开发(12)消息接口Bug
微信公众平台开发模式 微信公众平台消息接口 微信公众平台API 微信开发模式 Bug 方倍工作室 原文:http://www.cnblogs.com/txw1958/archive/2013/03/1 ...