Word Embeddings: Encoding Lexical Semantics
Word Embeddings: Encoding Lexical Semantics
- Getting Dense Word Embeddings
- Word Embeddings in Pytorch
- An Example: N-Gram Language Modeling
- Exercise: Computing Word Embeddings: Continuous Bag-of-Words
Word Embeddings in Pytorch
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim torch.manual_seed(1) word_to_ix = {"hello": 0, "world": 1}
embeds = nn.Embedding(2, 5) # 2 words in vocab, 5 dimensional embeddings
lookup_tensor = torch.tensor([word_to_ix["hello"]], dtype=torch.long)
hello_embed = embeds(lookup_tensor)
print(hello_embed)
Out:
tensor([[ 0.6614, 0.2669, 0.0617, 0.6213, -0.4519]],
grad_fn=<EmbeddingBackward>)
An Example: N-Gram Language Modeling
CONTEXT_SIZE = 2
EMBEDDING_DIM = 10
# We will use Shakespeare Sonnet 2
test_sentence = """When forty winters shall besiege thy brow,
And dig deep trenches in thy beauty's field,
Thy youth's proud livery so gazed on now,
Will be a totter'd weed of small worth held:
Then being asked, where all thy beauty lies,
Where all the treasure of thy lusty days;
To say, within thine own deep sunken eyes,
Were an all-eating shame, and thriftless praise.
How much more praise deserv'd thy beauty's use,
If thou couldst answer 'This fair child of mine
Shall sum my count, and make my old excuse,'
Proving his beauty by succession thine!
This were to be new made when thou art old,
And see thy blood warm when thou feel'st it cold.""".split()
# we should tokenize the input, but we will ignore that for now
# build a list of tuples. Each tuple is ([ word_i-2, word_i-1 ], target word)
trigrams = [([test_sentence[i], test_sentence[i + 1]], test_sentence[i + 2])
for i in range(len(test_sentence) - 2)] vocab = set(test_sentence) #the element in set is distinct
word_to_ix = {word: i for i, word in enumerate(vocab)} class NGramLanguageModeler(nn.Module): def __init__(self, vocab_size, embedding_dim, context_size):
super(NGramLanguageModeler, self).__init__()
self.embeddings = nn.Embedding(vocab_size, embedding_dim)
self.linear1 = nn.Linear(context_size * embedding_dim, 128)
self.linear2 = nn.Linear(128, vocab_size) def forward(self, inputs):
embeds = self.embeddings(inputs).view((1, -1))
out = F.relu(self.linear1(embeds))
out = self.linear2(out)
log_probs = F.log_softmax(out, dim=1)
return log_probs losses = []
loss_function = nn.NLLLoss()
model = NGramLanguageModeler(len(vocab), EMBEDDING_DIM, CONTEXT_SIZE)
optimizer = optim.SGD(model.parameters(), lr=0.001) for epoch in range(10):
total_loss = 0
for context, target in trigrams: context_idxs = torch.tensor([word_to_ix[w] for w in context], dtype=torch.long) model.zero_grad() log_probs = model(context_idxs) loss = loss_function(log_probs, torch.tensor([word_to_ix[target]], dtype=torch.long)) loss.backward()
optimizer.step() total_loss += loss.item()
losses.append(total_loss)
print(losses)
Exercise: Computing Word Embeddings: Continuous Bag-of-Words
CONTEXT_SIZE=2
raw_text= """We are about to study the idea of a computational process.
Computational processes are abstract beings that inhabit computers.
As they evolve, processes manipulate other abstract things called data.
The evolution of a process is directed by a pattern of rules
called a program. People create programs to direct processes. In effect,
we conjure the spirits of the computer with our spells.""".split() # By deriving a set from `raw_text`, we deduplicate the array
vocab = set(raw_text)
vocab_size = len(vocab) word_to_ix={word:i for i,word in enumerate(vocab)}
data=[]
for i in range(2,len(raw_text)-2):
context=[raw_text[i-2],raw_text[i-1],raw_text[i+1],raw_text[i+2]]
target=raw_text[i]
data.append((context,target))
print(data[:5]) class CBOW(nn.Module):
def __init__(self):
pass def forward(self,inputs):
pass def make_context_vector(context,word_to_ix):
idxs=[word_to_ix[w] for w in context]
return torch.tensor(idxs,dtype=torch.long) make_context_vector(data[0][0],word_to_ix)
Word Embeddings: Encoding Lexical Semantics的更多相关文章
- Word Embeddings: Encoding Lexical Semantics(译文)
词向量:编码词汇级别的信息 url:http://pytorch.org/tutorials/beginner/nlp/word_embeddings_tutorial.html?highlight= ...
- [C5W2] Sequence Models - Natural Language Processing and Word Embeddings
第二周 自然语言处理与词嵌入(Natural Language Processing and Word Embeddings) 词汇表征(Word Representation) 上周我们学习了 RN ...
- deeplearning.ai 序列模型 Week 2 NLP & Word Embeddings
1. Word representation One-hot representation的缺点:把每个单词独立对待,导致对相关词的泛化能力不强.比如训练出“I want a glass of ora ...
- 翻译 | Improving Distributional Similarity with Lessons Learned from Word Embeddings
翻译 | Improving Distributional Similarity with Lessons Learned from Word Embeddings 叶娜老师说:"读懂论文的 ...
- 论文阅读笔记 Word Embeddings A Survey
论文阅读笔记 Word Embeddings A Survey 收获 Word Embedding 的定义 dense, distributed, fixed-length word vectors, ...
- 课程五(Sequence Models),第二 周(Natural Language Processing & Word Embeddings) —— 1.Programming assignments:Operations on word vectors - Debiasing
Operations on word vectors Welcome to your first assignment of this week! Because word embeddings ar ...
- [IR] Word Embeddings
From: https://www.youtube.com/watch?v=pw187aaz49o Ref: http://blog.csdn.net/abcjennifer/article/deta ...
- Word Embeddings
能够充分意识到W的这些属性不过是副产品而已是很重要的.我们没有尝试着让相似的词离得近.我们没想把类比编码进不同的向量里.我们想做的不过是一个简单的任务,比如预测一个句子是不是成立的.这些属性大概也就是 ...
- Papers of Word Embeddings
首先解释一下什么叫做embedding.举个例子:地图就是对于现实地理的embedding,现实的地理地形的信息其实远远超过三维 但是地图通过颜色和等高线等来最大化表现现实的地理信息. embeddi ...
随机推荐
- 【u027】神秘大三角
Time Limit: 1 second Memory Limit: 128 MB [问题描述] 判断一个点与已知三角形的位置关系. [输入格式] 前三行:每行一个坐标,表示该三角形的三个顶点 第四行 ...
- html5常用标签table表格布局
html5常用标签table表格布局 一.总结 一句话总结: 二.html5常用标签table表格布局 用表格显示信息调理清楚,使浏览者一目了然.表格在网页中还有协助布局的作用,可以把文字.图像等组织 ...
- Expression Blend 的点滴(1)--ListBox华丽大变身
原文:Expression Blend 的点滴(1)--ListBox华丽大变身 最近,在园子里有不少朋友写了关于Blend的优秀并且实用的文章,在此,我先代表silverlight的爱好者感谢他们的 ...
- 学习OpenCV研究报告指出系列(二)源代码被编译并配有实例project
下载并安装CMake3.0.1 要自己编译OpenCV2.4.9的源代码.首先.必须下载编译工具,使用的比較多的编译工具是CMake. 以下摘录一段关于CMake的介绍: CMake是一个 ...
- 基于webRTC技术 音频和视频,IM解
由于原来的文章 http://blog.csdn.net/voipmaker 转载注明出处. 基于WebRTC技术可实现点对点音视频.即时通信.视频会议.最新的系统组件包含: TeleICE NAT ...
- Mina、Netty、Twisted一起学习(三):TCP前缀固定大小的消息(Header)
于以前的博文于,有介绍切割消息换行的方法. 但是有一个小问题,这样的方法,设消息中本身就包括换行符,那将会将这条消息切割成两条.结果就不正确了. 本文介绍第二种消息切割方式,即上一篇博文中讲的第2条: ...
- 《⑨也懂系列:MinGW-w64安装教程》著名C/C++编译器GCC的Windows版本(MinGW-w64在安装的时候可以选择版本,有图,一步一步)
发布日期 2016年10月31日 分类 教程 标签 编程.软件 前言<⑨也懂系列:MinGW-w64安装教程>这篇文章由 rsreland (http://rsreland.net)于 2 ...
- 五笔字根--good
https://gss0.baidu.com/94o3dSag_xI4khGko9WTAnF6hhy/zhidao/pic/item/4b90f603738da977b1b5ce57b251f8198 ...
- Configure Two DataSources ---
67. Data Access 67.1 Configure a DataSource To override the default settings just define a @Bean of ...
- ashx 请求的内容似乎是脚本,因而将无法由静态文件处理程序来处理。
1.点击查看ashx在浏览器中显示的信息 2.自定义协议头 这样问题就搞定了.当然只是我遇到的一种.