Word2vec教程
Word2vec Tutorial
RADIM ŘEHŮŘEK 2014-02-02GENSIM, PROGRAMMING157 COMMENTS
I never got round to writing a tutorial on how to use word2vec in gensim. It’s simple enough and the API docs are straightforward, but I know some people prefer more verbose formats. Let this post be a tutorial and a reference example.
UPDATE: the complete HTTP server code for the interactive word2vec demo below is now open sourced on Github. For a high-performance similarity server for documents, see ScaleText.com.
Preparing the Input
Starting from the beginning, gensim’s word2vec expects a sequence of sentences as its input. Each sentence a list of words (utf8 strings):
1
2
3
4
5
6
7
|
# import modules & set up logging import gensim, logging logging.basicConfig( format = '%(asctime)s : %(levelname)s : %(message)s' , level = logging.INFO) sentences = [[ 'first' , 'sentence' ], [ 'second' , 'sentence' ]] # train word2vec on the two sentences model = gensim.models.Word2Vec(sentences, min_count = 1 ) |
Keeping the input as a Python built-in list is convenient, but can use up a lot of RAM when the input is large.
Gensim only requires that the input must provide sentences sequentially, when iterated over. No need to keep everything in RAM: we can provide one sentence, process it, forget it, load another sentence…
For example, if our input is strewn across several files on disk, with one sentence per line, then instead of loading everything into an in-memory list, we can process the input file by file, line by line:
1
2
3
4
5
6
7
8
9
10
11
|
class MySentences( object ): def __init__( self , dirname): self .dirname = dirname def __iter__( self ): for fname in os.listdir( self .dirname): for line in open (os.path.join( self .dirname, fname)): yield line.split() sentences = MySentences( '/some/directory' ) # a memory-friendly iterator model = gensim.models.Word2Vec(sentences) |
Say we want to further preprocess the words from the files — convert to unicode, lowercase, remove numbers, extract named entities… All of this can be done inside the MySentences iterator and word2vec doesn’t need to know. All that is required is that the input yields one sentence (list of utf8 words) after another.
Note to advanced users: calling Word2Vec(sentences, iter=1) will run two passes over the sentences iterator (or, in general iter+1 passes; default iter=5). The first pass collects words and their frequencies to build an internal dictionary tree structure. The second and subsequent passes train the neural model. These two (or, iter+1) passes can also be initiated manually, in case your input stream is non-repeatable (you can only afford one pass), and you’re able to initialize the vocabulary some other way:
1
2
3
|
model = gensim.models.Word2Vec( iter = 1 ) # an empty model, no training yet model.build_vocab(some_sentences) # can be a non-repeatable, 1-pass generator model.train(other_sentences) # can be a non-repeatable, 1-pass generator |
In case you’re confused about iterators, iterables and generators in Python, check out our tutorial on Data Streaming in Python.
Training
Word2vec accepts several parameters that affect both training speed and quality.
One of them is for pruning the internal dictionary. Words that appear only once or twice in a billion-word corpus are probably uninteresting typos and garbage. In addition, there’s not enough data to make any meaningful training on those words, so it’s best to ignore them:
1
|
model = Word2Vec(sentences, min_count = 10 ) # default value is 5 |
A reasonable value for min_count is between 0-100, depending on the size of your dataset.
Another parameter is the size of the NN layers, which correspond to the “degrees” of freedom the training algorithm has:
1
|
model = Word2Vec(sentences, size = 200 ) # default value is 100 |
Bigger size values require more training data, but can lead to better (more accurate) models. Reasonable values are in the tens to hundreds.
The last of the major parameters (full list here) is for training parallelization, to speed up training:
1
|
model = Word2Vec(sentences, workers = 4 ) # default = 1 worker = no parallelization |
The workers parameter has only effect if you have Cython installed. Without Cython, you’ll only be able to use one core because of the GIL (and word2vec training will be miserably slow).
Note from Radim: Get my latest machine learning tips & articles delivered straight to your inbox (it's free).
Unsubscribe anytime, no spamming. Max 2 posts per month, if lucky.
Memory
At its core, word2vec model parameters are stored as matrices (NumPy arrays). Each array is #vocabulary (controlled by min_count parameter) times #size (size parameter) of floats(single precision aka 4 bytes).
Three such matrices are held in RAM (work is underway to reduce that number to two, or even one). So if your input contains 100,000 unique words, and you asked for layer size=200, the model will require approx. 100,000*200*4*3 bytes = ~229MB.
There’s a little extra memory needed for storing the vocabulary tree (100,000 words would take a few megabytes), but unless your words are extremely loooong strings, memory footprint will be dominated by the three matrices above.
Evaluating
Word2vec training is an unsupervised task, there’s no good way to objectively evaluate the result. Evaluation depends on your end application.
Google have released their testing set of about 20,000 syntactic and semantic test examples, following the “A is to B as C is to D” task: https://raw.githubusercontent.com/RaRe-Technologies/gensim/develop/gensim/test/test_data/questions-words.txt.
Gensim support the same evaluation set, in exactly the same format:
1
2
3
4
5
6
7
8
9
10
11
|
model.accuracy( '/tmp/questions-words.txt' ) 2014 - 02 - 01 22 : 14 : 28 , 387 : INFO : family: 88.9 % ( 304 / 342 ) 2014 - 02 - 01 22 : 29 : 24 , 006 : INFO : gram1 - adjective - to - adverb: 32.4 % ( 263 / 812 ) 2014 - 02 - 01 22 : 36 : 26 , 528 : INFO : gram2 - opposite: 50.3 % ( 191 / 380 ) 2014 - 02 - 01 23 : 00 : 52 , 406 : INFO : gram3 - comparative: 91.7 % ( 1222 / 1332 ) 2014 - 02 - 01 23 : 13 : 48 , 243 : INFO : gram4 - superlative: 87.9 % ( 617 / 702 ) 2014 - 02 - 01 23 : 29 : 52 , 268 : INFO : gram5 - present - participle: 79.4 % ( 691 / 870 ) 2014 - 02 - 01 23 : 57 : 04 , 965 : INFO : gram7 - past - tense: 67.1 % ( 995 / 1482 ) 2014 - 02 - 02 00 : 15 : 18 , 525 : INFO : gram8 - plural: 89.6 % ( 889 / 992 ) 2014 - 02 - 02 00 : 28 : 18 , 140 : INFO : gram9 - plural - verbs: 68.7 % ( 482 / 702 ) 2014 - 02 - 02 00 : 28 : 18 , 140 : INFO : total: 74.3 % ( 5654 / 7614 ) |
This accuracy takes an optional parameter restrict_vocab which limits which test examples are to be considered.
Once again, good performance on this test set doesn’t mean word2vec will work well in your application, or vice versa. It’s always best to evaluate directly on your intended task.
Storing and loading models
You can store/load models using the standard gensim methods:
1
2
|
model.save( '/tmp/mymodel' ) new_model = gensim.models.Word2Vec.load( '/tmp/mymodel' ) |
which uses pickle internally, optionally mmap‘ing the model’s internal large NumPy matrices into virtual memory directly from disk files, for inter-process memory sharing.
In addition, you can load models created by the original C tool, both using its text and binary formats:
1
2
3
|
model = Word2Vec.load_word2vec_format( '/tmp/vectors.txt' , binary = False ) # using gzipped/bz2 input works too, no need to unzip: model = Word2Vec.load_word2vec_format( '/tmp/vectors.bin.gz' , binary = True ) |
Online training / Resuming training
Advanced users can load a model and continue training it with more sentences:
1
2
|
model = gensim.models.Word2Vec.load( '/tmp/mymodel' ) model.train(more_sentences) |
You may need to tweak the total_words parameter to train(), depending on what learning rate decay you want to simulate.
Note that it’s not possible to resume training with models generated by the C tool, load_word2vec_format(). You can still use them for querying/similarity, but information vital for training (the vocab tree) is missing there.
Using the model
Word2vec supports several word similarity tasks out of the box:
1
2
3
4
5
6
|
model.most_similar(positive = [ 'woman' , 'king' ], negative = [ 'man' ], topn = 1 ) [( 'queen' , 0.50882536 )] model.doesnt_match( "breakfast cereal dinner lunch" ;.split()) 'cereal' model.similarity( 'woman' , 'man' ) 0.73723527 |
If you need the raw output vectors in your application, you can access these either on a word-by-word basis
1
2
|
model[ 'computer' ] # raw NumPy vector of a word array([ - 0.00449447 , - 0.00310097 , 0.02421786 , ...], dtype = float32) |
…or en-masse as a 2D NumPy matrix from model.syn0.
Bonus app
As before with finding similar articles in the English Wikipedia with Latent Semantic Analysis, here’s a bonus web app for those who managed to read this far. It uses the word2vec model trained by Google on the Google News dataset, on about 100 billion words:
If you don’t get “queen” back, something went wrong and baby SkyNet cries.
Try more examples too: “he” is to “his” as “she” is to ?, “Berlin” is to “Germany” as “Paris” is to ? (click to fill in).
is to as is to ?
Try: U.S.A.; Monty_Python; PHP; Madiba (click to fill in).
Get most similar
Also try: “monkey ape baboon human chimp gorilla”; “blue red green crimson transparent” (click to fill in).
Which phrase doesn’t fit?
The model contains 3,000,000 unique phrases built with layer size of 300.
Note that the similarities were trained on a news dataset, and that Google did very little preprocessing there. So the phrases are case sensitive: watch out! Especially with proper nouns.
On a related note, I noticed about half the queries people entered into the LSA@Wikidemo contained typos/spelling errors, so they found nothing. Ouch.
To make it a little less challenging this time, I added phrase suggestions to the forms above. Start typing to see a list of valid phrases from the actual vocabulary of Google News’ word2vec model.
The “suggested” phrases are simply ten phrases starting from whatever bisect_left(all_model_phrases_alphabetically_sorted, prefix_you_typed_so_far) from Python’s built-in bisect module returns.
See the complete HTTP server code for this “bonus app” on github (using CherryPy).
Outro
Full word2vec API docs here; get gensim here. Original C toolkit and word2vec papers by Google here.
Word2vec教程的更多相关文章
- 干货 | 请收下这份2018学习清单:150个最好的机器学习,NLP和Python教程
机器学习的发展可以追溯到1959年,有着丰富的历史.这个领域也正在以前所未有的速度进化.在之前的一篇文章中,我们讨论过为什么通用人工智能领域即将要爆发.有兴趣入坑ML的小伙伴不要拖延了,时不我待! 在 ...
- 理解 Word2Vec 之 Skip-Gram 模型
理解 Word2Vec 之 Skip-Gram 模型 天雨粟 模型师傅 / 果粉 https://zhuanlan.zhihu.com/p/27234078 508 人赞同了该文章 注明:我发现知乎有 ...
- Gensim进阶教程:训练word2vec与doc2vec模型
本篇博客是Gensim的进阶教程,主要介绍用于词向量建模的word2vec模型和用于长文本向量建模的doc2vec模型在Gensim中的实现. Word2vec Word2vec并不是一个模型--它其 ...
- 自然语言处理工具:中文 word2vec 开源项目,教程,数据集
word2vec word2vec/glove/swivel binary file on chinese corpus word2vec: https://code.google.com/p/wor ...
- word2vec 入门(二)使用教程篇
word2vec 要解决问题: 在神经网络中学习将word映射成连续(高维)向量,这样通过训练,就可以把对文本内容的处理简化为K维向量空间中向量运算,而向量空间上的相似度可以用来表示文本语义上的相似度 ...
- 一步一步理解word2Vec
一.概述 关于word2vec,首先需要弄清楚它并不是一个模型或者DL算法,而是描述从自然语言到词向量转换的技术.词向量化的方法有很多种,最简单的是one-hot编码,但是one-hot会有维度灾难的 ...
- Python Tensorflow下的Word2Vec代码解释
前言: 作为一个深度学习的重度狂热者,在学习了各项理论后一直想通过项目练手来学习深度学习的框架以及结构用在实战中的知识.心愿是好的,但机会却不好找.最近刚好有个项目,借此机会练手的过程中,我发现其实各 ...
- Word2Vec在Tensorflow上的版本以及与Gensim之间的运行对比
接昨天的博客,这篇随笔将会对本人运行Word2Vec算法时在Gensim以及Tensorflow的不同版本下的运行结果对比.在运行中,参数的调节以及迭代的决定本人并没有很好的经验,所以希望在展出运行的 ...
- R语言︱文本挖掘——jiabaR包与分词向量化的simhash算法(与word2vec简单比较)
每每以为攀得众山小,可.每每又切实来到起点,大牛们,缓缓脚步来俺笔记葩分享一下吧,please~ --------------------------- <数据挖掘之道>摘录话语:虽然我比 ...
随机推荐
- Actifio中如何分析Oracle备份恢复的报错
场景不同,可以分析的日志不同. 有关oracle备份 (L0/L1) 或者Oracle Log smart backups的日志:UDSAgent.log (on target host locate ...
- mac电脑上如何启动mysql
export PATH=$PATH:/usr/local/mysql/bin/ mysql -uroot -p
- 面向对象编程导论 An Introduction to Object-Oriented Programming (Timothy 著)
第1章 面向对象思想 第2章 抽象 第3章 面向对象设计 第4章 类和方法 第5章 消息,实例和初始化 第6章 案例研究: 八皇后问题 第7章 研究研究: 台球游戏 第8章 继承与替换 第9章 案例研 ...
- C#集合类型大揭秘 【转载】
[地址]https://www.cnblogs.com/songwenjie/p/9185790.html 集合是.NET FCL(Framework Class Library)的重要组成部分,我们 ...
- VCenter6.0.0的安装过程
背景和实验环境介绍 操作系统环境:windows 2008R2 中文企业版 前期环境配置 配置IP信息,把DNS改成自己的IP 修改主机名和后缀 安装和配置DNS服务 Vcenter 添加dns角色 ...
- 15.1 打开文件时的提示(不是dos格式)去掉头文件
1.用ultraedit打开文件时,总提示不是DOS格式 2.把这个取消.dos格式只是用来在unix下读写内容的,此功能禁用即可.
- YOLO理解
一.YOLO v1 1.网络结构 (1)最后一层使用线性激活函数: (2)其他各层使用leaky ReLU的激活函数: 2.Training (1) 将原图划分为SxS的网格.如果一个目标的中心落入某 ...
- pll时钟延迟为问题
pll时钟延迟为问题 这关系到pll的工作方式,如果pll内部使用的是鉴频器,则输入和输出将没有固定的相位差,就是每次锁定都锁定在某个相位,但每次都不一样.如果使用的是鉴相器,则输入和输出为0相位差. ...
- 传统、VHD、VHDX性能对比测试(转帖)
nkc3g4发表于 2014-4-30 16:24:41 传统.VHD.VHDX性能对比测试 - Windows To Go优盘系统 - 萝卜头IT论坛 - Powered by Discuz! ht ...
- Elasticsearch-6.7.0系列-Joyce博客总目录
官方英文文档地址:https://www.elastic.co/guide/index.html Elasticsearch博客目录 Elasticsearch-6.7.0系列(一)9200端口 . ...