Comparison of FastText and Word2Vec

 

Facebook Research open sourced a great project yesterday - fastText, a fast (no surprise) and effective method to learn word representations and perform text classification. I was curious about comparing these embeddings to other commonly used embeddings, so word2vec seemed like the obvious choice, especially considering fastText embeddings are based upon word2vec.

 

Download data

In [ ]:
import nltk
nltk.download()
# Only the brown corpus is needed in case you don't have it.
# alternately, you can simply download the pretrained models below if you wish to avoid downloading and training # Generate brown corpus text file
with open('brown_corp.txt', 'w+') as f:
for word in nltk.corpus.brown.words():
f.write('{word} '.format(word=word))
In [ ]:
# download the text8 corpus (a 100 MB sample of cleaned wikipedia text)
# alternately, you can simply download the pretrained models below if you wish to avoid downloading and training
!wget http://mattmahoney.net/dc/text8.zip
In [ ]:
# download the file questions-words.txt to be used for comparing word embeddings
!wget https://raw.githubusercontent.com/arfon/word2vec/master/questions-words.txt
 

Train models

 

If you wish to avoid training, you can download pre-trained models instead in the next section. For training the fastText models yourself, you'll have to follow the setup instructions for fastText and run the training with -

In [ ]:
!./fasttext skipgram -input brown_corp.txt -output brown_ft
!./fasttext skipgram -input text8.txt -output text8_ft
 

For training the gensim models -

In [ ]:
from nltk.corpus import brown
from gensim.models import Word2Vec
from gensim.models.word2vec import Text8Corpus
import logging logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s')
logging.root.setLevel(level=logging.INFO) MODELS_DIR = 'models/' brown_gs = Word2Vec(brown.sents())
brown_gs.save_word2vec_format(MODELS_DIR + 'brown_gs.vec') text8_gs = Word2Vec(Text8Corpus('text8'))
text8_gs.save_word2vec_format(MODELS_DIR + 'text8_gs.vec')
 

Download models

In case you wish to avoid downloading the corpus and training the models, you can download pretrained models with -

In [ ]:
# download the fastText and gensim models trained on the brown corpus and text8 corpus
!wget https://www.dropbox.com/s/4kray3epy439gca/models.tar.gz?dl=1 -O models.tar.gz
 

Once you have downloaded or trained the models (make sure they're in the models/ directory, or that you've appropriately changed MODELS_DIR) and downloaded questions-words.txt, you're ready to run the comparison.

 

Comparisons

In [1]:
from gensim.models import Word2Vec

def print_accuracy(model, questions_file):
print('Evaluating...\n')
acc = model.accuracy(questions_file)
for section in acc:
correct = len(section['correct'])
total = len(section['correct']) + len(section['incorrect'])
total = total if total else 1
accuracy = 100*float(correct)/total
print('{:d}/{:d}, {:.2f}%, Section: {:s}'.format(correct, total, accuracy, section['section']))
sem_correct = sum((len(acc[i]['correct']) for i in range(5)))
sem_total = sum((len(acc[i]['correct']) + len(acc[i]['incorrect'])) for i in range(5))
print('\nSemantic: {:d}/{:d}, Accuracy: {:.2f}%'.format(sem_correct, sem_total, 100*float(sem_correct)/sem_total)) syn_correct = sum((len(acc[i]['correct']) for i in range(5, len(acc)-1)))
syn_total = sum((len(acc[i]['correct']) + len(acc[i]['incorrect'])) for i in range(5,len(acc)-1))
print('Syntactic: {:d}/{:d}, Accuracy: {:.2f}%\n'.format(syn_correct, syn_total, 100*float(syn_correct)/syn_total)) MODELS_DIR = 'models/' word_analogies_file = 'questions-words.txt'
print('\nLoading FastText embeddings')
ft_model = Word2Vec.load_word2vec_format(MODELS_DIR + 'brown_ft.vec')
print('Accuracy for FastText:')
print_accuracy(ft_model, word_analogies_file) print('\nLoading Gensim embeddings')
gs_model = Word2Vec.load_word2vec_format(MODELS_DIR + 'brown_gs.vec')
print('Accuracy for word2vec:')
print_accuracy(gs_model, word_analogies_file)
 
Loading FastText embeddings
Accuracy for FastText:
Evaluating... 0/1, 0.00%, Section: capital-common-countries
0/1, 0.00%, Section: capital-world
0/1, 0.00%, Section: currency
0/1, 0.00%, Section: city-in-state
27/182, 14.84%, Section: family
539/702, 76.78%, Section: gram1-adjective-to-adverb
106/132, 80.30%, Section: gram2-opposite
656/1056, 62.12%, Section: gram3-comparative
136/210, 64.76%, Section: gram4-superlative
439/650, 67.54%, Section: gram5-present-participle
0/1, 0.00%, Section: gram6-nationality-adjective
165/1260, 13.10%, Section: gram7-past-tense
327/552, 59.24%, Section: gram8-plural
245/342, 71.64%, Section: gram9-plural-verbs
2640/5086, 51.91%, Section: total Semantic: 27/182, Accuracy: 14.84%
Syntactic: 2613/4904, Accuracy: 53.28% Loading Gensim embeddings
Accuracy for word2vec:
Evaluating... 0/1, 0.00%, Section: capital-common-countries
0/1, 0.00%, Section: capital-world
0/1, 0.00%, Section: currency
0/1, 0.00%, Section: city-in-state
53/182, 29.12%, Section: family
8/702, 1.14%, Section: gram1-adjective-to-adverb
0/132, 0.00%, Section: gram2-opposite
75/1056, 7.10%, Section: gram3-comparative
0/210, 0.00%, Section: gram4-superlative
16/650, 2.46%, Section: gram5-present-participle
0/1, 0.00%, Section: gram6-nationality-adjective
30/1260, 2.38%, Section: gram7-past-tense
4/552, 0.72%, Section: gram8-plural
8/342, 2.34%, Section: gram9-plural-verbs
194/5086, 3.81%, Section: total Semantic: 53/182, Accuracy: 29.12%
Syntactic: 141/4904, Accuracy: 2.88%
 

Word2vec embeddings seem to be slightly better than fastText embeddings at the semantic tasks, while the fastText embeddings do significantly better on the syntactic analogies. Makes sense, since fastText embeddings are trained for understanding morphological nuances, and most of the syntactic analogies are morphology based.

Let me explain that better.

According to the paper [1], embeddings for words are represented by the sum of their n-gram embeddings. This is meant to be useful for morphologically rich languages - so theoretically, the embedding for apparently would include information from both character n-grams apparent and ly (as well as other n-grams), and the n-grams would combine in a simple, linear manner. This is very similar to what most of our syntactic tasks look like.

Example analogy:

amazing amazingly calm calmly

This analogy is marked correct if:

embedding(amazing) - embedding(amazingly) = embedding(calm) - embedding(calmly)

Both these subtractions would result in a very similar set of remaining ngrams. No surprise the fastText embeddings do extremely well on this.

A brief note on hyperparameters - the Gensim word2vec implementation and the fastText word embedding implementation use largely the same defaults (dim_size = 100, window_size = 5, num_epochs = 5). Of course, they are two completely different models (albeit, with a few similarities).

Let's try with a larger corpus now - text8 (collection of wiki articles). I'm especially curious about the impact on semantic accuracy - for models trained on the brown corpus, the difference in the semantic accuracy and the accuracy values themselves are too small to be conclusive. Hopefully a larger corpus helps, and the text8 corpus likely has a lot more information about capitals, currencies, cities etc, which should be relevant to the semantic tasks.

In [2]:
print('Loading FastText embeddings')
ft_model = Word2Vec.load_word2vec_format(MODELS_DIR + 'text8_ft.vec')
print('Accuracy for FastText:')
print_accuracy(ft_model, word_analogies_file) print('Loading Gensim embeddings')
gs_model = Word2Vec.load_word2vec_format(MODELS_DIR + 'text8_gs.vec')
print('Accuracy for word2vec:')
print_accuracy(gs_model, word_analogies_file)
 
Loading FastText embeddings
Accuracy for FastText:
Evaluating... 298/506, 58.89%, Section: capital-common-countries
625/1452, 43.04%, Section: capital-world
37/268, 13.81%, Section: currency
291/1511, 19.26%, Section: city-in-state
151/306, 49.35%, Section: family
567/756, 75.00%, Section: gram1-adjective-to-adverb
188/306, 61.44%, Section: gram2-opposite
809/1260, 64.21%, Section: gram3-comparative
303/506, 59.88%, Section: gram4-superlative
528/992, 53.23%, Section: gram5-present-participle
1291/1371, 94.16%, Section: gram6-nationality-adjective
451/1332, 33.86%, Section: gram7-past-tense
853/992, 85.99%, Section: gram8-plural
360/650, 55.38%, Section: gram9-plural-verbs
6752/12208, 55.31%, Section: total Semantic: 1402/4043, Accuracy: 34.68%
Syntactic: 5350/8165, Accuracy: 65.52% Loading Gensim embeddings
Accuracy for word2vec:
Evaluating... 138/506, 27.27%, Section: capital-common-countries
248/1452, 17.08%, Section: capital-world
28/268, 10.45%, Section: currency
158/1571, 10.06%, Section: city-in-state
227/306, 74.18%, Section: family
85/756, 11.24%, Section: gram1-adjective-to-adverb
54/306, 17.65%, Section: gram2-opposite
739/1260, 58.65%, Section: gram3-comparative
178/506, 35.18%, Section: gram4-superlative
297/992, 29.94%, Section: gram5-present-participle
718/1371, 52.37%, Section: gram6-nationality-adjective
325/1332, 24.40%, Section: gram7-past-tense
389/992, 39.21%, Section: gram8-plural
200/650, 30.77%, Section: gram9-plural-verbs
3784/12268, 30.84%, Section: total Semantic: 799/4103, Accuracy: 19.47%
Syntactic: 2985/8165, Accuracy: 36.56%
 

With the text8 corpus, the semantic accuracy for the fastText model increases significantly, and it surpasses word2vec on accuracies for both semantic and syntactical analogies. However, the increase in syntactic accuracy from the increase in corpus size is much higher for word2vec

These preliminary results seem to indicate fastText embeddings might be better than word2vec at encoding semantic and especially syntactic information. It'd be interesting to see how transferable these embeddings are by comparing their performance in a downstream supervised task.

 

References

Comparison of FastText and Word2Vec的更多相关文章

  1. fastText训练word2vec并用于训练任务

    最近测试OpenNRE,没有GPU服务器,bert的跑不动,于是考虑用word2vec,捡起fasttext 下载安装 先clone代码 git clone https://github.com/fa ...

  2. 超快的 FastText

    Word2Vec 作者.脸书科学家 Mikolov 文本分类新作 fastText:方法简单,号称并不需要深度学习那样几小时或者几天的训练时间,在普通 CPU 上最快几十秒就可以训练模型,得到不错的结 ...

  3. DL4NLP——词表示模型(二)基于神经网络的模型:NPLM;word2vec(CBOW/Skip-gram)

    本文简述了以下内容: 神经概率语言模型NPLM,训练语言模型并同时得到词表示 word2vec:CBOW / Skip-gram,直接以得到词表示为目标的模型 (一)原始CBOW(Continuous ...

  4. NLP︱高级词向量表达(二)——FastText(简述、学习笔记)

    FastText是Facebook开发的一款快速文本分类器,提供简单而高效的文本分类和表征学习的方法,不过这个项目其实是有两部分组成的,一部分是这篇文章介绍的 fastText 文本分类(paper: ...

  5. 检索式chatbot:

    小夕从7月份开始收到第一场面试邀请,到9月初基本结束了校招(面够了面够了T_T),深深的意识到今年的对话系统/chatbot方向是真的超级火呀.从微软主打情感计算的小冰,到百度主打智能家庭(与车联网? ...

  6. NLP获取词向量的方法(Glove、n-gram、word2vec、fastText、ELMo 对比分析)

    自然语言处理的第一步就是获取词向量,获取词向量的方法总体可以分为两种两种,一个是基于统计方法的,一种是基于语言模型的. 1 Glove - 基于统计方法 Glove是一个典型的基于统计的获取词向量的方 ...

  7. 模型介绍之FastText

    模型介绍一: 1. FastText原理及实践 前言----来源&特点 fastText是Facebook于2016年开源的一个词向量计算和文本分类工具,在学术上并没有太大创新.但是它的优点也 ...

  8. 文本分类需要CNN?No!fastText完美解决你的需求(后篇)

    http://blog.csdn.net/weixin_36604953/article/details/78324834 想必通过前一篇的介绍,各位小主已经对word2vec以及CBOW和Skip- ...

  9. FastText算法原理解析

    1. 前言 自然语言处理(NLP)是机器学习,人工智能中的一个重要领域.文本表达是 NLP中的基础技术,文本分类则是 NLP 的重要应用.fasttext是facebook开源的一个词向量与文本分类工 ...

随机推荐

  1. jdk环境配置-windows 10

    近期由于云服务器到期,重新买了一个云服务器,这里顺便把jdk环境配置步骤做一个记录 1.下载自己需要的jdk 我这里是下的免安装版的  2.计算机(此电脑)->属性->高级系统设置-> ...

  2. nginx,php-fpm性能优化

    When you running a highload website with PHP-FPM via FastCGI, the following tips may be useful to yo ...

  3. 清理Visual Studio解决方案临时文件:Clean Visual Studio Solution Temporary File Build20160418

    复制保存到任意文件名.bat,放置在Visual Studio Solution目录下. 当Visual Studio Solution目录过于庞大或打算拷贝移动Visual Studio Solut ...

  4. UVA 10005 Packing polygons(最小圆覆盖)

    裸的模板题 AC代码: #include<cstdio> #include<cmath> #include<algorithm> #include<iostr ...

  5. hdu6354 /// 线段树

    题目大意: 给定n m x y z 长度为n的序列初始为0 接下来m个操作 l r v 将l r区间内比v小的数都变成v l r v由x y z和给定的函数生成 线段树维护区间 最大值 最小值 再加 ...

  6. Codeforces Round #535 F-MST Unification

    题目大意: 给定n m 为图中的点数n和边数m 给定m条边的信息 u v w 为u点到v点有一条长度为w的边 图中无环无重边 这个图的MST的花费为k 但可能存在多种花费为k的MST的方案 此时对图中 ...

  7. str2int HDU - 4436 后缀自动机求子串信息

    题意: 给出 n 个串,求出这 n 个串所有子串代表的数字的和. 题解; 首先可以把这些串构建后缀自动机(sam.last=1就好了), 因为后缀自动机上从 root走到的任意节点都是一个子串,所有可 ...

  8. 第九篇 数据表设计和保存item到json文件

    上节说到Pipeline会拦截item,根据设置的优先级,item会依次经过这些Pipeline,所以可以通过Pipeline来保存文件到json.数据库等等. 下面是自定义json #存储item到 ...

  9. linux每日命令(1):gzip命令

    gzip是在Linux系统中经常使用的一个对文件进行压缩和解压缩的命令,既方便又好用. gzip不仅可以用来压缩大的.较少使用的文件以节省磁盘空间,还可以和tar命令一起构成Linux操作系统中比较流 ...

  10. velocity 相关

    Apache Velocity 是一个基于java的模板引擎(template engine) 应用场景1.Web 应用:开发者在不使用 JSP 的情况下,可以用 Velocity 让 HTML 具有 ...