Comparison of FastText and Word2Vec
Comparison of FastText and Word2Vec
Facebook Research open sourced a great project yesterday - fastText, a fast (no surprise) and effective method to learn word representations and perform text classification. I was curious about comparing these embeddings to other commonly used embeddings, so word2vec seemed like the obvious choice, especially considering fastText embeddings are based upon word2vec.
Download data
import nltk
nltk.download()
# Only the brown corpus is needed in case you don't have it.
# alternately, you can simply download the pretrained models below if you wish to avoid downloading and training # Generate brown corpus text file
with open('brown_corp.txt', 'w+') as f:
for word in nltk.corpus.brown.words():
f.write('{word} '.format(word=word))
# download the text8 corpus (a 100 MB sample of cleaned wikipedia text)
# alternately, you can simply download the pretrained models below if you wish to avoid downloading and training
!wget http://mattmahoney.net/dc/text8.zip
# download the file questions-words.txt to be used for comparing word embeddings
!wget https://raw.githubusercontent.com/arfon/word2vec/master/questions-words.txt
Train models
If you wish to avoid training, you can download pre-trained models instead in the next section. For training the fastText models yourself, you'll have to follow the setup instructions for fastText and run the training with -
!./fasttext skipgram -input brown_corp.txt -output brown_ft
!./fasttext skipgram -input text8.txt -output text8_ft
For training the gensim models -
from nltk.corpus import brown
from gensim.models import Word2Vec
from gensim.models.word2vec import Text8Corpus
import logging logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s')
logging.root.setLevel(level=logging.INFO) MODELS_DIR = 'models/' brown_gs = Word2Vec(brown.sents())
brown_gs.save_word2vec_format(MODELS_DIR + 'brown_gs.vec') text8_gs = Word2Vec(Text8Corpus('text8'))
text8_gs.save_word2vec_format(MODELS_DIR + 'text8_gs.vec')
Download models
In case you wish to avoid downloading the corpus and training the models, you can download pretrained models with -
# download the fastText and gensim models trained on the brown corpus and text8 corpus
!wget https://www.dropbox.com/s/4kray3epy439gca/models.tar.gz?dl=1 -O models.tar.gz
Once you have downloaded or trained the models (make sure they're in the models/
directory, or that you've appropriately changed MODELS_DIR
) and downloaded questions-words.txt
, you're ready to run the comparison.
Comparisons
from gensim.models import Word2Vec def print_accuracy(model, questions_file):
print('Evaluating...\n')
acc = model.accuracy(questions_file)
for section in acc:
correct = len(section['correct'])
total = len(section['correct']) + len(section['incorrect'])
total = total if total else 1
accuracy = 100*float(correct)/total
print('{:d}/{:d}, {:.2f}%, Section: {:s}'.format(correct, total, accuracy, section['section']))
sem_correct = sum((len(acc[i]['correct']) for i in range(5)))
sem_total = sum((len(acc[i]['correct']) + len(acc[i]['incorrect'])) for i in range(5))
print('\nSemantic: {:d}/{:d}, Accuracy: {:.2f}%'.format(sem_correct, sem_total, 100*float(sem_correct)/sem_total)) syn_correct = sum((len(acc[i]['correct']) for i in range(5, len(acc)-1)))
syn_total = sum((len(acc[i]['correct']) + len(acc[i]['incorrect'])) for i in range(5,len(acc)-1))
print('Syntactic: {:d}/{:d}, Accuracy: {:.2f}%\n'.format(syn_correct, syn_total, 100*float(syn_correct)/syn_total)) MODELS_DIR = 'models/' word_analogies_file = 'questions-words.txt'
print('\nLoading FastText embeddings')
ft_model = Word2Vec.load_word2vec_format(MODELS_DIR + 'brown_ft.vec')
print('Accuracy for FastText:')
print_accuracy(ft_model, word_analogies_file) print('\nLoading Gensim embeddings')
gs_model = Word2Vec.load_word2vec_format(MODELS_DIR + 'brown_gs.vec')
print('Accuracy for word2vec:')
print_accuracy(gs_model, word_analogies_file)
Loading FastText embeddings
Accuracy for FastText:
Evaluating... 0/1, 0.00%, Section: capital-common-countries
0/1, 0.00%, Section: capital-world
0/1, 0.00%, Section: currency
0/1, 0.00%, Section: city-in-state
27/182, 14.84%, Section: family
539/702, 76.78%, Section: gram1-adjective-to-adverb
106/132, 80.30%, Section: gram2-opposite
656/1056, 62.12%, Section: gram3-comparative
136/210, 64.76%, Section: gram4-superlative
439/650, 67.54%, Section: gram5-present-participle
0/1, 0.00%, Section: gram6-nationality-adjective
165/1260, 13.10%, Section: gram7-past-tense
327/552, 59.24%, Section: gram8-plural
245/342, 71.64%, Section: gram9-plural-verbs
2640/5086, 51.91%, Section: total Semantic: 27/182, Accuracy: 14.84%
Syntactic: 2613/4904, Accuracy: 53.28% Loading Gensim embeddings
Accuracy for word2vec:
Evaluating... 0/1, 0.00%, Section: capital-common-countries
0/1, 0.00%, Section: capital-world
0/1, 0.00%, Section: currency
0/1, 0.00%, Section: city-in-state
53/182, 29.12%, Section: family
8/702, 1.14%, Section: gram1-adjective-to-adverb
0/132, 0.00%, Section: gram2-opposite
75/1056, 7.10%, Section: gram3-comparative
0/210, 0.00%, Section: gram4-superlative
16/650, 2.46%, Section: gram5-present-participle
0/1, 0.00%, Section: gram6-nationality-adjective
30/1260, 2.38%, Section: gram7-past-tense
4/552, 0.72%, Section: gram8-plural
8/342, 2.34%, Section: gram9-plural-verbs
194/5086, 3.81%, Section: total Semantic: 53/182, Accuracy: 29.12%
Syntactic: 141/4904, Accuracy: 2.88%
Word2vec embeddings seem to be slightly better than fastText embeddings at the semantic tasks, while the fastText embeddings do significantly better on the syntactic analogies. Makes sense, since fastText embeddings are trained for understanding morphological nuances, and most of the syntactic analogies are morphology based.
Let me explain that better.
According to the paper [1], embeddings for words are represented by the sum of their n-gram embeddings. This is meant to be useful for morphologically rich languages - so theoretically, the embedding for apparently
would include information from both character n-grams apparent
and ly
(as well as other n-grams), and the n-grams would combine in a simple, linear manner. This is very similar to what most of our syntactic tasks look like.
Example analogy:
amazing amazingly calm calmly
This analogy is marked correct if:
embedding(amazing)
- embedding(amazingly)
= embedding(calm)
- embedding(calmly)
Both these subtractions would result in a very similar set of remaining ngrams. No surprise the fastText embeddings do extremely well on this.
A brief note on hyperparameters - the Gensim word2vec implementation and the fastText word embedding implementation use largely the same defaults (dim_size = 100, window_size = 5, num_epochs = 5). Of course, they are two completely different models (albeit, with a few similarities).
Let's try with a larger corpus now - text8 (collection of wiki articles). I'm especially curious about the impact on semantic accuracy - for models trained on the brown corpus, the difference in the semantic accuracy and the accuracy values themselves are too small to be conclusive. Hopefully a larger corpus helps, and the text8 corpus likely has a lot more information about capitals, currencies, cities etc, which should be relevant to the semantic tasks.
print('Loading FastText embeddings')
ft_model = Word2Vec.load_word2vec_format(MODELS_DIR + 'text8_ft.vec')
print('Accuracy for FastText:')
print_accuracy(ft_model, word_analogies_file) print('Loading Gensim embeddings')
gs_model = Word2Vec.load_word2vec_format(MODELS_DIR + 'text8_gs.vec')
print('Accuracy for word2vec:')
print_accuracy(gs_model, word_analogies_file)
Loading FastText embeddings
Accuracy for FastText:
Evaluating... 298/506, 58.89%, Section: capital-common-countries
625/1452, 43.04%, Section: capital-world
37/268, 13.81%, Section: currency
291/1511, 19.26%, Section: city-in-state
151/306, 49.35%, Section: family
567/756, 75.00%, Section: gram1-adjective-to-adverb
188/306, 61.44%, Section: gram2-opposite
809/1260, 64.21%, Section: gram3-comparative
303/506, 59.88%, Section: gram4-superlative
528/992, 53.23%, Section: gram5-present-participle
1291/1371, 94.16%, Section: gram6-nationality-adjective
451/1332, 33.86%, Section: gram7-past-tense
853/992, 85.99%, Section: gram8-plural
360/650, 55.38%, Section: gram9-plural-verbs
6752/12208, 55.31%, Section: total Semantic: 1402/4043, Accuracy: 34.68%
Syntactic: 5350/8165, Accuracy: 65.52% Loading Gensim embeddings
Accuracy for word2vec:
Evaluating... 138/506, 27.27%, Section: capital-common-countries
248/1452, 17.08%, Section: capital-world
28/268, 10.45%, Section: currency
158/1571, 10.06%, Section: city-in-state
227/306, 74.18%, Section: family
85/756, 11.24%, Section: gram1-adjective-to-adverb
54/306, 17.65%, Section: gram2-opposite
739/1260, 58.65%, Section: gram3-comparative
178/506, 35.18%, Section: gram4-superlative
297/992, 29.94%, Section: gram5-present-participle
718/1371, 52.37%, Section: gram6-nationality-adjective
325/1332, 24.40%, Section: gram7-past-tense
389/992, 39.21%, Section: gram8-plural
200/650, 30.77%, Section: gram9-plural-verbs
3784/12268, 30.84%, Section: total Semantic: 799/4103, Accuracy: 19.47%
Syntactic: 2985/8165, Accuracy: 36.56%
With the text8 corpus, the semantic accuracy for the fastText model increases significantly, and it surpasses word2vec on accuracies for both semantic and syntactical analogies. However, the increase in syntactic accuracy from the increase in corpus size is much higher for word2vec
These preliminary results seem to indicate fastText embeddings might be better than word2vec at encoding semantic and especially syntactic information. It'd be interesting to see how transferable these embeddings are by comparing their performance in a downstream supervised task.
References
Comparison of FastText and Word2Vec的更多相关文章
- fastText训练word2vec并用于训练任务
最近测试OpenNRE,没有GPU服务器,bert的跑不动,于是考虑用word2vec,捡起fasttext 下载安装 先clone代码 git clone https://github.com/fa ...
- 超快的 FastText
Word2Vec 作者.脸书科学家 Mikolov 文本分类新作 fastText:方法简单,号称并不需要深度学习那样几小时或者几天的训练时间,在普通 CPU 上最快几十秒就可以训练模型,得到不错的结 ...
- DL4NLP——词表示模型(二)基于神经网络的模型:NPLM;word2vec(CBOW/Skip-gram)
本文简述了以下内容: 神经概率语言模型NPLM,训练语言模型并同时得到词表示 word2vec:CBOW / Skip-gram,直接以得到词表示为目标的模型 (一)原始CBOW(Continuous ...
- NLP︱高级词向量表达(二)——FastText(简述、学习笔记)
FastText是Facebook开发的一款快速文本分类器,提供简单而高效的文本分类和表征学习的方法,不过这个项目其实是有两部分组成的,一部分是这篇文章介绍的 fastText 文本分类(paper: ...
- 检索式chatbot:
小夕从7月份开始收到第一场面试邀请,到9月初基本结束了校招(面够了面够了T_T),深深的意识到今年的对话系统/chatbot方向是真的超级火呀.从微软主打情感计算的小冰,到百度主打智能家庭(与车联网? ...
- NLP获取词向量的方法(Glove、n-gram、word2vec、fastText、ELMo 对比分析)
自然语言处理的第一步就是获取词向量,获取词向量的方法总体可以分为两种两种,一个是基于统计方法的,一种是基于语言模型的. 1 Glove - 基于统计方法 Glove是一个典型的基于统计的获取词向量的方 ...
- 模型介绍之FastText
模型介绍一: 1. FastText原理及实践 前言----来源&特点 fastText是Facebook于2016年开源的一个词向量计算和文本分类工具,在学术上并没有太大创新.但是它的优点也 ...
- 文本分类需要CNN?No!fastText完美解决你的需求(后篇)
http://blog.csdn.net/weixin_36604953/article/details/78324834 想必通过前一篇的介绍,各位小主已经对word2vec以及CBOW和Skip- ...
- FastText算法原理解析
1. 前言 自然语言处理(NLP)是机器学习,人工智能中的一个重要领域.文本表达是 NLP中的基础技术,文本分类则是 NLP 的重要应用.fasttext是facebook开源的一个词向量与文本分类工 ...
随机推荐
- 【leetcode】949. Largest Time for Given Digits
题目如下: Given an array of 4 digits, return the largest 24 hour time that can be made. The smallest 24 ...
- Delphi COM对象
作者:巴哈姆特 http://www.cnpack.org(转载请注明出处并保持完整) 如果没有接触过COM对象的话,你会觉得接口真的很麻烦,也许会有:“还不如直接定义一个类更方便”的想法. 的确, ...
- SQL 与,或,非
SQL AND, OR and NOT(与,或不是运算符) AND&OR运算符用于根据一个以上的条件过滤记录. SQL AND & OR 运算符 WHERE子句可以与AND,OR和NO ...
- java在acm中常用基础技巧方法
java在acm中常用基础技巧方法 如果学到了新的技巧,本博客会更新~ input input-std @Frosero import java.util.*; public class Main { ...
- django_websocket实现简单聊天室
一.安装模块 pip install channels pip install channels-redis 二.代码 #websocket_v1/settings.py INSTALLED_APPS ...
- Windows注册表的学习
什么是注册表 注册表是Windows在Win95/98系统开始引入的一种核心数据库,里面存放着各类的配置信息.参数等.直接控制着系统的启动.硬件的装载以及Winodws程序的运行 手册表的功能 记录用 ...
- NX二次开发-UFUN打开本地文本文档uc4504
NX9+VS2012 #include <uf.h> #include <uf_cfi.h> #include <uf_ui.h> using std::strin ...
- (转)OpenFire源码学习之十一:连接管理(下)
转:http://blog.csdn.net/huwenfeng_2011/article/details/43416523 下面是下部分 C2S 1.当有客户端进行连接时根据Mina框架的模式首先调 ...
- (转)OpenFire源码学习之十:连接管理(上)
转:http://blog.csdn.net/huwenfeng_2011/article/details/43415827 关于连接管理分为上下两部分 连接管理 在大并发环境下,连接资源 需要随着用 ...
- Openstack组件部署 — Nova_Install and configure a compute node
目录 目录 前文列表 Prerequisites 先决条件 Install and configure a compute node Install the packages Edit the etc ...