Comparison of FastText and Word2Vec

Facebook Research open sourced a great project yesterday - fastText, a fast (no surprise) and effective method to learn word representations and perform text classification. I was curious about comparing these embeddings to other commonly used embeddings, so word2vec seemed like the obvious choice, especially considering fastText embeddings are based upon word2vec.

Download data

In [ ]:

import nltk

nltk.download()

# Only the brown corpus is needed in case you don't have it.

# alternately, you can simply download the pretrained models below if you wish to avoid downloading and training

# Generate brown corpus text file

with open('brown_corp.txt', 'w+') as f:

    for word in nltk.corpus.brown.words():

        f.write('{word} '.format(word=word))

In [ ]:

# download the text8 corpus (a 100 MB sample of cleaned wikipedia text)

# alternately, you can simply download the pretrained models below if you wish to avoid downloading and training

!wget http://mattmahoney.net/dc/text8.zip

In [ ]:

# download the file questions-words.txt to be used for comparing word embeddings

!wget https://raw.githubusercontent.com/arfon/word2vec/master/questions-words.txt

Train models

If you wish to avoid training, you can download pre-trained models instead in the next section. For training the fastText models yourself, you'll have to follow the setup instructions for fastText and run the training with -

In [ ]:

!./fasttext skipgram -input brown_corp.txt -output brown_ft

!./fasttext skipgram -input text8.txt -output text8_ft

For training the gensim models -

In [ ]:

from nltk.corpus import brown

from gensim.models import Word2Vec

from gensim.models.word2vec import Text8Corpus

import logging

logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s')

logging.root.setLevel(level=logging.INFO)

MODELS_DIR = 'models/'

brown_gs = Word2Vec(brown.sents())

brown_gs.save_word2vec_format(MODELS_DIR + 'brown_gs.vec')

text8_gs = Word2Vec(Text8Corpus('text8'))

text8_gs.save_word2vec_format(MODELS_DIR + 'text8_gs.vec')

Download models

In case you wish to avoid downloading the corpus and training the models, you can download pretrained models with -

In [ ]:

# download the fastText and gensim models trained on the brown corpus and text8 corpus

!wget https://www.dropbox.com/s/4kray3epy439gca/models.tar.gz?dl=1 -O models.tar.gz

Once you have downloaded or trained the models (make sure they're in the models/ directory, or that you've appropriately changed MODELS_DIR) and downloaded questions-words.txt, you're ready to run the comparison.

Comparisons

In [1]:

from gensim.models import Word2Vec

def print_accuracy(model, questions_file):

    print('Evaluating...\n')

    acc = model.accuracy(questions_file)

    for section in acc:

        correct = len(section['correct'])

        total = len(section['correct']) + len(section['incorrect'])

        total = total if total else 1

        accuracy = 100*float(correct)/total

        print('{:d}/{:d}, {:.2f}%, Section: {:s}'.format(correct, total, accuracy, section['section']))

    sem_correct = sum((len(acc[i]['correct']) for i in range(5)))

    sem_total = sum((len(acc[i]['correct']) + len(acc[i]['incorrect'])) for i in range(5))

    print('\nSemantic: {:d}/{:d}, Accuracy: {:.2f}%'.format(sem_correct, sem_total, 100*float(sem_correct)/sem_total))

    syn_correct = sum((len(acc[i]['correct']) for i in range(5, len(acc)-1)))

    syn_total = sum((len(acc[i]['correct']) + len(acc[i]['incorrect'])) for i in range(5,len(acc)-1))

    print('Syntactic: {:d}/{:d}, Accuracy: {:.2f}%\n'.format(syn_correct, syn_total, 100*float(syn_correct)/syn_total))

MODELS_DIR = 'models/'

word_analogies_file = 'questions-words.txt'

print('\nLoading FastText embeddings')

ft_model = Word2Vec.load_word2vec_format(MODELS_DIR + 'brown_ft.vec')

print('Accuracy for FastText:')

print_accuracy(ft_model, word_analogies_file)

print('\nLoading Gensim embeddings')

gs_model = Word2Vec.load_word2vec_format(MODELS_DIR + 'brown_gs.vec')

print('Accuracy for word2vec:')

print_accuracy(gs_model, word_analogies_file)

Loading FastText embeddings

Accuracy for FastText:

Evaluating...

0/1, 0.00%, Section: capital-common-countries

0/1, 0.00%, Section: capital-world

0/1, 0.00%, Section: currency

0/1, 0.00%, Section: city-in-state

27/182, 14.84%, Section: family

539/702, 76.78%, Section: gram1-adjective-to-adverb

106/132, 80.30%, Section: gram2-opposite

656/1056, 62.12%, Section: gram3-comparative

136/210, 64.76%, Section: gram4-superlative

439/650, 67.54%, Section: gram5-present-participle

0/1, 0.00%, Section: gram6-nationality-adjective

165/1260, 13.10%, Section: gram7-past-tense

327/552, 59.24%, Section: gram8-plural

245/342, 71.64%, Section: gram9-plural-verbs

2640/5086, 51.91%, Section: total

Semantic: 27/182, Accuracy: 14.84%

Syntactic: 2613/4904, Accuracy: 53.28%

Loading Gensim embeddings

Accuracy for word2vec:

Evaluating...

0/1, 0.00%, Section: capital-common-countries

0/1, 0.00%, Section: capital-world

0/1, 0.00%, Section: currency

0/1, 0.00%, Section: city-in-state

53/182, 29.12%, Section: family

8/702, 1.14%, Section: gram1-adjective-to-adverb

0/132, 0.00%, Section: gram2-opposite

75/1056, 7.10%, Section: gram3-comparative

0/210, 0.00%, Section: gram4-superlative

16/650, 2.46%, Section: gram5-present-participle

0/1, 0.00%, Section: gram6-nationality-adjective

30/1260, 2.38%, Section: gram7-past-tense

4/552, 0.72%, Section: gram8-plural

8/342, 2.34%, Section: gram9-plural-verbs

194/5086, 3.81%, Section: total

Semantic: 53/182, Accuracy: 29.12%

Syntactic: 141/4904, Accuracy: 2.88%

Word2vec embeddings seem to be slightly better than fastText embeddings at the semantic tasks, while the fastText embeddings do significantly better on the syntactic analogies. Makes sense, since fastText embeddings are trained for understanding morphological nuances, and most of the syntactic analogies are morphology based.

Let me explain that better.

According to the paper [1], embeddings for words are represented by the sum of their n-gram embeddings. This is meant to be useful for morphologically rich languages - so theoretically, the embedding for apparently would include information from both character n-grams apparent and ly (as well as other n-grams), and the n-grams would combine in a simple, linear manner. This is very similar to what most of our syntactic tasks look like.

Example analogy:

amazing amazingly calm calmly

This analogy is marked correct if:

embedding(amazing) - embedding(amazingly) = embedding(calm) - embedding(calmly)

Both these subtractions would result in a very similar set of remaining ngrams. No surprise the fastText embeddings do extremely well on this.

A brief note on hyperparameters - the Gensim word2vec implementation and the fastText word embedding implementation use largely the same defaults (dim_size = 100, window_size = 5, num_epochs = 5). Of course, they are two completely different models (albeit, with a few similarities).

Let's try with a larger corpus now - text8 (collection of wiki articles). I'm especially curious about the impact on semantic accuracy - for models trained on the brown corpus, the difference in the semantic accuracy and the accuracy values themselves are too small to be conclusive. Hopefully a larger corpus helps, and the text8 corpus likely has a lot more information about capitals, currencies, cities etc, which should be relevant to the semantic tasks.

In [2]:

print('Loading FastText embeddings')

ft_model = Word2Vec.load_word2vec_format(MODELS_DIR + 'text8_ft.vec')

print('Accuracy for FastText:')

print_accuracy(ft_model, word_analogies_file)

print('Loading Gensim embeddings')

gs_model = Word2Vec.load_word2vec_format(MODELS_DIR + 'text8_gs.vec')

print('Accuracy for word2vec:')

print_accuracy(gs_model, word_analogies_file)

Loading FastText embeddings

Accuracy for FastText:

Evaluating...

298/506, 58.89%, Section: capital-common-countries

625/1452, 43.04%, Section: capital-world

37/268, 13.81%, Section: currency

291/1511, 19.26%, Section: city-in-state

151/306, 49.35%, Section: family

567/756, 75.00%, Section: gram1-adjective-to-adverb

188/306, 61.44%, Section: gram2-opposite

809/1260, 64.21%, Section: gram3-comparative

303/506, 59.88%, Section: gram4-superlative

528/992, 53.23%, Section: gram5-present-participle

1291/1371, 94.16%, Section: gram6-nationality-adjective

451/1332, 33.86%, Section: gram7-past-tense

853/992, 85.99%, Section: gram8-plural

360/650, 55.38%, Section: gram9-plural-verbs

6752/12208, 55.31%, Section: total

Semantic: 1402/4043, Accuracy: 34.68%

Syntactic: 5350/8165, Accuracy: 65.52%

Loading Gensim embeddings

Accuracy for word2vec:

Evaluating...

138/506, 27.27%, Section: capital-common-countries

248/1452, 17.08%, Section: capital-world

28/268, 10.45%, Section: currency

158/1571, 10.06%, Section: city-in-state

227/306, 74.18%, Section: family

85/756, 11.24%, Section: gram1-adjective-to-adverb

54/306, 17.65%, Section: gram2-opposite

739/1260, 58.65%, Section: gram3-comparative

178/506, 35.18%, Section: gram4-superlative

297/992, 29.94%, Section: gram5-present-participle

718/1371, 52.37%, Section: gram6-nationality-adjective

325/1332, 24.40%, Section: gram7-past-tense

389/992, 39.21%, Section: gram8-plural

200/650, 30.77%, Section: gram9-plural-verbs

3784/12268, 30.84%, Section: total

Semantic: 799/4103, Accuracy: 19.47%

Syntactic: 2985/8165, Accuracy: 36.56%

With the text8 corpus, the semantic accuracy for the fastText model increases significantly, and it surpasses word2vec on accuracies for both semantic and syntactical analogies. However, the increase in syntactic accuracy from the increase in corpus size is much higher for word2vec

These preliminary results seem to indicate fastText embeddings might be better than word2vec at encoding semantic and especially syntactic information. It'd be interesting to see how transferable these embeddings are by comparing their performance in a downstream supervised task.

References

[1] Enriching Word Vectors with Subword Information

[2] Efficient Estimation of Word Representations in Vector Space