File Exchange

image thumbnail

Text Analytics Toolbox Model for fastText English 16 Billion Token Word Embedding

Pre-trained English Word Embedding Model for Machine Learning and Deep Learning with Text


Updated 11 Sep 2019

This Add-on provides a pre-trained word embedding and sentence classification model using FastText for use in machine learning and deep learning algorithms. FastText is an open-source library which provides efficient and scalable libraries for text analytics. For more information on the pre-trained word vector model see :

Opening the fasttext.mlpkginstall file from your operating system or from within MATLAB will initiate the installation process for the release you have.
This mlpkginstall file is functional for R2018a and beyond.
Usage Example:
% Load the trained model
emb = fastTextWordEmbedding;

% Find the top 10 closest word to “impedance” according to this word embedding
impedanceVec = word2vec(emb,"impedance");
vec2word(emb, impedanceVec,10)

ans =

10×1 string array


Comments and Ratings (5)

I think that something is wrong. Look at the most often example in many scientific papers ( "king" - "man" + "woman" -> "queen") .

manVec = word2vec(emb,"man");
womanVec = word2vec(emb,"woman");
kingVec = word2vec(emb,"king");

answer = kingVec - manVec + womanVec;
res1 = vec2word(emb, answer,5)
(vecnorm((word2vec(emb,res1) - answer)'))'

Five nearest words are:
with distances from answer:

It is surprise for me (that king is the first, and the queen is second). I think that, it is problem. What are you suggest ? Are you sure that vector length 300 is enough ?
Or, Am I doing something incorrect ? Thank you.

P.S. I tested different forms of words "man", "Man", "MAN", also average 0.5 * (word2vec("man") + word2vec("Man") ) ... but the first result is never queen.

To add words to the embedding vocabulary, follow the steps below to create a new embedding object after reading it in:
>> emb = fasttextenglishembedding();
>> vocab = emb.Vocabulary;
>> mat = word2vec(emb,vocab);
>> newvocab = [vocab "sample 1" "sample 2"];
>> newmat = [mat ; randn(2,300)];
>> newemb = wordEmbedding(newvocab,newmat);

Is it possible to add additional words to the pretrained vocabulary? If so, how is this done?

MATLAB Release Compatibility
Created with R2018a
Compatible with R2018a to R2019b
Platform Compatibility
Windows macOS Linux