Spacy pretrain command

Spacy 2.1 released an interesting command spacy pretrain. It loads a pre-trained vectors (https://spacy.io/models/) and uses a CNN model to predict each word’s pre-trained vector instead of the word itself. They termed this technique as Language Modelling with Approximate Outputs (LMAO).

According to the creator of Spacy, this approach is especially useful when you have limited training data for text classification and parsing task. He used pretraining’s output to train a text classifier on 1,000 samples and reported a high F1-score of 87% on the test set consisting of 5,000 samples.

In order to get familiar with this technique, I ran a similar experiment on a new dataset of 120,000 sentences with 4 classes. All the code and data are available at https://github.com/tienduccao/spacy-pretrain-polyaxon, under lmao-imdb-1k folder.

Firstly you need to convert your unlabled text into .jsonl format using python generate_jsonl.py. You also need to download the Spacy pre-trained vector file (spacy download en_vectors_web_lg). Then executing the commandspacy pretrain to obtain the pre-trained weights.

spacy pretrain corpus.jsonl en_vectors_web_lg weights

The details of this command could be found at https://spacy.io/api/cli#pretrain. This task took 24.5h for 832 iterations on GeForce GTX 1080. Some pre-trained weights are stored here.

Without using the pre-trained weights (python pretrain_textcat.py 96 2000), I obtained an F1-score of 73.4%.

The next step is training a text classifier using pre-trained weights.

python pretrain_textcat.py -t2v weights/model800.bin 96 2000

Now the performance increases up to 85.4%.

Spacy text classifier is CNN model and it is sensitive to hyperparameters. Trying different parameters of pretrain_textcat.py should be done in order to achieve the optimal performance. We could also try different pre-trained vector file (word2vec, Glove, fastText) to see if it could lead to better result.

Comments

comments powered by Disqus