Natural language processing - introduction and state-of-the-art
This post provides a summary of introductory articles I found useful to better understand what’s possible in NLP, specifically what the current state of the art is and what areas should be prioritized for future explorations.
NLP - Basics
- Overview / ImageNet moment in NLP
Embeddings
First step often is turning input words into vectors using embeddings (see. e.g. here for an explanation of word embeddings). These vectors capture some semantics (king-man+woman=queen). These vectors are somewhere in the 200-300 dimensions range.
Non-contextual embeddings
Embedding explanation with Word2Vec focus on medium. Key constraint of Word2Vec is that it doesn’t capture multiple interpretations of a single word (“stick”).
- Word2Vec - generate your own - Google project
- GloVe.
Contextualized embeddings
Contextualized word embeddings factor in multiple interpretations of words. ELMo uses bi-directional LSTM to create embeddings based on the context of words.
- Video about using sentence embedding for fact checking
Attention
Overview and original paper “Attention is all you need”. Basic premise is that enabling the neural network to pay attention to a specific subset of hidden states leads to better results. Seq2seq pay Attention to Self Attention: Part 1
Transformers
- Basics - Transformer Great high level overview. Originally introduced in “Attention is all you need”.
Fully annotated code to the paper from Harvard NLP group is here.
And Tensorflow library tensor2tensor. Interactive tensor2tensor on Google Colab.
Hugging Face Transformer quick tutorial
Transformer introduction from Google - Introductory article for Hugging Face NLP library
OpenAI transformer
Language modelling using 12 stacked decoder layers for language tasks such as classification, similarity, multiple choice questions - source. Note: this transformer is unidirectional (so a step back from LSTM based embeddings such as ELMo)
BERT
- BERT original paper
- Illustrated BERT, great explanation
Basic premise: semi-supervised training on large amounts of text to build model. Uses encoder stack (as opposed to OpenAI which uses decoder stack)
Key concepts applied: semi-supervised sequence learning; Transformers; ELMo (evolution of), ULMFiT (evolution of)
Bert is pre-trained with two tasks: (1) predict missing word (2) figure out if sentence A follows sentence B
Predict missing words: masked language input (replace 15% of input with [Mask] keyword).
Two-sentence task: predict if sentence A follows sentence B. Other notes: BERT doesn’t use words, but WordPieces which are smaller chunks.
Using BERT
Once BERT is pre-trained, it can be combined with supervised training with specific labeled data set.
Simplest model: train a binary classifier with (1) BERT –> (2) Feed-forward neural network + softmax. BERT model is modified only slightly during the training phase (details in the great article here )
-
Interactive BERT in Google Colab with specific language tasks
-
BERT and Binary classification
-
BERT and Semantic Similarity in Sentences on Medium
-
Sentence embeddings with BERT. Other sentence embedding with Universal Encoder Light Google Colab Sheet
Extensions of BERT
-
VideoBERT: A joint model for video and language representation learning [Sep 2019]. Uses pre-trained video ConvNet (TBD) to extract features - in their example S3D (TBD) - separable temporal convolutions to an Inception network backbone (TBD). Further reading:
Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott E. Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions. arXiv preprint arXiv:1409.4842, 2014.
Implementation
- Getting a big model into production
Self improvement / learning
- Learning course for GAN from Google
Data-sets
- Stanford sentiment data set