Natural language processing - Introduction

Natural language processing - introduction and state-of-the-art

This post provides a summary of introductory articles I found useful to better understand what’s possible in NLP, specifically what the current state of the art is and what areas should be prioritized for future explorations.

NLP - Basics

Overview / ImageNet moment in NLP

Embeddings

First step often is turning input words into vectors using embeddings (see. e.g. here for an explanation of word embeddings). These vectors capture some semantics (king-man+woman=queen). These vectors are somewhere in the 200-300 dimensions range.

Non-contextual embeddings

Embedding explanation with Word2Vec focus on medium. Key constraint of Word2Vec is that it doesn’t capture multiple interpretations of a single word (“stick”).

Word2Vec - generate your own - Google project
GloVe.

Contextualized embeddings

Contextualized word embeddings factor in multiple interpretations of words. ELMo uses bi-directional LSTM to create embeddings based on the context of words.

Video about using sentence embedding for fact checking

Attention

Overview and original paper “Attention is all you need”. Basic premise is that enabling the neural network to pay attention to a specific subset of hidden states leads to better results. Seq2seq pay Attention to Self Attention: Part 1

Transformers

Basics - Transformer Great high level overview. Originally introduced in “Attention is all you need”.
Fully annotated code to the paper from Harvard NLP group is here.
And Tensorflow library tensor2tensor. Interactive tensor2tensor on Google Colab.
Hugging Face Transformer quick tutorial
Transformer introduction from Google
Introductory article for Hugging Face NLP library

OpenAI transformer

Language modelling using 12 stacked decoder layers for language tasks such as classification, similarity, multiple choice questions - source. Note: this transformer is unidirectional (so a step back from LSTM based embeddings such as ELMo)

BERT

BERT original paper
Illustrated BERT, great explanation

Basic premise: semi-supervised training on large amounts of text to build model. Uses encoder stack (as opposed to OpenAI which uses decoder stack)

Key concepts applied: semi-supervised sequence learning; Transformers; ELMo (evolution of), ULMFiT (evolution of)

Bert is pre-trained with two tasks: (1) predict missing word (2) figure out if sentence A follows sentence B

Predict missing words: masked language input (replace 15% of input with [Mask] keyword).

Two-sentence task: predict if sentence A follows sentence B. Other notes: BERT doesn’t use words, but WordPieces which are smaller chunks.

Using BERT

Once BERT is pre-trained, it can be combined with supervised training with specific labeled data set.

Simplest model: train a binary classifier with (1) BERT –> (2) Feed-forward neural network + softmax. BERT model is modified only slightly during the training phase (details in the great article here )

Interactive BERT in Google Colab with specific language tasks
BERT and Binary classification
BERT and Semantic Similarity in Sentences on Medium
Sentence embeddings with BERT. Other sentence embedding with Universal Encoder Light Google Colab Sheet

Extensions of BERT

VideoBERT: A joint model for video and language representation learning [Sep 2019]. Uses pre-trained video ConvNet (TBD) to extract features - in their example S3D (TBD) - separable temporal convolutions to an Inception network backbone (TBD). Further reading:

Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott E. Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions. arXiv preprint arXiv:1409.4842, 2014.

Implementation

Getting a big model into production

Self improvement / learning

Learning course for GAN from Google

Data-sets

Stanford sentiment data set

Natural language processing - Introduction