ml, nlp,

Natural language processing - Introduction

Gerald Haslhofer Gerald Haslhofer Follow Nov 29, 2019 · 3 mins read
Natural language processing - Introduction
Share this

Natural language processing - introduction and state-of-the-art

This post provides a summary of introductory articles I found useful to better understand what’s possible in NLP, specifically what the current state of the art is and what areas should be prioritized for future explorations.

NLP - Basics

Embeddings

First step often is turning input words into vectors using embeddings (see. e.g. here for an explanation of word embeddings). These vectors capture some semantics (king-man+woman=queen). These vectors are somewhere in the 200-300 dimensions range.

Non-contextual embeddings

Embedding explanation with Word2Vec focus on medium. Key constraint of Word2Vec is that it doesn’t capture multiple interpretations of a single word (“stick”).

Contextualized embeddings

Contextualized word embeddings factor in multiple interpretations of words. ELMo uses bi-directional LSTM to create embeddings based on the context of words.

Attention

Overview and original paper “Attention is all you need”. Basic premise is that enabling the neural network to pay attention to a specific subset of hidden states leads to better results. Seq2seq pay Attention to Self Attention: Part 1

Transformers

OpenAI transformer

Language modelling using 12 stacked decoder layers for language tasks such as classification, similarity, multiple choice questions - source. Note: this transformer is unidirectional (so a step back from LSTM based embeddings such as ELMo)

BERT

Basic premise: semi-supervised training on large amounts of text to build model. Uses encoder stack (as opposed to OpenAI which uses decoder stack)

Key concepts applied: semi-supervised sequence learning; Transformers; ELMo (evolution of), ULMFiT (evolution of)

Bert is pre-trained with two tasks: (1) predict missing word (2) figure out if sentence A follows sentence B

Predict missing words: masked language input (replace 15% of input with [Mask] keyword).

Two-sentence task: predict if sentence A follows sentence B. Other notes: BERT doesn’t use words, but WordPieces which are smaller chunks.

Using BERT

Once BERT is pre-trained, it can be combined with supervised training with specific labeled data set.

Simplest model: train a binary classifier with (1) BERT –> (2) Feed-forward neural network + softmax. BERT model is modified only slightly during the training phase (details in the great article here )

Extensions of BERT

  • VideoBERT: A joint model for video and language representation learning [Sep 2019]. Uses pre-trained video ConvNet (TBD) to extract features - in their example S3D (TBD) - separable temporal convolutions to an Inception network backbone (TBD). Further reading:

    Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott E. Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions. arXiv preprint arXiv:1409.4842, 2014.

Implementation

Self improvement / learning

  • Learning course for GAN from Google

Data-sets