The goal of this exercise was to explore the most recent advances in natural language processing and apply it to a real-world problem:

When having two different collections of articles denoted by their headlines, which collection should a new, unseen article be added to?

Secondarily, given two to-do lists, which list should a new to-do item be added to?

The solution applied here attempts to create a similarity measure to pick the top-n most similar articles based on semantic similarity. It leverages the state-of-the art Transformer based model - BERT.

Step 0: Understand what BERT is

See the previous blog post on NLP, section “Transformers”.

Step 1: Data science environment setup on a Surface Book

I opted for the Surface Book laptop with a reasonable GPU. Let’s start with setting up the environment with Python, and a great editor (Visual Studio Code).

Understand Visual Studio Editor and Python

Introduction to Visual Studio Code and Python

Downloads

Visual Studio Code download
Python installer for Python 3.8.1. Make sure that environment variables are set
Leverage Anaconda distribution
Create directory, start Visual Studio Code (cd dir; code .)
Select environment: CTRL+SHIFT+P, “Python: Select Interpreter”
Create a terminal window: CTRL+SHIFT+P, “Terminal: Create New Integrated Terminal”
Install tensorflow GPU (with Conda). New Command Window. conda install tensorflow-gpu (Conda is preinstalled)

Conda and Pip coexistence: https://www.anaconda.com/using-pip-in-a-conda-environment/

Create new environment with Conda so you control the right version of e.g. Tensorflow (Bert-as-a-service does not work with Tensorflow 2.0)

conda create environname
conda activate environname

Make sure the GPU works for Tensorflow

# verify GPU availability
import tensorflow as tf

device_name = tf.test.gpu_device_name()
if device_name != '/device:GPU:0':
  raise SystemError('GPU device not found')
print('Found GPU at: {}'.format(device_name))

Step 2: Choose tutorial to get started

Potential tutorials, ultimately chose Bert-as-a-service as it allowed the most straightforward experiments

Tutorial 1: Bert Explained
Tutorial 2: Intent classification
Tutorial 3: Huggingface Transformers
Tutorial 4: BERT word embedding tutorial
Tutorial 6: BERT as service Our choice.

Step 3: set up Bert-as-a-service

https://github.com/hanxiao/bert-as-service

pip install bert-serving-server  # server
pip install bert-serving-client  # client, independent of `bert-serving-server`

Download BERT model from Google.

Download location of original model

and extracted it to

 C:\Users\gerhas\code\python\uncased_L-12_H-768_A-12

Start the BERT server component

bert-serving-start -model_dir c:\users\gerhas\code\python\uncased_L-12_H-768_A-12 -num_worker=4

Step 4: Create sample dataset

We create two collections, and add headlines to each one of them.

I chose two distinct sets of headlines: one set with articles about machine learning, one set with articles about general self-improvement articles, sourced from Medium.com and other sites.

Sample collection #I: AI related topics

2020 NLP wish lists, HuggingFace + fastai, NeurIPS 2019, GPT-2 things, Machine Learning Interviews Revue

Google Brain’s AI achieves state-of-the-art text summarization performance

Image Segmentation: Kaggle experience

Article: TensorFlow 2 Tutorial: Get Started in Deep Learning With tf.keras

Examining BERT’s raw embeddings

Eight Surprising Predictions for AI in 2020.

Recent Advancements in NLP (1/2)

Sample collection #II: Self-improvement and general interest books

7 Reasons Why Smart, Hardworking People Don’t Become Successful by Melissa Chu

The Secret To Being Loved The Way You Dream Of

The World Beyond Your Head’ by Matthew B. Crawford

Become What You Are by Alan Watts

The Sense of Style’ by Stephen Pinker

The Storm Before the Storm’ by Mike Duncan

The 1 Thing I Did That Changed My Entire Life For The Better

The Five Qualities You Need in a Partner

Step 5: Writing the code

The code is a simple adaptation of the sample code of the Bert-as-service manual.

from bert_serving.client import BertClient
import numpy as np
prefix_q = '##### **Q:** '
with open('C:\\Users\\gerhas\\code\\BERT-experiments\\collections.txt') as fp:
    questions_raw = [v.replace(prefix_q, '').strip() for v in fp if v.strip() and v.startswith(prefix_q)]
    print('%d titles of collections loaded, avg. len of %d' % (len(questions_raw), np.mean([len(d.split()) for d in questions_raw])))

questions_cat = [v[:4] for v in questions_raw]
questions = [v[5:] for v in questions_raw]

count = len(questions)
bc = BertClient()
doc_vecs = bc.encode(questions)

while True:
    query = input('Enter title of new article:')
    query_vec = bc.encode([query])[0]
    # compute normalized dot product as score
    score = np.sum(query_vec * doc_vecs, axis=1) / np.linalg.norm(doc_vecs, axis=1)
    topk_idx = np.argsort(score)[::-1][:count]
    count = 1
    for idx in topk_idx:
        print('> %s\t%s\t%s\t%s' % (count, score[idx], questions_cat[idx], questions[idx]))
        count+=1
print("end")

Step 6: Running the model

Now to the exciting part: let’s enter a new title, and see a ranked list of most to least similar articles in the base dataset. We can use this ranking to determine whether the new article should be added to collection #1 (AI articles), or collection #2 (General Interest).

Experiment 1: New article named “neural networks”

We have a new article called “neural networks”. Which collection should it belong to?

The results are stunning: entering “neural networks” leads to all AI related articles being ranked higher than all of the other general interest ones. Look at the ranked list here:

Assets

Note that none of the existing articles had “neural networks” in them literally, so traditional TFIDF approaches would not have worked due to the small text corpus being available.

Experiment 2: New article named “How to grow your career and prosper”

In this experiment we chose a general purpose title, which clearly had nothing to do with AI.

The results are again very compelling: The top 4 articles based on similarity with the term are in the general purpose category. Look at the ranked list here:

Assets

Experiment 3: ToDo lists”

I create two todo lists -

#1 - One around shopping with the content of “Buy milk, meat, groceries”

#2 One with travel activities “Book airplane, reserve seats, check in and remember frequent flyer miles”

When entering new items, the system consistently chooses the right list to add the new item to. E.g. when selecting “flour” the system ranks the “groceries” list higher, when selecting “rental car”, the system chooses “Travel”. Assets

Bottom line: BERT appears to be a very promising approach!

Smaller Bert models

Updates

https://www.python.org/downloads/release/python-376/

https://automaticaddison.com/how-to-install-tensorflow-2-on-windows-10/

conda create -n mypython3 python=3 conda activate mypython3 python -m pip install bert-serving-server

curl https://bootstrap.pypa.io/get-pip.py -o get-pip.py python get-pip.py virtualenv –system-site-packages -p python ./venv .\venv\Scripts\activate

Appendix

sentence embedding

Quick Semantic Search using Siamese-BERT Networks” by Aneesha Bakharia https://link.medium.com/kX6QF9oqG3

#Deploying Python service to Azure

Django, Flask

https://azure.microsoft.com/en-us/develop/python/ https://www.youtube.com/watch?time_continue=837&v=0Bk0dw2Ktbg&feature=emb_logo

https://cloudblogs.microsoft.com/opensource/2020/01/21/microsoft-onnx-open-source-optimizations-transformer-inference-gpu-cpu/

https://github.com/onnx/tutorials/blob/master/tutorials/Inference-TensorFlow-Bert-Model-for-High-Performance-in-ONNX-Runtime.ipynb

How do I assign an article to one of two collections? Using BERT and state-of-the-art NLP