Semantic Analysis

draws meaning from natural language text, considers signs and symbols(semiotics) and collocations(words that often go together)

Techniques

Classification Models

Topic Classification

Sentiment Analysis

Intent Classification

Extraction Models

Keyword Extraction

Entity Extraction

Word Vectors

Word vectors - also called word embeddings - are mathematical descriptions of individual words such that words that appear frequently together in the language will have similar values. In this way we can mathematically derive context. As mentioned above, the word vector for “lion” will be closer in value to “cat” than to “dandelion”.

In en_core_web_md, there are 20k unique vectors (300 dimensions).

What’s interesting is that Doc and Span objects themselves have vectors, derived from the averages of individual token vectors.
This makes it possible to compare similarities between whole documents.

Identifying similar vectors

The best way to expose vector relationships is through the .similarity() method of Doc tokens.

# Create a three-token Doc object:
tokens = nlp(u'dog cat monkey')

# Iterate through token combinations:
for token1 in tokens:
for token2 in tokens:
print(token1.text, token2.text, token1.similarity(token2))
dog dog 1.0
dog cat 0.80168545
dog monkey 0.47752646
cat dog 0.80168545
cat cat 1.0
cat monkey 0.5351813
monkey dog 0.47752646
monkey cat 0.5351813
monkey monkey 1.0

Note that order doesn’t matter. token1.similarity(token2) has the same value as token2.similarity(token1).

Opposites are not necessarily different

Words that have opposite meaning, but that often appear in the same context may have similar vectors.

# Create a three-token Doc object:
tokens = nlp(u'like love dislike hate')

# Iterate through token combinations:
for token1 in tokens:
for token2 in tokens:
print(token1.text, token2.text, token1.similarity(token2))
like like 1.0
like love 0.65790397
like dislike 0.5194136
like hate 0.6574652
love like 0.65790397
love love 1.0
love dislike 0.49095604
love hate 0.6393099
dislike like 0.5194136
dislike love 0.49095604
dislike dislike 1.0
dislike hate 0.77664256
hate like 0.6574652
hate love 0.6393099
hate dislike 0.77664256
hate hate 1.0

Vector norms

It’s sometimes helpful to aggregate 300 dimensions into a Euclidian (L2) norm, computed as the square root of the sum-of-squared-vectors. This is accessible as the .vector_norm token attribute. Other helpful attributes include .has_vector and .is_oov or out of vocabulary.

tokens = nlp(u'dog cat nowaythere')

for token in tokens:
print(token.text, token.has_vector, token.vector_norm, token.is_oov)
#dog True 7.0336733 False
#cat True 6.6808186 False
#nowaythere False 0.0 True

Vector arithmetic

Believe it or not, we can actually calculate new vectors by adding & subtracting related vectors. A famous example suggests

"king" - "man" + "woman" = "queen"
from scipy import spatial
cosine_similarity = lambda x, y: 1 - spatial.distance.cosine(x, y)
king = nlp.vocab['king'].vector
man = nlp.vocab['man'].vector
woman = nlp.vocab['woman'].vector

queen = nlp.vocab['queen'].vector

new_vector = king - man + woman

cosine_similarity(new_vector, queen)
#0.7880843877792358

cosine_similarity(new_vec, woman)
#0.5150813460350037
# Now we find the closest vector in the vocabulary to the result of "man" - "woman" + "queen"
computed_similarities = []

for word in nlp.vocab:
# Ignore words without vectors and mixed-case words:
if word.has_vector:
if word.is_lower:
if word.is_alpha:
similarity = cosine_similarity(new_vector, word.vector)
computed_similarities.append((word, similarity))

computed_similarities = sorted(computed_similarities, key=lambda item: -item[1])

print([w[0].text for w in computed_similarities[:10]])
#['king', 'queen', 'commoner', 'highness', 'prince', 'sultan', 'maharajas', 'princes', 'kumbia', 'kings']

Sentiment Analysis

detects polarity(e.g., positive or negative opinion) within the text and emotions, urgency and intentions

VADER

Valence Aware Dictionary for Sentiment Reasoning is a model in NLTK

Download the VADER lexicon. You only need to do this once.

import nltk
nltk.download('vader_lexicon')

VADER’s SentimentIntensityAnalyzer() takes in a string and returns a dictionary of scores in each of four categories:

  • negative
  • neutral
  • positive
  • compound (computed by normalizing the scores above)
from nltk.sentiment.vader import SentimentIntensityAnalyzer

sid = SentimentIntensityAnalyzer()
a = 'This was a good movie.'
sid.polarity_scores(a)
#{'neg': 0.0, 'neu': 0.508, 'pos': 0.492, 'compound': 0.4404}

a = 'This was the best, most awesome movie EVER MADE!!!'
sid.polarity_scores(a)
#{'neg': 0.0, 'neu': 0.425, 'pos': 0.575, 'compound': 0.8877}

a = 'This was the worst film to ever disgrace the screen.'
sid.polarity_scores(a)
#{'neg': 0.477, 'neu': 0.523, 'pos': 0.0, 'compound': -0.8074}

Use VADER to analyze Amazon Reviews

import numpy as np
import pandas as pd

df = pd.read_csv('amazonreviews.tsv', sep='\t')
df.head()
# REMOVE NaN VALUES AND EMPTY STRINGS:
df.dropna(inplace=True)

blanks = [] # start with an empty list

for i,lb,rv in df.itertuples(): # iterate over the DataFrame
if type(rv)==str: # avoid NaN values
if rv.isspace(): # test 'review' for whitespace
blanks.append(i) # add matching index numbers to the list

df.drop(blanks, inplace=True)

sid.polarity_scores(df.loc[0]['review'])
#{'neg': 0.088, 'neu': 0.669, 'pos': 0.243, 'compound': 0.9454}
df.loc[0]['label']
#'pos'

Adding Scores and Labels to the DataFrame

df['scores'] = df['review'].apply(lambda review: sid.polarity_scores(review))
df['compound'] = df['scores'].apply(lambda score_dict: score_dict['compound'])
df['comp_score'] = df['compound'].apply(lambda c: 'pos' if c >=0 else 'neg')

Report on Accuracy

from sklearn.metrics import accuracy_score,classification_report,confusion_matrix
accuracy_score(df['label'],df['comp_score'])
#0.7091
print(classification_report(df['label'],df['comp_score']))
              precision    recall  f1-score   support

neg 0.86 0.51 0.64 5097
pos 0.64 0.91 0.75 4903

micro avg 0.71 0.71 0.71 10000
macro avg 0.75 0.71 0.70 10000
weighted avg 0.75 0.71 0.70 10000