【NLP】Semantic & Sentiment Analysis
Semantic Analysis
draws meaning from natural language text, considers signs and symbols(semiotics) and collocations(words that often go together)
Techniques
Classification Models
Topic Classification
Sentiment Analysis
Intent Classification
Extraction Models
Keyword Extraction
Entity Extraction
Word Vectors
Word vectors - also called word embeddings - are mathematical descriptions of individual words such that words that appear frequently together in the language will have similar values. In this way we can mathematically derive context. As mentioned above, the word vector for “lion” will be closer in value to “cat” than to “dandelion”.
In en_core_web_md
, there are 20k unique vectors (300 dimensions).
What’s interesting is that Doc and Span objects themselves have vectors, derived from the averages of individual token vectors.
This makes it possible to compare similarities between whole documents.
Identifying similar vectors
The best way to expose vector relationships is through the .similarity()
method of Doc tokens.
# Create a three-token Doc object: |
dog dog 1.0 |
Note that order doesn’t matter. token1.similarity(token2)
has the same value as token2.similarity(token1)
.
Opposites are not necessarily different
Words that have opposite meaning, but that often appear in the same context may have similar vectors.
# Create a three-token Doc object: |
like like 1.0 |
Vector norms
It’s sometimes helpful to aggregate 300 dimensions into a Euclidian (L2) norm, computed as the square root of the sum-of-squared-vectors. This is accessible as the .vector_norm
token attribute. Other helpful attributes include .has_vector
and .is_oov
or out of vocabulary.
tokens = nlp(u'dog cat nowaythere') |
Vector arithmetic
Believe it or not, we can actually calculate new vectors by adding & subtracting related vectors. A famous example suggests
"king" - "man" + "woman" = "queen"
from scipy import spatial |
# Now we find the closest vector in the vocabulary to the result of "man" - "woman" + "queen" |
Sentiment Analysis
detects polarity(e.g., positive or negative opinion) within the text and emotions, urgency and intentions
VADER
Valence Aware Dictionary for Sentiment Reasoning is a model in NLTK
Download the VADER lexicon. You only need to do this once.
import nltk |
VADER’s SentimentIntensityAnalyzer()
takes in a string and returns a dictionary of scores in each of four categories:
- negative
- neutral
- positive
- compound (computed by normalizing the scores above)
from nltk.sentiment.vader import SentimentIntensityAnalyzer |
Use VADER to analyze Amazon Reviews
import numpy as np |
Adding Scores and Labels to the DataFrame
df['scores'] = df['review'].apply(lambda review: sid.polarity_scores(review)) |
Report on Accuracy
from sklearn.metrics import accuracy_score,classification_report,confusion_matrix |
precision recall f1-score support |