NLP Fundamentals & Common Techniques

What is NLP

NLP attempts to use a variety of common techniques to create structure out of raw text data, by applying language specific grammatical rules and semantics.

Common NLP Techniques

NLP Technique Description
Tokenization(word segmentation) Convert raw text into separate words or tokens.
Parsing & Tagging Parsing is about creating a tree like structure with words, focusing on relationships between them. Tagging is attaching additional info with tokens.
Stemming Reducing words into their base form using rules
Lemmatization Reducing words into their base dictionary form (called as lemma)
Stop Word Filtering
Parts of Speech Tagging
Named Entity Recognition

Tokenization

uses prefix, suffix and infix characters, and punctuation rules

Stemming

Portal’s Algorithm

cats→cat

Lemmatization

The lemma of was is be

Stop word Filtering

Words like a and the appears so frequently that they don’t require tagging as thoroughly as nouns, verbs and modifiers.

We call these stop words.

Part of Speech Tagging

using linguistic knowledge to add useful information to tokens.

Named Entity Recognition (NER)

seeks to locate and classify named entities.

Introduction to NLTK and spcCy

What is NLTK

Natural Language Toolkit

What is spcCy

choosing the most efficient method

Getting Started with spcCy

Install en model

pip install --upgrade spacy
python -m spacy download en

if failed, visit English · spaCy Models Documentation, and then download the .whl file to disk and install.

pip install somewhere/en_core_web_sm-3.7.1-py3-none-any.whl

Using spcCy

# Import spaCy and load the language library
import spacy
nlp = spacy.load('en_core_web_sm')

# Create a Doc object
doc = nlp(u'Apple is looking at buying a U.K. startup for $1 Billion')

# Print each token separately
for token in doc:
print(token.text, token.pos_, token.dep_)
Apple PROPN nsubj
is VERB aux
looking VERB ROOT
at ADP prep
buying VERB pcomp
a DET det
U.K. PROPN compound
startup NOUN dobj
for ADP prep
$ SYM quantmod
1 NUM compound
Billion NUM pobj

spcCy Objects

After importing the spacy module in the cell above we loaded a model and named it nlp.
Next we created a Doc object by applying the model to our text, and named it doc.
spaCy also builds a companion Vocab object that we’ll cover in later sections.
The Doc object that holds the processed text is our focus here.

Pipeline

When we run nlp, our text enters a processing pipeline that first breaks down the text and then performs a series of operations to tag, parse and describe the data.

Tokenization & POS & Dependencies

The first step in processing text is to split up all the component parts (words & punctuation) into “tokens”. These tokens are annotated inside the Doc object to contain descriptive information.

The next step after splitting the text up into tokens is to assign parts of speech.

We also looked at the syntactic dependencies assigned to each token. Apple is identified as an nsubj or the nominal subject of the sentence.

doc2 = nlp(u"Apple isn't looking into startups anymore.")
for token in doc2:
print(f'{token.text:{10}} {token.pos_:{10}} {token.dep_:{10}}')
Apple      PROPN      nsubj     
is VERB aux
n't ADV neg
looking VERB ROOT
into ADP prep
startups NOUN pobj
anymore ADV advmod
. PUNCT punct

Notice how isn't has been split into two tokens. spaCy recognizes both the root verb is and the negation attached to it. Notice also that both the extended whitespace and the period at the end of the sentence are assigned their own tokens.

To see the full name of a tag use spacy.explain(tag)

spacy.explain('nsubj')
# 'nominal subject'
spacy.explain(str(doc[0].pos_))
# 'proper noun'

Additional Token Attributes

Tag Description doc2[0].tag
.text The original word text Apple
.lemma_ The base form of the word apple
.pos_ The simple part-of-speech tag PROPN/proper noun
.tag_ The detailed part-of-speech tag NNP/noun, proper singular
.shape_ The word shape – capitalization, punctuation, digits Xxxxx
.is_alpha Is the token an alpha character? True
.is_stop Is the token part of a stop list, i.e. the most common words of the language? False

Spans

Large Doc objects can be hard to work with at times. A span is a slice of Doc object in the form Doc[start:stop].

doc3 = nlp(u'Although commmonly attributed to John Lennon from his song "Beautiful Boy", \
the phrase "Life is what happens to us while we are making other plans" was written by \
cartoonist Allen Saunders and published in Reader\'s Digest in 1957, when Lennon was 17.')

life_quote = doc3[16:30]
print(life_quote)
# "Life is what happens to us while we are making other plans"
type(life_quote)
# spacy.tokens.span.Span

Sentences

Certain tokens inside a Doc object may also receive a “start of sentence” tag. While this doesn’t immediately build a list of sentences, these tags enable the generation of sentence segments through Doc.sents. Later we’ll write our own segmentation rules.

doc4 = nlp(u'This is the first sentence. This is another sentence. This is the last sentence.')
for sent in doc4.sents:
print(sent)
# This is the first sentence.
# This is another sentence.
# This is the last sentence.

Tokenization

  • Prefix: Character(s) at the beginning ▸ $ ( “ ¿
  • Suffix: Character(s) at the end ▸ km ) , . ! ”
  • Infix: Character(s) in between ▸ - -- / ...
  • Exception: Special-case rule to split a string into several tokens or prevent a token from being split when punctuation rules are applied ▸ St. U.S.

Prefixes, Suffixes and Infixes

However, punctuation that exists as part of an email address, website or numerical value will be kept as part of the token.

doc2 = nlp(u"We're here to help! Send snail-mail, email support@oursite.com or visit us at http://www.oursite.com!")

for t in doc2:
print(t)
......
email
support@oursite.com
or
visit
us
at
http://www.oursite.com
!

Note that the exclamation points, comma, and the hyphen in ‘snail-mail’ are assigned their own tokens, yet both the email address and website are preserved.

Exceptions

Punctuation that exists as part of a known abbreviation will be kept as part of the token.

doc4 = nlp(u"Let's visit St. Louis in the U.S. next year.")

for t in doc4:
print(t)
Let
's
visit
St.
Louis
in
the
U.S.
next
year
.

Here the abbreviations for “Saint” and “United States” are both preserved.

Counting Vocab Entries

Vocab objects contain a full library of items!

len(doc.vocab)
# 57852

NOTE: This number changes based on the language library loaded at the start, and any new lexemes introduced to the vocab when the Doc was created.

Tokens cannot be reassigned

Although Doc objects can be considered lists of tokens, they do not support item reassignment.

Named Entities

Going a step beyond tokens, named entities add another layer of context. The language model recognizes that certain words are organizational names while others are locations, and still other combinations relate to money, dates, etc. Named entities are accessible through the ents property of a Doc object.

doc8 = nlp(u'Apple to build a Hong Kong factory for $6 million')
for ent in doc8.ents:
print(ent.text+' - '+ent.label_+' - '+str(spacy.explain(ent.label_)))
Apple - ORG - Companies, agencies, institutions, etc.
Hong Kong - GPE - Countries, cities, states
$6 million - MONEY - Monetary values, including unit

Noun Chunks

Similar to Doc.ents, Doc.noun_chunks are another object property. Noun chunks are “base noun phrases“ – flat phrases that have a noun as their head. You can think of noun chunks as a noun plus the words describing the noun – for example, in Sheb Wooley’s 1958 song, a “one-eyed, one-horned, flying, purple people-eater” would be one long noun chunk.

doc11 = nlp(u"He was a one-eyed, one-horned, flying, purple people-eater.")

for chunk in doc11.noun_chunks:
print(chunk.text)
# He
# a one-eyed, one-horned, flying, purple people-eater

Built-in Visualizers

spaCy includes a built-in visualization tool called displaCy. displaCy is able to detect whether you’re working in a Jupyter notebook, and will return markup that can be rendered in a cell right away. When you export your notebook, the visualizations will be included as HTML.

For more info visit https://spacy.io/usage/visualizers

doc = nlp(u'Over the last quarter Apple sold nearly 20 thousand iPods for a profit of $6 million.')
displacy.render(doc, style='ent', jupyter=True)
Over the last quarter DATE Apple ORG sold nearly 20 thousand CARDINAL iPods PRODUCT for a profit of $6 million MONEY .

Stemming

Often when searching text for a certain keyword, it helps if the search returns variations of the word. For instance, searching for “boat” might also return “boats” and “boating”. Here, “boat” would be the stem for [boat, boater, boating, boats].

Instead, we’ll use another popular NLP tool called NLTK.

Porter Stemmer

One of the most common - and effective - stemming tools is Porter’s Algorithm developed by Martin Porter in 1980. The algorithm employs five phases of word reduction, each with its own set of mapping rules.

# Import the toolkit and the full Porter Stemmer library
import nltk

from nltk.stem.porter import *
p_stemmer = PorterStemmer()
words = ['run','runner','running','ran','runs','easily','fairly']
for word in words:
print(word+' --> '+p_stemmer.stem(word))
run --> run
runner --> runner
running --> run
ran --> ran
runs --> run
easily --> easili
fairly --> fairli
eighteen --> eighteen

Note how the stemmer recognizes “runner” as a noun, not a verb form or participle. Also, the adverbs “easily” and “fairly” are stemmed to the unusual root “easili” and “fairli”

Snowball Stemmer

This is somewhat of a misnomer, as Snowball is the name of a stemming language developed by Martin Porter. The algorithm used here is more acurately called the “English Stemmer” or “Porter2 Stemmer”. It offers a slight improvement over the original Porter stemmer, both in logic and speed.

from nltk.stem.snowball import SnowballStemmer

# The Snowball Stemmer requires that you pass a language parameter
s_stemmer = SnowballStemmer(language='english')
for word in words:
print(word+' --> '+s_stemmer.stem(word))
run --> run
runner --> runner
running --> run
ran --> ran
runs --> run
easily --> easili
fairly --> fair

In this case the stemmer performed the same as the Porter Stemmer, with the exception that it handled the stem of “fairly” more appropriately with “fair”

Stemming has its drawbacks. If given the token saw, stemming might always return saw, whereas lemmatization would likely return either see or saw depending on whether the use of the token was as a verb or a noun. As an example, consider the following:

phrase = 'I am meeting him tomorrow at the meeting'
for word in phrase.split():
print(word+' --> '+p_stemmer.stem(word))
I --> I
am --> am
meeting --> meet
him --> him
tomorrow --> tomorrow
at --> at
the --> the
meeting --> meet

Here the word “meeting” appears twice - once as a verb, and once as a noun, and yet the stemmer treats both equally.


Lemmatization

In contrast to stemming, lemmatization looks beyond word reduction, and considers a language’s full vocabulary to apply a morphological analysis to words. The lemma of ‘was’ is ‘be’ and the lemma of ‘mice’ is ‘mouse’. Further, the lemma of ‘meeting’ might be ‘meet’ or ‘meeting’ depending on its use in a sentence.

import spacy
nlp = spacy.load('en_core_web_sm')
doc1 = nlp(u"I am a runner running in a race because I love to run since I ran today")

for token in doc1:
print(token.text, '\t\t', token.pos_, '\t', token.lemma, '\t\t', token.lemma_)
I 		 PRON 	 561228191312463089 		 -PRON-
am VERB 10382539506755952630 be
a DET 11901859001352538922 a
runner NOUN 12640964157389618806 runner
running VERB 12767647472892411841 run
in ADP 3002984154512732771 in
a DET 11901859001352538922 a
race NOUN 8048469955494714898 race
because ADP 16950148841647037698 because
I PRON 561228191312463089 -PRON-
love VERB 3702023516439754181 love
to PART 3791531372978436496 to
run VERB 12767647472892411841 run
since ADP 10066841407251338481 since
I PRON 561228191312463089 -PRON-
ran VERB 12767647472892411841 run
today NOUN 11042482332948150395 today

In the above sentence, running, run and ran all point to the same lemma run (…11841) to avoid duplication.

Also notice that Spacy does not try to find lemma for personal pronouns, instead it assigns them the same symbol -PRON-

def show_lemmas(text):
for token in text:
print(f'{token.text:{12}} {token.pos_:{6}} {token.lemma:<{22}} {token.lemma_}')
doc3 = nlp(u"I am meeting him tomorrow at the meeting.")

show_lemmas(doc3)
I            PRON   561228191312463089     -PRON-
am VERB 10382539506755952630 be
meeting VERB 6880656908171229526 meet
him PRON 561228191312463089 -PRON-
tomorrow NOUN 3573583789758258062 tomorrow
at ADP 11667289587015813222 at
the DET 7425985699627899538 the
meeting NOUN 14798207169164081740 meeting
. PUNCT 12646065887601541794 .

Here the lemma of meeting is determined by its Part of Speech tag.


Stop Words

Words like “a” and “the” appear so frequently that they don’t require tagging as thoroughly as nouns, verbs and modifiers. We call these stop words, and they can be filtered from the text to be processed. spaCy holds a built-in list of some 305 English stop words.

Add a stop word

There may be times when you wish to add a stop word to the default set. Perhaps you decide that 'btw' (common shorthand for “by the way”) should be considered a stop word.

# Add the word to the set of stop words. Use lowercase!
nlp.Defaults.stop_words.add('btw')

# Set the stop_word tag on the lexeme
nlp.vocab['btw'].is_stop = True

When adding stop words, always use lowercase. Lexemes are converted to lowercase before being added to vocab.

Remove a stop word

Alternatively, you may decide that 'beyond' should not be considered a stop word.

# Remove the word from the set of stop words
nlp.Defaults.stop_words.remove('beyond')

# Remove the stop_word tag from the lexeme
nlp.vocab['beyond'].is_stop = False