NLP Fundamentals & Common Techniques

What is NLP

NLP attempts to use a variety of common techniques to create structure out of raw text data, by applying language specific grammatical rules and semantics.

Common NLP Techniques

NLP Technique	Description
Tokenization(word segmentation)	Convert raw text into separate words or tokens.
Parsing & Tagging	Parsing is about creating a tree like structure with words, focusing on relationships between them. Tagging is attaching additional info with tokens.
Stemming	Reducing words into their base form using rules
Lemmatization	Reducing words into their base dictionary form (called as lemma)
Stop Word Filtering
Parts of Speech Tagging
Named Entity Recognition

Tokenization

uses prefix, suffix and infix characters, and punctuation rules

Stemming

Portal’s Algorithm

cats→cat

Lemmatization

The lemma of was is be

Stop word Filtering

Words like a and the appears so frequently that they don’t require tagging as thoroughly as nouns, verbs and modifiers.

We call these stop words.

Part of Speech Tagging

using linguistic knowledge to add useful information to tokens.

Named Entity Recognition (NER)

seeks to locate and classify named entities.

Introduction to NLTK and spcCy

What is NLTK

Natural Language Toolkit

What is spcCy

choosing the most efficient method

Getting Started with spcCy

Install en model

pip install --upgrade spacy

python -m spacy download en

if failed, visit English · spaCy Models Documentation, and then download the .whl file to disk and install.

pip install somewhere/en_core_web_sm-3.7.1-py3-none-any.whl

Using spcCy

# Import spaCy and load the language library
import spacy
nlp = spacy.load('en_core_web_sm')

# Create a Doc object
doc = nlp(u'Apple is looking at buying a U.K. startup for $1 Billion')

# Print each token separately
for token in doc:
    print(token.text, token.pos_, token.dep_)

Apple PROPN nsubj
is VERB aux
looking VERB ROOT
at ADP prep
buying VERB pcomp
a DET det
U.K. PROPN compound
startup NOUN dobj
for ADP prep
$ SYM quantmod
1 NUM compound
Billion NUM pobj

spcCy Objects

After importing the spacy module in the cell above we loaded a model and named it nlp.
Next we created a Doc object by applying the model to our text, and named it doc.
spaCy also builds a companion Vocab object that we’ll cover in later sections.
The Doc object that holds the processed text is our focus here.

Pipeline

When we run nlp, our text enters a processing pipeline that first breaks down the text and then performs a series of operations to tag, parse and describe the data.

Tokenization & POS & Dependencies

The first step in processing text is to split up all the component parts (words & punctuation) into “tokens”. These tokens are annotated inside the Doc object to contain descriptive information.

The next step after splitting the text up into tokens is to assign parts of speech.

We also looked at the syntactic dependencies assigned to each token. Apple is identified as an nsubj or the nominal subject of the sentence.

doc2 = nlp(u"Apple isn't looking into startups anymore.")
for token in doc2:
    print(f'{token.text:{10}} {token.pos_:{10}} {token.dep_:{10}}')

Apple      PROPN      nsubj     
is         VERB       aux       
n't        ADV        neg       
looking    VERB       ROOT      
into       ADP        prep      
startups   NOUN       pobj      
anymore    ADV        advmod    
.          PUNCT      punct

Notice how isn't has been split into two tokens. spaCy recognizes both the root verb is and the negation attached to it. Notice also that both the extended whitespace and the period at the end of the sentence are assigned their own tokens.

To see the full name of a tag use spacy.explain(tag)

spacy.explain('nsubj')
# 'nominal subject'
spacy.explain(str(doc[0].pos_))
# 'proper noun'

Additional Token Attributes

Tag	Description	doc2[0].tag
`.text`	The original word text	`Apple`
`.lemma_`	The base form of the word	`apple`
`.pos_`	The simple part-of-speech tag	`PROPN`/`proper noun`
`.tag_`	The detailed part-of-speech tag	`NNP`/`noun, proper singular`
`.shape_`	The word shape – capitalization, punctuation, digits	`Xxxxx`
`.is_alpha`	Is the token an alpha character?	`True`
`.is_stop`	Is the token part of a stop list, i.e. the most common words of the language?	`False`

Spans

Large Doc objects can be hard to work with at times. A span is a slice of Doc object in the form Doc[start:stop].

doc3 = nlp(u'Although commmonly attributed to John Lennon from his song "Beautiful Boy", \
the phrase "Life is what happens to us while we are making other plans" was written by \
cartoonist Allen Saunders and published in Reader\'s Digest in 1957, when Lennon was 17.')

life_quote = doc3[16:30]
print(life_quote)
# "Life is what happens to us while we are making other plans"
type(life_quote)
# spacy.tokens.span.Span

Sentences

Certain tokens inside a Doc object may also receive a “start of sentence” tag. While this doesn’t immediately build a list of sentences, these tags enable the generation of sentence segments through Doc.sents. Later we’ll write our own segmentation rules.

doc4 = nlp(u'This is the first sentence. This is another sentence. This is the last sentence.')
for sent in doc4.sents:
    print(sent)
# This is the first sentence.
# This is another sentence.
# This is the last sentence.

Tokenization

Prefix: Character(s) at the beginning ▸ $ ( “ ¿
Suffix: Character(s) at the end ▸ km ) , . ! ”
Infix: Character(s) in between ▸ - -- / ...
Exception: Special-case rule to split a string into several tokens or prevent a token from being split when punctuation rules are applied ▸ St. U.S.

Prefixes, Suffixes and Infixes

However, punctuation that exists as part of an email address, website or numerical value will be kept as part of the token.

doc2 = nlp(u"We're here to help! Send snail-mail, email support@oursite.com or visit us at http://www.oursite.com!")

for t in doc2:
    print(t)

......
email
support@oursite.com
or
visit
us
at
http://www.oursite.com
!

Note that the exclamation points, comma, and the hyphen in ‘snail-mail’ are assigned their own tokens, yet both the email address and website are preserved.

Exceptions

Punctuation that exists as part of a known abbreviation will be kept as part of the token.

doc4 = nlp(u"Let's visit St. Louis in the U.S. next year.")

for t in doc4:
    print(t)

Let
's
visit
St.
Louis
in
the
U.S.
next
year
.

Here the abbreviations for “Saint” and “United States” are both preserved.

Counting Vocab Entries

Vocab objects contain a full library of items!

len(doc.vocab)
# 57852

NOTE: This number changes based on the language library loaded at the start, and any new lexemes introduced to the vocab when the Doc was created.

Tokens cannot be reassigned

Although Doc objects can be considered lists of tokens, they do not support item reassignment.

Named Entities

Going a step beyond tokens, named entities add another layer of context. The language model recognizes that certain words are organizational names while others are locations, and still other combinations relate to money, dates, etc. Named entities are accessible through the ents property of a Doc object.

doc8 = nlp(u'Apple to build a Hong Kong factory for $6 million')
for ent in doc8.ents:
    print(ent.text+' - '+ent.label_+' - '+str(spacy.explain(ent.label_)))

Apple - ORG - Companies, agencies, institutions, etc.
Hong Kong - GPE - Countries, cities, states
$6 million - MONEY - Monetary values, including unit

Noun Chunks

Similar to Doc.ents, Doc.noun_chunks are another object property. Noun chunks are “base noun phrases” – flat phrases that have a noun as their head. You can think of noun chunks as a noun plus the words describing the noun – for example, in Sheb Wooley’s 1958 song, a “one-eyed, one-horned, flying, purple people-eater” would be one long noun chunk.

doc11 = nlp(u"He was a one-eyed, one-horned, flying, purple people-eater.")

for chunk in doc11.noun_chunks:
    print(chunk.text)
# He
# a one-eyed, one-horned, flying, purple people-eater

Built-in Visualizers

spaCy includes a built-in visualization tool called displaCy. displaCy is able to detect whether you’re working in a Jupyter notebook, and will return markup that can be rendered in a cell right away. When you export your notebook, the visualizations will be included as HTML.

For more info visit https://spacy.io/usage/visualizers

doc = nlp(u'Over the last quarter Apple sold nearly 20 thousand iPods for a profit of $6 million.')
displacy.render(doc, style='ent', jupyter=True)

Over the last quarter DATE Apple ORG sold nearly 20 thousand CARDINAL iPods PRODUCT for a profit of $6 million MONEY .

Stemming

Often when searching text for a certain keyword, it helps if the search returns variations of the word. For instance, searching for “boat” might also return “boats” and “boating”. Here, “boat” would be the stem for [boat, boater, boating, boats].

Instead, we’ll use another popular NLP tool called NLTK.

Porter Stemmer

One of the most common - and effective - stemming tools is Porter’s Algorithm developed by Martin Porter in 1980. The algorithm employs five phases of word reduction, each with its own set of mapping rules.

# Import the toolkit and the full Porter Stemmer library
import nltk

from nltk.stem.porter import *
p_stemmer = PorterStemmer()
words = ['run','runner','running','ran','runs','easily','fairly']
for word in words:
    print(word+' --> '+p_stemmer.stem(word))

run --> run
runner --> runner
running --> run
ran --> ran
runs --> run
easily --> easili
fairly --> fairli
eighteen --> eighteen

Note how the stemmer recognizes “runner” as a noun, not a verb form or participle. Also, the adverbs “easily” and “fairly” are stemmed to the unusual root “easili” and “fairli”

Snowball Stemmer

This is somewhat of a misnomer, as Snowball is the name of a stemming language developed by Martin Porter. The algorithm used here is more acurately called the “English Stemmer” or “Porter2 Stemmer”. It offers a slight improvement over the original Porter stemmer, both in logic and speed.

from nltk.stem.snowball import SnowballStemmer

# The Snowball Stemmer requires that you pass a language parameter
s_stemmer = SnowballStemmer(language='english')
for word in words:
    print(word+' --> '+s_stemmer.stem(word))

run --> run
runner --> runner
running --> run
ran --> ran
runs --> run
easily --> easili
fairly --> fair

In this case the stemmer performed the same as the Porter Stemmer, with the exception that it handled the stem of “fairly” more appropriately with “fair”

Stemming has its drawbacks. If given the token saw, stemming might always return saw, whereas lemmatization would likely return either see or saw depending on whether the use of the token was as a verb or a noun. As an example, consider the following:

phrase = 'I am meeting him tomorrow at the meeting'
for word in phrase.split():
    print(word+' --> '+p_stemmer.stem(word))

I --> I
am --> am
meeting --> meet
him --> him
tomorrow --> tomorrow
at --> at
the --> the
meeting --> meet

Here the word “meeting” appears twice - once as a verb, and once as a noun, and yet the stemmer treats both equally.

Lemmatization

In contrast to stemming, lemmatization looks beyond word reduction, and considers a language’s full vocabulary to apply a morphological analysis to words. The lemma of ‘was’ is ‘be’ and the lemma of ‘mice’ is ‘mouse’. Further, the lemma of ‘meeting’ might be ‘meet’ or ‘meeting’ depending on its use in a sentence.

import spacy
nlp = spacy.load('en_core_web_sm')
doc1 = nlp(u"I am a runner running in a race because I love to run since I ran today")

for token in doc1:
    print(token.text, '\t\t', token.pos_, '\t', token.lemma, '\t\t', token.lemma_)

I 		 PRON 	 561228191312463089 		 -PRON-
am 		 VERB 	 10382539506755952630 		 be
a 		 DET 	 11901859001352538922 		 a
runner 		 NOUN 	 12640964157389618806 		 runner
running 		 VERB 	 12767647472892411841 		 run
in 		 ADP 	 3002984154512732771 		 in
a 		 DET 	 11901859001352538922 		 a
race 		 NOUN 	 8048469955494714898 		 race
because 		 ADP 	 16950148841647037698 		 because
I 		 PRON 	 561228191312463089 		 -PRON-
love 		 VERB 	 3702023516439754181 		 love
to 		 PART 	 3791531372978436496 		 to
run 		 VERB 	 12767647472892411841 		 run
since 		 ADP 	 10066841407251338481 		 since
I 		 PRON 	 561228191312463089 		 -PRON-
ran 		 VERB 	 12767647472892411841 		 run
today 		 NOUN 	 11042482332948150395 		 today

In the above sentence, running, run and ran all point to the same lemma run (…11841) to avoid duplication.

Also notice that Spacy does not try to find lemma for personal pronouns, instead it assigns them the same symbol -PRON-

def show_lemmas(text):
    for token in text:
        print(f'{token.text:{12}} {token.pos_:{6}} {token.lemma:<{22}} {token.lemma_}')

doc3 = nlp(u"I am meeting him tomorrow at the meeting.")

show_lemmas(doc3)

I            PRON   561228191312463089     -PRON-
am           VERB   10382539506755952630   be
meeting      VERB   6880656908171229526    meet
him          PRON   561228191312463089     -PRON-
tomorrow     NOUN   3573583789758258062    tomorrow
at           ADP    11667289587015813222   at
the          DET    7425985699627899538    the
meeting      NOUN   14798207169164081740   meeting
.            PUNCT  12646065887601541794   .

Here the lemma of meeting is determined by its Part of Speech tag.

Stop Words

Words like “a” and “the” appear so frequently that they don’t require tagging as thoroughly as nouns, verbs and modifiers. We call these stop words, and they can be filtered from the text to be processed. spaCy holds a built-in list of some 305 English stop words.

Add a stop word

There may be times when you wish to add a stop word to the default set. Perhaps you decide that 'btw' (common shorthand for “by the way”) should be considered a stop word.

# Add the word to the set of stop words. Use lowercase!
nlp.Defaults.stop_words.add('btw')

# Set the stop_word tag on the lexeme
nlp.vocab['btw'].is_stop = True

When adding stop words, always use lowercase. Lexemes are converted to lowercase before being added to vocab.

Remove a stop word

Alternatively, you may decide that 'beyond' should not be considered a stop word.

# Remove the word from the set of stop words
nlp.Defaults.stop_words.remove('beyond')

# Remove the stop_word tag from the lexeme
nlp.vocab['beyond'].is_stop = False