【NLP】NLP Fundamentals & Introduction to spaCy
NLP Fundamentals & Common Techniques
What is NLP
NLP attempts to use a variety of common techniques to create structure out of raw text data, by applying language specific grammatical rules and semantics.
Common NLP Techniques
NLP Technique | Description |
---|---|
Tokenization(word segmentation) | Convert raw text into separate words or tokens. |
Parsing & Tagging | Parsing is about creating a tree like structure with words, focusing on relationships between them. Tagging is attaching additional info with tokens. |
Stemming | Reducing words into their base form using rules |
Lemmatization | Reducing words into their base dictionary form (called as lemma) |
Stop Word Filtering | |
Parts of Speech Tagging | |
Named Entity Recognition |
Tokenization
uses prefix, suffix and infix characters, and punctuation rules
Stemming
Portal’s Algorithm
cats→cat
Lemmatization
The lemma of was
is be
Stop word Filtering
Words like a
and the
appears so frequently that they don’t require tagging as thoroughly as nouns, verbs and modifiers.
We call these stop words.
Part of Speech Tagging
using linguistic knowledge to add useful information to tokens.
Named Entity Recognition (NER)
seeks to locate and classify named entities.
Introduction to NLTK and spcCy
What is NLTK
Natural Language Toolkit
What is spcCy
choosing the most efficient method
Getting Started with spcCy
Install en model
pip install --upgrade spacy |
python -m spacy download en |
if failed, visit English · spaCy Models Documentation, and then download the .whl
file to disk and install.
pip install somewhere/en_core_web_sm-3.7.1-py3-none-any.whl |
Using spcCy
# Import spaCy and load the language library |
Apple PROPN nsubj |
spcCy Objects
After importing the spacy module in the cell above we loaded a model and named it nlp
.
Next we created a Doc object by applying the model to our text, and named it doc
.
spaCy also builds a companion Vocab object that we’ll cover in later sections.
The Doc object that holds the processed text is our focus here.
Pipeline
When we run nlp
, our text enters a processing pipeline that first breaks down the text and then performs a series of operations to tag, parse and describe the data.
Tokenization & POS & Dependencies
The first step in processing text is to split up all the component parts (words & punctuation) into “tokens”. These tokens are annotated inside the Doc object to contain descriptive information.
The next step after splitting the text up into tokens is to assign parts of speech.
We also looked at the syntactic dependencies assigned to each token. Apple
is identified as an nsubj
or the nominal subject of the sentence.
doc2 = nlp(u"Apple isn't looking into startups anymore.") |
Apple PROPN nsubj |
Notice how isn't
has been split into two tokens. spaCy recognizes both the root verb is
and the negation attached to it. Notice also that both the extended whitespace and the period at the end of the sentence are assigned their own tokens.
To see the full name of a tag use spacy.explain(tag)
spacy.explain('nsubj') |
Additional Token Attributes
Tag | Description | doc2[0].tag |
---|---|---|
.text |
The original word text | Apple |
.lemma_ |
The base form of the word | apple |
.pos_ |
The simple part-of-speech tag | PROPN /proper noun |
.tag_ |
The detailed part-of-speech tag | NNP /noun, proper singular |
.shape_ |
The word shape – capitalization, punctuation, digits | Xxxxx |
.is_alpha |
Is the token an alpha character? | True |
.is_stop |
Is the token part of a stop list, i.e. the most common words of the language? | False |
Spans
Large Doc objects can be hard to work with at times. A span is a slice of Doc object in the form Doc[start:stop]
.
doc3 = nlp(u'Although commmonly attributed to John Lennon from his song "Beautiful Boy", \ |
Sentences
Certain tokens inside a Doc object may also receive a “start of sentence” tag. While this doesn’t immediately build a list of sentences, these tags enable the generation of sentence segments through Doc.sents
. Later we’ll write our own segmentation rules.
doc4 = nlp(u'This is the first sentence. This is another sentence. This is the last sentence.') |
Tokenization
- Prefix: Character(s) at the beginning ▸
$ ( “ ¿
- Suffix: Character(s) at the end ▸
km ) , . ! ”
- Infix: Character(s) in between ▸
- -- / ...
- Exception: Special-case rule to split a string into several tokens or prevent a token from being split when punctuation rules are applied ▸
St. U.S.
Prefixes, Suffixes and Infixes
However, punctuation that exists as part of an email address, website or numerical value will be kept as part of the token.
doc2 = nlp(u"We're here to help! Send snail-mail, email support@oursite.com or visit us at http://www.oursite.com!") |
...... |
Note that the exclamation points, comma, and the hyphen in ‘snail-mail’ are assigned their own tokens, yet both the email address and website are preserved.
Exceptions
Punctuation that exists as part of a known abbreviation will be kept as part of the token.
doc4 = nlp(u"Let's visit St. Louis in the U.S. next year.") |
Let |
Here the abbreviations for “Saint” and “United States” are both preserved.
Counting Vocab Entries
Vocab
objects contain a full library of items!
len(doc.vocab) |
NOTE: This number changes based on the language library loaded at the start, and any new lexemes introduced to the vocab
when the Doc
was created.
Tokens cannot be reassigned
Although Doc
objects can be considered lists of tokens, they do not support item reassignment.
Named Entities
Going a step beyond tokens, named entities add another layer of context. The language model recognizes that certain words are organizational names while others are locations, and still other combinations relate to money, dates, etc. Named entities are accessible through the ents
property of a Doc
object.
doc8 = nlp(u'Apple to build a Hong Kong factory for $6 million') |
Apple - ORG - Companies, agencies, institutions, etc. |
Noun Chunks
Similar to Doc.ents
, Doc.noun_chunks
are another object property. Noun chunks are “base noun phrases“ – flat phrases that have a noun as their head. You can think of noun chunks as a noun plus the words describing the noun – for example, in Sheb Wooley’s 1958 song, a “one-eyed, one-horned, flying, purple people-eater” would be one long noun chunk.
doc11 = nlp(u"He was a one-eyed, one-horned, flying, purple people-eater.") |
Built-in Visualizers
spaCy includes a built-in visualization tool called displaCy. displaCy is able to detect whether you’re working in a Jupyter notebook, and will return markup that can be rendered in a cell right away. When you export your notebook, the visualizations will be included as HTML.
For more info visit https://spacy.io/usage/visualizers
doc = nlp(u'Over the last quarter Apple sold nearly 20 thousand iPods for a profit of $6 million.') |
Stemming
Often when searching text for a certain keyword, it helps if the search returns variations of the word. For instance, searching for “boat” might also return “boats” and “boating”. Here, “boat” would be the stem for [boat, boater, boating, boats].
Instead, we’ll use another popular NLP tool called NLTK.
Porter Stemmer
One of the most common - and effective - stemming tools is Porter’s Algorithm developed by Martin Porter in 1980. The algorithm employs five phases of word reduction, each with its own set of mapping rules.
# Import the toolkit and the full Porter Stemmer library |
run --> run |
Note how the stemmer recognizes “runner” as a noun, not a verb form or participle. Also, the adverbs “easily” and “fairly” are stemmed to the unusual root “easili” and “fairli”
Snowball Stemmer
This is somewhat of a misnomer, as Snowball is the name of a stemming language developed by Martin Porter. The algorithm used here is more acurately called the “English Stemmer” or “Porter2 Stemmer”. It offers a slight improvement over the original Porter stemmer, both in logic and speed.
from nltk.stem.snowball import SnowballStemmer |
run --> run |
In this case the stemmer performed the same as the Porter Stemmer, with the exception that it handled the stem of “fairly” more appropriately with “fair”
Stemming has its drawbacks. If given the token saw
, stemming might always return saw
, whereas lemmatization would likely return either see
or saw
depending on whether the use of the token was as a verb or a noun. As an example, consider the following:
phrase = 'I am meeting him tomorrow at the meeting' |
I --> I |
Here the word “meeting” appears twice - once as a verb, and once as a noun, and yet the stemmer treats both equally.
Lemmatization
In contrast to stemming, lemmatization looks beyond word reduction, and considers a language’s full vocabulary to apply a morphological analysis to words. The lemma of ‘was’ is ‘be’ and the lemma of ‘mice’ is ‘mouse’. Further, the lemma of ‘meeting’ might be ‘meet’ or ‘meeting’ depending on its use in a sentence.
import spacy |
I PRON 561228191312463089 -PRON- |
In the above sentence, running
, run
and ran
all point to the same lemma run
(…11841) to avoid duplication.
Also notice that Spacy does not try to find lemma for personal pronouns, instead it assigns them the same symbol -PRON-
def show_lemmas(text): |
doc3 = nlp(u"I am meeting him tomorrow at the meeting.") |
I PRON 561228191312463089 -PRON- |
Here the lemma of meeting
is determined by its Part of Speech tag.
Stop Words
Words like “a” and “the” appear so frequently that they don’t require tagging as thoroughly as nouns, verbs and modifiers. We call these stop words, and they can be filtered from the text to be processed. spaCy holds a built-in list of some 305 English stop words.
Add a stop word
There may be times when you wish to add a stop word to the default set. Perhaps you decide that 'btw'
(common shorthand for “by the way”) should be considered a stop word.
# Add the word to the set of stop words. Use lowercase! |
When adding stop words, always use lowercase. Lexemes are converted to lowercase before being added to vocab.
Remove a stop word
Alternatively, you may decide that 'beyond'
should not be considered a stop word.
# Remove the word from the set of stop words |