Part of Speech Tagging

View token tags

  • To view the coarse POS tag use token.pos_
  • To view the fine-grained tag use token.tag_
  • To view the syntactic dependency use token.dep_
  • To view the description of either type of tag use spacy.explain(tag_)

token.pos and token.tag return integer hash values !

# Perform standard imports
import spacy
nlp = spacy.load('en_core_web_sm')

# Create a simple Doc object
doc = nlp(u"Apple is looking at buying U.K. startup for $1 Billion.")

for token in doc:
print(f'{token.text:{10}} {token.pos_:{8}} {token.tag_:{6}} {spacy.explain(token.tag_)}')
Apple      PROPN    NNP    noun, proper singular
is VERB VBZ verb, 3rd person singular present
looking VERB VBG verb, gerund or present participle
at ADP IN conjunction, subordinating or preposition
buying VERB VBG verb, gerund or present participle
U.K. PROPN NNP noun, proper singular
startup NOUN NN noun, singular or mass
for ADP IN conjunction, subordinating or preposition
$ SYM $ symbol, currency
1 NUM CD cardinal number
Billion NUM CD cardinal number
. PUNCT . punctuation mark, sentence closer

Coarse-grained Part-of-speech Tags

Every token is assigned a POS Tag from the following list:

POS DESCRIPTION EXAMPLES
ADJ adjective big, old, green, incomprehensible, first
ADP adposition in, to, during
ADV adverb very, tomorrow, down, where, there
AUX auxiliary is, has (done), will (do), should (do)
CONJ conjunction and, or, but
CCONJ coordinating conjunction and, or, but
DET determiner a, an, the
INTJ interjection psst, ouch, bravo, hello
NOUN noun girl, cat, tree, air, beauty
NUM numeral 1, 2017, one, seventy-seven, IV, MMXIV
PART particle 's, not,
PRON pronoun I, you, he, she, myself, themselves, somebody
PROPN proper noun Mary, John, London, NATO, HBO
PUNCT punctuation ., (, ), ?
SCONJ subordinating conjunction if, while, that
SYM symbol $, %, §, ©, +, −, ×, ÷, =, :), 😝
VERB verb run, runs, running, eat, ate, eating
X other sfpksdpsxmsa
SPACE space

Fine-grained Part-of-speech Tags

For a current list of tags for all languages visit https://spacy.io/api/annotation#pos-tagging

Working with POS Tags

In the English language, the same string of characters can have different meanings, even within the same sentence. For this reason, morphology is important. spaCy uses machine learning algorithms to best predict the use of a token in a sentence. Is “I read books on NLP” present or past tense? Is wind a verb or a noun?

doc = nlp(u'I read books on NLP.')
r = doc[1]

print(f'{r.text:{10}} {r.pos_:{8}} {r.tag_:{6}} {spacy.explain(r.tag_)}')
#read VERB VBP verb, non-3rd person singular present
doc = nlp(u'I read a book on NLP.')
r = doc[1]

print(f'{r.text:{10}} {r.pos_:{8}} {r.tag_:{6}} {spacy.explain(r.tag_)}')
#read VERB VBD verb, past tense

In the first example, with no other cues to work from, spaCy assumed that read was present tense.

In the second example the present tense form would be I am reading a book, so spaCy assigned the past tense.

Counting POS Tags

The Doc.count_by() method accepts a specific token attribute as its argument, and returns a frequency count of the given attribute as a dictionary object. Keys in the dictionary are the integer values of the given attribute ID, and values are the frequency. Counts of zero are not included.

doc = nlp(u"Apple is looking at buying U.K. startup for $1 Billion.")

# Count the frequencies of different coarse-grained POS tags:
POS_counts = doc.count_by(spacy.attrs.POS)
POS_counts
#{96: 1, 98: 1, 99: 3, 84: 2, 91: 1, 92: 2, 95: 2}

This isn’t very helpful until you decode the attribute ID:

doc.vocab[96].text
# 'PUNCT'
# Count the different fine-grained tags:
TAG_counts = doc.count_by(spacy.attrs.TAG)

for k,v in sorted(TAG_counts.items()):
print(f'{k:<{23}} {doc.vocab[k].text:{4}}: {v:<{5}} {spacy.explain(doc.vocab[k].text)}')
1292078113972184607     IN  : 2     conjunction, subordinating or preposition
1534113631682161808 VBG : 2 verb, gerund or present participle
8427216679587749980 CD : 2 cardinal number
11283501755624150392 $ : 1 symbol, currency
12646065887601541794 . : 1 punctuation mark, sentence closer
13927759927860985106 VBZ : 1 verb, 3rd person singular present
15308085513773655218 NN : 1 noun, singular or mass
15794550382381185553 NNP : 2 noun, proper singular

Why did the ID numbers get so big?

In spaCy, certain text values are hardcoded into Doc.vocab and take up the first several hundred ID numbers. Strings like ‘NOUN’ and ‘VERB’ are used frequently by internal operations. Others, like fine-grained tags, are assigned hash values as needed.

Why don’t SPACE tags appear?

In spaCy, only strings of spaces (two or more) are assigned tokens. Single spaces are not.

Visualizing Parts of Speech

spaCy offers an outstanding visualizer called displaCy:

# Perform standard imports
import spacy
nlp = spacy.load('en_core_web_sm')

# Import the displaCy library
from spacy import displacy

# Create a simple Doc object
doc = nlp(u"A quick brown fox jumps over the lazy dog.")

# Render the dependency parse immediately inside Jupyter:
displacy.render(doc, style='dep', jupyter=True, options={'distance': 110})

image-20240125110524573

The dependency parse shows the coarse POS tag for each token, as well as the dependency tag if given:

for token in doc:
print(f'{token.text:{10}} {token.pos_:{7}} {token.dep_:{7}} {spacy.explain(token.dep_)}')
A          DET     det     determiner
quick ADJ amod adjectival modifier
brown ADJ amod adjectival modifier
fox NOUN nsubj nominal subject
jumps VERB ROOT None
over ADP prep prepositional modifier
the DET det determiner
lazy ADJ amod adjectival modifier
dog NOUN pobj object of preposition
. PUNCT punct punctuation

Creating Visualizations Outside of Jupyter

If you’re using another Python IDE or writing a script, you can choose to have spaCy serve up HTML separately.

Instead of displacy.render(), use displacy.serve():

displacy.serve(doc, style='dep', options={'distance': 110})

Handling Large Text

displacy.serve() accepts a single Doc or list of Doc objects. Since large texts are difficult to view in one line, you may want to pass a list of spans instead. Each span will appear on its own line:

doc2 = nlp(u"This is a sentence. This is another, possibly longer sentence.")

# Create spans from Doc.sents:
spans = list(doc2.sents)

displacy.serve(spans, style='dep', options={'distance': 110})

Customizing the Appearance

Besides setting the distance between tokens, you can pass other arguments to the options parameter:

NAME TYPE DESCRIPTION DEFAULT
compact bool “Compact mode” with square arrows that takes up less space. False
color unicode Text color (HEX, RGB or color names). #000000
bg unicode Background color (HEX, RGB or color names). #ffffff
font unicode Font name or font family for all text. Arial

For a full list of options visit https://spacy.io/api/top-level#displacy_options

Named Entity Recognition

# Perform standard imports
import spacy
nlp = spacy.load('en_core_web_sm')

# Write a function to display basic entity info:
def show_ents(doc):
if doc.ents:
for ent in doc.ents:
print(ent.text+' - '+ent.label_+' - '+str(spacy.explain(ent.label_)))
else:
print('No named entities found.')

doc = nlp("I am heading to New York City and will visit Statue of Liberty tomorrow")

show_ents(doc)
New York City - GPE - Countries, cities, states
Statue of Liberty - ORG - Companies, agencies, institutions, etc.
tomorrow - DATE - Absolute or relative dates or periods

Entity annotations

Doc.ents are token spans with their own set of annotations.

Annotation
ent.text The original entity text
ent.label The entity type’s hash value
ent.label_ The entity type’s string description
ent.start The token span’s start index position in the Doc
ent.end The token span’s stop index position in the Doc
ent.start_char The entity text’s start index position in the Doc
ent.end_char The entity text’s stop index position in the Doc

NER Tags

Tags are accessible through the .label_ property of an entity.

TYPE DESCRIPTION EXAMPLE
PERSON People, including fictional. Fred Flintstone
NORP Nationalities or religious or political groups. The Republican Party
FAC Buildings, airports, highways, bridges, etc. Logan International Airport, The Golden Gate
ORG Companies, agencies, institutions, etc. Microsoft, FBI, MIT
GPE Countries, cities, states. France, UAR, Chicago, Idaho
LOC Non-GPE locations, mountain ranges, bodies of water. Europe, Nile River, Midwest
PRODUCT Objects, vehicles, foods, etc. (Not services.) Formula 1
EVENT Named hurricanes, battles, wars, sports events, etc. Olympic Games
WORK_OF_ART Titles of books, songs, etc. The Mona Lisa
LAW Named documents made into laws. Roe v. Wade
LANGUAGE Any named language. English
DATE Absolute or relative dates or periods. 20 July 1969
TIME Times smaller than a day. Four hours
PERCENT Percentage, including “%”. Eighty percent
MONEY Monetary values, including unit. Twenty Cents
QUANTITY Measurements, as of weight or distance. Several kilometers, 55kg
ORDINAL “first”, “second”, etc. 9th, Ninth
CARDINAL Numerals that do not fall under another type. 2, Two, Fifty-two

Adding a Named Entity to a Span

doc = nlp(u'Tesla is planning build a new U.K. factory for $6 million')

show_ents(doc)
#U.K. - GPE - Countries, cities, states
#$6 million - MONEY - Monetary values, including unit

# doc.ents returns a tuple
type(doc.ents)
#tuple

# each elementin the entities tuple is of type Span
type(doc.ents[0])
#spacy.tokens.span.Span

Right now, spaCy does not recognize “Tesla” as a company.

The method to add a named entity of your own is simple. Create a span in document for the phrase you are interested in with the TYPE of entity you want (i.e. PERSON, or ORG) or and manually add it to the entities tuple.

from spacy.tokens import Span

# Get the hash value of the ORG entity label
ORG = doc.vocab.strings[u'ORG']
# ORG

# Create a Span for the new entity
new_ent = Span(doc, 0, 1, label=ORG)

# Add the entity to the existing Doc object
doc.ents = list(doc.ents) + [new_ent]

In the code above, the arguments passed to Span() are:

  • doc - the name of the Doc object
  • 0 - the start index position of the span
  • 1 - the stop index position (exclusive)
  • label=ORG - the label assigned to our entity
show_ents(doc)
Tesla - ORG - Companies, agencies, institutions, etc.
U.K. - GPE - Countries, cities, states
$6 million - MONEY - Monetary values, including unit

Adding Named Entities to All Matching Spans

What if we want to tag all occurrences of “Tesla”? In this section we show how to use the PhraseMatcher to identify a series of spans in the Doc

The process is slghtly involved. The steps are:

(1) build a simple list of phrases to match,

(2) Create a phrase matcher using vocab,

(3) Create phrase patterns and add them to the matcher,

(4) Apply matcher to the doc, which finds spans of matches,

(5) use the found matches to create spans and add to the doc.ents as we did previously

# Import PhraseMatcher and create a matcher object:
from spacy.matcher import PhraseMatcher
matcher = PhraseMatcher(nlp.vocab)

# Create the desired phrase patterns:
phrase_list = ['vacuum cleaner', 'vacuum-cleaner']
phrase_patterns = [nlp(text) for text in phrase_list]

# notice the use of list comprehension above, each element is really a doc object
type(phrase_patterns[0])
#spacy.tokens.doc.Doc

# Apply the patterns to our matcher object:
matcher.add('newproduct', None, *phrase_patterns)

# Apply the matcher to our Doc object:
matches = matcher(doc)

# See what matches occur:
matches
#[(2689272359382549672, 7, 9), (2689272359382549672, 14, 16)]
# Here we create Spans from each match, and create named entities from them:
from spacy.tokens import Span

PROD = doc.vocab.strings[u'PRODUCT']

new_ents = [Span(doc, match[1],match[2],label=PROD) for match in matches]

doc.ents = list(doc.ents) + new_ents

show_ents(doc)
vacuum cleaner - PRODUCT - Objects, vehicles, foods, etc. (not services)
vacuum cleaner - PRODUCT - Objects, vehicles, foods, etc. (not services)
first - ORDINAL - "first", "second", etc.

Counting Entities

While spaCy may not have a built-in tool for counting entities, we can pass a conditional statement into a list comprehension:

doc = nlp(u'Originally priced at $29.50, the sweater was marked down to five dollars.')

len([ent for ent in doc.ents if ent.label_=='MONEY'])
#2

Problem with Line Breaks

There's a known issue with spaCy v2.0.12 where some linebreaks are interpreted as `GPE` entities:
doc = nlp(u'Originally priced at $29.50,\nthe sweater was marked down to five dollars.')

show_ents(doc)
29.50 - MONEY - Monetary values, including unit

- GPE - Countries, cities, states
five dollars - MONEY - Monetary values, including unit

However, there is a simple fix that can be added to the nlp pipeline:

# Quick function to remove ents formed on whitespace:
def remove_whitespace_entities(doc):
doc.ents = [e for e in doc.ents if not e.text.isspace()]
return doc

# Insert this into the pipeline AFTER the ner component:
nlp.add_pipe(remove_whitespace_entities, after='ner')

# Rerun nlp on the text above, and show ents:
doc = nlp(u'Originally priced at $29.50,\nthe sweater was marked down to five dollars.')

show_ents(doc)
29.50 - MONEY - Monetary values, including unit
five dollars - MONEY - Monetary values, including unit

Noun Chunks

Doc.noun_chunks are base noun phrases: token spans that include the noun and words describing the noun. Noun chunks cannot be nested, cannot overlap, and do not involve prepositional phrases or relative clauses.

Where Doc.ents rely on the ner pipeline component, Doc.noun_chunks are provided by the parser.

noun_chunks components:

Description
.text The original noun chunk text.
.root.text The original text of the word connecting the noun chunk to the rest of the parse.
.root.dep_ Dependency relation connecting the root to its head.
.root.head.text The text of the root token’s head.
doc = nlp(u"Autonomous cars shift insurance liability toward manufacturers.")

for chunk in doc.noun_chunks:
print(chunk.text+' - '+chunk.root.text+' - '+chunk.root.dep_+' - '+chunk.root.head.text)
Autonomous cars - cars - nsubj - shift
insurance liability - liability - dobj - shift
manufacturers - manufacturers - pobj - toward

Doc.noun_chunks is a generator function

Previously we mentioned that Doc objects do not retain a list of sentences, but they’re available through the Doc.sents generator.

It’s the same with Doc.noun_chunks - lists can be created if needed:

len(list(doc.noun_chunks))
#3

Visualizing Named Entities

Besides viewing Part of Speech dependencies with style='dep', displaCy offers a style='ent' visualizer:

# Perform standard imports
import spacy
nlp = spacy.load('en_core_web_sm')

# Import the displaCy library
from spacy import displacy

doc = nlp(u'Over the last quarter Apple sold nearly 20 thousand iPods for a profit of $6 million. '
u'By contrast, Sony sold only 7 thousand Walkman music players.')

displacy.render(doc, style='ent', jupyter=True)
Over the last quarter DATE Apple ORG sold nearly 20 thousand CARDINAL iPods PRODUCT for a profit of $6 million MONEY . By contrast, Sony ORG sold only 7 thousand CARDINAL Walkman PRODUCT music players.

Viewing Sentences Line by Line

Unlike the displaCy dependency parse, the NER viewer has to take in a Doc object with an ents attribute. For this reason, we can’t just pass a list of spans to .render(), we have to create a new Doc from each span.text:

for sent in doc.sents:
displacy.render(nlp(sent.text), style='ent', jupyter=True)

If a span does not contain any entities, displaCy will issue a harmless warning

Viewing Specific Entities

You can pass a list of entity types to restrict the visualization:

options = {'ents': ['ORG', 'PRODUCT']}

displacy.render(doc, style='ent', jupyter=True, options=options)
Over the last quarter Apple ORG sold nearly 20 thousand iPods PRODUCT for a profit of $6 million. By contrast, Sony ORG sold only 7 thousand Walkman PRODUCT music players.

Customizing Colors and Effects

You can also pass background color and gradient options:

#colors = {'ORG': 'orange', 'PRODUCT':'yellow'}
colors = {'ORG': 'Linear-gradient(90deg, orange, green)', 'PRODUCT': 'radial-gradient(pink, purple)'}
#colors = {'ORG': 'linear-gradient(90deg, #aa9cfc, #fc9ce7)', 'PRODUCT': 'radial-gradient(yellow, green)'}

options = {'ents': ['ORG', 'PRODUCT'], 'colors':colors}

displacy.render(doc, style='ent', jupyter=True, options=options)
Over the last quarter Apple ORG sold nearly 20 thousand iPods PRODUCT for a profit of $6 million. By contrast, Sony ORG sold only 7 thousand Walkman PRODUCT music players.

Sentence Segmentation

In this section we’ll learn how sentence segmentation works, and how to set our own segmentation rules.

# Perform standard imports
import spacy
nlp = spacy.load('en_core_web_sm')

# From Spacy Basics:
doc = nlp(u'This is the first sentence. This is another sentence. This is the last sentence.')

for sent in doc.sents:
print(sent)
This is the first sentence.
This is another sentence.
This is the last sentence.

Doc.sents is a generator

It is important to note that doc.sents is a generator. That is, a Doc is not segmented until doc.sents is called. This means that, where you could print the second Doc token with print(doc[1]), you can’t call the “second Doc sentence” with print(doc.sents[1]):

Convert doc.sents to a list

list(doc.sents)

sents are spans

type(list(doc.sents)[0])
#spacy.tokens.span.Span

You can build a sentence collection by running doc.sents and saving the result to a list.

Adding Rules for sentence segmentation

spaCy’s built-in sentencizer relies on the dependency parse and end-of-sentence punctuation to determine segmentation rules. We can add rules of our own, but they have to be added before the creation of the Doc object, as that is where the parsing of segment start tokens happens:

# Parsing the segmentation start tokens happens during the nlp pipeline
doc2 = nlp(u'This is a sentence. This is a sentence. This is a sentence.')

for token in doc2:
print(token.is_sent_start, ' '+token.text)
None  This
None is
None a
None sentence
None .
True This
None is
None a
None sentence
None .
True This
None is
None a
None sentence
None .

Notice we haven’t run doc2.sents, and yet token.is_sent_start was set to True on two tokens in the Doc.

Let’s add a semicolon to our existing segmentation rules. That is, whenever the sentencizer encounters a semicolon(;), the next token should start a new segment.

# SPACY'S DEFAULT BEHAVIOR
doc3 = nlp(u'"Management is doing things right; leadership is doing the right things." -Peter Drucker')

for sent in doc3.sents:
print(sent)
#"Management is doing things right; leadership is doing the right things."
#-Peter Drucker
# ADD A NEW RULE TO THE PIPELINE
# Every token in document object maintains a fixed index position,
# which does not change. We will take advantage of that.

def set_custom_boundaries(doc):
for token in doc[:-1]:
if token.text == ';':
doc[token.i+1].is_sent_start = True
return doc

nlp.add_pipe(set_custom_boundaries, before='parser')

nlp.pipe_names
#['tagger', 'set_custom_boundaries', 'parser', 'ner']

The new rule has to run before the document is parsed. Here we can either pass the argument before='parser' or first=True.

# Re-run the Doc object creation:
doc4 = nlp(u'"Management is doing things right; leadership is doing the right things." -Peter Drucker')

for sent in doc4.sents:
print(sent)
"Management is doing things right;
leadership is doing the right things."
-Peter Drucker
# And yet the new rule doesn't apply to the older Doc object:
for sent in doc3.sents:
print(sent)
#"Management is doing things right; leadership is doing the right things."
#-Peter Drucker

Why not change the token directly?

Why not simply set the .is_sent_start value to True on existing tokens?

Because spaCy refuses to change the tag after the document is parsed to prevent inconsistencies in the data.

Changing the Rules

In some cases we want to replace spaCy’s default sentencizer with our own set of rules. In this section we’ll see how the default sentencizer breaks on periods. We’ll then replace this behavior with a sentencizer that breaks on linebreaks.

mystring = u"This is a sentence. This is another.\n\nThis is a \nthird sentence."

# SPACY DEFAULT BEHAVIOR:
doc = nlp(mystring)

for sent in doc.sents:
print([token.text for token in sent])
['This', 'is', 'a', 'sentence', '.']
['This', 'is', 'another', '.', '\n\n']
['This', 'is', 'a', '\n', 'third', 'sentence', '.']
# CHANGING THE RULES
from spacy.pipeline import SentenceSegmenter

def split_on_newlines(doc):
start = 0
seen_newline = False
for word in doc:
if seen_newline:
yield doc[start:word.i]
start = word.i
seen_newline = False
elif word.text.startswith('\n'): # handles multiple occurrences
seen_newline = True
yield doc[start:] # handles the last group of tokens


sbd = SentenceSegmenter(nlp.vocab, strategy=split_on_newlines)
nlp.add_pipe(sbd)
nlp.pipe_names
#['tagger', 'parser', 'ner', 'sbd']
doc = nlp(mystring)
for sent in doc.sents:
print([token.text for token in sent])
['This', 'is', 'a', 'sentence', '.', 'This', 'is', 'another', '.', '\n\n']
['This', 'is', 'a', '\n']
['third', 'sentence', '.']

Here we see that periods no longer affect segmentation, only linebreaks do. This would be appropriate when working with a long list of tweets, for instance.