Part of Speech Tagging

View token tags

To view the coarse POS tag use token.pos_
To view the fine-grained tag use token.tag_
To view the syntactic dependency use token.dep_
To view the description of either type of tag use spacy.explain(tag_)

token.pos and token.tag return integer hash values !

# Perform standard imports
import spacy
nlp = spacy.load('en_core_web_sm')

# Create a simple Doc object
doc = nlp(u"Apple is looking at buying U.K. startup for $1 Billion.")

for token in doc:
    print(f'{token.text:{10}} {token.pos_:{8}} {token.tag_:{6}} {spacy.explain(token.tag_)}')

Apple      PROPN    NNP    noun, proper singular
is         VERB     VBZ    verb, 3rd person singular present
looking    VERB     VBG    verb, gerund or present participle
at         ADP      IN     conjunction, subordinating or preposition
buying     VERB     VBG    verb, gerund or present participle
U.K.       PROPN    NNP    noun, proper singular
startup    NOUN     NN     noun, singular or mass
for        ADP      IN     conjunction, subordinating or preposition
$          SYM      $      symbol, currency
1          NUM      CD     cardinal number
Billion    NUM      CD     cardinal number
.          PUNCT    .      punctuation mark, sentence closer

Coarse-grained Part-of-speech Tags

Every token is assigned a POS Tag from the following list:

POS	DESCRIPTION	EXAMPLES
ADJ	adjective	big, old, green, incomprehensible, first
ADP	adposition	in, to, during
ADV	adverb	very, tomorrow, down, where, there
AUX	auxiliary	is, has (done), will (do), should (do)
CONJ	conjunction	and, or, but
CCONJ	coordinating conjunction	and, or, but
DET	determiner	a, an, the
INTJ	interjection	psst, ouch, bravo, hello
NOUN	noun	girl, cat, tree, air, beauty
NUM	numeral	1, 2017, one, seventy-seven, IV, MMXIV
PART	particle	‘s, not,
PRON	pronoun	I, you, he, she, myself, themselves, somebody
PROPN	proper noun	Mary, John, London, NATO, HBO
PUNCT	punctuation	., (, ), ?
SCONJ	subordinating conjunction	if, while, that
SYM	symbol	$, %, §, ©, +, −, ×, ÷, =, :), 😝
VERB	verb	run, runs, running, eat, ate, eating
X	other	sfpksdpsxmsa
SPACE	space

Fine-grained Part-of-speech Tags

For a current list of tags for all languages visit https://spacy.io/api/annotation#pos-tagging

Working with POS Tags

In the English language, the same string of characters can have different meanings, even within the same sentence. For this reason, morphology is important. spaCy uses machine learning algorithms to best predict the use of a token in a sentence. Is “I read books on NLP” present or past tense? Is wind a verb or a noun?

doc = nlp(u'I read books on NLP.')
r = doc[1]

print(f'{r.text:{10}} {r.pos_:{8}} {r.tag_:{6}} {spacy.explain(r.tag_)}')
#read       VERB     VBP    verb, non-3rd person singular present

doc = nlp(u'I read a book on NLP.')
r = doc[1]

print(f'{r.text:{10}} {r.pos_:{8}} {r.tag_:{6}} {spacy.explain(r.tag_)}')
#read       VERB     VBD    verb, past tense

In the first example, with no other cues to work from, spaCy assumed that read was present tense.

In the second example the present tense form would be I am reading a book, so spaCy assigned the past tense.

Counting POS Tags

The Doc.count_by() method accepts a specific token attribute as its argument, and returns a frequency count of the given attribute as a dictionary object. Keys in the dictionary are the integer values of the given attribute ID, and values are the frequency. Counts of zero are not included.

doc = nlp(u"Apple is looking at buying U.K. startup for $1 Billion.")

# Count the frequencies of different coarse-grained POS tags:
POS_counts = doc.count_by(spacy.attrs.POS)
POS_counts
#{96: 1, 98: 1, 99: 3, 84: 2, 91: 1, 92: 2, 95: 2}

This isn’t very helpful until you decode the attribute ID:

doc.vocab[96].text
# 'PUNCT'

# Count the different fine-grained tags:
TAG_counts = doc.count_by(spacy.attrs.TAG)

for k,v in sorted(TAG_counts.items()):
    print(f'{k:<{23}} {doc.vocab[k].text:{4}}: {v:<{5}} {spacy.explain(doc.vocab[k].text)}')

1292078113972184607     IN  : 2     conjunction, subordinating or preposition
1534113631682161808     VBG : 2     verb, gerund or present participle
8427216679587749980     CD  : 2     cardinal number
11283501755624150392    $   : 1     symbol, currency
12646065887601541794    .   : 1     punctuation mark, sentence closer
13927759927860985106    VBZ : 1     verb, 3rd person singular present
15308085513773655218    NN  : 1     noun, singular or mass
15794550382381185553    NNP : 2     noun, proper singular

Why did the ID numbers get so big?

In spaCy, certain text values are hardcoded into Doc.vocab and take up the first several hundred ID numbers. Strings like ‘NOUN’ and ‘VERB’ are used frequently by internal operations. Others, like fine-grained tags, are assigned hash values as needed.

Why don’t SPACE tags appear?

In spaCy, only strings of spaces (two or more) are assigned tokens. Single spaces are not.

Visualizing Parts of Speech

spaCy offers an outstanding visualizer called displaCy:

# Perform standard imports
import spacy
nlp = spacy.load('en_core_web_sm')

# Import the displaCy library
from spacy import displacy

# Create a simple Doc object
doc = nlp(u"A quick brown fox jumps over the lazy dog.")

# Render the dependency parse immediately inside Jupyter:
displacy.render(doc, style='dep', jupyter=True, options={'distance': 110})

The dependency parse shows the coarse POS tag for each token, as well as the dependency tag if given:

for token in doc:
    print(f'{token.text:{10}} {token.pos_:{7}} {token.dep_:{7}} {spacy.explain(token.dep_)}')

A          DET     det     determiner
quick      ADJ     amod    adjectival modifier
brown      ADJ     amod    adjectival modifier
fox        NOUN    nsubj   nominal subject
jumps      VERB    ROOT    None
over       ADP     prep    prepositional modifier
the        DET     det     determiner
lazy       ADJ     amod    adjectival modifier
dog        NOUN    pobj    object of preposition
.          PUNCT   punct   punctuation

Creating Visualizations Outside of Jupyter

If you’re using another Python IDE or writing a script, you can choose to have spaCy serve up HTML separately.

Instead of displacy.render(), use displacy.serve():

displacy.serve(doc, style='dep', options={'distance': 110})

Handling Large Text

displacy.serve() accepts a single Doc or list of Doc objects. Since large texts are difficult to view in one line, you may want to pass a list of spans instead. Each span will appear on its own line:

doc2 = nlp(u"This is a sentence. This is another, possibly longer sentence.")

# Create spans from Doc.sents:
spans = list(doc2.sents)

displacy.serve(spans, style='dep', options={'distance': 110})

Customizing the Appearance

Besides setting the distance between tokens, you can pass other arguments to the options parameter:

NAME	TYPE	DESCRIPTION	DEFAULT
`compact`	bool	“Compact mode” with square arrows that takes up less space.	`False`
`color`	unicode	Text color (HEX, RGB or color names).	`#000000`
`bg`	unicode	Background color (HEX, RGB or color names).	`#ffffff`
`font`	unicode	Font name or font family for all text.	`Arial`

For a full list of options visit https://spacy.io/api/top-level#displacy_options

Named Entity Recognition

# Perform standard imports
import spacy
nlp = spacy.load('en_core_web_sm')

# Write a function to display basic entity info:
def show_ents(doc):
    if doc.ents:
        for ent in doc.ents:
            print(ent.text+' - '+ent.label_+' - '+str(spacy.explain(ent.label_)))
    else:
        print('No named entities found.')
        
doc = nlp("I am heading to New York City and will visit Statue of Liberty tomorrow")

show_ents(doc)

New York City - GPE - Countries, cities, states
Statue of Liberty - ORG - Companies, agencies, institutions, etc.
tomorrow - DATE - Absolute or relative dates or periods

Entity annotations

Doc.ents are token spans with their own set of annotations.

	Annotation
`ent.text`	The original entity text
`ent.label`	The entity type’s hash value
`ent.label_`	The entity type’s string description
`ent.start`	The token span’s start index position in the Doc
`ent.end`	The token span’s stop index position in the Doc
`ent.start_char`	The entity text’s start index position in the Doc
`ent.end_char`	The entity text’s stop index position in the Doc

NER Tags

Tags are accessible through the .label_ property of an entity.

TYPE	DESCRIPTION	EXAMPLE
`PERSON`	People, including fictional.	Fred Flintstone
`NORP`	Nationalities or religious or political groups.	The Republican Party
`FAC`	Buildings, airports, highways, bridges, etc.	Logan International Airport, The Golden Gate
`ORG`	Companies, agencies, institutions, etc.	Microsoft, FBI, MIT
`GPE`	Countries, cities, states.	France, UAR, Chicago, Idaho
`LOC`	Non-GPE locations, mountain ranges, bodies of water.	Europe, Nile River, Midwest
`PRODUCT`	Objects, vehicles, foods, etc. (Not services.)	Formula 1
`EVENT`	Named hurricanes, battles, wars, sports events, etc.	Olympic Games
`WORK_OF_ART`	Titles of books, songs, etc.	The Mona Lisa
`LAW`	Named documents made into laws.	Roe v. Wade
`LANGUAGE`	Any named language.	English
`DATE`	Absolute or relative dates or periods.	20 July 1969
`TIME`	Times smaller than a day.	Four hours
`PERCENT`	Percentage, including “%”.	Eighty percent
`MONEY`	Monetary values, including unit.	Twenty Cents
`QUANTITY`	Measurements, as of weight or distance.	Several kilometers, 55kg
`ORDINAL`	“first”, “second”, etc.	9th, Ninth
`CARDINAL`	Numerals that do not fall under another type.	2, Two, Fifty-two

Adding a Named Entity to a Span

doc = nlp(u'Tesla is planning build a new U.K. factory for $6 million')

show_ents(doc)
#U.K. - GPE - Countries, cities, states
#$6 million - MONEY - Monetary values, including unit

# doc.ents returns a tuple
type(doc.ents)
#tuple

# each elementin the entities tuple is of type Span
type(doc.ents[0])
#spacy.tokens.span.Span

Right now, spaCy does not recognize “Tesla” as a company.

The method to add a named entity of your own is simple. Create a span in document for the phrase you are interested in with the TYPE of entity you want (i.e. PERSON, or ORG) or and manually add it to the entities tuple.

from spacy.tokens import Span

# Get the hash value of the ORG entity label
ORG = doc.vocab.strings[u'ORG']  
# ORG 

# Create a Span for the new entity
new_ent = Span(doc, 0, 1, label=ORG)

# Add the entity to the existing Doc object
doc.ents = list(doc.ents) + [new_ent]

In the code above, the arguments passed to Span() are:

doc - the name of the Doc object
0 - the start index position of the span
1 - the stop index position (exclusive)
label=ORG - the label assigned to our entity

show_ents(doc)

Tesla - ORG - Companies, agencies, institutions, etc.
U.K. - GPE - Countries, cities, states
$6 million - MONEY - Monetary values, including unit

Adding Named Entities to All Matching Spans

What if we want to tag all occurrences of “Tesla”? In this section we show how to use the PhraseMatcher to identify a series of spans in the Doc

The process is slghtly involved. The steps are:

(1) build a simple list of phrases to match,

(2) Create a phrase matcher using vocab,

(3) Create phrase patterns and add them to the matcher,

(4) Apply matcher to the doc, which finds spans of matches,

(5) use the found matches to create spans and add to the doc.ents as we did previously

# Import PhraseMatcher and create a matcher object:
from spacy.matcher import PhraseMatcher
matcher = PhraseMatcher(nlp.vocab)

# Create the desired phrase patterns:
phrase_list = ['vacuum cleaner', 'vacuum-cleaner']
phrase_patterns = [nlp(text) for text in phrase_list]

# notice the use of list comprehension above, each element is really a doc object
type(phrase_patterns[0])
#spacy.tokens.doc.Doc

# Apply the patterns to our matcher object:
matcher.add('newproduct', None, *phrase_patterns)

# Apply the matcher to our Doc object:
matches = matcher(doc)

# See what matches occur:
matches
#[(2689272359382549672, 7, 9), (2689272359382549672, 14, 16)]

# Here we create Spans from each match, and create named entities from them:
from spacy.tokens import Span

PROD = doc.vocab.strings[u'PRODUCT']

new_ents = [Span(doc, match[1],match[2],label=PROD) for match in matches]

doc.ents = list(doc.ents) + new_ents

show_ents(doc)

vacuum cleaner - PRODUCT - Objects, vehicles, foods, etc. (not services)
vacuum cleaner - PRODUCT - Objects, vehicles, foods, etc. (not services)
first - ORDINAL - "first", "second", etc.

Counting Entities

While spaCy may not have a built-in tool for counting entities, we can pass a conditional statement into a list comprehension:

doc = nlp(u'Originally priced at $29.50, the sweater was marked down to five dollars.')

len([ent for ent in doc.ents if ent.label_=='MONEY'])
#2

Problem with Line Breaks

There's a known issue with spaCy v2.0.12 where some linebreaks are interpreted as `GPE` entities:

doc = nlp(u'Originally priced at $29.50,\nthe sweater was marked down to five dollars.')

show_ents(doc)

29.50 - MONEY - Monetary values, including unit

 - GPE - Countries, cities, states
five dollars - MONEY - Monetary values, including unit

However, there is a simple fix that can be added to the nlp pipeline:

# Quick function to remove ents formed on whitespace:
def remove_whitespace_entities(doc):
    doc.ents = [e for e in doc.ents if not e.text.isspace()]
    return doc

# Insert this into the pipeline AFTER the ner component:
nlp.add_pipe(remove_whitespace_entities, after='ner')

# Rerun nlp on the text above, and show ents:
doc = nlp(u'Originally priced at $29.50,\nthe sweater was marked down to five dollars.')

show_ents(doc)

29.50 - MONEY - Monetary values, including unit
five dollars - MONEY - Monetary values, including unit

Noun Chunks

Doc.noun_chunks are base noun phrases: token spans that include the noun and words describing the noun. Noun chunks cannot be nested, cannot overlap, and do not involve prepositional phrases or relative clauses.

Where Doc.ents rely on the ner pipeline component, Doc.noun_chunks are provided by the parser.

`noun_chunks` components:

	Description
`.text`	The original noun chunk text.
`.root.text`	The original text of the word connecting the noun chunk to the rest of the parse.
`.root.dep_`	Dependency relation connecting the root to its head.
`.root.head.text`	The text of the root token’s head.

doc = nlp(u"Autonomous cars shift insurance liability toward manufacturers.")

for chunk in doc.noun_chunks:
    print(chunk.text+' - '+chunk.root.text+' - '+chunk.root.dep_+' - '+chunk.root.head.text)

Autonomous cars - cars - nsubj - shift
insurance liability - liability - dobj - shift
manufacturers - manufacturers - pobj - toward

`Doc.noun_chunks` is a generator function

Previously we mentioned that Doc objects do not retain a list of sentences, but they’re available through the Doc.sents generator.

It’s the same with Doc.noun_chunks - lists can be created if needed:

len(list(doc.noun_chunks))
#3

Visualizing Named Entities

Besides viewing Part of Speech dependencies with style='dep', displaCy offers a style='ent' visualizer:

# Perform standard imports
import spacy
nlp = spacy.load('en_core_web_sm')

# Import the displaCy library
from spacy import displacy

doc = nlp(u'Over the last quarter Apple sold nearly 20 thousand iPods for a profit of $6 million. '
         u'By contrast, Sony sold only 7 thousand Walkman music players.')

displacy.render(doc, style='ent', jupyter=True)

Over the last quarter DATE Apple ORG sold nearly 20 thousand CARDINAL iPods PRODUCT for a profit of $6 million MONEY . By contrast, Sony ORG sold only 7 thousand CARDINAL Walkman PRODUCT music players.

Viewing Sentences Line by Line

Unlike the displaCy dependency parse, the NER viewer has to take in a Doc object with an ents attribute. For this reason, we can’t just pass a list of spans to .render(), we have to create a new Doc from each span.text:

for sent in doc.sents:
    displacy.render(nlp(sent.text), style='ent', jupyter=True)

If a span does not contain any entities, displaCy will issue a harmless warning

Viewing Specific Entities

You can pass a list of entity types to restrict the visualization:

options = {'ents': ['ORG', 'PRODUCT']}

displacy.render(doc, style='ent', jupyter=True, options=options)

Over the last quarter Apple ORG sold nearly 20 thousand iPods PRODUCT for a profit of $6 million. By contrast, Sony ORG sold only 7 thousand Walkman PRODUCT music players.

Customizing Colors and Effects

You can also pass background color and gradient options:

#colors = {'ORG': 'orange', 'PRODUCT':'yellow'}
colors = {'ORG': 'Linear-gradient(90deg, orange, green)', 'PRODUCT': 'radial-gradient(pink, purple)'}
#colors = {'ORG': 'linear-gradient(90deg, #aa9cfc, #fc9ce7)', 'PRODUCT': 'radial-gradient(yellow, green)'}

options = {'ents': ['ORG', 'PRODUCT'], 'colors':colors}

displacy.render(doc, style='ent', jupyter=True, options=options)

Over the last quarter Apple ORG sold nearly 20 thousand iPods PRODUCT for a profit of $6 million. By contrast, Sony ORG sold only 7 thousand Walkman PRODUCT music players.

Sentence Segmentation

In this section we’ll learn how sentence segmentation works, and how to set our own segmentation rules.

# Perform standard imports
import spacy
nlp = spacy.load('en_core_web_sm')

# From Spacy Basics:
doc = nlp(u'This is the first sentence. This is another sentence. This is the last sentence.')

for sent in doc.sents:
    print(sent)

This is the first sentence.
This is another sentence.
This is the last sentence.

`Doc.sents` is a generator

It is important to note that doc.sents is a generator. That is, a Doc is not segmented until doc.sents is called. This means that, where you could print the second Doc token with print(doc[1]), you can’t call the “second Doc sentence” with print(doc.sents[1]):

Convert `doc.sents` to a list

list(doc.sents)

`sents` are spans

type(list(doc.sents)[0])
#spacy.tokens.span.Span

You can build a sentence collection by running doc.sents and saving the result to a list.

Adding Rules for sentence segmentation

spaCy’s built-in sentencizer relies on the dependency parse and end-of-sentence punctuation to determine segmentation rules. We can add rules of our own, but they have to be added before the creation of the Doc object, as that is where the parsing of segment start tokens happens:

# Parsing the segmentation start tokens happens during the nlp pipeline
doc2 = nlp(u'This is a sentence. This is a sentence. This is a sentence.')

for token in doc2:
    print(token.is_sent_start, ' '+token.text)

None  This
None  is
None  a
None  sentence
None  .
True  This
None  is
None  a
None  sentence
None  .
True  This
None  is
None  a
None  sentence
None  .

Notice we haven’t run doc2.sents, and yet token.is_sent_start was set to True on two tokens in the Doc.

Let’s add a semicolon to our existing segmentation rules. That is, whenever the sentencizer encounters a semicolon(;), the next token should start a new segment.

# SPACY'S DEFAULT BEHAVIOR
doc3 = nlp(u'"Management is doing things right; leadership is doing the right things." -Peter Drucker')

for sent in doc3.sents:
    print(sent)
#"Management is doing things right; leadership is doing the right things."
#-Peter Drucker

# ADD A NEW RULE TO THE PIPELINE
# Every token in document object maintains a fixed index position, 
# which does not change. We will take advantage of that.

def set_custom_boundaries(doc):
    for token in doc[:-1]:
        if token.text == ';':
            doc[token.i+1].is_sent_start = True
    return doc

nlp.add_pipe(set_custom_boundaries, before='parser')

nlp.pipe_names
#['tagger', 'set_custom_boundaries', 'parser', 'ner']

The new rule has to run before the document is parsed. Here we can either pass the argument before='parser' or first=True.

# Re-run the Doc object creation:
doc4 = nlp(u'"Management is doing things right; leadership is doing the right things." -Peter Drucker')

for sent in doc4.sents:
    print(sent)

"Management is doing things right;
leadership is doing the right things."
-Peter Drucker

# And yet the new rule doesn't apply to the older Doc object:
for sent in doc3.sents:
    print(sent)
#"Management is doing things right; leadership is doing the right things."
#-Peter Drucker

Why not change the token directly?

Why not simply set the .is_sent_start value to True on existing tokens?

Because spaCy refuses to change the tag after the document is parsed to prevent inconsistencies in the data.

Changing the Rules

In some cases we want to replace spaCy’s default sentencizer with our own set of rules. In this section we’ll see how the default sentencizer breaks on periods. We’ll then replace this behavior with a sentencizer that breaks on linebreaks.

mystring = u"This is a sentence. This is another.\n\nThis is a \nthird sentence."

# SPACY DEFAULT BEHAVIOR:
doc = nlp(mystring)

for sent in doc.sents:
    print([token.text for token in sent])

['This', 'is', 'a', 'sentence', '.']
['This', 'is', 'another', '.', '\n\n']
['This', 'is', 'a', '\n', 'third', 'sentence', '.']

# CHANGING THE RULES
from spacy.pipeline import SentenceSegmenter

def split_on_newlines(doc):
    start = 0
    seen_newline = False
    for word in doc:
        if seen_newline:
            yield doc[start:word.i]
            start = word.i
            seen_newline = False
        elif word.text.startswith('\n'): # handles multiple occurrences
            seen_newline = True
    yield doc[start:]      # handles the last group of tokens


sbd = SentenceSegmenter(nlp.vocab, strategy=split_on_newlines)
nlp.add_pipe(sbd)
nlp.pipe_names
#['tagger', 'parser', 'ner', 'sbd']

doc = nlp(mystring)
for sent in doc.sents:
    print([token.text for token in sent])

['This', 'is', 'a', 'sentence', '.', 'This', 'is', 'another', '.', '\n\n']
['This', 'is', 'a', '\n']
['third', 'sentence', '.']

Here we see that periods no longer affect segmentation, only linebreaks do. This would be appropriate when working with a long list of tweets, for instance.