To view the description of either type of tag use spacy.explain(tag_)
token.pos and token.tag return integer hash values !
# Perform standard imports import spacy nlp = spacy.load('en_core_web_sm')
# Create a simple Doc object doc = nlp(u"Apple is looking at buying U.K. startup for $1 Billion.")
for token in doc: print(f'{token.text:{10}} {token.pos_:{8}} {token.tag_:{6}} {spacy.explain(token.tag_)}')
Apple PROPN NNP noun, proper singular is VERB VBZ verb, 3rd person singular present looking VERB VBG verb, gerund or present participle at ADP IN conjunction, subordinating or preposition buying VERB VBG verb, gerund or present participle U.K. PROPN NNP noun, proper singular startup NOUN NN noun, singular or mass for ADP IN conjunction, subordinating or preposition $ SYM $ symbol, currency 1 NUM CD cardinal number Billion NUM CD cardinal number . PUNCT . punctuation mark, sentence closer
Coarse-grained Part-of-speech Tags
Every token is assigned a POS Tag from the following list:
In the English language, the same string of characters can have different meanings, even within the same sentence. For this reason, morphology is important. spaCy uses machine learning algorithms to best predict the use of a token in a sentence. Is “I read books on NLP” present or past tense? Is wind a verb or a noun?
print(f'{r.text:{10}} {r.pos_:{8}} {r.tag_:{6}} {spacy.explain(r.tag_)}') #read VERB VBD verb, past tense
In the first example, with no other cues to work from, spaCy assumed that read was present tense.
In the second example the present tense form would be I am reading a book, so spaCy assigned the past tense.
Counting POS Tags
The Doc.count_by() method accepts a specific token attribute as its argument, and returns a frequency count of the given attribute as a dictionary object. Keys in the dictionary are the integer values of the given attribute ID, and values are the frequency. Counts of zero are not included.
doc = nlp(u"Apple is looking at buying U.K. startup for $1 Billion.")
# Count the frequencies of different coarse-grained POS tags: POS_counts = doc.count_by(spacy.attrs.POS) POS_counts #{96: 1, 98: 1, 99: 3, 84: 2, 91: 1, 92: 2, 95: 2}
This isn’t very helpful until you decode the attribute ID:
doc.vocab[96].text # 'PUNCT'
# Count the different fine-grained tags: TAG_counts = doc.count_by(spacy.attrs.TAG)
for k,v insorted(TAG_counts.items()): print(f'{k:<{23}} {doc.vocab[k].text:{4}}: {v:<{5}} {spacy.explain(doc.vocab[k].text)}')
1292078113972184607 IN : 2 conjunction, subordinating or preposition 1534113631682161808 VBG : 2 verb, gerund or present participle 8427216679587749980 CD : 2 cardinal number 11283501755624150392 $ : 1 symbol, currency 12646065887601541794 . : 1 punctuation mark, sentence closer 13927759927860985106 VBZ : 1 verb, 3rd person singular present 15308085513773655218 NN : 1 noun, singular or mass 15794550382381185553 NNP : 2 noun, proper singular
Why did the ID numbers get so big?
In spaCy, certain text values are hardcoded into Doc.vocab and take up the first several hundred ID numbers. Strings like ‘NOUN’ and ‘VERB’ are used frequently by internal operations. Others, like fine-grained tags, are assigned hash values as needed.
Why don’t SPACE tags appear?
In spaCy, only strings of spaces (two or more) are assigned tokens. Single spaces are not.
Visualizing Parts of Speech
spaCy offers an outstanding visualizer called displaCy:
# Perform standard imports import spacy nlp = spacy.load('en_core_web_sm')
# Import the displaCy library from spacy import displacy
# Create a simple Doc object doc = nlp(u"A quick brown fox jumps over the lazy dog.")
The dependency parse shows the coarse POS tag for each token, as well as the dependency tag if given:
for token in doc: print(f'{token.text:{10}} {token.pos_:{7}} {token.dep_:{7}} {spacy.explain(token.dep_)}')
A DET det determiner quick ADJ amod adjectival modifier brown ADJ amod adjectival modifier fox NOUN nsubj nominal subject jumps VERB ROOT None over ADP prep prepositional modifier the DET det determiner lazy ADJ amod adjectival modifier dog NOUN pobj object of preposition . PUNCT punct punctuation
Creating Visualizations Outside of Jupyter
If you’re using another Python IDE or writing a script, you can choose to have spaCy serve up HTML separately.
Instead of displacy.render(), use displacy.serve():
displacy.serve() accepts a single Doc or list of Doc objects. Since large texts are difficult to view in one line, you may want to pass a list of spans instead. Each span will appear on its own line:
doc2 = nlp(u"This is a sentence. This is another, possibly longer sentence.")
# Create spans from Doc.sents: spans = list(doc2.sents)
# Perform standard imports import spacy nlp = spacy.load('en_core_web_sm')
# Write a function to display basic entity info: defshow_ents(doc): if doc.ents: for ent in doc.ents: print(ent.text+' - '+ent.label_+' - '+str(spacy.explain(ent.label_))) else: print('No named entities found.') doc = nlp("I am heading to New York City and will visit Statue of Liberty tomorrow")
show_ents(doc)
New York City - GPE - Countries, cities, states Statue of Liberty - ORG - Companies, agencies, institutions, etc. tomorrow - DATE - Absolute or relative dates or periods
Entity annotations
Doc.ents are token spans with their own set of annotations.
Annotation
ent.text
The original entity text
ent.label
The entity type’s hash value
ent.label_
The entity type’s string description
ent.start
The token span’s start index position in the Doc
ent.end
The token span’s stop index position in the Doc
ent.start_char
The entity text’s start index position in the Doc
ent.end_char
The entity text’s stop index position in the Doc
NER Tags
Tags are accessible through the .label_ property of an entity.
TYPE
DESCRIPTION
EXAMPLE
PERSON
People, including fictional.
Fred Flintstone
NORP
Nationalities or religious or political groups.
The Republican Party
FAC
Buildings, airports, highways, bridges, etc.
Logan International Airport, The Golden Gate
ORG
Companies, agencies, institutions, etc.
Microsoft, FBI, MIT
GPE
Countries, cities, states.
France, UAR, Chicago, Idaho
LOC
Non-GPE locations, mountain ranges, bodies of water.
Europe, Nile River, Midwest
PRODUCT
Objects, vehicles, foods, etc. (Not services.)
Formula 1
EVENT
Named hurricanes, battles, wars, sports events, etc.
Olympic Games
WORK_OF_ART
Titles of books, songs, etc.
The Mona Lisa
LAW
Named documents made into laws.
Roe v. Wade
LANGUAGE
Any named language.
English
DATE
Absolute or relative dates or periods.
20 July 1969
TIME
Times smaller than a day.
Four hours
PERCENT
Percentage, including “%”.
Eighty percent
MONEY
Monetary values, including unit.
Twenty Cents
QUANTITY
Measurements, as of weight or distance.
Several kilometers, 55kg
ORDINAL
“first”, “second”, etc.
9th, Ninth
CARDINAL
Numerals that do not fall under another type.
2, Two, Fifty-two
Adding a Named Entity to a Span
doc = nlp(u'Tesla is planning build a new U.K. factory for $6 million')
show_ents(doc) #U.K. - GPE - Countries, cities, states #$6 million - MONEY - Monetary values, including unit
# doc.ents returns a tuple type(doc.ents) #tuple
# each elementin the entities tuple is of type Span type(doc.ents[0]) #spacy.tokens.span.Span
Right now, spaCy does not recognize “Tesla” as a company.
The method to add a named entity of your own is simple. Create a span in document for the phrase you are interested in with the TYPE of entity you want (i.e. PERSON, or ORG) or and manually add it to the entities tuple.
from spacy.tokens import Span
# Get the hash value of the ORG entity label ORG = doc.vocab.strings[u'ORG'] # ORG
# Create a Span for the new entity new_ent = Span(doc, 0, 1, label=ORG)
# Add the entity to the existing Doc object doc.ents = list(doc.ents) + [new_ent]
In the code above, the arguments passed to Span() are:
doc - the name of the Doc object
0 - the start index position of the span
1 - the stop index position (exclusive)
label=ORG - the label assigned to our entity
show_ents(doc)
Tesla - ORG - Companies, agencies, institutions, etc. U.K. - GPE - Countries, cities, states $6 million - MONEY - Monetary values, including unit
Adding Named Entities to All Matching Spans
What if we want to tag all occurrences of “Tesla”? In this section we show how to use the PhraseMatcher to identify a series of spans in the Doc
The process is slghtly involved. The steps are:
(1) build a simple list of phrases to match,
(2) Create a phrase matcher using vocab,
(3) Create phrase patterns and add them to the matcher,
(4) Apply matcher to the doc, which finds spans of matches,
(5) use the found matches to create spans and add to the doc.ents as we did previously
# Import PhraseMatcher and create a matcher object: from spacy.matcher import PhraseMatcher matcher = PhraseMatcher(nlp.vocab)
# Create the desired phrase patterns: phrase_list = ['vacuum cleaner', 'vacuum-cleaner'] phrase_patterns = [nlp(text) for text in phrase_list]
# notice the use of list comprehension above, each element is really a doc object type(phrase_patterns[0]) #spacy.tokens.doc.Doc
# Apply the patterns to our matcher object: matcher.add('newproduct', None, *phrase_patterns)
# Apply the matcher to our Doc object: matches = matcher(doc)
# See what matches occur: matches #[(2689272359382549672, 7, 9), (2689272359382549672, 14, 16)]
# Here we create Spans from each match, and create named entities from them: from spacy.tokens import Span
vacuum cleaner - PRODUCT - Objects, vehicles, foods, etc. (not services) vacuum cleaner - PRODUCT - Objects, vehicles, foods, etc. (not services) first - ORDINAL - "first", "second", etc.
Counting Entities
While spaCy may not have a built-in tool for counting entities, we can pass a conditional statement into a list comprehension:
doc = nlp(u'Originally priced at $29.50, the sweater was marked down to five dollars.')
len([ent for ent in doc.ents if ent.label_=='MONEY']) #2
Problem with Line Breaks
There's a known issue with spaCy v2.0.12 where some linebreaks are interpreted as `GPE` entities:
doc = nlp(u'Originally priced at $29.50,\nthe sweater was marked down to five dollars.')
show_ents(doc)
29.50 - MONEY - Monetary values, including unit
- GPE - Countries, cities, states five dollars - MONEY - Monetary values, including unit
However, there is a simple fix that can be added to the nlp pipeline:
# Quick function to remove ents formed on whitespace: defremove_whitespace_entities(doc): doc.ents = [e for e in doc.ents ifnot e.text.isspace()] return doc
# Insert this into the pipeline AFTER the ner component: nlp.add_pipe(remove_whitespace_entities, after='ner')
# Rerun nlp on the text above, and show ents: doc = nlp(u'Originally priced at $29.50,\nthe sweater was marked down to five dollars.')
show_ents(doc)
29.50 - MONEY - Monetary values, including unit five dollars - MONEY - Monetary values, including unit
Noun Chunks
Doc.noun_chunks are base noun phrases: token spans that include the noun and words describing the noun. Noun chunks cannot be nested, cannot overlap, and do not involve prepositional phrases or relative clauses.
Where Doc.ents rely on the ner pipeline component, Doc.noun_chunks are provided by the parser.
noun_chunks components:
Description
.text
The original noun chunk text.
.root.text
The original text of the word connecting the noun chunk to the rest of the parse.
.root.dep_
Dependency relation connecting the root to its head.
Previously we mentioned that Doc objects do not retain a list of sentences, but they’re available through the Doc.sents generator.
It’s the same with Doc.noun_chunks - lists can be created if needed:
len(list(doc.noun_chunks)) #3
Visualizing Named Entities
Besides viewing Part of Speech dependencies with style='dep', displaCy offers a style='ent' visualizer:
# Perform standard imports import spacy nlp = spacy.load('en_core_web_sm')
# Import the displaCy library from spacy import displacy
doc = nlp(u'Over the last quarter Apple sold nearly 20 thousand iPods for a profit of $6 million. ' u'By contrast, Sony sold only 7 thousand Walkman music players.')
displacy.render(doc, style='ent', jupyter=True)
Over
the last quarter
DATE
Apple
ORG
sold
nearly 20 thousand
CARDINAL
iPods
PRODUCT
for a profit of
$6 million
MONEY
. By contrast,
Sony
ORG
sold
only 7 thousand
CARDINAL
Walkman
PRODUCT
music players.
Viewing Sentences Line by Line
Unlike the displaCy dependency parse, the NER viewer has to take in a Doc object with an ents attribute. For this reason, we can’t just pass a list of spans to .render(), we have to create a new Doc from each span.text:
for sent in doc.sents: displacy.render(nlp(sent.text), style='ent', jupyter=True)
If a span does not contain any entities, displaCy will issue a harmless warning
Viewing Specific Entities
You can pass a list of entity types to restrict the visualization:
Over the last quarter
Apple
ORG
sold nearly 20 thousand
iPods
PRODUCT
for a profit of $6 million. By contrast,
Sony
ORG
sold only 7 thousand
Walkman
PRODUCT
music players.
Customizing Colors and Effects
You can also pass background color and gradient options:
Over the last quarter
Apple
ORG
sold nearly 20 thousand
iPods
PRODUCT
for a profit of $6 million. By contrast,
Sony
ORG
sold only 7 thousand
Walkman
PRODUCT
music players.
Sentence Segmentation
In this section we’ll learn how sentence segmentation works, and how to set our own segmentation rules.
# Perform standard imports import spacy nlp = spacy.load('en_core_web_sm')
# From Spacy Basics: doc = nlp(u'This is the first sentence. This is another sentence. This is the last sentence.')
for sent in doc.sents: print(sent)
This is the first sentence. This is another sentence. This is the last sentence.
Doc.sents is a generator
It is important to note that doc.sents is a generator. That is, a Doc is not segmented until doc.sents is called. This means that, where you could print the second Doc token with print(doc[1]), you can’t call the “second Doc sentence” with print(doc.sents[1]):
Convert doc.sents to a list
list(doc.sents)
sents are spans
type(list(doc.sents)[0]) #spacy.tokens.span.Span
You can build a sentence collection by running doc.sents and saving the result to a list.
Adding Rules for sentence segmentation
spaCy’s built-in sentencizer relies on the dependency parse and end-of-sentence punctuation to determine segmentation rules. We can add rules of our own, but they have to be added before the creation of the Doc object, as that is where the parsing of segment start tokens happens:
# Parsing the segmentation start tokens happens during the nlp pipeline doc2 = nlp(u'This is a sentence. This is a sentence. This is a sentence.')
for token in doc2: print(token.is_sent_start, ' '+token.text)
None This None is None a None sentence None . True This None is None a None sentence None . True This None is None a None sentence None .
Notice we haven’t run doc2.sents, and yet token.is_sent_start was set to True on two tokens in the Doc.
Let’s add a semicolon to our existing segmentation rules. That is, whenever the sentencizer encounters a semicolon(;), the next token should start a new segment.
# SPACY'S DEFAULT BEHAVIOR doc3 = nlp(u'"Management is doing things right; leadership is doing the right things." -Peter Drucker')
for sent in doc3.sents: print(sent) #"Management is doing things right; leadership is doing the right things." #-Peter Drucker
# ADD A NEW RULE TO THE PIPELINE # Every token in document object maintains a fixed index position, # which does not change. We will take advantage of that.
defset_custom_boundaries(doc): for token in doc[:-1]: if token.text == ';': doc[token.i+1].is_sent_start = True return doc
The new rule has to run before the document is parsed. Here we can either pass the argument before='parser' or first=True.
# Re-run the Doc object creation: doc4 = nlp(u'"Management is doing things right; leadership is doing the right things." -Peter Drucker')
for sent in doc4.sents: print(sent)
"Management is doing things right; leadership is doing the right things." -Peter Drucker
# And yet the new rule doesn't apply to the older Doc object: for sent in doc3.sents: print(sent) #"Management is doing things right; leadership is doing the right things." #-Peter Drucker
Why not change the token directly?
Why not simply set the .is_sent_start value to True on existing tokens?
Because spaCy refuses to change the tag after the document is parsed to prevent inconsistencies in the data.
Changing the Rules
In some cases we want to replace spaCy’s default sentencizer with our own set of rules. In this section we’ll see how the default sentencizer breaks on periods. We’ll then replace this behavior with a sentencizer that breaks on linebreaks.
mystring = u"This is a sentence. This is another.\n\nThis is a \nthird sentence."
# SPACY DEFAULT BEHAVIOR: doc = nlp(mystring)
for sent in doc.sents: print([token.text for token in sent])
# CHANGING THE RULES from spacy.pipeline import SentenceSegmenter
defsplit_on_newlines(doc): start = 0 seen_newline = False for word in doc: if seen_newline: yield doc[start:word.i] start = word.i seen_newline = False elif word.text.startswith('\n'): # handles multiple occurrences seen_newline = True yield doc[start:] # handles the last group of tokens
Here we see that periods no longer affect segmentation, only linebreaks do. This would be appropriate when working with a long list of tweets, for instance.