Containers

Containers are data types intended to hold a collection of data.

  • List
  • Tuple
  • Set
  • Dictionary

ordered, mutable, allows duplicates

not ordered, mutable, no duplicates, sorted

ordered, immutable, allows duplicates

(Can’t modify. Read only)

  • count
  • index

not ordered, mutable, no duplicate keys

Packing and unpacking values

# I am interested only in what is at the head and tail of the list
l = [1, 2, 3, 4, 5, 6, 7, 20]
head, *body, tail = l
print(head, tail)
1 20

Formatted String

Formatted String Literals

Introduced in Python 3.6, f-strings offer several benefits over the older .format() string method.

name = 'Antonio'

# Using the old .format() method:
print('His name is {}.'.format(name))

# Using f-strings:
print(f'His name is {name}.')

Pass !r to get the string representation:

print(f'His name is {name!r}')
His name is 'Antonio'

Be careful not to let quotation marks in the replacement fields conflict with the quoting used in the outer string:

d = {'first':123,'second':456}
# wrong
print(f'Address: {d['first']} Main Street')
# right
print(f"Address: {d['first']} Main Street")

Minimum Widths, Alignment and Padding

To set the alignment, use the character < for left-align, ^ for center, > for right.

To set padding, precede the alignment character with the padding character (- and . are common choices).

Text Files

# I am creating a file using ipython
# This function is specific to jupyter notebooks
# Alternatively, quickly create a file using a text editor.
%%writefile test.txt
Hello, this is first line
This is second line

%%writefile -a test.txt

PDF Files

using PyPDF2

pip install PyPDF2

Reading PDFs

import PyPDF2

f = open('Magna-carta-translation.pdf','rb')
pdf_reader = PyPDF2.PdfFileReader(f)
page_one = pdf_reader.getPage(0)

We can then extract the text:

page_one_text = page_one.extractText()

Adding to PDFs

pdf_output = open("first_page_of_doi.pdf","wb")
pdf_writer.write(pdf_output)

Regular Expressions*

Patterns

r'mypattern'

placing the r in front of the string allows python to understand that the \ in the pattern string are not meant to be escape slashes.

Identifiers for Characters

Below you can find a table of all the possible identifiers:

Character Description Example Pattern Code Example Match
\d A digit file_\d\d file_25
\w Alphanumeric \w-\w\w\w A-b_1
\s White space a\sb\sc a b c
\D A non digit \D\D\D ABC
\W Non-alphanumeric \W\W\W\W\W *-+=)
\S Non-whitespace \S\S\S\S Yoyo

Quantifiers

Character Description Example Pattern Code Example Match
+ Occurs one or more times Version \w-\w+ Version A-b1_1
{3} Occurs exactly 3 times \D{3} abc
{2,4} Occurs 2 to 4 times \d{2,4} 123
{3,} Occurs 3 or more \w{3,} anycharacters
* Occurs zero or more times A*B*C* AAACC
? Once or none plurals? plural
text = "My phone number is 614-292-5800"
re.search(r'\d{3}-\d{3}-\d{4}',text)
<re.Match object; span=(19, 31), match='614-292-5800'>

Groups

What if we wanted to do two tasks, find phone numbers, but also be able to quickly extract their area code (the first three digits). We can use groups for any general task that involves grouping together regular expressions (so that we can later break them down).

phone_pattern = re.compile(r'(\d{3})-(\d{3})-(\d{4})')
results = re.search(phone_pattern,text)
# The entire result
results.group()
'614-292-5800'
# Can then also call by group position.
# remember groups were separated by parentheses ()
# Something to note is that group ordering starts at 1. Passing in 0 returns everything
results.group(1)
'614'

Additional Regax Syntax

Or operator |

Use the pipe operator to have an or statement. For example:

re.search(r"man|woman","This man was here.")

The Wildcard Character

Use a “wildcard” as a placement that will match any character placed there. You can use a simple period . for this. For example:

re.findall(r".at","The cat in the hat sat here.")
# ['cat', 'hat', 'sat']
re.findall(r"...at","The bat went splat")
# ['e bat', 'splat']

\S:Non-whitespace

+:Occurs one or more times

# One or more non-whitespace that ends with 'at'
re.findall(r'\S+at',"The bat went splat")
# ['bat', 'splat']

Starts With and Ends With

We can use the ^ to signal starts with, and the $ to signal ends with.(for the entire string)

Exclusion

To exclude characters, we can use the ^ symbol in conjunction with a set of brackets []. Anything inside the brackets is excluded. For example:

phrase = "there are 3 numbers 34 inside 5 this sentence."
re.findall(r'[^\d]+',phrase)
# ['there are ', ' numbers ', ' inside ', ' this sentence.']

We can use this to remove punctuation from a sentence.

test_phrase = 'This is a string! But it has punctuation. How can we remove it?'
clean = ' '.join(re.findall('[^!.? ]+',test_phrase))
# 'This is a string But it has punctuation How can we remove it'

Brackets for Grouping

As we showed above we can use brackets to group together options, for example if we wanted to find hyphenated words:

text = 'Only find the hypen-words in this sentence. But you do not know how long-ish they are'
re.findall(r'[\w]+-[\w]+',text)
# ['hypen-words', 'long-ish']

Parentheses for Multiple Options

If we have multiple options for matching, we can use parentheses to list out these options. For Example:

# Find words that start with cat and end with one of these options: 'fish','nap', or 'claw'
text = 'Hello, would you like some catfish?'
texttwo = "Hello, would you like to take a catnap?"
textthree = "Hello, have you seen this caterpillar?"
re.search(r'cat(fish|nap|claw)',text)