Lecture 9: List Processing Patterns & Files Writing

So far in the course, we have learnt how we can read from a text file and turn it into a Python data structure (such as a list of words). Today we will look at how to read from a CSV (comma separated file), process the entries and write/append to a different text file.

In the process, we will look at some code patterns involving lists, strings and counters that are useful when analyzing data.

Acknowlegement. This notebook has been adapted from the Wellesley CS111 Spring 2019 course materials (http://cs111.wellesley.edu/spring19).

Reading in a CSV File

CSV Format. A CSV (Comma Separated Values) file is a type of plain text file that stores tabula data. Each row of a table is a line in the text file, with each column on the row separated by commas. This format is the most common import and export format for spreadsheets and databases.

For example a simple table such as the following with columns names and ages would be represented in a CSV as:

Table:

Name Age
Harry 14
Hermoine 14
Dumbledor 60

CSV:

Name,Age
Harry,14
Hermoine,14
Dumbledor,60

Python's csv module provides an easy way to read and iterate over a CSV file.

In [1]:
import csv # the module must be explicitly imported
In [2]:
with open('roster.csv') as myFile:
    csvf = csv.reader(myFile)
    print(csvf)
# implicitly closes file
# csvf is a  file object that can be iterated over
<_csv.reader object at 0x103ee26d0>

Iterating over a CSV object

When we iterate over a regular text file, the loop variable is a string and takes the role of each line in the file one by one in order. When we iterate over a CSV object, the loop variable is a list and takes the value of each row one by one in order.

In [3]:
with open('roster.csv') as myFile:
    csvf = csv.reader(myFile)
    for row in csvf:
        print(row)
        
['Ahmad,Omar', '23AAA', '02 (Shikha)']
['Bennett,Zoe', '23AAA', '02 (Shikha)']
['Le,Long N.', '21AAA', '02 (Shikha)']
['Bal,Gabriella H.', '20AAA', '02 (Shikha)']
['Chen,Kary', '23AAA', '02 (Shikha)']
['Dean,Sarah R.', '23AAA', '02 (Shikha)']
['Eckerle,Jacob M.', '23AAA', '02 (Shikha)']
['Kilinc,Onder', '23AAA', '02 (Shikha)']
['Litton,William', '23AAA', '02 (Shikha)']
['Lynch,Lauren E.', '23AAA', '02 (Shikha)']
['McCarey,Lauren R.', '23AAA', '02 (Shikha)']
['Mohan,Avery E.', '23AAA', '02 (Shikha)']
['Mojarradi,Mohammad Mehdi', '23AAA', '02 (Shikha)']
['Peters,Maximilian E.', '23AAA', '02 (Shikha)']
['Robayo,Salvador', '23AAA', '02 (Shikha)']
['Ruschil,Evan U.', '23AAA', '02 (Shikha)']
['Paul,Jonathan S.', '23WWA', '02 (Shikha)']
['Shi,Sarah', '23AAA', '02 (Shikha)']
['Siu,Benjamin A.', '23AAA', '02 (Shikha)']
['Su,April', '23AAA', '02 (Shikha)']
['Ho,Ching-Hsien', '20AAA', '02 (Shikha)']
['Massey-Bierman,Marika E.', '22AAA', '02 (Shikha)']
['Osman,Islam N.', '20AAA', '02 (Shikha)']
['CortÈs,Bernal D.', '22AAA', '02 (Shikha)']
['Grossman,Keith T.', '22AAA', '02 (Shikha)']
['Wolf,Samuel T.', '21AAA', '02 (Shikha)']
['Job,Sebastian M.', '22AAA', '02 (Shikha)']
['Michalska,Victoria', '22AAA', '02 (Shikha)']
['Murray-Stark,Abigail R.', '22AAA', '02 (Shikha)']
['Yacoub,George E.', '21AAA', '02 (Shikha)']
['Watson,Ryan H.', '22AAA', '02 (Shikha)']
['White,Olivia R.', '22AAA', '02 (Shikha)']

Accumulating the rows of the CSV as a Nested List

We can iterate over a CSV file and accumulate all rows (each of which is a list) into a mega list.

In [4]:
rosterList = []
with open('roster.csv') as myFile:
    csvf = csv.reader(myFile)
    for row in csvf:
        rosterList.append(row)
In [5]:
rosterList # lets see what is in the rosterList
Out[5]:
[['Ahmad,Omar', '23AAA', '02 (Shikha)'],
 ['Bennett,Zoe', '23AAA', '02 (Shikha)'],
 ['Le,Long N.', '21AAA', '02 (Shikha)'],
 ['Bal,Gabriella H.', '20AAA', '02 (Shikha)'],
 ['Chen,Kary', '23AAA', '02 (Shikha)'],
 ['Dean,Sarah R.', '23AAA', '02 (Shikha)'],
 ['Eckerle,Jacob M.', '23AAA', '02 (Shikha)'],
 ['Kilinc,Onder', '23AAA', '02 (Shikha)'],
 ['Litton,William', '23AAA', '02 (Shikha)'],
 ['Lynch,Lauren E.', '23AAA', '02 (Shikha)'],
 ['McCarey,Lauren R.', '23AAA', '02 (Shikha)'],
 ['Mohan,Avery E.', '23AAA', '02 (Shikha)'],
 ['Mojarradi,Mohammad Mehdi', '23AAA', '02 (Shikha)'],
 ['Peters,Maximilian E.', '23AAA', '02 (Shikha)'],
 ['Robayo,Salvador', '23AAA', '02 (Shikha)'],
 ['Ruschil,Evan U.', '23AAA', '02 (Shikha)'],
 ['Paul,Jonathan S.', '23WWA', '02 (Shikha)'],
 ['Shi,Sarah', '23AAA', '02 (Shikha)'],
 ['Siu,Benjamin A.', '23AAA', '02 (Shikha)'],
 ['Su,April', '23AAA', '02 (Shikha)'],
 ['Ho,Ching-Hsien', '20AAA', '02 (Shikha)'],
 ['Massey-Bierman,Marika E.', '22AAA', '02 (Shikha)'],
 ['Osman,Islam N.', '20AAA', '02 (Shikha)'],
 ['CortÈs,Bernal D.', '22AAA', '02 (Shikha)'],
 ['Grossman,Keith T.', '22AAA', '02 (Shikha)'],
 ['Wolf,Samuel T.', '21AAA', '02 (Shikha)'],
 ['Job,Sebastian M.', '22AAA', '02 (Shikha)'],
 ['Michalska,Victoria', '22AAA', '02 (Shikha)'],
 ['Murray-Stark,Abigail R.', '22AAA', '02 (Shikha)'],
 ['Yacoub,George E.', '21AAA', '02 (Shikha)'],
 ['Watson,Ryan H.', '22AAA', '02 (Shikha)'],
 ['White,Olivia R.', '22AAA', '02 (Shikha)']]

List of lists format. Notice that each item in the list is a row in the original file (in order) and the overall list is a list of rowLists. How can we access the information of a particular student from this nested list?

In [6]:
len(rosterList)  # number of students in class
Out[6]:
32

Generating random indices. Remember Homework 1 where you were asked to design an algorithm for generating random numbers? Let's play a game where we generated random numbers between 0 and 31 and index our list with that number to see whose name comes up.

In [7]:
import random # import module to help generate random numbers
In [8]:
randomIndex = random.randint(0, 31)  
# generates a random integer between 0 and 31
In [9]:
rosterList[randomIndex]
Out[9]:
['Kilinc,Onder', '23AAA', '02 (Shikha)']
In [10]:
randomIndex = random.randint(0, 31)
In [11]:
rosterList[randomIndex]  # great way of cold calling in lectures !
Out[11]:
['Dean,Sarah R.', '23AAA', '02 (Shikha)']
In [12]:
rosterList[random.randint(0,31)][0]   
# Accessing just the name
Out[12]:
'Yacoub,George E.'

Reorganizing Data

Sometimes your CSV may have unnecessary data that you want to discard (such as the last column in our class roster). Additionally your rows might have integer values stored as a string (such as class year) that you may want to convert to an integer. Let us write some helper functions that take as input a list (which is a row of the CSV file) and output a cleaned row as a tuple. The returned tuple must have three items:

  • First item of the returned tuple must be the student first name as a string
  • Second index of the returned tuple must be the student last name as a string
  • Third index of the returned tuple must represent the graduation year (23, 22, 21, 20) as an int
In [13]:
def reorgData(rowList):
    """Takes a row of a CSV (as a list) and returns
    a tuple of student information"""
    # tuple assignment, splitting last name
    # and first(with middle) name
    lName, fmName = rowList[0].split(',')  
    fName = fmName.split()[0]
    year = rowList[1]  # takes the form '23AAA'
    yy = int(year[:2])
    return fName, lName, yy

Let us test our reorgData function on a particular random rowList from the rosterList.

In [14]:
randomIndex = random.randint(0, 31)
In [15]:
reorgData(rosterList[randomIndex])
Out[15]:
('Ching-Hsien', 'Ho', 20)

Accumulation with Lists

In previous lectures we have seen that it is common to use loops in conjunction with accumulation variables that collect results from processing elements within the loop. Let us write some funtions that exercise commonly seen accumulation patterns using lists.

Exercise: Number of Students by Year

Let's get to know our class better! We will write a function yearList which takes in two arguments rosterList (list of lists) and year (int) and returns the list of students in the class with that graduating year.

In [16]:
def yearList(classList, year):
    result = []
    for sList in rosterList:
        # tuple assignment:
        fName, lName, yy = reorgData(sList) 
        if yy == year:
            result.append(fName + ' ' +lName)
    return result
In [17]:
len(yearList(rosterList, 23)) # how many first years in class?
Out[17]:
18
In [18]:
yearList(rosterList, 23)  # Names of first years
Out[18]:
['Omar Ahmad',
 'Zoe Bennett',
 'Kary Chen',
 'Sarah Dean',
 'Jacob Eckerle',
 'Onder Kilinc',
 'William Litton',
 'Lauren Lynch',
 'Lauren McCarey',
 'Avery Mohan',
 'Mohammad Mojarradi',
 'Maximilian Peters',
 'Salvador Robayo',
 'Evan Ruschil',
 'Jonathan Paul',
 'Sarah Shi',
 'Benjamin Siu',
 'April Su']
In [19]:
len(yearList(rosterList, 22)) # how many second sophmores
Out[19]:
8
In [20]:
yearList(rosterList, 22)  # Names of sophmores 
Out[20]:
['Marika Massey-Bierman',
 'Bernal CortÈs',
 'Keith Grossman',
 'Sebastian Job',
 'Victoria Michalska',
 'Abigail Murray-Stark',
 'Ryan Watson',
 'Olivia White']
In [21]:
len(yearList(rosterList, 21))  # how many juniors?
Out[21]:
3
In [22]:
yearList(rosterList, 21) # names of juniors
Out[22]:
['Long Le', 'Samuel Wolf', 'George Yacoub']
In [23]:
len(yearList(rosterList, 20))  # how many seniors
Out[23]:
3
In [24]:
yearList(rosterList, 20)  # name of seniors
Out[24]:
['Gabriella Bal', 'Ching-Hsien Ho', 'Islam Osman']

Exercise: Use our sequenceTools

We built an assortment functions last week as part of our sequences toolkit. Lets use some of those functions now to find out fun facts about the class. Function names in the __all__ variable of our toolkit:

  • isVowel
  • countAllVowels
  • countChar
  • wordStartEndCount
  • wordStartEndList
  • isPalindrome

We can import these functions from our module into our current interactive python session, using the import command.

In [25]:
from sequenceTools import *
In [26]:
help(countAllVowels)
Help on function countAllVowels in module sequenceTools:

countAllVowels(word)
    Returns number of vowels in the word.
    >>> countAllVowels('Williams')
    3
    >>> countAllVowels('Eephs')
    2

In [27]:
countAllVowels('onomatopoeia')  # test if import work
Out[27]:
8

Another helper function. As we will be analyzing student names, lets create helper functions which extract names out of the CVS rows (lists).

In [28]:
def getName(sInfo):
    """Takes in a tuple consisting of first name, last name, year 
    and returns the string first name concatenated with last name"""
    fName, lName, yy = reorgData(sInfo)
    return fName + ' ' + lName
In [29]:
getName(rosterList[random.randint(0, 31)])  # test on a random student!
Out[29]:
'Maximilian Peters'

Fun Facts. Who has the most number of vowels in their name?

In [30]:
def mostVowelName(classList):
    currentMax = 0 # initialize max value
    persons = []  # initialize list for names
    for sInfo in classList:
        name = getName(sInfo)
        numVowels = countAllVowels(name)
        if numVowels > currentMax:
            # found someone whose name as more vowels
            # than current max update person, currentMax
            currentMax = numVowels 
            persons = [name] # reupdate
        elif numVowels == currentMax:
            # is someone's name as long as currentMax?
            persons.append(name)
    return persons, currentMax
In [31]:
mostVowelName(rosterList)  # which student has most vowels in their name?
Out[31]:
(['Marika Massey-Bierman'], 8)

Fun Facts. How about the least number of vowels? Since we will need to extract student names again, lets just write a little helper function to do it for us.

In [32]:
def leastVowelName(classList):
    currentMin = 20 # initialize min value
    persons = []  # initialize placeholder for name
    for sInfo in classList:
        name = getName(sInfo)
        numVowels = countAllVowels(name)
        if numVowels < currentMin:
            currentMin = numVowels # update state of current max
            person = [name]
        elif numVowels == currentMin:
            persons.append(name)
    return person, currentMin
In [33]:
leastVowelName(rosterList)  # which student has most vowels in their name?
Out[33]:
(['Long Le'], 2)

Writing to Files

We can write all the results that we are computing into a file (a persitent structure). To open a file for writing, we use open with the mode 'w'.

The following code will create a new file named studentFacts.txt in the current working directory and write in it results of our function calls.

In [34]:
with open('studentFacts.txt', 'w') as sFile:
    sFile.write('Fun facts about CS134 students.\n')# need newlines
    sFile.write('No. of first years in CS134: {}\n'.format(len(yearList(rosterList, 23)))) 
    sFile.write('No. of sophmores in CS134: {}\n'.format(len(yearList(rosterList, 22))))
    sFile.write('No. of juniors in CS134: {}\n'.format(len(yearList(rosterList, 21))))
    sFile.write('No. of seniors in CS134: {}\n'.format(len(yearList(rosterList, 20))))

We can use ls -l to see that a new file studentFacts.txt has been created:

In [35]:
ls # new file information
134-Lecture9.key              lec_listPatterns_solns.ipynb
__pycache__/                  roster.csv
csvExample.py                 sequenceTools.py
faculty.csv                   studentFacts.txt
lec_listPatterns.ipynb

Use the OS command more to view the contents of the file:

In [36]:
more studentFacts.txt

Alternatively, go to Finder (on a Mac) or Windows Explorer (PC) to view the contents of the file.

Appending to files

How do we add lines to the end of an existing file? We can't open the file in write mode (with a 'w'), because that erases all previous contents and starts with an empty file.

Instead, we open the file in append mode (with an 'a'). Any subsequent writes are made after the existing contents.

In [37]:
with open('studentFacts.txt', 'a') as sFile:
    sFile.write('Name with most vowels: {}\n'.format(mostVowelName(rosterList)))
    sFile.write('Name with least vowels: {}\n'.format(leastVowelName(rosterList)))

Open the file studentFacts.txt again to view it, or using the OS command more:

In [38]:
more studentFacts.txt

List Accumulation Patterns

When iterative over lists there are several accumulation patterns which come up a lot. In the following questions, the premise is that we have a list we are iterating over and we are returning a new list. There are two common category of tasks:

  • Mapping patters: when you want to perform the same action to every item in the list
  • Filter patterns: when you want to retain only some items of the list

We can simplify the mapping/filtering patterns with a syntactic device called list comprehension. Lets take an exampe of each.

Mapping Patteer via List Comprehension

We can generate a new list by performing an operation on every element in a given list. This is called mapping.

In [39]:
def mapDouble(nums):
    """Given a list of numbers, returns a *new* list,
    in which each element is twice the corresponding
    element in the input list.
    """
    result = []
    for n in nums:
        result.append(2*n)
    return result
In [40]:
mapDouble([2, 3, 4, 5])
Out[40]:
[4, 6, 8, 10]

Succint form using list comprehension.

In [41]:
def mapDoubleShort(nums):
    return [2*n for n in nums]
In [42]:
mapDoubleShort([6, 7, 8])
Out[42]:
[12, 14, 16]

List of Names. Suppose we want to iterate over our nested list rosterList, and collect all the student names in a list, we can do that with a simple mapping list comprehension!

In [43]:
nameList = [getName(sInfo) for sInfo in rosterList]
In [44]:
nameList
Out[44]:
['Omar Ahmad',
 'Zoe Bennett',
 'Long Le',
 'Gabriella Bal',
 'Kary Chen',
 'Sarah Dean',
 'Jacob Eckerle',
 'Onder Kilinc',
 'William Litton',
 'Lauren Lynch',
 'Lauren McCarey',
 'Avery Mohan',
 'Mohammad Mojarradi',
 'Maximilian Peters',
 'Salvador Robayo',
 'Evan Ruschil',
 'Jonathan Paul',
 'Sarah Shi',
 'Benjamin Siu',
 'April Su',
 'Ching-Hsien Ho',
 'Marika Massey-Bierman',
 'Islam Osman',
 'Bernal CortÈs',
 'Keith Grossman',
 'Samuel Wolf',
 'Sebastian Job',
 'Victoria Michalska',
 'Abigail Murray-Stark',
 'George Yacoub',
 'Ryan Watson',
 'Olivia White']

Another example. Suppose we want to iterate over a list of names and return a list of first names in lower case.

In [45]:
def firstNames(nameList):
    """Given a list of names as firstName lastname, returns a list of firstNames.
    """
    return [name.split()[0].lower() for name in nameList]  
In [46]:
firstNames(['Shikha Singh', 'Iris Howley', 'Lida Doret'])
Out[46]:
['shikha', 'iris', 'lida']

Filtering Pattern via List Comprehensions

Another common way to produce a new list is to filter an existing list, keeping only those elements that satisfy a certain predicate.

In [47]:
def filterNames(nameList):
    """Given a list of names as first name, returns a *new* list of all
    names in the input list that have length >= 6.
    """
    result = []
    for name in nameList:
        if len(name) >= 9:
            result.append(name)
    return result
In [48]:
filterNames(firstNames(nameList))
Out[48]:
['gabriella', 'maximilian', 'ching-hsien', 'sebastian']

We can also do this filtering pattern very succinctly using list comprehensions!

In [49]:
def filterNamesShort(nameList):
    return [name for name in nameList if len(name) >= 9]
In [50]:
filterNamesShort(firstNames(nameList))
Out[50]:
['gabriella', 'maximilian', 'ching-hsien', 'sebastian']

List Comprehensions Exercises

In [51]:
# Given a list of numbers numList
# Create a list of all numbers that are even
nums = [1, 2, 3, 4, 5, 6, 7]
result = [n for n in nums if n%2 == 0]
print(result)
[2, 4, 6]
In [52]:
# add the ending 'th' to all words in a phrase
phrase = "mine dog ate your shoe"
# expected phrase: ["mineth", "dogth", "ateth", "yourth", "shoeth"]
newPhrase = [word + 'th' for word in phrase.split()]
newPhrase
Out[52]:
['mineth', 'dogth', 'ateth', 'yourth', 'shoeth']

List Comprehensions with Mapping and Filtering

It is possible to do both mapping and filtering in a single list comprehension. Examine the example below which filters a list by even numbers and creates a new list of their squares.

In [53]:
[(x**2) for x in range(10) if x % 2 == 0]
Out[53]:
[0, 4, 16, 36, 64]

Note that our expression for mapping still comes before the "for" and our filtering with "if" still comes after our sequence. Below is the equivalent code without list comprehensions.

In [54]:
newList = []
for x in range(10):
    if x % 2 == 0:
        newList.append(x**2)
newList
Out[54]:
[0, 4, 16, 36, 64]

YOUR TURN: Try to write the following list comprehension examples:

In [56]:
# Example 1: Write a list comprehension that filters the vowels from a word 
# such as beauteous and returns a list of its capitalized vowels.
word = "beauteous"
newList = [char.upper() for char in word if isVowel(char)]
newList
Out[56]:
['E', 'A', 'U', 'E', 'O', 'U']
In [57]:
# Example 2: Write a list comprehension that filters a list of proper nouns by length.
# It should extract nouns of length greater than 4 but less than 8 and return a list
# where the first letter is properly capitalized
# This is a challenge!
properNouns = ["cher", "bjork", "sting", "beethoven", "prince", "madonna"]
newList = [word[0].upper() + word[1:] for word in properNouns if len(word)>4 and len(word)<=8]
newList
Out[57]:
['Bjork', 'Sting', 'Prince', 'Madonna']