So far in the course, we have learnt how we can read from a text file and turn it into a Python data structure (such as a list of words). Today we will look at how to read from a CSV (comma separated file), process the entries and write/append to a different text file.
In the process, we will look at some code patterns involving lists, strings and counters that are useful when analyzing data.
Acknowlegement. This notebook has been adapted from the Wellesley CS111 Spring 2019 course materials (http://cs111.wellesley.edu/spring19).
CSV Format. A CSV (Comma Separated Values) file is a type of plain text file that stores tabula
data. Each row of a table is a line in the text file, with each column on the row separated by commas. This format is the most common import and export format for spreadsheets and databases.
For example a simple table such as the following with columns names and ages would be represented in a CSV as:
Table:
Name | Age |
---|---|
Harry | 14 |
Hermoine | 14 |
Dumbledor | 60 |
CSV:
Name,Age
Harry,14
Hermoine,14
Dumbledor,60
Python's csv
module provides an easy way to read and iterate over a CSV file.
import csv # the module must be explicitly imported
with open('roster.csv') as myFile:
csvf = csv.reader(myFile)
print(csvf)
# implicitly closes file
# csvf is a file object that can be iterated over
When we iterate over a regular text file, the loop variable is a string and takes the role of each line in the file one by one in order. When we iterate over a CSV object, the loop variable is a list and takes the value of each row one by one in order.
with open('roster.csv') as myFile:
csvf = csv.reader(myFile)
for row in csvf:
print(row)
We can iterate over a CSV file and accumulate all rows (each of which is a list) into a mega list.
rosterList = []
with open('roster.csv') as myFile:
csvf = csv.reader(myFile)
for row in csvf:
rosterList.append(row)
rosterList # lets see what is in the rosterList
List of lists format. Notice that each item in the list is a row in the original file (in order) and the overall list is a list of rowLists. How can we access the information of a particular student from this nested list?
len(rosterList) # number of students in class
Generating random indices. Remember Homework 1 where you were asked to design an algorithm for generating random numbers? Let's play a game where we generated random numbers between 0 and 31 and index our list with that number to see whose name comes up.
import random # import module to help generate random numbers
randomIndex = random.randint(0, 31)
# generates a random integer between 0 and 31
rosterList[randomIndex]
randomIndex = random.randint(0, 31)
rosterList[randomIndex] # great way of cold calling in lectures !
rosterList[random.randint(0,31)][0]
# Accessing just the name
Sometimes your CSV may have unnecessary data that you want to discard (such as the last column in our class roster). Additionally your rows might have integer values stored as a string (such as class year) that you may want to convert to an integer. Let us write some helper functions that take as input a list (which is a row of the CSV file) and output a cleaned row as a tuple. The returned tuple must have three items:
def reorgData(rowList):
"""Takes a row of a CSV (as a list) and returns
a tuple of student information"""
# tuple assignment, splitting last name
# and first(with middle) name
lName, fmName = rowList[0].split(',')
fName = fmName.split()[0]
year = rowList[1] # takes the form '23AAA'
yy = int(year[:2])
return fName, lName, yy
Let us test our reorgData
function on a particular random rowList
from the rosterList
.
randomIndex = random.randint(0, 31)
reorgData(rosterList[randomIndex])
In previous lectures we have seen that it is common to use loops in conjunction with accumulation variables that collect results from processing elements within the loop. Let us write some funtions that exercise commonly seen accumulation patterns using lists.
Let's get to know our class better! We will write a function yearList
which takes in two arguments rosterList
(list of lists) and year
(int) and returns the list of students in the class with that graduating year.
def yearList(classList, year):
result = []
for sList in rosterList:
# tuple assignment:
fName, lName, yy = reorgData(sList)
if yy == year:
result.append(fName + ' ' +lName)
return result
len(yearList(rosterList, 23)) # how many first years in class?
yearList(rosterList, 23) # Names of first years
len(yearList(rosterList, 22)) # how many second sophmores
yearList(rosterList, 22) # Names of sophmores
len(yearList(rosterList, 21)) # how many juniors?
yearList(rosterList, 21) # names of juniors
len(yearList(rosterList, 20)) # how many seniors
yearList(rosterList, 20) # name of seniors
We built an assortment functions last week as part of our sequences toolkit. Lets use some of those functions now to find out fun facts about the class. Function names in the __all__
variable of our toolkit:
We can import these functions from our module into our current interactive python session, using the import command.
from sequenceTools import *
help(countAllVowels)
countAllVowels('onomatopoeia') # test if import work
Another helper function. As we will be analyzing student names, lets create helper functions which extract names out of the CVS rows (lists).
def getName(sInfo):
"""Takes in a tuple consisting of first name, last name, year
and returns the string first name concatenated with last name"""
fName, lName, yy = reorgData(sInfo)
return fName + ' ' + lName
getName(rosterList[random.randint(0, 31)]) # test on a random student!
Fun Facts. Who has the most number of vowels in their name?
def mostVowelName(classList):
currentMax = 0 # initialize max value
persons = [] # initialize list for names
for sInfo in classList:
name = getName(sInfo)
numVowels = countAllVowels(name)
if numVowels > currentMax:
# found someone whose name as more vowels
# than current max update person, currentMax
currentMax = numVowels
persons = [name] # reupdate
elif numVowels == currentMax:
# is someone's name as long as currentMax?
persons.append(name)
return persons, currentMax
mostVowelName(rosterList) # which student has most vowels in their name?
Fun Facts. How about the least number of vowels? Since we will need to extract student names again, lets just write a little helper function to do it for us.
def leastVowelName(classList):
currentMin = 20 # initialize min value
persons = [] # initialize placeholder for name
for sInfo in classList:
name = getName(sInfo)
numVowels = countAllVowels(name)
if numVowels < currentMin:
currentMin = numVowels # update state of current max
person = [name]
elif numVowels == currentMin:
persons.append(name)
return person, currentMin
leastVowelName(rosterList) # which student has most vowels in their name?
We can write all the results that we are computing into a file (a persitent structure). To open a file for writing, we use open
with the mode 'w'.
The following code will create a new file named studentFacts.txt
in the current working directory and write in it results of our function calls.
with open('studentFacts.txt', 'w') as sFile:
sFile.write('Fun facts about CS134 students.\n')# need newlines
sFile.write('No. of first years in CS134: {}\n'.format(len(yearList(rosterList, 23))))
sFile.write('No. of sophmores in CS134: {}\n'.format(len(yearList(rosterList, 22))))
sFile.write('No. of juniors in CS134: {}\n'.format(len(yearList(rosterList, 21))))
sFile.write('No. of seniors in CS134: {}\n'.format(len(yearList(rosterList, 20))))
We can use ls -l
to see that a new file studentFacts.txt
has been created:
ls # new file information
Use the OS command more
to view the contents of the file:
more studentFacts.txt
Alternatively, go to Finder (on a Mac) or Windows Explorer (PC) to view the contents of the file.
How do we add lines to the end of an existing file? We can't open the file in write mode (with a 'w'), because that erases all previous contents and starts with an empty file.
Instead, we open the file in append mode (with an 'a'). Any subsequent writes are made after the existing contents.
with open('studentFacts.txt', 'a') as sFile:
sFile.write('Name with most vowels: {}\n'.format(mostVowelName(rosterList)))
sFile.write('Name with least vowels: {}\n'.format(leastVowelName(rosterList)))
Open the file studentFacts.txt
again to view it, or using the OS command more:
more studentFacts.txt
When iterative over lists there are several accumulation patterns which come up a lot. In the following questions, the premise is that we have a list we are iterating over and we are returning a new list. There are two common category of tasks:
We can simplify the mapping/filtering patterns with a syntactic device called list comprehension. Lets take an exampe of each.
We can generate a new list by performing an operation on every element in a given list. This is called mapping.
def mapDouble(nums):
"""Given a list of numbers, returns a *new* list,
in which each element is twice the corresponding
element in the input list.
"""
result = []
for n in nums:
result.append(2*n)
return result
mapDouble([2, 3, 4, 5])
Succint form using list comprehension.
def mapDoubleShort(nums):
return [2*n for n in nums]
mapDoubleShort([6, 7, 8])
List of Names. Suppose we want to iterate over our nested list rosterList
, and collect all the student names in a list, we can do that with a simple mapping list comprehension!
nameList = [getName(sInfo) for sInfo in rosterList]
nameList
Another example. Suppose we want to iterate over a list of names and return a list of first names in lower case.
def firstNames(nameList):
"""Given a list of names as firstName lastname, returns a list of firstNames.
"""
return [name.split()[0].lower() for name in nameList]
firstNames(['Shikha Singh', 'Iris Howley', 'Lida Doret'])
Another common way to produce a new list is to filter an existing list, keeping only those elements that satisfy a certain predicate.
def filterNames(nameList):
"""Given a list of names as first name, returns a *new* list of all
names in the input list that have length >= 6.
"""
result = []
for name in nameList:
if len(name) >= 9:
result.append(name)
return result
filterNames(firstNames(nameList))
We can also do this filtering pattern very succinctly using list comprehensions!
def filterNamesShort(nameList):
return [name for name in nameList if len(name) >= 9]
filterNamesShort(firstNames(nameList))
# Given a list of numbers numList
# Create a list of all numbers that are even
nums = [1, 2, 3, 4, 5, 6, 7]
result = [n for n in nums if n%2 == 0]
print(result)
# add the ending 'th' to all words in a phrase
phrase = "mine dog ate your shoe"
# expected phrase: ["mineth", "dogth", "ateth", "yourth", "shoeth"]
newPhrase = [word + 'th' for word in phrase.split()]
newPhrase
It is possible to do both mapping and filtering in a single list comprehension. Examine the example below which filters a list by even numbers and creates a new list of their squares.
[(x**2) for x in range(10) if x % 2 == 0]
Note that our expression for mapping still comes before the "for" and our filtering with "if" still comes after our sequence. Below is the equivalent code without list comprehensions.
newList = []
for x in range(10):
if x % 2 == 0:
newList.append(x**2)
newList
YOUR TURN: Try to write the following list comprehension examples:
# Example 1: Write a list comprehension that filters the vowels from a word
# such as beauteous and returns a list of its capitalized vowels.
word = "beauteous"
newList = [char.upper() for char in word if isVowel(char)]
newList
# Example 2: Write a list comprehension that filters a list of proper nouns by length.
# It should extract nouns of length greater than 4 but less than 8 and return a list
# where the first letter is properly capitalized
# This is a challenge!
properNouns = ["cher", "bjork", "sting", "beethoven", "prince", "madonna"]
newList = [word[0].upper() + word[1:] for word in properNouns if len(word)>4 and len(word)<=8]
newList