In today's lecture we will learn how we can sort data in other ways than the default sorting behaviour given by Python, using anonymous lambda functions. We will use this sorting functionality to plot some data using Python's matplotlib
module.
Acknowlegement. This notebook has been adapted from the Wellesley CS111 Spring 2019 course materials (http://cs111.wellesley.edu/spring19).
The built-in function sorted
always returns a new list:
numbers = [35, -2, 17, -9, 0, 12, 19]
sorted(numbers)
Make sure that numbers
hasn't changed:
numbers
By default the list is sorted in the ascending order, but we can easily reverse the order:
sorted(numbers, reverse=True)
Sorting other sequences Strings and tuples can also be sorted in the same way. The result is always going to be a new list.
phrase = 'Red Code 1'
sorted(phrase) # sorted based on ascii ordering
Just as in the case of the list numbers
in the above example, the string value of phrase
hasn't changed:
phrase
This is to be expected, because strings are immutable.
digits = (9, 7, 5, 3, 1) # this is a tuple
sorted(digits)
Notice that the result of the sorting is a list, not a tuple.
digits
The original tuple value hasn't changed.
We can sort list of sequences such as list of strings, list of tuples, and list of lists.
Sorting the list of tuples and the list of lists is going to be similar. The same principles will apply.
# a long string that we will split into a list of words
phrase = "99 red balloons *floating* in the Summer sky"
words = phrase.split()
words
sorted(words)
listOfLists = [[3, 1], [3, 10], [15, 2]]
sorted(listOfLists)
Tuples are compared element by element, starting with the one at index 0. This is known as lexicographic order, which is a generalization of dictionary order on strings in which each tuple element generalizes a character in a string.
triples = [(8, 'a', '$'), (7, 'c', '@'),
(7, 'b', '+'), (8, 'a', '!')]
sorted(triples)
Question. What happens in the case of ties for the first elements of tuples?
Notice that Python keeps comparing elements of the tuples until we find two that are not the same.
ord('!') < ord('$')
Question. How do we sort a tuple based on the second item rather than first?
sorted()
on dictionaries and a list of dictionaries¶What does the sorted function do when we use it of a dictionary or a list of dictionaries? Recall that a dictionary inherently is an unordered collection.
daysOfMonth = {'Jan': 31, 'Feb': 28, 'Mar': 31, 'Apr': 30,
'May': 31, 'Jun': 30, 'Jul': 31, 'Aug': 31,
'Sep': 30, 'Oct': 31, 'Nov': 30, 'Dec': 31}
sorted(daysOfMonth)
Question. How do we sort a dictionary by its values rather than keys?
Sorting a list of dictionaries. Suppose we wanted to sort the following list of fruit dictionaries by their weight parameter. Python does not know how to sort dictionaries by default as they have no pre-defined order.
fruitDictsList = [{'name': 'apple', 'weight': 20},
{'name': 'orange', 'weight': 5},
{'name': 'kiwi', 'weight': 15.7}]
sorted(listOfDicts)
# by default Python does not know how to compare two dicts
Question. How do we sort a list of dictionary by some parameter, such as their weights in the above example?
Now we will look at the general question of how do we sort in a way that is different from Python's default behaviour (sorting lexicographically)?
missionTuples = [('Apollo 11', 8), ('Mercury-Redstone 3', 1),
('Apollo 13', 5), ('Gemini 3', 1), ('Little Joe 6', 1)]
missionLists = [['Apollo 11', 8], ['Mercury-Redstone 3', 1],
['Apollo 13', 5], ['Gemini 3', 1], ['Little Joe 6', 1]]
sorted(missionTuples) # by default sorts by item 1: missionName
Our Problem: How to sort the missions not by name (default), but by days?
key
parameter¶The function sorted
allows us to perform sorting by using the key
parameter, which specifies a function that for each element determines how it should be compared to other elements.
help(sorted)
def days(missions):
"""Takes a sequence and returns the item at index 1"""
return missions[1]
sorted(missionTuples, key=days) # sort by num of days
sorted(missionLists, key=days, reverse=True) # works the same for lists
Exercise Write a function lenName
that when used as key
will allow us to sort by the length of mission names.
# define function that returns length of mission names
def lenName(mission):
"""Takes a sequence and returns the length of item at index 0"""
return len(mission[0])
# use the function as key for sorting
sorted(missionTuples, key=lenName)
Things get more complicated in situations where elements in a list to be sorted have the same key, and we want to sort by a second key. For example consider the following mission tuples. The first three when sorted by days have a tie (each go on for 1 day).
sorted(missionTuples, key=days) # sorts by days
Question. The resulting tuples are indeed sorted by days, but how to explain the order for items with the same days?
Python's sorting functions/methods are stable, which means that items that are equal according to the sorting key have the same relative order as in the original list.
How can we sort missions with the same number of days, alphabetically by the mission name?
def programName(mission):
"""Takes a sequence and returns the item at index 0"""
return mission[0]
# A key function that returns a tuple to specify
# lexicographic ordering by the elements of that tuple.
def daysProgram(missions):
return (days(missions), programName(missions))
# write the sorted expression to sort first by days
# and then alphabetically by name
sorted(missionTuples, key=daysProgram)
lambda
Notation for Anonymous Functions¶Thus far we have always used def
to create functions that have names.
But Python has a lambda
notation for anonymous functions that need not be named:
lambda
parameterName :
expressionForReturnedValue
Below is how we can use lambda
to replace the function days
:
sorted(missionTuples, key=lambda missions: missions[1])
lambda
notation is very convenient specifying a function exactly where we need it (like in the key
parameter of sorted
) rather than having to define a named function elsewhere via def
and then use the name.
lambda
¶The function created by lambda
can be invoked directly on an argument:
(lambda x: x*x)(4)
Although functions created by lambda
are initially anonymous, they can be named by variables:
sq = lambda x: x*x
sq(4)
lambda
¶lambda
functions are particularly handy to express a key
function in sorting.
Predict the element ordering in the result of the following sorting examples, then verify.
sorted(missionLists, key=lambda missions: len(missions[0]))
Exercise. Write an expression to sort with lambda notation misstionTuples
in ascending order first by number of days and then by length of the program name.
sorted(missionLists, key=lambda missions: (missions[1], len(missions[0])))
Questions.
weight
key in the fruitDictsList
example?daysOfMonth = {'Jan': 31, 'Feb': 28, 'Mar': 31, 'Apr': 30,
'May': 31, 'Jun': 30, 'Jul': 31, 'Aug': 31,
'Sep': 30, 'Oct': 31, 'Nov': 30, 'Dec': 31}
sorted(daysOfMonth)
Function to access keys To use the key parameter of sorted we need a way a function which when given a dictionary, accesses its key if it exists, and returns the corresponding value.
sorted(daysOfMonth, key=daysOfMonth.get) # replace ? here
Sorting a list of dictionaries. Recall the the following lists of dictionaries fruitDictsList
where we want to sort them by their weights.
fruitDicts = [{'name': 'apple', 'weight': 20},
{'name': 'orange', 'weight': 15},
{'name': 'kiwi', 'weight': 15}]
# by default Python does not know how to compare two dicts
Using Sorting with Lambda. We can sort the above list of dictionaries based on the weight values of each key using lambda sorting. What should we write to accomplish that?
# write the expression
sorted(fruitDicts, key=lambda fDict: fDict['weight'])
Sorting by some function of name. What if we wanted to sort the dictionaries first by weight and then alphabetically by fruit name if therea are ties?
# write expression here
Now, we will see how we can use the matplotlib
library to visualize data through plots.
%matplotlib inline
# previous line causes the plots to show inside the notebook,
# as opposed to in a separate window
import matplotlib.pyplot as plt
# as plt means we can now call the functions in the module with plt
# rather than having to type matplotlib.pyplot each time
#the next line to fix resolution for lecture
plt.rcParams['figure.dpi']= 100
plot
function¶We'll start simple by plotting a simple line, and continue from there.
To plot a line, if we provide two values, we can draw a line between them. Notice that if only provide Y coordinates, matplotlib assigns X coordinates as indices (0, 1, 2, etc.)
plt.figure(figsize=(2,2))
plt.plot([10, 20]) # plot y coordinates 10 and 20
plt.show()
Notice that the two points being plotted are (0, 10) and (1, 20).
We provide two separate lists, one with all x coordinate, and one with all y coordinates.
plt.figure(figsize=(3, 3))
plt.plot([1, 2, 5], [1, 5, 7])
plt.show()
We can add more elements to the plot, such as tick values that we want, labels for the axes, title of the plot, etc.
plt.figure()
plt.plot([0, 5, 10], [4, 12, 14])
plt.xticks([0, 5, 10], # points where to show the tick
['x:p1', 'x:p2', 'x:p3']) # values to show for ticks
plt.yticks([4, 12, 14],
['y:p1', 'y:p2', 'y:p3'],
rotation=90) # rotate the tick labels, because are shown horizontally
plt.xlabel("plotting some x values")
plt.ylabel("plotting some y values")
plt.title("Learning how to plot")
plt.show()
frequencies
with get
¶Recall we wrote the following frequencies
function that takes in a list words and returns a dictionary where each key is a word in the list and its value is the number of times that word appears in the list.
# we wrote this together in last lecture
def frequencies(wordList):
"""Given a list of words, returns a dictionary of word frequencies"""
freqDict = {}
for word in wordList:
freqDict[word] = freqDict.get(word, 0) + 1
return freqDict
# you wrote this in lab 3
def fileToList(filename):
wordList = []
for line in open(filename):
wordList.extend(line.strip().split())
return wordList
bookWords = fileToList('prideandprejudice.txt')
# test our function on pride and prejudice
pridePrejDict = frequencies(bookWords)
len(pridePrejDict)
Question. How do we sort the dictionary of words based on the frequency of the words (highest to lowest)?
sortedWordList = sorted(pridePrejDict, key=pridePrejDict.get, reverse=True) # replace ? with expression
sortedWordList[1:10] # lot of common words!
Removing boring stopwords. Such words are common when processing text files, they are called "stop words" which you remove when analyzing the words. This stopwords.txt
is a collection of such words downloaded from https://gist.github.com/larsyencken/1440509.
stopWords = fileToList('stopwords.txt')
List Comprehension to Filter. We want to filter out all words in list stopWords
from the list sortedWordList
and return the result as a new list. How can we accomplish this using a list comprehension?
topWords = [word for word in sortedWordList if word not in stopWords] # write list comprehension
len(topWords)
topTenWords = topWords [0:10] # top ten words
topTenWords
Lets plot the frequency of the top ten words in Pride and Prejudice as a bar plot using matplotlib
.
labels = topTenWords[::-1] # using words as labels
# reverse order to plot least freq to most
positions = list(range(len(labels)))
# for each word in topSortedWords[0:10]
# lets get their frequencies from the dictionary
values = [pridePrejDict[word] for word in labels]
# Create a new figure:
plt.figure()
# Create a bar chart
plt.bar(positions, values)
# Set x tick labels from names
# rotate by 90 so labels are vertical and do not overlap
plt.xticks(positions, labels, rotation=90)
# Set title and label axes
plt.title("Frequency of common words in Pride and Prejudice")
plt.xlabel("words")
plt.ylabel("frequency")
# Use a 'tight' layout to avoid cutting off rotated xticks
plt.tight_layout()
# Show our chart:
#plt.show()
plt.savefig('wordFreqPlot.pdf')