CSCI 136 :: Fall 2020
Data Structures & Advanced Programming
Home | Schedule | Labs | Handouts | Links | CS@Williams
Optional Lab : revisit Lab 2 with hash tables
This lab handout describes an optional exercise to explore some of the benefits of hash tables.
In our second lab, we used the Vector
and Association
classes to
implement "map" functionality, although we didn't call it that at the time.
By revisiting our lab 2 implementation, we hope to demonstrate the power of hash tables—both their performance and interface.
We also hope that you can see how far you've come this semester!
Step 1: Lab Program
Consider the following three quotes:
Call me Ishmael. Some years ago--never mind how long
precisely--having repeatedly smelt the spleen respectfully, not to
say reverentially, of a bad cold in his grego pockets, and throwing
grim about with his tomahawk from the bowsprit?
Call me Ishmael. Some years ago--never mind how long
precisely--having little or no money in my purse, and nothing
particular to interest me on shore, I thought I would sail about a
little and see the watery part of the world.
Call me Ishmael, said I to myself. We hast seen that the lesser man is
far more prevalent than winds from the fishery.
The second quote is the first sentence of Herman Melville’s 1851 novel, Moby Dick. The other two quotes were generated “in Melville’s style” using a simple algorithm developed by Claude Shannon in 1948 [footnote: 1]. You will implement Shannon’s algorithm in this week’s lab.
Level 0 analysis
The following algorithm is based on letter probability distributions. Imagine taking the novel Tom Sawyer and counting the frequency with which each character occurs. You’d probably find that spaces are the most common, that the character 'e'
is fairly common, and that the character 'q'
is rather uncommon. After completing this level 0 analysis, you’d be able to produce Tom Sawyer-esque text based on character frequencies. The idea is to randomly select each character based on the probability of it’s occurrence in the text. This random text wouldn’t have much in common with the real novel, but characters would tend to appear with the same proportions. In fact, here’s an example of what a level 0 analysis might produce:
rla bsht eS ststofo hhfosdsdewno oe wee h .mr ae irii ela
iad o r te u t mnyto onmalysnce, ifu en c fDwn oee iteo
Level 1 analysis
Now imagine doing a slightly more sophisticated level 1
analysis by determining the frequency with which each
character follows every other character. You would probably
discover that 'h'
is more likely to
follow 't'
than 'x'
, and you would
probably discover that a space follows '.'
more
frequently than ','
does.
More concretely, if you are analyzing the text "the
theater is their thing"
, 'e'
appears after
‘h'
three times, and 'i'
appears
after 'h'
one time; no other letters appear
after 'h'
. So, given this example,
'e'
follows 'h'
is 0.75
,'i'
follows 'h'
is 0.25
,'h'
is 0
.Using a level 1 analysis, you could produce randomly generated Tom Sawyer-esque text by 1) picking a character, and then 2) always choosing the next character based on the probability of the next character given the chosen character. Here’s an example of level 1 random text:
"Shand tucthiney m?" le ollds mind Theybooure
He, he s whit Pereg lenigabo Jodind alllld ashanthe ainofevids tre
lin--p asto oun theanthadomoere
Level k analysis
Now imagine doing a level k analysis by determining the
probability with which each character follows every possible
sequence of characters of length k. For example,
a level 5 analysis of Tom Sawyer would reveal
that 'r'
follows "Sawye"
more
frequently than any other character. After a level k
analysis, you would be able to generate Tom
Sawyer-esque text by always choosing the next character
based on the previous k characters and the
probabilities revealed by the analysis.
At only a moderate level of analysis (levels 5—7), randomly-generated text begins to take on many of the characteristics of the source text. It probably won’t make complete sense, but you’ll be able to tell that it was derived from Tom Sawyer as opposed to, say, Moby Dick. Here are some more examples:
(Level 2:) "Yess been." for gothin, Tome oso; ing, in to
weliss of an'te cle - armit. Papper a comeasione, and smomenty,
fropeck hinticer, sid, a was Tom, be suck tied. He sis tred a
youck to themen
(Level 4) en themself, Mr. Welshman, but him awoke, the
balmy shore. I'll give him that he couple overy because in the
slated snufflindeed structure's kind was rath. She said that the
wound the door a fever eyes that WITH him.
(Level 6:) people had eaten, leaving. Come - didn't stand it
better judgment; His hands and bury it again, tramped herself!
She'd never would be. He found her spite of anything the one was a
prime feature sunset, and hit upon that of the forever.
(Level 8:) look-a-here - I told you before, Joe. I've heard
a pin drop. The stillness was complete, however, this is awful
crime, beyond the village was sufficient. He would be a good
enough to get that night, Tom and Becky.
(Level 10:) you understanding that they don't come around in
the cave should get the word "beauteous" was over-fondled, and
that together" and decided that he might as we used to do - it's
nobby fun. I'll learn you."
Once we have processed input text and stored it in a table structure that allows us to check probabilities, we pick k letters (for example, the first k in the input text) to use as a “seed” for our new text. Then we choose subsequent characters based on the preceding k characters and their probability information.
For now, we will focus on implementing a level 2 analysis. That is, we will compute the next character to print based on the previous two characters only.
Program Design
When thinking about the design, focus on what would constitute a good data structure for this problem. Your data structure needs to keep a table of info and be able to support two key operations:
You might try to save the frequency information in a big array, but the size will quickly become too large. For k = 2, you would need a three-dimensional array whose size is about 27,000 entries if you include blanks and punctuation. For larger k, the number of entries grows rapidly.
Instead, you will develop a Table
class which is
implemented using
a Hashtable<String,FrequencyList>
:
each key should be a k-character sequence
(stored as a String
), and the
associated value should be
a FrequencyList
. FrequencyList
is a
class that keeps track of which characters appear after the
given sequence, along with a frequency.
There are many ways to implement a frequency list. A good start
is… another Hashtable
!
The frequency
list’s Hashtable<String,Integer>
should consists of
single-character keys (which can be
stored simply as a String
) and values
that are the counts of the number of times that a given letter occurs
after the k-character sequence with which the list is
associated (stored as an Integer
). Think carefully
about what methods the frequency list needs and any instance
variables that might be useful.
The data structure design built from these two classes has the benefit of requiring only as many entries as necessary for the given input text. You may find it helpful to look carefully at the word frequency program in Section 3.3 of Bailey.
Your main
method for the program should be written
in a third class, WordGen
, which reads the input
text, builds the table, and prints out a randomly-generated
string based on the character sequence probabilities from the
input text.
To summarize, you will write three classes for this
lab: Table
, FrequencyList
,
and WordGen
.
Revisit your design from Lab 2! What does each class do?
Allocate the functionality in the appropriate location.
Your Wordgen class should handle parsing the input text, and calling
table.add(String kStr, String nextChar)
,
which should find (or create) the frequency list associated with kStr
,
then call freqList.add(nextChar)
on that FrequencyList
object.
To generate new text, you Wordgen class should call table.generateOneLetter(kStr)
,
which should find the frequency list associated with kStr, and then call freqList.sample()
on that FrequencyList
object.
How To Proceed
Since this activity is 100% optional, we hope that you work with any number of friends if that helps you to learn (or have fun!). You all have Lab 2 repositories on evolene—these are great starting points. You can directly edit your files from your old submission, you can copy your WordGen, Table, and FrequencyList files to a new directory within your repository, or you could create a branch. It is completely up to you! If you wish to work collaboratively and you would benefit from having a shared space to do so, reach out to one of the instructors and we can create an empty repository for you and any friends you'd like us to add.
Food for Thought
There are a few questions we encourage you to think about as you work on this activity.
Optional A Deeper Dive: Word-level Analysis.
This section (like the lab) is completely optional. However, we feel it is an interesting extension and we encourage you to explore it.
Instead of working at the character level, you could work at the
word level. How would this affect your program design? If you
designed your classes appropriately,
your FrequencyList.java
and Table.java
classesneed not change. Does word-level analysis make the
results better or worse in any way? Is your word-level analysis
more or less sensitive to your choice of k
? Why?
[1] Claude Shannon, “A mathematical theory of communication”, Bell System Technical Journal, 1948. Shannon’s technique was popularized by A.K. Dewdney’s A potpourri of programmed prose and prosody that appeared in Scientific American, 122-TK, 1989.