Lab 2: Random Writing
This week we will continue to build our skills with the command line and git
.
We will write a program with several classes. This week’s lab program will analyze letter frequencies in text documents and then generate new documents based on those frequencies. Each class will implement unique functionality and together they will solve a larger problem.
We will explore what is called a has a relationship: where one class has a member variable that stores an instance of another class. Careful planning will help tremendously, so take the time to write a well thought-out design document before arriving for lab.
PRE-LAB Step 0: Explore static
There is nothing to hand in for this part of the lab, but make sure you understand the following.
-
What is printed by the following program?
class Example { public int num; public Example(int initial) { num = initial; } public int getNum() { num = num + 1; return num; } public static void main(String args[]) { Example first = new Example(10); Example second = new Example(20); System.out.println(first.getNum()); System.out.println(second.getNum()); } }
- What would be printed if the declaration of
num
were changes as follows?public static int num;
- What does this tell you about
static
fields?
Be sure to ask your TAs and instructors if you have any questions. The static
keyword is one of the more confusing features of the Java programming language, but it is an important concept that will show up in all of our labs (and exams!).
PRE-LAB Step 1: Explore the Structure Package
Browse the javadoc documentation for the structure5
classes. Look up the Vector
and Association
classes. Become familiar with how to look up information on classes in the structure5
package. For this lab, we link to the documentation directly for convenience. Learn how to navigate the documentation yourself, as in future labs we will assume that you know where to find documentation.
PRE-LAB Step 2: Design Document
Please read through the rest of this document before lab. You should prepare a written design for your program before lab on Wednesday. You must think about this lab before writing code! In the past, students have found it very helpful to draw a picture showing the relationship between each of the classes.
Be sure to bring two printouts of your design document to lab, as we will be collecting one of them during lab.
Step 0: PROBLEMS.md
By the start of lab, you should see a new private repository called cs136lab2-USERNAME
in your GitHub account (where USERNAME
is replaced by your GitHub username). Submit answers to the following book questions inside PROBLEMS.md
, a file that was included in the Lab 2 starter code. (Note the difference between Self Check problems and Regular problems in the book!):
- Self Check Problem 3.2
- Self Check Problem 3.3
- Self Check Problem 3.4
- Regular Problem 3.6. Don’t write the class—just answer what the advantage would be.
Step 1: Lab Program
Consider the following three quotes:
Call me Ishmael. Some years ago--never mind how long
precisely--having repeatedly smelt the spleen respectfully, not to
say reverentially, of a bad cold in his grego pockets, and throwing
grim about with his tomahawk from the bowsprit?
Call me Ishmael. Some years ago--never mind how long
precisely--having little or no money in my purse, and nothing
particular to interest me on shore, I thought I would sail about a
little and see the watery part of the world.
Call me Ishmael, said I to myself. We hast seen that the lesser man is
far more prevalent than winds from the fishery.
The second quote is the first sentence of Herman Melville’s 1851 novel, Moby Dick. The other two quotes were generated “in Melville’s style” using a simple algorithm developed by Claude Shannon in 1948 [footnote: 1]. You will implement Shannon’s algorithm in this week’s lab.
In addition to producing mildly entertaining output, this lab introduces you to:
- the
Vector
andAssociation
classes, - building complex structures out of these basic building blocks, and
- the design and implementation of several new classes.
Level 0 analysis
The following algorithm is based on letter probability distributions. Imagine taking the novel Tom Sawyer and counting the frequency with which each character occurs. You’d probably find that spaces are the most common, that the character 'e'
is fairly common, and that the character 'q'
is rather uncommon. After completing this level 0 analysis, you’d be able to produce Tom Sawyer-esque text based on character frequencies. The idea is to randomly select each character based on the probability of it’s occurrence in the text. This random text wouldn’t have much in common with the real novel, but characters would tend to appear with the same proportions. In fact, here’s an example of what a level 0 analysis might produce:
rla bsht eS ststofo hhfosdsdewno oe wee h .mr ae irii ela
iad o r te u t mnyto onmalysnce, ifu en c fDwn oee iteo
Level 1 analysis
Now imagine doing a slightly more sophisticated level 1 analysis by determining the frequency with which each character follows every other character. You would probably discover that 'h'
is more likely to follow 't'
than 'x'
, and you would probably discover that a space follows '.'
more frequently than ','
does.
More concretely, if you are analyzing the text "the theater is their thing"
, 'e'
appears after ‘h'
three times, and 'i'
appears after 'h'
one time; no other letters appear after 'h'
. So, given this example,
- the probability that
'e'
follows'h'
is0.75
, - the probability that
'i'
follows'h'
is0.25
, - the probability that any other letter follows
'h'
is0
.
Using a level 1 analysis, you could produce randomly generated Tom Sawyer-esque text by 1) picking a character, and then 2) always choosing the next character based on the probability of the next character given the chosen character. Here’s an example of level 1 random text:
"Shand tucthiney m?" le ollds mind Theybooure
He, he s whit Pereg lenigabo Jodind alllld ashanthe ainofevids tre
lin--p asto oun theanthadomoere
Level k analysis
Now imagine doing a level k analysis by determining the probability with which each character follows every possible sequence of characters of length k. For example, a level 5 analysis of Tom Sawyer would reveal that 'r'
follows "Sawye"
more frequently than any other character. After a level k analysis, you would be able to generate Tom Sawyer-esque text by always choosing the next character based on the previous k characters and the probabilities revealed by the analysis.
At only a moderate level of analysis (levels 5—7), randomly-generated text begins to take on many of the characteristics of the source text. It probably won’t make complete sense, but you’ll be able to tell that it was derived from Tom Sawyer as opposed to, say, Moby Dick. Here are some more examples:
(Level 2:) "Yess been." for gothin, Tome oso; ing, in to
weliss of an'te cle - armit. Papper a comeasione, and smomenty,
fropeck hinticer, sid, a was Tom, be suck tied. He sis tred a
youck to themen
(Level 4) en themself, Mr. Welshman, but him awoke, the
balmy shore. I'll give him that he couple overy because in the
slated snufflindeed structure's kind was rath. She said that the
wound the door a fever eyes that WITH him.
(Level 6:) people had eaten, leaving. Come - didn't stand it
better judgment; His hands and bury it again, tramped herself!
She'd never would be. He found her spite of anything the one was a
prime feature sunset, and hit upon that of the forever.
(Level 8:) look-a-here - I told you before, Joe. I've heard
a pin drop. The stillness was complete, however, this is awful
crime, beyond the village was sufficient. He would be a good
enough to get that night, Tom and Becky.
(Level 10:) you understanding that they don't come around in
the cave should get the word "beauteous" was over-fondled, and
that together" and decided that he might as we used to do - it's
nobby fun. I'll learn you."
Once we have processed input text and stored it in a table structure that allows us to check probabilities, we pick k letters (for example, the first k in the input text) to use as a “seed” for our new text. Then we choose subsequent characters based on the preceding k characters and their probability information.
For now, we will focus on implementing a level 2 analysis. That is, we will compute the next character to print based on the previous two characters only.
Program Design
Think about the design and prepare a written design description of this program. Just like last week, you should have your design prepared when you come to lab on Wednesday. We will briefly discuss the general structure of the classes together at the beginning of lab. We have provided implementation strategies below to help you decide which classes to include and what functionality each class should implement.
When thinking about the design, focus on what would constitute a good data structure for this problem. Your data structure needs to keep a table of info and be able to support two key operations:
- update the probabilities in the table, given a string of 2 characters and the succeeding character.
- select a next character, given a string 2 characters and the probabilities stored in your table.
You might try to save the frequency information in a big array, but the size will quickly become too large. For k = 2, you would need a three-dimensional array whose size is about 27,000 entries if you include blanks and punctuation. For larger k, the number of entries grows rapidly.
Instead, you will develop a Table
class which is implemented as a Vector
of Association
s. Each Association
should have a character sequence (stored as a String
) as its key, along with a value which is a FrequencyList
. FrequencyList
is a class that keeps track of which characters appear after the given sequence, along with a frequency.
There are many ways to implement a frequency list. A good start is… another Vector
of Association
s. Thus the frequency list’s Vector
consists of Association
s in which the key is a single character (which can be stored simply as a String
) and the value is a count of the number of times that the given letter occurs after the k-character sequence with which the list is associated (stored as an Integer). Think carefully about what methods the frequency list needs and any instance variables that might be useful.
The data structure design built from these two classes has the benefit of requiring only as many entries as necessary for the given input text. You may find it helpful to look carefully at the word frequency program in Section 3.3 of Bailey.
Your main
method for the program should be written in a third class, WordGen
, which reads the input text, builds the table, and prints out a randomly-generated string based on the character sequence probabilities from the input text.
To summarize, you will write three classes for this lab: Table
, FrequencyList
, and WordGen
. The design doc you prepare should include a description of each of them. (In addition to text, you may also find it helpful to illustrate how the three classes interact with each other by creating a drawing or diagram.)
Importing Class Definitions
To ensure that your program can find the implementations of Vector
and Association
, you will need to add the following import
statement to the top of your Java files:
import structure5.*;
The *
tells Java to import all classes inside the structure package. Note: To import the Random
and Scanner
classes use the following imports:
import java.util.Random;
import java.util.Scanner;
In these latter imports, observe that there is no *
, so only the specific class Random
and the specific class Scanner
will be imported from the java.util
package. If you mistakenly write import java.util.*;
, the Java compiler will produce an error message, since another Vector
class is already defined in java.util
. You must use the structure5
version of Vector
for this assignment.
Implementation Strategy
You should build your program in stages that you have planned out ahead of time. While writing and debugging your code, try using a fixed String
constant as the input (e.g., "the theater is their thing"
) and use a fixed k = 2. Using these fixed parameters will help you ensure the correctness of your code since you can verify the correct answers by hand on paper.
After the input is processed you should generate and print out new text using the frequencies in the table. You may start with a fixed sequence of letters that appears in the table or choose starting characters randomly. Generate and print about 500 letters of randomly-generated text so that we can see how your program works. Be sure to handle the special case where you encounter a sequence that has no known successor character in a reasonable way (i.e., your program should not throw an exception).
Once the basic program is working, change it to accept input from the keyboard using the Scanner
class. When using the Scanner
, build up a string of the entire input line by line before performing any frequency analysis. You can use the hasNextLine()
method to find out if there is another line of input ready and the nextLine()
to read the next line if it exists, as in the following code snippet. (This code uses a StringBuffer
to create large strings efficiently.)
Scanner in = new Scanner(System.in);
StringBuffer textBuffer = new StringBuffer();
while (in.hasNextLine()) {
String line = in.nextLine();
textBuffer.append(line);
textBuffer.append("\n");
}
String text = textBuffer.toString();
// 'text' now contains the full contents of the input.
You can signal the end of the input for Java on the Mac or on any Unix system by typing Ctrl-D
on a new line.
You may also read your input from any text file, e.g. whosonfirst.txt
, with this command:
$ java WordGen < whosonfirst.txt
This way of reading files is not specific to Java. “Input redirection” can be used with any program running on any Unix system.
Finally, change WordGen
to support any arbitrary level of analysis k other than 2. If you have designed the structure well, simply passing longer strings to Table
’s methods should be sufficient, and you should not need to change any code except in the WordGen
class.
You can also change your main
method to use a command line parameter to specify the level of analysis. For example, it would be convenient to be able to run the program like so:
$ java WordGen 5 < whosonfirst.txt
to specify a level 5 analysis. Command line options are passed to your main method as an array of strings. You can use the following skeleton code to ensure that there is at least one argument and to convert the first argument to an int
:
public static void main(String args[]) {
if (args.length == 0) {
// no args, so print usage line and do nothing else
System.out.println("Usage: java WordGen ");
} else {
// convert first argument to k
int k = Integer.parseInt(args[0]);
// ... the rest of your code here ...
}
}
Style
As we continue to develop our sense of programming style,
we will continue to add and refine the rules in our checkstyle
tool.
This week, we fixed some of the issues with the first week’s checkstyle
rules. To do so, we refined our “[ERROR]” rule from Lab 1, as explained below.
We are also adding one style requirement this week
that does not involve checkstyle; it involves the standard Java compiler,
javac
.
To earn full style points this week, your code must do two things:
- Pass all of the
checkstyle
“[ERROR]” checks - Compile using
javac
without any messages about “unchecked or unsafe operations”.
Style Details
checkstyle
requirements:
This week, checkstyle
will report an “[ERROR]” if your program
declares a class variable and does not specify an appropriate visibility.
(All class variables must be declared as public
, private
, or protected
.
If you do not specify a class variable’s visibility, the variable is given
“default” visibility, which is very likely not what you want.)
We are defining appropriate as follows:
- all class variables that are not
final
must be declaredprivate
orprotected
(i.e., nopublic
member variables unless they are constants)
To run checkstyle
, you would type the following command at the terminal:
./checkstyle
The ./
is a Unix thing: it tells the terminal where to look for the
checkstyle
program. The .
(dot) tells Unix to look in the current directory.
This will run checkstyle
on every Java program in your directory.
To run checkstyle
on just one Java file, you would instead specify:
./checkstyle SomeFile.java
javac
requirements:
At least two of the classes that you will use to implement your program
use Java generics (Vector
and Association
).
Proper use of Java generics will be enforced as follows.
- Compiling your code with
javac
must not produce the following message:Note: YourProgram.java uses unchecked or unsafe operations. Note: Recompile with -Xlint:unchecked for details.
To fix this problem, make sure to specify the appropriate types for all of your
generic classes when you declare the variable’s type and when you instantiate
an object. For example, to create a Vector
that stores Integer
objects, I would type:
Vector<Integer> intVec = new Vector<Integer>();
That message also tells you how to get more information. Recompile your program as follows (replace with the appropriate file name):
javac YourProgram.java -Xlint:unchecked
The output will give you details about the issue.
Lab Deliverables
When you are finished with your program, answer the questions from Step 0, and put your answers in a separate file called PROBLEMS.md
. Commit and push PROBLEMS.md along with your other lab files as part of your repository. Your repository should have the files:
cs136lab02_wordgen-USERNAME/
checkstyle
FrequencyList.java
MISTAKES.md (optional)
PROBLEMS.md
README.md
Table.java
Wordgen.java
texts/mobydick.txt
texts/Seuss.txt
texts/Twain.txt
texts/whosonfirst.txt
Submitting Your Lab
As you complete portions of this lab, you should commit
your changes and push
them. Commit early and often. When the deadline arrives, we will retrieve the latest version of your code. If you are confident that you are done, please use the phrase "Lab Submission"
as the commit message for your final commit. If you later decide that you have more edits to make, it is OK. We will look at the latest commit before the deadline.
- Be sure to push your changes to GitHub</.u>
- Verify your changes on Github. Navigate in your web browser to your private repository on GitHub. It should be available at https://github.com/williams-cs/cs136lab02_wordgen-USERNAME. You should see all changes reflected in the files that you
push
. If not, go back and make sure you have both committed and pushed.
We will know that the files are yours because they are in your git
repository. Do not include identifying information in the code that you submit. We grade your lab programs anonymously to avoid bias. In your README.md
file, please cite any sources of inspiration or collaboration (e.g., conversations with classmates). We take the honor code very seriously, and so should you. Please include the statement "I am the sole author of the work in this repository."
in the comments at the top your Java files.
Bonus: Word-level Analysis.
Instead of working at the character level, you could work at the word level. Only attempt this after you get the required work finished and be sure to submit it as a separate class so that we can grade your character-level analysis separately. Does this make the results better or worse in any way?
Bonus: Mistakes
Did you find any mistakes in this writeup? If so, add a file called MISTAKES.md
to your GitHub repository and describe the mistakes using bullets. For example, you might write
* Where it says "bypass the auxiliary sensor" you should have written "bypass the primary sensor".
* You spelled "college" wrong ("collej").
* A quadrilateral has four edges, not "too many to count" as you state.
You will receive 1 bonus point on this assignment for each mistake we are able to validate. However, submitted mistakes cannot be used to boost your score above 100%.
[1] Claude Shannon, “A mathematical theory of communication”, Bell System Technical Journal, 1948. Shannon’s technique was popularized by A.K. Dewdney’s A potpourri of programmed prose and prosody that appeared in Scientific American, 122-TK, 1989.