Lab 10: Hollywood Science

For your last lab, you will write the software necessary to perform data analysis tasks. The goals of this lab are to:

  • Understand and implement hash table data structures.
  • Compute statistics on a large data set using your hash table implementations.

Data oops!

The starter code inadvertently supplied an old version of the movie_data.dat file that has some data quality issues. You can download the updated version here. Note that this data file still has a few duplicate movie titles, so you will need to check for them, but it does not have any missing fields. By the way, equivalent movie titles are not sufficient to determine whether a movie is a duplicate. For example, “A Star Is Born” has been remade four times (1, 2, 3, 4)!


PRE-LAB: Step 0

Due Monday, 4/29 by 5pm.

  • Partner preference form. In this last lab, you will have the opportunity to work with a partner or by yourself. Please fill out the following Google Form to indicate the person you plan to work with. If we do not hear from you, and we have no partner preference from you, we will assume that you intend to work by yourself.

Part 1: Movie statistics

Due Sunday, 5/5 by 11:59pm.

The first part of your assignment is to analyze a large dataset containing 2,994 movies.

Input

The input to your program will be a text file containing movie information. Each row is a distinct movie. A typical row might look like:

The Matrix|Mar 31, 1999|88%|85%|147|33324202|R|Action & Adventure;Science Fiction & Fantasy|Lilly Wachowski;Lana Wachowski|Lana Wachowski;Lilly Wachowski|136|Warner Bros. Pictures

Each row contains the following set of fields:

Title|Release Date|Critics Score|Audience Score|Number of Critic Reviews|Number of Audience Reviews|MPAA Rating|Genres|Directors|Screenwriters|Duration in Minutes|Studio

Note that if you decide to use String.split to split by the | character, you actually need to call .split("\\|"). The reason is that | has a special meaning to the split function (it means “or”, so "|" means “nothing OR nothing”).

The Genres, Directors, and Screenwriters fields may contain multiple entries. Entries are semicolon-separated (;).

You may find it useful to store each movie’s data in a Movie class. If you do so, and you plan to use a Movie as a key in a hash table, be sure to overridboth equals and hashCode, otherwise your has table will not function properly. You do not need to do this if you simply store Movie as a value in a hash table.

Date comparison is made significantly simpler by using the java.util.Date class.

DateFormat df = new SimpleDateFormat("MMM dd, yyyy");
Date d = df.parse("Jul 14, 2000");

To use the date parser, you will need to import java.text.DateFormat, java.text.SimpleDateFormat, and java.text.ParseException. Note that you also need to wrap the parse function in a trycatch statement. In Java, any code that throws an exception must be accompanied by code that catches that exception. The DateFormat.parse method may throw a ParseException, so if you use that method, you should wrap your code with a trycatch statement.

Here’s a little function that you may find helpful:

public static Date parseDate(String input) {
	DateFormat df = new SimpleDateFormat("MMM dd, yyyy");
	Date d = null;
	try {
		d = df.parse(input);
	} catch (ParseException e) {
		// handle the failure
		System.out.println("Bad date: " + input);
		System.exit(1);
	}
	return d;
}

There should be no malformed dates in the movie_data.dat file, but you should still plan for failure (e.g., your input may be malformed because your Scanner code is incorrect). The code above tells me that something went wrong and immediately terminates the program, which helps me find and fix the problem easily.


Output

The output of the program should be answers to the following questions:

  1. How many movies are in the dataset?
  2. What movie in the dataset has the earliest (i.e., oldest) release date?
  3. What movie in the dataset has the most recent release date?
  4. How many movies are there for each genre? (List genres in alphabetical order)
  5. What is the most popular genre by mean audience score?†
  6. What are the top five movies by average score (where average score is defined as (critic score + audience score) / 2)?†
  7. What are the top five “cult classics”? We define a cult classic as a movie with a high critic score and a low audience score. The “top” cult classics have the biggest difference between critic score and audience score.†
  8. What are the top five movies that are “so bad it’s good” (SBIG)? We define a SBIG as a movie with a high audience score and a low critic score. The “top” SBIGs have the biggest difference between audience score and critic score.†
  9. Who has directed the most movies?
  10. Which screenwriter wrote for the most genres?
  11. Which director has worked for the most studios? (“N/A” means that the studio is not known, so it is not a valid studio and should not be counted)
  12. What percent of movies are over 3 hours (180 minutes) in length?
  13. Not all movies have the same number of ratings. Compute the movie at the 25th percentile for the number of audience ratings (i.e, the movie that has more audience ratings than 25% of the entire dataset). What is the 50th? The 75th? The 99th?
  14. Compute the 25th, 50th, 75th, and 99th percentile movies for the number of critic ratings.
  15. Come up with your own “top 5” movie ranking based on data in the dataset. Your ranking should have: a) a catchy name (like “cult classic”), and b) a concise mathematical formulation.

For any question marked with a †, only include movies with at least 1,000 audience reviews and at least 50 critic reviews.

The index of the element at the \(\frac{p}{100}\)th percentile p (e.g., for the 25th percentile, p is 0.25) is defined as (int) Math.ceil(p * n) - 1, where n the number of elements in an ordered collection ranked from worst to best by some criterion (e.g., number of audience reviews).

For many of these questions, you should use a Map to maintain counts for a set of categories. For example, when counting the number of movies for each genre, you would create a Map<String,Integer> where the key is a genre and the value is a count. Note that you may also find it useful to employ a Set data structure.

You may find the code for counting the frequency of first letters to be a useful example.


Parameterizing your code for an arbitrary hash table implementation

In Part 2 of this assignment, you will want to be able to call your movie statistics code with an arbitrary hash table. To facilitate this, you will need to create several classes that implement the following interface. Put the following code into a file called MapMaker.java:

import structure5.*;

public interface MapMaker {
    public <K,V> Map<K,V> newMap();
}

Then you will need to create a new maker class for each hash table implementation you plan to use. For example, to use the Hashtable implementation that comes with structure5, put the following code into a file called HashtableMaker.java:

import structure5.*;

public class HashtableMaker implements MapMaker {
    public <K,V> Map<K,V> newMap() {
        return new Hashtable<K,V>();
    }
}

If you put your statistics code in a function, stats, with the following method signature:

public static void stats(MapMaker mm)
public static void stats(MapMaker mm, Set<Movie> movies))

then you can call your statistics program with an arbitrary hash table maker. (Note 2019-05-04: you may find it easier to use a stats program that takes a MapMaker and a Set<Movie>)

To use the MapMaker, instead of calling

... = new Hashtable<K,V>();

you will call

... = mm.newMap();

Running your program

Your movie program should be called MovieStatistics.java, and you should be able to call it like:

$ java MovieStatistics < movie_data.dat

Part 2: Other hash algorithms

Due Sunday, 5/12 by 11:59pm.

The second part of your assignment is to write the code for two one alternative hash table implementations. Both It should implement the structure5 Map<K,V> interface. The Map<K,V> interface has the following methods:

public interface Map<K,V>
{
    public int size();                   // returns the number of entries in the map
    public boolean isEmpty();            // returns true iff this map does not contain any entries
    public boolean containsKey(K k);     // returns true iff k is in the domain of the map
    public boolean containsValue(V v);   // returns true iff v is the target of at least one map entry
    public V get(K k);                   // returns the value mapped to from k, or null
    public V put(K k, V v);              // inserts a mapping from k to v in the map
    public V remove(K k);                // removes any mapping from k to a value, from the mapping
    public void putAll(Map<K,V> other);  // all the mappings of other are installed in this map, overriding any conflicting maps
    public void clear();                 // removes all map entries associated with this map
    public Set<K> keySet();              // returns a set of all keys associated with this map
    public Structure<V> values();        // returns a structure that contains the range of the map
    public Set<Association<K,V>> entrySet(); // returns a set of (key-value) pairs, generated from this map
}

Your hash table implementations should also implement the java.lang.Iterable<V> interface.

You may refer to the structure5 hash table source for guidance.

We suggest that you utilize the HashAssociation class supplied with structure5 in order to store pairs of keys and values.

The backing store (e.g., array or Vector) for your implementation should expand when the load factor exceeds a threshold. A widely used default load factor is 0.75. This is the default for the Java HashMap implementation. When your hash table expands, you need to rehash and reinsert all entries, because their locations may change in the new table. A reasonable default backing store size is 997.


Java hashCode

Hash tables are frequently used in Java. So frequently, in fact, that the designers of the language sought to make it easy to add custom hash functions. These custom hash functions, called hash codes, are defined for every type that extends Object (i.e., every type that is a class). Having a hash code defined on every object in Java allows those objects to be used as keys in hash tables.

The hash code facility also makes it easy for programmers to use hash tables without having to customize the table itself. Instead, when you parameterize a Map, you specify the key (K) and value (V) as generic types (i.e., Map<K,V>) and the Map knows that it must call the hashCode() method on your key (K) type.

Suppose that a hash table utilizes the following hash function:

\(h(k) = h_1(k) \; mod \; |T| \)

where \(k\) is the key and \(|T|\) is the size of the table, \(T\). Since only the hash table implementation itself knows how big its underlying array-like storage is, \(h(k)\) must be computed by the hash table. But \(h_1(k)\) is independent of the table size, \(|T|\), so it can be defined elsewhere. In Java, \(h_1(k) = \) hashCode(). Built-in types, such as String, Date, etc., already have good hashCode()s defined for them, but you must define hashCode() for any class that you write yourself.


Your objective: Open addressing with double hashing

Double hashing is a technique designed to improve the behavior of hash tables that use open-addressing. Unlike linear probing, which sequentially scans a hash’s internal array for an available location, double hashing uses a second hash function to find an available location. For Part 2 of the lab, you will need to write a class called DoubleHashtable that implements the double hashing algorithm.

To understand why double hashing is a good idea, we need to introduce you to a concept called “clustering.” In hash tables that use linear probing, when a number of keys hash to the same location, linear probing ends up producing consecutive runs of values. These “runs” are known as a cluster. Clusters are bad because, in their presence, any key that hashes to the neighborhood of a collision, ends up itself colliding, even though it may not actually collide with other keys. Since, in linear probing, you resolve collisions by scanning until you find an empty bucket, over time clusters grow, producing very bad lookup times for keys in certain neighborhoods even when the table is still mostly empty.

Double hashing addresses this problem by searching for the “next available location” by hashing again. It repeats hashing until it finds an empty location. If we have a good hash function, one that promotes uniformity, hash clusters are unlikely.

This alternative hash function is defined as:

\(h(i,k) = (h_1(k) + i \cdot h_2(k)) \; mod \; |T| \)

where \(k\) is the key to be hashed, \(i\) is the \(i\)th attempt at rehashing and \(|T|\) is the size of the table, \(T\). Observe that the first time you call this hash function, \(i\) is zero, so it is the same as just calling \(h_1(k)\).

What is \(h_1(k)\)? Recall, as above, we let \(h_1(k)\) stand for calling hashCode() on the key.

What is \(h_2(k)\)? Here, we must supply a second hash function.


Computing \(h_2(k)\) with the SHA-1 hash function

For \(h_2(k)\), we will use the SHA-1 hash function, which is a high-quality (cryptographic) hash function built into Java. After importing java.security.* and java.nio.ByteBuffer, you may use the following helper method to create SHA-1 hashes:

protected static int toSHA1(byte[] bytes) {
    MessageDigest md = null;
    try {
        md = MessageDigest.getInstance("SHA-1");
    }
    catch(NoSuchAlgorithmException e) {
        e.printStackTrace();
        System.exit(1);
    } 
	
    ByteBuffer bb = ByteBuffer.wrap(md.digest(bytes));
    return bb.getInt();
}

Remember that your hash table must be able to store keys and values of any type, K and V, respectively. One obstacle that you may immediately encounter is that it is not obvious how to obtain the byte representation for any key, which you will need in order to call toSHA1. A good workaround hinges on the fact that you can obtain the hashCode() for any key: instead of getting the byte representation for the key, get the byte representation for the key’s hashCode().

To be clear, this means that you need to write the following function:

protected static <T> byte[] keyToBytes(T key) {
	int hc = key.hashCode();
	byte[] bytes = new byte[4];
	// fill in the rest here;
	return bytes;
}

and then compute \(h_2(k)\) for the key by calling:

protected static <T> int h2(T key) {
	return toSHA1(keyToBytes(key));
}

Obtaining a byte from a word

Recall that we learned how to obtain the values of bits in an int value in lab 7. In this lab, you will extend that lesson to obtain the values of groups of bits from an int value.

The most basic grouping of bits is called a byte. A byte is eight bits. There are 4 bytes in an int. We often refer to 4-byte ints as a word.

Since the toSHA1 function above takes a byte[], you need to convert the int you obtained from hashCode() into a byte[]. We will use the same basic idea that we did in Lab 7: we will shift and mask bits.

Suppose you have the decimal number 3132610186. This number is usually represented on a computer as:

10111010 10110111 11010110 10001010

(spaces added to aid readability)

From left to right, bytes are organized from most to least significant. In other words, the rightmost byte is the smallest part of the number.

What if we want to get the first byte (eight bits) from the right? As in Lab 7, we can mask off the bits we don’t care about using a bitmask and the logical and operator:

  10111010 10110111 11010110 10001010
& 00000000 00000000 00000000 11111111
---------------------------------------
  00000000 00000000 00000000 10001010

The byte 11111111 can be conveniently expressed in hexadecimal (base 16) as 0xFF. You can then cast the result to a byte, ala

byte b = (byte) (3132610186 & 0xFF);

But what if you want to obtain a different byte? Suppose you want the third byte from the right? You can accomplish this by shifting,

  10111010 10110111 11010110 10001010
>> 2 * 8                               
---------------------------------------
  00000000 00000000 10111010 10110111

and then masking:

  00000000 00000000 10111010 10110111
& 00000000 00000000 00000000 11111111
---------------------------------------
  00000000 00000000 00000000 10110111

In code,

byte b = (byte) ((3132610186 >> 2 * 8) & 0xFF);

In fact, when we extracted the first byte from the right, above, we could have written:

byte b = (byte) ((3132610186 >> 0 * 8) & 0xFF);

Now, you need to figure out how to obtain the \(i\)th byte from the right. See the pattern?


\(h(i, k)\): putting it all together

To help you with this task, we’ve provided a simple SHATest program that you can look at here. Note that it is missing a portion of keyToBytes that you should figure out yourself.

Once you have a working keyToBytes method, you can compute \(h_2(k)\) by calling toSHA1(keyToBytes(key)). This will allow you to implement \(h(i,k)\) which you will need to perform the probing procedure. Your \(h(i,k)\) method should be defined using the following method signature:

protected int hash(int i, T key);

Finally, use your hash function to implement two methods. The first, locate, returns the index of the given key, or -1 if it is not in the table.

protected int locate(K key);

The second method, locateEmpty, returns the first available (empty) index for the key. We use locateEmpty to perform insertions.

protected int locateEmpty(K key);

Your implementation should be in a file called DoubleHashtable.java.


Bonus: Secondary clustering

One last thing: although double hashing does not suffer from the same clustering problem as linear probing, it is susceptible to another form of clustering for certain table sizes. The scenario you would like to avoid is when the hash value and the table size share a common factor. Common factors lead to a phenomenon called “secondary clustering.”

A little more formally, suppose your hash value is \(a\) and your table size is \(b\). Since our hash function does not necessarily produce values of \(a < b\), we compute \(a\mod b\) to find the hash table location. But now suppose that \(a\) and \(b\) share a common factor, \(c\). Let \(a = a’ \cdot c\) and let \(b = b’ \cdot c\). Finding a hash location is now \(c \cdot a’ \mod c \cdot b’ = c \cdot a’ \mod b’\). What this tells us is that, if we produce a secondary hash using a multiplicative constant, and if that multiplicative constant is a factor of the table size, we haven’t actually helped the situation. Instead of having \(a\mod b\) possible locations to hash to, we have \(a’\mod b\) locations, which is \(\frac{1}{c}\) the number of locations we thought we had. An effectively smaller hash table means: more collisions.

The fix is to ensure that \(b\), the table size, is relatively prime to \(a’\). Cool theorem: any value \(v\) less than a prime number \(p\) is “relatively prime” to \(p\). If \(v\) is relatively prime to \(p\) it just means that their greatest common divisor is 1. So we just need a prime hash table size. For this assignment, you can just double the table size, but as an optional extension, you can find a prime table size (that is at least 2x) instead.


Separate chaining using the move-to-front heuristic

(NOTE: No longer required as of 2019-05-05.)

An alternative to open-address hashing is called external chaining. In this scheme, a linked list is used when key collisions occur. The ChainedHashtable implementation supplied with the structure5 package uses a SinglyLinkedList to maintain this chain. Upon key collisions, the ChainedHashtable adds the key and value to the appropriate “chain” (SinglyLinkedList).

You will implement a small optimization to this scheme: instead of using an ordinary SinglyLinkedList, your hash table implementation should use a self-reorganizing singly-linked list called an MTFSinglyLinkedList. This list will use the move-to-front heuristic: when the list is searched, and a key is found, the node containing the key is moved to the front of the list.

Your hash table should be in a file called MTFChainedHashtable.java.


Map evaluation

Now that you have several Map implementations to work with, you can answer the following questions about your implementations.

  1. For which hash table does your movie analysis code run the fastest? Compare DoubleHashtable and MTFChainedHashtable with the Hashtable and ChainedHashtable from structure5. This means that you will need to implement MapMakers for every hash table implementation you use.
  2. Write a small exercise program that inserts 10,000 random entries (i.e., a random key; use a counter value for the value) and then prints them back out.
    1. Which hash table implementation has the fastest insertion?
    2. Which hash table implementation has the fastest retrieval?
    3. Which hash table implementation is the fastest overall?
    4. Which hash table implementation suffers from the fewest collisions? (only need to evaluate DoubleHashtable and MTFChainedHashtable for this question)
    5. How many collisions does DoubleHashtable suffer from? It’s OK for the number of collisions to vary from run to run since you are using random entries. What’s a typical number?

You can measure the time of a task in Java using:

final long startTime = System.currentTimeMillis();
// your code
final long endTime = System.currentTimeMillis();
final long totalTime = endTime - startTime; // duration in ms

You should perform all insertion operations first, and then perform all retrievals. This will also make measuring time more convenient.


Running your program

Your movie program should be called MapEvaluation.java, and you should be able to call it like:

$ MapEvaluation 

Extensions

There are many possible hash table implementations. For bonus credit, try implementing one or more of the following alternative hash table designs. For all of these, you may use existing structure5 data structures if they help you:

  • Cuckoo hash
  • Chaining using a self-balancing binary search tree
  • Coalesced hash
  • Hopscotch hash

Be sure to tell us which extensions you implemented/attempted to implement by putting a note in your README.md.


Thought Questions

In addition, answer the following questions in a file called PROBLEMS.md, and submit it with the rest of your code this week.

  1. Under what condition is a MapList preferable to a Hashtable?
  2. When using linear probing with a rehashing jump size of greater than 1, why is it necessary to have the hash table size and jump size be relatively prime?
  3. Design a hashCode method for a class that represents a telephone number.
  4. When 23 randomly selected people are brought together, chances are greater than 50 percent that two have the same birthday. What does this tell us about uniformly distributed hash codes for keys in a hash table?
  5. Because all objects in Java inherit a hashCode method, our hash table objects also have hash codes. Under what circumstances would such hash codes be useful? Can you think of a simple hash code for hash tables?

Lab Deliverables

We provide several input files, small.txt, medium.txt, and large.txt to help you design and test your program. We recommend that you initially develop your code using the small.txt file, and then move on to the larger files as you gain confidence in your code.

By the start of lab, you should see a new private repository called cs136lab10_movies-{USERNAMES} in your GitHub account (where USERNAMES is replaced by your usernames).

For this lab, please submit the following:

cs136lab10_movies-{USERNAMES}/
    README.md
    ChainedHashtableMaker.java
    DoubleHashtable.java
    DoubleHashtableMaker.java
    MovieStatistics.java
    movie_data.dat
    HashtableMaker.java
    MapEvaluation.java
    MapMaker.java
    {any other classes}

where {any other classes} should contain any other classes you may need to create to solve this lab.

Recall in previous labs that you had a Java file that contained a convenient main method pre-populated with a variety of helpful tests. It is always a good practice to create a small set of tests to facilitate development, and you are encouraged to do so here.

As in all labs, you will be graded on design, documentation, style, and correctness. Be sure to document your program with appropriate comments, a general description at the top of the file, and a description of each method with pre- and post-conditions where appropriate. Also, use comments and descriptive variable names to clarify sections of the code which may not be clear to someone trying to understand it.

Whenever you see yourself duplicating functionality, consider moving that code to a helper method. There are several opportunities in this lab to simplify your code by using helper methods.


Submitting Your Lab

As you complete portions of this lab, you should commit your changes and push them. Commit early and often. When the deadline arrives, we will retrieve the latest version of your code. If you are confident that you are done, please use the phrase "Lab Submission" as the commit message for your final commit. If you later decide that you have more edits to make, it is OK. We will look at the latest commit before the deadline.

  • Be sure to push your changes to GitHub.
  • Verify your changes on Github. Navigate in your web browser to your private repository on GitHub. It should be available at https://github.com/williams-cs/cs136lab10_movies-USERNAMES. You should see all changes reflected in the files that you push. If not, go back and make sure you have both committed and pushed.

We will know that the files are yours because they are in your git repository. Do not include identifying information in the code that you submit. We grade your lab programs anonymously to avoid bias. In your README.md file, please cite any sources of inspiration or collaboration (e.g., conversations with classmates). We take the honor code very seriously, and so should you. Please include the statement "We are the sole authors of the work in this repository." in the comments at the top of your Java files.


The squeaky wheel gets the grease

We are always looking to make our labs better. Please submit answers to the following questions using the anonymous feedback form for this class:

  1. How difficult was this assignment on a scale from 1 to 5 (1 = super easy, …, 5 = super hard).
  2. Anything else that you want to tell us?

Bonus: Mistakes

Did you find any mistakes in this writeup? If so, add a file called MISTAKES.md to your GitHub repository and describe the mistakes using bullets. For example, you might write

* Where it says "bypass the auxiliary sensor" you should have written "bypass the primary sensor".
* You spelled "college" wrong ("collej").
* A quadrilateral has four edges, not "too many to count" as you state.

You will receive 1 bonus point on this assignment for each mistake we are able to validate.


  • CSCI 136: Data Structures and Advanced Programming, Spring 2019