CSCI 136 :: Spring 2021

Data Structures & Advanced Programming

Home | Schedule | Labs | Handouts | Links | CS@Williams

Lab 9: Spot the Difference

Up until this point in class, we've been using git to submit assignments. You may have wondered what exactly gets sent to the server when you run git push. If you make one change, is the entire file reuploaded? It would be much more efficient if, instead, git could calculate what changes were made to the file, and only upload those changes to the server. In fact, git does make this calculation!

      Our goal in this lab is to create a Java program that does exactly this: given two files (an old version and a new version of the file), calculate the minimum number of changes required to transform the old version into the new version of the file.

What's the Difference?

Our goal is to list the differences between the two files on a line by line basis:

For example, let's say the old version of our file is:

match
onlyA

and the new version is:

match
onlyB

Then the output should be:

 < 2: onlyA
 > 2: onlyB
	

This calculation is quite similar to the output of the unix utility diff. You can try running

	% diff sample/verySmallA sample/verySmallB
	

and you will see a similar output. Paying homage to this utility, we'll call our result the diff of two files.

Calculating the Optimal Differences

In this lab, we want to calculate the minimum sequence of differences between the old and new version of the file. How can we make sure that the differences we output are, in fact, the minimum? To ensure an optimal solution, we fall back to a classic technique in the course: we brute force all possibilities using recursion. Let's say we have two files file1 and file2, each stored as Strings. To start, there are two possibilities to obtain the minimum difference between these files:

If the first lines of file 1 and file 2 are the same, then we have a third option:

For any pair of files, we have either 2 or 3 possibilities for how to proceed. We calculate the cost of each recursively. We take the choice that gives the minimum cost, and return that as our solution.

Finally, any recursive algorithm needs a base case. If file1 is empty, then the remaining difference consists of all of file2, with the > character and line number prepended to each line. Similarly, if file2 is empty, then the remaining difference consists of all of file1, with the < character and line number prepended to each line.

Simplifying the Design

As-is, this approach is a bit difficult to work with: we need to spend lots of work splitting the first line off of each file. This is inefficient, and it would significantly clutter up our code. Instead, let's split each file into a Vector of Strings, where each entry is a line of the file.

With this change, we can recurse by keeping track of which lines remain in each file. To recurse, we'll use a helper method:

public Association<Integer, String> diffHelper(int remainingFile1Index, int remainingFile2Index)

diffHelper finds the optimal diff between the suffix of file1 starting at line remainingFile1Index, and the suffix of file2 starting at line remainingFile2Index. It returns an Association between an int (representing the cost of the solution), and a String (representing the actual diff to return).

diffHelper works recursively as described above. We can determine if diffHelper is in a base case (where one of the files is empty) by comparing remainingFile1Index and remainingFile2Index to the size of each vector. If both files are nonempty, diffHelper considers two or three possibilities, each calculated with a recursive call. The first two recursive calls remove one line from file1 and file2; these recursive calls can be made by modifying remainingFile1Index and remainingFile2Index respectively.

Task 1: Implement The Difference Algorithm

Your first task is to implement diffHelper using the recursive method described above, in the file Diff.java. You should store the files using a Vector<String> to simplify your design (this has already been implemented in the starter code). You will probably want to use helper methods to keep your diffHelper method short. You don't need to use any other class yet; task 1 can be completed by only modifying Diff.java.

First, implement the base case, where one of the files has no remaining lines. You may want to use a helper function for the base case. After the base case is implemented, Diff should work correctly when one of the files is empty. In particular, if you run the command

		% java Diff samples/verySmallA.txt samples/empty.txt
	

then the result you obtain should exactly match the sample solution in emptySolution.txt. This is a good opportunity to make sure the format of your output matches the expected format---make sure the output matches before you proceed further in the lab.

Next, implement the recursive algorithm. In some cases, there may be multiple optimal solutions. Make sure you break ties according to the following rules so that your output matches the sample output:

In other words, if removing a line from file1 and file2 leads to the same cost, break the tie towards deleting the line from file1. Only remove a line from both files if it is the unique best solution (of course, we only consider deleting a line from both files if their current lines are equal).

After implementing diffHelper, your java program should run successfully on the inputs verySmallA.txt and verySmallB.txt (matching the output in verySmallSolution.txt), as well as smallA.txt and smallB.txt (matching the output in smallSolution.txt). It will also run correctly on the other sample files, but the time it takes will be prohibitively large.

Improving Efficiency

The recursive algorithm described above is very slow. In fact, if file1 and file2 have n and m lines respectively, then this algorithm takes Ω(2^(n+m)) time.

Part of the reason this algorithm is so slow is due to diffHelper computing the same thing over and over. Each new recursive call to diffHelper requires an entire, possibly slow computation—even if an identical call was made in the past.

We need a way to keep track of previous calls to diffHelper. If diffHelper is called a second time with the same arguments, rather than recomputing the solution from scratch, we want to look up the solution calculated last time it was called. Then, we can return it immediately.

We need a few things to accomplish this. First, we need a way to keep track of which calls have been made to diffHelper. Since diffHelper takes two ints as arguments, let's create a class IntPair that stores two ints.

Then, we want to store, for each diffHelper we have called in the past, the final return value of the method (which is an Association<Integer, String>). We want to be able to lookup if we've made a call with a given IntPair in the past, and if so, find the associated Association and return it. Otherwise, at the end of diffHelper, we want to keep track of the IntPair and the newly calculated return value.

This is an ideal situation for hashing. At the beginning of a recursive call, we get an IntPair consisting of remainingFile1Index and remainingFile2Index to see if their final value has been previously caluclated. If so, we return it. Otherwise, after completing a new recursive call, we want to put the arguments along with the calculated return value into the hash table.

Task 2: Speed Up Diff With Hashing

Implement a version of diffHelper that uses a hash table to avoid repeated calculations in HashDiff.java. The HashDiff class extends the Diff class, so you will have access to all of the variables and methods you already wrote in task 1. You should use your diffHelper from Diff as a staring point, incorporating the hashtable checks described above. Make sure your diffHelper method works correctly before starting task 2! Otherwise, if you made a mistake in your original implementation, you'll need to fix it in two places.

You'll need to create an IntPair class (much of it is already implemented in the starter code). Don't forget to create a hashCode() method for IntPair so that it can be used as a key in a hash table. The hashCode() method of IntPair should return the sum of the integers stored in IntPair.

Once this is completed, HashDiff should be able to correctly diff maSenate2016.txt and maSenate2020.txt in a few seconds. Your solution should match maSenateSol.txt.

		% java HashDiff samples/maSenate2016.txt samples/maSenate2020.txt
	

Task 3: Getting a Better Hash Function

Unfortunately, the hash described above has suboptimal performance: many pairs share the same hash code. For example, if remainingFile1Index is 3, and remainingFile2Index is 1, we'll get the same hash code as when remainingFile1Index is 2 and remainingFile2Index is 2.

As a result, you may find that your program still cannot compute the diff of whosonfirst.txt and whosonfirst2.txt in a reasonable time. (It takes just over 15 minutes on my machine.)

To speed up execution, implement the following hashCode() method in the BetterIntPair class (BetterIntPair extends IntPair, so you only need to override the hashCode() method). The instance variables used in this method are already included in the starter code.

		return ((hashMult * int1 + hashAdd) % hashPrime) +
				((hashMult * int2 + hashAdd) % hashPrime);
	

Multiplying by a large random number, adding a second large random number, and finally taking the result modulo a large random prime is one of the best-known methods for hashing integers (or, in this case, pairs of integers). And it should work particularly well here.

Create a new implementation of diffHelper() in BetterHashDiff.java that uses the BetterIntPair class in the hash table. This implementation will likely be very similar to the version of diffHelper() implemented in HashDiff.java. With this improvement, BetterHashDiff should be able to quickly and correctly calculate the diff for all sample files. In particular, BetterHashDiff should be able to quickly calculate the diff of whosonfirst.txt and whosonfirst2.txt, obtaining a solution matching whosonfirstSol.txt

		% java BetterHashDiff samples/whosonfirst.txt samples/whosonfirst2.txt
	

Task 4: Thought Questions

Answer the following questions in PROBLEMS.md.

Pre-lab

Before lab, please do the following:

checkstyle requirements:

For this lab, we will be not be adding any new checkstyle rules to the set of rules that we have used in previous weeks.

We STRONGLY ENCOURAGE you to run checkstyle early and often when developing your code, and try to program in a way that minimizes WARNING messages. The checkstyle rules that we use in this course are based on real-world style guides; internalizing good style practices will help us write more readable code.

In total, checkstyle will enforce the following guidelines:



To run checkstyle, you would type the following command at the terminal:


          $ ./checkstyle
        

The ./ is peculiar to Unix: it tells the terminal to look for the checkstyle program in the current directory. This command will run checkstyle on every Java program in your directory. To run checkstyle on a specific Java file, type:


          $ ./checkstyle SomeFile.java
    

Lab Deliverables

For this lab, please submit the following:

As in all labs, you will be graded on design, documentation, style, and correctness. Be sure to document your program with appropriate comments, including a general description at the top of each Java class, a description of each method with pre- and post-conditions where appropriate. Also use comments and descriptive variable names to clarify sections of the code which may not be clear to someone trying to understand it.

Submitting Your Lab

As you complete various milestones, you should commit your changes and push them. Commit early and often. When the deadline arrives, we will retrieve the latest version of your code. If you are confident that you are done, please include "Lab Submission" as the commit message for your final commit. If you later decide that you have more edits to make, it is OK. We will look at the latest commit before the deadline.

We will know that the files are yours because they are in your git repository. Do not include identifying information in the code that you submit. Our goal is to grade the programs anonymously to avoid any bias. However, in your README.md file, please cite any sources of inspiration or collaboration (e.g., conversations with classmates). We take the honor code very seriously, and so should you. Please include the statement "I am the sole author of the work in this repository" in the comments at the top each of your java files.