CSCI 136

CSCI 136 :: Spring 2021

Data Structures & Advanced Programming

Lab 9: Spot the Difference

Up until this point in class, we've been using git to submit assignments. You may have wondered what exactly gets sent to the server when you run git push. If you make one change, is the entire file reuploaded? It would be much more efficient if, instead, git could calculate what changes were made to the file, and only upload those changes to the server. In fact, git does make this calculation!

Our goal in this lab is to create a Java program that does exactly this: given two files (an old version and a new version of the file), calculate the minimum number of changes required to transform the old version into the new version of the file.

What's the Difference?

Our goal is to list the differences between the two files on a line by line basis:

If a line was not included in the old version of the file, but is included in the new version, it should be prefixed by > and its line number.
On the other hand, if a line was in the old version, but has been removed in the new version, it should be prefixed by < and its line number.
If a line stays the same between the two versions, it should not be included in the output.

For example, let's say the old version of our file is:

match
onlyA

and the new version is:

match
onlyB

Then the output should be:

 < 2: onlyA
 > 2: onlyB

This calculation is quite similar to the output of the unix utility diff. You can try running

	% diff sample/verySmallA sample/verySmallB

and you will see a similar output. Paying homage to this utility, we'll call our result the diff of two files.

Calculating the Optimal Differences

In this lab, we want to calculate the minimum sequence of differences between the old and new version of the file. How can we make sure that the differences we output are, in fact, the minimum? To ensure an optimal solution, we fall back to a classic technique in the course: we brute force all possibilities using recursion. Let's say we have two files file1 and file2, each stored as Strings. To start, there are two possibilities to obtain the minimum difference between these files:

It may be that the minimum difference is obtained by deleting the first line of file1. Then the minimum number of differences is 1 plus the minimum number of differences between the remainder of file1 (with the first line deleted), and file2. In this case, the solution begins with the first line of file1, with < and the line number prepended, and continues with the recursive solution.
On the other hand, perhaps the minimum difference involves deleting the first line from file2. Then the minimum number of differences is 1 plus the minimum number of differences between file1, and the remainder of file2 (with the first line deleted). In this case, the solution begins with the first line of file2, with > and the line number prepended, and continues with the recursive solution.

If the first lines of file 1 and file 2 are the same, then we have a third option:

Perhaps the first line of file1 should remain unchanged in file2. Then the minimum number of differences is the minimum number of differences between the remainder of file1 and file2 (with one line deleted from each).

For any pair of files, we have either 2 or 3 possibilities for how to proceed. We calculate the cost of each recursively. We take the choice that gives the minimum cost, and return that as our solution.

Finally, any recursive algorithm needs a base case. If file1 is empty, then the remaining difference consists of all of file2, with the > character and line number prepended to each line. Similarly, if file2 is empty, then the remaining difference consists of all of file1, with the < character and line number prepended to each line.

Simplifying the Design

As-is, this approach is a bit difficult to work with: we need to spend lots of work splitting the first line off of each file. This is inefficient, and it would significantly clutter up our code. Instead, let's split each file into a Vector of Strings, where each entry is a line of the file.

With this change, we can recurse by keeping track of which lines remain in each file. To recurse, we'll use a helper method:

public Association<Integer, String> diffHelper(int remainingFile1Index, int remainingFile2Index)

diffHelper finds the optimal diff between the suffix of file1 starting at line remainingFile1Index, and the suffix of file2 starting at line remainingFile2Index. It returns an Association between an int (representing the cost of the solution), and a String (representing the actual diff to return).

diffHelper works recursively as described above. We can determine if diffHelper is in a base case (where one of the files is empty) by comparing remainingFile1Index and remainingFile2Index to the size of each vector. If both files are nonempty, diffHelper considers two or three possibilities, each calculated with a recursive call. The first two recursive calls remove one line from file1 and file2; these recursive calls can be made by modifying remainingFile1Index and remainingFile2Index respectively.

Task 1: Implement The Difference Algorithm

Your first task is to implement diffHelper using the recursive method described above, in the file Diff.java. You should store the files using a Vector<String> to simplify your design (this has already been implemented in the starter code). You will probably want to use helper methods to keep your diffHelper method short. You don't need to use any other class yet; task 1 can be completed by only modifying Diff.java.

First, implement the base case, where one of the files has no remaining lines. You may want to use a helper function for the base case. After the base case is implemented, Diff should work correctly when one of the files is empty. In particular, if you run the command

		% java Diff samples/verySmallA.txt samples/empty.txt

then the result you obtain should exactly match the sample solution in emptySolution.txt. This is a good opportunity to make sure the format of your output matches the expected format---make sure the output matches before you proceed further in the lab.

Next, implement the recursive algorithm. In some cases, there may be multiple optimal solutions. Make sure you break ties according to the following rules so that your output matches the sample output:

If removing a line from file1 leads to an optimal solution, then return the solution that removes a line from file1 (even if other solutions are also optimal).
Otherwise, if removing a line from file2 leads to the same cost as removing a line from both files, return the solution that removes a line from file2.

In other words, if removing a line from file1 and file2 leads to the same cost, break the tie towards deleting the line from file1. Only remove a line from both files if it is the unique best solution (of course, we only consider deleting a line from both files if their current lines are equal).

After implementing diffHelper, your java program should run successfully on the inputs verySmallA.txt and verySmallB.txt (matching the output in verySmallSolution.txt), as well as smallA.txt and smallB.txt (matching the output in smallSolution.txt). It will also run correctly on the other sample files, but the time it takes will be prohibitively large.

Improving Efficiency

The recursive algorithm described above is very slow. In fact, if file1 and file2 have n and m lines respectively, then this algorithm takes Ω(2^(n+m)) time.

Part of the reason this algorithm is so slow is due to diffHelper computing the same thing over and over. Each new recursive call to diffHelper requires an entire, possibly slow computation—even if an identical call was made in the past.

We need a way to keep track of previous calls to diffHelper. If diffHelper is called a second time with the same arguments, rather than recomputing the solution from scratch, we want to look up the solution calculated last time it was called. Then, we can return it immediately.

We need a few things to accomplish this. First, we need a way to keep track of which calls have been made to diffHelper. Since diffHelper takes two ints as arguments, let's create a class IntPair that stores two ints.

Then, we want to store, for each diffHelper we have called in the past, the final return value of the method (which is an Association<Integer, String>). We want to be able to lookup if we've made a call with a given IntPair in the past, and if so, find the associated Association and return it. Otherwise, at the end of diffHelper, we want to keep track of the IntPair and the newly calculated return value.

This is an ideal situation for hashing. At the beginning of a recursive call, we get an IntPair consisting of remainingFile1Index and remainingFile2Index to see if their final value has been previously caluclated. If so, we return it. Otherwise, after completing a new recursive call, we want to put the arguments along with the calculated return value into the hash table.

Task 2: Speed Up Diff With Hashing

Implement a version of diffHelper that uses a hash table to avoid repeated calculations in HashDiff.java. The HashDiff class extends the Diff class, so you will have access to all of the variables and methods you already wrote in task 1. You should use your diffHelper from Diff as a staring point, incorporating the hashtable checks described above. Make sure your diffHelper method works correctly before starting task 2! Otherwise, if you made a mistake in your original implementation, you'll need to fix it in two places.

You'll need to create an IntPair class (much of it is already implemented in the starter code). Don't forget to create a hashCode() method for IntPair so that it can be used as a key in a hash table. The hashCode() method of IntPair should return the sum of the integers stored in IntPair.

Once this is completed, HashDiff should be able to correctly diff maSenate2016.txt and maSenate2020.txt in a few seconds. Your solution should match maSenateSol.txt.

		% java HashDiff samples/maSenate2016.txt samples/maSenate2020.txt

Task 3: Getting a Better Hash Function

Unfortunately, the hash described above has suboptimal performance: many pairs share the same hash code. For example, if remainingFile1Index is 3, and remainingFile2Index is 1, we'll get the same hash code as when remainingFile1Index is 2 and remainingFile2Index is 2.

As a result, you may find that your program still cannot compute the diff of whosonfirst.txt and whosonfirst2.txt in a reasonable time. (It takes just over 15 minutes on my machine.)

To speed up execution, implement the following hashCode() method in the BetterIntPair class (BetterIntPair extends IntPair, so you only need to override the hashCode() method). The instance variables used in this method are already included in the starter code.

		return ((hashMult * int1 + hashAdd) % hashPrime) +
				((hashMult * int2 + hashAdd) % hashPrime);

Multiplying by a large random number, adding a second large random number, and finally taking the result modulo a large random prime is one of the best-known methods for hashing integers (or, in this case, pairs of integers). And it should work particularly well here.

Create a new implementation of diffHelper() in BetterHashDiff.java that uses the BetterIntPair class in the hash table. This implementation will likely be very similar to the version of diffHelper() implemented in HashDiff.java. With this improvement, BetterHashDiff should be able to quickly and correctly calculate the diff for all sample files. In particular, BetterHashDiff should be able to quickly calculate the diff of whosonfirst.txt and whosonfirst2.txt, obtaining a solution matching whosonfirstSol.txt

		% java BetterHashDiff samples/whosonfirst.txt samples/whosonfirst2.txt

Task 4: Thought Questions

Answer the following questions in PROBLEMS.md.

This question is about the reason why the more complex hash function in Task 3 led to a significant speedup over the simpler hash function in Task 2. Let's say that throughout all recursive calls, remainingFile1Index is always between 0 and n-1, and remainingFile2Index is always between 0 and m-1.
- In this case, how many possible IntPairs can there be throughout the execution of the program?
- How many possible hashCodes can be output by all IntPairs? (Hint: What is the maximum value that can be output by hashCode()? What is the minimum value?)
Rather than using an IntPair in this lab, we could have used an Association<Integer, Integer>. What would be the downside of doing this?

Pre-lab

Before lab, please do the following:

In this lab, the amount of code you write won't be too large, but the concepts behind the lab are fairly new. For this reason, be sure to carefully read through the lab description before you get started.

checkstyle requirements:

For this lab, we will be not be adding any new checkstyle rules to the set of rules that we have used in previous weeks.

We STRONGLY ENCOURAGE you to run checkstyle early and often when developing your code, and try to program in a way that minimizes WARNING messages. The checkstyle rules that we use in this course are based on real-world style guides; internalizing good style practices will help us write more readable code.

In total, checkstyle will enforce the following guidelines:

All class variables that are not final must be declared private or protected (i.e., no public member variables unless they are constants). (We don't expect this to be an issue this week.)
All public methods must include a “Javadoc” comment (starts with /** and ends with */; it should include descriptions of the function at the top, descriptions of return values after a @return tag, descriptions of each argument after a @param tag, and pre/post conditions after the @pre or @post tags).
No methods should be longer than 30 lines (excluding whitespace and single-line comments).

To run checkstyle, you would type the following command at the terminal:


          $ ./checkstyle

The ./ is peculiar to Unix: it tells the terminal to look for the checkstyle program in the current directory. This command will run checkstyle on every Java program in your directory. To run checkstyle on a specific Java file, type:


          $ ./checkstyle SomeFile.java

Lab Deliverables

For this lab, please submit the following:

Your well-documented source code in Diff.java and IntPair.java.
The PROBLEMS.md file that contains answers to the questions described above, as well as any other information about your submission.

As in all labs, you will be graded on design, documentation, style, and correctness. Be sure to document your program with appropriate comments, including a general description at the top of each Java class, a description of each method with pre- and post-conditions where appropriate. Also use comments and descriptive variable names to clarify sections of the code which may not be clear to someone trying to understand it.

Submitting Your Lab

As you complete various milestones, you should commit your changes and push them. Commit early and often. When the deadline arrives, we will retrieve the latest version of your code. If you are confident that you are done, please include "Lab Submission" as the commit message for your final commit. If you later decide that you have more edits to make, it is OK. We will look at the latest commit before the deadline.

Be sure to push your changes to GitLab
Verify your changes on GitLab. Navigate in your web browser to your private repository on GitLab. You should see all changes reflected in the various files that you submit. If not, go back and make sure you committed and pushed.

We will know that the files are yours because they are in your git repository. Do not include identifying information in the code that you submit. Our goal is to grade the programs anonymously to avoid any bias. However, in your README.md file, please cite any sources of inspiration or collaboration (e.g., conversations with classmates). We take the honor code very seriously, and so should you. Please include the statement "I am the sole author of the work in this repository" in the comments at the top each of your java files.