Lab 9: Spot the Difference
Up until this point in class, we’ve been using git
to submit
assignments. You may have wondered what exactly gets sent to the server
when you run git push
. If you make one change, is the entire file
reuploaded? It would be much more efficient if, instead, git
could
calculate what changes were made to the file, and only upload those
changes to the server. In fact, git
does make this calculation!
Our goal in this lab is to create a Java program that does exactly this: given two files (an old version and a new version of the file), calculate the minimum number of changes required to transform the old version into the new version of the file.
What’s the Difference?
Our goal is to list the differences between the two files on a line by line basis:
- If a line was not included in the old version of the file, but is included in the new version, it should be prefixed by > and its line number.
- On the other hand, if a line was in the old version, but has been removed in the new version, it should be prefixed by < and its line number.
- If a line stays the same between the two versions, it should not be included in the output.
For example, let’s say the old version of our file is:
match
onlyA
and the new version is:
match
onlyB
Then the output should be:
< 2: onlyA
> 2: onlyB
This calculation is quite similar to the output of the unix utility diff. You can try running
% diff sample/verySmallA.txt sample/verySmallB.txt
on a Unix computer and you will see a similar output. Paying homage to this utility, we’ll call our result the diff of two files.
Calculating the Optimal Differences
In this lab, we want to calculate the minimum sequence of differences
between the old and new version of the file. How can we make sure that
the differences we output are, in fact, the minimum? To ensure an
optimal solution, we fall back to a classic technique in the course: we
brute force all possibilities using recursion. Let’s say we have two
files file1
and file2
, each stored as String
s. To start, there are two
possibilities to obtain the minimum difference between these files:
- It may be that the minimum difference is obtained by deleting the
first line of
file1
. Then the minimum number of differences is 1 plus the minimum number of differences between the remainder offile1
(with the first line deleted), andfile2
. In this case, the solution begins with the first line offile1
, with < and the line number prepended, and continues with the recursive solution. - On the other hand, perhaps the minimum difference involves deleting
the first line from
file2
. Then the minimum number of differences is 1 plus the minimum number of differences betweenfile1
, and the remainder offile2
(with the first line deleted). In this case, the solution begins with the first line offile2
, with > and the line number prepended, and continues with the recursive solution.
If the first lines of file1
and file2
are the same, then we have a
third option:
- Perhaps the first line of
file1
should remain unchanged infile2
. Then the minimum number of differences is the minimum number of differences between the remainder offile1
andfile2
(with one line deleted from each).
For any pair of files, we have either 2 or 3 possibilities for how to proceed. We calculate the cost of each recursively. We take the choice that gives the minimum cost, and return that as our solution.
Finally, any recursive algorithm needs a base case. If file1
is empty,
then the remaining difference consists of all of file2
, with the >
character and line number prepended to each line. Similarly, if file2
is
empty, then the remaining difference consists of all of file1
, with the
< character and line number prepended to each line.
Simplifying the Design
As-is, this approach is a bit difficult to work with: we need to spend lots of work splitting the first line off of each file. This is inefficient, and it would significantly clutter up our code. Instead, let’s split each file into a Vector of Strings, where each entry is a line of the file.
With this change, we can recurse by keeping track of which lines remain in each file. To recurse, we’ll use a helper method:
public Association<Integer, String> diffHelper(int remainingFile1Index, int remainingFile2Index)
diffHelper
finds the optimal diff between the suffix of file1
starting
at line remainingFile1Index
, and the suffix of file2
starting at line
remainingFile2Index
. It returns an Association
between an int
(representing the cost of the solution), and a String
(representing the
actual diff to return).
diffHelper
works recursively as described above. We can determine if
diffHelper
is in a base case (where one of the files is empty) by
comparing remainingFile1Index
and remainingFile2Index
to the size of
each vector. If both files are nonempty, diffHelper
considers two or
three possibilities, each calculated with a recursive call. The first
two recursive calls remove one line from file1
and file2
; these
recursive calls can be made by modifying remainingFile1Index
and
remainingFile2Index
respectively.
You should not change the parameters of diffHelper
.
Task 1: Implement The Difference Algorithm
Your first task is to implement diffHelper
using the recursive method
described above, in the file Diff.java
. You should store the files using
a Vector<String>
to simplify your design (this has already been
implemented in the starter code). You will probably want to use helper
methods to keep your diffHelper method short. You don’t need to use any
other class yet; task 1 can be completed by only modifying Diff.java
.
First, implement the base case, where one of the files has no remaining
lines. You may want to use a helper function for the base case. After
the base case is implemented, Diff
should work correctly when one of the
files is empty. In particular, if you run the command
% java Diff samples/verySmallA.txt samples/empty.txt
then the result you obtain should exactly match the sample solution in
emptySolution.txt
. This is a good opportunity to make sure the format of
your output matches the expected format—make sure the output matches
before you proceed further in the lab.
Next, implement the recursive algorithm. In some cases, there may be multiple optimal solutions. Make sure you break ties according to the following rules so that your output matches the sample output:
- If removing a line from
file1
leads to an optimal solution, then return the solution that removes a line fromfile1
(even if other solutions are also optimal). - Otherwise, if removing a line from
file2
leads to the same cost as removing a line from both files, return the solution that removes a line fromfile2
.
In other words, if removing a line from file1
and file2
leads to the
same cost, break the tie towards deleting the line from file1
. Only
remove a line from both files if it is the unique best solution (of
course, we only consider deleting a line from both files if their
current lines are equal).
After implementing diffHelper
, your java program should run successfully
on the inputs verySmallA.txt
and verySmallB.txt
(matching the output in
verySmallSolution.txt
), as well as smallA.txt
and smallB.txt
(matching
the output in smallSolution.txt
). It will also run correctly on the
other sample files, but the time it takes will be prohibitively large.
Improving Efficiency
The recursive algorithm described above is very slow. In fact, if file1
and file2
have n and m lines respectively, then this algorithm takes
O(2^(n+m)) time.
Part of the reason this algorithm is so slow is due to diffHelper
computing the same thing over and over. Each new recursive call to
diffHelper
requires an entire, possibly slow computation—even if an
identical call was made in the past.
We need a way to keep track of previous calls to diffHelper
. If
diffHelper
is called a second time with the same arguments, rather than
recomputing the solution from scratch, we want to look up the solution
calculated last time it was called. Then, we can return it immediately.
We need a few things to accomplish this. First, we need a way to keep
track of which calls have been made to diffHelper
. Since diffHelper
takes two ints as arguments, let’s create a class IntPair
that stores
two ints.
Then, we want to store, for each diffHelper
we have called in the past,
the final return value of the method
(which is an Association<Integer, String>
).
We want to be able to lookup if we’ve made a call with a
given IntPair
in the past, and if so, find the associated Association
and return it. Otherwise, at the end of diffHelper
, we want to keep
track of the IntPair
and the newly calculated return value.
This is an ideal situation for hashing. At the beginning of a recursive
call, we get an IntPair
consisting of remainingFile1Index
and
remainingFile2Index
to see if their final value has been previously
caluclated. If so, we return it. Otherwise, after completing a new
recursive call, we want to put the arguments along with the calculated
return value into the hash table.
Task 2: Speed Up Diff With Hashing
Implement a version of diffHelper
that uses a hash table to avoid
repeated calculations in HashDiff.java
. The HashDiff
class extends the
Diff
class, so you will have access to all of the variables and methods
you already wrote in task 1. You should use your diffHelper
from Diff
as
a staring point, incorporating the hashtable checks described above.
Make sure your diffHelper
method works correctly before starting task
2! Otherwise, if you made a mistake in your original implementation,
you’ll need to fix it in two places.
You’ll need to create an IntPair
class (much of it is already
implemented in the starter code). Don’t forget to create a hashCode()
method for IntPair
so that it can be used as a key in a hash table. The
hashCode()
method of IntPair
should return the sum of the integers
stored in IntPair
.
Once this is completed, HashDiff
should be able to correctly diff
maSenate2016.txt
and maSenate2020.txt
in a few seconds. Your solution
should match maSenateSol.txt
.
% java HashDiff samples/maSenate2016.txt samples/maSenate2020.txt
Task 3: Getting a Better Hash Function
Unfortunately, the hash described above has suboptimal performance: many
pairs share the same hash code. For example, if remainingFile1Index
is
3, and remainingFile2Index
is 1, we’ll get the same hash code as when
remainingFile1Index
is 2 and remainingFile2Index
is 2.
As a result, you may find that your program still cannot compute the
diff of whosonfirst.txt
and whosonfirst2.txt
in a reasonable time. (It
takes just over 15 minutes on my machine.)
To speed up execution, implement the following hashCode()
method in the
BetterIntPair
class (BetterIntPair
extends IntPair
, so you only need to
override the hashCode()
method). The instance variables used in this
method are already included in the starter code.
return ((hashMult * int1 + hashAdd) % hashPrime) +
((hashMult * int2 + hashAdd) % hashPrime);
Multiplying by a large random number, adding a second large random number, and finally taking the result modulo a large random prime is one of the best-known methods for hashing integers (or, in this case, pairs of integers). And it should work particularly well here.
Create a new implementation of diffHelper()
in BetterHashDiff.java
that
uses the BetterIntPair
class in the hash table. This implementation will
likely be very similar to the version of diffHelper()
implemented in
HashDiff.java
. With this improvement, BetterHashDiff
should be able to
quickly and correctly calculate the diff for all sample files. In
particular, BetterHashDiff
should be able to quickly calculate the diff
of whosonfirst.txt
and whosonfirst2.txt
, obtaining a solution matching
whosonfirstSol.txt
% java BetterHashDiff samples/whosonfirst.txt samples/whosonfirst2.txt
Checkstyle requirements
For this lab, we will be not be adding any new checkstyle
rules to the
set of rules that we have used in previous weeks.
We STRONGLY ENCOURAGE you to run checkstyle early and often when developing your code, and try to program in a way that minimizes WARNING messages. The checkstyle rules that we use in this course are based on real-world style guides; internalizing good style practices will help us write more readable code.
In total, checkstyle will enforce the following guidelines:
- All class variables that are not
final
must be declaredprivate
orprotected
(i.e., nopublic
member variables unless they are constants). (We don’t expect this to be an issue this week.) - All
public
methods must include a “Javadoc” comment (starts with/**
and ends with*/
; it should include descriptions of the function at the top, descriptions of return values after a@return
tag, descriptions of each argument after a@param
tag, and pre/post conditions after the@pre
or@post
tags). - No methods should be longer than 30 lines (excluding whitespace and single-line comments).
To run checkstyle
, you would type the following command at the
terminal:
$ ./checkstyle
The ./
is peculiar to Unix: it tells the terminal to look for the
checkstyle
program in the current directory. This command will run
checkstyle
on every Java program in your directory. To run
checkstyle
on a specific Java file, type:
$ ./checkstyle SomeFile.java
Lab Deliverables
For this lab, please submit the following:
lab09-diff/
README.md
Diff.java
HashDiff.java
IntPair.java
BetterIntPair.java
samples/
empty.txt
emptySolution.txt
maSenate2016.txt
maSenate2020.txt
maSenateSol.txt
smallA.txt
smallB.txt
smallSolution.txt
verySmallA.txt
verySmallB.txt
verySmallSolution.txt
whosonfirst.txt
whosonfirst2.txt
whosonfirstSol.txt
You should have modified Diff.java
, HashDiff.java
, IntPair.java
, BetterIntPair.java
.
As in all labs, you will be graded on design, documentation, and functionality. Be sure to document your program with appropriate comments, a general description at the top of the file, and a description of each method with pre- and post-conditions where appropriate. Also use comments and descriptive variable names to clarify sections of the code which may not be clear to someone trying to understand it.
Submitting Your Lab
As you complete portions of this lab, you should commit
your changes and push
them. Commit early and often.
- Be sure to push your changes to GitLab.
- Verify your changes on GitLab. Navigate in your web browser to your private repository on GitLab. You should see all changes reflected in the files that you
push
. If not, go back and make sure you have both committed and pushed.
We will know that the files are yours because they are in your git
repository. Do not include identifying information in the code that you submit. We grade your lab programs anonymously to avoid bias. In your README.md
file, please cite any sources of inspiration or collaboration (e.g., conversations with classmates). We take the honor code very seriously, and so should you. Please include the statement "I am the sole author of the work in this repository."
in the comments at the top of your Java files.