Scala Patience Diff · Owen Stephens' Website

Scala Patience Diff🔗

24th December 2010

Consider diffing the following two files:

File1

#include <stdio.h>

// Frobs foo heartily
int frobnitz(int foo)
{
    int i;
    for(i = 0; i < 10; i++)
    {
        printf("Your answer is: ");
        printf("%d\n", foo);
    }
}

int fact(int n)
{
    if(n > 1)
    {
        return fact(n-1) * n;
    }
    return 1;
}

int main(int argc, char **argv)
{
    frobnitz(fact(10));
}

File2

#include <stdio.h>

int fib(int n)
{
    if(n > 2)
    {
        return fib(n-1) + fib(n-2);
    }
    return 1;
}

// Frobs foo heartily
int frobnitz(int foo)
{
    int i;
    for(i = 0; i < 10; i++)
    {
        printf("%d\n", foo);
    }
}

int main(int argc, char **argv)
{
    frobnitz(fib(10));
}

It’s easy to see that the call to the “fib” function has been added instead of the “fact” function call. The “fact” function was removed, with the “fib” function being added a few lines below. Let’s examine how a typical diff tool presents these changes.

First compare the result of diffing file1 and file2 using GNU diff:

diff -u file1.txt file2.txt

 #include <stdio.h>

-// Frobs foo heartily
-int frobnitz(int foo)
+int fib(int n)
 {
-    int i;
-    for(i = 0; i < 10; i++)
+    if(n > 2)
     {
-        printf("Your answer is: ");
-        printf("%d\n", foo);
+        return fib(n-1) + fib(n-2);
     }
+    return 1;
 }

-int fact(int n)
+// Frobs foo heartily
+int frobnitz(int foo)
 {
-    if(n > 1)
+    int i;
+    for(i = 0; i < 10; i++)
     {
-        return fact(n-1) * n;
+        printf("%d\n", foo);
     }
-    return 1;
 }

 int main(int argc, char **argv)
 {
-    frobnitz(fact(10));
+    frobnitz(fib(10));
 }

And now, the result using the patience diff algorithm (as described here and here):

scala -classpath . Main file1.txt file2.txt

 #include <stdio.h>

+int fib(int n)
+{
+    if(n > 2)
+    {
+        return fib(n-1) + fib(n-2);
+    }
+    return 1;
+}
+
 // Frobs foo heartily
 int frobnitz(int foo)
 {
     int i;
     for(i = 0; i < 10; i++)
     {
-        printf("Your answer is: ");
         printf("%d\n", foo);
     }
 }

-int fact(int n)
-{
-    if(n > 1)
-    {
-        return fact(n-1) * n;
-    }
-    return 1;
-}
-
 int main(int argc, char **argv)
 {
-    frobnitz(fact(10));
+    frobnitz(fib(10));
 }

It’s clear for this example that the patience diff has produced a superior diff output; it represents something closer to how one would describe the changes verbally.

The reason that the patience diff algorithm (whose name comes from the fact that a patience sort is used to calculate the LCS) produces more readable diff outputs is because it only considers common lines that are unique within each file. What this means practically, is that lines that are blank, or contain a single brace are not considered as forming part of the common subsequence; the diff algorithm will not focus on trying to match braces (the GNU diff output shown above does this) but instead, focusses on unique lines (e.g. function definitions).

My Scala implementation can be found on Github, here. An example usage (which takes the two filenames to diff on stdin and prints the diff) is shown below. Patience diff is used by the Bazaar version control system, whose implementation I consulted for ideas whilst coding my Scala version.

import scala.io.Source._
import OwenDiff._
object Main {
    def main(args : Array[String]) : Unit = {
    	val file1 = fromFile(args(0)).getLines.toList
    	val file2 = fromFile(args(1)).getLines.toList

        val diffList = Diff.diff(file1, file2)

        println(diffList.mkString)
    }
}

The main diff method is fairly simple; it recursively creates a list of pairs of matching line indices between the two files and coalesces them into contiguous blocks if possible, which are used to calculate the list of line insertion/deletion/changes required to go from file1->file2, as per diff.

A slight modification of my patience sort code, with the inclusion of a bisect method implementation gives the code that calculates the longest increasing subsequence (LIS) of its argument. The LIS code is used to determine the LCS of the line pairs.