Post Three - 26 May 2007

Surprisingly, the difference between 1.pub and 2.pub, once the trivia has been dealt with, amounts to 541 changes.
The difference between the files? Well, for 1.pub, I created a document, inserted a text box, added the words "Hello world", selected File -> Save, and chose the name C:\1.pub.
For 2.pub, I waited a minute, and selected File -> Save, and chose the name 2.pub.
So the differences are that they have different names, and are aged (I think slightly over) one minute apart. 1.pub was saved at 23:42, 26 May 2007 (GMT), 2.pub was saved at 23:43.
You can see the differences here. The first column is the address in Decimal, followed by the address in Octal (which makes it easier to tie in with the output from od -c myfile.pub). Then there are three columns for each file: Hex, Decimal, and (if printable) the ASCII character. If 1.pub's value is higher than 2.pub's value, then the row is light blue, otherwise it's yellow.

New tools: getdiffs.c is a hacked version of phtml.c. The latter, phtml, does a simple binary "diff" of the files, as described above. It also includes the common text, which is marked in white. The former, getdiffs, only shows the differences, but does mark the areas of commonality by a simple horizontal rule, so you can see which chunks go together. They also both have a "skipA()" and "skipB()" function, which allows you to mark places in the binaries where you can see that one includes a longer entry than the other. So, where one file stores its own filename as "C:\x.pub", and the other stores it as "C:\somedir\y.pub", you can blank out the "omedir", to leave a single difference, of "x.pub" vs "s.pub". This is because binary diff is very difficult to automate; text-based diffs know that an end-of-line is a significant character, and can therefore spot related lines. With a binary file, you've just got a load of ones and zeroes (which I'm lazily interpreting as bytes). If you don't deal with these differences, you won't spot later commonalities.

Update

I created a file (x.pub), then did a File/SaveAs y.pub, followed by File/SaveAs z.pub. These files all differ. Even y and z, which didn't even have the "unique" feature of being SaveAs vs. Save.
Here's the diff on y and z. Their only crime is to have been saved within a couple of seconds of each other, with one character difference in their names.