Publisher - Reverse Engineering

This is my log of my attempt to reverse-engineer the Microsoft Publisher file format. This is something that I once had an interest in, though like many proprietary formats, loses relevance over time. Since writing this, I have received correspondence about OLE2 and Escher. A few people have pointed me to POI-HPBF - A Guide to the Publisher File Format. As I say, my personal interest in this project is ended, and it seems that others have done more work than is represented in this document. Still, the rest is left here as a historic relic.

It appears that nobody can decode MS Publisher files, whatever platform, whatever application they use. (Update: Publisher to InDesign is possible, and Corel Draw X4 can read them also. Still, there is no documented description of the format, and no F/OSS software to demonstrate the format.) I would like to decode it. If you would, too, then please join me - share what you've got. If we all collaborate, we can learn much more than individuals working alone.

I will try to keep the links up top (One, Two, Three, Four, etc) updated. It might be worth trying the next iteration, just in case I put something up here and miss it, though...

Post One - 29 April 2007 : Retrieving the Text. That was easy! Here it is: ms_publisher.c

However, it's not that simple. There's formatting, and positioning. I had to go back a long way, and work out what all the elements of a Publisher document are.
This is not easy. The best I can do so far is This Text File. It seems accurate enough so far, which is not much to say for a vague set of observations like these, but I have only just started. Actually, I've got a slightly better idea, now, but not much, and I can't hold much confidence in it, as I have not tested it against many sample documents at all.

There are a few basic utility scripts I've written around this project; most get hacked around with each update; here are some relatively generic versions:
getdiffs.c - just get the diffs between two binary files. Yellow and Blue colours just mark which file has the higher value. This creates diffs.html.
phtml.c does the same thing, but includes every byte, with matching bytes shown in white. This creates a bunch of files, #define'd to break out from outfile1.html to outfile2.html, to outfile3.html, and so on, for every 10Kb of binary file.
Both have skipA() and skipB() functions, which can be used to allow for known but uninteresting differences between files. Text diff utilities have less need for this, because the linebreak is common (frequent) enough to spot where to pick up the trace again; with binary files, there is no such common marker.

Post Two - 22 May 2007 : Moving a text box, see the changes

Post Three - 26 May 2007 : Saving the same file twice, see the changes

Post Four - 9 June 2007 : And again, but in "extreme mode" - "deadsimple"!

Post Five - 10 June 2007 : Getting even more simplistic.

Post Six - 12 June 2007 : Brancing out again.

Experiment 7 - 28 June 2007 - seeing the difference between bold and normal text. Apparently, quite a lot :-(

Post Eight - 30 June 2007 : Writing my own .pub files (well, okay, hacking existing ones)