Post One - 29 April 2007

Attempt to retrieve the **TEXT** from a Microsoft Publisher document. This will not deal with any formatting. I just want the text.

It seems (April 2007) that there is no current tool to read MS Publisher (.pub) files, other than MS Publisher itself, on any platform - Win, Mac, Linux, etc. This tool certainly doesn't allow you to edit a MS Publisher file, but it attempts to get some content out of it, in text form. Due to a lack of documentation about the file format, and this being an initial release after only an hour or so of reverse-engineering, this tool gathers (hopefully) all of the text in the document, plus other text which is not (or should not) be contained in the file.

This tool has been written upon the basis of one PUB file (v3.0), and the output of "od -c" on that file. It helps that I had some idea about the content of the file. YMMV.

So, here's the code so far: ms_publisher.c Released under GPLv2. Linux Binary here.

Strings are of the form char \0 char \0 char \0 char \0 :

0550660      \0      \0      \0   A  \0   c  \0   t  \0   i  \0   v  \0
0550700   i  \0   t  \0   y  \0      \0   b  \0   a  \0   g  \0   s  \0

This says " Activity bags"

Much of the output from this tool is useless; it errs on the side of caution. It appears that there are flags within the PUB file to mark the start of the content (and possibly also the end of the content). This warrants further investigation.

I'll need to get some other source, and permission to distribute the source - along with its text - before I can go any further. Still, I hope that the above will be of some use to someone.

Update Oct 2007: Raymond Chen explained the Unicode encoding on his blog back in 2004. We seem to have big-endian Unicode with BOM, in Publisher.