/* * (c) 2007 Steve Parker http://steve-parker.org/ * GPLv2 * * Attempt to retrieve the **TEXT** from a Microsoft Publisher document. * This will not deal with any formatting. I just want the text. * * It seems (April 2007) that there is no current tool to read MS Publisher (.pub) * files, other than MS Publisher itself, on any platform - Win, Mac, Linux, etc. * This tool certainly doesn't allow you to edit a MS Publisher file, but * it attempts to get some content out of it, in text form. * Due to a lack of documentation about the file format, and this being * an initial release after only an hour or so of reverse-engineering, * this tool gathers (hopefully) all of the text in the document, plus * other text which is not (or should not) be contained in the file. * * This tool has been written upon the basis of one PUB file (v3.0), and * the output of "od -c" on that file. It helps that I had some idea about * the content of the file. YMMV. * * Much of the output from this tool is useless; it errs on the side of caution. * It appears that there are flags within the PUB file to mark the start of * the content (and possibly also the end of the content). This warrants * further investigation. */ #include #include /* * Strings are of the form char \0 char \0 char \0 char \0 : * * 0550660 \0 \0 \0 A \0 c \0 t \0 i \0 v \0 * 0550700 i \0 t \0 y \0 \0 b \0 a \0 g \0 s \0 * * This says " Activity bags" */ /* I don't want multiple spaces / linebreaks. */ #define MAXSPACES 2 #define MAXNEWLINES 2 void readstrings(FILE *f) { int prev=0; int c=0; int i=0; int runofnewlines=0; int runofspaces=0; while ((c=fgetc(f))!=EOF) { i++; if (c==0) { if ( ((prev>31) && (prev<127)) || (prev=='\t') ) { if ((prev!=32) || (runofspaces