OpenOffice.org XML File format

Having played about a little with the OpenOffice.org (and StarOffice) file format, I thought I'd share my experiences and lessons learned.

I have a feeling this format will take off, I could be wrong, but it's very complete, accurate, well-specified, and clear. And it's not just documented by DTDs, documentation, but also code, if you want to check exactly what OpenOffice.org does with the data.

So far, I've not had to look at the code (it's just nice to know that it's there).

The .sxw file is a .JAR file (which is basically the same as a .ZIP file).

Creating an empty file in OpenOffice.org 1.0 creates a file blank.sxw.

If my approach seems strange (create a blank document, look at it, make a change, look at it, etc) it may be because of why I want to know how the format works - I'm writing an application which creates report files, using various standard bits of formatting (headers, tables, etc). It currently creates HTML and RTF documents, and makes a token stab at plain-text documents, even though plain text cannot properly / nicely do things like tables. I'm currently adding OpenOffice.org functionality to that application, because I think it's a good format, and I feel it has a good chance of catching on.

So I start with a blank file, and add features to the file once I've understood what's already in there...

The blank .SXW package contains the following files:

content.xml - the text of the document
styles.xml - the styles used by the document
meta.xml - various meta-information (author, etc)
settings.xml - the Word Processor settings used.
META-INF/manifest.xml - a list of all these files, plus any others used in the document

The manifest.xml file is the difference between .ZIP and .JAR, apparently. It makes searching the files easier, and the documentation claims that this file is uncompressed, so that it can be easily read. I have not tested this, but it seems like a good idea.

I have not investigated all these files fully, just the content.xml and styles.xml files.

Personally, from my experience of working out RTF, it's much easier to create a blank document, look at what's in there, then add some feature I'd like to use, and see what's changed. Since MS Word is the reference implementation of RTF, and OpenOffice.org is the reference implementation of the OpenOffice.org XML file format, this seems to make sense.

Where OO beats RTF, though, is that many RTF documents I've tried have crashed MS Word, or, more often, the whole PC. At least I can validate genuine XML... though this is my first foray into "proper" XML, there are lots of open-source tools out there to validate XML code. The xml.openoffice.org website has some documentation on these files, but the really interesting stuff is hidden away in the CVS. These are listed from the front page, but to get the latest files, you must choose "OpenOffice.org XML File Format DTD" from the front page, then select a .MOD file (start with office.mod), which will show you the CVS log for that file. There is a line which says something like:

Revision 1.46 / (download) - [select for diffs] , Mon May 6 09:34:55 2002 UTC (4 months ago) by bm

Select the "download" link to get the file. As at 09 Sep 2002, the latest is this file.

They also offer a 548-page PDF file, which goes through the whole DTD, describing each element and attribute. I recommend you download this, but it's not the best way of working out what's what, and what goes where. I recommend the .MOD files to work out what goes where, what needs what, and so on.

It may seem strange that I've extolled the virtues of OpenOffice.org's File Format for its openness, then rubbished their documentation attempts. It's a good document, but frankly, my person opinion is that the DTD is more useful.

The information in the xml_specification.pdf file about tables is particularly uninformative, I emailed the author with a question and did not receive a reply, and have finally worked out the simplest code to create a table:

<table:table table:name="Table 1">
  <table:table-column/>
  <table:table-column/>
  <table:table-row>
      <table:table-cell>
	<text:p>Element 1</text:p>
      </table:table-cell>
      <table:table-cell>
	<text:p>Element 2</text:p>
      </table:table-cell>
  </table:table-row>
  <table:table-row>
      <table:table-cell>
        <text:p>Element 3</text:p>
      </table:table-cell>
      <table:table-cell>
        <text:p>Element 4</text:p>
      </table:table-cell>
  </table:table-row>
</table:table>

This creates a table:

Element 1	Element 2
Element 3	Element 4

though without the border - I just added that to this in HTML to show more clearly that it is a table.

You would not believe how long it took me to get to this point - I was using the xml_specification.pdf file, and looking at the documents created by OpenOffice.org... I eventually found the table.mod file on xml.openoffice.org, and everything became clear.

The above can be simplified a little - the table:table-column tag lists the table columns. To avoid repeating ourselves (let's say you had 10 columns, it would be silly to repeat the "column" line 10 times), we can say: <table:table-column table:number-columns-repeated="2"\>.

This will create a table as wide as the paper-width, with all columns having the same width. If you want to create columns of different widths, eg:
Element 1 Element 2
Element 3 Element 4
That's the next thing I'll be looking into.

Of course, there are ways and ways of doing this stuff - I just prefer to find the cleanest way, then start adding to it.

Just a footnote:

For those who say that OpenSource is unprofitable: I'm creating a closed-source application which is using this open standard; in order to achieve this, I must understand the format I'm using. I'm having to do this work anyway, and document what I've learned. Is there any extra cost to me, or to my employer, in making this information public? No.

It may give a competitor an advantage - it's taken me a few days to work out what I've needed to know, and I need to work out a few things yet; by publishing the work I've done, I am also promoting the document format, which is another push for the format itself. If 10 open source programmers read this page for every proprietary programmer, then that benefits open source.

The whole OpenOffice.org project is funded by Sun; now my employer has also added to the knowledge-base, if anybody finds these pages useful. And it has cost my employer nothing - I would have documented this anyway. Sun? They sell StarOffice, which uses the same format - it's the same code as OpenOffice.org - so they benefit too.

Free != Cheap