OpenOffice.org Findings

I've been looking at the work done by the OpenOffice.org team (http://openoffice.org/) and the file format. It's very good, it's small, open, XML, and allows for nearly infinite expansion and backwards-compatability in the future.

So far, I've got some documentation of my findings documented here about the .sxw file format.

There has clearly been a lot of work put into the xml.openoffice.org site, and even more work into the file format, but the documentation they have doesn't really tell you what you need to know to start getting your teeth into the file format.

I'm starting to attack the file format myself, and have started creating some pretty decent-looking documents in the format. If I can show you a way in to understanding what it's about, and where the useful information is to be found, then I'll consider my goal achieved.

The reason I'm interested in the format is twofold; firstly, I believe that it's such an excellent format, that I think it will take off, and end up having huge significance - think of the influence Microsoft have over millions of PC users who have to upgrade to the latest version of MS Office simply because everyone else has; secondly because I'm in charge of an application which creates user-editable reports. So far, it creates HTML and RTF documents... HTML is not a very powerful WP language (it's not a WP language!), and RTF is - effectively - undocumented. The RTF documentation is here. Okay, this is 1.5, which Office 97 supports, so it's a bit old, but please, feel free to (a) find this on www.microsoft.com, and once you've gone through that hurdle, please (b) let me know how it documents putting page numbers into footers... it doesn't. It gives a passing mention to footers. Nothing about putting page numbers into them - a pretty standard thing to do with a document.

A valid question to ask, is, "Who are you to be talking about XML, your website's HTML is simplistic and not even valid HTML, let alone XML", that's for a very good reason - see the Any Damn Browser icon at the bottom of the page - it might be simple and wrong, but it works. For XML, the rules are different. I am not breaking the HTML rules lightly, I am breaking them because it makes the site visible in everything including Lynx, Mosaic, Netscape, Internet Explorer, and Mozilla. I assume it works in Opera and everything else I've not tried, too... With XML, all tags must be closed, and everything is properly defined. The OpenOffice.org File Format defines a standard which is implementable by every word processor, and allows for extra features, and allows for different features offered by different word processors, seamlessly - so long as they all follow the specification. That is why the OpenOffice.org spec. is good - it's not just good, it's properly documented.

Gotchas

One gotcha I've found so far: When trying your code, adding features and so on, don't use "File|Reload" to see if your changes have worked. I believe the team are working on this problem, but at the moment, opening a file either succeeds or fails - it does not give any explanation as to the problem, etc. This has a rather drastic affect on the File|Reload feature... if the file was okay before, then you broke it, it'll fail to load the document, but the cursor will go to the start of the current document, thus giving the impression that the file has been reloaded, and what you're seeing is the document you've just created. I have opened bug #7597 on this issue.

Benefits

The benefits I see to the OpenOffice.org File Format

The OpenOffice.org file format is, IMHO, great.

I'm not a Word-Processor developer, though I am working on an application which creates WP files, that makes me a much lower life-form. I don't have to worry about everything a user might do, I just have to make sure that the files I create are acceptable to the word-processors. The WP programmers can then worry about what strange things users might do to the document.

The file size is small, because it's XML (which is really just text) then zipped. This is far more efficient than any binary format could be, because a binary format would do all its compression before creating the file. By zipping the file after it's created, a better compression rate is possible. A typical text document in MS .DOC format (24,648 bytes) compresses to a mere 9,329 bytes as .SXW. Most of the world are still using 56Kb modems; in the UK we pay by the minute for internet access, so downloading a document in 5 minutes instead of 15 minutes makes a big difference.

It's clever. It uses XML, and there seems to have been some "discussion" internally about this, but it also supports binary file formats (such as graphics) - it would be stupid for OO.org to add a new image format to all the ones already out there, and it would have been silly for OO.org to insist on a single format. It would also have made the filesize huge if all images were encoded in a text representation (such as base-64, as used by MIME (like email uses)), and this is really why the .SXW file format is a .ZIP file - it just includes the image files as they are. Sure, a well-compressed image isn't going to be more-compressed by .ZIP, but it isn't going to be any bigger. This really is the best of both worlds. Think about HTML - there is a pretty good argument for WWW documents to be passed around in this way - get the content, style, images, and Binary Large OBjects (BLOBs) seperately, as and when required. By putting them in different files, the whole issue is clearly resolved.

Knocks HTML/CSS into a hat, if you ask me.

Queries - Questions I want to put to the OpenOffice.org team


Styles seem to be contained in styles.xml and content.xml - presumably content.xml overrules styles.xml?


Is anything standard? How about Standard? OpenOffice.org 1.0 and StarOffice 6 are (roughly) the same thing - SO6 has extra fonts (incl. Times, the default; OO.org's default is Thorndale).

It seems strange that when my document specifies Times / family "times" that OpenOffice.org uses Thorndale, as does StarOffice (which includes Times as a value-add to OO.org). How do I make StarOffice (ideally any system with Times font installed) use Times instead of Thorndale?