Microsoft's last chance for OOXML standard approval

The final decision on whether or not to accept Microsoft's OOXML specification for approval as a standard by the International Standards Organisation (ISO) is to be made over the weekend of 29-30 March 2008.

Acceptance assumes a level of satisfaction with the current definition of the proposed standard. Although rejection of the process would not imply complete rejection of MS-OOXML, it would allow time to highlight the glaring inconsistencies inherent in the current specification and ensure proper examination and revision of the proposed formats.

The worst possible outcome would be the acceptance of a specification that is not fit for purpose. Unfortunately, this is also the likely outcome.

Other people's browsers

Standards are about interoperability, or they are about nothing. By definition a standard assumes a level of commonality that enables multiple implementations that are totally conversant with one another.

The basic requirement of a standard data format for office suites is that it preserves the integrity and neutrality of the data. Governments and other organisations have a vested interest in the implementation of open standards because they want to ensure that the documents of today will be readable tomorrow. Vendors and developers want open standards because they allow the opportunity to develop alternative ways to edit, interpret and view the data.

Too many users save their documents in binary formats that are both proprietary and transitory. The justification for this practice is that the proprietary formats are 'de facto' standards. A "de facto standard", (usually an undocumented format that dominates because it is owned by the dominant player in a particular market), effectively confers ownership of documents on the "owner" of the standard.

Currently, Microsoft owns the "de facto standard" for office documents. Not so long ago the "de facto standards" for office computing were owned by WordPerfect, WordStar or Lotus. A current monopoly does not ensure a future monopoly. A "de facto standard" has a limited lifespan and confers no guarantees.

Microsoft has not encouraged independent implementations of its protocols or data formats, and in the past has published them, if at all, in piecemeal fashion. Each release of Office has coincided with changes in the data formats, designed to encourage conformance to the upgrade cycle. Competitors have been forced to reverse engineer Microsoft Office outputs to achieve some level of compatibility. Such a set of circumstances inevitably inhibits competition and innovation. Monopolistic or proprietary control of a "standard" inevitably reduces the incentive and opportunity for competitive innovation, especially when the "owner" of the "standard" has a history of extending and breaking the parameters of that standard.

Bill Gates illustrated Microsoft's historic attitude to interoperability and standards in an internal memo (pdf) from December 1998:

"One thing we have got to change in our strategy - allowing Office documents to be rendered very well by other peoples browsers is one of the most destructive things we could do to the company", he wrote.

"We have to stop putting any effort into this and make sure that Office documents very well depends on PROPRIETARY IE capabilities." (The emphasis was Gates' own)

Can't do that

For the prospective developer the first problem in writing a piece of software that reads or writes data to the OOXML specification is its excessive length. This is a restriction in itself - but with lack of brevity comes repetition and inconsistency. OOXML suffers from being a partial description of what is implemented in Office 2007, rather than a specification for describing a universal document.

For instance, part of the reason for the excessive length of the OOXML specification is that exactly similar elements in Word, Excel or PowerPoint will have entirely different definitions although they fulfil exactly the same purpose. The specification does not betray the signs of careful engineering or of a document that has been designed for consideration as an international standard, but merely reflects the inconsistencies, kluges and bugs of the software it describes, accumulated over nearly three decades.

In some ways it is an indictment of Microsoft's own internal software development processes, and as such, it is a surprise that Microsoft has allowed it to see the light of day.

Rob Weir gives the example of OOXML's representation of "a staple of document formats: text colour and alignment", which is represented in three different ways, according to whether the document is text, sheet or presentation, and is driven to ask: "does this represent a reasonable engineering judgment?

ODF uses the W3C's XSL-FO vocabulary for text styling, and uses this vocabulary consistently. OOXML's representations, on the other hand, appear incompatible with any deliberate design methodology."

Weir also undertook a review of unreported errors in the OOXML specification and claimed to find 64 errors in the first 25 random pages of the specification that he studied, including the following serious flaws:

* storage of plain text passwords in database connection strings

* Undefined mappings between CSS and DrawingML

* Errors in XML Schema definitions

* Dependencies of proprietary Microsoft Internet Explorer features

* Spreadsheet functions that break with non-Latin characters

* Dependencies on Microsoft OLE method calls

* Numerous undefined terms and features

The flaws are too numerous to itemise, and a great number are showstoppers.

A programmer's hell

These observations are critical to the practicality of third-party implementations of OOXML. The specification of a standard requires clarity, precision and brevity to enable efficient implementation. It doesn't really matter that Microsoft's internal documents might use a plethora of different data descriptions for the same action. It matters when a third party developer is trying to interpret that data.

It is no accident that so many commentators have noted the unprecedented length and complexity of the specification. The ODF specification is 600 pages. The specification for MS-OOXML has reputedly grown to upwards of 7000 pages during the fast track process. One has to wonder whether interoperability, the first consideration for the acceptance of a standard, has ever been contemplated in the definition of OOXML, which looks and sounds like a programmer's hell.

Moreover, OOXML contains binary specifications, specifications that contradict existing standards, and elements that are covered by "undisclosed patents and incomplete licensing terms".

The specification has been shown to be incomplete in its detail, contains internal and historic contradictions, and the design, implementation, limitations and further development of OOXML are exclusive to Microsoft, which goes against the grain of the standards process, which usually requires third party participation in the design and implementation of a proposed standard.

But the worst part for the programmer is that you can implement every last line of the specification (assuming that were possible), only to find that when you test it against the only existing implementation of OOXML, Office 2007, it will crash because Office 2007 has extensions that are not present in the specification. These include undocumented non-XML data and XML tags in the Office 2007 outputs that do not appear in the ECMA 376 specification.

The reference implementation of MS-OOXML is Office 2007, so if we are to understand OOXML as a standard, we have to be able to translate any document written by Office 2007 to the OOXML format. But Office 2007 documents don't themselves conform to the standard.

The Free Software Foundation Europe has come to a similar conclusion and records that "XLSX documents created by MS Office 2007 have binary content in addition to content described in the proposed MS-OOXML specification."

Using a document downloaded from microsoft.com the FSFE asserts that "the binary content consists of three implementation defined files called printerSettings1.bin, printerSettings2.bin and printerSettings3.bin. They originate from Microsoft and their content is not described in the proposed specification. Examining the binary files in a HEX editor reveals references to 'Microsoft OneNote Import' and 'Letter'. 'Letter' appears to be a reference to page size."

"Referencing page size in a implementation defined binary file is problematic. Page size information is critical for ensuring the correct layout of a document. European applications without access to the binary information may use A4 page size instead of Letter for displaying the document, thus allowing for more content on each page. Two different users could get the impression they are discussing very different documents when their page numbers do not match."

De facto interoperability

MS-OOXML is not a gift from Microsoft to the standards community. OOXML was devised as a response to the Open Document Format (ODF). Without ODF there would be no OOXML.

From the beginning Microsoft was invited to join the process to develop ODF as a default standard for describing office documents, but chose not to participate - after all, Microsoft "owned" the "de facto" standard, (its proprietary data formats), and had every confidence that ODF, which was championed by the free and open source software community as well as by many of the major players, would gain no traction.

Once ODF became an ISO standard and governments around the world began to show an active interest in deploying documents that conformed to the standard, the publication of MS-OOXML became an imperative for Microsoft, to protect its proprietary hold on Office documents, and to perpetuate its monopoly share of the market. Fast tracking MS-OOXML guarantees "ISO approval" for documents created by Microsoft Office, which in turn assures Microsoft's grip on the market, but OOXML itself does not ensure interoperability.

Both ODF and MS-OOXML are imperfect solutions to the problem of data neutrality and the integrity of office documents, because both lack an interoperability framework which might preclude proprietary extensions to the format, which is why many experts in the field are leaping onto the World Wide Web Consortium's Compound Document Formats standard.

ODF at least benefits from the active participation of multiple vendors and a long passage through the standardisation process. There are multiple implementations of ODF, which is the first prerequisite for acceptance of a standard. It is not clear that any other vendor will ever be able to implement MS-OOXML or to guarantee interoperability with Microsoft Office, and it is not clear that interoperability has ever been an imperative for Microsoft.

The way to interoperate with Microsoft Office is to use Microsoft Office, which is why, whatever the outcome of the ISO decision-making process, many governments across the world will opt to support ODF.