Markup Compatibility and Extensibility

Note from Eric White: This is the second post in a series of guest posts by Florian Reuter. He has written a pretty cool library for working with OPC files (published at http://libopc.codeplex.com/). In the next post, he is going to cover his libopc library.

In the previous blog post, we briefly explained the lowest layer of the Office Open XML specification, the Open Packaging Conventions. In this blog post, we will cover the next layer of the Office Open XML specification, called Markup Compatibility and Extensibility, or MCE for short.

MCE defines a small set of XML elements and attributes that can be used to annotate any XML document. An MCE-aware XML processor can then preprocess the annotated XML document and automatically generate a transformed XML document which ensures the best possible compatibility for the consumer of the XML document.

MCE is a very powerful mechanism when you have to deal with XML documents and schemas which need to be extended in a backward compatible way.

To illustrate the problem, imagine an office productivity software suite which is constantly improved. When a new feature gets implemented in the next version of the software and the file format, the challenge is to introduce the new feature in the file format in a way that both old and new versions of the software can interoperate in the best possible way. If you just add new XML markup to the existing file format, the previous version of the software would likely be unable to open the file, since it would no longer conform to its version of the XML schema, so you need a way to tell the previous version how to handle that unknown markup.

mce:Ignorable

The simplest extensibility mechanism of MCE is “ignorable” elements and attributes. When a new element or attribute is introduced it can be marked as ignorable. Consider the following real life example in Microsoft Word 2010. Word 2010 has a new feature called “glowing text”:

This feature is not present in Word 2007 nor specified in the current version of the Office Open XML specification. Using MCE, this new feature can be introduced in a backward compatible way so that Word 2007 as well as current Office Open XML validators don’t break:

<w:document xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main"
xmlns:w14="http://schemas.microsoft.com/office/word/2010/wordml"
mce:Ignorable="w14 ">
<w:body>
 
<w:p>
 
<w:r>
   
<w:rPr>
   
<w:sz w:val="96"/>
   
<w14:glow w14:rad="63500">
   
<w14:schemeClr w14:val="accent1">
     
<w14:alpha w14:val="60000"/>
     
<w14:satMod w14:val="175000"/>
   
</w14:schemeClr>
   
</w14:glow>
 
</w:rPr>
 
<w:t>“glowing text”</w:t>
 
</w:r>
 
</w:p>
</w:body>
</w:document>

The mechanism works as follows:

1.       A new namespace is defined for the XML markup of the new feature.

2.       The new feature is encoded using that new namespace. In this example: <w14:glow w14:rad=”228600″>…</w14:glow>.

3.       The new namespace is marked as ignorable using the mce:Ignorable=”w14″.

Now, each time the file is loaded, an MCE preprocessor is initialized with a set of namespaces the consuming application understands and:

·         Word 2010 will tell the MCE preprocessor that it understands “http://schemas.microsoft.com/office/word/2010/wordml”. The preprocessor will then pass the new elements and attributes to Word 2010.

·         Word 2007 and current Office Open XML validators know nothing about the w14 namespace and thus will *not* tell their MCE preprocessors that they understand the namespace. MCE will therefore remove all elements and attributes in the w14 namespace and thus neither Word 2007 nor other Office Open XML validators will get confused by the new namespaces. They will receive valid ISO/IEC 29500:2008 XML markup from the MCE preprocessor.

The strength of MCE is the ability to make backward compatible extensions, so Word 2010 can have “glow” without impacting the ability of other processors to consume the rest of the document.

mce:ProcessContent

The MCE ignorable element feature is accompanied with the mce:ProcessContent feature. By default, MCE skips the entire sub tree of an ignorable, non-understood element. Consider the following markup:

<w:document mce:Ignorable="ext">
 
<w:body>
   
<ext:group>
     
<w:p><w:r><w:t>Sample</w:t></w:r></w:p>
     
<w:p><w:r><w:t>Content</w:t></w:r></w:p>
   
</ext:group>
 
</w:body>
</w:document>

An MCE preprocessor which does not understand the “ext” namespace will produce the following output:

<w:document>
 
<w:body>
 
</w:body>
</w:document>

Since “ext:group” is not understood the element and all children are removed.

By using MCE’s mce:ProcessContent attribute you can force the MCE processor to process the content of the ignored element:

<w:document mce:Ignorable="ext" mce:ProcessContent="ext:group">
 
<w:body>
   
<ext:group>
     
<w:p><w:r><w:t>Sample</w:t></w:r></w:p>
     
<w:p><w:r><w:t>Content</w:t></w:r></w:p>
   
</ext:group>
 
</w:body>
</w:document>

The mce:ProcessContent=”ext:group” attribute tells the MCE preprocessor to “process the content”, i.e., to preserve the sub tree under ext:group. The MCE processor will produce the following output when “ext” is not understood:

<w:document>
 
<w:body>
   
<w:p><w:r><w:t>Sample</w:t></w:r></w:p>
   
<w:p><w:r><w:t>Content</w:t></w:r></w:p>
 
</w:body>
</w:document>

mce:AlternateContent

Besides ignorable elements and attributes accompanied by the ProcessContent flag, MCE has another powerful mechanism: alternate content blocks. If you are a developer, think of a “switch..case..default” statement:

<w:document>
 
<w:body>
   
<mce:AlternateContent>
     
<mce:Choice Requires="w14">
       
<w:p>
         
<w:r>
           
<w:t>New feature!</w:t>
         
</w:r>
       
</w:p>
     
</mce:Choice>
     
<mce:Fallback>
       
<w:p>
         
<w:r>
           
<w:t>Fallback content.</w:t>
         
</w:r>
       
</w:p>
     
</mce:Fallback>
   
</mce:AlternateContent>
 
</w:body>
</w:document>

An MCE processor will evaluate the “choice” elements in order. If an application understands the namespace, the <choice>’s sub tree will be further processed. If no choice is taken, the fallback element will be chosen.

mce:MustUnderstand

Another useful MCE feature is the mce:MustUnderstand attribute. The idea is very simple: If a processor does not understand a namespace listed in the mce:MustUnderstand attribute, the processing should stop immediately. This allows a preprocessor to halt at the point it encounters the mce:MustUnderstand rather than when it encounters an unknown element or attribute later in the file.

<w:document mce:MustUnderstand="w20 w21 w22">
 
<w:body>
   
<!-- new incompatible feature here -->
 
</w:body>
</w:document>

Suspending MCE processing

Finally, an MCE processor needs to be able to “suspend” processing for selected elements. An MCE processor is therefore initialized with a list of elements in which MCE processing should be suspended. Consider a file format which embeds XHTML fragments like this:

<my:format>
 
<my:xhtml-note>
   
<xhtml:p>My XHTML Note</xhtml:p>
 
</my:xhtml-note>
</my: format>

The “my”-format allows XHTML in the <my: xhtml-note>..</my: xhtml-note> tags. Since XHTML knows nothing about MCE, the MCE processing needs to be disabled while parsing XHTML. A “my”-format processing application will therefore tell the MCE parser to suspend MCE processing inside the my:xhtml-note tag.

If we additionally allow Office Open XML paragraphs inside another <my:ooxml-note>…</my: ooxml-note> tag, the naive approach would simply not tell the MCE processor to suspend MCE processing. However simply not turning MCE processing off is not enough. Sometimes we have to turn MCE processing explicitly on as the following nested example shows:

<my:format>
 
<my:xhtml-note>
   
<xhtml:p>XHTML content</xhtml>
   
<xhtml:embed>
     
<my:ooxml-note>
       
<w:p>OOXML content</w:p>
     
</my:ooxml-note>
   
</xhtml:embed>
 
</my:xhtml-note>
</my:format>

In order to parse the above fragment correctly an MCE processor needs to

1.       Turn MCE processing on inside “my:format”, since our format uses MCE by design;

2.       Turn MCE processing off inside the “my:xhtml-note”, since XHTML knows nothing about MCE and

3.       Turn MCE processing on again inside the “my:ooxml-note”, since OOXML uses MCE by design too.

The main purpose of the above example is to show how critical the correct initialization of the MCE processor for correct results is. An MCE processor needs to be configured with a set of elements where MCE processing needs to be turned on and a set of elements where MCE processing needs to be turned off. We will now discuss a real world example which requires correct initialization.

Office Open XML Extension Lists

Office Open XML has another extension mechanism called “extLst”. The following is a real example produced by PowerPoint 2010:

<dgm:extLst>
 
<a:ext uri="http://schemas.microsoft.com/office/drawing/2008/diagram">
 
<dsp:dataModelExt minVer="http://schemas.openxmlformats.org/drawingml/2006/diagram"
   
relId="rId6"
   
xmlns:dsp="http://schemas.microsoft.com/office/drawing/2008/diagram"/>
 
</a:ext>
</dgm:extLst>

According to the Office Open XML specification, these extension lists can contain arbitrary elements. PowerPoint 2010 uses these extension lists to encode new features as shown in the XML fragment above. To parse extension lists correctly, MCE processing needs to be suspended inside the <dgm:extLst> element. Otherwise MCE would remove the <dst:dataModelExt> element in the above example, since MCE is not aware of the “dsp” namespace. Alternatively, the MCE processor could see this as an error (an unknown, non-ignorable namespace) and fail processing. However be aware that it might be required to turn MCE processing on again for selected elements, if the element inside the extension list is aware of MCE by design. The MCE processor needs to know implicitly whether a given namespace/markup language understands and uses MCE.

mce:PreserveElements and mce:PreserveAttributes

The Office Open XML specification also defines two additional attributes: mce:PreserveElements and mce:PreserveAttributes.

These attributes define which elements and attributes should be automatically preserved by MCE. When an element or attribute is “ignored” by MCE, mce:PreserveElements and mce:PreserveAttributes define which elements or attributes should be preserved rather than discarded from the content passed from the preprocessor to the consumer application.

However, the problem is that automatic preservation of ignored content is very hard and can only be done with the help of the layers above MCE. The mce:PreserveElements and mce:PreserveAttributes are more or less “hints” to the upper layers. An MCE processor will not try to preserve the elements itself. A mechanism could work like this:

1.       If an MCE processor encounters an element which has “ignored” child elements or “ignored” attributes it will generate a unique token and store the ignored elements and attributes under that token in its internal map. The MCE processor will give this token to the application.

2.       When the application saves the MCE stream again it will pass the token back to the MCE processor which will automatically add the previously ignored elements and attributes to the element which came with the token.

The drawback of this approach is that the MCE processor might use a lot of memory for its internal map. Moreover, the standards body that maintains the Office Open XML specification recently decided to drop mce:PreserveElements and mce:PreserveAttributes in the next edition of the standard. Additionally, mce:PreserveElements and mce:PreserveAttributes are currently not used by Microsoft Office. For example, Microsoft Word 2010 did not use the mce:PreserveElements tag to indicate that the w14:glow element should be preserved. For more information please refer to the “MCE Best Practices” whitepaper which discusses how MCE is used inside Microsoft Office.

Next

Now that we covered the “theoretical” bases of OPC and MCE we are ready to dive into some code. The next blog will introduce libopc, an open-source library for reading and writing OPC and MCE.