libopc: Open Packaging Convention

Note from Eric White: This is the first post in a series of guest posts by Florian Reuter. He has written a pretty cool library for working with OPC files (published at http://libopc.codeplex.com/). In upcoming posts, he is going to cover Markup Compatibility and Extensibility (MCE) and his libopc library.

The Open Packaging Convention (OPC) is part II of the OfficeOpenXML standard — the standard behind the new .docx, .xslx and .pptx Office formats.

The OPC defines a container format which can be used to store any kind of data and it is not only suited for Office format. E.g. the XMLPaperSpecification (XPS) also uses OPC as the packaging layer.

In many ways OPC can be seen as a successor of OLE containers used by the proprietary .DOC, .XSL and .PPT formats. Unlike OLE containers — which are modeled according to the FAT format —OLE containers are valid .ZIP archived plus some extra metadata.

This means that any OPC container can be opened with a ZIP program. Try it out yourself: Create a .docx/.xslx or .pptx file and rename the extension to .zip. A simple double-click will expose the container structure in Windows internal viewer:

The metadata is encoded in the additional “_rels” folders and the “[Content_Types].xml”.

In order to really understand the OPC it is important to understand the abstract OPC container structure first.

Abstract OPC container structure

First of all every OPC container specifies a set of MIME types also known as content types. Typical content types in a .DOCX document are:

Content Types

application/vnd.openxmlformats-officedocument.customXmlProperties+xml

application/vnd.openxmlformats-officedocument.extended-properties+xml

application/vnd.openxmlformats-officedocument.theme+xml

application/vnd.openxmlformats-officedocument.wordprocessingml.document.main+xml

application/vnd.openxmlformats-officedocument.wordprocessingml.fontTable+xml

application/vnd.openxmlformats-officedocument.wordprocessingml.settings+xml

application/vnd.openxmlformats-officedocument.wordprocessingml.styles+xml

application/vnd.openxmlformats-officedocument.wordprocessingml.webSettings+xml

application/vnd.openxmlformats-package.core-properties+xml

application/vnd.openxmlformats-package.relationships+xml

application/xml

Additionally every OPC container has a “default”
binding between an extension and a content type. E.g.:

Extension Type
rels application/vnd.openxmlformats-package.relationships+xml
xml application/xml

Next every OPC container defines a set of “relation
types”. Relation types have the same form as XML namespace names. Typical relation types in a .DOCX file are:

Relation Types

http://schemas.openxmlformats.org/officeDocument/2006/relationships/customXml

http://schemas.openxmlformats.org/officeDocument/2006/relationships/customXmlProps

http://schemas.openxmlformats.org/officeDocument/2006/relationships/extended-properties

http://schemas.openxmlformats.org/officeDocument/2006/relationships/fontTable

http://schemas.openxmlformats.org/officeDocument/2006/relationships/hyperlink

http://schemas.openxmlformats.org/officeDocument/2006/relationships/officeDocument

http://schemas.openxmlformats.org/officeDocument/2006/relationships/settings

http://schemas.openxmlformats.org/officeDocument/2006/relationships/styles

http://schemas.openxmlformats.org/officeDocument/2006/relationships/theme

http://schemas.openxmlformats.org/officeDocument/2006/relationships/webSettings

http://schemas.openxmlformats.org/package/2006/relationships/metadata/core-properties

OPC container also keeps a list of all external relations.E.g. when a .DOCX document contains a hyperlink to “http://naverage.com”, then this external link is stored as an external relation:

External Relations
http://naverage.com

Data is stored inside an OPC container as parts. A part has
a hierarchical name and a type. Here are the typical parts of a .DOCX document:

Part
Type
customXml/item1.xml
application/xml
customXml/itemProps1.xml application/vnd.openxmlformats-officedocument.customXmlProperties+xml
docProps/app.xml
application/vnd.openxmlformats-officedocument.extended-properties+xml
docProps/core.xml
application/vnd.openxmlformats-package.core-properties+xml
word/document.xml
application/vnd.openxmlformats-officedocument.wordprocessingml.document.main+xml
word/fontTable.xml
application/vnd.openxmlformats-officedocument.wordprocessingml.fontTable+xml
word/settings.xml
application/vnd.openxmlformats-officedocument.wordprocessingml.settings+xml
word/styles.xml
application/vnd.openxmlformats-officedocument.wordprocessingml.styles+xml
word/theme/theme1.xml
application/vnd.openxmlformats-officedocument.theme+xml
word/webSettings.xml
application/vnd.openxmlformats-officedocument.wordprocessingml.webSettings+xml

Finally OPC container store relations between parts. Consider e.g. the part “word/document.xml” and the part “word/styles.xml”. There is obviously a relation between this two
parts in the way that the “word/styles.xml” part contains the styles definitions referenced in the “word/document.xml” part. Therefore in a typical .DOCX document a relation similar to the following is established:

Source
Id Destination
Type
word/document.xml
rId2 word/styles.xml
http://schemas.openxmlformats.org/officeDocument/2006/relationships/styles

A relation inside an OPC container has a source part, a destination part as well as a relation id and a relation type. The relation id is unique with respect to the source part, i.e. no two relations which leave a source part have the same id.

An OPC container also has a virtual root part (here denoted with “[root]” or “/”), which is used to model the root of the relation hierarchy.

Here are the typical relations found in a .DOCX file:

Source
Id Destination
Type
[root]
rId1 word/document.xml
http://schemas.openxmlformats.org/officeDocument/2006/relationships/officeDocument
[root]
rId2 docProps/core.xml
http://schemas.openxmlformats.org/package/2006/relationships/metadata/core-properties
[root]
rId3 docProps/app.xml
http://schemas.openxmlformats.org/officeDocument/2006/relationships/extended-properties

 

Source
Id Destination
Type
customXml/item1.xml rId1 customXml/itemProps1.xml http://schemas.openxmlformats.org/officeDocument/2006/
relationships/customXmlProps

 

Source
Id Destination
Type
word/document.xml
rId1 customXml/item1.xml
http://schemas.openxmlformats.org/officeDocument/2006/relationships/customXml
word/document.xml
rId2 word/styles.xml
http://schemas.openxmlformats.org/officeDocument/2006/relationships/styles
word/document.xml
rId3 word/settings.xml
http://schemas.openxmlformats.org/officeDocument/2006/relationships/settings
word/document.xml
rId4 word/webSettings.xml
http://schemas.openxmlformats.org/officeDocument/2006/relationships/webSettings
word/document.xml
rId5 http://naverage.com http://schemas.openxmlformats.org/officeDocument/2006/relationships/hyperlink
word/document.xml
rId6 word/fontTable.xml
http://schemas.openxmlformats.org/officeDocument/2006/relationships/fontTable
word/document.xml
rId7 word/theme/theme1.xml
http://schemas.openxmlformats.org/officeDocument/2006/relationships/theme

Navigating though OPC container

One of the peculiarities of the OPC is how you navigate within an OPC container. Although most API’s give you the ability to access parts directly usually the relations are used to find the right part.

Let’s suppose you want to open the document part of a DOCX document. The straightforward — but wrong — way would be to check whether an OPC container has the “word/document.xml” stream and open it if present. Even when you additionally check whether the “word.document.xml” stream has the content type “application/vnd.openxmlformats-officedocument.wordprocessingml.document.main+xml” it would be the wrong way to handle a DOCX document, since the part name “word/document.xml” is not important.

The right way to access the document part of a DOCX document is to check whether the OPC container has a relation leaving [root] of type “http://schemas.openxmlformats.org/officeDocument/2006/relationships/officeDocument”. If so we follow the relation and next we check the content type of the relation’s
target part. If the content type of the target part is ” application/vnd.openxmlformats-officedocument.wordprocessingml.document.main+xml” we have a DOCX document.

Libraries / libopc

Windows comes with two different libraries for handling OPC container. An unmanaged COM-based API and a managed .NET-based API. Documentation about the two APIs can be found here http://msdn.microsoft.com/en-us/library/windows/desktop/dd742822.aspx
and here http://msdn.microsoft.com/en-us/library/system.io.packaging.aspx.

In this series of blog posts we will use libopc (libopc.codeplex.com) a FREE and open source library for dealing with the OPC which can be used on Windows as well on Linux, iOS and Android.

Libopc comes with a command line tool “opc_dump” which can be used to dump the structure of an OPC container. This tool is very handy and it can be used like this:

> opc_dump “Hello World.docx” > dump.txt

Next

In the next post we will take a look at the layer above OPC called Markup Compatibility and Extensibility (MCE) before we will take a closer look at libopc.

 

Florian Reuter (CEO of Naverage UG http://naverage.com and coordinator of http://libopc.codeplex.com)