Keeping Meta-Data with Content Items in Open XML WordprocessingML
Using Open XML WordprocessingML, there is an approach whereby we can associate meta-data with content items such as paragraphs,
tables, rows, cells, numbered and bulleted lists, and list items.
This approach was inspired by some conversations with Ken Stearn and Jan Urbanski of Spectra Global Solutions.
The end user can edit this document that contains metadata, and after editing, our program can retrieve the metadata associated with each and every content item. If the user has modified a paragraph, no matter how much, we can retrieve the metadata for that paragraph. If the user has cut and pasted (or moved) content, we can retrieve the correct metadata from the newly located paragraph. If the user has copied and pasted content, we can retrieve the correct metadata from the original paragraph or content, and we can know that the pasted paragraph or content has been inserted, and therefore has no metadata associated with it.
This approach, made possible by processing of documents using Open XML, is not immediately obvious. However once you understand the approach, which I will explain in depth in this blog post and associated videos, it is not very difficult. It is made easier by the use of a new class, IdentityManager that I am introducing into PowerTools for Open XML.
So, if associating metadata with content in Open XML WordprocessingML is interesting to you, grab a cup of coffee, settle in, and
we’ll explore exactly how we can use WordprocessingML fields, revision tracking markup, and document editing protection to achieve correct round-tripping of metadata in a WordprocessingML document. First, watch the following video:
[View:http://www.youtube.com/watch?v=fteKdmozA1U]
The Scenario
One of the most interesting scenarios around Open XML is where application developers enable their end users to use Word 2007, 2010, or
2013 to maintain complex structured information. The process is typically as follows:
- The software system pulls information from an enterprise application or line-of-business (LOB) system.
- The system constructs an Open XML WordprocessingML document from that information.
- The system then enables the end user to edit that structured information using Word.
- After the user finishes editing that information, the software extracts the information from the Word document and updates the data behind the enterprise application or LOB system.
There are a lot of reasons why system architects design a software system along these lines.
- The data in the underlying system may have a similar structure to a word-processing document. It may contain paragraphs, tables, images, and numbered lists. Because of this similarity in the structure of the data, using the familiar and powerful user interface of Word appeals to the system architect. This approach reduces training costs.
- Microsoft Word is a robust, debugged application that handles many scenarios. By relying on Word as an integral component of the system, we reduce development cost. We need not invest scarce development resources in developing a user interface for a portion of the software system. Word provides the user interface.
- Sometimes in the process of editing in Word, before integrating back into the enterprise application or LOB system, the user wants to use the collaborative features of Word; they want to send to others for comments or further revisions. The systems architect may even want to integrate with SharePoint, and have the Word documents participate in SharePoint Work Flows.
In order to process this data after the user edits it, we must
associate metadata with content items. We need this metadata so that we can
properly update the data in the back-end system.
We could associate metadata with content controls. However,
editing a document with content controls can be cumbersome. It takes a fair
amount of time to train an end-user to properly use the developer tab to insert
a content control, and even if trained, the process can be error-prone. There
are certain scenarios where content controls shine, however, there are other
scenarios where content controls become cumbersome.
Overview of the Approach
In short, we are going to take the following approach:
- We will insert a custom Field at the beginning of each and
every paragraph of the document. This custom field will have one and only one
argument, which is an index into a custom XML part, which will contain the
metadata for each Content Item in the document. The editing behavior of
Word with regards to fields is exactly what we want in order to maintain
integrity of metadata. - We will turn on revision tracking for the document. Further, we
are going to ‘lock’ revision tracking into the ‘on’ state, so that the user
can’t turn it off. - We will use the IdentityManager class of PowerTools for
Open XML to post-process the markup of the document. With a little bit of
knowledge of revision tracking markup, we can understand exactly how the
document was edited, and we can process the revision tracking markup in such a
way that the metadata is properly maintained through the process of the user editing
the document.
All of this, although not complicated, requires a bit of
explanation. First, let’s discuss what I mean by ‘Content Items’.
Content Items
Content Items are those artifacts in a
word-processing document for which we need to keep metadata. There are six
varieties of Content Items:
- Paragraphs
- Tables
- Rows
- Cells
- Lists
- List Items
Paragraphs, tables, rows, and cells do not require much
explanation. We need to maintain integrity of metadata when the user moves a
paragraph, cuts and pastes a paragraph, copies and pastes a paragraph, deletes
a paragraph, or inserts a paragraph. The same applies to tables – if the user
deletes and inserts a table, we need to properly associate the table’s metadata
with the table at its new location. This also applies to rows. If the user
moves, copies and pastes, inserts, or deletes rows, we want to know it, and to
maintain integrity of metadata. The same also applies to cells.
Lists require a bit more explanation. There is no markup
artifact for a list as such in WordprocessingML markup.
We only have the markup for all of the items in a list. In effect, a list is comprised
of all of the list items that make up a list. However, we can easily use the ListItemRetriever
class of PowerTools for Open XML to identify all of the list items in a
list. We can then place the metadata for a list on the first list item of the
list. I’ve simplified the matter a little bit here, but this is the gist of
the idea. I’ll be elaborating on this in greater detail in subsequent posts
and videos.
One additional point to make – as I’ve noted, we’ll
associate the metadata for a list with the first list item in a list. In a
similar fashion, we’ll associate the metadata for a table with the first
paragraph in the first cell in the table. We’ll associate the metadata for a
row with the first paragraph in the first cell in the row. We’ll associate the
metadata for a cell with the first paragraph in a cell.
A picture here will be helpful. In the picture below, you
can see a number of Content Items:
- Two paragraphs
- A small table with two rows and two columns
- A list with three list items
Each paragraph contains a field with a Field
Id, which is simply a pointer into the custom XML part. I’ve denoted the Field
Ids by putting them in curly braces, starting at 1, going through 9. These
Field IDs are not semantically significant. We don’t really care what
the numbers are – only that they are unique, and that they point to the correct
metadata element in the custom XML. Each Content Item contains a Content
Item Id, which uniquely identifies it. Each Content Item also
contains a Content Item Type. And finally, any metadata is stored with
the Content Item.
The custom XML part looks like this:
<?xml version=“1.0“ encoding=“utf-8“?>
<identityManager>
<f id=“1“>
<i t=“PARA“
u=“1“>
<metadataGoesHere anyAttribute=“some data“>
<anyElement>some
data</anyElement>
</metadataGoesHere>
</i>
</f>
<f id=“2“>
<i t=“PARA“
u=“2“>
<metadataGoesHere anyAttribute=“some data“>
<anyElement>some
data</anyElement>
</metadataGoesHere>
</i>
</f>
<f id=“3“>
<i t=“TABLE“
u=“3“>
<metadata/>
</i>
<i t=“ROW“
u=“4“>
<metadata/>
</i>
<i t=“CELL“
u=“5“>
<metadata/>
</i>
<i t=“PARA“
u=“6“>
<metadata/>
</i>
</f>
<f id=“4“>
<i t=“CELL“
u=“7“>
<metadata/>
</i>
<i t=“PARA“
u=“8“>
<metadata/>
</i>
</f>
<f id=“5“>
<i t=“ROW“
u=“9“>
<metadata/>
</i>
<i t=“CELL“
u=“10“>
<metadata/>
</i>
<i t=“PARA“
u=“11“>
<metadata/>
</i>
</f>
<f id=“6“>
<i t=“CELL“
u=“12“>
<metadata/>
</i>
<i t=“PARA“
u=“13“>
<metadata/>
</i>
</f>
<f id=“7“>
<i t=“PARA“
u=“14“>
<metadata/>
</i>
<i t=“LIST“
u=“15“>
<metadata/>
</i>
<i t=“ITEM“
u=“16“>
<metadata/>
</i>
</f>
<f id=“8“>
<i t=“PARA“
u=“17“>
<metadata/>
</i>
<i t=“ITEM“
u=“18“>
<metadata/>
</i>
</f>
<f id=“9“>
<i t=“PARA“
u=“19“>
<metadata/>
</i>
<i t=“ITEM“
u=“20“>
<metadata/>
</i>
</f>
</identityManager>
How the Technique Works
Now that we’ve identified what a Content Item is, we
can discuss how the technique works. The procedure would be:
- You extract data from your database or other data source. You
construct your WordprocessingML document from the data. - You insert field codes at the beginning of every paragraph of the
document. Each field code has a single argument, which is an integer that
points into the custom XML part. In addition, you construct the custom XML
part with all appropriate metadata. I’ve provided an example that shows how to
do this field code insertion. You can find this example in the InsertIdentityManagerFieldCodes
example that is in the project in the zip file that is attached to this
post. - You turn on revision tracking, and you lock the document so that
the end-user can’t turn off revision tracking. The InsertIdentityManagerFieldCodes
example also contains an example of this code. - You turn the document over to your end user to edit.
- After the end user has edited the document, you process the
edited document with IdentityManager class. This code accepts all
tracked revisions, and at the same time, it reconstructs the document such that
each paragraph contains a single field code that points to the custom XML part,
and the custom XML part contains the appropriate metadata. - You then process the resulting document. You can use the metadata
along with the content in the document to update the LOB system or enterprise
application.
In the following video, I am going to walk through several
cases and demonstrate how the IdentityManager re-assigns identity to
each content item in the document. In the process of correctly assigning
identity to each content item, it also correctly associates the correct
metadata with each content item.
In the video following that, I am going to walk through the IdentityManager code, explaining exactly how the code works.
Download – Example Code