Keeping Meta-Data with Content Items in Open XML WordprocessingML

Using Open XML WordprocessingML, there is an approach whereby we can associate meta-data with content items such as paragraphs,
tables, rows, cells, numbered and bulleted lists, and list items.

This approach was inspired by some conversations with Ken Stearn and Jan Urbanski of Spectra Global Solutions.

The end user can edit this document that contains metadata, and after editing, our program can retrieve the metadata associated with each and every content item.  If the user has modified a paragraph, no matter how much, we can retrieve the metadata for that paragraph.  If the user has cut and pasted (or moved) content, we can retrieve the correct metadata from the newly located paragraph.  If the user has copied and pasted content, we can retrieve the correct metadata from the original paragraph or content, and we can know that the pasted paragraph or content has been inserted, and therefore has no metadata associated with it.

This approach, made possible by processing of documents using Open XML, is not immediately obvious.  However once you understand the approach, which I will explain in depth in this blog post and associated videos, it is not very difficult.  It is made easier by the use of a new class, IdentityManager that I am introducing into PowerTools for Open XML.

So, if associating metadata with content in Open XML WordprocessingML is interesting to you, grab a cup of coffee, settle in, and
we’ll explore exactly how we can use WordprocessingML fields, revision tracking markup, and document editing protection to achieve correct round-tripping of metadata in a WordprocessingML document.  First, watch the following video:

[View:http://www.youtube.com/watch?v=fteKdmozA1U]

The Scenario

One of the most interesting scenarios around Open XML is where application developers enable their end users to use Word 2007, 2010, or
2013 to maintain complex structured information.  The process is typically as follows:

There are a lot of reasons why system architects design a software system along these lines.

In order to process this data after the user edits it, we must
associate metadata with content items.  We need this metadata so that we can
properly update the data in the back-end system.

We could associate metadata with content controls.  However,
editing a document with content controls can be cumbersome.  It takes a fair
amount of time to train an end-user to properly use the developer tab to insert
a content control, and even if trained, the process can be error-prone.  There
are certain scenarios where content controls shine, however, there are other
scenarios where content controls become cumbersome.

Overview of the Approach

In short, we are going to take the following approach:

All of this, although not complicated, requires a bit of
explanation.  First, let’s discuss what I mean by ‘Content Items’.

Content Items

Content Items are those artifacts in a
word-processing document for which we need to keep metadata.  There are six
varieties of Content Items:

  1. Paragraphs
  2. Tables
  3. Rows
  4. Cells
  5. Lists
  6. List Items

Paragraphs, tables, rows, and cells do not require much
explanation.  We need to maintain integrity of metadata when the user moves a
paragraph, cuts and pastes a paragraph, copies and pastes a paragraph, deletes
a paragraph, or inserts a paragraph.  The same applies to tables – if the user
deletes and inserts a table, we need to properly associate the table’s metadata
with the table at its new location.  This also applies to rows.  If the user
moves, copies and pastes, inserts, or deletes rows, we want to know it, and to
maintain integrity of metadata.  The same also applies to cells.

Lists require a bit more explanation.  There is no markup
artifact for a list as such in WordprocessingML markup. 
We only have the markup for all of the items in a list.  In effect, a list is comprised
of all of the list items that make up a list.  However, we can easily use the ListItemRetriever
class of PowerTools for Open XML to identify all of the list items in a
list.  We can then place the metadata for a list on the first list item of the
list.  I’ve simplified the matter a little bit here, but this is the gist of
the idea.  I’ll be elaborating on this in greater detail in subsequent posts
and videos.

One additional point to make – as I’ve noted, we’ll
associate the metadata for a list with the first list item in a list.  In a
similar fashion, we’ll associate the metadata for a table with the first
paragraph in the first cell in the table.  We’ll associate the metadata for a
row with the first paragraph in the first cell in the row.  We’ll associate the
metadata for a cell with the first paragraph in a cell.

A picture here will be helpful.  In the picture below, you
can see a number of Content Items:

Each paragraph contains a field with a Field
Id
, which is simply a pointer into the custom XML part.  I’ve denoted the Field
Ids
by putting them in curly braces, starting at 1, going through 9.  These
Field IDs are not semantically significant.  We don’t really care what
the numbers are – only that they are unique, and that they point to the correct
metadata element in the custom XML.  Each Content Item contains a Content
Item Id
, which uniquely identifies it.  Each Content Item also
contains a Content Item Type.  And finally, any metadata is stored with
the Content Item.

The custom XML part looks like this:

<?xml version=1.0 encoding=utf-8?>

<identityManager>

  <f id=1>

    <i t=PARA

       u=1>

      <metadataGoesHere anyAttribute=some data>

        <anyElement>some
data</anyElement>

      </metadataGoesHere>

    </i>

  </f>

  <f id=2>

    <i t=PARA

       u=2>

      <metadataGoesHere anyAttribute=some data>

        <anyElement>some
data</anyElement>

      </metadataGoesHere>

    </i>

  </f>

  <f id=3>

    <i t=TABLE

       u=3>

      <metadata/>

    </i>

    <i t=ROW

       u=4>

      <metadata/>

    </i>

    <i t=CELL

       u=5>

      <metadata/>

    </i>

    <i t=PARA

       u=6>

      <metadata/>

    </i>

  </f>

  <f id=4>

    <i t=CELL

       u=7>

      <metadata/>

    </i>

    <i t=PARA

       u=8>

      <metadata/>

    </i>

  </f>

  <f id=5>

    <i t=ROW

       u=9>

      <metadata/>

    </i>

    <i t=CELL

       u=10>

      <metadata/>

    </i>

    <i t=PARA

       u=11>

      <metadata/>

    </i>

  </f>

  <f id=6>

    <i t=CELL

       u=12>

      <metadata/>

    </i>

    <i t=PARA

       u=13>

      <metadata/>

    </i>

  </f>

  <f id=7>

    <i t=PARA

       u=14>

      <metadata/>

    </i>

    <i t=LIST

       u=15>

      <metadata/>

    </i>

    <i t=ITEM

       u=16>

      <metadata/>

    </i>

  </f>

  <f id=8>

    <i t=PARA

       u=17>

      <metadata/>

    </i>

    <i t=ITEM

       u=18>

      <metadata/>

    </i>

  </f>

  <f id=9>

    <i t=PARA

       u=19>

      <metadata/>

    </i>

    <i t=ITEM

       u=20>

      <metadata/>

    </i>

  </f>

</identityManager>

 

How the Technique Works

Now that we’ve identified what a Content Item is, we
can discuss how the technique works.  The procedure would be:

In the following video, I am going to walk through several
cases and demonstrate how the IdentityManager re-assigns identity to
each content item in the document
.  In the process of correctly assigning
identity to each content item, it also correctly associates the correct
metadata with each content item.

In the video following that, I am going to walk through the IdentityManager code, explaining exactly how the code works.

Download – Example Code