Merge Comments from Multiple OpenXML WordprocessingML Documents into a Single Document

Several years ago, I was at a Microsoft conference, and was asked by a couple of attendees how to merge comments from multiple Open XML WordprocessingML documents into a single document.  Those developers had attempted to write code to do this, and were not successful.  I took this as a challenge, and a couple of months later, I released some sample code to do merge comments.  In that original version of the sample code, I attempted to write code that determined where breaks were needed in runs, and then split runs as appropriate.  This approach turned out to be problematic, and far more complicated than it needed to be.  It was hard to debug and get all of the edge cases right.

Flash forward 3 years – I recently wrote some code to do search-and-replace of text in a WordprocessingML document.  In that code, after identifying a paragraph that contained the search string, the algorithm then splits all runs in the paragraph up into runs each of a single character.  It then becomes a trivial exercise to find a sequence of runs (each of a single character) that match the sequence of characters in the search string.  This approach also plays nicely with those ‘characters’ that are represented by special markup, such as tab characters, break characters, soft hyphen characters, and the like (see the children elements of the w:r element for a complete list).  The transform also makes sure that those ‘characters’ are also each in their own run.  That algorithm was easy to code and debug.  It does create a fair number of extra short-lived objects on the heap; however, the .NET heap is optimized for this scenario.  I did some experiments where I watched the working set (amount of memory allocated) while I processed a number of documents using this approach, and memory usage stabilized at a reasonably small amount (at least using the test documents I had, a few of which were artificially large).

I decided that I like this approach to doing intricate manipulations of text content in WordprocessingML documents, so I used this approach to merge comments from two identical (except for comments) WordprocessingML documents into a single document.  Once the algorithm identifies a paragraph that contains comments that need to be merged from the two documents into a single document, it breaks up the runs (in both source documents) into runs of a single character each.  It then can ‘walk’ down the list of runs of both source paragraphs, identifying common characters, identifying comment markup, and creating a new paragraph that contains comments from both source documents (unless the comments happen to be identical, in which case only a single copy of the comment is placed in the merged document).  The algorithm was far easier to code and debug than the first version of comment merging that I wrote, so I am even more convinced that this approach of splitting runs up, processing, and then subsequently merging runs is a good one.

The following screen-cast demonstrates the PowerTools for Open XML cmdlet that merges comments from multiple documents into a single document.  It also discusses the code that merges comments.

One caveat about the code – it does not handle comments that are inserted into a math formula.  The algorithm for handling the merging of comments in a math formula would be very similar to that of merging comments for paragraphs and text, however, I simply ran out of time, so deferred this functionality until later.