Generating Open XML WordprocessingML Documents using XPath Expressions in Content Controls
Over the last few days, I have completed a new prototype of an approach to Open XML WordprocessingML document generation. In this approach, I control the document generation process by placing XPath expressions in content controls. In contrast, the previous approach in this series of posts on document generation was controlled by writing C# code in content controls.
This post is the 13th in a series of blog posts on generating Open XML documents. Here is the complete list: Generating Open XML WordprocessingML Documents Blog Post Series
When I started down this path of discovery around document generation, I would not have predicted it, but the XPath-in-Content-Controls approach is, in my opinion, much superior to the C#-in-Content-Controls approach. Going forward, I am going to abandon the C#-in-Content-Controls approach, and focus on this approach using XPath. There are some very cool places that we can take this approach.
To compare and contrast, the C#-in-Content-Controls prototype consists of less than 400 lines of code. While it was not fully fleshed-out, and there remain many necessary refinements, I would expect that a finished version would be perhaps 3000 lines of code.
The XPath-in-Content-Controls prototype that I am introducing in this post is even smaller. It is less than 240 lines of code. It is simpler, more robust, and more amenable to polishing. I expect that the finished example, including integration into a document-level add-in for Word 2010 will be less than 1000 lines of code. I’ll be posting V1 of the prototype with the next post in this series.
Driven from an XML Document
One of the nice things about the C#-in-Content-Controls approach is that you could drive the document generation process from literally any data you could get your hands on from the .NET framework. In contrast, with this approach, there is one and only one form of data source, which is an XML document. And in this first prototype, I am restricting the data to an XML document that contains XML in no namespace. Allowing for namespaces in the XML means that I would need to provide mapping between namespaces and namespace prefixes, and that would get in the way of discussing the architecture and merits of this approach. I’ll deal with this in the future.
In the meantime, if you have XML that uses namespaces (or any other variety of data sources), your first task is to transform that data source to XML in no namespace.
The XML document should look something like this:
<Customers>
<Customer>
<CustomerID>1</CustomerID>
<Name>Andrew</Name>
<HighValueCustomer>True</HighValueCustomer>
<Orders>
<Order>
<ProductDescription>Bike</ProductDescription>
<Quantity>2</Quantity>
<OrderDate>5/1/2002</OrderDate>
</Order>
<Order>
<ProductDescription>Sleigh</ProductDescription>
<Quantity>2</Quantity>
<OrderDate>11/1/2000</OrderDate>
</Order>
<Order>
<ProductDescription>Plane</ProductDescription>
<Quantity>2</Quantity>
<OrderDate>2/19/2000</OrderDate>
</Order>
</Orders>
</Customer>
<Customer>
<CustomerID>2</CustomerID>
<Name>Bob</Name>
<HighValueCustomer>False</HighValueCustomer>
<Orders>
<Order>
<ProductDescription>Boat</ProductDescription>
<Quantity>2</Quantity>
<OrderDate>8/9/2000</OrderDate>
</Order>
<Order>
<ProductDescription>Boat</ProductDescription>
<Quantity>4</Quantity>
<OrderDate>3/25/2001</OrderDate>
</Order>
<Order>
<ProductDescription>Bike</ProductDescription>
<Quantity>1</Quantity>
<OrderDate>6/5/2002</OrderDate>
</Order>
</Orders>
</Customer>
<Customer>
<CustomerID>3</CustomerID>
<Name>Celcin</Name>
<HighValueCustomer>False</HighValueCustomer>
<Orders>
<Order>
<ProductDescription>Bike</ProductDescription>
<Quantity>2</Quantity>
<OrderDate>2/24/2001</OrderDate>
</Order>
<Order>
<ProductDescription>Boat</ProductDescription>
<Quantity>4</Quantity>
<OrderDate>5/6/2001</OrderDate>
</Order>
</Orders>
</Customer>
</Customers>
While it isn’t required, it is more convenient to use a form where the Orders element is a child of the Customer element. The reason for this will become clear.
The XPath-in-Content-Controls Template Document
The next step in introducing this approach is to take a look at the template document that will drive document generation. While looking at this template, you can compare and contrast it to the template that contains C# code in content controls.
In this template document, I am going to borrow some nomenclature from XSLT. One of the attributes of the xsl:apply-templates element is the select attribute. If you place an XPath expression in the optional select attribute, XSLT will apply templates to the set of nodes that are selected by the XPath expression. The XPath expression is applied relative to the current context of the node that is currently being transformed by the sequence constructor. I am going to use a very similar approach in the template document. In effect, I am going to turn an Open XML WordprocessingML document into something that is analogous to an XSLT style sheet. Don’t worry if this is not immediately clear. It will be before the end of this blog post series. The point of this paragraph is that I’m going to use the term Select to indicate an XPath expression that will be evaluated, and the results of the evaluation will become the current context for other operations.
As usual, I am going to show content controls in design mode. Here is the template document, in its entirety. Of course, the circles and arrows are added by me to aid in explanation.
The Config Content Control (*1)
Starting at the bottom of the document, there is the Config content control, which contains XML, with a root element of Config.
The DataFileName element specifies the source XML document that contains the data that drives the document generation process.
The SelectDocuments element specifies an XPath expression that when evaluated against the root element of the document returns a collection of elements, each of which represent a document to be generated. In the case of the XML data file that I presented earlier, the XPath expression “./Customer” returns a collection of the Customer child elements of the root Customers element. Given that source data file, the document generation process will generate three documents.
The DocumentGenerationInfo element, and its child elements contains the necessary information to control the actual physical generation of the documents – the directory where the documents will be placed, a .NET StringFormat that works in conjunction with the SelectDocumentName XPath expression to assemble the generated FileName.
As an aside, I initially played around with nested content controls instead of having a single content control that contains XML. While this approach works, maintaining nested content controls using the Word 2007 or Word 2010 user interface is idiosyncratic. I could write a pretty detailed bug report around the maintainability of nested content controls. Maintaining the XML in a single content control is a more satisfactory approach.
The SelectValue Content Control (*2)
At the top of the template document, you can see the SelectValue content controls. As mentioned in the last section, the SelectDocuments XPath expression selects multiple Customer elements. While generating each document in turn, each Customer element becomes the current context. The SelectValue XPath expression is then evaluated in the context of each Customer element in turn. One of the circled SelectValue XPath expressions selects the Name child element of the Customer element. The other circled SelectValue XPath expression selects the CustomerID child element of the Customer element. In XML, the value of an element is defined to be the concatenated descendant text nodes (in other words, its textual content). The document generation engine retrieves the value of the selected element and replaces the content control with the value.
The Table Content Control (*3)
Just as the SelectValue content control is evaluated in the context of a Customer element, the SelectRows content control is also evaluated in the context of a Customer element. The difference is that SelectValue is expected to select a single element, whereas the SelectRows expression is expected to select a collection of elements, one for each row in the table. For customer #1 (Andrew), the SelectRows XPath expression selects three Customer elements. The XPath expressions (pointed to by *4) stored in the prototype row (the second row in the table) are evaluated in the context of each row selected by the SelectRows expression.
You also often see a similar pattern in properly written XSLT style sheets. One template is evaluated in the context of the root element, which selects a set of elements. An xsl:apply-templates causes an XPath expression to be evaluated in the context of each element selected by the first template. And an xsl:apply-templates in the sequence constructor of the second template causes an XPath expression to be evaluated in the context of each element selected by the second template, thereby causing a third set of templates to be applied.
Once you are familiar with this approach (sometimes called the ‘pull’ approach), you never write XSLT style sheets in any other way. Inexperienced XSLT developers sometimes try to write style sheets by using loops and calling templates explicitly, instead of letting the pattern matching power of XSLT to do the heavy lifting. This incorrect approach is sometimes called the ‘push’ approach.
To summarize, the SelectDocuments expression selects multiple elements, one for each document. The SelectRows expression, evaluated in the context of the elements selected by SelectDocuments, selects multiple elements, one for each row. The XPath expressions in the prototype row are evaluated in the context of the row elements selected by SelectRows.
The Conditional Content Control (*5)
The conditional content control works in exactly the same way as SelectValue and SelectRows. The SelectTestValue expression is evaluated in the context of the Customer element. The retrieved value is compared to the contents of the Match content control. If there is a match, the Conditional content control is replaced by the contents of the Content content control in the generated document.
Advantages of the XPath-in-Content-Controls Approach
There are several advantages to the XPath-in-Content-Controls approach over the C#-in-Content-Controls approach:
- We eliminate the two-step process for generating documents. The program that processes the template (and processes all of the XPath expressions in the template) does the actual document generation. We don’t need to generate code, and then compile and run the generated code.
- We can catch errors in the XPath expressions, and supply the template designer with good error messages that indicate the specific XPath expression that contains the error.
- We eliminate all of the issues associated with typing C# code into content controls. When entering C# code in Word, of course there is no Intellisense. It could be difficult to catch errors in the C# code. The issues associated with replacing single or double quotes with smart quotes is significantly reduced. Note that the issues around quotes is not entirely eliminated. There are circumstances where the template designer may need to use single or double quotes in XPath expressions.
In the next post, I’ll show a video of this approach in action.
Future posts:
- Show this approach at scale
- Review XPath semantics of LINQ to XML
- Examine the issues around namespaces in the source XML document
- Show the process of changing the schema
- Add robustness and error handling
- Integrate as a document-level managed add-in for Word 2010.
This is fun!
Phil Nolan said,
April 4, 2011 @ 3:00 pm
Thanks Eric, this looks great. Have not had time to play as yet but look forward to downloading and giving it a go.
One quick question.
Does the conditional content control handle nested conditions? For example, say we wanted a paragraph that was targeted at high value customers whose name is Eric (HighValueCustomer=True & Name=Eric).
Keep up the great work,
Phil
Eric White said,
April 7, 2011 @ 3:48 pm
Hi Phil,
The code is written in a simple recursive fashion, so it should work. That said, I haven’t tried it. I plan on writing a comprehensive test suite in the near future and will include that case.
-Eric
Jason Harrop said,
May 4, 2011 @ 12:46 pm
Hi Eric
Nice to read about your approach, and to contrast it with the content control/XPath-based approach I have found powerful.
While we both use XPaths and content controls, I have chosen to rely on content control data binding where possible.
So for example, rather than having a SelectValue content control, I just have a normal w:dataBinding XPath.
We both have a notion of conditional, and where you have a Table Content Control, I have a “repeat”, which allows a table row, a block level content control, or indeed an inline content control, to be repeated.
I have a pre-processing step which handles these special content controls, replicating them in the case of a repeat, or removing conditionals which have evaluated to false. This can be done inside Word (even via a macro, so everything travels inside the docx) or outside of it, but having done that, the beauty of the approach is that custom xml databinding does the rest. (I like it that you can allow the user can edit the document, and have their changes reflected in your custom xml part).
Another thing we have in common is a need for better support for content controls in Word for the Mac. Perhaps you can influence some folk about that?
If you are interested in reading more about my approach, please see . It supports both interactive (interview style) and non-interactive (data driven) generation. For an example of interactive generation, see
cheers .. Jason
Jesse said,
October 11, 2011 @ 4:54 pm
I must’ve missed the obvious, but I’m using Word 2007 and it took me awhile to figure out HOW to nest content controls.
I found that the parent control had to be a Rich Text control, as stated here:
http://blogs.msdn.com/b/ericwhite/archive/2010/03/02/using-nested-content-controls-for-data-and-content-extraction.aspx
However, this didn’t explain the whole picture to me (or perhaps I’m still doing it incorrectly?). For example, I have to have a sentence first, or a regular Text content control, and then I can select that text or Text control control, and then and only then can I add additional content controls.
If I have a blank document, add then add a Rich Text content control, then all of the content control options are greyed-out when I’m inside the content control.
This only works after having some pre-existing text/controls/whatever, highlighting these items, and then clicking Rich Text content control to add a container around the entire selected stuff.
Is this the right way to do this, or am I missing something?
Eric White said,
October 11, 2011 @ 5:10 pm
Hi Jesse, I agree, the user interface for nested content controls has issues. I’ll add this to my list of possible screen-casts. Current, I am avoiding nesting them, for that very reason.
-Eric
Jesse said,
October 11, 2011 @ 6:40 pm
Ah, well then!
I’ve really enjoyed the posts. As far as explaining OpenXML solutions (vs. just being presented with the problem, some code, and the result), your blog definitely rocks the hardest.
Thanks!
Eric White said,
October 12, 2011 @ 10:48 am
Thanks! 🙂 I’m happy that it is helpful.
inforium said,
November 4, 2011 @ 2:28 am
Thanks Eric for your posts!
Your sample is simple control content. I have a template include other control content(ex: checkbox) >> how to solve my case?
Thanks!
Eric White said,
November 4, 2011 @ 2:48 am
Well, the question is, what do you want to do with these other content controls? If you want to link them to a field in the XML data, then you may have to do something like define a tag value that indicates the type of replacement content control, and then modify the transform so that it replaces the XPath content control with the checkbox content control. This currently is not a feature of the example, though.
inforium said,
January 6, 2012 @ 9:22 am
Thanks Eric!
Some case, I need to design Option Button but I can not find Option button(radio button) Content Control in Developer Tab of Ms Word Office. Can You show me, please!
Tor said,
August 8, 2013 @ 4:50 pm
inforium,
It may be a little late to reply to this question, but perhaps someone else can use the information.
I currently use this code inside my Transform code block to generate different style of checkboxes. I also give the XML elements short names ( i.e. True ) so that the column width in a Word table that contains checkboxes remains close to the size of the checkbox (i.e. Font-Size 12 using MS Gothic). The checkboxes are placed in one column next to the text in the second column – like this:
Col1 Col2
./A1. Meals
./B1 Educational Items
The Content Control’s Title and Tag = “CheckBox” or “RadioButton” etc. The xPath expression inside the content control becomes like this: ./A1
Here is the code I use:
static object Transform(XNode node, XElement document)
etc….
if ((tag.ToUpper() == “CHECKBOX”) || (tag.ToUpper() == “TICKCHECKBOX”) ||
(tag.ToUpper() == “DIAMONDCHECKBOX”) || (tag.ToUpper() == “TICKMARK”) ||
(tag.ToUpper() == “RADIOBUTTON”) || (tag.ToUpper() == “CIRCLECHECK”))
{
XElement run = element.Element(w + “sdtContent”).Element(w + “r”);
string valueSelector = GetContentControlContents(element);
string xPathVal = document.XPathSelectElement(valueSelector).Value.ToUpper();
string newValue = string.Empty;
//====================================================================================
//Note:
//====================================================================================
// Coverts a “True” or “False” in the XML data file to selected CheckBox style values
// Use Unicode symbols i.e.: u2612 -> Checked and u2610 -> unchecked
// together with the escape character “\” in front of the U + decimal expression.
//
// In the Word Document Template – Create a Text Content Control, and
// Set the Tag to “CheckBox”, “TickCheckbox”, or “DiamondCheckbox” … etc.
// Set the Font type for the Content Control to MS Gothic.
//
// See this url: http://msdn.microsoft.com/en-us/library/documentformat.openxml.office2010.word.sdtcontentcheckbox.aspx
//
//————————————————————————————————————–
switch (tag.ToUpper())
{
case “CHECKBOX”:
newValue = ((xPathVal == “TRUE”) ? “\u2612” : “\u2610”);
break;
case “TICKCHECKBOX”:
newValue = ((xPathVal == “TRUE”) ? “\u2611” : “\u2610”);
break;
case “DIAMONDCHECKBOX”:
newValue = ((xPathVal == “TRUE”) ? “\u25C6” : “\u25C7”);
break;
case “TICKMARK”:
newValue = ((xPathVal == “TRUE”) ? “\u2713” : “”);
break;
case “RADIOBUTTON”:
newValue = ((xPathVal == “TRUE”) ? “\u29BF” : “\u25CB”);
break;
case “CIRCLECHECK”:
newValue = ((xPathVal == “TRUE”) ? “\u25CF” : “\u25CB”);
break;
}
return (new XElement(w + “r”,
run.Elements().Where(e => e.Name != w + “t”),
new XElement(w + “t”, newValue)));
}
— etc.
Hope that someone can use this.
Thanks,
Elias Nissiotis said,
December 10, 2011 @ 7:44 pm
Hi Eric! Very nice post.
The problem in my case is that I have a dataset that is hierarhical which depth in some cases reaches the 10 levels!!!! I went through your sample and I saw that you do nest content controls and you use the table to iterate but inside the table you do not use content control with xpath but the xpath alone. This prohibits the recursive call as per my anderstanding. What should I do to nest multible levels? Can you send me a sample with multible levels or guide me to do it?
Thanks advance,
Kristen said,
February 24, 2012 @ 7:50 pm
Hi Eric,
Thank you for your replies to past questions; I appreciate it so much.
I am finally really catching on, and it is facinating to me how much more can be done with the new open XML formats.
As someone else requested, please do provide videos, if possible, of how to actually set up a test run of this system (including of course what programs are and are not necessary).
I wish we could clone you a few times so this information could be produced that much faster. There should be more experts willing to share their knowledge like you do. Thank you again, so, so very much.
Kristen
Steve Kissh said,
June 19, 2012 @ 4:07 pm
You mention that the data xml is limited to no namespaces. I hoped you could help me understand why using name space prefixes in the Xpath expression wouldn’t work.
Eric White said,
June 19, 2012 @ 5:18 pm
Hi Steve,
The main reason is that when you instantiate an XPath processor, you need to pass in the mapping between namespaces and namespace prefixes. This is part of the XPath API. It could be possible that in the configuration information for the document, you could have some sort of configuration that specifies the mapping between prefixes and namespaces, and then prefixes in the various XPath expressions could then be mapped to those namespaces. I didn’t do that work. However, the example is pretty simple – it would not be hard to add this capability.
Joao Heleno said,
July 4, 2012 @ 9:02 am
Hi Eric.
I was trying out this approach and I made my own template. The problem is I’m now getting a NullReferenceException in line 89 of DocumentGenerator.cs. The run variable is null (XElement run = element.Element(w + “sdtContent”).Element(w + “r”);).
I’m getting the error when the current node is the following: check http://pastebin.com/34ARXCB2
If I use your template everything runs fine.
I appreciate your help.
Thanks,
Joao
Crocop said,
September 10, 2012 @ 5:26 am
Hi Eric! There was given a task for me , where program generates .docx files, after that user changes the styles and formats (for examples change the tables border into bold, delete one column of the table) in that .docx file. After all of this the program must use that generated before .docx file to generate new docx files in that appropriate format and style.
Is it really possible with open xml ?
If u know the appproach how to do it , could u give me any advice?
Thanks a lot.
Royce Lithgo said,
October 8, 2014 @ 1:34 am
Would it be possible to do something similar in Excel for report generation? I’d like to setup an Excel sheet with Xpath expressions in cells which would be dynamically replaced with values from an embedded XML document when the workbook is opened. I would first create the workbook and format as required, then a program would copy the ‘template’ workbook and inject the XML document (after retrieving the data from the database and creating the XML), then when the user opens the workbook, some auto-load VBA would replace the Xpath expressions with the data in the embedded XML and the report would be ready.
I’m thinking of storing the Xpath references in cell comments so they could be retained even after the XML data was inserted into the cells.