Generating Open XML WordprocessingML Documents

Generating word-processing documents is perhaps the single most compelling use of Open XML.  The archetypical case is an insurance company or bank that needs to generate 10’s of thousands of documents per month and archive them and make them available online, send them electronically, or print them and send via post.  But there are about a million variations on this theme.  In this blog series, I am going to examine the various approaches for document generation.  I’m going to present code that demonstrates the various approaches.

This post is the first in a series of blog posts.  Here is the complete list: Generating Open XML WordprocessingML Documents Blog Post Series

I have some goals for the code that I’ll be publishing:

  • First and foremost, I want the document generation process to be data-driven from content controls that you configure in a template document. 
  • The approach that I want to take is that the template designer creates a document, inserts content controls with specific tags, and then inserts specific instructions into each content control.
  • The data that we will supply to the document generation process will be a data-centric XML document.  I’ll place a few constraints on this document.  Some time ago, I wrote about Document-Centric Transforms using LINQ to XML.  That post discusses data-centric vs. document-centric XML documents.  When generating documents from another data source, such as a SQL database or an internal or secure Web service, the task will be to generate a data centric XML document from that source, and then kick off the document generation process.
  • This code should be short and sweet.  I don’t want to create some monolithic code base that would require a design process, formalized coding and testing procedures, and the like.  The question is: how simple and how powerful can such a system be made?  I’m hoping to stay under a 1000 lines of code.  But we have some powerful tools at our disposal, most importantly using LINQ to XML in a functional style.  Also, I probably will code a few recursive functional transforms.

I am contemplating four approaches for the instructions that the template designer will place in the content controls.  The content controls could contain:

  • Parameterized XPath expressions: This approach might be the easiest for the template designer to configure.
  • XSLT sequence constructors: This approach possible might be the easiest to code.  It might be very, very short if you exclude existing code such as transforming OPC back and forth to Flat OPC, OpenXmlCodeTester, and the axes I detailed in Mastering Text in Open XML WordprocessingML Documents.  I am contemplating using XSLT 2.0.
  • .NET code (either VB or C#): This approach reminds me of code that I presented in OpenXmlCodeTester: Validating Code in Open XML Documents.  It might be cool to put a LINQ expression in a content control that projects a collection of rows and columns that become an table in the word-processing document.  There could be some cool and easy ways to supply formatting.
  • Some XML dialect that I invent as I go along.

I’m not sure which approach I’ll take.  I want to play around with all four approaches, and see which one is easiest to use, and which one is easiest to develop.  As I start playing around with these (and posting the code as I go along), I’ll make some design decisions, and list my reasons for the decisions.

By the way, I really love to have discussions about these things.  If you agree or disagree with any of my design decisions, feel free to chime in.  You can register so we can have more of a discussion, or post anonymously, as you like.

In the next post, I’m going to examine template documents, and define exactly what I mean by a template document.

!!!

16 Comments »

  1. Svetlin said,

    January 24, 2011 @ 4:45 pm

    Great news Eric!Can’t wait to compare notes,as I recently completed a project of this nature for a big insurance company :-).

  2. Eric White said,

    January 24, 2011 @ 5:14 pm

    Cool – I am really interested in comparing notes on this – see how you and others approach this problem.

  3. Patrick Durusau said,

    January 29, 2011 @ 1:21 pm

    The same use case comes up for ODF (Open Document Format, ISO 26300) so I hope you don’t mind my lurking for good ideas!

    +1! on XSLT 2.0

    Not sure “….short and sweet….” Short code not a problem. But my experience has been that “short code” = “undocumented semantics” for the code. Job security I suppose but really bad practice. Not what you meant but what I hear when that phrase is used about code.

    Looking forward to the series!

    Hope you think imitation is the highest praise because I will be adapting some of your ideas for ODF documents.

    Patrick Durusau
    ODF – ISO 26300 Editor

    PS: You might want to touch bases with Ken Holman, Crane Softwrights, http://www.cranesoftwrights.com/, who has been in the lead for generating UBL documents in XML tool chains. Would be nice to see MS/OOXML provide direct support for UBL authoring/generation.

  4. Eric White said,

    January 29, 2011 @ 7:37 pm

    Hi Patrick,

    Regarding short code – my premise is that this can be a simple example program, not an engineered product. If it is a fairly simple example and the example is explained properly, the theory is that anyone can modify the example to fit specific scenarios, or reimplement with a different language but with similar ideas.

    Yes, I agree, +1 on XSLT 2.0. It feels like a much more complete language than 1.0.

    Yes, I’ll be flattered if you adapt ideas! 🙂 I’m not building a product or anything – I just want to contribute to the conversation around document generation. Question for you: I have much to learn about ODF – what is the correct approach to delineate / tag content in an ODF document?

    I’m not at all certain that this approach (embedding C# code inside a content control) is the way to go. There are definite downsides when compared to other approaches, such as putting XML inside content controls. It is just the first of several experiments around doc gen.

    UBL looks pretty interesting. That might make a great Open XML proof-of-concept. I will follow up, but it may be a month or two.

    Glad to have your comments/participation!

    -Eric

  5. spevilgenius said,

    February 4, 2011 @ 2:18 pm

    Wicked! I actually have just started looking into OpenXML because we have a need to create reports with SharePoint data. I tried to follow your other article on that, but just could not wrap my brain around some of the things that I need to do. It would be great if we could build some sort of a way for the users to build a template for what they want and then upload this template to a document library. Then there would either be a webpart or some other page that would allow them to create a report using this template. I would like to have templates for both Excel and Word and some sort of way to get the SharePoint data including charts/graphs. I think it would be neat if you could also gracefully paginate report rows or groups so that they would not get truncated, but that might be a bit much to ask!! Any more insights are appreciated!

  6. Eric White said,

    February 9, 2011 @ 12:36 pm

    Generating reports from SharePoint data is one of the main scenarios that I’m targeting. With this first version – c# in content controls – this should be very doable. Pagination is probably a bit down the road, but certainly I want to enhance this until it meets real-world needs.

  7. thwelly said,

    February 10, 2011 @ 7:56 am

    Any idea how to merge two docx’s including customXML?
    I tried „altChunk“ to insert docx but it looks like not to by the right way. In the inserted document the customXML Data is not showing on word.

    What I have todo is finally a minutes. This includes different sections like informations, tasks, requests, decisions, …
    This different sections I think to do in different templates as docx with customXML. And then finally merge together to a minutes. So the user can make decision what he like to include in his report.

  8. Eric White said,

    February 11, 2011 @ 3:56 pm

    Hi, the only way to merge two documents that use content controls that are bound to XML in a custom XML part is to do it manually. Have you taken a look at DocumentBuilder? I haven’t used DocumentBuilder to merge documents that contain content controls that are bound to custom XML parts. It probably requires some modification, but probably would not be too hard. Also see How to Control Sections when using OpenXml.PowerTools.DocumentBuilder.

  9. Mike Brennan said,

    March 11, 2011 @ 9:41 pm

    Hi Eric,

    Do you have any goals for the usability of the documents that are generated?

    It seems that documents that retain content controls and that have mapping to a custom XML part would provide the most end usable generated product. Would you see much additional code or effort needed to produce this kind of generated document?

    – Mike

  10. Eric White said,

    March 11, 2011 @ 11:59 pm

    Having the documents retain content controls and generating a custom XML part is certainly doable. Not a lot of code would be involved. For instance, for the Value content control, this would be a change to the projection that starts at line 36 in ProcessTemplate.cs. With tables, it would involve creating a bound content control for every cell in the table. In general, I would approach this by discarding the configuration content controls, and creating new content controls that are bound to the data.

    This is not a lot of work. I’ll add it to the list.

    In this project, the next step I’m contemplating is an approach where the template document designer places XPath expressions in content controls (instead of C# code). In this iteration, I’ll consider your proposal. It is a good idea.

    -Eric

  11. Nicholas said,

    June 14, 2012 @ 12:08 am

    Hi, I have been recently investigating the possibility of allowing for template based document generation for a new website I am designing. I have a Business Systems Analyst background so not highly technical – hence why I am here.
    I have investigated ActiveDocs and Apose and the possibility of Sharepoint integration.
    The requirement is for the website interface to prompt users with a questionnaire (e.g. text, logos, booleans, date etc) and then have the answers merged into pre-defined templates via a content placeholder technique. The ability to add/remove paragraphs based on boolean question responses would also be desirable.
    I may be on the wrong forum for this but if anyone can suggest/recommend what current technology(s) would be best (i.e. the simpler the better) to achieve this, I would be very appreciative.
    Thanks

  12. Eric White said,

    December 5, 2012 @ 4:13 pm

    Hi Nicholas,

    I don’t have anything specific to recommend. My current set of solutions mainly are around rolling your own document generation solution using code. Outside of that, I don’t have any suggestions at this point.

    -Eric

  13. AJ said,

    December 5, 2012 @ 3:25 pm

    Hello,

    Using your article(s) i created an application which transforms the output of SSRS reports (more than 1) into an assembled word document. The test documents generated so far happen to perfect open using MS Word 2007 and 2010. Everything works great! Thank You so much.

    Now there is not so good part to this story. Some or most of these document fail to open using MS Office 2002 or old versions (with latest compatible packs). But if the generated documents are opened and saved with MS Office 2007/2010 and then opened in MS Office 2002 or XP, these documents open without any problem. To achive this (opening documents usng MS Office 2010 and saving them) I have created a dropbox application which listens to a shared folder (dropbox) for new documents, then opens them and then saves them to a delivery location where user can access. This application makes use of “Microsoft.Office.Interop.Word” to launch Word and open the document. The application works for documents which dont have “altchunk”, and it fails for the document which my document assembly application generates. Can you please throw some light as to how can this be troubleshooted or resolved? Any help and pointers are appreciated.

    regards,
    AJ

  14. Eric White said,

    December 5, 2012 @ 4:18 pm

    Hi AJ,

    I am wondering whether the content controls are left in the document after generation. I can’t recall whether they are or not. But as you can see, there is some feature in the generated documents that is not supported by earlier versions of Word that using the compat packs.

    The best way to figure out what is causing the incompatibility with earlier versions of Word is to use the Open XML package editor PowerTool and comment out large chunks of content in the main document part. Cut out nearly all of document.xml until the document opens in the old version of Word, and then incrementally add content back into the document until it fails to open. You then know which feature of Open XML is causing earlier versions to not open the documents.

    I wish I could research this, but I don’t have any working installations of old versions of Word.

    If you do this experiment, please let me know what you find out.

    Best, Eric

  15. Maria said,

    February 15, 2013 @ 4:53 pm

    Hola Erick, estoy en una implementacion de document builder , pero tengo un problema, tengo un pedido srm en el portal y al generar el documento con Docbuilber solo me trae el primer Item los demas no…. es necesario que agregue alguna parametrizacion para que la tabla sea dinamica?

    Muxhas Gracias

    Saludos

  16. liuyi said,

    June 3, 2019 @ 2:49 am

    hi eric:
    I want to create table of contents after the cover. The cover will occupy 3 pages,it means that I should create the table of contents in page 4,how can I realize it?
    thanks
    liuyi

RSS feed for comments on this post · TrackBack URI

Leave a Comment