New Version of DocumentBuilder
Return to the
DocumentBuilder
Developer CenterAt long last, I have completed and posted a much requested and sorely needed update to DocumentBuilder. DocumentBuilder is code that is part of the Open-Xml-PowerTools project that handles issues of interrelated markup, enabling you to generate new documents from existing documents in a variety of ways. For example, you can assemble a new document from just the first few paragraphs of an existing document. Another example is that you can effectively delete portions of a document by importing the portion of the document before the part you want deleted, and importing the portion after the part you want deleted. The resulting document contains merged styles, comments, fonts, bookmarks, endnotes and footnotes, and so on. If an image is not imported from a source document, then it is not included in the generated document, which is appropriate. There are a number of ways to pick specific portions of content of WordprocessingML documents and assemble them into a new document.
To download the new version of DocumentBuilder, clone or fork the repo at Open-Xml-PowerTools.
The primary difference is that this version is MUCH more robust. It handles many, many cases that the original DocumentBuilder did not, including images in headings, smart art everywhere, images in smart art, external relationships, and on and on. It works properly if you import the same content more than once. I completely ripped the old version apart and reassembled it using a recursive approach that handles many cases of related parts in a much more generic approach.
In the near future, I’ll be providing new documentation about DocumentBuilder here on OpenXMLDeveloper.org. I’ll post lots of information about the various use-cases, as well as a number of examples and sample documents that show the various ways that you can use DocumentBuilder to assemble documents.
Broken Backwards Compatibility
One key point to mention here is that I made a small adjustment to the programming interface to document builder that breaks existing programs. The new API is very similar to the old DocumentBuilder programming interface, and it is slightly easier to use. However, ease of use is not the reason I broke backwards compatibility. I did it because it was absolutely necessary. Here is why:
In the programming interface to the old DocumentBuilder class, you created a List<Source> that contained a list of sources for the document to be built. Each Source object contained an open WordprocessingDocument object, which is of course the class in the Open XML SDK that you use to access and manipulate word-processing documents. There-in lies the rub. There is no way using the Open XML SDK to clone an open WordprocessingDocument object, but to enable the case where you want to import content from a single document more than once, it was highly desirable to be able to clone the document for each import. Further, WordprocessingDocument implements IDisposable, which really complicates the code. You must either use the using construct of C#, or you must explicitly dispose of those objects when done with them, raising the possibility of bugs where documents were not disposed of properly.
Instead of using open WordprocessingDocument objects to specify sources, I decided that it was much more convenient to simply pass byte arrays around. It is super-easy to open a WordprocessingDocument from a byte array. It is also easy to clone, easy to serialize to disk, and easy to serialize to a SharePoint document library. To make it even easier to work with these byte arrays, I have defined a small class, WmlDocument, which encapsulates the small bits of functionality that you want around these byte arrays. The entire definition of the WmlDocument class looks like this:
public class WmlDocument { public byte[] RawDocument { get; set; } public WordprocessingDocument GetWordprocessingDocument() { MemoryStream mem = new MemoryStream(); mem.Write(RawDocument, 0, RawDocument.Length); WordprocessingDocument doc = WordprocessingDocument.Open(mem, true); return doc; } public WmlDocument(WmlDocument original) { RawDocument = new byte[original.RawDocument.Length]; Array.Copy(original.RawDocument, RawDocument, original.RawDocument.Length); } public WmlDocument(string fileName) { RawDocument = File.ReadAllBytes(fileName); } public WmlDocument(byte[] byteArray) { RawDocument = new byte[byteArray.Length]; Array.Copy(byteArray, RawDocument, byteArray.Length); } public void Save(string fileName) { File.WriteAllBytes(fileName, RawDocument); } }
Now that you’ve seen the WmlDocument class, you can see the new code to use DocumentBuilder. The following example code (taken from the example that is delivered with the new DocumentBuilder class) shows five cases:
- Generate a new document that contains just a sub-document consisting of 10 paragraphs, starting at paragraph 5.
- ‘Delete’ a range of a document by importing the same WmlDocument twice, specifying to ranges that leave out a chunk of content between them.
- Concatenate two documents, using the section info (headers and footers) from the first document.
- Concatenate two documents, using the section info from the second document.
- Generate a new document from the first five paragraphs of one document, as well as the first five paragraphs of the second document. In this case, DocumentBuilder.BuildDocument returns a WmlObject, which you can serialize to wherever you need to.
You can see that the code to set up a list of sources and then fire off the DocumentBuilder is very similar to code for V1 of this class.
string source1 = "../../Source1.docx"; string source2 = "../../Source2.docx"; string source3 = "../../Source3.docx"; List<Source> sources = null; // Create new document from 10 paragraphs starting at paragraph 5 of Source1.docx sources = new List<Source>() { new Source(new WmlDocument(source1), 5, 10, true), }; DocumentBuilder.BuildDocument(sources, "Out1.docx"); // Create new document from paragraph 1, and paragraphs 5 through end of Source3.docx. // This effectively 'deletes' paragraphs 2-4 sources = new List<Source>() { new Source(new WmlDocument(source3), 0, 1, false), new Source(new WmlDocument(source3), 4, false), }; DocumentBuilder.BuildDocument(sources, "Out2.docx"); // Create a new document that consists of the entirety of Source1.docx and Source2.docx. Use // the section information (headings and footers) from source1. sources = new List<Source>() { new Source(new WmlDocument(source1), true), new Source(new WmlDocument(source2), false), }; DocumentBuilder.BuildDocument(sources, "Out3.docx"); // Create a new document that consists of the entirety of Source1.docx and Source2.docx. Use // the section information (headings and footers) from source2. sources = new List<Source>() { new Source(new WmlDocument(source1), false), new Source(new WmlDocument(source2), true), }; DocumentBuilder.BuildDocument(sources, "Out4.docx"); // Create a new document that consists of the first 5 paragraphs of Source1.docx and the first // five paragraphs of Source2.docx. This example returns a new WmlDocument, when you then can // serialize to a SharePoint document library, or use in some other interesting scenario. sources = new List<Source>() { new Source(new WmlDocument(source1), 0, 5, false), new Source(new WmlDocument(source2), 0, 5, true), }; WmlDocument out5 = DocumentBuilder.BuildDocument(sources); out5.Save("Out5.docx"); // save it to the file system, but we could just as easily // done something else with it.
Change in the Namespace
In the original version of DocumentBuilder, I placed it in the OpenXml.PowerTools namespace (notice the decimal between OpenXml and PowerTools). Then subsequently when releasing the RevisionAccepter class and the MarkupSimplifier class, I placed then in the OpenXmlPowerTools namespace. This was an inadvertency. In any case, the best namespace is OpenXmlPowerTools (without the decimal point between OpenXML and PowerTools, so now all modules in the PowerTools for Open XML have the namespace of OpenXmlPowerTools.
Moving into the Future
One of my main goals over the next few months is to rationalize the various pieces of code in the PowerTools for Open XML. I want to make the C# code be more consistent. I want to make it more robust, and I think that the new approach of using a byte array helps with this goal. And if the winds are favorable in our direction, we’ll have a new release of PowerTools for Open XML sometime in the near future.
Please continue to give me feedback on this new version. Your feedback on the original version of DocumentBuilder was instrumental in helping me decide the direction to take.
-Eric