Merging Manipulated Word Docs

Home Forums Open-Xml-PowerTools Merging Manipulated Word Docs

This topic contains 3 replies, has 2 voices, and was last updated by  Eric White 8 years, 7 months ago.

Viewing 4 posts - 1 through 4 (of 4 total)
  • Author
    Posts
  • #2547

    AlanSMac
    Participant

    Hi,

    I am opening Word documents from templates and subsitituting variables for values, then I want to combine all the docs together at the end into one output word document.

    I’ve found articles like https://blogs.technet.microsoft.com/ptsblog/2011/01/24/open-xml-sdk-merging-documents/ and it looks like DocumentBuilder is the cleanest way to go. Certainly from a readability point comparing it to the chunk approach on here I much prefer it http://stackoverflow.com/questions/18351829/merge-multiple-word-documents-into-one-open-xml

    I open Word documents as follows:

    WordprocessingDocument wordDoc = WordprocessingDocument.Open(ConfigurationManager.AppSettings[“MailingLabelTemplatePath”], true);

    wordDocs.Add(wordDoc);

    Each doc will be based on the same template file. It holds 30 mailing labels so as I need more a new doc is created based on the same template.

    I then process the docs in memory to substitute the variables so I need to walk through each label section to set the next label name and address.

    Now I want to combine the results into one doc at the end. This is in a service processing hundreds of docs so I really want to just take these docs from memory and combine them.

    My problem is DocumentBuilder seems to be based on Source objects which are based on WmlDocuments. The only thing I can find telling me what a WmlDocument is this old post http://openxmldeveloper.org/home2/bm8qcmjy/public_html/blog/b/openxmldeveloper/archive/2011/06/20/new-version-of-documentbuilder-available-in-powertools-for-open-xml.aspx

    As far as I can tell it used to let you construct a WordprocessingDocument from a WmlDocument but not vice versa and now does not do that anymore. I’m using current version installed from nuget (v4.1.3). I can only construct a WmlDocument from a file path or a byte array which means I will have to save all these temp files to disk just to re-open them immediately for merging.

    The WmlDocument abstraction seems pretty inconvenient because parts of the API like Source are based on it but the rest of the SDK is based on WordprocessDocument etc. so you need to convert between when you are actually changing things? Or am I missing something?

    In a nutshell how do I construct a WmlDocument from a WordProcessingDocument without saving to disk?

    Thanks

    • This topic was modified 8 years, 8 months ago by  AlanSMac. Reason: Typos
    • This topic was modified 8 years, 8 months ago by  AlanSMac. Reason: Clarification over needing document access
    • This topic was modified 8 years, 8 months ago by  AlanSMac.
    #2556

    Eric White
    Keymaster

    Hi,

    The WmlDocument is an abstraction for an unopened Open XML document. It is a thin wrapper over a byte array. If you want to process a document successively by the Open-Xml-Sdk, and then by DocumentBuilder, the best way is:

    1. Get the document as a byte array, perhaps by calling File.ReadAllBytes, or by serializing a stream to a byte array.
    2. Create a new MemoryStream
    3. Write the byte array to the memory stream. Note that you do not want to use the MemoryStream constructor that takes a byte array as an argument, as that creates a non-resizable memory stream, which means you can’t make changes to the document. Instead, you new up a MemoryStream using the default constructor, and then write the byte array into the MemoryStream as the first line in your using block.
    4. Open the MemoryStream using the Open-Xml-Sdk. Make modifications to the document.
    5. Once you drop out of the using block for the WordprocessingDocument, the memory stream will be updated. You can get the byte array by calling ToArray() on the MemoryStream.
    6. You can then new up the WmlDocument using the byte array, and do DocumentBuilder operations on it.

    This sounds more complicated than it is. At any point in time, you have:

    • A byte array.
    • A memory stream.
    • An opened WordprocessingDocument
    • A WmlDocument created from the byte array, or retrieved from an Open-Xml-PowerTools function. You can get the byte array from a public property in this class.

    I agree, it isn’t ideal. The root of this division is a difference in programming models – LINQ to XML is a better programming model than some of the classes in the Open-Xml-Sdk for doing certain types of transforms, so all of Open-Xml-PowerTools are written using LINQ to XML. Further, in order to make DocumentBuilder as robust and useful as possible, it was easier to write DocumentBuilder such that it was responsible for doing the actual opening of the documents in the source list. In particular, if a source references the same document twice, DocumentBuilder opens it twice, therefore, it is important to take unopened documents as input into DocumentBuilder.

    I have had an idea of doing a much tighter integration of Open-Xml-PowerTools and the Open-Xml-Sdk, making your job easier, but this is not currently in my plans. But we never know, plans change.

    #3270

    AlanSMac
    Participant

    Thanks Eric.

    Bizarrely I replied to this the day after you posted and it never showed. I tried to post again immediately and the server said it detected a duplicate post and it never ever showed up!

    Just wanted to say your response was really useful and much appreciated. I ended up creating a wrapper for conveniently being able to interchange between the formats. The only thing is the callis responsbile for disposing the Word doc etc. to get the bytes to udpdate like you mentioned:

    public class InterchangeableWordProcessingDocument : IDisposable
    {

    public MemoryStream memoryStream { get; private set; }

    public InterchangeableWordProcessingDocument(string path)
    {
    var bytes = File.ReadAllBytes(path);
    CreateMemoryStream(bytes);
    }

    private MemoryStream CreateMemoryStream(byte[] bytes)
    {
    //Do not use byte array constructor as this is not resizable i.e. does not handle change.
    memoryStream = new MemoryStream();
    memoryStream.Write(bytes, 0, bytes.Length);
    return memoryStream;
    }

    public WordprocessingDocument GetAsWordProcessingDocument()
    {
    return WordprocessingDocument.Open(memoryStream, true);
    }

    public WmlDocument GetAsWmlDocument()
    {
    return new WmlDocument(“dummy”, memoryStream.ToArray());
    }

    public void Dispose()
    {
    memoryStream.Dispose();
    }
    }

    While I’m posting for my particular task it would have been great if the TextReplacer class exposed the method that does all the hard work on an individual element rather than the only public method being to replace all instances in a whole document. Sorry I forget the method name but it looked it would just be a case of changing the accessor. I ended up having to put in code to manually handle replacing values like <<myvariable>> becuase the << and >> would sometimes be broken into 2 or 3 elements but I didn’t want to replace all instances in my particualr scenario (different parts of the document were for different people and had different values based for variables based on the person). I got the impressiong TextReplacer had had a lot of work and pain to handle these types of things.

    Thanks again for your speedy and useful response.

    #3291

    Eric White
    Keymaster

    Also take a look at OpenXmlRegex. It is a superset of TextReplacer.

    http://www.ericwhite.com/blog/blog/openxmlregex-developer-center/

Viewing 4 posts - 1 through 4 (of 4 total)

You must be logged in to reply to this topic.