SimplifyMarkup – Saved file corrupted

Home Forums Open-Xml-PowerTools SimplifyMarkup – Saved file corrupted

This topic contains 3 replies, has 2 voices, and was last updated by  FRCMNS0 8 years ago.

Viewing 4 posts - 1 through 4 (of 4 total)
  • Author
    Posts
  • #3980

    FRCMNS0
    Participant

    Hello,
    I am running into a problem when running the MarkupSimplifier.SimplifyMarkup method and saving the document. The code used is this one:

    using (var docMaster = WordprocessingDocument.Open("PORTA_copy.docx", true))
                    {
                        SimplifyMarkupSettings settings = new SimplifyMarkupSettings
                        {
                            NormalizeXml = true, // Merges Run's in a paragraph with similar formatting
    
                            // Additional settings if required
                            RemoveBookmarks = true,
                            RemoveComments = true,
                            RemoveGoBackBookmark = true,
                            RemoveWebHidden = true,
                            RemoveContentControls = true,
                            RemoveEndAndFootNotes = true,
                            //RemoveFieldCodes = true,
                            RemoveLastRenderedPageBreak = true,
                            RemovePermissions = true,
                            RemoveProof = true,
                            RemoveRsidInfo = true,
                            RemoveSmartTags = true,
                            RemoveSoftHyphens = true,
                        };
    
                        MarkupSimplifier.SimplifyMarkup(docMaster, settings);
    
                        docMaster.Save();
                    }

    The PORTA_copy.docx file (created on Word 2016) contains only one word “PORTA” and is segmented like that on the internal document.xml:

    <w:r w:rsidRPr="00F74B85">
    				<w:rPr>
    					<w:color w:val="FF0000"/>
    				</w:rPr>
    				<w:t>PO</w:t>
    				<w:t>R</w:t>
    				<w:t>TA</w:t>
    			</w:r>
    

    My intention is to group the word together using SimplifyMarkup.

    After the above code runs, the new document.xml section of the word looks like this:

    <w:r><w:rPr><w:color w:val="FF0000" /></w:rPr><w:t>PORTA</w:t></w:r>
    

    OK, That’s what I wanted. However, when i try to open the docx on Word (again testing with the 2016 version), it shows this error:

    The XML data is invalid according to the schema
    Location: Part: /word/styles.xml, Line: 0, Column: 0

    It shows an option to repair, but it’s clear that something is wrong.

    What is the problem with this code?

    Here is a sample project with the test docx and a minimal console application.

    #3981

    FRCMNS0
    Participant

    As an addendum, the extra SimplifyMarkupSettings options (everything besides NormalizeXml) doesn’t cause errors, only when NormalizeXml is set this problem occurs.

    #3982

    Eric White
    Keymaster

    Hi,

    I think that there is something else causing this problem, not MarkupSimplifier.

    You are getting a failure in parsing the xml in the /word/styles.xml file, not the main document part, which is what NormalizeXml operates on. It looks as though your styles.xml file maybe doesn’t have anything in it, which could be caused by any of a variety of things, but probably not by MarkupSimplifier, not to say that MarkupSimplifier doesn’t modify styles.xml – it might, I can’t recall, but this is not the first place I’d look for this bug. I’d look for what is writing to styles.xml, and see why the XML parser is failing on it.

    You can also manually examine the styles.xml file using the Open XML Package Editor Add-In for Visual Studio. That may provide a clue as to why the parser is failing on reading the styles.xml part.

    Best, Eric

    #3983

    FRCMNS0
    Participant

    In this case, I am sure that MarkupSimplifier is modifying the styles.xml file. Here is the entire sample code used:

    using DocumentFormat.OpenXml.Packaging;
    using OpenXmlPowerTools;
    using System;
    using System.Collections.Generic;
    using System.IO;
    using System.Linq;
    using System.Text;
    
    namespace DocxTest
    {
        class Program
        {
            static void Main(string[] args)
            {
                try
                {
                    File.Copy("PORTA.docx", "PORTA_copy.docx");
                    using (var docMaster = WordprocessingDocument.Open("PORTA_copy.docx", true))
                    {
                        SimplifyMarkupSettings settings = new SimplifyMarkupSettings
                        {
                            NormalizeXml = true, // Merges Run's in a paragraph with similar formatting
    
                            // Additional settings if required
                            RemoveBookmarks = true,
                            RemoveComments = true,
                            RemoveGoBackBookmark = true,
                            RemoveWebHidden = true,
                            RemoveContentControls = true,
                            RemoveEndAndFootNotes = true,
                            //RemoveFieldCodes = true,
                            RemoveLastRenderedPageBreak = true,
                            RemovePermissions = true,
                            RemoveProof = true,
                            RemoveRsidInfo = true,
                            RemoveSmartTags = true,
                            RemoveSoftHyphens = true,
                        };
    
                        MarkupSimplifier.SimplifyMarkup(docMaster, settings);
    
                        docMaster.Save();
                    }
    
                    Console.WriteLine("Done.");
                }
                catch(Exception ex)
                {
                    Console.WriteLine("Error: {0}", ex.ToString());
                }
                
                Console.ReadLine();
            }
        }
    }
    

    There’s nothing else being done to the document.
    Most of the differences are a extra space before closing a tag or reordered attributes.

    The major change is right on the beginning of the file, mostly additional namespace declarations.

    Here is a WinMerge report with the differences highlighted:
    https://drive.google.com/file/d/0B0ZNalzpb4uFRjdndWFidTduME0/view?usp=sharing

Viewing 4 posts - 1 through 4 (of 4 total)

You must be logged in to reply to this topic.