Extract all charts and SmartArts from the Word Document.
Home › Forums › WordprocessingML › Extract all charts and SmartArts from the Word Document.
This topic contains 6 replies, has 2 voices, and was last updated by princrai 8 years, 8 months ago.
-
AuthorPosts
-
March 15, 2016 at 10:04 pm #2553
Hi Eric,
I have a scenario where I want to extract table,smartart,chart,images from a document and create a new document using each of these objects. For table I was able to extract using xDoc.Descendants(W.tbl) but not sure how I can import each of these table into a new document. Can you suggest how I can save these tables into the new documents. And how should I extract the smartart,charts or images from the document.
Thanks,
PrinceMarch 16, 2016 at 4:11 am #2554Hi Prince,
What you need is the DocumentBuilder module in Open-Xml-PowerTools. Please see the screen-casts at the following link:
DocumentBuilder Developer Center
In particular, first watch this:
Short Introduction to DocumentBuilder
Cheers, Eric
March 16, 2016 at 6:41 am #2560Thanks Eric for sharing the video links .
But I have a little different scenario where suppose I have a paragraph in which I have a chart and a smartArt(which are basically drawing node in openxml) and want to shred these two objects individually to create two Word document, one containing the chart and the other containing the smartArt. I went through the DocumentBuilder code and looks like we mostly do all the processing over the paragraph and hence the chart and smartart goes into the same document.I wanted to check can we, by anyways shred these drawings and create a well formed document containining these drawings invidivually.
Thanks,
PrinceMarch 16, 2016 at 12:36 pm #2564Hi Prince,
You are correct – DocumentBuilder works at the granularity of a paragraph. It doesn’t have facilities to break out a run in a paragraph, and do something with it.
It is possible to write the code to directly do this, but it isn’t trivial.
Unfortunately, I don’t know of any samples or documentation that I can point you to. In general, you can take the approach:
- Take a copy of the document before you have deleted the the chart or smartArt
- Take another copy, open in Word, modify the content by deleting one or the other, save
- Use the Open XML SDK Productivity Tool to compare the two, and make detailed notes on all the changes you need to make.
- Write your code to make the same changes. Validate your code by comparing with the second copy in the above procedure.
March 16, 2016 at 11:25 pm #2571Hi Eric,
I was able to extract the table,smartart,chart,images from the Word document by doing some modification to the DocumentBuilder. So what I am doing is something this:
1. Get all the paragraphs by filter by (W:p)
2. From the above list get all the runs(W:r) from each of the above step 1.
3. Create a new paragraph XElement(W:p) and add the above runs one by one and create the Document.Here are few things I am assuming are correct:
1. Since Chart,SmartArt,Images are always put in W:drawing and a run(W:r), if contains a drawing it will not contain text or any other objects.
2. For extracting all the chart,smartart,images, parsing a W:p is sufficient. Can the drawings can be present in any other nodes such as w:sdt,w:sdtContent,w:ins. Because I saw something like
https://msdn.microsoft.com/en-us/library/ee836138(v=office.12).aspx
<w:p/>
<w:ins w:id=”16″w:author=”Eric White”w:date=”2009-08-29T06:47:00Z”>
<w:p>
<w:r>
<w:t>This is another inserted paragraph.</w:t>
</w:r>
</w:p>
</w:ins>
<w:p/>March 17, 2016 at 12:07 am #2573Question #1: I have never seen Word put a chart, smartArt, or image into a run with other content. The Open XML standard does not prohibit this, and Word will process a run just fine if it contains both an image and text. However, I have never seen Word write this markup. It probably is an OK assumption for your program.
Question #2: Are you anticipating processing documents that contain tracked revisions? One option is to use the RevisionAccepter module to first accept tracked revisions, and then process the document. In general, are you making use of content controls? You might find content controls in a document, such as to contain a TOC or a page number in a header/footer, but these probably are not the paragraphs that will contain the charts / smartArt / images that you are interested in. If you are not using content controls for introducing metadata into your document, then you can probably ignore them. You probably would want to check to see if there are any content controls in the document (other than expected CCs, such as for TOC) before processing it.
It is worthwhile to scan the w:p and w:r elements in the standard, and note all of the child elements of each.
March 17, 2016 at 12:21 am #2575Since my service will be just extracting objects from the document, I don’t think I will get a document that contain tracked revisions.
Basically I have to make sure I get a paragraph with a run containing a drawing(for chart,smartart,images).To be on the safer side what I can do is,remove everything from the run except the run property(w:rPr) and the drawing Node(w:drawing) . I think this should work correctly. -
AuthorPosts
You must be logged in to reply to this topic.