Screen-Cast Series: Recursive Pure Functional Transforms of XML

One of the tough problems to solve in XML processing is that of document-centric transforms.  Transforming Open XML WordprocessingML to HTML, or transforming one form of Open XML to another form of Open XML require that we write a transform that can deal with document-centric XML, as opposed to data-centric XML.

Today, I am embarking on a fairly ambitious project.  Through the medium of screen-casts, I am going to teach you how to write document-centric transforms of XML using recursive pure functional programming techniques.  I am going to start with the requisite topics in functional programming, and then take you from A to Z of the larger topic of how to write recursive pure functional transforms.  At the end of this screen-cast series, you will be able to customize the Open XML transforms that I have published in PowerTools for Open XML, and elsewhere on my blog.  Further, you will be able to write your own interesting document-centric transforms.

See A Series of Screen-Casts on Recursive Pure Functional Transforms for a
complete list of the screen-casts in this series.

Some time ago, I gave a talk on Open XML development at one of the Microsoft technical conferences, and one attendee gave me the (anonymous) feedback: “You are holding out on us.  Why don’t you tell us really how you write the code that you have published in PowerTools.”

Actually, it would not be possible to cover recursive pure functional transforms in a one-hour session at TechEd.  However, in this screen-cast series, I am certainly not limited to one hour, and in fact, this screen-cast series is exactly what that attendee suggested.  I am going to expose my thought process, tools, and approach to writing interesting transforms of Open XML.

While over the last few years, I have been principally writing in C#, it is certainly possible to write recursive pure functional transforms in JavaScript.  Throughout the screen-cast, I will be presenting every example in both C# and JavaScript.  There are a lot of benefits that we
gain by using JavaScript.  I won’t go into those benefits in this blog-post / screen-cast, but suffice it to say, we will certainly see those benefits before the end of this screen-cast series.

I am going to go pretty fast in this series.  I have already recorded the first six screen-casts, and am recording a new one about every other day.  I’ll be posting these screen-casts as fast as I can edit them.

To make it clear what I mean by document-centric transforms, first let’s consider what data-centric XML is, and what document-centric XML is.

Data-Centric XML

A data-centric XML document contains regular repeating elements.  Child elements of a given element might all have the same tag name, or they might not.  Typically, child element order doesn’t matter.  There are lots of examples of this – many types of transforms of a relational database to XML results in data-centric XML.  RSS feeds are another.

Here’s a data-centric XML document:

<Customers>
  <Customer>
    <Name>Bob</Name>
    <Age>45</Age>
  </Customer>
  <Customer>
    <Name>Jill</Name>
    <Age>37</Age>
  </Customer>
</Customers>

Document-Centric XML

Document-centric XML documents have the characteristic that the child elements of a given element are much less bounded – you might have many child elements of a given name, or you might have none.  You might have ‘recursion’ in the hierarchy – element A is a child of element B, which is
itself a child of a different element A.  A number of examples: Open XML word processing markup, XHTML, and XPS.

I further divide document-centric XML documents into two camps – those that contained mixed content, and those that don’t.  Mixed content is a variety of XML where significant text nodes and elements are interspersed.  Insignificant text nodes are the white space that provides indenting when formatting XML.  Open XML word processing markup doesn’t contain mixed content, whereas XHTML does:

An Open XML paragraph that contains some bold text:

<w:p>
  <w:r>
    <w:t>abc</w:t>
  </w:r>
  <w:r>
    <w:rPr>
      <w:b/>
    </w:rPr>
    <w:t>def</w:t>
  </w:r>
  <w:r>
    <w:t>ghi</w:t>
  </w:r>
</w:p>

An XHTML document that contains significant text nodes interspersed with
element start and end tags:

<html>
  <head></head>
  <body>
    <p>abc<b>def</b>ghi</p>
  </body>
</html>

Some time ago, I wrote a detailed comparison of data-centric and document-centric transforms.  See the blog post Document-Centric Transforms using LINQ to XML for more information.

Recently, I heard a question on twitter: what is a tough XML processing problem.

Here is a tough problem: you need to transform this:

<?xml version=”1.0″ encoding=”UTF-8″ standalone=”yes”?>

<w:document xmlns:ve=”http://schemas.openxmlformats.org/markup-compatibility/2006″

xmlns:o=”urn:schemas-microsoft-com:office:office”

xmlns:r=”http://schemas.openxmlformats.org/officeDocument/2006/relationships”

xmlns:m=”http://schemas.openxmlformats.org/officeDocument/2006/math”

xmlns:v=”urn:schemas-microsoft-com:vml”

xmlns:wp=”http://schemas.openxmlformats.org/drawingml/2006/wordprocessingDrawing”

xmlns:w10=”urn:schemas-microsoft-com:office:word”

xmlns:w=”http://schemas.openxmlformats.org/wordprocessingml/2006/main”

xmlns:wne=”http://schemas.microsoft.com/office/word/2006/wordml”>

<w:body>

<w:p w:rsidR=”000C4AEB”

w:rsidRDefault=”00656E26″>

<w:r>

<w:t>This is a test document to transform from WordprocessingML to XHtml.</w:t>

</w:r>

</w:p>

<w:p w:rsidR=”000C4AEB”

w:rsidRDefault=”00656E26″>

<w:r>

<w:rPr>

<w:noProof/>

</w:rPr>

<w:drawing>

<wp:inline distT=”0″

distB=”0″

distL=”0″

distR=”0″>

<wp:extent cx=”457200″

cy=”441960″/>

<wp:effectExtent l=”19050″

t=”0″

r=”0″

b=”0″/>

<wp:docPr id=”1″

name=”Picture 1″/>

<wp:cNvGraphicFramePr>

<a:graphicFrameLocksxmlns:a=”http://schemas.openxmlformats.org/drawingml/2006/main”

noChangeAspect=”1″/>

</wp:cNvGraphicFramePr>

<a:graphic xmlns:a=”http://schemas.openxmlformats.org/drawingml/2006/main”>

<a:graphicData uri=”http://schemas.openxmlformats.org/drawingml/2006/picture”>

<pic:pic xmlns:pic=”http://schemas.openxmlformats.org/drawingml/2006/picture”>

<pic:blipFill>

<a:blip r:embed=”rId5″

cstate=”print”/>

</pic:blipFill>

<!– some xml elided –>

</pic:pic>

</a:graphicData>

</a:graphic>

</wp:inline>

</w:drawing>

</w:r>

</w:p>

<w:p w:rsidR=”000C4AEB”

w:rsidRDefault=”00656E26″>

<w:r>

<w:t>Find out more about</w:t>

</w:r>

<w:r>

<w:t xml:space=”preserve”> </w:t>

</w:r>

<w:hyperlink r:id=”rId6″

w:history=”1″>

<w:proofErr w:type=”spellStart”/>

<w:r>

<w:rPr>

<w:rStyle w:val=”Hyperlink”/>

</w:rPr>

<w:t>PowerTools</w:t>

</w:r>

<w:proofErr w:type=”spellEnd”/>

<w:r>

<w:rPr>

<w:rStyle w:val=”Hyperlink”/>

</w:rPr>

<w:t xml:space=”preserve”> for Open XML</w:t>

</w:r>

</w:hyperlink>

<w:r>

<w:t>.</w:t>

</w:r>

</w:p>

<w:tbl>

<w:tblPr>

<w:tblStyle w:val=”TableGrid”/>

<w:tblW w:w=”0″

w:type=”auto”/>

<w:tblLook w:val=”04A0″/>

</w:tblPr>

<w:tblGrid>

<w:gridCol w:w=”3192″/>

<w:gridCol w:w=”3192″/>

<w:gridCol w:w=”3192″/>

</w:tblGrid>

<w:tr w:rsidR=”000C4AEB”>

<w:tc>

<w:tcPr>

<w:tcW w:w=”3192″

w:type=”dxa”/>

</w:tcPr>

<w:p w:rsidR=”000C4AEB”

w:rsidRDefault=”00656E26″>

<w:r>

<w:t>Vehicles</w:t>

</w:r>

</w:p>

</w:tc>

<w:tc>

<w:tcPr>

<w:tcW w:w=”3192″

w:type=”dxa”/>

</w:tcPr>

<w:p w:rsidR=”000C4AEB”

w:rsidRDefault=”00656E26″>

<w:r>

<w:t>Weight</w:t>

</w:r>

</w:p>

</w:tc>

<w:tc>

<w:tcPr>

<w:tcW w:w=”3192″

w:type=”dxa”/>

</w:tcPr>

<w:p w:rsidR=”000C4AEB”

w:rsidRDefault=”00656E26″>

<w:r>

<w:t>Cylinders</w:t>

</w:r>

</w:p>

</w:tc>

</w:tr>

<!– some xml elided –>

</w:tbl>

<w:p w:rsidR=”000C4AEB”

w:rsidRDefault=”00656E26″>

<w:pPr>

<w:pStyle w:val=”Heading1″/>

</w:pPr>

<w:r>

<w:t>Text styled as Heading 1</w:t>

</w:r>

</w:p>

<w:p w:rsidR=”000C4AEB”

w:rsidRDefault=”00656E26″>

<w:pPr>

<w:pStyle w:val=”Heading2″/>

</w:pPr>

<w:r>

<w:t>Subheading</w:t>

</w:r>

</w:p>

<w:p w:rsidR=”000C4AEB”

w:rsidRDefault=”00656E26″>

<w:r>

<w:t>Here is a bulleted list:</w:t>

</w:r>

</w:p>

<!– some xml elided –>

<w:p w:rsidR=”000C4AEB”

w:rsidRDefault=”00656E26″>

<w:pPr>

<w:pStyle w:val=”ListParagraph”/>

<w:numPr>

<w:ilvl w:val=”2″/>

<w:numId w:val=”3″/>

</w:numPr>

</w:pPr>

<w:r>

<w:t>Test-C</w:t>

</w:r>

</w:p>

<w:p w:rsidR=”000C4AEB”

w:rsidRDefault=”000C4AEB”/>

<w:sectPr w:rsidR=”000C4AEB”

w:rsidSect=”000C4AEB”>

<w:pgSz w:w=”12240″

w:h=”15840″/>

<w:pgMar w:top=”1440″

w:right=”1440″

w:bottom=”1440″

w:left=”1440″

w:header=”720″

w:footer=”720″

w:gutter=”0″/>

<w:cols w:space=”720″/>

<w:docGrid w:linePitch=”360″/>

</w:sectPr>

</w:body>

</w:document>

Into this:

<html xmlns=”http://www.w3.org/1999/xhtml”>

<head>

<meta

http-equiv=”Content-Type”

content=”text/html; charset=windows-1252″ />

<meta

name=”Generator”

content=”PowerTools for Open XML” />

<title>Test Title</title>

</head>

<body>

<p>This is a test document to transform from WordprocessingML to XHtml.</p>

<p>

<img

src=”C:\Users\Public\Documents\A-000-OpenXmlPowerTools\OpenXmlPowerTools\ExampleHtmlConverter03Images/Test_files/image1.jpeg”

style=”width: 0.5in; height: 0.4833333in”

alt=”Picture 1″ />

</p>

<p>Find out more about <A

href=”http://www.codeplex.com/powertools”>PowerTools for Open XML</A>.</p>

<table

border=”1″>

<tr>

<td>

<p>Vehicles</p>

</td>

<td>

<p>Weight</p>

</td>

<td>

<p>Cylinders</p>

</td>

</tr>

<tr>

<td>

<p>Car</p>

</td>

<td>

<p>3.25 tons</p>

</td>

<td>

<p>6</p>

</td>

</tr>

<tr>

<td>

<p>Truck</p>

</td>

<td>

<p>5.5 tons</p>

</td>

<td>

<p>8</p>

</td>

</tr>

</table>

<h1>Text styled as Heading 1</h1>

<h2>Subheading</h2>

<p>Here is a bulleted list:</p>

<p>&bull; One</p>

<p>&bull; Two</p>

<p>&bull; Three</p>

<p>Here is a simple numbered list:</p>

<p>1. Car</p>

<p>2. Truck</p>

<p>Here is a little more elaborate list:</p>

<p>1. Outer level</p>

<p>1.1. Next indent</p>

<p>1.1.1. Test1</p>

<p>1.1.2. Test2</p>

<p>1.1.3. Test3</p>

<p>1.2. Next again</p>

<p>1.2.1. Test-A</p>

<p>1.2.2. Test-B</p>

<p>1.2.3. Test-C</p>

<p />

</body>

</html>

 

In this screen-cast series, I am going to teach you how to write this transform.