Screen-Cast: Introducing the FormattingAssembler Module
When transforming a WordprocessingML document into HTML formatted with CSS, the key question is: What is the exact formatting of the various tables, paragraphs, and runs in a document? If a run in a document is formatted with a particular font and size, i.e. Tahoma 12pt or Calibri 11pt, and if it is bold or italic, we want to know that, and we want to transform it to HTML/CSS as such.
But this question is not an easy one to answer. There are a number of aspects of the markup that add complexity. When the run itself is formatted with bold or italic, it is easy to find out. We just need to examine the w:rPr element of the run. However, the run might be formatted with a particular character style, and in order to find out whether the run is bold or not, we need to look at the style. Further, styles in WordprocessingML can derive from base styles, so in addition to looking at the style of the run, we need to examine the base style of the run’s style. And if no character style in the style chain defines some aspect of formatting, we may find it defined in the global run properties.
We also, of course, want to find all aspects of paragraph styling, including indentation, and the white space before and after the paragraph. All of the same issues apply. Paragraph styles inherit from other paragraph styles. We also need to take global paragraph properties into account.
Tables are even more interesting. Table styles define formatting for various components of a table, including the first row, the last row, the first column, the last column, odd rows, even rows, odd columns, even columns, and the four cells in the corners of the table. Further, the styling for these components are applied in a specific order, and even more interesting, each component of a table style may be conditionally applied for a given table. And if that were not enough, table styles also can inherit from other table styles.
Numbered and bulleted lists add their own special challenges. We can determine the text of the list item using the ListItemRetriever class from PowerTools for Open XML. The run properties of the paragraph define the list item formatting, but number formats can override this. Further, a number format can refer to a character style, and the list item will take on the character formatting of that character style. And of course, that character style can inherit from another character style.
As I said, it takes some work to determine the exact formatting of tables, paragraphs, and runs. When writing a high-fidelity transform to HTML, we really want to accurately determine the exact formatting. But combining this fairly involved logic with the transform to HTML isn’t a good idea. It is putting too much functionality into a single transform.
There is an easier way.
I’ve written a new module, FormattingAssembler, which transforms a WordprocessingML document that contains paragraph, character, and table styles into a new document that contains no styles. The FormattingAssembler module goes through all of the applicable styles, and assembles the appropriate and exact paragraph and run properties for every paragraph and run in the document. These assembled paragraph and run properties are directly applied to the paragraphs and runs of the new document.
Ditto for tables. In the new document, tables do not use table styles. Every cell in the table has cell properties that consist of the appropriate cell properties taken from the original table style, its base style, and so on. The FormattingAssembler module takes conditional formatting into account, of course. The resulting table, although it does not use a table style, looks identical to the original table that uses a table style.
Once we have transformed a document that contains references to styles into a new document that contains no references to styles, then writing the transform to HTML/CSS is a far easier proposition. The resulting HTML document will look very much like the original WordprocessingML document.
The following screen-cast demonstrates the FormattingAssembler module. It also explains some of the nuances of how it works.
Installation instructions: Open Xml Installation Center