Transform DOCX to HTML/CSS with High-Fidelity using PowerTools for Open XML
Return to the
WmlToHtmlConverter Developer CenterToday I am happy to announce the release of HtmlConverter version 2.06.00, which is a high fidelity conversion from DOCX to HTML/CSS. HtmlConverter is a module in the Open-Xml-PowerTools project.
Here is a short video that demonstrates HtmlConverter (aka WmlToHtmlConverter)in action:
Developers often ask for guidance on how to transform DOCX to rich HTML/CSS on the forums at OpenXmlDeveloper.org, the MSDN support forums, and StackOverflow. In the past, many developers have put in a lot of effort and have written this transformation with just enough features to support their particular scenario. The HtmlConverter class provides a great head-start for putting together a customized transform from DOCX to HTML.
PowerTools for Open XML is licensed under the Microsoft Public License (Ms-PL), which gives you wide latitude in how you use the code, including its use in commercial products and open source projects.
HtmlConverter.cs 2.06.00 supports:
- Paragraph styles, character styles, and table styles, including styles that are based on other styles.
- Table styles includes support for conditional table style options (header row, total row, banded rows, first column, last column, and banded columns.
- Fonts, including font styles such as bold, italic, underline, strikethrough, foreground and background colors, shading, sub-script, super-script, and more. HtmlConverter is, in effect, guidance on how to correctly determine the font and formatting for each paragraph and text run in a document.
- Numbered and bulleted lists. Current support is only for en-US and fr-FR; however, HtmlConverter is factored and parameterized so that you can support other languages without altering the source code. In the near future, I’ll be publishing guidance and instructions on how to support additional languages, and I’ll be asking for volunteers to write and contribute the bits of code to generate canonical (one, two, three) and ordinal (first, second, third) implementations for your native language, as well as the various Asian and RTL numbering systems.
- Tabs, including left tabs, right tabs, centered tabs, and decimal tabs. HtmlConverter takes the approach of using font metrics to calculate the exact width of the various pieces of text in a line, and inserts <span> elements with precisely calculated widths.
- High fidelity support for vertical white space and horizontal white space, including indented text, hanging indents, centered text, right justified text, and justified text.
- Borders around paragraphs, and high fidelity for borders of tables.
- Horizontally and vertically merged cells in tables.
- External hyperlinks, and internal hyperlinks to bookmarks within the document.
- You have much more control over the conversion when compared to other approaches to converting to HTML. There are already a number of parameters that enable you to control the transformation, and in the future I’ll be adding many more knobs and levers to fine tune the conversion. And of course, you have the source code, so you can customize the conversion for your scenario.
Here are a couple of implementation notes:
- This transform does not attempt to support down-level browsers. It does not generate HTML attributes that are more properly supported by CSS.
- The projects that are in the OpenXmlPowerTools-2-06-00.zip file are compatible with Visual Studio 2012 and Visual Studio 2013. PowerTools for Open XML will work with Visual Studio 2010, but you will need to put together your own Visual Studio project.
There are some key features that I haven’t implemented yet:
- Full support for Asian and RTL languages. They partially work, but I have the goal that this transform should be an accurate transform for all languages that Word supports.
- Support for Asian, RTL, and other numbering systems, such as:
- aiueo (AIUEO Order Half-Width Katakana)
- arabicAbjad (Arabic Abjad Numerals)
- chineseCountingThousand (Chinese Counting Thousand System)
- (When I organize the implementation of all of these numbering systems, I will make a list and ask for volunteers for specific languages, and track those languages that are in the process of being implemented. We don’t want to duplicate efforts!)
- Support for numbering in Latin languages other than English and French. I had to implement two languages just to make sure that my infrastructure for parameterizing list item generation was implemented properly, so I dusted off my high-school French and wrote an implementation. Again, I’ll ask for volunteers for languages other than French and English.
- Floating and anchored text boxes – this is an important feature, and I want to make sure that it is implemented in as accurate a fashion as possible.
- Display of comments. My desired design is to have an indicator of the position of comments, and when you hover over that location, the comment is displayed in an appropriately size tool-tip. Alternatively, I may display in a bubble to the right, as Word does. I think that with HTML5/CSS3, I can make a pretty slick implementation.
- Display of content controls – this isn’t trivial, but it also isn’t too awfully hard, and enables some interesting scenarios. You may want to display some contextual information when the mouse hovers over a content control.
- Sections, multi-column layout, and pagination – I would not attempt to match the pagination of Word, but it might be an interesting and pleasing way to view documents. What do you think?
If you have opinions about the order of implementation of these features, or if you know of other features that should be on this list, please file an issue on GitHub OfficeDev/Open-Xml-PowerTools.
There are some features that I currently don’t plan to support:
- Smart Art – this would require writing a complete rendering engine for Smart Art – high effort, low ROI, in my opinion.
- Charts – this would require writing a charting module – high effort, but this is a highly used feature, so it would be interesting to do.
- Embedded spreadsheets – supporting an embedded spreadsheet would require writing a transform from SpreadsheetML to HTML, something that is on my radar, but probably not this year.
- Embedded WordprocessingML documents – this could be doable, but I personally don’t prefer to use this feature. Let me know if this is important to you.
- MathML – would require writing a rendering engine, probably not for some time.
I am certain that I have missed some aspects of this conversion. I’ll be fine tuning this conversion for some time to come. If you find aspects that do not convert correctly, please log the issue, along with a sample document, at https://github.com/OfficeDev/Open-Xml-PowerTools.