Archive for LINQ to XML

LINQ to XML for JavaScript – Gaining Perf thru Atomization

LINQ to XML for JavaScript uses the same approach for good perf as LINQ to XML for .NET – atomization. Read more: http://openxmldeveloper.org/home2/bm8qcmjy/public_html/blog/b/openxmldeveloper/archive/2012/10/24/linq-to-xml-for-javascript-gaining-performance-through-atomization.aspx

Comments

Introducing a new class for PowerTools for Open XML: TextReplacer

Recently I wrote some code that implemented search-and-replace for Open XML WordprocessingML documents.  I wrote that code for an Open XML developer who needed to implement that functionality using XML DOM, although with a different language than C#.  Because XML DOM is standardized, translating the code to another language and another implementation of XML DOM is relatively straightforward.

I want to introduce search-and-replace functionality in a CMDLET in PowerTools for Open XML, but I have been moving PowerTools code away from XmlDocument, so I rewrote the search-and-replace code using LINQ to XML, using a functional transform.  It was an interesting and fun project.  The video below introduces the TextReplacer class, and compares it to the code that I presented that uses XmlDocument.  It is an interesting comparison of imperative code (using XmlDocument) and functional code (using LINQ to XML).

You can download the TextReplacer class from this blog post (in an attachment at the bottom).

Introduces TextReplacer, which is LINQ to XML code that replaces text in WordprocessingML documents.

Comments (8)

Custom Formatting of XML using LINQ to XML

On StackOverflow, there is a question (posted by Otaku, an online friend of mine for some time) about how to serialize multiple XML elements on the same line.  It is a very interesting question.  After going down a couple of dead-ends, I realized that it is pretty easy to iterate through an XML tree and do all of the writing to an XmlWriter explicitly, bypassing all of LINQ to XML’s logic for serializing through an XmlWriter.  This lets us do just about anything we want to do to the indentation of the XML, while still letting the XmlWriter class do all of the serializing of the XML itself.  Some folks at StackOverflow suggested post-processing the XML, but I know from hard experience that it is very difficult to post-process XML and really get it right, including handling CData elements, and etc.  By letting the XmlWriter class do all of the output of XML, while injecting just a bit of white space in the right places, we can be confident of the validity of the XML.

His question: he has XML that looks like this:


<Canvas>
  <Grid>
    <TextBlock>
      <Run Text="r"/>
      <Run Text="u"/>
      <Run Text="n"/>
    </TextBlock>
    <TextBlock>
      <Run Text="far a"/>
      <Run Text="way"/>
      <Run Text=" from me"/>
    </TextBlock>
  </Grid>
  <Grid>
    <TextBlock>
      <Run Text="I"/>
      <Run Text=" "/>
      <Run Text="want"/>
      <LineBreak/>
    </TextBlock>
    <TextBlock>
      <LineBreak/>
      <Run Text="...thi"/>
      <Run Text="s to"/>
      <LineBreak/>
      <Run Text=" work"/>
    </TextBlock>
  </Grid>
</Canvas>

He wants to format it so that it looks like this:


<Canvas>
  <Grid>
    <TextBlock>
      <Run Text="r"/><Run Text="u"/><Run Text="n"/>
    </TextBlock>
    <TextBlock>
      <Run Text="far a"/><Run Text="way"/><Run Text=" from me"/>
    </TextBlock>
  </Grid>
  <Grid>
    <TextBlock>
      <Run Text="I"/><Run Text=" "/><Run Text="want"/>
      <LineBreak/>
    </TextBlock>
    <TextBlock>
      <LineBreak/>
      <Run Text="...thi"/><Run Text="s to"/>
      <LineBreak/>
      <Run Text=" work"/>
    </TextBlock>
  </Grid>
</Canvas>

The reason he wants to do this is because of some fairly obscure semantics of XAML for Silverlight 3.  Read his question on StackOverflow for more detail.

I posted code on StackOverflow that shows how to do that specialized serialization using VB.NET.  Of course, actually I wrote the code first in C#, and then after getting it all working, I translated to VB.NET.  This post presents the C# code.

The key to solving this problem is to write a recursive function that iterates through the XML tree, writing the various elements and attributes to specially created XmlWriter objects.  There is an ‘outer’ XmlWriter object that writes indented XML, and an ‘inner’ XmlWriter object that writes non-indented XML.

The recursive function initially uses the ‘outer’ XmlWriter, writing indented XML, until it sees the TextBlock element (an element that triggers a desired change in the indenting behavior).  When it encounters the TextBlock element, it creates the ‘inner’ XmlWriter object, writing the child elements of the TextBlock element to it.  It also writes custom white space to the ‘inner’ XmlWriter.

When the ‘inner’ XmlWriter object is finished with writing the TextBlock element, the text that the ‘inner’ writer wrote is written to the ‘outer’ XmlWriter using the WriteRaw method.

As I mentioned, the advantages of this approach is that there is no post-processing of the XML.  It is extremely difficult to post-process XML and be certain that you have properly handled all cases, including arbitrary text in CData nodes, etc.  All of the XML is written using only the XmlWriter class, thereby ensuring that this will always write valid XML.  The only exception to this is the specially crafted white-space that is written using the WriteRaw method, which achieves the desired indenting behavior.

One key point is that the ‘inner’ XmlWriter object’s conformance level is set to ConformanceLevel.Fragment, because the ‘inner’ XmlWriter needs to write XML that does not have a root element.

To achieve the desired formatting of Run elements (i.e. Run elements that are adjacent have no insignificant white space between them), the code uses the GroupAdjacent extension method.  Some time ago, I write a blog post on the GroupAdjacent extension method.

Here is the C# code to do the specialized formatting:


using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Xml;
using System.Xml.Linq;

public static class Extensions
{
    public static IEnumerable<IGrouping<TKey, TSource>> GroupAdjacent<TSource, TKey>(
        this IEnumerable<TSource> source,
        Func<TSource, TKey> keySelector)
    {
        TKey last = default(TKey);
        bool haveLast = false;
        List<TSource> list = new List<TSource>();

        foreach (TSource s in source)
        {
            TKey k = keySelector(s);
            if (haveLast)
            {
                if (!k.Equals(last))
                {
                    yield return new GroupOfAdjacent<TSource, TKey>(list, last);
                    list = new List<TSource>();
                    list.Add(s);
                    last = k;
                }
                else
                {
                    list.Add(s);
                    last = k;
                }
            }
            else
            {
                list.Add(s);
                last = k;
                haveLast = true;
            }
        }
        if (haveLast)
            yield return new GroupOfAdjacent<TSource, TKey>(list, last);
    }
}

public class GroupOfAdjacent<TSource, TKey> : IEnumerable<TSource>, IGrouping<TKey, TSource>
{
    public TKey Key { get; set; }
    private List<TSource> GroupList { get; set; }

    System.Collections.IEnumerator System.Collections.IEnumerable.GetEnumerator()
    {
        return ((System.Collections.Generic.IEnumerable<TSource>)this).GetEnumerator();
    }

    System.Collections.Generic.IEnumerator<TSource>
        System.Collections.Generic.IEnumerable<TSource>.GetEnumerator()
    {
        foreach (var s in GroupList)
            yield return s;
    }

    public GroupOfAdjacent(List<TSource> source, TKey key)
    {
        GroupList = source;
        Key = key;
    }
}

class Program
{
    static void WriteStartElement(XmlWriter writer, XElement e)
    {
        XNamespace ns = e.Name.Namespace;
        writer.WriteStartElement(e.GetPrefixOfNamespace(ns),
            e.Name.LocalName, ns.NamespaceName);
        foreach (var a in e.Attributes())
        {
            ns = a.Name.Namespace;
            string localName = a.Name.LocalName;
            string namespaceName = ns.NamespaceName;
            writer.WriteAttributeString(
                e.GetPrefixOfNamespace(ns),
                localName,
                namespaceName.Length == 0 && localName == "xmlns" ?
                    XNamespace.Xmlns.NamespaceName :
                    namespaceName,
                a.Value);
        }
    }

    public static void WriteElement(XmlWriter writer, XElement e)
    {
        if (e.Name == "TextBlock")
        {
            WriteStartElement(writer, e);
            writer.WriteRaw(Environment.NewLine);

            // Create an XML writer that outputs no insignificant white space so that we can
            // write to it and explicitly control white space.
            XmlWriterSettings settings = new XmlWriterSettings();
            settings.Indent = false;
            settings.OmitXmlDeclaration = true;
            settings.ConformanceLevel = ConformanceLevel.Fragment;
            StringBuilder sb = new StringBuilder();
            using (XmlWriter newXmlWriter = XmlWriter.Create(sb, settings))
            {
                // Group adjacent runs so that they can be output with no whitespace between them
                var groupedRuns = e.Nodes().GroupAdjacent(n =>
                {
                    XElement element = n as XElement;
                    if (element != null && element.Name == "Run")
                        return true;
                    return false;
                });
                foreach (var g in groupedRuns)
                {
                    if (g.Key == true)
                    {
                        // Write white space so that the line of Run elements is properly indented.
                        newXmlWriter.WriteRaw("".PadRight((e.Ancestors().Count() + 1) * 2));
                        foreach (var run in g)
                            run.WriteTo(newXmlWriter);
                        newXmlWriter.WriteRaw(Environment.NewLine);
                    }
                    else
                    {
                        foreach (var g2 in g)
                        {
                            // Write some white space so that each child element is properly indented.
                            newXmlWriter.WriteRaw("".PadRight((e.Ancestors().Count() + 1) * 2));
                            g2.WriteTo(newXmlWriter);
                            newXmlWriter.WriteRaw(Environment.NewLine);
                        }
                    }
                }
            }
            writer.WriteRaw(sb.ToString());
            writer.WriteRaw("".PadRight(e.Ancestors().Count() * 2));
            writer.WriteEndElement();
        }
        else
        {
            WriteStartElement(writer, e);
            foreach (var n in e.Nodes())
            {
                XElement element = n as XElement;
                if (element != null)
                {
                    WriteElement(writer, element);
                    continue;
                }
                n.WriteTo(writer);
            }
            writer.WriteEndElement();
        }
    }

    static string ToStringWithCustomWhiteSpace(XElement element)
    {
        // Create XmlWriter that indents.
        XmlWriterSettings settings = new XmlWriterSettings();
        settings.Indent = true;
        settings.OmitXmlDeclaration = true;
        StringBuilder sb = new StringBuilder();
        using (XmlWriter xmlWriter = XmlWriter.Create(sb, settings))
            WriteElement(xmlWriter, element);
        return sb.ToString();
    }

    static void Main(string[] args)
    {
        XElement root = XElement.Parse(
@"<Canvas a='1'>
  <Grid>
    <TextBlock>
      <Run Text='r'/>
      <Run Text='u'/>
      <Run Text='n'/>
    </TextBlock>
    <TextBlock>
      <Run Text='far a'/>
      <Run Text='way'/>
      <Run Text=' from me'/>
    </TextBlock>
  </Grid>
  <Grid>
    <TextBlock>
      <Run Text='I'/>
      <Run Text=' '/>
      <Run Text='want'/>
      <LineBreak/>
    </TextBlock>
    <TextBlock>
      <LineBreak/>
      <Run Text='...thi'/>
      <Run Text='s to'/>
      <LineBreak/>
      <Run Text=' work'/>
    </TextBlock>
  </Grid>
</Canvas>");
        Console.WriteLine(ToStringWithCustomWhiteSpace(root));
    }
}

And for completeness, here is the VB code:


Imports System.Text
Imports System.Xml

Public Class GroupOfAdjacent(Of TElement, TKey)
    Implements IEnumerable(Of TElement)

    Private _key As TKey
    Private _groupList As List(Of TElement)

    Public Property GroupList() As List(Of TElement)
        Get
            Return _groupList
        End Get
        Set(ByVal value As List(Of TElement))
            _groupList = value
        End Set
    End Property

    Public ReadOnly Property Key() As TKey
        Get
            Return _key
        End Get
    End Property

    Public Function GetEnumerator() As System.Collections.Generic.IEnumerator(Of TElement) _
            Implements System.Collections.Generic.IEnumerable(Of TElement).GetEnumerator
        Return _groupList.GetEnumerator
    End Function

    Public Function GetEnumerator1() As System.Collections.IEnumerator _
            Implements System.Collections.IEnumerable.GetEnumerator
        Return _groupList.GetEnumerator
    End Function

    Public Sub New(ByVal key As TKey)
        _key = key
        _groupList = New List(Of TElement)
    End Sub
End Class

Module Module1
    <System.Runtime.CompilerServices.Extension()> _
    Public Function GroupAdjacent(Of TElement, TKey)(ByVal source As IEnumerable(Of TElement), _
                ByVal keySelector As Func(Of TElement, TKey)) As List(Of GroupOfAdjacent(Of TElement, TKey))
        Dim lastKey As TKey = Nothing
        Dim currentGroup As GroupOfAdjacent(Of TElement, TKey) = Nothing
        Dim allGroups As List(Of GroupOfAdjacent(Of TElement, TKey)) = New List(Of GroupOfAdjacent(Of TElement, TKey))()
        For Each item In source
            Dim thisKey As TKey = keySelector(item)
            If lastKey IsNot Nothing And Not thisKey.Equals(lastKey) Then
                allGroups.Add(currentGroup)
            End If
            If Not thisKey.Equals(lastKey) Then
                currentGroup = New GroupOfAdjacent(Of TElement, TKey)(keySelector(item))
            End If
            currentGroup.GroupList.Add(item)
            lastKey = thisKey
        Next
        If lastKey IsNot Nothing Then
            allGroups.Add(currentGroup)
        End If
        Return allGroups
    End Function

    Public Sub WriteStartElement(ByVal writer As XmlWriter, ByVal e As XElement)
        Dim ns As XNamespace = e.Name.Namespace
        writer.WriteStartElement(e.GetPrefixOfNamespace(ns), _
            e.Name.LocalName, ns.NamespaceName)
        For Each a In e.Attributes
            ns = a.Name.Namespace
            Dim localName As String = a.Name.LocalName
            Dim namespaceName As String = ns.NamespaceName
            writer.WriteAttributeString( _
                e.GetPrefixOfNamespace(ns), _
                localName, _
                IIf(namespaceName.Length = 0 And localName = "xmlns", _
                    XNamespace.Xmlns.NamespaceName, namespaceName),
                a.Value)
        Next
    End Sub

    Public Sub WriteElement(ByVal writer As XmlWriter, ByVal e As XElement)
        If (e.Name = "TextBlock") Then
            WriteStartElement(writer, e)
            writer.WriteRaw(Environment.NewLine)

            ' Create an XML writer that outputs no insignificant white space so that we can
            ' write to it and explicitly control white space.
            Dim settings As XmlWriterSettings = New XmlWriterSettings()
            settings.Indent = False
            settings.OmitXmlDeclaration = True
            settings.ConformanceLevel = ConformanceLevel.Fragment
            Dim sb As StringBuilder = New StringBuilder()
            Using newXmlWriter As XmlWriter = XmlWriter.Create(sb, settings)
                ' Group adjacent runs so that they can be output with no whitespace between them
                Dim groupedRuns = e.Nodes().GroupAdjacent( _
                    Function(n) As Boolean?
                        If TypeOf n Is XElement Then
                            Dim element As XElement = n
                            If element.Name = "Run" Then
                                Return True
                            End If
                            Return False
                        End If
                        Return False
                    End Function)
                For Each g In groupedRuns
                    If g.Key = True Then
                        ' Write white space so that the line of Run elements is properly indented.
                        newXmlWriter.WriteRaw("".PadRight((e.Ancestors().Count() + 1) * 2))
                        For Each run In g
                            run.WriteTo(newXmlWriter)
                        Next
                        newXmlWriter.WriteRaw(Environment.NewLine)
                    Else
                        For Each g2 In g
                            ' Write some white space so that each child element is properly indented.
                            newXmlWriter.WriteRaw("".PadRight((e.Ancestors().Count() + 1) * 2))
                            g2.WriteTo(newXmlWriter)
                            newXmlWriter.WriteRaw(Environment.NewLine)
                        Next
                    End If
                Next
            End Using
            writer.WriteRaw(sb.ToString())
            writer.WriteRaw("".PadRight(e.Ancestors().Count() * 2))
            writer.WriteEndElement()
        Else
            WriteStartElement(writer, e)
            For Each n In e.Nodes
                If TypeOf n Is XElement Then
                    Dim element = n
                    WriteElement(writer, element)
                    Continue For
                End If
                n.WriteTo(writer)
            Next
            writer.WriteEndElement()
        End If
    End Sub

    Function ToStringWithCustomWhiteSpace(ByVal element As XElement) As String
        ' Create XmlWriter that indents.
        Dim settings As XmlWriterSettings = New XmlWriterSettings()
        settings.Indent = True
        settings.OmitXmlDeclaration = True
        Dim sb As StringBuilder = New StringBuilder()
        Using xmlWriter As XmlWriter = xmlWriter.Create(sb, settings)
            WriteElement(xmlWriter, element)
        End Using
        Return sb.ToString()
    End Function

    Sub Main()
        Dim myXML As XElement = _
            <Canvas>
                <Grid>
                    <TextBlock>
                        <Run Text='r'/>
                        <Run Text='u'/>
                        <Run Text='n'/>
                    </TextBlock>
                    <TextBlock>
                        <Run Text='far a'/>
                        <Run Text='way'/>
                        <Run Text=' from me'/>
                    </TextBlock>
                </Grid>
                <Grid>
                    <TextBlock>
                        <Run Text='I'/>
                        <Run Text=' '/>
                        <Run Text='want'/>
                        <LineBreak/>
                    </TextBlock>
                    <TextBlock>
                        <LineBreak/>
                        <Run Text='...thi'/>
                        <Run Text='s to'/>
                        <LineBreak/>
                        <Run Text=' work'/>
                    </TextBlock>
                </Grid>
            </Canvas>
        Console.Write(ToStringWithCustomWhiteSpace(myXML))
        Console.ReadLine()
    End Sub

End Module

Comments

Align Attributes when Formatting XML using LINQ to XML

A few years ago, I wrote a blog post that showed how to align attributes when formatting XML using LINQ to XML. Here is an extension method that uses that technique.


using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Text.RegularExpressions;
using System.Xml;
using System.Xml.Linq;

namespace ConsoleApplication1
{
    public static class Extensions
    {
        public static string ToStringAlignAttributes(this XContainer xContainer)
        {
            XmlWriterSettings settings = new XmlWriterSettings();
            settings.Indent = true;
            settings.OmitXmlDeclaration = true;
            settings.NewLineOnAttributes = true;
            StringBuilder sb = new StringBuilder();
            using (XmlWriter xmlWriter = XmlWriter.Create(sb, settings))
                xContainer.WriteTo(xmlWriter);
            return sb.ToString();
        }
    }

    class Program
    {
        static void Main(string[] args)
        {
            XDocument doc = new XDocument(
                new XElement("Root",
                    new XAttribute("att1", 1),
                    new XAttribute("att2", 2),
                    new XAttribute("att3", 3),
                    new XElement("Child",
                        new XAttribute("att1", 1),
                        new XAttribute("att2", 2),
                        new XAttribute("att3", 3))));
            Console.WriteLine(doc.ToStringAlignAttributes());

            XElement el = new XElement("Root",
                new XAttribute("att1", 1),
                new XAttribute("att2", 2),
                new XAttribute("att3", 3),
                new XElement("Child",
                    new XAttribute("att1", 1),
                    new XAttribute("att2", 2),
                    new XAttribute("att3", 3)));
            Console.WriteLine(el.ToStringAlignAttributes());
        }
    }
}

Update: May 5, 2011 – I initially wrote a more fancy version of this, but as it turns out, I got it wrong – it didn’t properly indent some cases of some XML documents, so am reverting the code. When I get a chance, I’ll work out the issues with the code that implements more fancy alignment.

Comments (6)