A reoccurring question around Open XML is how to search and replace text in a word-processing document. There have been several attempts at presenting example code to do this, however, until now I have not seen any examples that correctly implement this. This post presents some example code that implements a correct algorithm to search and replace text.

The first challenge is handle the case when the text you are searching for spans runs with different formatting. A simple example will demonstrate the problem. You want to replace ‘Hello World’ with ‘Hi World’. If, in the document, the word ‘World’ is bolded, then the markup will look something like this:

<w:p>
  <w:r>
    <w:t xml:space="preserve">Hello </w:t>
  </w:r>
  <w:r>
    <w:rPr>
      <w:b />
    </w:rPr>
    <w:t>World</w:t>
  </w:r>
</w:p>

Even though the search text spans runs, the algorithm should find the text and replace it. The next challenge is to define exactly the semantics of searching and replacing text if the text that you are searching for spans runs with different formatting. In short, the replaced text takes on the run formatting of the run that contains the first character of the search string. An example makes this clear. In the following sentence, the first four characters of the word ‘include’ are bolded:

On the Insert tab, the galleries include items.

If you replace ‘include’ with ‘do not include’, then the sentence should be formatted like this:

On the Insert tab, the galleries do not include items.

The replaced text takes on the formatting of the ‘i’ character of include, which was bolded.

Here is a short screen-cast that walks through the algorithm and the code.

Search and Replace Algorithm

It certainly would be possible to carefully define an algorithm to search for text that spans runs, noting where the searched text intersects bookmarks, comments, and the like. However, this algorithm would be pretty complicated, and to be done properly, a test team would need to write extensive test specs, and supply a plethora of sample documents that exercise all edge cases. It is a non-trivial project.

However, there is another approach that we can take that is pretty simple, easy to test, and yields the correct results in all cases. The algorithm consists of:

  • Concatenate all text in a paragraph into a single string, and search for the search string in the concatenated text. If the search text is found, then continue with the following steps.
  • Iterate through all runs in the paragraph, and break all runs into runs of a single character. There are a variety of special characters, such as carriage return, hard tab, break, and the non-breaking hyphen character. Normally, these special characters will coexist in runs with text elements. When breaking runs into runs of a single character, these special characters should also be placed into their own run. At the end of this process, no run will contain more than a single character, whether it is a character of text, or one of the special characters that is represented by an XML element.
  • After breaking runs of text into multiple runs of single characters, it is then pretty easy to iterate through the runs looking for a string of runs that match the characters in the search string.
  • If the algorithm finds a string of runs that match the search string, then it inserts a new run into the document. This new run contains the run properties of the first run in the string of runs that match the search string. In addition, the algorithm deletes the set of single-character runs that matched the search string. This process is repeated until no strings of runs are found that match the search string.
  • After the algorithm replaces the single-character runs with a new run containing the replacement text, then the algorithm coalesces adjacent runs with the same formatting into a single run.

Algorithm Walk-through

It will be helpful to walk through an example, and examine the markup at each step in the process. The following paragraph contains the text, “See this markup.” The letters ‘th’ in the word ‘this’ is bolded. We want to change the word ‘this’ to the word ‘the’.

<w:p>
  <w:r>
    <w:t xml:space="preserve">See </w:t>
  </w:r>
  <w:r>
    <w:rPr>
      <w:b />
    </w:rPr>
    <w:t>th</w:t>
  </w:r>
  <w:r>
    <w:t>is markup.</w:t>
  </w:r>
</w:p>

After splitting all runs into multiple runs of a single character each, the markup looks like this:

<w:p xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main">
  <w:r>
    <w:t>S</w:t>
  </w:r>
  <w:r>
    <w:t>e</w:t>
  </w:r>
  <w:r>
    <w:t>e</w:t>
  </w:r>
  <w:r>
    <w:t
      xml:space="preserve"> </w:t>
  </w:r>
  <w:r>
    <w:rPr>
      <w:b />
    </w:rPr>
    <w:t>t</w:t>
  </w:r>
  <w:r>
    <w:rPr>
      <w:b />
    </w:rPr>
    <w:t>h</w:t>
  </w:r>
  <w:r>
    <w:t>i</w:t>
  </w:r>
  <w:r>
    <w:t>s</w:t>
  </w:r>
  <w:r>
    <w:t
      xml:space="preserve"> </w:t>
  </w:r>
  <w:r>
    <w:t>m</w:t>
  </w:r>
  <w:r>
    <w:t>a</w:t>
  </w:r>
  <w:r>
    <w:t>r</w:t>
  </w:r>
  <w:r>
    <w:t>k</w:t>
  </w:r>
  <w:r>
    <w:t>u</w:t>
  </w:r>
  <w:r>
    <w:t>p</w:t>
  </w:r>
  <w:r>
    <w:t>.</w:t>
  </w:r>
</w:p>

The algorithm can then iterate through the runs, finding the series of runs where the text of the runs matches ‘t’, ‘h’, ‘I’, ‘s’. The algorithm then inserts a new run containing the replace text, taking the run properties from the run that contained the ‘t’ in the search string, which indicates that the run is bolded. It also removes the single character runs that match the search string. The adjusted markup looks like this.

<w:p xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main">
  <w:r>
    <w:t>S</w:t>
  </w:r>
  <w:r>
    <w:t>e</w:t>
  </w:r>
  <w:r>
    <w:t>e</w:t>
  </w:r>
  <w:r>
    <w:t
      xml:space="preserve"> </w:t>
  </w:r>
  <w:r>
    <w:rPr>
      <w:b />
    </w:rPr>
    <w:t>the</w:t>
  </w:r>
  <w:r>
    <w:t
      xml:space="preserve"> </w:t>
  </w:r>
  <w:r>
    <w:t>m</w:t>
  </w:r>
  <w:r>
    <w:t>a</w:t>
  </w:r>
  <w:r>
    <w:t>r</w:t>
  </w:r>
  <w:r>
    <w:t>k</w:t>
  </w:r>
  <w:r>
    <w:t>u</w:t>
  </w:r>
  <w:r>
    <w:t>p</w:t>
  </w:r>
  <w:r>
    <w:t>.</w:t>
  </w:r>
</w:p>

Finally, the algorithm iterates through the runs, coalescing adjacent runs with identical formatting.

<w:p xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main">
  <w:r>
    <w:t
      xml:space="preserve">See </w:t>
  </w:r>
  <w:r>
    <w:rPr>
      <w:b />
    </w:rPr>
    <w:t>the</w:t>
  </w:r>
  <w:r>
    <w:t
      xml:space="preserve"> markup.</w:t>
  </w:r>
</w:p>

Additional Notes

There are a few additional notes worth mentioning about this algorithm.

  • This algorithm only works for paragraphs that do not contain tracked revisions. While it is certainly possible to write this functionality for content that contains tracked revisions, it significantly complicates the algorithm. The code as written checks for the existence of tracked revisions (using the code presented in Using XML DOM to Detect Tracked Revisions in Open XML WordprocessingML Documents), and throws an exception if they exist.
  • If revision tracking is turned on for a document, the correct functionality would be to create the revision tracking markup, which is beyond the scope of this example. If revision tracking is turned on, the example code throws an exception.
  • While my favorite way to write these types of algorithms is to use LINQ to XML, to make this code more widely applicable, I used System.Xml.XmlDocument, which is an implementation of XML DOM. This makes it easier to translate this code to a variety of other platforms, such as PHP or Java.
  • The code searches and replaces text in the main document part, all headers, all footers, the endnote part, and the footnote part.