Formula Processing in SpreadsheetML

For the last couple of months, I have been exploring the contents of formulas in SpreadsheetML. Formulas can include functions, cell ranges, external references, references to other sheets in the same workbook, and a variety of operators. In fact, formulas even support union and intersection operators for ranges. The entire specification of formulas can be found in a document called “Extensions to the Office Open XML SpreadsheetML Format.” In the first part of this post, I will describe how to use the sample code. In the second part, I will show most of how the sample code works. The actual parsing is more complex and I may cover that in later post, depending on interest. You can find the code for these examples in version 2.2.7 or later of the PowerTools for OpenXML.

Two Parsing Examples

// Change sheet name in formulas
using (OpenXmlMemoryStreamDocument streamDoc = new OpenXmlMemoryStreamDocument(
   
SmlDocument.FromFileName("../../Formulas.xlsx")))
{
   
using (SpreadsheetDocument doc = streamDoc.GetSpreadsheetDocument())
   
{
       
WorksheetAccessor.FormulaReplaceSheetName(doc, "Source", "'Source 2'");
   
}
    streamDoc
.GetModifiedSmlDocument().SaveAs("../../FormulasUpdated.xlsx");
}

All sheet references are replaced. For example, this formula:

=SUM(Source!$A$1:$A$5)+SUM(Source!$B$1:$B$5)+Source!$D$1+Source!$D$3

Becomes:

=SUM(‘Source 2’!$A$1:$A$5)+SUM(‘Source 2′!$B$1:$B$5)+’Source 2′!$D$1+’Source 2’!$D$3

The second example is very similar, but uses a different method in order to copy a range of cells. After the same using statements, I get the WorksheetPart containing the cells I wish to copy. Then the CopyCellRange method copies everything in the cells and also updates any relative cell references. For example, the formula “=A1+A2” becomes “=H4+H5” after being copied.

// Change sheet name in formulas
using (OpenXmlMemoryStreamDocument streamDoc = new OpenXmlMemoryStreamDocument(
   
SmlDocument.FromFileName("../../Formulas.xlsx")))
{
   
using (SpreadsheetDocument doc = streamDoc.GetSpreadsheetDocument())
   
{
       
WorksheetPart sheet = WorksheetAccessor.GetWorksheet(doc, "References");
       
WorksheetAccessor.CopyCellRange(doc, sheet, 1, 1, 7, 5, 4, 8);
   
}
    streamDoc
.GetModifiedSmlDocument().SaveAs("../../FormulasCopied.xlsx");
}

Please note that only cell formulas are updated in these examples. There are other formulas that could also be adjusted, like conditional formatting.

Sheet Substitution

The core of this process is the use of the ParseFormula class. It has a method, ReplaceSheetName, that returns a new formula string based on the original parsed formula, but with the new sheet name. The rest of the code is retrieving all of the formulas and processing them one by one. There is also some change detection so that no extra work is done if there is
no change in the formula.

public static void FormulaReplaceSheetName(SpreadsheetDocument document, string oldName, string newName)
{
   
foreach (WorksheetPart sheetPart in document.WorkbookPart.WorksheetParts)
   
{
       
XDocument sheetDoc = sheetPart.GetXDocument();
       
bool changed = false;
       
foreach (XElement formula in sheetDoc.Descendants(S.f))
       
{
           
ParseFormula parser = new ParseFormula(formula.Value);
           
string newFormula = parser.ReplaceSheetName(oldName, newName);
           
if (newFormula != formula.Value)
           
{
                formula
.SetValue(newFormula);
                changed
= true;
           
}
       
}
       
if (changed)
       
{
            sheetPart
.PutXDocument();
           
ForceCalculateOnLoad(document);
       
}
   
}
}

There is also a call to the ForceCalculateOnLoad method so that values will be updated after the formulas have been changed. If that is not done, the values will not be updated from the new sheet when the spreadsheet is first opened.

Cell Range Copy

This method also depends heavily on the ParseFormula class. There are two loops for rows and columns in the range. Each of the original cells is retrieved from the worksheet and then a new cell is created that is identical to the old. In order to make the new cell valid in its new location, the “r” attribute must be set to the new cell name (e.g. H4). Then, if the cell
has a formula, that formula is parsed and modified using the ReplaceRelativeCell method to update any relative cell references.

// Copy all cells in the specified range to a new location
public static void CopyCellRange(SpreadsheetDocument document, WorksheetPart worksheet, int startRow, int startColumn, int endRow, int endColumn,
   
int toRow, int toColumn)
{
   
int rowOffset = toRow - startRow;
   
int columnOffset = toColumn - startColumn;
   
XDocument worksheetXDocument = worksheet.GetXDocument();
   
for (int row = startRow; row <= endRow; row++)
       
for (int column = startColumn; column <= endColumn; column++)
       
{
           
XElement oldCell = GetCell(worksheetXDocument, column, row);
           
if (oldCell != null)
           
{
               
XElement newCell = new XElement(oldCell);
                newCell
.SetAttributeValue(NoNamespace.r, GetColumnId(column + columnOffset) + (row + rowOffset).ToString());
               
XElement formula = newCell.Element(S.f);
               
if (formula != null)
               
{
                   
ParseFormula parser = new ParseFormula(formula.Value);
                    formula
.SetValue(parser.ReplaceRelativeCell(rowOffset, columnOffset));
               
}
               
SetCell(worksheetXDocument, newCell);
           
}
       
}
    worksheet
.PutXDocument();
   
ForceCalculateOnLoad(document);
}

The last step is to put the new cell element into the spreadsheet using the SetCell method. That method will find the proper “r” element, or add one if needed, and insert or replace
the cell in that element.

The Formula Parsing Class

There are two main functional parts of this class. The first is the parsing of the formula and the second is one of the methods that replaces particular elements of the formula. The parsing is done in the constructor of the class. It uses another class, ExcelFormula, to do the actual
parsing. The code in that class was generated from a PEG (Parsing Expression Grammar) that describes Excel formulas. The resulting code parses the formula and creates a tree of nodes that identify the elements of the formula. That tree can then be used to find and replace those elements.

The tree is made up of nodes, each of which may have a set of child nodes. There is one root node that holds the first set of child nodes, but does not refer to a particular element. Each node also has a starting position and length of where that element appears in the original formula text. For example, the formula “SUM(Source!$A$1:$A$5)” has a tree that looks like
this:

The ReplaceSheetName method, shown below, uses a recursive method to replace all occurrences of the “SheetName” node with a new name, but only if the current value of the node matches the old name. The recursive method is a simple way to be sure that the last occurrence is replaced first and to easily traverse the children of the tree. If the substitution was done left to right, all of the later position values would have to be adjusted. This simple substitution could easily be done for other node IDs just by calling ReplaceNode with a different ID.

public string ReplaceSheetName(string oldName, string newName)
{
   
StringBuilder text = new StringBuilder(parser.GetSource());
   
ReplaceNode(parser.GetRoot(), (int)EExcelFormula.SheetName, oldName, newName, text);
   
return text.ToString();
}

// Recursive function that will replace values from last to first
private void ReplaceNode(Peg.Base.PegNode node, int id, string oldName, string newName, StringBuilder text)
{
   
if (node.next_ != null)
       
ReplaceNode(node.next_, id, oldName, newName, text);
   
if (node.id_ == id && parser.GetSource().Substring(node.match_.posBeg_, node.match_.Length) == oldName)
   
{
        text
.Remove(node.match_.posBeg_, node.match_.Length);
        text
.Insert(node.match_.posBeg_, newName);
   
}
   
else if (node.child_ != null)
       
ReplaceNode(node.child_, id, oldName, newName, text);
}

The relative cell adjustment is a little more complex, but not much. A recursive function is still used to do the actual substitution of text. In this case, there are two possible substitutions, one for the A1Row and one for the A1Column. In both cases, the row number or column number is retrieved and adjusted by the offset values. Note that there is additional code for the columns to convert the alphabetic column name to an integer and then back after
the offset adjustment.

public string ReplaceRelativeCell(int rowOffset, int colOffset)
{
   
StringBuilder text = new StringBuilder(parser.GetSource());
   
ReplaceRelativeCell(parser.GetRoot(), rowOffset, colOffset, text);
   
return text.ToString();
}

// Recursive function that will adjust relative cells from last to first
private void ReplaceRelativeCell(Peg.Base.PegNode node, int rowOffset, int colOffset, StringBuilder text)
{
   
if (node.next_ != null)
       
ReplaceRelativeCell(node.next_, rowOffset, colOffset, text);
   
if (node.id_ == (int)EExcelFormula.A1Row && parser.GetSource().Substring(node.match_.posBeg_, 1) != "$")
   
{
       
int rowNumber = Convert.ToInt32(parser.GetSource().Substring(node.match_.posBeg_, node.match_.Length));
        text
.Remove(node.match_.posBeg_, node.match_.Length);
        text
.Insert(node.match_.posBeg_, Convert.ToString(rowNumber + rowOffset));
   
}
   
else if (node.id_ == (int)EExcelFormula.A1Column && parser.GetSource().Substring(node.match_.posBeg_, 1) != "$")
   
{
       
int colNumber = GetColumnNumber(parser.GetSource().Substring(node.match_.posBeg_, node.match_.Length));
        text
.Remove(node.match_.posBeg_, node.match_.Length);
        text
.Insert(node.match_.posBeg_, GetColumnId(colNumber + colOffset));
   
}
   
else if (node.child_ != null)
       
ReplaceRelativeCell(node.child_, rowOffset, colOffset, text);
}

The tree node structure makes it pretty simple to make these substitutions. The details of the parsing grammar and the operation of the PEG style of parser are beyond the scope of this post. If you would like me to post more information about PEG and the specific grammar for formulas, just let me know.