Advice: When Generating Spreadsheets, Use Inline Strings, not Shared Strings

If you create a spreadsheet using Excel, place a single string “Abcde” in cell A1, and save it, in the worksheet part you will see markup that looks something like this:

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<worksheet xmlns="http://schemas.openxmlformats.org/spreadsheetml/2006/main"
           xmlns:r="http://schemas.openxmlformats.org/officeDocument/2006/relationships"
           xmlns:mc="http://schemas.openxmlformats.org/markup-compatibility/2006"
           mc:Ignorable="x14ac"
           xmlns:x14ac="http://schemas.microsoft.com/office/spreadsheetml/2009/9/ac">
  <dimension ref="A1"/>
  <sheetViews>
    <sheetView tabSelected="1"
               workbookViewId="0"/>
  </sheetViews>
  <sheetFormatPr defaultRowHeight="15"/>
  <sheetData>
    <row r="1"
         spans="1:1">
      <c r="A1"
         t="s">
        <v>0</v>
      </c>
    </row>
  </sheetData>
</worksheet>

The value in cell A1 is ‘0’ and the data type (in the ‘t’ attribute) is ‘s’.  This is because this markup is using the shared string table.  The ‘0’ is a zero-based index into the shared string table.  The shared string part is where we find the string “Abcde”:

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<sst xmlns="http://schemas.openxmlformats.org/spreadsheetml/2006/main"
     count="1"
     uniqueCount="1">
  <si>
    <t>Abcde</t>
  </si>
</sst>

When you use the code generator in the Open XML SDK Productivity Toolkit to generate a spreadsheet, the code you generate reflects this structure.  You can see that the code that is generated is generating the shared string table, and then generates references to the shared string table when generating the markup for cells.

When you reuse the code that the Productivity Toolkit generates, you need to be aware of this, and alter the code to take this into account.

But there is a better way – don’t use the shared string table, and instead use inline strings.  After altering the above worksheet markup to use inline strings, the markup looks like this:

<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<worksheet xmlns:r="http://schemas.openxmlformats.org/officeDocument/2006/relationships"
             xmlns:mc="http://schemas.openxmlformats.org/markup-compatibility/2006"
             xmlns:x14ac="http://schemas.microsoft.com/office/spreadsheetml/2009/9/ac"
             mc:Ignorable="x14ac"
             xmlns="http://schemas.openxmlformats.org/spreadsheetml/2006/main">
  <dimension ref="A1" />
  <sheetViews>
    <sheetView tabSelected="1"
                 workbookViewId="0" />
  </sheetViews>
  <sheetFormatPr defaultRowHeight="15"/>
  <sheetData>
    <row r="1"
           spans="1:1">
      <c r="A1"
           t="inlineStr">
        <is>
          <t>Abcde</t>
        </is>
      </c>
    </row>
  </sheetData>
</worksheet>

But what if we want to use the Open XML SDK Productivity Tool to generate code to generate such markup?  There is a super easy answer – we write a small program that iterates through all worksheets in a workbook, and then iterates through all cells in each worksheet, and if the cell uses data from the shared string table, we convert the cell to use the inline string markup.  Here is the program to do this, in its entirety:

using System;
using System.Collections.Generic;
using System.IO;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
using DocumentFormat.OpenXml.Packaging;
using DocumentFormat.OpenXml.Spreadsheet;

class Program
{
    static void Main(string[] args)
    {
        var source = new FileInfo("../../Test.xlsx");
        var dest = new FileInfo("../../Test-Out.xlsx");
        if (dest.Exists)
            dest.Delete();
        File.Copy(source.FullName, dest.FullName);
        using (var sDoc = SpreadsheetDocument.Open(dest.FullName, true))
        {
            var sharedStringPart = sDoc.WorkbookPart.SharedStringTablePart;
            foreach (var worksheet in sDoc.WorkbookPart.WorksheetParts)
            {
                foreach (var cell in worksheet.RootElement.Descendants<Cell>())
                {
                    if (cell.DataType == "s")
                    {
                        // this is the zero based index
                        // Console.WriteLine(cell.CellValue.Text);
                        int zeroBasedIdx = Int32.Parse(cell.CellValue.Text);
                        var valueFromSharedStringTable = sharedStringPart
                            .RootElement
                            .Elements<SharedStringItem>()
                            .ElementAt(zeroBasedIdx);
                        // Console.WriteLine(valueFromSharedStringTable.Text.Text);
                        cell.RemoveAllChildren();
                        cell.DataType = CellValues.InlineString;
                        cell.Append(
                            new InlineString(
                                new Text(valueFromSharedStringTable.Text.Text)));
                    }
                }
            }
        }
    }
}

We run this program on our workbook before we use the Open XML SDK Productivity Tool to generate our code.  Be careful not to open and save your workbook in Excel between the time when you run this code, and when you generate the code, because Excel will alter the workbook to again use the shared string table.

If we do this conversion before we generate our code, the resulting code will be easier to understand and therefore, when we alter the code per our specific scenario, we won’t introduce any interesting bugs.

Update Dec 19, 2013: I’ve written another blog post on this subject:

When Writing Spreadsheets, A Comparison of using the Shared String Table to using In-Line Strings

Cheers, Eric