Get Highlighted Text from .docx

Home Forums Open-Xml-Sdk Get Highlighted Text from .docx

This topic contains 3 replies, has 2 voices, and was last updated by  Eric White 8 years, 5 months ago.

Viewing 4 posts - 1 through 4 (of 4 total)
  • Author
    Posts
  • #3442

    HardCorps88
    Participant

    I have been trying to get all highlighted text from a .docx but it fails to find any. My code is as follows
    Dim htext As IEnumerable(Of Highlight) = wordDocument.MainDocumentPart.Document.Descendants(Of Highlight)().Where(Function(h) h.Val = “Yellow”).ToList()
    This returns a collection of items but InnerText is “”. Therefore my next statement Returns nothing.
    For Each e in htext
    Dim docHightext As New ParagraphText()
    Dim highText As String = “”
    highText = e.InnerText
    docHighText.FieldText = highText
    If e.InnerText <> “” Then
    paratext.Add(docHighText)
    End IF
    Next

    Can you help me out?

    #3468

    Eric White
    Keymaster

    It is quite a bit more complicated than the approach you are taking. You are selected the descendant ‘w:highlight’ elements, but this is not where the text is stored. The text is in the w:t element that is inside a w:r element that contains the w:rPr element (the run properties), which contains the w:highlight element.

    <w:p>
      <w:r>
        <w:rPr>
          <w:highlight w:val="yellow"/>
        </w:rPr>
        <w:t>Test</w:t>
      </w:r>
    </w:p>

    You have to first select the runs that have the w:rPr elements that contain the w:highlight element with your desired value. Then after selecting those runs, you have to select the child w:t elements (and there may be multiple) that contain the actual text. To complicate matters further, that highlight element may be in the run properties in a style, so you would have to look at the style part, find the style, and see if the w:highlight element is in the run props for a style. Also, that character style may itself derive from another character style, where the w:highlight element is defined.

    Document formats are complicated, and for good reason – the structure of the documents themselves are complicated.

    I recommend that you watch the Introduction to Open XML screen-cast series. After you have watched those screen-casts, then watch the Introduction to WordprocessingML screen-cast series.

    #3471

    HardCorps88
    Participant

    Ok,
    So I got an answer off stackflow with this code.
    Still an issue with it finding my Run Properties as Nothing. WHen I open my document in OpenXml Productivity tool I see this.
    <w:p w:rsidRPr=”00AA7ABD” w:rsidR=”00710260″ w:rsidP=”00710260″ w:rsidRDefault=”006B1119″>
    <w:pPr>
    <w:spacing w:line=”240″ w:lineRule=”auto” />
    <w:ind w:firstLine=”720″ />
    <w:rPr>
    <w:highlight w:val=”yellow” />
    </w:rPr>
    </w:pPr>
    <w:proofErr w:type=”gramStart” />
    <w:r w:rsidRPr=”00AA7ABD”>
    <w:rPr>
    <w:highlight w:val=”yellow” />
    </w:rPr>
    <w:t>Zz5</w:t>
    </w:r>
    <w:r w:rsidRPr=”00AA7ABD” w:rsidR=”00710260″>
    <w:rPr>
    <w:highlight w:val=”yellow” />
    </w:rPr>
    <w:t>TT-1.</w:t>
    </w:r>
    <w:proofErr w:type=”gramEnd” />
    <w:r w:rsidRPr=”00AA7ABD” w:rsidR=”00710260″>
    <w:rPr>
    <w:highlight w:val=”yellow” />
    </w:rPr>
    <w:t xml:space=”preserve”> This is </w:t>
    </w:r>
    <w:proofErr w:type=”spellStart” />
    <w:r w:rsidRPr=”00AA7ABD” w:rsidR=”00CC1B4F”>
    <w:rPr>
    <w:highlight w:val=”yellow” />
    </w:rPr>
    <w:t>ttttt</w:t>
    </w:r>
    <w:r w:rsidRPr=”00AA7ABD” w:rsidR=”00710260″>
    <w:rPr>
    <w:highlight w:val=”yellow” />
    </w:rPr>
    <w:t>my</w:t>
    </w:r>
    <w:proofErr w:type=”spellEnd” />
    <w:r w:rsidRPr=”00AA7ABD” w:rsidR=”00710260″>
    <w:rPr>
    <w:highlight w:val=”yellow” />
    </w:rPr>
    <w:t xml:space=”preserve”> test paragraph test paragraph </w:t>
    </w:r>
    <w:proofErr w:type=”gramStart” />
    <w:r w:rsidRPr=”00AA7ABD” w:rsidR=”00710260″>
    <w:rPr>
    <w:highlight w:val=”yellow” />
    </w:rPr>
    <w:t>This</w:t>
    </w:r>
    <w:proofErr w:type=”gramEnd” />
    <w:r w:rsidRPr=”00AA7ABD” w:rsidR=”00710260″>
    <w:rPr>
    <w:highlight w:val=”yellow” />
    </w:rPr>
    <w:t xml:space=”preserve”> is my test paragraph test paragraph.</w:t>
    </w:r>
    </w:p>

    I included the snippet of code below.
    Private Function GetListOfHighlightedString(ByVal Docx As WordprocessingDocument) As List(Of String)
    Dim lstOfHighlightedString As List(Of String) = New List(Of String)()
    Try
    For Each EachRun In Docx.MainDocumentPart.Document.Body.Descendants(Of Run)()
    If EachRun.RunProperties IsNot Nothing Then
    For Each EachPrpChild In EachRun.RunProperties.ChildElements
    If TypeOf EachPrpChild Is Highlight Then
    Dim highlightVal As Highlight = TryCast(EachPrpChild, Highlight)
    If highlightVal.Val.Equals(HighlightColorValues.Yellow) Then
    lstOfHighlightedString.Add(EachRun.InnerText)
    End If
    End If
    Next EachPrpChild
    End If
    Next EachRun
    Catch e1 As Exception

    Throw
    End Try
    Return lstOfHighlightedString

    #3487

    Eric White
    Keymaster

    Yes, sometimes there are no run properties for a run, in which case it uses the run properties from the style, and then from the global defaults. This is valid Open XML, and your code should be prepared to handle this.

    -Eric

Viewing 4 posts - 1 through 4 (of 4 total)

You must be logged in to reply to this topic.