Hoyle paper SUGI 31 Reading Microsoft Word XML files with SAS® Larry Hoyle, Policy Research Institute, University of Kansas
Hoyle paper SUGI 31 Three Scenarios Extracting text and attributes Extracting data from tables Extracting drawing object parameters
Hoyle paper SUGI 31 XML - Syntax Some content Other content Must begin with this prolog tag Paired tags, must have 1 root tag case sensitive Tags and content called "element" Tags can be Qualified by attributes Elements can be nested, Start and end in same parent
Hoyle paper SUGI 31 Word XML
Hoyle paper SUGI 31 Word XML
Hoyle paper SUGI 31 Extracting Text and Properties
Hoyle paper SUGI 31 What Does SAS Need? SAS XML Engine Needs XMLMAP file Can use XML Mapper to generate XMLMAP Only needs to be generated once for each type of extract
Hoyle paper SUGI 31 Example Document Styles and Colors Have Meaning I have never been so humiliated in my life. That was very rude treatment. What a pleasant experience. Your staff was both quick and pleasant. It took about the time I expected to reach someone. I have nothing to say. The sky is blue and the sea is green. You are the worst organization in the world. I love you guys.
Hoyle paper SUGI 31 Style and Color Style is “Treated” – a statement about treatment Color is “Red” - represents negative affect
Hoyle paper SUGI 31 Example Document as XML I have never been so humiliated in my life. That was very rude treatment. What a pleasant experience. Your staff was both quick and pleasant. It took about the time I expected to reach someone. I have nothing to say. The sky is blue and the sea is green. You are the worst organization in the world. I love you guys. Paragraph property: /w:wordDocument/w:body /wx:sect/w:p/w:pPr Run property: /w:wordDocument/w:body /wx:sect/w:p/w:r/w:rPr.
Hoyle paper SUGI 31 Rows The XMLMap has to describe a path that delineates rows: In this case it’s each text element in a run (in a paragraph…) /w:wordDocument/w:body/wx:sect/w:p/ w:r/w:t
Hoyle paper SUGI 31 Columns – the Text The XMLMap has to describe a path that delineates each column: The text itself is: /w:wordDocument/w:body/wx:sect/w:p /w:r/w:t
Hoyle paper SUGI 31 Columns – the Text Element Number A sequential number for the text element is: /w:wordDocument/w:body/wx:sect/w:p/w:r/w:t
Hoyle paper SUGI 31 Columns – the Paragraph Number A sequential number for the paragraph is: /w:wordDocument/w:body/wx:sect/w:p
Hoyle paper SUGI 31 Columns –Paragraph Color /w:wordDocument/w:body/wx:sect/w:p/
Hoyle paper SUGI 31 Columns – Run Color /w:wordDocument/w:body/wx:sect/w:
Hoyle paper SUGI 31 Columns – Run Style /w:wordDocument/w:body/wx:sect/w:p/w:r/ character string 11
Hoyle paper SUGI 31 The Data as Read into SAS
Hoyle paper SUGI 31 Tables
Hoyle paper SUGI 31 Our Sample Tables Read all data from all tables into one dataset Add variables to indicate table, row, column
Hoyle paper SUGI 31 The Tables Dataset
Hoyle paper SUGI 31 The Tables Dataset
Hoyle paper SUGI 31 Word XML – Tables Absolute Path /w:wordDocument/w:body/wx: sect/w:tbl/w:tr/w:tc/w:p/w:r/ w:t Relative Path w:tc/w:p/w:r/w:t
Hoyle paper SUGI 31 Count Table Beginnings w:tbl
Hoyle paper SUGI 31 Count Table Endings w:tbl
Hoyle paper SUGI 31 Graphics
Hoyle paper SUGI 31 Drawing Object Parameters VML – Vector Markup Language This example will only read lines –(they’re easiest) Other drawing objects have different XML elements
Hoyle paper SUGI 31 Our Example Drawing
Hoyle paper SUGI 31 Word XML – Drawn Lines
Hoyle paper SUGI 31 One Row for Each Line Element /w:wordDocument/w:body/wx:sect/w:p/w:r/ w:pict/v:group/v:line
Hoyle paper SUGI 31 Columns Parameters as Attributes /w:wordDocument/w:body/wx:sect/w:p/w:r/
Hoyle paper SUGI 31 The Dataset
Hoyle paper SUGI 31 Example Code in Paper Convert colors Parse stroke weight (e.g. 2pt) Detect the keyword “flip” and flip coordinates
Hoyle paper SUGI 31 As Drawn by SAS
Hoyle paper SUGI 31 Contact Information Larry Hoyle Policy Research Institute, University of Kansas sugi31 sugi31