Reading Microsoft Word XML files with SAS August 25, 2005 Larry Hoyle -- Policy Research Institute University of Kansas revised 8/18/2005
3 scenarios Extracting text along with associated properties (styles and attributes) Extracting all data from tables Extracting coordinates of objects in drawings
XML - syntax Some content Other content Must begin with this prolog tag Paired tags, must have 1 root tag case sensitive Empty tags end with /> Tags and content called "element" Tags can be Qualified by attributes Elements can be nested, Start and end in same parent
Word XML
Extracting text and properties SAS XML Engine Needs XMLMAP file Can use XML Mapper to generate XMLMAP Only needs to be generated once for each type of extract
Example Document I have never been so humiliated in my life. That was very rude treatment. What a pleasant experience. Your staff was both quick and pleasant. It took about the time I expected to reach someone. I have nothing to say. The sky is blue and the sea is green. You are the worst organization in the world. I love you guys.
XML - Example Document I have never been so humiliated in my life. That was very rude treatment. What a pleasant experience. Your staff was both quick and pleasant. It took about the time I expected to reach someone. I have nothing to say. The sky is blue and the sea is green. You are the worst organization in the world. I love you guys. Paragraph property: /w:wordDocument/w:body /wx:sect/w:p/w:pPr Run property: /w:wordDocument/w:body /wx:sect/w:p/w:r/w:rPr.
Rows The XMLMap has to describe a path that delineates rows: In this case it’s each text element in a run (in a paragraph…) /w:wordDocument/w:bo dy/wx:sect/w:p/w:r/w:t
Columns – the text The XMLMap has to describe a path that delineates each column: The text itself is: /w:wordDocument/w:body /wx:sect/w:p/w:r/w:t
Columns – the text element number A sequential number for the text element is: /w:wordDocument/w:body /wx:sect/w:p/w:r/w:t
Columns – the paragraph number A sequential number for the paragraph is: /w:wordDocument/w:body /wx:sect/w:p
Columns –paragraph color /w:wordDocument/w:body/w
Columns – run color /w:wordDocument/w:body/w
Our dataset
Tables
All Tables Into One Dataset
Tables – Word XML
Tables - DataSet Rows / w:wordDocument /w:body/wx:sect/w:tbl/w:tr/w:tc/w:p/w:r/w:t
Tables – Table Number /w:wordDocument/w:body/wx:sect/w:tbl
Tables – Row Number /w:wordDocument/w:body/wx:sect/w:tbl/w:tr
We Could Add Properties if Needed
Nested tables
Nested Tables – Absolute Path for Rows / w : wordDocument /w:body/wx:sect/w:tbl/w:tr/w:tc/w:p/w:r/w:t
Nested Tables – Rootless Path for Rows w:tbl/w:tr/w:tc/w:p/w:r/w:t
Drawing Objects VML – Vector Markup Language Drawings in Word get stored as XML also We’ll just look at lines
VML – Vector Markup Language
Dataset – One Row for Each Line / w:wordDocument/w:body /wx:sect/w:p/w:r/w:pict/v:group/v:line
Dataset – Column: From /w:wordDocument/w:body/wx:sect/w:p/w:r/w:pict/v:group /v:line
Dataset – Column: To /w:wordDocument/w:body/wx:sect/w:p/w:r/w:pict/v:group /v:line
Dataset – Column: StrokeColor /w:wordDocument/w:body/wx:sect/w:p/w:r/w:pict/v:group /v:line
The Dataset
Usage Example: Annotate dataset if prxmatch(xyPattern, from) then do; function='move'; x= input(PRXPOSN (xyPattern, 1, from),10.); if prxmatch('/flip:y/',style) then y= -1* input(PRXPOSN (xyPattern, 2, to),10.); else y= -1* input(PRXPOSN (xyPattern, 2, from),10.); output;
Plotted in SAS
Contact Information Larry Hoyle Policy Research Institute, University of Kansas