Presentation is loading. Please wait.

Presentation is loading. Please wait.

Reading Microsoft Word XML files with SAS August 25, 2005 Larry Hoyle -- Policy Research Institute University of Kansas revised 8/18/2005.

Similar presentations


Presentation on theme: "Reading Microsoft Word XML files with SAS August 25, 2005 Larry Hoyle -- Policy Research Institute University of Kansas revised 8/18/2005."— Presentation transcript:

1 Reading Microsoft Word XML files with SAS August 25, 2005 Larry Hoyle -- Policy Research Institute University of Kansas revised 8/18/2005

2 3 scenarios Extracting text along with associated properties (styles and attributes) Extracting all data from tables Extracting coordinates of objects in drawings

3 XML - syntax Some content Other content Must begin with this prolog tag Paired tags, must have 1 root tag case sensitive Empty tags end with /> Tags and content called "element" Tags can be Qualified by attributes Elements can be nested, Start and end in same parent

4 Word XML

5

6 Extracting text and properties SAS XML Engine Needs XMLMAP file Can use XML Mapper to generate XMLMAP Only needs to be generated once for each type of extract

7 Example Document I have never been so humiliated in my life. That was very rude treatment. What a pleasant experience. Your staff was both quick and pleasant. It took about the time I expected to reach someone. I have nothing to say. The sky is blue and the sea is green. You are the worst organization in the world. I love you guys.

8 XML - Example Document I have never been so humiliated in my life. That was very rude treatment. What a pleasant experience. Your staff was both quick and pleasant. It took about the time I expected to reach someone. I have nothing to say. The sky is blue and the sea is green. You are the worst organization in the world. I love you guys. Paragraph property: /w:wordDocument/w:body /wx:sect/w:p/w:pPr Run property: /w:wordDocument/w:body /wx:sect/w:p/w:r/w:rPr.

9 Rows The XMLMap has to describe a path that delineates rows: In this case it’s each text element in a run (in a paragraph…) /w:wordDocument/w:bo dy/wx:sect/w:p/w:r/w:t

10 Columns – the text The XMLMap has to describe a path that delineates each column: The text itself is: /w:wordDocument/w:body /wx:sect/w:p/w:r/w:t

11 Columns – the text element number A sequential number for the text element is: /w:wordDocument/w:body /wx:sect/w:p/w:r/w:t

12 Columns – the paragraph number A sequential number for the paragraph is: /w:wordDocument/w:body /wx:sect/w:p

13 Columns –paragraph color /w:wordDocument/w:body/w x:sect/w:p/w:pPr/w:rPr/w:color/@val

14 Columns – run color /w:wordDocument/w:body/w x:sect/w:p/w:r/w:rPr/w:color/@val

15 Our dataset

16 Tables

17 All Tables Into One Dataset

18 Tables – Word XML

19 Tables - DataSet Rows / w:wordDocument /w:body/wx:sect/w:tbl/w:tr/w:tc/w:p/w:r/w:t

20 Tables – Table Number /w:wordDocument/w:body/wx:sect/w:tbl

21 Tables – Row Number /w:wordDocument/w:body/wx:sect/w:tbl/w:tr

22 We Could Add Properties if Needed

23 Nested tables

24 Nested Tables – Absolute Path for Rows / w : wordDocument /w:body/wx:sect/w:tbl/w:tr/w:tc/w:p/w:r/w:t

25 Nested Tables – Rootless Path for Rows w:tbl/w:tr/w:tc/w:p/w:r/w:t

26 Drawing Objects VML – Vector Markup Language Drawings in Word get stored as XML also We’ll just look at lines

27 VML – Vector Markup Language

28 Dataset – One Row for Each Line / w:wordDocument/w:body /wx:sect/w:p/w:r/w:pict/v:group/v:line

29 Dataset – Column: From /w:wordDocument/w:body/wx:sect/w:p/w:r/w:pict/v:group /v:line /@from

30 Dataset – Column: To /w:wordDocument/w:body/wx:sect/w:p/w:r/w:pict/v:group /v:line /@to

31 Dataset – Column: StrokeColor /w:wordDocument/w:body/wx:sect/w:p/w:r/w:pict/v:group /v:line /@strokecolor

32 The Dataset

33 Usage Example: Annotate dataset if prxmatch(xyPattern, from) then do; function='move'; x= input(PRXPOSN (xyPattern, 1, from),10.); if prxmatch('/flip:y/',style) then y= -1* input(PRXPOSN (xyPattern, 2, to),10.); else y= -1* input(PRXPOSN (xyPattern, 2, from),10.); output;

34 Plotted in SAS

35 Contact Information Larry Hoyle Policy Research Institute, University of Kansas LarryHoyle@ku.edu http://www.ku.edu/pri/ksdata/sashttp/sugi31


Download ppt "Reading Microsoft Word XML files with SAS August 25, 2005 Larry Hoyle -- Policy Research Institute University of Kansas revised 8/18/2005."

Similar presentations


Ads by Google