Presentation is loading. Please wait.

Presentation is loading. Please wait.

Hoyle paper 019-31 SUGI 31 Reading Microsoft Word XML files with SAS® Larry Hoyle, Policy Research Institute, University of Kansas.

Similar presentations


Presentation on theme: "Hoyle paper 019-31 SUGI 31 Reading Microsoft Word XML files with SAS® Larry Hoyle, Policy Research Institute, University of Kansas."— Presentation transcript:

1 Hoyle paper 019-31 SUGI 31 Reading Microsoft Word XML files with SAS® Larry Hoyle, Policy Research Institute, University of Kansas

2 Hoyle paper 019-31 SUGI 31 Three Scenarios Extracting text and attributes Extracting data from tables Extracting drawing object parameters

3 Hoyle paper 019-31 SUGI 31 XML - Syntax Some content Other content Must begin with this prolog tag Paired tags, must have 1 root tag case sensitive Tags and content called "element" Tags can be Qualified by attributes Elements can be nested, Start and end in same parent

4 Hoyle paper 019-31 SUGI 31 Word XML

5 Hoyle paper 019-31 SUGI 31 Word XML

6 Hoyle paper 019-31 SUGI 31 Extracting Text and Properties

7 Hoyle paper 019-31 SUGI 31 What Does SAS Need? SAS XML Engine Needs XMLMAP file Can use XML Mapper to generate XMLMAP Only needs to be generated once for each type of extract

8 Hoyle paper 019-31 SUGI 31 Example Document Styles and Colors Have Meaning I have never been so humiliated in my life. That was very rude treatment. What a pleasant experience. Your staff was both quick and pleasant. It took about the time I expected to reach someone. I have nothing to say. The sky is blue and the sea is green. You are the worst organization in the world. I love you guys.

9 Hoyle paper 019-31 SUGI 31 Style and Color Style is “Treated” – a statement about treatment Color is “Red” - represents negative affect

10 Hoyle paper 019-31 SUGI 31 Example Document as XML I have never been so humiliated in my life. That was very rude treatment. What a pleasant experience. Your staff was both quick and pleasant. It took about the time I expected to reach someone. I have nothing to say. The sky is blue and the sea is green. You are the worst organization in the world. I love you guys. Paragraph property: /w:wordDocument/w:body /wx:sect/w:p/w:pPr Run property: /w:wordDocument/w:body /wx:sect/w:p/w:r/w:rPr.

11 Hoyle paper 019-31 SUGI 31 Rows The XMLMap has to describe a path that delineates rows: In this case it’s each text element in a run (in a paragraph…) /w:wordDocument/w:body/wx:sect/w:p/ w:r/w:t

12 Hoyle paper 019-31 SUGI 31 Columns – the Text The XMLMap has to describe a path that delineates each column: The text itself is: /w:wordDocument/w:body/wx:sect/w:p /w:r/w:t

13 Hoyle paper 019-31 SUGI 31 Columns – the Text Element Number A sequential number for the text element is: /w:wordDocument/w:body/wx:sect/w:p/w:r/w:t

14 Hoyle paper 019-31 SUGI 31 Columns – the Paragraph Number A sequential number for the paragraph is: /w:wordDocument/w:body/wx:sect/w:p

15 Hoyle paper 019-31 SUGI 31 Columns –Paragraph Color /w:wordDocument/w:body/wx:sect/w:p/ w:pPr/w:rPr/w:color/@val

16 Hoyle paper 019-31 SUGI 31 Columns – Run Color /w:wordDocument/w:body/wx:sect/w: p/w:r/w:rPr/w:color/@val

17 Hoyle paper 019-31 SUGI 31 Columns – Run Style /w:wordDocument/w:body/wx:sect/w:p/w:r/ w:rPr/w:rStyle/@val character string 11

18 Hoyle paper 019-31 SUGI 31 The Data as Read into SAS

19 Hoyle paper 019-31 SUGI 31 Tables

20 Hoyle paper 019-31 SUGI 31 Our Sample Tables Read all data from all tables into one dataset Add variables to indicate table, row, column

21 Hoyle paper 019-31 SUGI 31 The Tables Dataset

22 Hoyle paper 019-31 SUGI 31 The Tables Dataset

23 Hoyle paper 019-31 SUGI 31 Word XML – Tables Absolute Path /w:wordDocument/w:body/wx: sect/w:tbl/w:tr/w:tc/w:p/w:r/ w:t Relative Path w:tc/w:p/w:r/w:t

24 Hoyle paper 019-31 SUGI 31 Count Table Beginnings w:tbl

25 Hoyle paper 019-31 SUGI 31 Count Table Endings w:tbl

26 Hoyle paper 019-31 SUGI 31 Graphics

27 Hoyle paper 019-31 SUGI 31 Drawing Object Parameters VML – Vector Markup Language This example will only read lines –(they’re easiest) Other drawing objects have different XML elements

28 Hoyle paper 019-31 SUGI 31 Our Example Drawing

29 Hoyle paper 019-31 SUGI 31 Word XML – Drawn Lines

30 Hoyle paper 019-31 SUGI 31 One Row for Each Line Element /w:wordDocument/w:body/wx:sect/w:p/w:r/ w:pict/v:group/v:line

31 Hoyle paper 019-31 SUGI 31 Columns Parameters as Attributes /w:wordDocument/w:body/wx:sect/w:p/w:r/ w:pict/v:group/v:line/@from

32 Hoyle paper 019-31 SUGI 31 The Dataset

33 Hoyle paper 019-31 SUGI 31 Example Code in Paper Convert colors Parse stroke weight (e.g. 2pt) Detect the keyword “flip” and flip coordinates

34 Hoyle paper 019-31 SUGI 31 As Drawn by SAS

35 Hoyle paper 019-31 SUGI 31 Contact Information Larry Hoyle Policy Research Institute, University of Kansas LarryHoyle@ku.edu http://www.ku.edu/pri/ksdata/sashttp/ sugi31 http://www.ku.edu/pri/ksdata/sashttp/ sugi31


Download ppt "Hoyle paper 019-31 SUGI 31 Reading Microsoft Word XML files with SAS® Larry Hoyle, Policy Research Institute, University of Kansas."

Similar presentations


Ads by Google