Reading Microsoft Word XML files with SAS August 25, 2005 Larry Hoyle -- Policy Research Institute University of Kansas revised 8/18/2005.

Slides:



Advertisements
Similar presentations
HTML Basics Customizing your site using the basics of HTML.
Advertisements

Introduction to HTML & CSS
WeB application development
Website Design.
Making Things Look Nice: Visual Appearance and CSS CMPT 281.
ECE 250 Algorithms and Data Structures Douglas Wilhelm Harder, M.Math. LEL Department of Electrical and Computer Engineering University of Waterloo Waterloo,
SPECIAL TOPIC XML. Introducing XML XML (eXtensible Markup Language) ◦A language used to create structured documents XML vs HTML ◦XML is designed to transport.
Hoyle paper SUGI 31 Reading Microsoft Word XML files with SAS® Larry Hoyle, Policy Research Institute, University of Kansas.
3 November 2008CIS 340 # 1 Topics To define XML as a technology To place XML in the context of system architectures.
HTML and Web Page Design Presented by Frank H. Osborne, Ph. D. © 2005 ID 2950 Technology and the Young Child.
Sistemi basati su conoscenza XML Prof. M.T. PAZIENZA a.a
CSS, cont. October 7, Unit 4. Generic Containers Currently, we know how to modify the properties of HTML tags using style sheets But, we can only modify.
XML Primer. 2 History: SGML vs. HTML vs. XML SGML (1960) XML(1996) HTML(1990) XHTML(2000)
Introduction to XML: Yong Choi CSU Bakersfield.
Basic HTML Workshop LIS Web Team Fall What is HTML? Stands for Hyper Text Markup Language Computer language used to create web pages HTML file =
Chapter 14 Introduction to HTML
Introduction to HTML academy.zariba.com 1. Lecture Content 1.What is HTML? 2.The HTML Tag 3.Most popular HTML tags 2.
Computer Sciences Department
Copyright © 2003 Pearson Education, Inc. Slide 2-1 Created by Cheryl M. Hughes, Harvard University Extension School — Cambridge, MA The Web Wizard’s Guide.
Review HTML  What is HTML?  HTML is a language for describing web pages.  HTML stands for Hyper Text Markup Language  HTML is not a programming language,
Chapter 12 Creating and Using XML Documents HTML5 AND CSS Seventh Edition.
Ku-Yaw Chang Assistant Professor, Department of Computer Science and Information Engineering Da-Yeh University.
XML introduction to Ahmed I. Deeb Dr. Anwar Mousa  presenter  instructor University Of Palestine-2009.
 Introduction to XML Introduction to XML  Features of XML Features of XML  Syntax of XML Syntax of XML  Syntax rules of XML document Syntax rules.
_ HTML, XHTML & CSS Sami Niemelä | Module 1: Introduction to digital media: Day 02.
CREATED BY ChanoknanChinnanon PanissaraUsanachote
XP 1 CREATING AN XML DOCUMENT. XP 2 INTRODUCING XML XML stands for Extensible Markup Language. A markup language specifies the structure and content of.
Learning HTML. HTML Attributes HTML elements can have attributes Attributes provide additional information about an element Class – specifies a class.
XML Extensible Markup Language. Markup Languages u What does this number (100) mean? –Actually, it’s just a string of characters! –A markup language can.
ACM 511 HTML Week -1 ACM 511 Course Notes. Books ACM 511 Course Notes.
Styling and theming Build campaigns in style. What we'll look at... How a web document is structured How HTML and CSS fit together Tools you will need.
August Chapter 2 - Markup and Core Concepts Learning XML by Erik T. Ray Slides were developed by Jack Davis College of Information Science and Technology.
HTML INTRODUCTION, EDITORS, BASIC, ELEMENTS, ATTRIBUTES.
I NTRO TO CSS IAT100 Spring I NTRO TO CSS Covered in this lesson: Overview What is CSS? Why to use CSS? CSS for Skinning your Website Structure.
Programming in HTML.  Programming Language  Used to design/create web pages  Hyper Text Markup Language  Markup Language  Series of Markup tags 
These Questions are copied from
 2008 Pearson Education, Inc. All rights reserved Introduction to XHTML.
 XML is designed to describe data and to focus on what data is. HTML is designed to display data and to focus on how data looks.  XML is created to structure,
Tutorial 1: XML Creating an XML Document. 2 Introducing XML XML stands for Extensible Markup Language. A markup language specifies the structure and content.
How do I use HTML and XML to present information?.
Introduction to Programming the WWW I CMSC Winter 2003 Lecture 7.
HTML | DOM. Objectives  HTML – Hypertext Markup Language  Sematic markup  Common tags/elements  Document Object Model (DOM)  Work on page | HTML.
XML 2nd EDITION Tutorial 1 Creating An Xml Document.
HTML Basics Let’s Make a Web Page. What is HTML? HTML is a language for describing web pages. HTML stands for Hyper Text Markup Language HTML is not a.
XP 1 Creating an XML Document Developing an XML Document for the Jazz Warehouse XML Tutorial.
1 Web Application Programming Presented by: Mehwish Shafiq.
HTML Basics. HTML Introduction Stands for HyperText Markup Language. HTML files are plain text files with mark ups. Some characteristics of HTML: –No.
1 Introduction to XML XML stands for Extensible Markup Language. Because it is extensible, XML has been used to create a wide variety of different markup.
SAS ODS (Output Delivery System) Donald Miller 812 Oswald Tower ;
XML Introduction. What is XML? XML stands for eXtensible Markup Language XML stands for eXtensible Markup Language XML is a markup language much like.
XML Design Goals 1.XML must be easily usable over the Internet 2.XML must support a wide variety of applications 3.XML must be compatible with SGML 4.It.
Lecture: Web Design Assis. Prof. Freshta Hanif Ehsan Faculty of Computer Science Kabul Polytechnic University Spring Semester
HTML Basic. What is HTML HTML is a language for describing web pages. HTML stands for Hyper Text Markup Language HTML is not a programming language, it.
Representing data with XML SE-2030 Dr. Mark L. Hornick 1.
Introduction to XML XML – Extensible Markup Language.
ASHIMA KALRA  INTRODUCTION OF XML INTRODUCTION OF XML  XML FEATURES XML FEATURES  XML SYNTAX XML SYNTAX  XML ELEMENTS XML ELEMENTS  XML ATTRIBUTES.
HTML A brief introduction HTML1. HTML, what is? HTML is a markup language for describing web documents (web pages). HTML stands for Hyper Text Markup.
WEEK -1 ACM 262 ACM 262 Course Notes. HTML What is HTML? HTML is a language for describing web pages. HTML stands for Hyper Text Markup Language HTML.
XML. HTML Before you continue you should have a basic understanding of the following: HTML HTML was designed to display data and to focus on how data.
1 The tree data structure Outline In this topic, we will cover: –Definition of a tree data structure and its components –Concepts of: Root, internal, and.
Creating Your 1 st Web Page. Tags Refers to anything between on a webpage Most appear in pairs surrounding content Some appear as empty tags (no closing.
1 HTML. 2 Full forms WWW – world Wide Web HTTP – Hyper Text Transfer Protocol HTML – Hyper Text Markup Language.
XML and SQL Server Better friends than you thought Matt Hartman.
Rendering XML Documents ©NIITeXtensible Markup Language/Lesson 5/Slide 1 of 46 Objectives In this session, you will learn to: * Define rendering * Identify.
1 Introduction to HTML. 2 Definitions  W W W – World Wide Web.  HTML – HyperText Markup Language – The Language of Web Pages on the World Wide Web.
XML BASICS and more…. What is XML? In common:  XML is a standard, simple, self-describing way of encoding both text and data so that content can be processed.
Introduction to XHTML.
The XML Language.
More Sample XML By Sadia Anjum.
Presentation transcript:

Reading Microsoft Word XML files with SAS August 25, 2005 Larry Hoyle -- Policy Research Institute University of Kansas revised 8/18/2005

3 scenarios Extracting text along with associated properties (styles and attributes) Extracting all data from tables Extracting coordinates of objects in drawings

XML - syntax Some content Other content Must begin with this prolog tag Paired tags, must have 1 root tag case sensitive Empty tags end with /> Tags and content called "element" Tags can be Qualified by attributes Elements can be nested, Start and end in same parent

Word XML

Extracting text and properties SAS XML Engine Needs XMLMAP file Can use XML Mapper to generate XMLMAP Only needs to be generated once for each type of extract

Example Document I have never been so humiliated in my life. That was very rude treatment. What a pleasant experience. Your staff was both quick and pleasant. It took about the time I expected to reach someone. I have nothing to say. The sky is blue and the sea is green. You are the worst organization in the world. I love you guys.

XML - Example Document I have never been so humiliated in my life. That was very rude treatment. What a pleasant experience. Your staff was both quick and pleasant. It took about the time I expected to reach someone. I have nothing to say. The sky is blue and the sea is green. You are the worst organization in the world. I love you guys. Paragraph property: /w:wordDocument/w:body /wx:sect/w:p/w:pPr Run property: /w:wordDocument/w:body /wx:sect/w:p/w:r/w:rPr.

Rows The XMLMap has to describe a path that delineates rows: In this case it’s each text element in a run (in a paragraph…) /w:wordDocument/w:bo dy/wx:sect/w:p/w:r/w:t

Columns – the text The XMLMap has to describe a path that delineates each column: The text itself is: /w:wordDocument/w:body /wx:sect/w:p/w:r/w:t

Columns – the text element number A sequential number for the text element is: /w:wordDocument/w:body /wx:sect/w:p/w:r/w:t

Columns – the paragraph number A sequential number for the paragraph is: /w:wordDocument/w:body /wx:sect/w:p

Columns –paragraph color /w:wordDocument/w:body/w

Columns – run color /w:wordDocument/w:body/w

Our dataset

Tables

All Tables Into One Dataset

Tables – Word XML

Tables - DataSet Rows / w:wordDocument /w:body/wx:sect/w:tbl/w:tr/w:tc/w:p/w:r/w:t

Tables – Table Number /w:wordDocument/w:body/wx:sect/w:tbl

Tables – Row Number /w:wordDocument/w:body/wx:sect/w:tbl/w:tr

We Could Add Properties if Needed

Nested tables

Nested Tables – Absolute Path for Rows / w : wordDocument /w:body/wx:sect/w:tbl/w:tr/w:tc/w:p/w:r/w:t

Nested Tables – Rootless Path for Rows w:tbl/w:tr/w:tc/w:p/w:r/w:t

Drawing Objects VML – Vector Markup Language Drawings in Word get stored as XML also We’ll just look at lines

VML – Vector Markup Language

Dataset – One Row for Each Line / w:wordDocument/w:body /wx:sect/w:p/w:r/w:pict/v:group/v:line

Dataset – Column: From /w:wordDocument/w:body/wx:sect/w:p/w:r/w:pict/v:group /v:line

Dataset – Column: To /w:wordDocument/w:body/wx:sect/w:p/w:r/w:pict/v:group /v:line

Dataset – Column: StrokeColor /w:wordDocument/w:body/wx:sect/w:p/w:r/w:pict/v:group /v:line

The Dataset

Usage Example: Annotate dataset if prxmatch(xyPattern, from) then do; function='move'; x= input(PRXPOSN (xyPattern, 1, from),10.); if prxmatch('/flip:y/',style) then y= -1* input(PRXPOSN (xyPattern, 2, to),10.); else y= -1* input(PRXPOSN (xyPattern, 2, from),10.); output;

Plotted in SAS

Contact Information Larry Hoyle Policy Research Institute, University of Kansas