Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Introduction to XML Yanlei Diao UMass Amherst April 19, 2007 Slides Courtesy of Ramakrishnan & Gehrke, Dan Suciu, Zack Ives and Gerome Miklau.

Similar presentations


Presentation on theme: "1 Introduction to XML Yanlei Diao UMass Amherst April 19, 2007 Slides Courtesy of Ramakrishnan & Gehrke, Dan Suciu, Zack Ives and Gerome Miklau."— Presentation transcript:

1 1 Introduction to XML Yanlei Diao UMass Amherst April 19, 2007 Slides Courtesy of Ramakrishnan & Gehrke, Dan Suciu, Zack Ives and Gerome Miklau.

2 2 Structure in Data Representation  Relational data is highly structured  structure is defined by the schema  good for system design  good for precise query semantics / answers  Structure can be limiting  data exchange hard: integration of diff schema  authoring is constrained: schema-first  querying constrained: must know schema  changes to structure not easy

3 3 Data Integration 1. Find all departments whose total employee salaries exceed 1% of the budget of the company. US Europe Asia Australia Internet 2. Find names of employees with the top sales record last month.

4 4 WWW Structured data - Databases Unstructured Text - Documents Semistructured Data Integration of Text and Structured Data

5 5 Need for A New Data Model Loose (and rich) structure  Integration of structured, but heterogeneous data sources  Evolving, unknown, or irregular structure  Textual data with tags and links  Combination of data models 5

6 6 XML: Universal Data Exchange Format  XML is the confluence of many factors:  Databases needed a more flexible interchange format.  Data needed to be generated and consumed by applications.  The Web needed a more declarative format for data.  Documents needed a mechanism for extended tags.  XML was originally proposed for online publishing, is becoming the wire format for data exchange.  W3C Recommendation: http://www.w3.org/TR/REC-xml/

7 7 From HTML to XML HTML describes the presentation.

8 8 HTML Bibliography Foundations of Databases Abiteboul, Hull, Vianu Addison Wesley, 1995 Data on the Web Abiteboul, Buneman, Suciu Morgan Kaufmann, 1999

9 9 XML Foundations… Abiteboul Hull Vianu Addison Wesley 1995 … XML describes the content!

10 10 XML: Syntax & Typing

11 11 XML Syntax  Tags: book, title, author, …  start tag:  end tag:  Elements: …, …  elements are nested  empty element:, abbrv.  An XML document: single root element An XML document is well formed if it has matching tags

12 12 XML Syntax Foundations of Databases Abiteboul … 1995 Foundations of Databases Abiteboul … 1995 Attributes are alternative ways to represent data.

13 13 XML Syntax Jane Mary John Jane Mary John Oids and references in XML are just syntax.

14 14 XML Semantics: a Tree ! Mary Maple 345 Seattle John Thailand 23456 Mary Maple 345 Seattle John Thailand 23456 data Mary person name addres s name addres s streetnocity Maple345 Seattle John Thai phone 23456 idid o555 Element node Text node Attribute node Order matters ! IDREF will turn it to a graph.

15 15 XML Data  XML is self-describing  Schema elements become part of the data – Relational schema: persons(name,phone) – In XML,, are part of the data, and are repeated many times  Consequence: XML is much more flexible Some real data: http://www.cs.washington.edu/research/xmldatasets/

16 16 Relational Data as XML John 3634 Sue 6343 Dick 6363 John 3634 Sue 6343 Dick 6363 row name phone “John” 3634“Sue” “Dick” 6343 6363 person XML: person namephone John3634 Sue6343 Dick6363

17 17 XML is Semi-structured Data  Missing attributes:  Could represent in a table with nulls John 1234 Joe John 1234 Joe ← no phone ! namephone John1234 Joe-

18 18 XML is Semi-structured Data  Repeated attributes  Impossible in tables: nested collections (non 1NF) Mary 2345 3456 Mary 2345 3456 ← two phones ! namephone Mary23453456 ?? ?

19 19 XML is Semi-structured Data  Attributes with different types in different objects  Mixed content: – contains both s and s John Smith 1234 M. Carey 3456 John Smith 1234 M. Carey 3456 ← structured name ! ← unstructured name !

20 20 Data Typing in XML  Data typing in the relational model: schema  Data typing in XML – Much more complex – Typing restricts valid trees that can occur theoretical foundation: tree languages – Practical methods: DTD (Document Type Definition) XML Schema

21 21 Document Type Definitions ( DTD )  Part of the original XML specification  To be replaced by XML Schema – Much more complex  An XML document may have a DTD  XML document: well-formed = if tags are correctly closed Valid = if it has a DTD and conforms to it  Validation is useful in data exchange

22 22 DTD Example <!DOCTYPE company [ ]> <!DOCTYPE company [ ]>

23 23 DTD Example 123456789 John B432 1234 987654321 Jim B123... 123456789 John B432 1234 987654321 Jim B123... Example of valid XML document:

24 24 DTD: The Content Model  Content model: – Complex = a regular expression over other elements – Text-only = #PCDATA – Empty = EMPTY – Any = ANY – Mixed content = (#PCDATA | A | B | C)* content model

25 25 DTD: Regular Expressions <!ELEMENT name (firstName, lastName)).......... <!ELEMENT name (firstName?, lastName)) DTDXML <!ELEMENT person (name, phone*)) sequence optional <!ELEMENT person (name, (phone|email))) Kleene star alternation......................

26 26 Attributes in DTDs..............

27 27 Attributes in DTDs <!ATTLIST person age CDATA #REQUIRED id ID #REQUIRED manager IDREF #REQUIRED manages IDREFS #REQUIRED > <!ATTLIST person age CDATA #REQUIRED id ID #REQUIRED manager IDREF #REQUIRED manages IDREFS #REQUIRED > <person age=“25” id=“p29432” manager=“p48293” manages=“p34982 p423234”>....... <person age=“25” id=“p29432” manager=“p48293” manages=“p34982 p423234”>.......

28 28 Attributes in DTDs Types:  CDATA= string  ID = key  IDREF = foreign key  IDREFS = foreign keys separated by space  (Monday | Wednesday | Friday) = enumeration

29 29 Attributes in DTDs Kind:  #REQUIRED  #IMPLIED = optional  value = default value  value #FIXED = the only value allowed

30 30 Using DTDs  Must include in the XML document  Either include the entire DTD: –  Or include a reference to it: –  Or mix the two... (e.g. to override the external definition)

31 31 XML Schema  DTDs capture grammatical structure, but have some drawbacks:  Not themselves in XML, inconvenient to build tools  Don’t capture database datatypes’ domains  No way of defining OO-like inheritance…  XML Schema addresses shortcomings of DTDs  XML syntax  Subclassing  Domains and built-in datatypes  nin. and max # of occurrences of elements  http://www.w3.org/XML/Schema

32 32 Basics of XML Schema  Need to use the XML Schema namespace (generally named xsd)  simpleTypes are a way of restricting domains on scalars  Can define a simpleType based on integer, with values within a particular range  complexTypes are a way of defining element structures  Basically equivalent to !ELEMENT, but more powerful  Specify sequence, choice between child elements  Specify minOccurs and maxOccurs (default 1)  Must associate an element/attribute with a simpleType, or an element with a complexType

33 33 Simple Schema Example

34 34 Questions

35 35 How the Web was Yesterday  HTML documents often generated by applications consumed by humans only easy access: across platforms, across organizations  No application interoperability: HTML not understood by applications Database technology: client-server

36 36 Application Interoperability Purchase order Amazon Supplier1 Supplier2 Supplier3 Internet

37 37 Semi-structured data  Structure may be: irregular implicit partial unknown 37

38 38 Examples  Bibtex file  Web data  Integrated data sources The integration of structured data sources can result in semi-structured data. 38

39 39 New Universal Data Exchange Format: XML A recommendation from the W3C  XML = data  XML generated by applications  XML consumed by applications  Easy access: across platforms, organizations

40 40 XML  A W3C standard to complement HTML  Origins: Structured text SGML Large-scale electronic publishing Data exchange on the web  Motivation: HTML describes presentation XML describes content  http://www.w3.org/TR/2000/REC-xml-20001006 (version 2, 10/2000)

41 41 Paradigm Shift on the Web  From documents (HTML) to data (XML)  From information retrieval to data management  For databases, also a paradigm shift: from relational model to XML model from data processing to data/query translation from storage to transport

42 42 Database Issues  How are we going to model XML? (graphs). Compared to relational model, XML is hierarchical XML allows missing or additional attributes XML allows multiple instances of an attribute (set-valued) XML allows different types in different objects XML integrates structure and text data …  How are we going to query XML? (XQuery)  How are we going to store XML (in a relational database? object-oriented? native?)  How are we going to process XML efficiently? (many interesting research questions!)

43 43 Designing an XML Schema/DTD  Not as formalized as relational data design  We can still use ER diagrams to break into entity, relationship sets  Note that often we already have our data in relations and need to design the XML schema to export them!  Generally orient the XML tree around the “central” objects  Big decision: element vs. attribute  Element if it has its own properties, or if you *might* have more than one of them  Attribute if it is a single property – or perhaps not!


Download ppt "1 Introduction to XML Yanlei Diao UMass Amherst April 19, 2007 Slides Courtesy of Ramakrishnan & Gehrke, Dan Suciu, Zack Ives and Gerome Miklau."

Similar presentations


Ads by Google