Presentation is loading. Please wait.

Presentation is loading. Please wait.

Managing XML and Semistructured Data

Similar presentations


Presentation on theme: "Managing XML and Semistructured Data"— Presentation transcript:

1 Managing XML and Semistructured Data
Part 2: Modelling XML Data

2 In this section… More XML syntax [XML glossary – by Sun] [XML Tutorials] XML DTD and XML Schema XML Query data model Comparison of XML with semistructured data Papers: XML, Java, and the future of the Web by Jon Bosak, Sun Microsystems. W3C XML Query Data Model Mary Fernandez, Jonathan Robie. Extracting Schema from Semi structured Data Nestorov, Abiteboul, Motwani. SIGMOD 98 Data on the Web Abiteboul, Buneman, Suciu : Section 3.3

3 More XML Syntax: Attributes
<book price = “55” currency = “USD”> <title> Foundations of Databases </title> <author> Abiteboul </author> <year> 1995 </year> </book> attributes are alternative ways to represent data (Single valued, unordered)

4 More XML: Oids and References
<person id=“o555”> <name> Jane </name> </person> <person id=“o456”> <name> Mary </name> <children idrefs=“o123 o555”/> </person> <person id=“o123” mother=“o456”><name>John</name> oids and references in XML are just syntax (ID, IDREF) The value of IDREF attribute must match the value of some ID attribute in the document. The value of IDREFS attribute can contain several references to elements with ID attribute separated with whitespaces.

5 XML Semantics: a Tree ! Order matters !!! data person person id
Element node Text node Attribute node data <data> <person id=“o555” > <name> Mary </name> <address> <street> Maple </street> <no> 345 </no> <city> Seattle </city> </address> </person> <person> <name> John </name> <address> Thailand </address> <phone> </phone> </data> person person id address name address name phone o555 street no city Mary Thai John 23456 Maple 345 Seattle Order matters !!!

6 More XML: CDATA Section
Syntax: <![CDATA[ .....any text here...]]> Example: <![CDATA[ <slide>..A sample slide..</slide> ]]> which displays as: <slide>..A sample slide.. </slide>

7 More XML: Entity References
Entity references to replace illegal XML characters (Escape characters) Syntax: &entityname; (a form of macros) Example: (what happens if we simply use <?) <element> this is less than < </element> Some entities: < > & & &apos; " & Unicode char

8 More XML: Processing Instructions
Syntax: <?target argument?> Example 1: Example 2: <product> <name> Alarm Clock </name> <?ringBell 20?> <price> </price> </product> Data for processing Target application <?wilfred.lecture.Program QUERY="MSc,PhD,all"?> <slide type="all"> <title>COMP630H</title> </slide> Note: <?xml version = “1.0”?> is not PI

9 More XML: Comments Syntax <!-- .... Comment text... -->
Yes, they are part of the data model !!!

10 XML Namespaces http://www.w3.org/TR/REC-xml-names (1/99)
name ::= [prefix:]localpart <book xmlns:isbn=“ <title> … </title> <number> 15 </number> <isbn:number> …. </isbn:number> </book>

11 Belong to this namespace
XML Namespaces syntactic: <number> , <isbn:number> semantic: provide URL for schema namespace declaration apply within the content of the specified element multiple namespace prefixes can be declared <tag xmlns:mystyle = “ <mystyle:title> … </mystyle:title> <mystyle:number> … </tag> Belong to this namespace

12 XML Data Models Several competing models: Document Object Model (DOM):
(2/2001) class hierarchy (node, element, attribute,…) objects have behavior defines API to inspect/modify the document XPath data model XML Query data model Infoset (a set of information items of an XML document) PSV (post schema validation)

13 XML Data v.s. E/R, ODL, Relational
Q: is XML better or worse ? A: serves different purposes E/R, ODL, Relational models: For centralized processing, when we control the data XML: Data sharing between different systems we do not have control over the entire data on the Web Data centric Vs Document centric documents Do NOT use XML to model your data ! Use E/R, ODL, or relational instead. Use XML to exchange data instead.

14 XLink Generalizes HTML’s href
Many types: simple, extended, locator, ... Discuss only simple links, which is a link that associates exactly two resources, one local and one remote, with an arc going from the former to the latter. Thus, a simple link is always an outbound link. <person xmlns:xlink=“ xlink:type=“simple” xlink:href=“ xlink:title=“The Homepage” xlink:show=“replace” xlink:actuate=“onRequest”> </person> required attributes optional attributes

15 XLink show attribute (specify desired presentation) can be
“new” (new window) ”replace” (same window) ”embed” ”other” actuate attribute (specify desired timing of traversal) can be “onLoad” (immediate loading) ”onRequest” (post-loading, event triggered) ”none”

16 XLink href attribute: More about XLink can be found in:
a URI or an XPointer (next) More about XLink can be found in: [

17 XPointer An extension of XPath (next week) Usage:
href=“ An XPointer expression points to: A point A range Reference [

18 XPointer Pointing to a point (=XML element or character)
Full form: e.g. #xpointer(id(“3652”)) Bar name: e.g. #3652 Child sequence: e.g. #xpointer( /1/3/2/5), #xpointer( /bib/book[3]) Pointing to a range: e.g. #xpointer(id(3652 to 44)) Most interesting examples use XPath

19 XML v.s. Semistructured Data
SSD integrates of heterogeneous sources with non-rigid structure, eg biological data, Web data {lecture: {title: “XML”, date: “1-Jan-2005”, instructor: { name: “Wilfred”, department: “CS”} } both described best by a graph both are schema-less, self-describing

20 Similarities and Differences
{ person: &o123 { name: “Alan”, age: 42, } } <person id=“o123”> <name> Alan </name> <age> 42 </age> < > </ > </person> <person father=“o123”> … </person> { person: { father: &o123 …} } person name age Alan 42 father similar on trees, different on graphs

21 More Differences XML is ordered, SSD is not
XML can mix text and elements: <talk> Teaching XML is horrible <speaker> Wilfred Ng </speaker> </talk> XML has lots of other stuff: entities, processing instructions, comments ! these differences make XML data management harder

22 Document Type Definitions DTD
part of the original XML specification an XML document may have a DTD XML document: well-formed = if tags are correctly closed Valid = if it has a DTD and conforms to it validation is useful in data exchange

23 Very Simple DTD <!DOCTYPE company [
<!ELEMENT company ((person|product)*)> <!ELEMENT person (ssn, name, office, phone?)> <!ELEMENT ssn (#PCDATA)> <!ELEMENT name (#PCDATA)> <!ELEMENT office (#PCDATA)> <!ELEMENT phone (#PCDATA)> <!ELEMENT product (pid, name, description?)> <!ELEMENT pid (#PCDATA)> <!ELEMENT description (#PCDATA)> ]>

24 Very Simple DTD Example of valid XML document: <company>
<person> <ssn> </ssn> <name> John </name> <office> B432 </office> <phone> 1234 </phone> </person> <person> <ssn> </ssn> <name> Jim </name> <office> B123 </office> <product> ... </product> ... </company>

25 DTD: The Content Model <!ELEMENT tag (CONTENT)> Content model:
Complex = a regular expression over other elements Text-only = #PCDATA Empty = EMPTY Any = ANY Mixed content = (#PCDATA | A | B | C)* content model

26 DTD: Regular Expressions
XML sequence <!ELEMENT name (firstName, lastName)) <name> <firstName> </firstName> <lastName> </lastName> </name> optional <!ELEMENT name (firstName?, lastName)) <person> <name> </name> <phone> </phone> </person> Kleene star <!ELEMENT person (name, phone*)) alternation <!ELEMENT person (name, (phone| )))

27 Attributes in DTDs <!ELEMENT person (ssn, name, office, phone?)>
<!ATTLIS person age CDATA #REQUIRED> <person age=“25”> <name> ....</name> ... </person>

28 Attributes in DTDs <!ELEMENT person (ssn, name, office, phone?)>
<!ATTLIS person age CDATA #REQUIRED id ID #REQUIRED manager IDREF #REQUIRED manages IDREFS #REQUIRED > <person age=“25” id=“p29432” manager=“p48293” manages=“p34982 p423234”> <name> ....</name> ... </person>

29 Attributes in DTDs Types: CDATA = string ID = key IDREF = foreign key
IDREFS = foreign keys separated by space (Monday | Wednesday | Friday) = enumeration NMTOKEN = must be a valid XML name NMTOKENS = multiple valid XML names ENTITY = you don’t want to know this Kind: #REQUIRED #IMPLIED = optional value = default value value #FIXED = the only value allowed

30 Using DTDs Must include in the XML document
Either include the entire DTD: <!DOCTYPE rootElement [ ]> Or include a reference to it: <!DOCTYPE rootElement SYSTEM “ Or mix the two... (e.g. to override the external definition)

31 DTDs as Grammars <!DOCTYPE paper [
<!ELEMENT paper (section*)> <!ELEMENT section ((title,section*) | text)> <!ELEMENT title (#PCDATA)> <!ELEMENT text (#PCDATA)> ]> <paper> <section> <text> </text> </section> <section> <title> </title> <section> … </section> <section> … </section> </section> </paper> A DTD = a grammar A valid XML document = a parse tree for that grammar

32 DTDs as Schemas Not so well suited:
impose unwanted constraints on order <!ELEMENT person (name,phone)> references cannot be constrained can be too vague: <!ELEMENT person ((name|phone| )*)>

33 XML Schemas generalizes DTDs uses XML syntax
two documents: structure and datatypes XML Schemas Elements v. Types Regular expressions Expressive power XML-Schema is very complex often criticized some alternative proposals

34 DTD: <!ELEMENT paper (title,author*,year, (journal|conference))>
XML Schemas <xs:element name=“paper” type=“papertype”/> <xs:complexType name=“papertype”> <xs:sequence> <xs:element name=“title” type=“xs:string”/> <xs:element name=“author” minOccurs=“0”/> <xs:element name=“year”/> <xs: choice> < xs:element name=“journal”/> <xs:element name=“conference”/> </xs:choice> </xs:sequence> </xs:element> DTD: <!ELEMENT paper (title,author*,year, (journal|conference))>

35 Elements v.s. Types in XML Schema
<xs:element name=“person”> <xs:complexType> <xs:sequence> <xs:element name=“name” type=“xs:string”/> <xs:element name=“address” type=“xs:string”/> </xs:sequence> </xs:complexType> </xs:element> <xs:element name=“person” type=“ttt”> <xs:complexType name=“ttt”> <xs:sequence> <xs:element name=“name” type=“xs:string”/> <xs:element name=“address” type=“xs:string”/> </xs:sequence> </xs:complexType> DTD: <!ELEMENT person (name,address)>

36 Elements v.s. Types in XML Schema
Simple types (integers, strings, ...) Complex types (regular expressions, like in DTDs) Element-type-element alternation: Root element has a complex type That type is a regular expression of elements Those elements have their complex types... ... On the leaf nodes we have simple types

37 Simple Types String Token Byte unsignedByte Integer positiveInteger
Int (larger than integer) unsignedInt Long Short ... Time dateTime Duration Date ID IDREF IDREFS

38 Facets of Simple Types Examples length minLength maxLength pattern
Facets = additional properties restricting a simple type 15 facets defined by XML Schema Examples length minLength maxLength pattern enumeration whiteSpace maxInclusive maxExclusive minInclusive minExclusive totalDigits fractionDigits

39 Facets of Simple Types Can further restrict a simple type by changing some facets Restriction = subset

40 Not so Simple Types List types: Union types Restriction types
<xs:simpleType name="listOfMyIntType"> <xs:list itemType="myInteger"/> </xs:simpleType> <listOfMyInt> </listOfMyInt>

41 Local and Global Types in XML Schema
Local type: <xs:element name=“person”> [define locally the person’s type] </xs:element> Global type: <xs:element name=“person” type=“ttt”/> <xs:complexType name=“ttt”> [define here the type ttt] </xs:complexType> Global types: can be reused in other elements

42 Local v.s. Global Elements in XML Schema
Local element: <xs:complexType name=“ttt”> <xs:sequence> <xs:element name=“address” type=“...”/> </xs:sequence> </xs:complexType> Global element: <xs:element name=“address” type=“ttt”/> <xs:complexType name=“ttt”> <xs:sequence> <xs:element ref=“address”/> </xs:sequence> </xs:complexType> Global elements: like in DTDs

43 Regular Expressions in XML Schema
Recall the element-type-element alternation: <xs:complexType name=“....”> [regular expression on elements] </xs:complexType> Regular expressions: <xs:sequence> A B C </...> = A B C <xs:choice> A B C </...> = A | B | C <xs:group> A B C </...> = (A B C) <xs:... minOccurs=“0” maxOccurs=“unbounded”> ..</...> = (...)* <xs:... minOccurs=“0” maxOccurs=“1”> ..</...> = (...)?

44 Local Names in XML-Schema
<xs:element name=“person”> <xs:complexType> <xs:element name=“name”> <xs:complexType> <xs:sequence> <xs:element name=“firstname” type=“xs:string”/> <xs:element name=“lastname” type=“xs:string”/> </xs:sequence> </xs:element> </xs:complexType> </xs:element> <xs:element name=“product”> <xs:complexType> <xs:element name=“name” type=“xs:string”/> </xs:complexType> </xs:element> name has different meanings in person and in product

45 Subtle Use of Local Names
<xs:element name=“A” type=“oneB”/> <xs:complexType name=“onlyAs”> <xs:choice> <xs:sequence> <xs:element name=“A” type=“onlyAs”/> <xs:element name=“A” type=“onlyAs”/> </xs:sequence> <xs:element name=“A” type=“xs:string”/> </xs:choice> </xs:complexType> <xs:complexType name=“oneB”> <xs:choice> <xs:element name=“B” type=“xs:string”/> <xs:sequence> <xs:element name=“A” type=“onlyAs”/> <xs:element name=“A” type=“oneB”/> </xs:sequence> <xs:sequence> <xs:element name=“A” type=“oneB”/> <xs:element name=“A” type=“onlyAs”/> </xs:sequence> </xs:choice> </xs:complexType> Arbitrary deep binary tree with A elements, and a single B element

46 Attributes in XML Schema
<xs:element name=“paper” type=“papertype”/> <xs:complexType name=“papertype”> <xs:sequence> <xs:element name=“title” type=“xs:string”/> </xs:sequence> <xs:attribute name=“language" type="xs:NMTOKEN" fixed=“English"/> </xs:complexType> Attributes are associated to the type, not to the element Only to complex types; more trouble if we want to add attributes to simple types.

47 “Mixed” Content, “Any” Type
<xs:complexType mixed="true"> Better than in DTDs: can still enforce the type, but now may have text between any elements Means anything is permitted there <xs:element name="anything" type="xs:anyType"/>

48 “All” Group A restricted form of & in SGML Restrictions:
<xs:complexType name="PurchaseOrderType"> <xs:all> <xs:element name="shipTo" type="USAddress"/> <xs:element name="billTo" type="USAddress"/> <xs:element ref="comment" minOccurs="0"/> <xs:element name="items" type="Items"/> </xs:all> <xs:attribute name="orderDate" type="xs:date"/> </xs:complexType> A restricted form of & in SGML Restrictions: Only at top level Has only elements Each element occurs at most once E.g. “comment” occurs 0 or 1 times

49 Derived Types by Extensions
<complexType name="Address"> <sequence> <element name="street" type="string"/> <element name="city" type="string"/> </sequence> </complexType> <complexType name="USAddress"> <complexContent> <extension base="ipo:Address"> <sequence> <element name="state" type="ipo:USState"/> <element name="zip" type="positiveInteger"/> </extension> </complexContent> Corresponds to inheritance

50 Derived Types by Restrictions
(*): may restrict cardinalities, e.g. (0,infty) to (1,1); may restrict choices; other restrictions… <complexContent> <restriction base="ipo:Items“> … [rewrite the entire content, with restrictions] </restriction> </complexContent> Corresponds to set inclusion


Download ppt "Managing XML and Semistructured Data"

Similar presentations


Ads by Google