Managing XML and Semistructured Data

Slides:



Advertisements
Similar presentations
Managing XML and Semistructured Data Lecture 12: XML Schema Prof. Dan Suciu Spring 2001.
Advertisements

4 XML Schema.
1 Web Data Management XML Schema. 2 In this lecture XML Schemas Elements v. Types Regular expressions Expressive power Resources W3C Draft:
XML 6.5 XML Schema (XSD) 6. What is XML Schema? The origin of schema  XML Schema documents are used to define and validate the content and structure.
1 DTD (Document Type Definition) Imposing Structure on XML Documents (W3Schools on DTDs)W3Schools on DTDs.
XML Document Type Definitions ( DTD ). 1.Introduction to DTD An XML document may have an optional DTD, which defines the document’s grammar. Since the.
1 XML DTD & XML Schema Monica Farrow G30
Database Management Systems, R. Ramakrishnan1 Introduction to Semistructured Data and XML Chapter 27, Part D Based on slides by Dan Suciu University of.
Agenda from now on Done: SQL, views, transactions, conceptual modeling, E/R, relational algebra. Starting: XML To do: the database engine: –Storage –Query.
Document Type Definitions
CSE 636 Data Integration XML Semistructured Data Document Type Definitions.
CSE 636 Data Integration XML Schema. 2 XML Schemas W3C Recommendation: Generalizes DTDs Uses XML syntax Two documents: structure.
1 Lecture 10 XML Wednesday, October 18, XML Outline XML (4.6, 4.7) –Syntax –Semistructured data –DTDs.
Introduction to XLink Transparency No. 1 XML Information Set W3C Recommendation 24 October 2001 (1stEdition) 4 February 2004 (2ndEdition) Cheng-Chia Chen.
XML Simple Types CSPP51038 shortcourse. Simple Types Recall that simple types are composed of text-only values. All attributes are of simple type Elements.
Semi-structured Data. Facts about the Web Growing fast Popular Semi-structured data –Data is presented for ‘human’-processing –Data is often ‘self-describing’
1 Introduction to XML Yanlei Diao UMass Amherst April 19, 2007 Slides Courtesy of Ramakrishnan & Gehrke, Dan Suciu, Zack Ives and Gerome Miklau.
Sunday, June 28, 2015 Abdelali ZAHI : FALL 2003 : XML Schemas XML Schemas Presented By : Abdelali ZAHI Instructor : Dr H.Haddouti.
Sebastian Bitzer Seminar Semistructured Data University of Osnabrueck May 2, 2003 XML An introduction in relation to semistructured.
1 Lecture 08: XML and Semistructured Data. 2 Outline XML (Section 17) –XML syntax, semistructured data –Document Type Definitions (DTDs) XPath.
Managing XML and Semistructured Data Lecture 2: XML Prof. Dan Suciu Spring 2001.
XML Verification Well-formed XML document  conforms to basic XML syntax  contains only built-in character entities Validated XML document  conforms.
1 Lecture 08: XML and Semistructured Data. 2 Outline XML (Section 17) –XML syntax, semistructured data –Document Type Definitions (DTDs) XPath.
Document Type Definitions. XML and DTDs A DTD (Document Type Definition) describes the structure of one or more XML documents. Specifically, a DTD describes:
VALIDATING AN XML DOCUMENT
4/20/2017.
XP New Perspectives on XML Tutorial 3 1 DTD Tutorial – Carey ISBN
Validating DOCUMENTS with DTDs
VICTORIA UNIVERSITY OF WELLINGTON Te Whare Wananga o te Upoko o te Ika a Maui SWEN 432 Advanced Database Design and Implementation Document Type Definition.
Why XML ? Problems with HTML HTML design - HTML is intended for presentation of information as Web pages. - HTML contains a fixed set of markup tags. This.
IS432 Semi-Structured Data Lecture 3: XSchema Dr. Gamal Al-Shorbagy.
Dr. Azeddine Chikh IS446: Internet Software Development.
XML and XPath. Web Services: XML+XPath2 EXtensible Markup Language (XML) a W3C standard to complement HTML A markup language much like HTML origins: structured.
Semistructured data and XML CS 645 April 5, 2006 Some slide content courtesy of Ramakrishnan & Gehrke, Dan Suciu, Zack Ives.
XML CPSC 315 – Programming Studio Fall 2008 Project 3, Lecture 1.
XML Language Family Detailed Examples Most information contained in these slide comes from: These slides are intended.
XML 1 Enterprise Applications CE00465-M XML. 2 Enterprise Applications CE00465-M XML Overview Extensible Mark-up Language (XML) is a meta-language that.
XML Syntax - Writing XML and Designing DTD's
Processing of structured documents Spring 2002, Part 2 Helena Ahonen-Myka.
1 Tutorial 13 Validating Documents with DTDs Working with Document Type Definitions.
Lecture 6 XML DTD Content of.xml fileContent of.dtd file.
1 Chapter 10: XML What is XML What is XML Basic Components of XML Basic Components of XML XPath XPath XQuery XQuery.
Of 33 lecture 3: xml and xml schema. of 33 XML, RDF, RDF Schema overview XML – simple introduction and XML Schema RDF – basics, language RDF Schema –
XML Extensible Markup Language Aleksandar Bogdanovski Programing Enviroment LABoratory
IS432 Semi-Structured Data Lecture 2: DTD Dr. Gamal Al-Shorbagy.
Lecture 16 Introduction to XML Boriana Koleva Room: C54
An Introduction to XML Sandeep Bhattaram
Sheet 1XML Technology in E-Commerce 2001Lecture 2 XML Technology in E-Commerce Lecture 2 Logical and Physical Structure, Validity, DTD, XML Schema.
1 Introduction to Semistructured Data and XML. 2 How the Web is Today  HTML documents often generated by applications consumed by humans only easy access:
1 Tutorial 14 Validating Documents with Schemas Exploring the XML Schema Vocabulary.
Tutorial 13 Validating Documents with Schemas
More XML: semantics, DTDs, XPATH February 18, 2004.
Management of XML and Semistructured Data Lecture 11: Schemas Wednesday, May 2nd, 2001.
Management of XML and Semistructured Data Lecture 10: Schemas Monday, April 30, 2001.
XML Validation II Schemas Robin Burke ECT 360. Outline Namespaces Documents  Data types XML Schemas Elements Attributes Derived data types RELAX NG.
COMP9321 Web Application Engineering Semester 2, 2015 Dr. Amin Beheshti Service Oriented Computing Group, CSE, UNSW Australia Week 4 1COMP9321, 15s2, Week.
Primer on XML Schema CSE 544 April, XML Schemas Generalizes DTDs Uses XML syntax Two parts: structure and datatypes Very complex –criticized –alternative.
QUALITY CONTROL WITH SCHEMAS CSC1310 Fall BASIS CONCEPTS SchemaSchema is a pass-or-fail test for document Schema is a minimum set of requirements.
Introduction to XML Schema John Arnett, MSc Standards Modeller Information and Statistics Division NHSScotland Tel: (x2073)
Document Type Definition (DTD) Eugenia Fernandez IUPUI.
XML Validation II Advanced DTDs + Schemas Robin Burke ECT 360.
XML Validation. a simple element containing text attribute; attributes provide additional information about an element and consist of a name value pair;
SEMI-STRUCTURED DATA (XML) 1. SEMI-STRUCTURED DATA ER, Relational, ODL data models are all based on schema Structure of data is rigid and known is advance.
Management of XML and Semistructured Data
Management of XML and Semistructured Data
Managing XML and Semistructured Data
Lecture 9: XML Monday, October 17, 2005.
CSE 544: Lecture 5 XML 4/15/2002.
Introduction to Database Systems CSE 444 Lecture 10 XML
Lecture 11: XML and Semistructured Data
Presentation transcript:

Managing XML and Semistructured Data Part 2: Modelling XML Data

In this section… More XML syntax [XML glossary – by Sun] [XML Tutorials] XML DTD and XML Schema XML Query data model Comparison of XML with semistructured data Papers: XML, Java, and the future of the Web by Jon Bosak, Sun Microsystems. W3C XML Query Data Model Mary Fernandez, Jonathan Robie. Extracting Schema from Semi structured Data Nestorov, Abiteboul, Motwani. SIGMOD 98 Data on the Web Abiteboul, Buneman, Suciu : Section 3.3

More XML Syntax: Attributes <book price = “55” currency = “USD”> <title> Foundations of Databases </title> <author> Abiteboul </author> … <year> 1995 </year> </book> attributes are alternative ways to represent data (Single valued, unordered)

More XML: Oids and References <person id=“o555”> <name> Jane </name> </person> <person id=“o456”> <name> Mary </name> <children idrefs=“o123 o555”/> </person> <person id=“o123” mother=“o456”><name>John</name> oids and references in XML are just syntax (ID, IDREF) The value of IDREF attribute must match the value of some ID attribute in the document. The value of IDREFS attribute can contain several references to elements with ID attribute separated with whitespaces.

XML Semantics: a Tree ! Order matters !!! data person person id Element node Text node Attribute node data <data> <person id=“o555” > <name> Mary </name> <address> <street> Maple </street> <no> 345 </no> <city> Seattle </city> </address> </person> <person> <name> John </name> <address> Thailand </address> <phone> 23456 </phone> </data> person person id address name address name phone o555 street no city Mary Thai John 23456 Maple 345 Seattle Order matters !!!

More XML: CDATA Section Syntax: <![CDATA[ .....any text here...]]> Example: <![CDATA[ <slide>..A sample slide..</slide> ]]> which displays as: <slide>..A sample slide.. </slide>

More XML: Entity References Entity references to replace illegal XML characters (Escape characters) Syntax: &entityname; (a form of macros) Example: (what happens if we simply use <?) <element> this is less than < </element> Some entities: < > & & &apos; ‘ " “ & Unicode char

More XML: Processing Instructions Syntax: <?target argument?> Example 1: Example 2: <product> <name> Alarm Clock </name> <?ringBell 20?> <price> 19.99 </price> </product> Data for processing Target application <?wilfred.lecture.Program QUERY="MSc,PhD,all"?> <slide type="all"> <title>COMP630H</title> </slide> Note: <?xml version = “1.0”?> is not PI

More XML: Comments Syntax <!-- .... Comment text... --> Yes, they are part of the data model !!!

XML Namespaces http://www.w3.org/TR/REC-xml-names (1/99) name ::= [prefix:]localpart <book xmlns:isbn=“www.isbn-org.org/def”> <title> … </title> <number> 15 </number> <isbn:number> …. </isbn:number> </book>

Belong to this namespace XML Namespaces syntactic: <number> , <isbn:number> semantic: provide URL for schema namespace declaration apply within the content of the specified element multiple namespace prefixes can be declared <tag xmlns:mystyle = “http://…”> … <mystyle:title> … </mystyle:title> <mystyle:number> … </tag> Belong to this namespace

XML Data Models Several competing models: Document Object Model (DOM): http://www.w3.org/TR/2001/WD-DOM-Level-3-CMLS-20010209/ (2/2001) class hierarchy (node, element, attribute,…) objects have behavior defines API to inspect/modify the document XPath data model XML Query data model Infoset (a set of information items of an XML document) PSV (post schema validation) http://www.w3.org/TR/xml-infoset/

XML Data v.s. E/R, ODL, Relational Q: is XML better or worse ? A: serves different purposes E/R, ODL, Relational models: For centralized processing, when we control the data XML: Data sharing between different systems we do not have control over the entire data on the Web Data centric Vs Document centric documents Do NOT use XML to model your data ! Use E/R, ODL, or relational instead. Use XML to exchange data instead.

XLink Generalizes HTML’s href Many types: simple, extended, locator, ... Discuss only simple links, which is a link that associates exactly two resources, one local and one remote, with an arc going from the former to the latter. Thus, a simple link is always an outbound link. <person xmlns:xlink=“http:///.w3.org/1999/xlink” xlink:type=“simple” xlink:href=“http://a.b.c/myhomepage.html” xlink:title=“The Homepage” xlink:show=“replace” xlink:actuate=“onRequest”> ..... </person> required attributes optional attributes

XLink show attribute (specify desired presentation) can be “new” (new window) ”replace” (same window) ”embed” ”other” actuate attribute (specify desired timing of traversal) can be “onLoad” (immediate loading) ”onRequest” (post-loading, event triggered) ”none”

XLink href attribute: More about XLink can be found in: a URI or an XPointer (next) More about XLink can be found in: [http://www.w3.org/TR/xlink/]

XPointer An extension of XPath (next week) Usage: href=“www.a.b.c/document.xml#xpointerExpr” An XPointer expression points to: A point A range Reference [http://www.w3.org/TR/2001/CR-xptr-20010911/]

XPointer Pointing to a point (=XML element or character) Full form: e.g. #xpointer(id(“3652”)) Bar name: e.g. #3652 Child sequence: e.g. #xpointer( /1/3/2/5), #xpointer( /bib/book[3]) Pointing to a range: e.g. #xpointer(id(3652 to 44)) Most interesting examples use XPath

XML v.s. Semistructured Data SSD integrates of heterogeneous sources with non-rigid structure, eg biological data, Web data {lecture: {title: “XML”, date: “1-Jan-2005”, instructor: { name: “Wilfred”, department: “CS”} } both described best by a graph both are schema-less, self-describing

Similarities and Differences { person: &o123 { name: “Alan”, age: 42, email: “ab@com” } } <person id=“o123”> <name> Alan </name> <age> 42 </age> <email> ab@com </email> </person> <person father=“o123”> … </person> { person: { father: &o123 …} } person name age email Alan 42 ab@com father similar on trees, different on graphs

More Differences XML is ordered, SSD is not XML can mix text and elements: <talk> Teaching XML is horrible <speaker> Wilfred Ng </speaker> </talk> XML has lots of other stuff: entities, processing instructions, comments ! these differences make XML data management harder

Document Type Definitions DTD part of the original XML specification an XML document may have a DTD XML document: well-formed = if tags are correctly closed Valid = if it has a DTD and conforms to it validation is useful in data exchange

Very Simple DTD <!DOCTYPE company [ <!ELEMENT company ((person|product)*)> <!ELEMENT person (ssn, name, office, phone?)> <!ELEMENT ssn (#PCDATA)> <!ELEMENT name (#PCDATA)> <!ELEMENT office (#PCDATA)> <!ELEMENT phone (#PCDATA)> <!ELEMENT product (pid, name, description?)> <!ELEMENT pid (#PCDATA)> <!ELEMENT description (#PCDATA)> ]>

Very Simple DTD Example of valid XML document: <company> <person> <ssn> 123456789 </ssn> <name> John </name> <office> B432 </office> <phone> 1234 </phone> </person> <person> <ssn> 987654321 </ssn> <name> Jim </name> <office> B123 </office> <product> ... </product> ... </company>

DTD: The Content Model <!ELEMENT tag (CONTENT)> Content model: Complex = a regular expression over other elements Text-only = #PCDATA Empty = EMPTY Any = ANY Mixed content = (#PCDATA | A | B | C)* content model

DTD: Regular Expressions XML sequence <!ELEMENT name (firstName, lastName)) <name> <firstName> . . . . . </firstName> <lastName> . . . . . </lastName> </name> optional <!ELEMENT name (firstName?, lastName)) <person> <name> . . . . . </name> <phone> . . . . . </phone> . . . . . . </person> Kleene star <!ELEMENT person (name, phone*)) alternation <!ELEMENT person (name, (phone|email)))

Attributes in DTDs <!ELEMENT person (ssn, name, office, phone?)> <!ATTLIS person age CDATA #REQUIRED> <person age=“25”> <name> ....</name> ... </person>

Attributes in DTDs <!ELEMENT person (ssn, name, office, phone?)> <!ATTLIS person age CDATA #REQUIRED id ID #REQUIRED manager IDREF #REQUIRED manages IDREFS #REQUIRED > <person age=“25” id=“p29432” manager=“p48293” manages=“p34982 p423234”> <name> ....</name> ... </person>

Attributes in DTDs Types: CDATA = string ID = key IDREF = foreign key IDREFS = foreign keys separated by space (Monday | Wednesday | Friday) = enumeration NMTOKEN = must be a valid XML name NMTOKENS = multiple valid XML names ENTITY = you don’t want to know this Kind: #REQUIRED #IMPLIED = optional value = default value value #FIXED = the only value allowed

Using DTDs Must include in the XML document Either include the entire DTD: <!DOCTYPE rootElement [ ....... ]> Or include a reference to it: <!DOCTYPE rootElement SYSTEM “http://www.mydtd.org”> Or mix the two... (e.g. to override the external definition)

DTDs as Grammars <!DOCTYPE paper [ <!ELEMENT paper (section*)> <!ELEMENT section ((title,section*) | text)> <!ELEMENT title (#PCDATA)> <!ELEMENT text (#PCDATA)> ]> <paper> <section> <text> </text> </section> <section> <title> </title> <section> … </section> <section> … </section> </section> </paper> A DTD = a grammar A valid XML document = a parse tree for that grammar

DTDs as Schemas Not so well suited: impose unwanted constraints on order <!ELEMENT person (name,phone)> references cannot be constrained can be too vague: <!ELEMENT person ((name|phone|email)*)>

XML Schemas generalizes DTDs uses XML syntax two documents: structure and datatypes www.w3.org/TR/2001/REC-xmlschema-1-20010502 www.w3.org/TR/2001/REC-xmlschema-2-20010502 XML Schemas Elements v. Types Regular expressions Expressive power XML-Schema is very complex often criticized some alternative proposals

DTD: <!ELEMENT paper (title,author*,year, (journal|conference))> XML Schemas <xs:element name=“paper” type=“papertype”/> <xs:complexType name=“papertype”> <xs:sequence> <xs:element name=“title” type=“xs:string”/> <xs:element name=“author” minOccurs=“0”/> <xs:element name=“year”/> <xs: choice> < xs:element name=“journal”/> <xs:element name=“conference”/> </xs:choice> </xs:sequence> </xs:element> DTD: <!ELEMENT paper (title,author*,year, (journal|conference))>

Elements v.s. Types in XML Schema <xs:element name=“person”> <xs:complexType> <xs:sequence> <xs:element name=“name” type=“xs:string”/> <xs:element name=“address” type=“xs:string”/> </xs:sequence> </xs:complexType> </xs:element> <xs:element name=“person” type=“ttt”> <xs:complexType name=“ttt”> <xs:sequence> <xs:element name=“name” type=“xs:string”/> <xs:element name=“address” type=“xs:string”/> </xs:sequence> </xs:complexType> DTD: <!ELEMENT person (name,address)>

Elements v.s. Types in XML Schema Simple types (integers, strings, ...) Complex types (regular expressions, like in DTDs) Element-type-element alternation: Root element has a complex type That type is a regular expression of elements Those elements have their complex types... ... On the leaf nodes we have simple types

Simple Types String Token Byte unsignedByte Integer positiveInteger Int (larger than integer) unsignedInt Long Short ... Time dateTime Duration Date ID IDREF IDREFS

Facets of Simple Types Examples length minLength maxLength pattern Facets = additional properties restricting a simple type 15 facets defined by XML Schema Examples length minLength maxLength pattern enumeration whiteSpace maxInclusive maxExclusive minInclusive minExclusive totalDigits fractionDigits

Facets of Simple Types Can further restrict a simple type by changing some facets Restriction = subset

Not so Simple Types List types: Union types Restriction types <xs:simpleType name="listOfMyIntType"> <xs:list itemType="myInteger"/> </xs:simpleType> <listOfMyInt>20003 15037 95977 95945</listOfMyInt>

Local and Global Types in XML Schema Local type: <xs:element name=“person”> [define locally the person’s type] </xs:element> Global type: <xs:element name=“person” type=“ttt”/> <xs:complexType name=“ttt”> [define here the type ttt] </xs:complexType> Global types: can be reused in other elements

Local v.s. Global Elements in XML Schema Local element: <xs:complexType name=“ttt”> <xs:sequence> <xs:element name=“address” type=“...”/>... </xs:sequence> </xs:complexType> Global element: <xs:element name=“address” type=“ttt”/> <xs:complexType name=“ttt”> <xs:sequence> <xs:element ref=“address”/> ... </xs:sequence> </xs:complexType> Global elements: like in DTDs

Regular Expressions in XML Schema Recall the element-type-element alternation: <xs:complexType name=“....”> [regular expression on elements] </xs:complexType> Regular expressions: <xs:sequence> A B C </...> = A B C <xs:choice> A B C </...> = A | B | C <xs:group> A B C </...> = (A B C) <xs:... minOccurs=“0” maxOccurs=“unbounded”> ..</...> = (...)* <xs:... minOccurs=“0” maxOccurs=“1”> ..</...> = (...)?

Local Names in XML-Schema <xs:element name=“person”> <xs:complexType> . . . . . <xs:element name=“name”> <xs:complexType> <xs:sequence> <xs:element name=“firstname” type=“xs:string”/> <xs:element name=“lastname” type=“xs:string”/> </xs:sequence> </xs:element> . . . . </xs:complexType> </xs:element> <xs:element name=“product”> <xs:complexType> . . . . . <xs:element name=“name” type=“xs:string”/> </xs:complexType> </xs:element> name has different meanings in person and in product

Subtle Use of Local Names <xs:element name=“A” type=“oneB”/> <xs:complexType name=“onlyAs”> <xs:choice> <xs:sequence> <xs:element name=“A” type=“onlyAs”/> <xs:element name=“A” type=“onlyAs”/> </xs:sequence> <xs:element name=“A” type=“xs:string”/> </xs:choice> </xs:complexType> <xs:complexType name=“oneB”> <xs:choice> <xs:element name=“B” type=“xs:string”/> <xs:sequence> <xs:element name=“A” type=“onlyAs”/> <xs:element name=“A” type=“oneB”/> </xs:sequence> <xs:sequence> <xs:element name=“A” type=“oneB”/> <xs:element name=“A” type=“onlyAs”/> </xs:sequence> </xs:choice> </xs:complexType> Arbitrary deep binary tree with A elements, and a single B element

Attributes in XML Schema <xs:element name=“paper” type=“papertype”/> <xs:complexType name=“papertype”> <xs:sequence> <xs:element name=“title” type=“xs:string”/> . . . . . . </xs:sequence> <xs:attribute name=“language" type="xs:NMTOKEN" fixed=“English"/> </xs:complexType> Attributes are associated to the type, not to the element Only to complex types; more trouble if we want to add attributes to simple types.

“Mixed” Content, “Any” Type <xs:complexType mixed="true"> . . . . Better than in DTDs: can still enforce the type, but now may have text between any elements Means anything is permitted there <xs:element name="anything" type="xs:anyType"/> . . . .

“All” Group A restricted form of & in SGML Restrictions: <xs:complexType name="PurchaseOrderType"> <xs:all> <xs:element name="shipTo" type="USAddress"/> <xs:element name="billTo" type="USAddress"/> <xs:element ref="comment" minOccurs="0"/> <xs:element name="items" type="Items"/> </xs:all> <xs:attribute name="orderDate" type="xs:date"/> </xs:complexType> A restricted form of & in SGML Restrictions: Only at top level Has only elements Each element occurs at most once E.g. “comment” occurs 0 or 1 times

Derived Types by Extensions <complexType name="Address"> <sequence> <element name="street" type="string"/> <element name="city" type="string"/> </sequence> </complexType> <complexType name="USAddress"> <complexContent> <extension base="ipo:Address"> <sequence> <element name="state" type="ipo:USState"/> <element name="zip" type="positiveInteger"/> </extension> </complexContent> Corresponds to inheritance

Derived Types by Restrictions (*): may restrict cardinalities, e.g. (0,infty) to (1,1); may restrict choices; other restrictions… <complexContent> <restriction base="ipo:Items“> … [rewrite the entire content, with restrictions]... </restriction> </complexContent> Corresponds to set inclusion