Presentation is loading. Please wait.

Presentation is loading. Please wait.

Fall 2001 CSE3301 XML and Beyond: Parts I and II

Similar presentations


Presentation on theme: "Fall 2001 CSE3301 XML and Beyond: Parts I and II"— Presentation transcript:

1 Fall 2001 CSE3301 XML and Beyond: Parts I and II http://db.cis.upenn.edu http://www.w3c.org

2 Fall 2001 CSE3302 Outline Background: documents (SGML/HTML) and databases (structured and semistructured data) XML Basics and Document Type Descriptors XML API’s: Document Object Model (DOM), SAX (not covered in this course) XML query languages: XML-QL, XSL, Quilt.

3 Fall 2001 CSE3303 Part I: Background What’s the difference between the world of documents and information retrieval and databases and query interfaces?

4 Fall 2001 CSE3304 Documents vs Databases Document world > plenty of small documents > usually static > implicit structure section, paragraph, toc, > tagging > human friendly > content form/layout, annotation > Paradigms “Save as” > meta-data author name, date, subject Database world > a few large databases > usually dynamic > explicit structure (schema) > records > machine friendly > content schema, data, methods > Paradigms Atomicity, Concurrency, Isolation, Durability > meta-data schema description

5 Fall 2001 CSE3305 What to do with them Documents editing printing spell-checking counting words retrieving (IR) searching Database updating cleaning querying composing/transforming

6 Fall 2001 CSE3306 HTML Lingua franca for publishing hypertext on the World Wide Web Designed to describe how a Web browser should arrange text, images and push-buttons on a page. Easy to learn, but does not convey structure. Fixed tag set. Welcome to the XML course Introduction Opening tag Text (PCDATA) Closing tag “Bachelor” tag Attribute nameAttribute value

7 Fall 2001 CSE3307 Thin red line The line between the document world and the database world is not clear. In some cases, both approaches are legitimate. An interesting middle ground is data formats -- of which XML is an example Examples –Personal address book

8 Fall 2001 CSE3308 Personal address book over 20 years 1977 N Achison, Malcolm F Dr. M.P. Achison A Dept. of Computer Science A University of Edinburgh A Kings Buildings A Edinburgh E12 8QQ A Scotland T 031-123-8855 ext. 4359 (work) T 031-345-7570 (home) N Albani, Paolo F Prof. Paolo Albani A Dip. Informatica e Sistemistica A Universita di Roma La Sapienza... 1980 N Achison, Malcolm F Dr. M.P. Achison A Dept. of Computer Science.... T 031-667-7570 (home) C mpa@uk.ac.ed.cs 1990 N Achison, Malcolm F Prof. M.P. Achison A Dept. of Computing Science A University of Glasgow A Lilybank Gardens A Glasgow G12 8QQ A Scotland T 041-339-8855 ext. 4359 T 041-357-3787 (private) T 031-667-7570 (home) X 041-339-0090 C mpa@uk.ac.gla.cs N Achison, Malcolm F Prof. M.P. Achison A 34 Inverness Place A Edinburgh, EH3 8UV 1997 N Achison, Malcolm F Prof. M.P. Achison A Department of Computing Science... T 031-667-7570 (home) X 041-339-0090 C mpa@dcs.gla.ac.uk W http://www.dcs.gla.ac.uk/mpa 2000 ?

9 Fall 2001 CSE3309 The Structure of XML XML consists of tags and text Tags come in pairs... They must be properly nested...... --- good...... --- bad (You can’t do......... in HTML)

10 Fall 2001 CSE33010 XML text XML has only one “basic” type -- text. It is bounded by tags e.g. The Big Sleep 1935 --- 1935 is still text XML text is called PCDATA (for parsed character data). It uses a 16-bit encoding, e.g. \&\#x0152 for the Hebrew letter Mem

11 Fall 2001 CSE33011 XML structure Nesting tags can be used to express various structures. E.g. A tuple (record) : Malcolm Atchison (215) 898 4321 mp@dcs.gla.ac.sc

12 Fall 2001 CSE33012 XML structure We can represent a list by using the same tag repeatedly:...

13 Fall 2001 CSE33013 Terminology The segment of an XML document between an opening and a corresponding closing tag is called an element. Malcolm Atchison (215) 898 4321 mp@dcs.gla.ac.sc element not an element element, a sub-element of

14 Fall 2001 CSE33014 XML is tree-like person name email tel Malcolm Atchison (215) 898 4321 mp@dcs.gla.ac.sc Semistructured data models typically put the labels on the edges

15 Fall 2001 CSE33015 Mixed Content An element may contain a mixture of sub-elements and PCDATA British Airways World’s favorite airline Data of this form is not typically generated from databases. It is needed for consistency with HTML

16 Fall 2001 CSE33016 A Complete XML Document Malcolm Atchison (215) 898 4321 mp@dcs.gla.ac.sc

17 Fall 2001 CSE33017 Representing relational DBs: Two ways projects: title budget managedBy employees: name ssn age

18 Fall 2001 CSE33018 Project and Employee relations in XML Pattern recognition 10000 Joe Joe 344556 34 Sandra 2234 35 Auto guided vehicle 70000 Sandra : Projects and employees are intermixed

19 Fall 2001 CSE33019 Pattern recognition 10000 Joe Auto guided vehicles 70000 Sandra : Project and Employee relations in XML (cont’d) Joe 344556 34 Sandra 2234 35 : Employees follows projects

20 Fall 2001 CSE33020 Pattern recognition 10000 Joe Auto guided vehicles 70000 Sandra : Project and Employee relations in XML (cont’d) Joe 344556 34 Sandra 2234 35 : Or without “separator” tags …

21 Fall 2001 CSE33021 Attributes An (opening) tag may contain attributes. These are typically used to describe the content of an element cheese fromage branza A food made …

22 Fall 2001 CSE33022 Attributes (cont’d) Another common use for attributes is to express dimension or type 2400 96 M05-.+C$@02!G96YE<FEC... A document that obeys the “nested tags” rule and does not repeat an attribute within a tag is said to be well-formed.

23 Fall 2001 CSE33023 When to use attributes It’s not always clear when to use attributes F. MacNiel fmacn@dcs.barra.ac.sc... OR 123 45 6789 F. MacNiel fmacn@dcs.barra.ac.sc...

24 Fall 2001 CSE33024 Using IDs Jane Doe John Doe Mary Doe Jack Doe

25 Fall 2001 CSE33025 An object-oriented schema class Movie ( extent Movies, key title ) { attribute string title; attribute string director; relationship set casts inverse Actor::acted_In; attribute int budget; } ; class Actor ( extent Actors, key name ) { attribute string name; relationship set acted_In inverse Movie::casts; attribute int age; attribute set directed; } ;

26 Fall 2001 CSE33026 An example Waking Ned Divine Kirk Jones III 100,000 Dragonheart Rob Cohen 110,000 Moondance Dagmar Hirtz 90,000 : David Kelly Sean Connery 68 Ian Bannen :

27 Fall 2001 CSE33027 Part II: Document Type Descriptors Imposing structure on XML documents

28 Fall 2001 CSE33028 Document Type Descriptors Document Type Descriptors (DTDs) impose structure on an XML document. There is some relationship between a DTD and a schema, but it is not close – there is still a need for additional “typing” systems. The DTD is a syntactic specification.

29 Fall 2001 CSE33029 Example: An Address Book MacNiel, John Dr. John MacNiel 1234 Huron Street Rome, OH 98765 (321) 786 2543 jm@abc.com Exactly one name At most one greeting As many address lines as needed (in order) Mixed telephones and faxes As many as needed

30 Fall 2001 CSE33030 Specifying the structure name to specify a name element greet? to specify an optional (0 or 1) greet elements name,greet? to specify a name followed by an optional greet

31 Fall 2001 CSE33031 Specifying the structure (cont) addr* to specify 0 or more address lines tel | fax a tel or a fax element (tel | fax)* 0 or more repeats of tel or fax email* 0 or more email elements

32 Fall 2001 CSE33032 Specifying the structure (cont) So the whole structure of a person entry is specified by name, greet?, addr*, (tel | fax)*, email* This is known as a regular expression. Why is it important?

33 Fall 2001 CSE33033 Regular Expressions Each regular expression determines a corresponding finite state automaton. Let’s start with a simpler example: name, addr*, email This suggests a simple parsing program name addr email

34 Fall 2001 CSE33034 Another example name,address*,(tel | fax)*,email* name address tel fax email Adding in the optional greet further complicates things email

35 Fall 2001 CSE33035 A DTD for the address book <!DOCTYPE addressbook [ <!ELEMENT person (name, greet?, address*, (fax | tel)*, email*)> ]>

36 Fall 2001 CSE33036 Two DTDs for the relational DB <!DOCTYPE db [... ]> <!DOCTYPE db [... ]>

37 Fall 2001 CSE33037 Recursive DTDs <DOCTYPE genealogy [ <!ELEMENT person ( name, dateOfBirth, person, -- mother person )> -- father... ]> What is the problem with this?

38 Fall 2001 CSE33038 Recursive DTDs cont’d. <DOCTYPE genealogy [ <!ELEMENT person ( name, dateOfBirth, person?, -- mother person? )> -- father... ]> What is now the problem with this?

39 Fall 2001 CSE33039 Some things are hard to specify Each employee element is to contain name, age and ssn elements in some order. <!ELEMENT employee ( (name, age, ssn) | (age, ssn, name) | (ssn, name, age) |... )> Suppose there were many more fields !

40 Fall 2001 CSE33040 Summary of XML regular expressions AThe tag A occurs e1,e2The expression e1 followed by e2 e*0 or more occurrences of e e?Optional -- 0 or 1 occurrences e+1 or more occurrences e1 | e2either e1 or e2 (e)grouping

41 Fall 2001 CSE33041 It’s easy to get confused… Cited from oagis_segments.dtd (one of the files in the Novell Developer Kit http://developer.novell.com/ndk/indexexe.htm) Ben Franklin Q. Which NAME is it?

42 Fall 2001 CSE33042 Specifying attributes in the DTD <!ATTLIST height dimension CDATA #REQUIRED accuracy CDATA #IMPLIED > The dimension attribute is required; the accuracy attribute is optional. CDATA is the “type” of the attribute -- it means string.

43 Fall 2001 CSE33043 Specifying ID and IDREF attributes <!DOCTYPE family [ <!ATTLIST person id ID #REQUIRED mother IDREF #IMPLIED father IDREF #IMPLIED children IDREFS #IMPLIED> ]>

44 Fall 2001 CSE33044 Some conforming data Jane Doe John Doe Mary Doe Jack Doe

45 Fall 2001 CSE33045 Consistency of ID and IDREF attribute values If an attribute is declared as ID –the associated values must all be distinct (no confusion) If an attribute is declared as IDREF –the associated value must exist as the value of some ID attribute (no dangling “pointers”) Similarly for all the values of an IDREFS attribute ID and IDREF attributes are not typed

46 Fall 2001 CSE33046 An alternative specification <!DOCTYPE family [ ]>

47 Fall 2001 CSE33047 The revised data Jane Doe John Doe...

48 Fall 2001 CSE33048 A useful abbreviation When an element has empty content we can use for For example: Jane Doe...

49 Fall 2001 CSE33049 Back to the object-oriented schema class Movie ( extent Movies, key title ) { attribute string title; attribute string director; relationship set casts inverse Actor::acted_In; attribute int budget; } ; class Actor ( extent Actors, key name ) { attribute string name; relationship set acted_In inverse Movie::casts; attribute int age; attribute set directed; } ;

50 Fall 2001 CSE33050 Schema.dtd <!DOCTYPE db [

51 Fall 2001 CSE33051 Schema.dtd (cont’d) ]>

52 Fall 2001 CSE33052 More on ODL and DTD Earlier last year (May 2000), Object Data Management Group (ODMG) suggested OIFML, a XML document type of Object Interchange Format. http://www.odmg.org/library/readingroom/oifml.pdf

53 Fall 2001 CSE33053 Constraints on ID s and IDREF s ID stands for identifier. No two ID attributes with the same name may have the same value (of type CDATA ) IDREF stands for identifier reference. Every value associated with an IDREF attribute must exist as an ID attribute value IDREFS specifies several (0 or more) identifiers

54 Fall 2001 CSE33054 Connecting the document with its DTD In line: … ]>... Another file : A URL: <!DOCTYPE db SYSTEM "http://www.schemaauthority.com/schema.dtd">

55 Fall 2001 CSE33055 Well-formed and Valid Documents Well-formed applies to any document (with or without a DTD): proper nesting of tags and unique attributes Valid specifies that the document conforms to the DTD: conforms to regular expression grammar, types of attributes correct, and constraints on references satisfied

56 Fall 2001 CSE33056 DTDs v.s Schemas (or Types) By database (or programming language) standards DTDs are rather weak specifications. –Only one base type -- PCDATA –No useful “abstractions” e.g., sets –IDREFs are untyped. You point to something, but you don’t know what! –No constraints e.g., child is inverse of parent –No methods –Tag definitions are global Some of the XML extensions impose something like a schema or type on an XML document. We’ll see these later

57 Fall 2001 CSE33057 Lots of possibilities for schemas XML Schema (under W3C’s spotlight) XDR (Microsoft’s BizTalk) SOX (Schema for Object-Oriented XML) Schematron DSD (AT&T Labs and BRICS) and more.

58 Fall 2001 CSE33058 Some tools XML Authority http://www.extensibility.com/tibco/solutions/xml _authority/index.htm XML Spy http://www.xmlspy.com/download.html

59 Fall 2001 CSE33059 Summary XML is a new data format. Its main virtues are widespread acceptance and the (important) ability to handle semistructured data (data without schema). DTDs provide some useful syntactic constraints on documents. As schemas they are weak. Next slides: XML programming, XML querying.


Download ppt "Fall 2001 CSE3301 XML and Beyond: Parts I and II"

Similar presentations


Ads by Google