Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 XML – Extensible Markup Language DBI – Representation and Management of Data on the Internet.

Similar presentations


Presentation on theme: "1 XML – Extensible Markup Language DBI – Representation and Management of Data on the Internet."— Presentation transcript:

1 1 XML – Extensible Markup Language DBI – Representation and Management of Data on the Internet

2 2 Part I: Background What’s the difference between –The world of documents and information retrieval, and –Databases and query interfaces?

3 3 Documents vs. Databases Document World Plenty of small documents Usually static Implicit structure: section, paragraph, table of contents Tagging Database World A few large databases Usually dynamic Explicit structure: schema Records

4 4 Documents vs. Databases (cont’d) Document World Human friendly Content: form/layout, annotation Paradigms: “Save as”, Wysiwyg Meta-data: author name, date, subject Database World Machine friendly Content: schema, data, methods Paradigms: Atomicity, Concurrency, Isolation, Durability Meta-data: schema description

5 5 What can be Done with Them DocumentsDatabase editing printing spell-checking counting words retrieving (IR) searching updating clustering cleaning querying adjusting transforming

6 6 HTML Hypertext Markup Language Used for publishing hypertext on the World-Wide Web Designed to describe how a Web browser should arrange text, images and push-buttons on a page Easy to learn, but does not convey structure Fixed tag set

7 7 HTML Example Welcome to the DBI course Introduction Opening tag Text (PCDATA) Closing tag “Bachelor” tag Attribute name Attribute value Opening tag Text (PCDATA) Closing tag Attribute name Attribute value “Bachelor” tag

8 8 HTML The World-Wide Web is constructed from HTML documents We can apply information-retrieval techniques to a set of documents –For example, clustering as Google does How can we apply database techniques to the Web?

9 9 HTML Pages We can –Edit (and put on the Web) –Print (or view with a browser) –Spell-check –Count words –Retrieve (again, with a browser) –Search (with a search engine, for example) –Cluster

10 10 How can we Ask Queries? How can we find automatically the cheapest flight from Israel to Micronezia, knowing the Web sites of all airlines that have flights to Micronezia? How can we find automatically the phone numbers of people that advertised on the Web that they want to sell a car for a price that is not greater than 30,000 IS? It can be useful to query data as we do in databases

11 11 Thin Red Line The line between the document world and the database world is not clear In some cases, both approaches are legitimate An interesting middle ground is data formats – of which XML is an example

12 12 The Structure of XML XML consists of tags and text Tags come in pairs... They must be properly nested –good......... –bad......... (You can’t do......... in HTML)

13 13 XML Text XML has only one “basic” type – text It is bounded by tags, e.g., The Big Sleep 1935 – 1935 is still text XML text is called PCDATA –(for parsed character data) It uses a 16-bit encoding, e.g., \&\#x0152 for the Hebrew letter Mem

14 14 XML Structure Nesting tags can be used to express various structures, e.g., a tuple (record): Lisa Simpson 02-828-1234 054-470-777 lisa@cs.huji.ac.il

15 15 XML Structure (cont’d) We can represent a list by using the same tag repeatedly: …

16 16 XML Structure (cont’d) Donald Duck 04-828-1345 donald@cs.technion.ac.il Miki Mouse 03-426-1142 miki@yahoo.com

17 17 Terminology The segment of an XML document between an opening and a corresponding closing tag is called an element Bart Simpson 02 – 444 7777 051 – 011 022 bart@tau.ac.il element element, a sub-element of not an element

18 18 XML Document is a Tree person name email tel Bart Simpson 02 – 444 7777 051 – 011 022 bart@tau.ac.il Semistructured data models typically put the labels on the edges

19 19 Mixed Content An element may contain a mixture of sub- elements and PCDATA British Airways World’s favorite airline

20 20 Needs for Mixed Content Mixed-content data is not typically generated from databases It is needed for consistency with HTML For example: Why can’t you find dragons in a restaurant? Because smoking is not allowed

21 21 A Complete XML Document Lisa Simpson 02-828-1234 054-470-777 lisa@cs.huji.ac.il

22 22 The Header Tag You can leave out the encoding attribute and the processor will use the UTF-8 default

23 23 Processing Instructions Hello, world!

24 24 Two Ways of Representing a Relational Database in XML projects: title budget managedBy employees: name ssn age

25 25 Project and Employee relations in XML Pattern recognition 10000 Joe Joe 344556 34 Sandra 2234 35 Auto guided vehicle 70000 Sandra : Projects and employees are intermixed

26 26 Pattern recognition 10000 Joe Auto guided vehicles 70000 Sandra : Joe 344556 34 Sandra 2234 35 : Employees follow projects Projects Employees

27 27 Pattern recognition 10000 Joe Auto guided vehicles 70000 Sandra : Joe 344556 34 Sandra 2234 35 : Or without “separator” tags … Can be done if it is clear where each employee and each project starts

28 28 Attributes An (opening) tag may contain attributes These are typically used to describe the contents of an element cheese fromage branza A food made …

29 29 Attributes (cont’d) Another common use for attributes is to express dimension or type 2400 96 M05-.+C$@02!G96YE<FEC...

30 30 Well-Formed Documents A document that –obeys the “nested-tags” rule, and –does not repeat an attribute within a tag is said to be well-formed

31 31 Jeff Cohen 04-828-1345 054-470-778 jeffc@cs.technion.ac.il Irma Levy 03-426-1142 irmal@yourmail.com Using Attributes

32 32 When to Use Attributes It’s not always clear when to use attributes L. Simpson lisa@cs.huji.ac.il... 123 4589 L. Simpson lisa@cs.huji.ac.il...

33 33 Using IDs Jeff Cohen 04-828-1345 054-470-778 jeffc@cs.technion.ac.il Irma Levy 03-426-1142 irmal@yourmail.com ID attributes

34 34 Using IDs Lisa Simpson Bart Simpson Marge Simpson Homer Simpson

35 35 ODL Schema class Movie ( extent Movies, key title ) { attribute string title; attribute string director; relationship set casts inverse Actor::acted_In; attribute int budget; } ; class Actor ( extent Actors, key name ) { attribute string name; relationship set acted_In inverse Movie::casts; attribute int age; attribute set directed; } ;

36 36 Waking Ned Divine Kirk Jones III 100,000 Dragonheart Rob Cohen 110,000 Moondance Dagmar Hirtz 90,000 : class Movie ( extent Movies, key title ) { attribute string title; attribute string director; relationship set casts inverse Actor::acted_In; attribute int budget; } ;

37 37 class Actor ( extent Actors, key name ) { attribute string name; relationship set acted_In inverse Movie::casts; attribute int age; attribute set directed; } ; : David Kelly Sean Connery 68 Ian Bannen :

38 38 Waking Ned Divine Kirk Jones III 100,000 Dragonheart Rob Cohen 110,000 Moondance Dagmar Hirtz 90,000 : David Kelly Sean Connery 68 Ian Bannen :

39 39 Part II: Document Type Descriptors Imposing Structure on XML Documents

40 40 Document Type Descriptors Document Type Descriptors (DTDs) impose structure on an XML document There is some relationship between a DTD and a schema, but it is not close – hence the need for additional “typing” systems (XML schemas) The DTD is a syntactic specification

41 41 Example: An Address Book Homer Simpson Dr. H. Simpson 1234 Springwater Road Springfield USA, 98765 (321) 786 2543 (321) 786 2544 homer@math.springfield.edu Mixed telephones and faxes As many as needed As many address lines as needed (in order) At most one greetingExactly one name

42 42 Specifying the Structure name to specify a name element greet? to specify an optional (0 or 1) greet elements name, greet? to specify a name followed by an optional greet

43 43 Specifying the Structure (cont’d) addr*to specify 0 or more address lines tel | faxa tel or a fax element (tel | fax)* 0 or more repeats of tel or fax email*0 or more email elements

44 44 Specifying the Structure (cont’d) So the whole structure of a person entry is specified by name, greet?, addr*, (tel | fax)*, email* This is known as a regular expression Why is it important?

45 45 Regular Expressions Each regular expression determines a corresponding finite state automaton Let’s start with a simpler example: name, addr*, email name addr email This suggests a simple parsing program

46 46 Another Example name,address*,(tel | fax)*,email* name address tel fax email Adding in the optional greet further complicates things email

47 47 Internal DTD For the Address Book <!DOCTYPE addressbook [ <!ELEMENT person (name, greet?, address*, (fax | tel)*, email*)> ]> The name of the DTD is addressbook “Internal” means that the DTD and the XML Document are in the same file

48 48 Rest of the Address Book Jeff Cohen Dr. Cohen jc@penny.com

49 49 Our Relational DB Revisited projects: title budget managedBy employees: name ssn age

50 50 Two DTDs for the Relational DB <!DOCTYPE db [... ]> <!DOCTYPE db [... ]>

51 51 Recursive DTDs <DOCTYPE genealogy [ <!ELEMENTperson ( name, dateOfBirth, person, -- mother person )> -- father... ]> What is the problem with this? A parser does not notice it! Each person should have a father and a mother. This leads to either infinite data or a person that is a descendent of himself.

52 52 Recursive DTDs (cont’d) <DOCTYPE genealogy [ <!ELEMENTperson ( name, dateOfBirth, person?, -- mother person? )> -- father... ]> What is now the problem with this? If a person has only a father, how can you tell that he has a father and does not have a mother?

53 53 Some Things are Hard to Specify Each employee element is to contain name, age and ssn elements in some order <!ELEMENT employee ( (name, age, ssn) | (age, ssn, name) | (ssn, name, age) |... )> Suppose there were many more fields!

54 54 Some Things are Hard to Specify (cont’d) <!ELEMENT employee ( (name, age, ssn) | (age, ssn, name) | (ssn, name, age) |... )> Suppose there were many more fields! There are n! different orders of n elements It is not even polynomial

55 55 General Definitions of Entities ANY - tells that the element can have any content EMPTY - tells that the element has no content

56 56 Summary of XML regular expressions AThe tag A occurs e1,e2The expression e1 followed by e2 e*0 or more occurrences of e e?Optional – 0 or 1 occurrences e+1 or more occurrences e1 | e2either e1 or e2 (e)grouping

57 57 Deterministic Requirement If element-type declarations are deterministic, it is easier Formally, the Glushkov automaton is deterministic The states of this automaton are the positions of the regular expression (semantic actions) The transitions are based on the “follows set”

58 58 Deterministic Requirement (cont’d) The associated automata are succinct A regular language may not have an associated deterministic grammar, e.g., <!ELEMENT ndeter ((movie|director)*,movie,(movie|director))>

59 59 Specifying Attributes in the DTD <!ATTLIST height dimension CDATA #REQUIRED accuracy CDATA #IMPLIED > The dimension attribute is required The accuracy attribute is optional CDATA is the “type” of the attribute – it means string, and may take any literal string as a value

60 60 Specifying ID and IDREF Attributes <!DOCTYPE family [ <!ATTLIST person id ID #REQUIRED mother IDREF #IMPLIED father IDREF #IMPLIED children IDREFS #IMPLIED> ]>

61 61 Specifying ID and IDREF Attributes (cont’d) The attributes mother and father are references to IDs of other elements However, those are not necessarily person elements! The mother attribute is not necessarily a reference to a female person References to IDs have no type

62 62 Some Conforming Data Lisa Simpson Bart Simpson Marge Simpson Homer Simpson

63 63 Consistency of ID and IDREF Attribute Values If an attribute is declared as ID –the associated values must all be distinct (no confusion) If an attribute is declared as IDREF –the associated value must exist as the value of some ID attribute (no dangling “pointers”) Similarly for all the values of an IDREFS attribute ID and IDREF attributes are not typed

64 64 A Useful Abbreviation When an element has empty content we can use for For example: Lisa Simpson...

65 65 An Alternative Specification <!DOCTYPE family [ ]>

66 66 The Revised Data Marge Simpson Homer Simpson Bart Simpson <mother idref=“marge"/> <father idref=“homer"/> Lisa Simpson

67 67 ODL Schema class Movie ( extent Movies, key title ) { attribute string title; attribute string director; relationship set cast inverse Actor::acted_In; attribute int budget; } ; class Actor ( extent Actors, key name ) { attribute string name; relationship set acted_In inverse Movie::cast; attribute int age; attribute set directed; } ;

68 68 Schema.dtd <!DOCTYPE db [ The DTD continues in the next slide

69 69 Schema.dtd (cont’d) ]>

70 70 Data Oh God! Woody Allen $2M George Burns

71 71 Constraints on IDs and IDREFs ID stands for identifier –No two ID attributes may have the same value (of type CDATA) IDREF stands for identifier reference –Every value associated with an IDREF attribute must exist as an ID attribute value IDREFS specifies several (0 or more) identifiers

72 72 Adding a DTD to the Document A DTD can be internal –The DTD is part of the document file or external –The DTD and the document are on separate files –An external DTD may reside In the local file system (where the document is) In a remote file system

73 73 Connecting a Document with its DTD An internal DTD: … ]>... A DTD from the local file system: A DTD from a remote file system:

74 74 Well-formed and Valid Documents A document (with or without a DTD) is well- formed if it has –proper nesting of tags and unique attributes A valid document conforms to the DTD, i.e., –the document conforms to the regular-expression grammar, –types of attributes are correct, and –constraints on references are satisfied

75 75 DTDs vs. Schemas (or Types) DTDs are rather weak specifications by DB & programming-language standards –Only one base type – PCDATA –No useful “abstractions”, e.g., sets –IDREFs are untyped – the type of the object being referenced is not known –No constraints, e.g., child is inverse of parent –No methods –Tag definitions are global Some extensions of XML impose a schema or types on an XML document We may see these later

76 76 Part III: Entities To Take Storage into Account

77 77 What are Entities? An entity is a shortcut to a set of information You might think of an entity as being a bit like a macro Entities allow dividing a document between some different storage devices

78 78 Why to Use Entities Entities allow sharing data between documents Entities save typing Entities can reduce errors Entities are easy to update Entities can act as placeholders for TBD (to be determined) information

79 79 Defining Entities Entities can be defined –in the local document as part of the DOCTYPE definition –with a link to external files that contain the entity data (this, too, is done through the DOCTYPE definition) –in an external DTD Define locally when the entity is being used only in one particular document Define by a link to an external file when the entity is being used in many documents

80 80 Kinds of Entities There are two kinds of entities: General entities –For usage in documents Parameter entities –For usage in declarations

81 81 General entities The definition of a general entitiy in the DTD The usage of the entity in the document is by &Name;

82 82 Example <!DOCTYPE mdb [ ]> Oh God! Woody Allen $2M

83 83 Browser View

84 84 Unparsed Entities <!DOCTYPE mdb [ <!ATTLIST movie id ID #REQUIRED opinion CDATA #IMPLIED starimage ENTITY #IMPLIED> ]> Entities are defined Types are defined

85 85 Data Oh God! Woody Allen $2M

86 86 Parameter Entities Parameter entities are used only within DTDs They carry information for use in the markup declaration –Internal entities - references are within the DTD –External entities -references draw information from outside files Parameter Entity declaration:

87 87 Parameter Entity Example <!ATTLIST person friend (yes | no) #IMPLIED id ID #REQUIRED knows IDREFS #IMPLIED>

88 88 Defining Entities Local Definition: <!DOCTYPE [ <!ENTITY copyright "Copyright 2000, As The World Spins Corp. All rights reserved. Please do not copy or use without authorization. For authorization contact legal@worldspins.com."> ]> Global Definition: <!DOCTYPE [ <!ENTITY copyright SYSTEM "http://www.worldspins.com/legal/copyright.xml" > ]>

89 89 Example <!DOCTYPE [ <!ENTITY copyright "Copyright 2000, As The World Spins Corp. All rights reserved. Please do not copy or use without authorization. For authorization contact legal@worldspins.com."> ]>

90 90 Example (cont’d) Mini-globe revolutionizes keychain industry Today As The World Spins introduces a new approach to key chains. With the new MINI-GLOBE keys can be kept inside a chain, called for upon demand, and stored safely. Never more will consumers lose a key or stand at a door flipping through a stack of keys seeking the right one. &trademark;&copyright;

91 91 Name Spaces Namespaces are standard DTDs More than one namespace can be used in the same XML document –Different elements of a given document may conform to different namespaces Declaring the namespaces –Each namespace is identified by a URI

92 92 Example Defining the used namespace Using a tag from the namespace This is a text of an element A according to dbi’s definition Using a tag not from the namespace This will probably be understood as an anchor

93 93 DTD’s

94 94 The Data File

95 95 The Data File: shorthands

96 96 Godzila Jeff Cohen A Suitable Boy 22.95

97 97 Using CDATA Entering a Kennel Club Member Enter the member by the name on his or her papers. Use the NAME tag. The NAME tag has two attributes. Common (all in lowercase, please!) is the dog's call name. Breed (also in all lowercase) is the dog's breed. Please see the breed reference guide for acceptable breeds. Your entry should look something like this: Sir Fredrick of Ledyard's End ]]> We want to see the text as is, even though it includes tags

98 98

99 99 Summary XML is a new data format. Its main virtues: –widespread acceptance –the (important) ability to handle semistructured data (data without schema) DTDs provide some useful syntactic constraints on documents. As schemas they are weak How to store large XML documents? How to query them? How to map between XML and other representations?


Download ppt "1 XML – Extensible Markup Language DBI – Representation and Management of Data on the Internet."

Similar presentations


Ads by Google