Presentation is loading. Please wait.

Presentation is loading. Please wait.

XML – eXtensible Markup Language. The World Wide Web and What We Would Like to Do with It XML has a lot of hype surrounding it This week we discuss: –Why.

Similar presentations


Presentation on theme: "XML – eXtensible Markup Language. The World Wide Web and What We Would Like to Do with It XML has a lot of hype surrounding it This week we discuss: –Why."— Presentation transcript:

1 XML – eXtensible Markup Language

2 The World Wide Web and What We Would Like to Do with It XML has a lot of hype surrounding it This week we discuss: –Why XML is needed –Basic technologies used together with XML In the upcoming weeks: –Storage –Compression –Query processing –etc

3 XML in One Slide Basically, XML looks like HTML. However, in XML, you can use any tag names that you want Example: Lisa Simpson 02-828-1234 054-470-777 lisa@cs.huji.ac.il Is that all? Big Deal?!

4 Motivation (1): The Semantic Web

5 Example 1: A Homepage on the Web Tom Sawyer's Homepage Tom's Friends Tom's Hobbies: Boating on the Mississippi River Chewing Gum Painting the Fence

6 Web Pages are Written in HTML HTML is a markup language An HTML page consists of tags with attributes and data HTML describes the style of the page (e.g., color, font type, etc.)

7 Tom Sawyer's Homepage Hi'ya all. Did you know that my best friend is Huckleberry Finn ? Sometimes, I like Becky Thatcher ? Here are some of my hobbies: Boating on the Mississippi River Chewing gum Painting the fence If you want to discuss common interests, contact me at tom@mark.twain

8 Automatically Using Information Tom Sawyer has a homepage. So do a lot of other people. It would be nice to be able to do the following things automatically (via a computer program) –Querying the Page: Find Tom Sawyer's email address and the names of his friends –Querying Similar Pages: Find people who have interests in common with Tom Sawyer

9 Automatically Using Information Site Personalization: Tom Sawyer's interests should be automatically recognized by sites –When Tom Sawyer enters Amazon, he should get "book recommendations" that match his interests –When Tom Sawyer enters a site that sells food, he should be told about sales on gum –This should all happen without Tom having to tell every site about his interests

10 Can we Automatically use the Information? In order to perform the tasks described before, we have to: –Find web pages that describe people –Extract the relevant information Problems: –How can we know if a page describes a person? –How can we know what to extract? (Everyone has their own style for their homepage...) –How can we "understand" the extracted information (What parts of the page describe which information?)

11 Example 2: Weather Forecasting National Weather Service: Weather Forecasting and Weather Alerts Flood Alerts in Mississippi

12 Wouldn't it be great if… Wouldn't it be great if Tom could get automatic updates of weather problems in Mississippi? It is dangerous to go boating if there are floods…

13 Example 3: News Alerts Yahoo News Traffic Jam in the Mississippi River

14 Wouldn't it be great if… Wouldn't it be great if Tom could get automatic updates of important news related to Mississippi? He might want to choose a different river to go boating…

15 Can these things be done? Once again, we need to FIND the relevant pages and EXTRACT the relevant data HTML pages are constantly changing How can we figure out what data is relevant and what the data is talking about automatically? (even when the page changes) HTML describes only style and not meaning (or semantics) It is difficult (perhaps impossible) to perform these tasks

16 Two Basic Approaches If the information on the Web was neatly organized in a huge database, these problems could be solved. But its not – What should we do? AI, NLP Approach: Use smart techniques to recognize information, e.g., recognize patterns about how things are written DB Approach: Turn the Web in to a “database”, by writing it in XML

17 The Semantic Web The Semantic Web is a machine-understandable Web The meaning of data (i.e., the semantics of data) should be encoded together with the data Tim Berners-Lee, the inventor of the Web (by putting together the ideas of hyper-text, TCP/IP, DNS) is one of the main people behind the Semantic Web

18 Main Technologies Needed XML: The syntax for marking up text with meaning RDF: Defines objects and relationships between them OWL: Defines ontologies which connect different concepts (e.g., a car is an automobile, a car is a type of locamotive) Web Services: Allow services given online to be accessed programmatically Here is a simplified version of how it could work

19 Thomas Sawyer Male English Huckleberry Finn Simplified version of the FOAF standard

20 Is there XML on the Web? (1) The weather forecasting site exports its forecasts as RSS (a standard for marking up news) - this data can easily be used by a program

21 Is there XML on the Web? (2) Yahoo News (seen before) exports its news as RSS - this data can easily be used by a program

22 Motivation (2): Data Exchange

23 Exchanging Data Problem: Many data sources, each of a different type (different vendor), with a different schema. –How can the data be combined and used together? –How can different companies collaborate on their data? –What (proprietary?) format should be used to exchange the data?

24 Usage Scenario: Company Collaboration Several companies want to collaborate Need to share data Each company has a different type of database system with a different schema Solution: Agree on a XML schema for exchange. Import to and export from this schema

25 Motivation (3): Separating Content From Style

26 Web Site Development Web sites develop over time Important to separate style from data in order to allow changes to the site structure and appearance CSS separates style from data only in a limited way – HTML will still have tables, lists, etc Using XML, we can store data alone Using XSL, this data can be translated into HTML The data can be translated differently as the site develops

27 Write Once Use Everywhere XML Stock Data XSL WML (hand-held devices) XSL HTML (web browser XSL TEXT (Excel)

28 XML Syntax

29 HTML Used for publishing hypertext on the World- Wide Web Designed to describe how a Web browser should arrange text, images and push- buttons on a page Easy to learn, but does not convey structure Fixed tag set

30 HTML Example Welcome to the DBI course Introduction Opening tag Closing tag Text (PCDATA) “Bachelor” tag Attribute name Attribute value

31 XML Vs. HTML XML and HTML are “brothers”. They are both special cases of SGML. HTML has specific tag and attribute names. These are associated with a specific meaning XML can have any tag and attribute name. These are not associated with any meaning HTML is used to specify visual style XML is used to specify meaning HTML XML SGML

32 Terminology The segment of an XML document between an opening and a corresponding closing tag is called an element Bart Simpson 02 – 444 7777 051 – 011 022 bart@tau.ac.il element element, a sub-element of not an element

33 XML Document is a Tree XML documents are abstractly modeled as rooted, ordered, labeled trees, as reflected by their nesting Sometimes, XML documents are graphs (by using IDs and IDREFs) person name email tel Bart Simpson 02 – 444 7777 051 – 011 022 bart@tau.ac.il

34 Example XML Fragment Donald Duck 04-828-1345 04-828-1374 donald@cs.technion.ac.il Miki Mouse 03-426-1142

35 Another Example An element may contain a mixture of sub- elements and PCDATA British Airways World’s favorite airline

36 A Complete XML Document Lisa Simpson 02-828-1234 054-470-777 lisa@cs.huji.ac.il Required Optional

37 Attributes An opening tag may contain attributes These are typically used to describe the contents of an element cheese fromage branza A food made …

38 Rules for XML (1) XML is order sensitive, i.e. the following are different: XML is case-sensitive, i.e., the following are different:,, cheese fromage fromage cheese

39 Rules for XML (2) Tags come in pairs... They must be properly nested. Which of the following are good? –......... –...... There is a special shortcut for tags that have no text in between them (bachelor tags) –

40 Rules for XML (3) There should be exactly one top-level element. This element is also called the root element Which of the following is legal? Is this legal? Is this legal? You tell me.

41 Tables Versus XML Can you easily represent the contents of a table in XML? –Example: Projects(title, budget, managedBy), Employees(name, age, ssn) Can you easily represent the contents of an XML document in a table? –Example: Remember the phone bookExample: Remember the phone book

42 42 Document Type Descriptors (DTDs) Imposing Structure on XML Documents

43 43 Document Type Descriptors Document Type Descriptors (DTDs) impose structure on an XML document Using DTDs, we can specify what a “valid” document should contain. These specifications require more than just being well-formed, e.g., what elements are legal, what nesting is allowed DTDs do not have very great expressive power, e.g., cannot specify types

44 44 What is it good for? DTDs can be used to define special languages of XML, i.e., restricted XML for special needs Examples: –FOAF –SVG (scalable vector graphics) –WML (a kind of html for wireless devices) –SOAP (for web services) –XHTML (well-formed version of HTML) Standards can be defined using DTDs, for data exchange and special applications can be written

45 45 Address Book DTD Suppose we want to create a DTD that describes legal address book entries This DTD will be used to exchange address book information between programs How should it be written? (What is a legal address?) We discuss both element definitions and attribute definitions

46 46 Element Definitions

47 47 Example: An Address Book Homer Simpson Dr. H. Simpson 1234 Springwater Road Springfield USA, 98765 (321) 786 2543 (321) 786 2544 homer@math.springfield.edu Mixed telephones and faxes At least one email As many address lines as needed At most one greetingExactly one name

48 48 Specifying the Structure How do we specify exactly what must appear in a person element? In a DTD, we can specify the permitted content for each element. The permitted content is specified as a regular expression We show the general syntax, and then an example

49 49 aElement a e1?0 or 1 occurrences of expression e1 e1*0 or more occurrences of expression e1 e1+1 or more occurrences of expression e1 e1,e2Expression e2 after expression e2 e1|e2 Either expression e1 or expression e2 (but not both!) (e)Grouping #PCDATAParsed character data (i.e., text) EMPTYNo content ANYAny content (#PCDATA|a 1 |..|a n )*Mixed content

50 50 What’s in a person Element? The expression is: –name, greet?, addr*, (tel | fax)*, email+ We discuss what each part of this means –name = there must be a name element –greet? = there is an optional greet element (i.e., 0 or 1 greet elements) –name, greet? = the name element is followed by an optional greet element

51 51 What’s in a person Element? (cont.) addr* = there are 0 or more address elements tel | fax= there is a tel or a fax element (tel | fax)* = there are 0 or more repeats of tel or fax email+ = there are 1 or more email elements name, greet?, addr*, (tel | fax)*, email+

52 52 What’s in a person Element? (cont.) Does this expression differ from: –name, greet?, addr*, tel*, fax*, email+ –name, greet?, addr*, (fax | tel)*, email+ –name, greet?, addr*, (fax | tel)*, email, email* –name, greet?, addr*, (fax | tel)*, email*, email name, greet?, addr*, (tel | fax)*, email+

53 53 DTD For the Address Book <!DOCTYPE addressbook [ <!ELEMENT person (name, greet?, address*, (fax | tel)*, email+)> ]>

54 54 Example Requirements: –Every country must have a name as the first node. –Every country must have a capital city as the following node. –A country may have a king. –A country may have a queen. What is wrong with the following: –

55 55 Unambiguity A DTD must be 1-unambigious, i.e., it must be clear at any moment when parsing a document, which point we are at in the regular expression Which of the following is 1-unambigious? –(a,b)|(a,c) –a,(b|c) We now formalize these ideas…

56 56 Languages An element definition defines a language, i.e., the set of all legal series of children Example: Which of the following are in the language defined by a*,(b|c),a+ –aba –abca –aab –aaacaaa

57 57 Automata Languages can also be defined using an automata An automata is: –a set of states Q. –an alphabet  –a transition function , which associates a pair (q,a) with a state q’ –an initial state q 0 –a set of accepting states F A word a 1 …a n is in the language defined by an automata if there is a path from q 0 to a state in F with edges labeled a 1,…,a n

58 58 Automata Example: What Language Does this Define? q0q0 q1q1 q2q2 a a b q3q3 b c

59 59 Non-Deterministic Automata An automaton is non-deterministic if there is a state q and a letter a such that there are at least two transitions from q via edges labeled with a –What words are in the language of a non- deterministic automata? We now show how to create a Glushkov automata from a regular expression

60 60 Creating an automata from an element definition a*,(b|c),a+ Step 1: Normalize the expression by replacing any occurrence of an expression e+ with e,e* Step 2: Use subscripts to number each occurrence of each letter a*,(b|c)a,a* a 1 *,(b 1 |c 1 )a 2,a 3 *

61 61 Creating an automata from an element definition Step 3: Create a state for each subscripted letter, and a state q 0 a 1 *,(b 1 |c 1 )a 2,a 3 * Step 4: Choose as accepting states all subscripted letters with which it is possible to end a word q0q0 a1a1 b1b1 c1c1 a2a2 a3a3

62 62 Creating an automata from an element definition Step 5: Create a transition from a state l j to a state k j if there is a word in which k j follows l i. Label the transition with k a 1 *(b 1 |c 1 )a 2,a 3 * q0q0 a1a1 b1b1 c1c1 a2a2 a3a3 You fill in the transitions!

63 63 1-unambigious A language is 1-unambigious if its Glushov automata is deterministic. –otherwise it is 1-ambigious –element definitions in a DTD must be 1-unambigious! Examples: Create a Glushkov automata for the following and check whether the corresponding languages are 1-unambigious –(a,b)|(a,c) –a,(b|c) –a?, d+, b*, d*, (c|b)+

64 64 Ambigious Example Replace the following with a 1-unambigious equivalent expression <!ELEMENT country (president | king | (king,queen) | queen)>

65 65 Another Example Customers at may pay with a combination of credit cards and cash. If cards and cash are both used the cards must come first. There may be more than one card. There may be no more than one cash element. At least one method of payment must be used. Find a 1-unambigious definition for the element payment, using the elemenrs card and cash

66 Can you do this? Show that the following regular expression is not 1-unambiguous: –(b|c)?, b, (b|c) Find an equivalent regular expression that is 1-unambiguous

67 67 Attribute Definitions

68 68 More DTD Syntax XML documents can have elements, which can have attributes. How are they defined? General Syntax: <!ATTLIST element-name attribute-name1 type1 default-value1 attribute-name2 type2 default-value2 …. attribute-namen typen default-valuen> Example:

69 69 <!ATTLIST element-name attribute-name1 type1 default-value1 attribute-name2 type2 default-value2 …. attribute-namen typen default-valuen> type is one of the following (there are additional possibilities that we don’t discuss) CDATAcharacter data (en1|en2|..)value must be one from the given list IDvalue is a unique id IDREFvalue is the id of another element IDREFSvalue is a list of other ids

70 70 <!ATTLIST element-name attribute-name1 type1 default-value1 attribute-name2 type2 default-value2 …. attribute-namen typen default-valuen> default-value is one of the following valueThe default value of the attribute #REQUIRED The attribute value must be included in the element #IMPLIED The attribute does not have to be included

71 71 Examples <!ATTLIST height dimension (cm | in) #REQUIRED accuracy CDATA #IMPLIED resizable CDATA “yes” >

72 72 Specifying ID and IDREF Attributes <!DOCTYPE family [ <!ATTLIST person id ID #REQUIRED mother IDREF #IMPLIED father IDREF #IMPLIED children IDREFS #IMPLIED> ]>

73 73 Specifying ID and IDREF Attributes (cont.) The attributes mother and father are references to IDs of other elements However, those are not necessarily person elements! The mother attribute is not necessarily a reference to a female person References to IDs have no type

74 74 Some Conforming Data Lisa Simpson Bart Simpson Marge Simpson Homer Simpson

75 75 Consistency of ID and IDREF Attribute Values If an attribute is declared as ID –the associated values must all be distinct (no confusion) –In other words, No two ID attributes can have the same value If an attribute is declared as IDREF –the associated value must exist as the value of some ID attribute (no dangling “pointers”) Similarly for all the values of an IDREFS attribute

76 76 Valid Documents A document with a DTD is valid if it conforms to the DTD, i.e., –the document conforms to the regular- expression grammar, –types of attributes are correct, and –constraints on references are satisfied

77 77 DTD Issues

78 78 DTDs Problems (1) DTDs are rather weak specifications by DB & programming-language standards –Only one base type – PCDATA –No useful “abstractions”, e.g., sets –IDREFs are untyped –No constraints, e.g., child is inverse of parent –Tag definitions are global –Not easily parsed (since they are not XML) Some extensions of XML impose a schema or types on an XML document, e.g., XSchema

79 79 DTD Problems (2) How would you say that element a has exactly the children c, d, e in any order? In general, can such definitions be written efficiently?

80 80 Be Careful (1) <DOCTYPE genealogy [ <!ELEMENTperson ( name, dateOfBirth, person, -- mother person )> -- father... ]> What is the problem with this?

81 81 Be Careful (2) <DOCTYPE genealogy [ <!ELEMENTperson ( name, dateOfBirth, person?, -- mother person? )> -- father... ]> What is now the problem with this?

82 82 XPath Extracting Data from XML

83 83 Extracting Data from XML Data stored in an XML document must be extracted to use with various applications Data can be extracted programmatically We will discuss XPath: a declarative language for extracting data from XML XPath is used extensively in other specifications, e.g., XQuery, XSL, XPointer

84 84 Dark Side of the Moon Pink Floyd 10.90 Space Oddity David Bowie 9.90 Aretha: Lady Soul Aretha Franklin 9.90 Dark Side of the Moon Pink Floyd 10.90 Space Oddity David Bowie 9.90 Aretha: Lady Soul Aretha Franklin 9.90 An XML document

85 85 catalog.xml catalog cd country titleartistprice titleartistpricetitleartistprice country UK USA Dark Side of the Moon Space OddityAretha: Lady Soul Pink Floyd David BowieAretha Franklin 10.90 9.90 The XML document as a tree The XML document as a tree

86 86 Main Element of Syntax: Path Expressions / at the beginning of an XPath expression represents the root of the document / between element names represents a parent-child relationship // represents an ancestor-descendent relationship @ marks an attribute [condition] specifies a condition

87 87 catalog.xml catalog cd country titleartistprice titleartistpricetitleartistprice country UK Dark Side of the Moon Space OddityAretha: Lady Soul Pink Floyd David BowieAretha Franklin 10.90 9.90 /catalog Getting the root element of the document USA

88 88 catalog.xml catalog cd country titleartistprice titleartistpricetitleartistprice country UK Dark Side of the Moon Space OddityAretha: Lady Soul Pink Floyd David BowieAretha Franklin 10.90 9.90 /catalog/cd Finding child nodes USA

89 89 catalog.xml catalog cd country titleartistprice titleartistpricetitleartistprice country UK Dark Side of the Moon Space OddityAretha: Lady Soul Pink Floyd David BowieAretha Franklin 10.90 9.90 /catalog/cd/price Finding descendent nodes USA

90 90 catalog.xml catalog cd country titleartistprice titleartistpricetitleartistprice country UK Dark Side of the Moon Space OddityAretha: Lady Soul Pink Floyd David BowieAretha Franklin 10.90 9.90 /catalog/cd[price<10] Condition on elements USA

91 91 catalog.xml catalog cd country titleartistprice titleartistpricetitleartistprice country UK Dark Side of the Moon Space OddityAretha: Lady Soul Pink Floyd David BowieAretha Franklin 10.90 9.90 //title // represents any directed path in the document /catalog//title USA

92 92 catalog.xml catalog cd country titleartistprice titleartistpricetitleartistprice country UK Dark Side of the Moon Space OddityAretha: Lady Soul Pink Floyd David BowieAretha Franklin 10.90 9.90 /catalog/cd/* * represents any element name in the document USA

93 93 catalog.xml catalog cd country titleartistprice titleartistpricetitleartistprice country UK Dark Side of the Moon Space OddityAretha: Lady Soul Pink Floyd David BowieAretha Franklin 10.90 9.90 /*/* What will the following expressions return? //* //*[price=9.90] //*[price=9.90]/* USA * represents any element name in the document

94 94 catalog.xml catalog cd country titleartistprice titleartistpricetitleartistprice country UK Dark Side of the Moon Space OddityAretha: Lady Soul Pink Floyd David BowieAretha Franklin 10.90 9.90 /catalog/cd[1] Position based condition /catalog/cd[last()] USA

95 95 catalog.xml catalog cd country titleartistprice titleartistpricetitleartistprice country UK Dark Side of the Moon Space OddityAretha: Lady Soul Pink Floyd David BowieAretha Franklin 10.90 9.90 //title | //price Use | for union USA

96 96 catalog.xml catalog cd country titleartistprice titleartistpricetitleartistprice country UK Dark Side of the Moon Space OddityAretha: Lady Soul Pink Floyd David BowieAretha Franklin 10.90 9.90 /catalog/cd[@country=“UK”] @ marks attributes USA

97 97 catalog.xml catalog cd country titleartistprice titleartistpricetitleartistprice country UK Dark Side of the Moon Space OddityAretha: Lady Soul Pink Floyd David BowieAretha Franklin 10.90 9.90 @ marks attributes USA /catalog/cd/@country

98 98 catalog.xml catalog cd country titleartistprice titleartisttitleartistprice country UK Dark Side of the Moon Space OddityAretha: Lady Soul Pink Floyd David BowieAretha Franklin 10.90 9.90 USA price How would you write: The price of the cds Whose artist is David Bowie?

99 99 XPath Axises We have discussed the following axises:  Child (/)  Descendent (//)  Attribute (@) These symbols are actually a shorthand, e.g., /cd//price is the same as child::cd/descendent::price There are additional shorthands, e.g.,  Self (/.)  Parent (/..)

100 100 Additional Axises ancestorContains all ancestors (parent, grandparent, etc.) of the current node ancestor-or-selfContains the current node plus all its ancestors (parent, grandparent, etc.) descendant-or-selfContains the current node plus all its descendants (children, grandchildren, etc.) followingContains everything in the document after the closing tag of the current node following-siblingContains all siblings after the current node precedingContains everything in the document that is before the starting tag of the current node preceding-siblingContains all siblings before the current node

101 Think About it We say that XPath expression p1 is contained in XPath expression p2 if for all documents D, the set of nodes returned by applying p1 to D is contained in the set of nodes returned by applying p2 to D Example: –//a/a is contained in //a –//*[a] is not contained in //a, nor is //a contained in //*[a]

102 Think About it State all containments that hold between the following expressions: –//*/* –/*//* –//* –/* Explain what the following expression returns. –//*[//a]//a[/a][a]


Download ppt "XML – eXtensible Markup Language. The World Wide Web and What We Would Like to Do with It XML has a lot of hype surrounding it This week we discuss: –Why."

Similar presentations


Ads by Google