Presentation is loading. Please wait.

Presentation is loading. Please wait.

S EMISTRUCTURED D ATA AND XML. 2222 H OW THE W EB IS T ODAY HTML documents often generated by applications consumed by humans only easy access: across.

Similar presentations


Presentation on theme: "S EMISTRUCTURED D ATA AND XML. 2222 H OW THE W EB IS T ODAY HTML documents often generated by applications consumed by humans only easy access: across."— Presentation transcript:

1 S EMISTRUCTURED D ATA AND XML

2 2222 H OW THE W EB IS T ODAY HTML documents often generated by applications consumed by humans only easy access: across platforms, across organizations only layout, no semantic information No application interoperability: HTML not understood by applications screen scraping brittle Database technology: client-server still vendor specific

3 3333 XML D ATA E XCHANGE F ORMAT A standard from the W3C (World Wide Web Consortium, http://www.w3.org).http://www.w3.org The mission of the W3C „... developing common protocols that promote its evolution and ensure its interoperability...“. Basic ideas XML = data XML generated by applications XML consumed by applications Easy access: across platforms, organizations.

4 4444 P ARADIGM S HIFT ON THE W EB For web search engines: From documents (HTML) to data (XML) From document management to document understanding (e.g., question answering) From information retrieval to data management For database systems: From relational (structured) model to semistructured data From data processing to data /query translation From storage to transport

5 5555 T HE S EMISTRUCTURED D ATA M ODEL &o1 &o12&o24&o29 &o43 &96 &243 &206 &25 “Serge” “Abiteboul” 1997 “Victor” “Vianu” 122133 paper book paper references author title year http author title publisher author title page firstname lastname firstnamelastnamefirst last Bib Object Exchange Model (OEM) complex object atomic object

6 6666 T HE S EMISTRUCTURED D ATA M ODEL Data is self-describing, i.e. the data description is integrated with the data itself rather than in a separate schema. Database is a collection of nodes and arcs (directed graph). Leaf nodes represent data of some atomic type ( atomic objects, such as numbers or strings). Interior nodes represent complex objects consisting of components (child nodes), connected by arcs to this node. Arcs are directed and connect two nodes.

7 7777 T HE S EMISTRUCTURED D ATA M ODEL Arc labels indicates the relationship between the two corresponding nodes. The root node is the only interior node without in- arcs, representing the entire database. All database objects are children of the root node. Every node must be reachable from the root. A general graph structure is possible, i.e. the graph need not be a tree structure.

8 8888 S YNTAX FOR S EMISTRUCTURED D ATA Bib: &o1 { paper: &o12 { … }, book: &o24 { … }, paper: &o29 { author: &o52 “Abiteboul”, author: &o96 { firstname: &243 “Victor”, lastname: &o206 “Vianu”}, title: &o93 “Regular path queries with constraints”, references: &o12, references: &o24, pages: &o25 { first: &o64 122, last: &o92 133} } Observe: Nested tuples, set-values, oids!

9 9999 S YNTAX FOR S EMISTRUCTURED D ATA May omit oids: { paper: { author: “Abiteboul”, author: { firstname: “Victor”, lastname: “Vianu”}, title: “Regular path queries …”, page: { first: 122, last: 133 } }

10 10 V S. R ELATIONAL M ODEL Missing attributes Additional attributes Multiple attribute values (set-valued attributes) Objects as attribute values No global schema  only the first characteristics supported by relational model, all others are not

11 11 V S. R ELATIONAL M ODEL Semistructured data Self-describing, Irregular data, No a-priori structure. Relational DB Separate schema, Regular data, A-priori structure.

12 XML

13 13 I MPORTANT XML S TANDARDS XSL/XSLT: presentation and transformation standards RDF: resource description framework (meta-info such as ratings, categorizations, etc.) Xpath/Xpointer/Xlink: standard for linking to documents and elements within Namespaces: for resolving name clashes DOM: Document Object Model for manipulating XML documents SAX: Simple API for XML parsing XQuery: query language

14 14 XML A W3C standard to complement HTML Origins: Structured text SGML Large-scale electronic publishing Data exchange on the web Motivation: HTML describes presentation XML describes content http://www.w3.org/TR/2000/REC-xml-20001006 (version 2, 10/2000)

15 15 F ROM HTML TO XML HTML describes the presentation

16 16 HTML Bibliography Foundations of Databases Abiteboul, Hull, Vianu Addison Wesley, 1995 Data on the Web Abiteboul, Buneman, Suciu Morgan Kaufmann, 1999 HTML describes the presentation

17 17 XML Foundations… Abiteboul Hull Vianu Addison Wesley 1995 … XML describes the content

18 18 W HY ARE WE DB’ ERS INTERESTED ? It’s data. That’s us. Database issues: How are we going to model XML? (graphs). How are we going to query XML? (XQuery) How are we going to store XML (in a relational database? object-oriented? native?) How are we going to process XML efficiently? (many interesting research questions!)

19 19 E LEMENTS Tags book, title, author, … start tag:, end tag: defined by user / programmer (different from HTML!) Elements …, … An element consists of a matching start and end tag and the enclosed content. Elements can be nested, i.e. content of one element can consist of sequence of other elements.

20 20 A TTRIBUTES Attributes can be associated with any element. Provide additional information about elements. Attributes can have only one value. Example Foundations of Databases Abiteboul … 1995 Attributes can also be used to connect elements.

21 21 N ON - TREE - LIKE XML So far: only tree-like XML documents, i.e. each element is nested within at most one other element. Attributes can also be used to create non-tree XML documents. Attributes with a domain of ID serve as primary keys of elements. Attributes with a domain of IDREF serve as foreign keys referencing the ID of another element.

22 22 N ON - TREE - LIKE XML Example of a non-tree structure Jane Mary John

23 23 N AMESPACES An XML document can involve tags that come for multiple sources. One and the same tag can appear in more than one source. Apples Bananas African Coffee Table 80 120

24 24 N AMESPACES Name conflicts can be resolved by prefixing tag names according to their source. Apples Bananas African Coffee Table 80 120 When using prefixes in XML, a namespace for the prefix must be defined. The namespace must be referenced (via an URI) in the start tag of an enclosing element.

25 25 W ELL -F ORMED XML A well-formed XML document satisfies the following conditions: Begins with a declaration that it is XML. Has a single root element that encloses the whole document. Consists of properly nested elements, i.e. start and end tag of an element are within the same enclosing element. standalone =“yes” states that document has no DTD. In this mode, you can invent your own tags, like in semistructured data model.

26 26 W ELL -F ORMED XML Foundations… Abiteboul Hull Vianu Addison Wesley 1995 …... …

27 27 W ELL -F ORMED XML HTML browsers will display documents with errors (like missing end tags). The W3C XML specification states that a program should stop processing an XML document if it finds an error. The main reason is that XML is being consumed by programs rather than by humans (as HTML). W3C provides a validator that checks whether an XML document is well-formed.

28 28 V ALID XML The validator can also check whether an XML document is valid, i.e. conforms to a Document Type Definition ( DTD ). A DTD specifies the allowable tags and how they can be nested. XML with a DTD is no longer semistructured (self-describing). However, a DTD is less rigid than the schema of a relational DB. E.g., a DTD allows missing and multiple attributes / elements.

29 DTD

30 30 D OCUMENT T YPE D EFINITIONS Document Type Definition ( DTD ): set of rules ( grammar ) specifying elements, attributes and all other aspects of XML documents. For each element, specify name and content type. Content type can, e.g., be #PCDATA (character string), other elements, regular expression made of the above content types * = zero or more occurrences ? = zero or one occurrence + = one or more occurrences, = sequence of elements.

31 31 D OCUMENT T YPE D ESCRIPTORS Sort of like a schema but not really. Inherited from SGML DTD standard BNF grammar establishing constraints on element structure and content Definitions of entities

32 32 E XAMPLE DTD: P RODUCT C ATALOG <!DOCTYPE CATALOG [ <!ATTLIST PRODUCT NAME CDATA #IMPLIED CATEGORY (HandTool|Table|Shop-Professional) "HandTool" PARTNUM CDATA #IMPLIED PLANT (Pittsburgh|Milwaukee|Chicago) "Chicago" INVENTORY (InStock|Backordered|Discontinued) "InStock"> <!ATTLIST SPECIFICATIONS WEIGHT CDATA #IMPLIED POWER CDATA #IMPLIED> <!ATTLIST OPTIONS FINISH (Metal|Polished|Matte) "Matte" ADAPTER (Included|Optional|NotApplicable) "Included" CASE (HardShell|Soft|NotApplicable) "HardShell"> <!ATTLIST PRICE MSRP CDATA #IMPLIED WHOLESALE CDATA #IMPLIED STREET CDATA #IMPLIED SHIPPING CDATA #IMPLIED> ]>

33 33 S HORTCOMINGS OF DTD S Useful for documents, but not so good for data: Element name and type are associated globally No support for structural re-use Object-oriented-like structures aren’t supported No support for data types Can’t do data validation Can have a single key item (ID), but: No support for multi-attribute keys No support for foreign keys (references to other keys) No constraints on IDREFs (reference only a Section)

34 XML S CHEMA

35 35 XML S CHEMA The successor of DTDs to specify a schema for XML documents. A W3C standard. Includes and extends functionality of DTDs. In particular, XML Schemas support data types. This makes it easier to validate the correctness of data and to work with data from a database. XML Schemas are written in XML. You don't have to learn a new language and can use your XML parser to parse your Schema files.

36 36 E XAMPLE XML S CHEMA …

37 37 S IMPLE E LEMENTS Simple elements contain only text. They can have one of the built-in datatypes: xs:string, xs:decimal, xs:integer, xs:boolean xs:date, xs:time. Example

38 38 S IMPLE E LEMENTS Restrictions allow you to further constrain the content of simple elements.

39 39 A TTRIBUTES Attributes can be specified using the attribute element: Attribute elements are nested within the element of the element with which they are associated. By default, attributes are optional. To make an attribute mandatory, use Attributes can have the same built-in datatypes as simple elements.

40 40 C OMPLEX E LEMENTS Complex elements can contain other elements and can have attributes. Nested elements need to occur in the order specified. The number of repetitions of elements are controlled by the attributes minOccurs and maxOccurs. The default is one repetition. A complex element with an attribute:

41 41 C OMPLEX E LEMENTS A complex element containing a sequence of nested (simple) elements:

42 42 C OMPLEX E LEMENTS If you name the complex element, other elements can reference and include it:

43 43 E XAMPLE XML S CHEMA …

44 44 XML VS. S EMISTRUCTURED D ATA Both described best by a graph. Both are schema-less, self-describing (XML without DTD / XML schema). XML is ordered, semistructured data is not. XML can mix text and elements: Making Java easier to type and easier to type Phil Wadler XML has lots of other stuff: attributes, entities, processing instructions, comments.

45 XML-P ATH = XP ATH

46 46 Q UERY L ANGUAGES FOR XML XPath is a simple query language based on describing similar paths in XML documents. XQuery extends XPath in a style similar to SQL, introducing iterations, subqueries, etc. XPath and XQuery expressions are applied to an XML document and return a sequence of qualifying items. Items can be primitive values or nodes (elements, attributes, documents). The items returned do not need to be of the same type.

47 47 XP ATH A path expression returns the sequence of all qualifying items that are reachable from the input item following the specified path. A path expression is a sequence consisting of tags or attributes and special characters such as slashes (“/”). Absolute path expressions are applied to some XML document and returns all elements that are reachable from the document’s root element following the specified path. Relative path expressions are applied to an arbitrary node.

48 48 XP ATH Foundations… Abiteboul Hull Vianu Addison Wesley 1995 … Applied to the above document, the XPath expression /bibliography/book/author returns the sequence Abiteboul Hull Vianu...

49 49 A TTRIBUTES If we do not want to return the qualifying elements, but the value one of their attributes, we end the path expression with @attribute. Foundations… Abiteboul Hull Vianu Addison Wesley 1995 the XPath expression /bibliography/book/@bookID returns the sequence “b100“...

50 50 W ILDCARDS We can use wildcards instead of actual tags and attributes: * means any tag, and @* means any attribute. Examples /bibliography/*/author returns the sequence Abiteboul Hull. /bibliography//author/@* returns the sequence “IBM“ “a739“.

51 51 P ATH E XPRESSIONS Examples: Bib.paper Bib.book.publisher Bib.paper.author.lastname Given an OEM instance, the value of a path expression p is a set of objects

52 52 P ATH E XPRESSIONS Examples: DB = &o1 &o12&o24&o29 &o43 &o70&o71 &96 &243 &206 &25 “Serge” “Abiteboul” 1997 “Victor” “Vianu” 122133 paper book paper references author title year http author title publisher author title page firstname lastname firstnamelastnamefirst last Bib &o44&o45&o46 &o47&o48 &o49 &o50 &o51 &o52 Bib.paper={&o12,&o29} Bib.book.publisher={&o51} Bib.paper.author.lastname={&o71,&206} Bib.paper={&o12,&o29} Bib.book.publisher={&o51} Bib.paper.author.lastname={&o71,&206}

53 XML-Q UERY = XQ UERY

54 54 XQ UERY Summary: FOR-LET-WHERE-ORDERBY-RETURN = FLWOR FOR/LET Clauses WHERE Clause ORDERBY/RETURN Clause List of tuples Instance of Xquery data model

55 55 XQ UERY FLWOR expressions are similar to SQL select.. from... where... queries. XQuery allows zero, one or more for and let clauses. The where clause is optional. There is one optional order-by clause. Finally, there is exactly one return clause. XQuery is case-sensitive. XQuery (and XPath) is a W3C standard.

56 56 XQ UERY C LAUSES for $x in expr Defines node variable $x. The expression expr evaluates to a sequence of items. The variable $x is assigned to each item, in turn, and the body of the for clause is executed once for each assignment. let $x := expr Defines collection variable $x. The expression expr evaluates to a sequence of items. The variable is bound to the entire sequence of items. Useful for common subexpressions and for aggregations.

57 57 XQ UERY C LAUSES where condition The condition is a boolean expression. The clause is applied to some item. If and only if the condition evaluates to true, the following return clause is executed for that item. return expression The result of a FLWOR clause is a sequence of items. Expression defines the result format for the current (qualifying) item. The sequence of items produced by expression is appended to the sequence of items produced so far.

58 58 I NTERPRETATION AS XQ UERY XQuery expressions can be used wherever an XML expression of any kind is permitted. Any text string is acceptable as content of a tag or value of an attribute. If a string contains an XQuery expression that should be evaluated, this substring must be surrounded by curly brackets {}. Example for $b in doc("bib.xml")/bibliography/book return {$b/title}

59 59 FOR V. S. LET Find all books FOR $x IN document("bib.xml") /bib/book RETURN $x FOR $x IN document("bib.xml") /bib/book RETURN $x Returns:... LET $x IN document("bib.xml") /bib/book RETURN $x LET $x IN document("bib.xml") /bib/book RETURN $x Returns:...

60 60 XQ UERY Find all book titles published after 1995: FOR $x IN document("bib.xml") /bib/book WHERE $x/year > 1995 RETURN $x/title FOR $x IN document("bib.xml") /bib/book WHERE $x/year > 1995 RETURN $x/title Result: abc def ghi

61 61 O RDERING THE Q UERY R ESULT The order-by clause allows you to order the results of an XQuery expression. order-by list of expressions The sort order is based on the value of the first expression. Ties are broken based on the value of the second (if necessary third etc.) expression. By default, the order is ascending. A descending sort order can be specified using descending.

62 62 E LIMINATION OF D UPLICATES The built-in function distinct-values eliminates duplicates from a sequence of result items. In principle, it applies only to primitive (atomic) types. It can also be applied to elements, but then it will remove their tags, replacing them by quotes “”. Example If return $b/title produces aaa bbb aaa then distinct-values (return $b/title) produces “aaa” “bbb”.

63 63 XQ UERY For each author of a book by Morgan Kaufmann, list all books she published: FOR $a IN distinct( document("bib.xml") /bib/book[publisher=“Morgan Kaufmann”]/author) RETURN $a, FOR $t IN /bib/book[author=$a]/title RETURN $t FOR $a IN distinct( document("bib.xml") /bib/book[publisher=“Morgan Kaufmann”]/author) RETURN $a, FOR $t IN /bib/book[author=$a]/title RETURN $t distinct = a function that eliminates duplicates Result: Jones abc def Smith ghi

64 64 J OINS We can join two or more documents, by using one variable for each of the documents. We let a variable range over the elements of the corresponding document, within a for-clause. Need to be careful when comparing elements for equality, since their equality is by element identity, not by element content. Typically, we want to compare the element content. The built-in function data(E) returns the content of an element E.

65 65 XQ UERY Find books whose price is larger than average: LET $a=avg( document("bib.xml") /bib/book/price) FOR $b in document("bib.xml") /bib/book WHERE $b/price > $a RETURN $b LET $a=avg( document("bib.xml") /bib/book/price) FOR $b in document("bib.xml") /bib/book WHERE $b/price > $a RETURN $b

66 66 S ORTING IN XQ UERY FOR $p IN distinct(document("bib.xml")//publisher) ORDERBY $p RETURN $p/text(), FOR $b IN document("bib.xml")//book[publisher = $p] ORDERBY $b/price DESCENDING RETURN $b/title, $b/price FOR $p IN distinct(document("bib.xml")//publisher) ORDERBY $p RETURN $p/text(), FOR $b IN document("bib.xml")//book[publisher = $p] ORDERBY $b/price DESCENDING RETURN $b/title, $b/price

67 67 I F -T HEN -E LSE FOR $h IN //holding ORDERBY $h/title RETURN $h/title, IF $h/@type = "Journal" THEN $h/editor ELSE $h/author FOR $h IN //holding ORDERBY $h/title RETURN $h/title, IF $h/@type = "Journal" THEN $h/editor ELSE $h/author

68 68 E XISTENTIAL Q UANTIFIERS FOR $b IN //book WHERE SOME $p IN $b//para SATISFIES contains($p, "sailing") AND contains($p, "windsurfing") RETURN $b/title FOR $b IN //book WHERE SOME $p IN $b//para SATISFIES contains($p, "sailing") AND contains($p, "windsurfing") RETURN $b/title

69 69 Q UANTIFICATION XQuery supports the existential and the universal quantifier. Universal quantifier every $v in expression1 satisfies expression 2 Existential quantifier some $v in expression1 satisfies expression 2 Expression1 evaluates to a sequence of items, expression 2 is a boolean expression.

70 70 A GGREGATION XQuery provides built-in functions for the standard aggregations such as SUM, MIN, COUNT and AVG. They can be applied to any XQuery expression, i.e. to any sequence of items. Example avg(doc("bib.xml")/bibliography/book/price) count(doc("bib.xml")/bibliography/book/price) Computes the average book price and the number of books, resp.

71 71 XQ UERY E XAMPLES Find books whose price is larger than the average price. Uses aggregate operator (avg), applied to the result of a path expression. let $a:=avg(doc("bib.xml")/bibliography/book/price) for $b in doc("bib.xml")/bibliography/book where $b/price > $a return $b let $a:=avg(doc("bib.xml")/bibliography/book/price) for $b in doc("bib.xml")/bibliography/book where $b/price > $a return $b

72 72 XQ UERY E XAMPLES Find title of books with a paragraph containing the terms “sailing” and “windsurfing”. Uses existential quantifier (some) and string matching (contains). for $b in doc("bib.xml")//book where some $p in $b//para satisfies contains($p, "sailing") and contains($p, "windsurfing") return $b/title for $b in doc("bib.xml")//book where some $p in $b//para satisfies contains($p, "sailing") and contains($p, "windsurfing") return $b/title


Download ppt "S EMISTRUCTURED D ATA AND XML. 2222 H OW THE W EB IS T ODAY HTML documents often generated by applications consumed by humans only easy access: across."

Similar presentations


Ads by Google