Presentation is loading. Please wait.

Presentation is loading. Please wait.

Querying XML Documents and Data CBU Summer School 13.8. - 20.8.2007 (2 ECTS) Prof. Pekka Kilpeläinen Univ of Kuopio, Dept of Computer Science

Similar presentations


Presentation on theme: "Querying XML Documents and Data CBU Summer School 13.8. - 20.8.2007 (2 ECTS) Prof. Pekka Kilpeläinen Univ of Kuopio, Dept of Computer Science"— Presentation transcript:

1 Querying XML Documents and Data CBU Summer School 13.8. - 20.8.2007 (2 ECTS) Prof. Pekka Kilpeläinen Univ of Kuopio, Dept of Computer Science Pekka.Kilpelainen@cs.uku.fi

2 CBU Summerschool '07Querying XML: Introduction2 Introduction & Motivation n XML appears everywhere n How to query it? XML Internet order invoice

3 CBU Summerschool '07Querying XML: Introduction3 Main Topic: Two XML Query Models n Region algebra –for retrieval of structuded text –"lightweight" »reduced language; for ad-hoc files; efficient free implementation n XQuery –for general querying/manipulation of XML –"heavy" »comprehensive and complex language; for (data viewed as) XML only; production-use implementations?

4 CBU Summerschool '07Querying XML: Introduction4 Course Outline Intro and Arrangements; Structured documents 1 Review of XML Basics 1.1 XML and XML docs; 1.2 Document grammars 1.3. XML DTDs; 1.4 XML Namespaces 1.5 XML Schema 2 Region Algebra and sgrep 3 W3C XQuery, and XPath 2.0 (Apologies for potential dis-organization!)

5 CBU Summerschool '07Querying XML: Introduction5 Arrangements  Background:  W3C Recommendations (XML, XQuery)  Reports (Region Algebra, sgrep)  Earlier courses; own research and experiments  Some material (to be posted) at http://www.cs.uku.fi/~kilpelai/CBU07/  Plan: Lectures 12 h; hands-on exercises 8 h

6 CBU Summerschool '07Querying XML: Introduction6 Structured Documents n Document: –a structured representation of information on some medium (  message) –normally for a human reader »memos, manuals, articles, books, … –also application-to-application messages »e.g., btw client and server in Web Services –"prose-oriented XML" vs "data-oriented XML" –can be treated as a single unit »(a web page vs a web site)

7 CBU Summerschool '07Querying XML: Introduction7 Presentation vs Structure n Presentation informs the human reader about the meaning of text and the role of its parts n Markup indicates the presentation or the meaning of different parts of text »originally hand-written annotations for the typesetter –nowadays primarily codes embedded in digital documents; –nowadays primarily codes embedded in digital documents;

8 CBU Summerschool '07Querying XML: Introduction8 Markup and Markup Language n Procedural markup –commands (start boldface, produce empty line, indent 5 mm,...) –proprietary word processor formats, nroff, TeX,... n Descriptive or generic markup –indicates conceptual structures using chosen names –LaTeX: \begin{abstract}... \end{abstract} –HTML:... –HTML:... n Markup language –a fixed set of markup notations (e.g. nroff, TeX, HTML, SVG, …)

9 CBU Summerschool '07Querying XML: Introduction9 Structure in Documents n Hierarchy or nesting is ubiquitous –Sections w. subsections etc –(Also overlapping hierarchies!) n Linear order essential in prose documents –less important in documents representing data objects n Hypertext and cross-references n XML: proper hierarchies, tree-like structures, with cross-references via attribute values

10 CBU Summerschool '07Querying XML: Introduction10 1 Document Instances and Grammars Overview of fundamentals, and some details, of XML 1.1 XML and XML documents 1.2 Basics of document grammars 1.3 Basics of XML DTDs 1.4 XML Namespaces 1.5 XML Schema 1.5 XML Schema

11 CBU Summerschool '07Querying XML: Introduction11 2.1 XML and XML documents n XML - Extensible Markup Language, W3C Recommendation, February 1998 –not an official standard, but a stable industry standard –2 nd Ed 2000, 3 rd Ed 2004, 4 th Ed 2006 »editorial revisions, not new versions of XML 1.0 n a simplified subset of SGML, Standard Generalized Markup Language, ISO 8879:1987 –valid XML documents are also SGML documents

12 CBU Summerschool '07Querying XML: Introduction12 What is XML? n Extensible Markup Language is not a markup language! –does not fix a tag set nor its semantics (like markup languages like HTML do) n XML documents have no inherent (processing or presentation) semantics –even though many think that XML is semantic or self- describing; See next

13 CBU Summerschool '07Querying XML: Introduction13 Semantics of XML Markup n Meaning of this XML fragment? –The application has to “understand” the tags –But better off with the tags, though!

14 CBU Summerschool '07Querying XML: Introduction14 What is XML (2)? n XML is –a way to use markup to represent information –a metalanguage »supports definition of specific markup languages through XML DTDs (Document Type Definitions) or Schemas »E.g. XHTML a reformulation of HTML using XML n Often “XML”  XML + XML technology

15 CBU Summerschool '07Querying XML: Introduction15 How does it look? Pekka Kilpeläinen Pekka Kilpeläinen kilpelai@cs.uku.fi kilpelai@cs.uku.fi </client> XML Handbook XSLT Programmer’s Ref XML Handbook XSLT Programmer’s Ref </invoice>

16 CBU Summerschool '07Querying XML: Introduction16 Essential Features of XML n Overview of XML essentials –many details skipped –Learn to consult original sources (specifications, documentation etc) for details! »The XML specification is easy to browse n First of all, XML is a textual or character-based way to represent data

17 CBU Summerschool '07Querying XML: Introduction17 XML Document Characters n XML documents are made of ISO-10646 (32-bit) characters; in practice of their 16-bit Unicode subset (used, e.g., in Java) –Unicode 2.0 defines almost 39,000 distinct characters n Characters have three different aspects: –their identification as numeric code points –their representation by bytes –their visual presentation

18 CBU Summerschool '07Querying XML: Introduction18 External Aspects of Characters n Documents are stored/transmitted as a sequence of bytes (of 8 bits). An encoding determines how characters are represented by bytes. –UTF-8 (  7-bit ASCII) is the XML default encoding –encoding="KOI8R" should be OK for Cyrillic texts »(I cannot comment on parser support) n A font determines the visual presentation of characters

19 CBU Summerschool '07Querying XML: Introduction19 XML Encoding of Structure 1 n XML is, essentially, a textual encoding scheme of labelled, ordered and attributed trees: –internal nodes are elements labelled by type names –leaves are text nodes labelled by string values, or empty element nodes –the left-to-right order of children of a node matters –element nodes may carry attributes = (name, string-value) pairs n This view is shared by many XML techniques (DOM, XPath, XSLT, XQuery,...)

20 CBU Summerschool '07Querying XML: Introduction20 XML Encoding of Structure 2 n XML encoding of a tree –corresponds to a pre-order walk –start of an element node with type name A denoted by a start tag, and its end denoted by end tag –start of an element node with type name A denoted by a start tag, and its end denoted by end tag –possible attributes written within the start tag: –possible attributes written within the start tag: »Names attr 1,…,attr n must be distinct – text nodes written as their string value

21 CBU Summerschool '07Querying XML: Introduction21 XML Encoding of Structure: Example <S> S E <W> <W> </W> Hello world! W Hello W world! </W> </S> A=1

22 CBU Summerschool '07Querying XML: Introduction22 XML: Logical Document Structure n Elements –indicated by matching (case-sensitive!) tags … –indicated by matching (case-sensitive!) tags … –can contain text and/or subelements –can be empty: or (e.g. in XHTML) –unique root element  document a single tree

23 CBU Summerschool '07Querying XML: Introduction23 Logical document structure (2) n Attributes –name-value pairs attached to elements –in start-tag after the element type name … … –forms "... " and '... ' are interchangeable n Also: – –

24 CBU Summerschool '07Querying XML: Introduction24 CDATA Sections n “CDATA Sections” to include XML markup characters as textual content <![CDATA[ Here we can easily include markup characters and, for example, code fragments: if (Count 0) if (Count 0) ]]>

25 CBU Summerschool '07Querying XML: Introduction25 Two levels of correctness (1) n Well-formed documents – roughly: follows the syntax of XML, markup correct (elements properly nested, tag names match, attributes of an element have unique names,...) –violation is a fatal error n Valid documents –(in addition to being well-formed) obey an associated grammar (DTD/Schema)

26 CBU Summerschool '07Querying XML: Introduction26 XML docs and valid XML docs XML docs and valid XML docs XML documents = well-formed XML documents DTD-valid documents Schema-valid documents

27 CBU Summerschool '07Querying XML: Introduction27 An XML Processor (Parser) n Reads XML documents and reports their contents to an application –relieves the application from details of markup –XML Recommendation specifies: –recognition of characters as markup or data; what information to pass to applications; how to check the correctness of documents; –validation based on comparing document against its grammar Next: Basics of document grammars Next: Basics of document grammars

28 CBU Summerschool '07Querying XML: Introduction28 1.2 Basics of document grammars n DTDs are variations of context-free grammars (CFGs), which are widely used to syntax specification (programming languages, XML, …) and to parser/compiler generation (e.g. YACC/GNU Bison) –No knowledge of them is necessary, but connections with CFGs may be informative for those that know about them

29 CBU Summerschool '07Querying XML: Introduction29 DTD/CFG Correspondence DTD-------------------------------- XML document element type element type declaration #PCDATA CFG ------------------------ parse/syntax tree nonterminal production terminal

30 CBU Summerschool '07Querying XML: Introduction30 Example: Three Authors of a Ref Ref AuthorAuthorAuthorTitle... PublData AhoHopcroftUllman The Design and Analysis... Ref  Author* Title PublData  P, Author Author Author Title PublData  L(Author* Title PublData)

31 CBU Summerschool '07Querying XML: Introduction31 Extended Productions n Notice the regular expressions in productions –to describe (potentially infinite) sequences n That is, we are using extended CFGs –content models (of a DTD) correspond to regular expressions (in an ECFG production) –> number of element’s children generally unlimited

32 CBU Summerschool '07Querying XML: Introduction32 1.3 Basics of XML DTDs n A Document Type Declaration provides a grammar (document type definition, DTD) for a class of documents [Defined in XML Rec] Syntax (in the prolog of a document instance): [ ] > Syntax (in the prolog of a document instance): [ ] > n DTD = union of the external and internal subset –internal has preference for attribute and entity decls

33 CBU Summerschool '07Querying XML: Introduction33 Markup Declarations n DTD consists of markup declarations –element type declarations »≈ productions of ECFGs –attribute-list declarations »for declared element types –entity declarations »for physical structures –notation declarations logical structures

34 CBU Summerschool '07Querying XML: Introduction34 How do Declarations Look Like? <!ATTLIST item price NMTOKEN #REQUIRED unit (FIM | EUR) ”EUR” >

35 CBU Summerschool '07Querying XML: Introduction35 Element Type Declarations General form: where E is a content model General form: where E is a content model  regular expression of element names n Content model operators: E | F : choiceE, F: concatenation E? : optionalE* : zero or more E+ : one or more(E) : grouping n Must group: (A,B)|C or A,(B|C), but A,B|C forbidden

36 CBU Summerschool '07Querying XML: Introduction36 Attribute-List Declarations n Can declare attributes for elements: –Name, data type and possible default value Example: Example: n Semantics mainly up to the application –processor checks that ID attributes are unique and that targets of IDREF attributes exist

37 CBU Summerschool '07Querying XML: Introduction37 Mixed, Empty and Arbitrary Content Mixed content: Mixed content: –may contain text and elements Empty content: Empty content: Unrestricted content: ANY (= (#PCDATA | choice-of-all-declared-element-types )* ) Unrestricted content: ANY (= (#PCDATA | choice-of-all-declared-element-types )* )

38 CBU Summerschool '07Querying XML: Introduction38 Entities (1) n Named storage units of XML documents n Multiple uses: –character entities: »< < and < all expand to ‘ < ‘ (treated as data, not as start-of-markup) »other predefined entities: & > &apos; &quote; expand to &, >, ' and " –general entities are shorthand notations: –general entities are shorthand notations:

39 CBU Summerschool '07Querying XML: Introduction39 Entities (2) n physical storage units comprising a document –parsed entities –parsed entities –document entity is the starting point of processing –entities and elements must nest properly: <!DOCTYPE doc [ <!ENTITY chap1 ( … as above …) > ]> <doc>&chap1;</doc> …</sec> …</sec>

40 CBU Summerschool '07Querying XML: Introduction40 Unparsed Entities and Parameter Entities n Unparsed entities allow XML documents refer to external binary objects like graphics files –XML processor handles only text –I've rarely used these n Parameter entities are used in DTDs –useful for modularizing declarations n We skip these

41 CBU Summerschool '07Querying XML: Introduction41 1.4 XML Namespaces n Documents often comprise parts processed by different applications (and/or defined by different grammars) –for example, in XSLT scripts: –for example, in XSLT scripts: </xsl:template> –How to manage multiple sets of names? HTMLelements XSLT elements/ instructions

42 CBU Summerschool '07Querying XML: Introduction42 XML Namespaces (2/5) n Solution: –By introducing (arbitrary) local name prefixes, and binding them to (fixed) globally unique URIs –For example, the local prefix “ xsl: ” conventionally used in XSLT scripts

43 CBU Summerschool '07Querying XML: Introduction43 XML Namespaces briefly (3/5) Namespace identified by a URI (through the associated local prexif) e.g. http://www.w3.org/ 1999/XSL/Transform for XSLT Namespace identified by a URI (through the associated local prexif) e.g. http://www.w3.org/ 1999/XSL/Transform for XSLT –conventional but not required to use URLs –the identifier has to be unique, but no need to be an address n Association inherited to sub-elements –see the next example (of an XSLT script)

44 CBU Summerschool '07Querying XML: Introduction44 XML Namespaces (4/5) </xsl:template></xsl:stylesheet>

45 CBU Summerschool '07Querying XML: Introduction45 XML Namespaces briefly (5/5) n Mechanism built on top of basic XML –overloads attribute syntax ( xmlns: ) to introduce namespaces –does not affect validation »namespace attributes have to be declared for DTD- validity »all element type names have to be declared (with their initial prefixes!) –> Other schema languages (XML Schema, Relax NG) better for validating documents with Namespaces

46 CBU Summerschool '07Querying XML: Introduction46 1.5 XML Schemas n A quick look at XML Schema –W3C Recommendation, 1 st Ed. May, 2001; 2 nd Ed. Oct, 2004: »XML Schema Part 0: Primer (readable non- normative introduction; Recommended) »XML Schema Part 1: Structures »XML Schema Part 2: Datatypes –W3C Draft (didn't lead anywhere?): »Formal Description, 9/2001

47 CBU Summerschool '07Querying XML: Introduction47 Advantages of XML Schema (1) n XML syntax –easier to manipulate by programs (than DTDs) n Compatibility with namespaces –can validate against declarations from multiple sources n Content datatypes –44 built-in datatypes (including primitive Java datatypes, datatypes of SQL, and XML attribute types) –mechanisms to derive user-defined datatypes –used as types of XQuery

48 CBU Summerschool '07Querying XML: Introduction48 XSDL built-in types (Part 2, Chap. 3) NB: all simple values in documents strings *  CDATA * * * * * * * * *: XML attribute types

49 CBU Summerschool '07Querying XML: Introduction49 Advantages of XML Schema (2) n Element names and content types independent; Compare with –For example, could define titles »of people as “Mr.”/”Mrs.”/”Ms.”, and »of chapters as strings –> extend the power of CFGs/DTDs »where non-terminal / tag-name alone determines its allowed content –(Is this relevant in practice?)

50 CBU Summerschool '07Querying XML: Introduction50 Advantages of XML Schema (3) n Ability to specify uniqueness and keys within selected parts of the document –for example, that title s of chapters should be unique; or key attributes of relations –uses XPath n Support for schema documentation –element annotation with sub-elements documentation (for human readers) and appInfo (for applications) –Only these contain text (#PCDATA)

51 CBU Summerschool '07Querying XML: Introduction51 Disadvantages of XML Schema n Complexity (esp. Rec Part 1!) vs. added power –> a long learning curve –> slow adoption by users n Immaturity of implementations (?) –W3C web site mentions ~ 60 tools/processors –Apache Xerces claims full XSDL support –Some features difficult to implement efficiently n Alternative schema languages have been suggested, too –Relax NG –Schematron –...

52 CBU Summerschool '07Querying XML: Introduction52 XSDL through Example – Next: walk-through of an XML schema example –from Chapter 2 of the XML Schema Primer –Consider modelling purchase orders like below: Alice Smith 123 Maple Street Mill Valley CA 90952 –Consider modelling purchase orders like below: Alice Smith 123 Maple Street Mill Valley CA 90952

53 CBU Summerschool '07Querying XML: Introduction53 purchaseOrder instance continues Robert Smith 8 Oak Avenue Old Town PA 95819 Hurry, my lawn is wild! Robert Smith 8 Oak Avenue Old Town PA 95819 Hurry, my lawn is wild! Lawnmower Lawnmower 1 1 148.95 148.95 Only if electric Only if electric

54 CBU Summerschool '07Querying XML: Introduction54 End of the example instance Baby Phone Baby Phone 1 1 39.98 39.98 1999-05-21 1999-05-21 </purchaseOrder> n Next: A schema for such purchase orders

55 CBU Summerschool '07Querying XML: Introduction55 The Purchase Order Schema (1/5)

56 CBU Summerschool '07Querying XML: Introduction56 The Purchase Order Schema (2/5)

57 CBU Summerschool '07Querying XML: Introduction57 The Purchase Order Schema (3/5) anonymous type for item anon. type for quantity

58 CBU Summerschool '07Querying XML: Introduction58 The Purchase Order Schema (4/5)

59 CBU Summerschool '07Querying XML: Introduction59 The Purchase Order Schema (5/5) </xs:schema>

60 CBU Summerschool '07Querying XML: Introduction60 XML Schema: Summary n XSDL: an XML-based grammar formalism –W3C Recommendation; Alternative to DTDs »support for namespaces »richer content and attribute datatypes n Well accepted(?) in XML industry –e.g., to describe messages btw clients and servers in Web services; (See, e.g., Web Services Description Language, Vers. 2.0, W3C Draft 3/07) –for typing of XQuery


Download ppt "Querying XML Documents and Data CBU Summer School 13.8. - 20.8.2007 (2 ECTS) Prof. Pekka Kilpeläinen Univ of Kuopio, Dept of Computer Science"

Similar presentations


Ads by Google