CS331: Advanced Database Systems: Semistructured Data Management Norman Paton University of Manchester

CS331: Advanced Database Systems: Semistructured Data Management Norman Paton University of Manchester npaton@manchester.ac.uk

Semistructured Data Two views of data: Databases: structured, modelled, queried, programmed. Documents: partially structured, authored, read, navigated. Semistructured data management is at the confluence of these two views. XML is the principal data representation notation for semistructured data. XML can be seen as: An extensible markup language for documents. A data model for hierarchical data. A notation for communicating data with its structure. See also: COMP30352 – IR, Hypermedia and the Web

XML Language Space XML (Extensible Markup Language) is just that: a markup language with an extensible collection of tags. XML is associated with many related standards within the W3C (World Wide Web Consortium): http://www.w3.org/. XML Related Standards: XPath: navigation. XQuery: queries. XSLT: transformations. XML Schema: document description. DOM: modelling documents as objects....... and underpins: Web Services. The Semantic Web.

Markup Markup is the inclusion of symbols with special meaning in a text document. Languages with markup: LaTeX. HTML. RTF. XML. <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"> The Silver Pigs Rome \documentclass{llncs} \begin{document} \title{An Experimental Performance Evaluation of Join Algorithms for Parallel Object Databases} \author{Sandra Sampaio\inst{1} \and Jim Smith\inst{2}\and Norman W. Paton\inst{1}\and Paul Watson\inst{2}}...

XML Markup In XML, content is structured using tags. Tags are distinguished by the characters ‘ ’. Tags often come in pairs, round some content, as start and end tags. The Silver Pigs Rome... End tag Start tag

Elements An element is a meaningful unit of content enclosed by tags. An application may be able to interpret an element. Elements may be ordered or nested. Context matters, especially given nesting. Oxford Road Manchester

Element Hierarchies An XML document essentially represents hierarchical data. The elements in a well formed XML document will match and respect the hierarchy. United Kingdom 60094648 75.74 80.7 country namepeople popmaleLEfemaleLE

Attributes Attributes provide auxiliary information about elements. Attributes are embedded within start tags, and have the form “name = value”. <station updatedBy = “Fred Bloggs” validUntil = “22/06/2005”> Oxford Road Manchester <country source=“http://www.cia.gov”> United Kingdom 60094648 75.74 80.7

Models An XML file may be able to contain any old tags, in any order or combination (while remaining well-formed). Restrictions on the legal tags and values a document can contain may be specified using a DTD or an XML Schema. A DTD (Document Type Definition) provides a concise syntax for modelling documents (but is on the way out). An XML Schema definition is itself an XML document, which provides a wide range of modelling constructs for constraining other XML documents.

Trains in XML Hierarchical models can capture most cycle-free data fairly naturally. Hierarchical models, however, promote some concepts and demote others. The relational model, by contrast, treats all concepts as (broadly) equal. 3107101 Edinburgh London Edinburgh 06:00 York 08:00 London 10:00

Tools XML is widely used, and many software systems can read/write XML formats. Generic tools have also been developed for designing/editing XML. XML Spy showing a valid XML file as text.

XML Spy XML Spy supports: Editing. Data modelling. Validation. Transformation. … XML Spy showing an XML schema document as a tree.

Oxygen Oxygen supports: Editing. Data modelling. Validation. Transformation. Querying. … Oxygen showing an XML schema document as a Tree and text.

XML Databases Native XML Databases: Store XML in the database directly (“native”). Make XML Schema the optional schema definition language. Query the database using XML query languages (XPath/XQuery). Program database data as XML data structures (XML:DB, DOM,...). An XPath query and result in eXist

XML Databases Native XML databases: Tamino - Software AG: http://www.softwareag.com/tamino/.http://www.softwareag.com/tamino/ eXist - Open Source: http://exist.sourceforge.net/. http://exist.sourceforge.net/ Standard APIs: XML:DB Initiative: http://www.xmldb.org/; Both Tamino and eXist provide XML:DB APIs.http://www.xmldb.org/ XQJ: XQuery API for Java; Java community standard.

XML and Relational Databases Storage options: Decomposed: store an XML document in relational tables, and reconstruct on retrieval. Composed: store an XML document as an attribute of a relational table. Retrieval options: Represent relational tables as XML (e.g. Java WebRowSet). Relational vendors: Tend to support both composed and decomposed storage models. Provide APIs that accommodate XML for data transport or display (e.g., in Web Services or for Web interface generation).

Summary XML is becoming increasingly ubiquitous for data representation for: files, transport, storage, metadata. Data management systems must support storage, querying and communication using XML. Soon everything will be stored using XML? Don’t believe it!

Further Reading S. Abiteboul, P. Buneman, D. Suchi, Data on the Web, Morgan-Kaufmann, 1999. N. Bradley, The XML Companion (3 rd Edition), Addison-Wesley, 2002.

Data Modelling in XML

XML Schema XML Schema is a W3C standard for modelling using XML. An XML Schema definition is itself an XML document – there is an XML Schema for XML Schema! XML Schema files have a.xsd suffix; XML data files have a.xml suffix. An XML Schema can specify: Which elements are mandatory/optional. Which attributes are mandatory/optional. Element/attribute types. Cardinalities. Relative ordering.

Role of XML Schema Unlike in relational/object databases: An XML database need not have a schema. An XML schema may not be very prescriptive in terms of what can or cannot be stored.

Train Model sequence recurring sequence

Train ComplexType </xs:complexType

Elements Elements are defined thus:. Attributes associated with elements: type: specifies the kind of content that an element with no attributes or sub-elements can have. Default imposes no constraints. minOccurs, maxOccurs: the number of times an element can occur. A value of unbounded allows open-ended cardinality. Default is once and only once.

Built-in Types There are many built-in types: string. integer, positiveInteger, negativeInteger. short, long. date, dateTime, time. id, idref. anyURI

Complex Elements Any element with sub-elements or attributes is declared to have a complex type. Sequence: the sub- elements must appear in the given order. Choice: a selection is made from the sub- elements. Both sequence and choice can have minOccurs/maxOccurs.

Attributes Attributes can be defined within complex types. Attributes are optional unless use=“required”. The resulting data file can populate the attribute.... <train... xsi:type="TrainType“ engine="125"> 3107101 Edinburgh London...

Building on Existing Types New types can be constructed from existing types by: Extension: for complex types, this means that new types can be defined that add attributes or elements to the type on which they are based. Restriction: for complex types, this means that new types can be defined with fewer attributes or elements, reduced cardinalities, etc.

Type Extensions Example

Type Extensions in Use London main York district London models

Cross References Hierarchal models do not naturally support shared components. In documents, cross- references and hyperlinks are very common. XML has several cross- referencing schemes (e.g., ID/IDREF, XPointer). Within an XML document: A value of type ID must be unique. A value of type IDREF must match some ID within the document.

ID and IDREF for Trains IDREF used to reference station ID used to identify station

XML Schema for ID/IDREF...

Example: What is the schema? John Smith Ola Nordmann Langgt 23 4000 Stavanger Norway Empire Burlesque 1 10.90 …

Summary XML data can be parsed, transmitted and queried in the absence of any formal description of its structure. Many applications need to be able to make assumptions about the structure of documents they process. XML Schema provides a wide range of modelling facilities for defining XML documents.

Further Reading The W3C Consortium Tutorial is short but informative: http://www.w3schools.com/schema/ D. Fallside, XML Schema Part 0: Primer, 2001: http://www.w3.org/TR/xmlschema-0/ N. Bradley, The XML Companion (3 rd Edition), Addison-Wesley, 2002.

Querying XML Documents using XPath and XQuery

XPath and XQuery XPath is a W3C standard for addressing parts of an XML document. XPath is widely used in XML languages and tools, including XQuery and XSLT. XPath is not especially expressive. XQuery is a W3C standard for accessing and restructuring XML documents. XQuery is supported by several XML databases, but is less widely deployed than XPath. XPath is used within XQuery.

Trains Model The following diagram shows the XML schema of the data queried in the following example queries.

XPath XPath uses path expressions to describe routes through documents. These expressions address locations in a hierarchy in a way that is familiar from file systems (/books/chapters/chapter1). XPath also includes a function library (shared with XQuery) for manipulating numerical, string, date, node and sequence values. Standardisation: XPath 1.0 has been a W3C standard since 1999; XPath 2.0 became a W3C standard in January 2007.

XPath Terminology Nodes include elements, documents (root elements) and attributes; nodes can be addressed using XPath. XPath includes constructs for exploring relationships between nodes, such as parent, child, ancestor and descendent.

XPath Syntax ExpressionDescription nodenameSelects child nodes of the current node. /Selects the root node. //Selects any node in the document..Selects the current node...Selects the parent of the current node. @Selects attributes.

Simple Paths - 1 What are the numbers of the trains? /trains/train/tno //tno 22403101 22407101 22403102 22446301 …

Simple Paths - 2 Where do trains start? /trains/train/source/ aberdeen …

Duplicates Where do trains start (no duplicates)? (or tags) distinct-values(/trains/train/source) aberdeen banchory edinburgh aberdour …

Predicates What are the destinations of trains that start in Aberdeen? /trains/train[source="aberdeen"]/destination edinburgh inverness inverness …

Extracting Node Values What are the destinations of trains that start in Aberdeen (no tags)? /trains/train[source="aberdeen"] /destination/text() edinburgh inverness …

Aggregates How many trains start from aberdeen? count(/trains/train[source="aberdeen"]) 9

Parents What are the sources of trains that visit York? /trains/train[visit/name="york"]/source /trains/train/visit[name="york"]/../source edinburgh london …

Positions in Sequences Where does the last train (in the document) from aberdeen visit? /trains/train[source="aberdeen"][last()]/visit banchory 18:20:00 …

XQuery XQuery is intended to play the same sort of role for XML data as SQL plays for relational data. XQuery is a functional language, and thus is declarative and compositional. XQuery can be used to restructure XML data, as well as to ask questions about it. XQuery became a W3C standard in January 2007.

XQuery Syntax XQuery is based on FLOWR (pronounced “flower”) expressions: for clauses bind variables to values in streams of tuples; let clauses bind variables to the complete result of an expression; where clauses filter values in tuple streams; order by clauses sort tuple streams; and return clauses construct results.

Iteration and Filtering What are the destinations of trains that start in Aberdeen? for $i in /trains/train where $i/source = "aberdeen" return $i/destination edinburgh edinburgh …

Identifying Documents What are the destinations of trains that start in Aberdeen? for $i in doc("/db/trains/Trains.xml")/trains/train where $i/source = "aberdeen“ return $i/destination banchory 18:20:00 …

Constructing Results How many visits are made by each train? for $t in /trains/train let $v := $t/visit return {$t/tno} {count($v)} 22403101 2 …

Ordering Results How many visits are made by each train, ordered by the number of visits? for $t in /trains/train let $c := count($t/visit) order by $c descending return {$t/tno} {$c} 46303101 7 …

Quantifiers Which trains visit York? for $t in /trains/train where some $v in $t//visit satisfies $v/name = "york" return $t/tn 3107101 46307101 22407101 …

Implicit Quantification Which trains visit York? for $t in /trains/train where $t//visit/name = "york" return $t/tn 3107101 46307101 22407101 …

Reorganisation Reorganise the document to nest the trains inside the stations. for $s in distinct-values(//visit/name) return {$s} {for $t in /trains/train, $v in $t//visit where $v/name = $s return {($t/tno, $v/time)} } </station edinburgh 22403101 09:50:00 …

Examples Write XQueries that: Return the number of visits paid to the most frequently visited station. Return the names of the stations that received the most visits.

Further Reading The W3C Consortium Tutorials are short but informative; you should read these: http://www.w3schools.com/xpath/ http://www.w3schools.com/xquery/ There are many books on querying XML documents, e.g.: Howard Katz (ed), XQuery from the Experts, Addison Wesley, 2004.

XML Databases

Native XML Databases Native XML databases support collections of XML documents. Native XML databases provide light weight support for management of large document collections. Such databases typically support: XPath and XQuery for querying. XUpdate for manipulating documents. The XML:DB API for programmatic access.

Accessing eXist The top level of an eXist database is a collection; collections in eXist are not typed. Like a file system, a collection can contain other collections or documents. The eXist client supports management of collections, uploading of documents, etc.

XML:DB XML:DB provides a Java API similar in concept to JDBC. Class cl = Class.forName(“org.exist.xmldb.DatabaseImpl”); Database database = (Database)cl.newInstance(); DatabaseManager.registerDatabase(database); Collection col = DatabaseManager.getCollection( "xmldb:exist://localhost:8080/exist/xmlrpc/db" ); XPathQueryService service = (XPathQueryService) col.getService("XPathQueryService", "1.0"); ResourceSet result = service.query(args[0]); ResourceIterator i = result.getIterator(); while(i.hasMoreResources()) { Resource r = i.nextResource(); System.out.println((String)r.getContent()); }

XML In Oracle 10g Oracle was not designed to store XML data. Extensions to support XML occur: In the server, so that Oracle can store XML data using tables or built-in types. Around the server, so that XML data can be transferred to and from Oracle applications.

XML Type Server datatype: Stored in the database. Configurable storage options (CLOBS or by mapping to object relational structures). Accessible through SQL, PL/SQL, Java. Operations to: Create XML Type values. Map to/from XML Type values. Test properties of XML Type values. Query the content of XML Type values using XPath.

Storing XMLType Values XMLType values can be stored as column attributes. Stored either through a mapping onto tables or as Character Large Objects (CLOBs). create table franchise ( company varchar(40) not null, franchise varchar(40) not null, trains SYS.XMLTYPE not null, primary key (company, franchise))

Operations on XMLTypes XMLTypeString createXML extract getStringVal() Number getNumberVal() Relational Construct SYS_XMLGEN

Populating XMLTypes Standard table and column modification operations apply to complete values of XMLType: Insert. Delete. Update. The updateXML operation is provided for making changes within XMLType values. insert into franchise values ( 'GNER', 'East Coast Main Line', sys.XMLType.createXML( ' 3107101 Edinburgh London … ') );

XMLType Methods existsNode() returns 0 if the XPath expression does not match the document, else 1: existsNode(xpath IN varchar2) RETURN number. extract() returns the fragment that matches an XPath expression as an XMLType: extract(xpath IN varchar2) RETURN sys.XMLType.

Querying XML Documents Classically, documents can be searched using regular expressions, which know nothing about their structure. XML documents, however, have an explicit structure that can be exploited by query languages. The two main XML query languages are: XPath: supports navigation through a document structure. XQuery: supports SQL- style queries over documents. XPath is currently most widely used, often in conjunction with other languages (XSLT, SQL, …).

XPath An XPath expression matches nodes in a document. An XPath expression may match a document 0, 1 or many times. XPath is a W3C standard. A single step path matches an element name that is a child of the current element, e.g. visit. A multiple step path matches nested elements, e.g. visit/name.

XPath Expressions A wildcard matches indirectly nested elements, e.g. train/*/name. An absolute path makes the root of the document the current element: /trains/train/source. An element anywhere in a document can be matched by ‘//’ at the start of an expression: //visit. //visit/name.

Selecting XML From SQL The function sys.XMLType.getStringVal converts its XMLType parameter into a string. Retrieve details of GNER trains: select f. franchise, sys.XMLType.getStringVal(f.trains) from franchise f where f.company='GNER' East Coast Main Line...

Extracting Elements The extract() method on XMLType can be used to access individual elements. Retrieve the src and dst of GNER trains: select f.franchise, f.trains.extract('//source').getStringVal() as src, f.trains.extract('//destination').getStringVal() as dst from franchise f where f.company='GNER' East Coast Main Line Edinburgh York London London

Extracting Element Values The text() function can extract element contents. select f.franchise, f.trains.extract('//source/text()').getStringVal() as src, f.trains.extract('//destination/text()').getStringVal() as dst from franchise f where f.company='GNER' East Coast Main Line EdinburghYork LondonLondon

Extracting for Comparison Comparisons can be carried out either in XPath or in SQL. select f.company from franchise f where f.trains.extract('//visit/name/text()').getStringVal() like '%York%' GNER select f.company from franchise f where f.trains.existsnode('//visit[name="York"]') = 1

Testing Before Retrieval To avoid including empty values in results, the existsNode() method can test for documents that match an Xpath expression. select f.franchise, f.trains.extract('//train/tno').getStringVal() as tno from franchise f where f.trains.existsnode('//train/tno') = 1 East Coast Main Line 3107101 4630710

XML In Relational Databases XML is increasingly ubiquitous: Data for Web display. Data in domain- specific standards. Data for communication between web services. Relational vendors want: Web pages to be generated from databases. Their storage managers to support all forms of data management. Database applications to be exposed as Web Services. This means full support for XML input, output and storage.

Further Reading Oracle 10g, XML DB Developers Guide [Chapter 3: Using Oracle XML DB]. Full documentation for eXist is available along with the software from: http://exist.sourceforge.net/

Extra Slides

XML Schema for Train <xs:schema targetNamespace="http://my-train.com/namespace" xmlns="http://my-train.com/namespace" xmlns:xs="http://www.w3.org/2001/XMLSchema" elementFormDefault="qualified" attributeFormDefault="unqualified"> Root element Element name and Type. Namespace definitions

Namespaces Namespaces support simultaneous use of definitions from multiple files. Elements and attributes from different namespaces are distinguished using the syntax prefix:name, to give a qualified name. The name is the normal element/attribute name. The prefix is a name given to a namespace. A namespace is a URL acting as a unique string.

Train Schema Namespaces The targetNameSpace is the namespace of items defined in this file. targetNamespace=“http://my- train.com/namespace”.http://my- train.com/namespace The xmlns is the default namespace for unqualified items referenced in this file: xmlns=“http://my-train.com/namespace”.http://my-train.com/namespace The names associated with XML Schema will be qualified using the prefix xs: xmlns:xs=“http://www.w3.org/2001/XMLSchema”.http://www.w3.org/2001/XMLSchema

Train XML Namespaces Default as before, and xsi is defined as prefix for XMLSchemaInstance schema. The schemaLocation attribute associates a xsd file with the default namespace. <train xmlns="http://my-train.com/namespace" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://my-train.com/namespace P:\unix\XML\Trains\Train.xsd" xsi:type="TrainType">

XQuery Problem Solution let $stations := //visit/name return max(for $s in distinct-values($stations) return count(for $v in $stations where $v = $s return $v)) let $stations := //visit/name let $mostvisits := max(for $s in distinct-values($stations) return count(for $v in $stations where $v = $s return $v)) for $station in distinct-values($stations) where count(for $visit in $stations where $visit = $station return $visit) = $mostvisits return $station

CS331: Advanced Database Systems: Semistructured Data Management Norman Paton University of Manchester

Similar presentations

Presentation on theme: "CS331: Advanced Database Systems: Semistructured Data Management Norman Paton University of Manchester"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

CS331: Advanced Database Systems: Semistructured Data Management Norman Paton University of Manchester

Similar presentations

Presentation on theme: "CS331: Advanced Database Systems: Semistructured Data Management Norman Paton University of Manchester"— Presentation transcript:

Similar presentations

About project

Feedback