Web Data Management XML Data Model.

Web Data Management XML Data Model

Semi-structured Data Model
A data model, based on graphs, for representing both regular and irregular data. Basic ideas Self-describing data. The content comes with its own description; contrast with the relational model, where schema and content are represented separately. Flexible typing. Data may be typed (i.e., “such nodes are integer values” or “this part of the graph complies to this description”); often no typing, or a very flexible one Serialized form. The graph representation is associated to a serialized form, convenient for exchanges in an heterogeneous environment.

Self-describing data Starting point: association lists, i.e., records of label-value pairs. Natural extension: values may themselves be other structures: Further extension: allow duplicate labels. {name: "Alan", tel: , {name: {first: "Alan", last: "Black"}, tel: , {name: “Alan’’, tel: , tel: }

Tree-based representation
Data can be graphically represented as trees: label structure can be captured by tree edges, and values reside at leaves.

Tree-based representation: labels as nodes
Another choice is to represent both labels and values as vertices. The XML data model adopts this latter representation.

Representation of regular data
The syntax makes it easy to describe sets of tuples as in: relational data can be represented for regular data, the semi-structure representation is highly redundant. { person: {name: "alan", phone: , person: {name: "sara", phone: , person: {name: "fred", phone: , }

Representation of irregular data
Many possible variations in the structure: missing values, duplicates, changes, etc. Nodes can be identified, and referred to by their identity. Cycles and objects models can be described as well. { person: {name: "alan", phone: , person: &314 { name: {first: "Sara", last: "Green" }, phone: , spouse: &443 }, person: &443 { name: "fred", Phone: , Height: 183, spouse: &314 }}

XML represents Semistructured Data
Do not care about the type of the data Serialize the data by annotating each data item explicitly with its description (e.g. name, phone, etc.) Such data is called self-describing Serialization: convert data into a byte stream that can be easily transmitted and reconstructed at the receiver Self-describing data wastes space but provides interoperability (required on the web) Semistructured data models – XML, JSON (Javascript Object Notation)

XML as a Semi-structured Data Explained
Missing attributes: <person> <name> John</name> <phone>1234</phone> </person> <person><name>Joe</name></person> no phone ! Repeated attributes <person> <name> Mary</name> <phone>2345</phone> <phone>3456</phone>

XML as a Semistructured Data Explained
Attributes with different types in different objects <person> <name> <first> John </first> <last> Smith </last> </name> complex name ! <phone>1234</phone> </person> Nested collections (no 1NF) Heterogeneous collections: <db> contains both <book>s and <publisher>s

XML in brief XML is the World-Wide-Web Consortium (W3C) standard for Web data exchange. XML documents can be serialized in a normalized encoding (typically iso , or utf-8), and safely transmitted on the Internet. XML is a generic format, which can be specialized in “dialects” for specific domain (e.g., XHTML) The W3C promotes companion standards: DOM (object model), XSchema (typing), XPath (path expression), XSLT (restructuring), Xquery (query language), and many others. XML is a simplified version of SGML, a long-term used language for technical documents.

XML documents An XML document is a labeled, unranked, ordered tree:
Labeled means that some annotation, the label, is attached to each node. Unranked means that there is no a priori bound on the number of children of a node. Ordered means that there is an order between the children of each node. XML specifies nothing more than a syntax: no meaning is attached to the labels. A dialect, on the other hand, associates a meaning to labels (e.g., title in XHTML).

XML documents are trees
person XML: person row row row phone name phone name phone name “John” 3634 “Sue” 6343 “Dick” 6363 <person> <row> <name>John</name> <phone> 3634</phone></row> <row> <name>Sue</name> <phone> 6343</phone> <row> <name>Dick</name> <phone> 6363</phone></row> </person> Serialized representation

XML and Semistructured Data: Similarities and Differences
<person id=“o123”> <name> Alan </name> <age> 42 </age> < > </ > </person> { person: &o123 { name: “Alan”, age: 42, } } <person father=“o123”> … </person> { person: { father: &o123 …} } person name age Alan 42 father similar on trees, different on graphs

XML describes structured content
Applications cannot interpret unstructured content: XML provides a means to structure this content: Now, an application can access the XML tree, extract some parts, rename the labels, reorganize the content into another structure, etc. The book ‘‘Fundations of Databases’’, written by Serge Abiteboul, Rick Hull and Victor Vianu, published in 1995 by Addison-Wesley <bibliography> <book> <title> Foundations of Databases </title> <author> Abiteboul </author> <author> Hull </author> <author> Vianu </author> <publisher> Addison Wesley </publisher> <year> 1995 </year> </book> <book>...</book> </bibliography>

Applications associate semantics to XML docs
Letter document <letter> <header> <author>...</author> <date>...</date> <recipient>...</recipient> <cc>...<cc> </header> <body> <text>...</text> <signature>...</signature> </body> </letter>

Applications associate semantics to XML docs
Letter style sheet Some software then produces the actual letter to mail or . if letter then ... if header then ... if author then ... if date then ... if recipient then ... if cc then ... if body then ... if text then ... if signature then ...

Serialized and tree-based forms
The serialized form is a textual, linear representation of the tree; it complies to a (sometimes complicated) syntax; Tree-based forms implement in a specific context, object-oriented model, the abstract tree representation (Document Object Model) Typically, an application gets a document in serialized form, parse it in tree form, and serializes it back at the end.

Serialized and tree-based forms: text and elements
The basic components of an XML document are element and text. Here is an element, whose content is a text. The tree form of the document, modeled in DOM: each node has a type, either Document or Text. <elt_name> Textual content </elt_name>

Serialized and tree-based forms: nesting elements
Serialized form: The content of an element is the part between the opening and ending tags Tree-based form: the subtree rooted at the corresponding Element node (in DOM) <elt1> Textual content <elt2> Another content </elt2> </elt1>

Serialized and tree-based forms: attributes
Serialized form: Attributes are pairs of name/value attached to an element The content of an attribute is always atomic text (no nesting) Attributes are not ordered, and there cannot be two attributes with the same name in an element Tree-based form: Attributes are special child nodes of the Element node (in DOM) <elt1 att1=’12’ att2=’fr’> Textual content </elt1>

Serialized and tree-based forms: the document root
Document content must always be enclosed in a single opening/ending tag, called the element root The first line of the serialized form must always be the prologue if there is one: <?xml version="1.0"encoding="utf-8"?> A document with its prologue, and element root In the DOM representation, the prologue appears as a Document node, called the root node. <?xml version="1.0“ encoding="utf-8" ?> <elt> Document content. </elt>

Web Data Management with XML
Publishing an XML document can easily be converted to another XML document (same content, but another structure) Web publishing is the process of transforming XML documents to XHTML. Integration XML documents from many sources can be transformed in a common dialect, and constitute a collection. Search engines, or portals, provide browsing and searching services on collections of XML documents. Distributed Data Processing many softwares can be adapted to consume/produce XML-represented data. Web services provide remote services for XML data processing.

Web Publishing: restructuring to XHTML
The Web application produces some XML content, structured in some application-dependent dialect, on the server. In a second phase, the XML content is transformed in an XHTML document that can be visualized by humans. The transformation is typically expressed in XSLT, and can be processed either on the server or on the client.

Web publishing: content + presentation instructions
XML content Document in XHTML <h1> Bibliography </h1> <p> <i> Foundations of Databases </i> Abiteboul, Hull, Vianu <br/> Addison Wesley, 1995 </p> <p> <i> Data on the Web </i> Abiteoul, Buneman, Suciu <br/> Morgan Kaufmann, 1999 </p> <bibliography> <book> <title> Foundations of Databases </title> <author> Abiteboul </author> <author> Hull </author> <author> Vianu </author> <publisher> Addison Wesley </publisher> <year> 1995 </year> </book> <book>...</book> </bibliography>

Web publishing The same content may be published using different means: Web publishing: XML  XHTML WAP (Wireless Application Protocol): XML  WML

Web publishing Data obtained from a relational database and from XML files XSLT restructures the XML data Produces XHTML pages for a browser

Web Integration: gluing together heterogeneous sources
The portal receives (possibly continuously) XML-structured content, each source using its own dialect. Each feed provides some content, extracted with XSLT or XQuery, or any convenient XML processing tool (e.g., SAX).

Data integration

Distributed Data Management with XML
XML encoding is used to exchange information between applications. A specific dialect, WDSL, is used to describe Web Services Interfaces.

Exploiting XML documents

XML dialects Dialects define specialized structures, constraints and vocabularies to construct ad hoc XML contents that can be used and exchanged in a specific application area RSS is an XML dialect for describing content updates that is heavily used for blog entries, news headlines or podcasts. WML (Wireless Mark-up Language) is used in Web sites by wireless applications based on the Wireless Application Protocol (WAP). MathML (Mathematical Mark-up Language) is an XML dialect for describing mathematical notation and capturing both its structure and content.

XML dialects Xlink (XML Linking Language) is an XML dialect for defining hyperlinks between XML documents. These links are expressed in XML and may be introduced inside XML documents. SVG (Scalable Vector Graphics) is an XML dialect for describing two-dimensional vector graphics, both static and animated. With SVG, images may contain outbound hyperlinks in XLinks.

XML dialects – SVG example
<?xml version="1.0" encoding="UTF-8" ?> <svg xmlns=" <polygon points="0,0 50,0 25,50" style="stroke:#660000; fill:#cc3333;"/> <text x="20" y="40">Some SVG text</text> </svg> Some SVG text

XML standards SAX (Simple API for XML) sees an XML document as a sequence of tokens (its serialization). DOM (Document Object Model) is an object model for representing (HTML and) XML document independently of the programming language. XPath (XML Path Language) that we will study, is a language for addressing portions of an XML document.

XML standards XQuery (that we will study) is a flexible query language for extracting information from collections of XML documents. XSLT (Extensible Stylesheet Language Transformations), that we will study, is a language for specifying the transformation of XML documents into other XML documents. Web services provide interoperability between machines based on Web protocols.

Processing an XML Document with SAX and DOM
A SAX parser transforms an XML document into a flow of events. Examples of events: start/end of a document, the start/end of an element, a text token, a comment, etc. Example: Load data in XML format into a relational database 1. when document start is received, connect to the database; 2. when a Movie open tag is received, create a new Movie record; (a) when a text node is received, assign its content to X; (b) when a Title close tag is received, assign X to Movie.Title; (c) when a Year close tag is received, assign X to Movie.Year, etc. 3. when a Movie close tag is received, insert the Movie record in the database (and commit the transaction); 4. when document end is received, close the database connection.

Sax SAX is a good choice when the content of a document needs to be examined once SAX handler written in Java It features methods that handle SAX events: opening and closing tags; character data See next slide

import org.xml.sax.*; import org.xml.sax.helpers.LocatorImpl; public class SaxHandler implements ContentHandler { /** Constructor */ public SaxHandler() { super(); } /** Handler for the beginning and end of the document */ public void startDocument() throws SAXException { out.println("Start the parsing of document"); } public void endDocument() throws SAXException { out.println("End the parsing of document"); } /** Opening tag handler */ public void startElement(String nameSpaceURI, String localName, String rawName, Attributes attributes) throws SAXException { out.println("Opening tag: " + localName); // Show the attributes, if any if (attributes.getLength() > 0) { System.out.println(" Attributes: "); for (int index = 0; index < attributes.getLength(); index++) { out.println(" - " + attributes.getLocalName(index) + " = " + attributes.getValue(index)); } } } /** Closing tag handler */ public void endElement(String nameSpaceURI, String localName, String rawName) throws SAXException { out.print("Closing tag : " + localName); out.println(); } /** Character data handling */ public void characters(char[] ch, int start, int end) throws SAXException { out.println("#PCDATA: " + new String(ch, start, end)); }}

DOM (Document Object Model)
A DOM parser transforms an XML document into a tree and offers an object API for that tree. A partial view of the Class hierarchy of DOM

‘to analyze or separate (input, for example) into more easily
Parsing XML Documents A parser checks XML documents for well-formedness or validates it against a schema So, what is parsing? well, in computer terminology.. ‘to analyze or separate (input, for example) into more easily processed components’ XML parsers load XML documents and provide access to it’s contents in the form of objects Point 1: Using XML to solve integration and implementation problems is not as easy as defining a data model and creating instance documents. What will you do with these documents after you create them? How will you read them into your application or database for processing? Part of using XML as a solution involves an XML processor, which is the application responsible for processing these documents. One function of this processor, and the focus of this chapter, is parsing documents. The parser is responsible for parsing the XML document and verifying it by checking for well-formedness or by validating it against a schema. If these tasks are performed successfully, the data contained within the document is exposed in a method that makes it available for other manipulations. Point 2 & 3: Parsing is an essential task for any application that uses language-based data or code as input. XML processors, which rely heavily on parsers, provide a standard mechanism for navigating and manipulating XML documents. If you have an XML document and need to get data out of it, change the data, or modify the XML document structure, you don't need to write code to load the XML file, validate it for specific characters and elements, and process this information accordingly. You can use an XML parser instead, which will load the document and give you access to its contents in the form of objects.

Parsing XML Documents A Validating Parser: can use a DTD or schema to verify that a document is properly constructed A Non-Validating Parser: only requires the document to be well formed; many free parsers on the Web are non-validating Stream-Based Parsers: read through the document and signal the application every time a new component appears Tree-Based Parsers: read the entire document and give the application a tree structure corresponding to the element structure of the document Point 1: A validating parser can use a DTD or schema to verify that a document is properly constructed according to the rules for the XML application it's an instance of, and it is supposed to complain loudly if the rules aren't followed. A DTD can also specify default values for the attributes of various elements, and a validating parser can fill them in when it encounters elements with no attributes listed. This capability can be important when you're processing XML documents you've received from the outside world. Point 2: A non-validating parser only requires that the document be well-formed. Because of the design of XML, it's possible to parse well-formed documents without referring to a DTD or XSD schema. Non-validating parsers are simpler, and many of the free parsers available over the Web are non-validating. They are usually adequate for processing XML documents generated within the same organization or documents whose validity constraints are so complex that they can't be expressed by a DTD and need to be verified by application logic instead. Point 3: A stream-based or event-driven parser can make the components of an XML document known to an application in by reading through the document and signaling the application every time a new component appears Point 4: A tree-based parser can read the entire document and give the application a tree structure corresponding to the element structure of the document.

Parsing XML Documents DOM API: language & platform independent interfaces for accessing & manipulating info. in XML documents Document checked to see if it’s well formed and valid Parser then converts information into a tree of nodes A tree starts at one root node; in DOM terms, called a document object instance You can modify, delete and create leaves and branches on the tree using interfaces in API Point 1, 2, 3, 4 & 5: We look at tree-based parsing with DOM first. The DOM API defines a minimal set of language and platform-independent interfaces for accessing and manipulating the content and structure of information stored in XML documents. In tree-based parsing with the DOM the document is checked to see if it is well-formed and valid, depending on the type of parser. The parser then converts the document's information into a tree of nodes. The entire document, no matter how simple or complex, is converted into a tree that starts from one root node, which, in DOM terms, is called a document object instance (hence Document Object Model). Once a document object tree is created, access to the elements allows you to modify, delete, and create leaves and branches by using the interfaces in the API.

Parsing XML Documents Titles.xml; a sample XML document
<?xml version="1.0" encoding="UTF-8"?> <BookList> <Book> <book_id>BU1111</book_id> <title>Cooking with Computers: Surreptitious Balance Sheets</title> <type>business</type> <pub_id>1389</pub_id> <price>11.95</price> <advance>5000</advance> <royalty>10</royalty> <ytd_sales>3876</ytd_sales> <notes>Helpful hints on how to use your electronic...</notes> <pubdate> T05:00:00</pubdate> </Book> <book_id>BU7832</book_id> <title>Straight Talk About Computers</title> <price>19.99</price> <ytd_sales>4095</ytd_sales> <notes>Annotated analysis of what computers can do for you</notes> <pubdate> T05:00:00</pubdate> </BookList> Point 1: We will use titles.xml as the example XML file during our discussion. This file (shown on slide), presents a collection of books based on the sample pubs database that comes with Microsoft SQL Server.

Parsing XML Documents Point 1:
Here is a DOM hierarchy representation of Titles.xml Everything is a node in the Document object tree. These nodes might have child nodes or hold information like its tag name (nodeName) and value (nodeValue). This hierarchical organization of information is similar to a file system, where folders might contain files or other folders, except everything descends from one root folder.

DOM DOM provides interfaces in it’s hierarchy of Node objects
Node: an XML Document object created after a DOM parser reads an XML file Inheritance relationship between important interfaces Point 1: The DOM provides interfaces in its hierarchy of Node objects. The interfaces either have child nodes that contain other nodes or are leaf nodes that do not contain anything after them in the document structure. Some types of child or leaf nodes are Node, Element, and NodeList, all of which are interfaces in the DOM. Point 2: An XML Document object created after a DOM parser reads an XML file often contains a tree-like representation of Node objects instances, while other interfaces are provided to create a more object-oriented environment. You can manipulate all the information in the DOM by using the Node interface. Even though the DOM Recommendation specifically states that it isn't necessarily a tree, for the purposes of our discussion on this topic , we will focus on examples with tree-like representations. Point 3: Figure (on slide) shows the inheritance relationships between some of the important interfaces.

DOM A sample document object tree Point 1:
Because the Document object is a subclass of Node, the root Node object of the tree is also a Document object. Every DOM object must have a root. Figure (on slide) illustrates a sample XML Document object tree and describes some of the Node objects that it contains.

Methods of Node Object Method Description Point 1:
hasChildNodes() finds out if a Node has children, takes no parameters, returns a Boolean getNodeType() returns the type of a particular Node. The type is a constant integer used to identify different types of Nodes appendChild() adds a new child object, which is passed to the method, to the current Node cloneNode() returns a duplicate of the Node hasAttributes() returns a Boolean true if the Node has any attributes. This method was added in DOM Level 2 insertBefore() takes a new child Node and a reference child Node and inserts the new child Node before the reference Node isSupported() tests whether or not this implementation of the DOM supports a specific feature. This method was added in DOM Level 2 and takes a version number and a feature as parameters normalize() puts all text nodes in the full depth of the sub-tree underneath this Node removeChild() removes the specified child replaceChild() replaces the specified child with the new child passed Point 1: Some important methods of the node object.

getAttribute() retrieves the specified attribute. getAttributeNS() retrieves the specified attribute by local name and namespace. This method was added in Level 2. getAttributeNode() retrieves an Attr node by name. getAttributeNodeNS() retrieves an Attr node by local name and namespace. This method was added in Level 2. getElementsByTagName() returns a NodeList of all child elements of a given tag name in the order in which they are encountered. getElementsByTagNameNS() returns a NodeList of all child elements of a given tag by local name and namespace in the order in which they are encountered. This method was added in Level 2. hasAttribute() returns a Boolean true if the specified attribute is present. Returns Boolean false otherwise. hasAttributeNS() returns a Boolean true if the specified attribute, by local name and namespace, is present. Returns Boolean false otherwise. This method was added in Level 2. removeAttribute() removes the specified attribute. removeAttributeNS() removes the attribute specified by local name and namespace. This method was added in Level 2. Point 1: The Element interface, which is a subclass of Node, is another important interface. It can be used to access the elements in a DOM Document object tree, which allows you to read in attributes and their values, as well as change, delete, or add to them. Table (on slide) contains the list of methods of the Element object. The table continues on next slide.

removeAttributeNode() removes the specified Attr node. setAttribute() adds a new attribute. If an attribute of the same name exists, its value is changed to the specified value. setAttributeNS() adds a new attribute. If an attribute of the same local name and namespace exists, its value is changed to the specified value. This method was added in Level 2. setAttributeNode() adds a new Attr node. If an Attr node of the same name exists, its value is changed to the specified value. setAttributeNodeNS() adds a new Attr node. If an Attr node of the same local name and namespace exists, its value is changed to the specified value. This method was added in Level 2. Point 1: Table continued (on slide) contains the list of methods of the Element object.

DOM Node Interface: NodeList is an iterator for a Nodes list
A DOM NodeList object NodeList has only a single method, item(); it returns the Node at the indexed position passed to the method Point 1: Some methods of the Node interface allow traversal of a Node tree. The getChildNodes() method is useful for gathering all the elements inside a Node. This method returns all Nodes, if they exist, in a container for Node objects. NodeList is an iterator for a list of Nodes. Point 2: Unlike Node and Element, NodeList has only a single method, item(). This method returns the Node located at the indexed position passed to the method. For instance, if you want to retrieve the first Node, you call the method using item(0)

SAX SAX Parser: stream in documents according to specific events
SAX parser doesn't have a default object model SAX parser read in a XML document and start events based on the following: open or start of elements closing or end of documents #PCDATA and CDATA sections processing instructions, comments & entity declarations 3 steps to using SAX in your applications: creating a custom object model, like a Book class creating a SAX parser creating a document handler to turn your document into instances of your custom object model Point 1: One of the major disadvantages of the DOM is how it processes large files. Because the DOM requires the entire file to be read in by the parser, memory can constrain the performance of your applications, if not render them useless. SAX parsers which are stream-based solve this problem by streaming in the document according to specific events. In this section we cover the behavior of a SAX parser and how to use one. Point 2: Unlike the DOM, which creates a tree-based representation, SAX doesn't have a default object model. When you use a SAX parser and read in a document, you will not be given a default object model. Point 3: These parsers only read in your XML document and fire events based on the following: - Open or start of elements Closing or end of elements #PCDATA and CDATA sections - Processing instructions, comments, and entity declarations Point 4: The three steps to using SAX in your applications are - Creating a custom object model, like a Book class - Creating a SAX parser - Creating a document handler to turn your document into instances of your custom object model Because SAX does not come with a default object model representation for the data in your XML document, you need to create your own the first time you use this method. The model could be something as simple as creating a Book class if your XML document is an address book.

SAX Document Handler: is a listener for the various events fired by the SAX parser Events are fired based on all registered document event listeners and translated into method calls A SAX event order SAX parser exposes the document as a series of events that are translated into method calls Point 1: After your custom model is created to hold your data in your application, the next step is creating a "document handler" to initialize instances of your object models from the document. This document handler is a listener for the various events we listed that are fired by the SAX parser. Most of the work involved in using SAX is in creating these document handlers. Point 2: As the SAX parser reads a document, events are fired based on all the "registered" document event listeners and translated into method calls on your document handler implementation. The document handler must then do something useful with these method calls. Point 3: Figure (on slide) shows the sequence of method calls the SAX parser makes on your document handler implementation. You can see from this picture how the SAX parser exposes the document as a series of events that are translated into method calls in your document handler implementation.

DOM Vs SAX The following are DOM benefits you should focus on:
it allows random access to the document complex searches can be easily implemented the DTD or schema is available the DOM is read/write The following list contains some of the most useful benefits of SAX: it can parse files of any size it is a fast processing method you can build your own data structure you can access only a small subset of info. if desired Which Parsing Method to choose You should choose your parser depending on the nature of the processing and the size of the XML documents. A tree-based parser usually needs to load the entire document into memory, so it can be impractical because of physical constraints on memory when processing documents like dictionaries or large databases. With a stream-based parser you can skip over elements that you aren't interested in (for example, when looking up a particular word in a dictionary). If your application needs to process certain elements in relation to other elements, however, a tree-based parser is much easier to work with. It's worth noting that a tree-based parser can be built on top of a stream-based parser and that the output of a tree-based parser can be "walked" to provide a stream-based interface to an application. Here is a list of the advantages of both methods Point 1: DOM implementations are currently biased toward in-memory storage of the document, but this may change as Persistent DOM (PDOM) implementations become more popular. Even with memory limitations, however, DOM certainly has a place because of features that help it access and manipulate documents. The following are DOM benefits you should focus on: - It allows random access to the document. - Complex searches can be easily implemented. - The DTD or schema is available. - The DOM is read/write. The first two benefits are the ability to randomly access the document and create complex searches. These provide a means for searching for elements and retrieving information, such as data and attributes, on these elements. The DOM can also be bound to an XML DTD or schema, which means it can be checked to make sure the data contained in the document is valid according to the rules of the DTD or schema. Finally it provides the ability to read data out of a document and write data to it. The DOM's simplicity, powerful access to the document, and a well-defined specification make it a popular parser method. It also pairs well with XSLT and other document-transformation solutions you might require. Therefore, if your project is small and you need to complete it quickly, using a DOM-based method is a great choice. However, if you are going to process large files and have the time to write a more robust application, you should look into a SAX- based implementation Point 2: If you need to parse and process huge XML documents, SAX implementations offer some benefits over DOM-based ones. You should first ask yourself, however, if an improved design would remove the need for large documents. For example, prefiltering in a database that can stream XML might suit your needs. By going with SAX, however, you can enforce options for document manipulation by using XSLT and requiring your team to write code to internally manage, store, and rewrite the document. Like the DOM, SAX has a particular set of benefits. The following list contains some of the most useful: - It can parse files of any size. - You can build your own data structure. - You can access only a small subset of the information if you desire. - It is fast. The biggest advantage of SAX is, arguably, its ability to process files of any size. The way the parser streams data in and out (exposes data) allows it to handle files of any size. SAX is also useful when you want to build your own data structure and allows you to grab only subsets of the information in a given document. Finally it can be a fast method of processing documents, especially when parsing large files. SAX is best suited to sequential-scan applications when you want to go through the XML document quickly from start to finish

Example Parsers MSXML: first parser which can perform both SAX and DOM-based parsing, from Microsoft Xerces: available in three languages, from the Apache Group at IBM's XML for Java: formula/xml Microstar's Ifred: Sun's Java API for XML: Oracle's XML Parser for Java: We now take a look at some of the parsers available out there.. Point 1: The first parser, which can perform both SAX and DOM-based parsing, is Microsoft's MSXML. This parser, which is currently at version 3.0, supports several standards and can handle most of your parsing needs. Point 2: Another popular parser is available from the Apache Group, an open source movement that made its name with a Web server, and it can be downloaded from The Xerces implementation, like MSXML, comes as a library, but it is available in three languages. These are a C++ library, a set of Java classes, and a COM and Perl binding/wrapper for the C++ implementation. Xerces supports DOM Level 1 and 2 and SAX2. While it does not support some of the additional standards that MSXML does, if you need a common parser across various platforms and environments, Xerces might be the choice for you. Point 3, 4, 5 & 6: Some other DOM & SAX parsers written in Java: IBM's XML for Java: formula/xml Microstar's lfred: Sun's Java API for XML: Oracle's XML Parser for Java:

XPath Language for expressing paths in an XML document
Navigation: child, descendant, parent, ancestor Tests on the nature of the node More complex selection predicates Means to specify portions of a document Basic tool for other XML languages: Xlink, XSLT, Xquery

XQuery Query language: “SQL for XML”
Like SQL: select portions of the data and reconstruct a result Query structure: FLW (pronounced "flower") $p : scans the sequence of publishers $b : scans the sequence of books for a publisher WHERE filters out some publishers RETURN constructs the result FOR $p IN document("bib.xml")//publisher LET $b := document("bib.xml)//book[publisher = $p] WHERE count($b) > 100 RETURN $p

XSLT Transformation language: “Perl for XML”
An XSLT style sheet includes a set of transformation rules: pattern/template Pattern: based on XPATH expressions; it specifies a structural context in the tree Template: specifies what should be produced Principle: when a pattern is matched in the source document, the corresponding templates produces some data

XLINK XML Linking Language Advanced hypertext primitives
Allows inserting in XML documents descriptions of links to external Web resources Simple mono-directional links ala (HREF) HTML Multidirectional links XLink relies on XPath for addressing portions of XML documents

XLink Generalizes HTML’s href
XLink links resources, which include documents, audio, video, database data, etc. Many types: simple, extended, locator, arc, resource, or title required attributes <person xmlns:xlink=“ xlink:type=“simple” xlink:href= xlink:role=" xlink:title=“The Homepage” xlink:show=“replace” xlink:actuate=“onRequest”> </person> optional attributes

XLink The linking element (i.e., person) that references document2 is called a local resource. The resource referenced is called the remote resource. Traversal Path between the local and remote resources is to get from resource A to resource B by following the defined link. Arc: how to traverse a pair of resources, including the direction of traversal and possibly application behavior information as well. Outbound arc: An arc that has a local starting resource and a remote ending resource Inbound arc: If an arc's ending resource is local but its starting resource is remote Third party arc: If neither the starting resource nor the ending resource is local

XLink show attribute specifies how to display a resource when it is loaded and can be “new”, to display in new window ”replace”, to replace current resource with the linked resource ”embed”, replace current element with the linked resource ”other”, XLink-aware application can decide how to display actuate attribute specifies when the resource should be retrieved and can be “onLoad”, retrieve as soon as it is loaded ”onRequest”, retrieve resource by clicking on the link ”other”, XLink-aware applications can decide when to load ”none”, no information when to load the resource

XLink The use of XLink elements and attributes requires declaration of the XLink namespace, <person xmlns:xlink=“ href attribute defines the remote resource's URI Role attribute is a URI that references a resource that describes the link Title attribute is a descriptive title for the link.

XLink Extended Links for linking multiple combinations of local and remote resources. The figure shows two unidirectional links With XLink, we can create multidirectional links for traversing between resources

XLink Multidirectional links are not limited to just two resources, but can link any number of resources The links need not be traversed sequentially.

A label's value is used to link one resource to another
1 <?xml version = "1.0"?> 2 3  4  5 6 <books xmlns:xlink = " 7 xlink:type = "extended" 8 xlink:title = "Book Inventory"> <author xlink:label = "authorDeitel" xlink:type = "locator" xlink:href = "/authors/deitel.xml" xlink:role = " xlink:title = "Deitel & Associates, Inc."> <persons id = "authors"> <person>Deitel, Harvey</person> <person>Deitel, Paul</person> </persons> 19 </author> <publisher xlink:label = "publisherPrenticeHall" xlink:type = "locator" xlink:href = "/publisher/prenticehall.xml" xlink:role = " xlink:title = "Prentice Hall"/> <warehouse xlink:label = "warehouseXYZ" xlink:type = "locator" xlink:href = "/warehouse/xyz.xml" xlink:role = " xlink:title = "X.Y.Z. Books"/> 32 Books – root element A link to the book's authors with information located at /authors/deitel.xml A label's value is used to link one resource to another type locator, which specifies a remote resource When either Deitel, Harvey or Deitel, Paul is selected, the document deitel.xml will be retrieved

can be linked to (or from) an author or publisher
33 <book xlink:label = "JavaBook" xlink:type = "resource" xlink:role = " xlink:title = "Textbook on Java"> Java How to Program: Third edition 38 </book> <arcElement xlink:type = "arc" xlink:from = "JavaBook" xlink:arcrole = " xlink:to = "authorDeitel" xlink:show = "new" xlink:actuate = "onRequest" xlink:title = "About the author"/> <arcElement xlink:type = "arc" xlink:from = "JavaBook" xlink:arcrole = " xlink:to = "publisherPrenticeHall" xlink:show = "new" xlink:actuate = "onRequest" xlink:title = "About the publisher"/> <arcElement xlink:type = "arc" xlink:from = "warehouseXYZ" xlink:arcrole = " xlink:to = "JavaBook" xlink:show = "new" xlink:actuate = "onRequest" xlink:title = "Information about this book"/> 63 Book - a local resource can be linked to (or from) an author or publisher create an outbound arc between the book local resource and the author remote resource arcrole provides information about the book's author create an outbound arc between the book local resource and the publisher remote resource an inbound arc, that has a starting resource that is remote, i.e., warehouseXYZ and an ending resource that is local,i.e. JavaBook

XLink 64 <arcElement xlink:type = "arc" 65 xlink:from = "publisherPrenticeHall" 66 xlink:arcrole = " xlink:to = "warehouseXYZ" 68 xlink:show = "embed" 69 xlink:actuate = "onLoad" 70 xlink:title = "Publisher's inventory"/> </books> - a third-party arc, that has starting and ending resources that are both remote Attribute show has value embed, which indicates that the ending resource should replace the starting resource when the link is traversed - Attribute actuate has value onLoad, so upon loading the XML document, the link is traversed - Because we consider the relationship between the publisher and warehouse as being different than the previous three arcs, we provide a different arcrole value for this link

XPointer An extension of XPath Usage: XPointer can link to
href= XPointer can link to specific locations (i.e., nodes in an XPath tree), or even ranges of locations, in an XML document. XPointer also adds the ability to search XML documents by using string matching.

XPointer Pointing to a point (=XML element or character)
Full XPointer form The name xpointer—called a scheme 1 <?xml version = "1.0"?> 2  3  4 5 <contacts> 6 <contact id = "author01">Deitel, Harvey</contact> 7 <contact id = "author02">Deitel, Paul</contact> 8 <contact id = "author03">Nieto, Tem</contact> 9 </contacts>

XPointer If a document's unique identifier is referenced in an expression such as bare-name XPointer address – simplified expression Child sequence: e.g. #xpointer( /1/3/2/5), #xpointer( /bib/book[3]) Pointing to a range: e.g. #xpointer(id(3652 to 44)) Most interesting examples use XPath xlink:href = "/contacts.xml#xpointer(id('author01‘))" xlink:href = "/contacts.xml#author01"

Web Data Management XML Data Model.

Similar presentations

Presentation on theme: "Web Data Management XML Data Model."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Web Data Management XML Data Model.

Similar presentations

Presentation on theme: "Web Data Management XML Data Model."— Presentation transcript:

Similar presentations

About project

Feedback