Introduction to XML V7 march 2010

Introduction to XML V7 march 2010
31284 Web Services Development Web services technologies 31242/32549/AJPCPE Advanced Internet Programming Introduction to XML V7 march 2010 (c) University of Technology, Sydney

next weeks... Agenda XML Introduction XML Namespaces DTD/XML Schema
XML programming XSLT/XPATH next weeks... (c) University of Technology, Sydney

XML Introduction Meta-language (literally a language about languages)
XML supports user-defined languages that add meaning to data Human readable Extensible & Vendor-independent Widespread Industry Support Form the basis for Web Services Buzzword-compliant! Widespread Industry Support (e.g. major relational databases have native capability to read and generate XML data) Large Family of technologies available for the processing and transformation of XML (e.g. Web page display and report generation) XML is rapidly evolving field – existing XML standards (and the software that is based on them) is constantly changing and new standards are being proposed regularly. This lecture do not by any means cover all the technologies in the XML family but it provides a good starting point to begin to understand the scope and power of XML and how it can be applied to a variety of problems. For annotating documents with metadata to make them easier to interpret (c) University of Technology, Sydney

Key Uses Of XML “Document centric” “Data centric”
represents semi-structured information eg: markup language - XHTML eg: Document publishing (separate presentation and content) “Data centric” represent highly structured information eg: Data storage eg: Data exchange eg: Application configuration (c) University of Technology, Sydney

Document centric XML <?xml version="1.0" encoding="UTF-8"?>
<html xmlns=" <head> <title> hello, World</title> </ head> <body> <h1> Hello, World</h1> <p> This is a sample of XHTML, geez it looks just like HTML? </p> </body> </html> XML stands for Extensible Markup Language. XML is designed as a way of representing structured information. XML represents information using a tag-based markup language. There are six kinds of XML markup: Elements Attributes Entity References Comments Processing Instructions CDATA Sections Elements are the main structuring element in an XML document. Each element consists of a name and a value. The element name describes the type of information, and in most cases consists of a starting and ending tag. The element value usually consists of PCDATA (Parsed Character Data). The fact that it is “parsed” character data means that the value of one element may contain other elements, i.e. elements may be nested. (c) University of Technology, Sydney

personal business Data centric XML Ms Smith 555-5555 Mr Suit 555-6666
<?xml version="1.0" encoding="UTF-8"?> <addressBook> <entry list="personal"> <name>Ms Smith</name> <address>1 Central Rd, Sydney </address> <phone> </phone> </entry> <entry list="business"> <name>Mr Suit</name> <address>1 George St, Sydney </address> <phone> </phone> </addressBook> personal Ms Smith 1 Central Rd, Sydney business Mr Suit 1 George St., Sydney XML stands for Extensible Markup Language. XML is designed as a way of representing structured information. XML represents information using a tag-based markup language. There are six kinds of XML markup: Elements Attributes Entity References Comments Processing Instructions CDATA Sections Elements are the main structuring element in an XML document. Each element consists of a name and a value. The element name describes the type of information, and in most cases consists of a starting and ending tag. The element value usually consists of PCDATA (Parsed Character Data). The fact that it is “parsed” character data means that the value of one element may contain other elements, i.e. elements may be nested. (c) University of Technology, Sydney

XML - Introduction XML stands for Extensible Markup Language
XML is a tag based markup language, which appears superficially similar to other markup languages, such as HTML Unlike HTML, which has a fixed set of tags defined that you can use, XML allows you to use whatever tags you like – it is extensible (c) University of Technology, Sydney

Family of XML Technologies
XML derives its strength from a variety of supporting technologies for Presentation, Structure, and Transformation Address many of the issues associated with bringing XML into mainstream computing (Frank P. Coyle 2002) (c) University of Technology, Sydney

XML Technologies XML Document XML Representations XPath
(address parts of XML doc) XML Document Object Model DOM Bits of XML doc Text XSL: XML Transformation (uses XPath) in memory XML XML Streaming interface e.g. SAX HTML XML Document file - Unicode XLink XML doc linking XML XQuery XML Query XML XSD XML Schema Validate XML (c) University of Technology, Sydney

What is XML? XML documents are plain text documents
The basic rules of XML are quite simple.  Any document which obeys the basic rules of XML is said to be “well formed” A “well formed” XML document obeys rules of XML  every tag closed no tags nested out of order  syntactically correct according to XML specification (look at (c) University of Technology, Sydney

XML – Document Structure
An XML document is a structured collection of text format markup tags Each tag either defines: some information used to describe how the document is to be interpreted some data contained within the document Tags in XML documents fall into the following categories: (1) Comments: XML, like Java and HTML, has a means of adding comments to a document  (c) University of Technology, Sydney

XML – Document Structure - 2
(2) XML declaration: Identifies the document as an XML document. Includes XML version (currently 1.0), encoding format, dependencies Must be at top of the XML document <? xml version=“1.0” encoding=“UTF-8”?> (3) DTD declarations: DTD (Document Type Definition) specifies the semantic constraints <!DOCTYPE addressBook SYSTEM "D:\projects\examples\xml\addressBook.dtd"> (c) University of Technology, Sydney

(4) Elements: Contain data values or other XML elements. Elements are defined between “<“ and “>” characters <entry> <name>Mr Smith</name> </entry> (5) Attributes: Attributes are name/value pairs associated with elements <entry list=“personal”> (c) University of Technology, Sydney

(5) Entity references: Special identifiers that refer to a value Defined between "&" and ";" characters Some standard reserved entities eg: < represents "<" character You can make your own eg: <!ENTITY au Australia”> Eg: <country>&au;</country> is the same as <country>Australia</country> (c) University of Technology, Sydney

(6) Processing instructions: Tell applications to perform some operation. Defined between "<?" and "?>" sequence <?OrderValidation creditCheck checkCustomer?> (7) Unparsed character data: Tells XML processor not to parse quoted string Defined between <![CDATA[ and ]]> sequences <![CDATA[if (x > y) or (y < z)]]> (c) University of Technology, Sydney

Agenda XML Introduction  XML Namespaces DTD/XML Schema
XML programming XSLT/XPATH (c) University of Technology, Sydney

title = “chris” ?? Which title is updated?? XML – Namespaces
XML elements can belong to an XML Namespace XML Namespaces help avoid clashes in element names when documents of different types are combined class Author { String name; String title; } class Book { Author author; Solution: Qualify the name! e.g.: Book’s title is book.title Author’s title is: book.author.title Much like java packages!! title = “chris” ?? Which title is updated?? (c) University of Technology, Sydney

tag is now <xs:schema>
XML Namespaces Namespaces identify that elements in the document come from different sources you can mix different namespaces in the same document namespace appears as prefix on element/attribute At the top of each document, you need to declare the namespaces used by that document <xs:schema xmlns:xs=" ... </xs:schema> Namespaces are used in XML to identify where a particular element came from (e.g. who defined it). In a single, isolated XML document, every element name is going to be unique within that document, because the document was written by a single author. However imagine that you later merge two XML documents together – what if both authors chose to use an element called "address", but with different structures. If each element name is prefixed with a namespace that indicates its origin, then there is no confusion and you can have two different "address" elements in the same document because they will have different namespace prefixes. Elements (and attributes) that have a namespace have a prefix indicating the namespace. For example, <xs:schema> is an element called "schema" in a namespace called "xs". Namespaces need to be declared before you use them. You declare a namespace at the top-most level at which it is used – often the root element of the XML document. When declaring the namespace, you must provide a unique identifier that serves to distinguish this namespace from any other namespace. By convention, a URL is used as the unique identifier. Any two namespaces that have the same unique identifier (e.g. the same URL) are treated as though they are the same namespace, even if one document chooses to call it "xs" and another document chooses to call it "xyz" – if the URL is the same, they are referring to the same namespace. By convention, when a URL is used as the namespace identifier, it is common to put a web page at that URL describing the namespace and its contents (its XML Schema). Notice that when writing XML Schemas, items defining structure (e.g. "xs:complexType") have the namespace prefix on them. This indicates that the element named "complexType" is part of the namespace identified by the URL " In this particular document, the namespace identified by " is referred to by the prefix "xs". In another document, the same namespace could be referred to by a different prefix, but still be the same namespace (e.g. you will see some XML Schema documents that use the prefix "xsi" instead of just "xs" – you can use any prefix you like). tag is now <xs:schema> (c) University of Technology, Sydney

XML – Namespaces Elements in an XML document can be marked with a namespace. The namespace is a prefix to the element name – the format for the element tag is: <namespace : elementName> data </namespace : elementName> Namespaces are first declared at the root element containing the elements belonging to the name space. eg: <Address:addressBook xmlns:Address=“ (c) University of Technology, Sydney

Example – XML Namespace
<?xml version="1.0" encoding="UTF-8"?> <Address:addressBook xmlns:Address=“ <Address:addressBook> <Address:entry list="personal"> <Address:name>Ms Smith</Address:name> <Address:address>1 Central Rd, Sydney</Address:address> <Address:phone> </Address:phone> </Address:entry> <Address:entry list="business"> <Address:name>Mr Suit</Address:name> <Address:address>1 George St, Sydney</Address:address> <Address:phone> </Address:phone> </Address:addressBook> (c) University of Technology, Sydney

Trees and XML A common way of processing XML documents is to read them into memory in a tree structure arbitrary tree, not binary One way to process the in-memory XML document is to use a traversal algorithm While XML is good for tree-structured information, it is not so useful for representing the more general concept of graphs ... Okay, let’s get back to XML. A lot of the material in the preceding slides was about the properties of trees in general, and not specific to XML document structures. However, as was discussed earlier, XML documents do represent tree-structured information. XML tree structures are rooted trees of arbitrary shape (they are not binary trees, and they are certainly not ordered). One way of processing an XML document is to write a program that reads the XML file from the disk and builds a tree data structure in memory. Then when your application wants to examine data from the XML file, it can do so by traversing the tree in memory. An XML parser does the job of reading the XML document from disk and building the tree structure (a “DOM-style parser”, that we will see later). Once you have an XML document in a tree structure in memory, a common way to use the document is to perform a preorder traversal, and visit each node in turn. This will process the elements of the XML document in the same order in which they appear in the file. We will see more about XML parsing and preorder traversals later. For now, we are going to look at an information structure that XML is not very good at representing: graphs. (c) University of Technology, Sydney

XML and trees XML represents a tree structure addressBook entry name
phone XML documents are a representation of tree-structured information. Trees represent information in a hierarchy. The tree shown in the slide is an abstract representation of the XML document shown on the previous slide. Actually, it is a simplified representation. The tree shown above only includes the elements in the XML document. As we will see later, when XML documents are actually converted into tree structure for processing, the attributes and the element values also become parts of the tree. (c) University of Technology, Sydney

Trees - terminology Parent Child Siblings Ancestors Descendants Leaf
Internal vertices Subtree addressBook entry name address phone The easiest way to understand the definitions of the words shown on the slide is by examples. Parent – The parent of “name” is “entry”. The parent of “entry” is “addressBook”. In a tree, every vertex has either zero or one parents. Child – The vertex “addressBook” has two children: “entry” and “entry”. The vertex “entry” has three children: “name”, “address” and “phone”. Siblings – The vertices “name”, “address” and “phone” are all siblings (i.e. they have the same parent). Ancestors – The vertex “name” has two ancestors: “entry” and “addressBook”. A vertex’s ancestors include its parent and its grandparent and its great-grandparent, etc. Descendants – The vertex “addressBook” has eight descendants: “entry”, “entry”, “name”, “name”, “address”, “address”, “phone”, “phone”. A vertex’s descendants include its children and its grandchildren and its great-grandchildren, etc. Leaf – The vertices “name”, “address”, and “phone” are leaf vertices. A leaf vertex has no children. Internal vertices – The vertices “entry” and “entry” are internal vertices. An internal vertex has both a parent and one or more children. Subtree – The set of vertices and edges consisting of the left-most vertex “entry” and its children form a subtree. i.e. the subtree consists of the vertices the left-most “entry”, “name”, “address” and “phone” and the edges that connect them. (c) University of Technology, Sydney

XML tree structure parsed..
An abstract syntax representation of an XML document Document Name: #document Element Name: addressBook Attribute Name: list Value: "personal" Element Name: entry Element Name: name Element Name: address Element Name: phone Another example is to look at the tree created when parsing an XML file. In this case, the original XML file would look like: <addressBook> <entry list=“personal”> <name> Ms Smith </name> <address> 1 Central ... </address> <phone> </phone> </entry> ...... </addressBook> This is not quite the same as the abstract syntax trees in the previous examples. In this case, each vertex in the tree has a type (e.g. “Document”, “Element”, “Attribute”) as well as a name and sometimes a value. Although this is more complex than the formal definition of an abstract syntax tree as considered in formal language theory, this tree is still showing the abstract syntax of the language. The tree structure shown in the slide is the tree structure formed when parsing an XML file using the DOM parser (more on this later). Text Name: #text Value: "Ms Smith" Text Name: #text Value: "1 Central..." Text Name: #text Value: " " (c) University of Technology, Sydney

Tree Traversals Obviously when using XML, we need to systematically visit each vertex of a tree Three main ways to traverse: preorder inorder postorder Which traversal to use depends on the trade-off of efficiency versus speed. We’ll talk about the different programming models later... As we have seen, trees can be used for searching for information that has been stored in a tree structure. That is one application of trees, but often when trees are used as a data structure for storing information, what is really needed is to extract all the information from the tree in a particular order. For example, in an ordered binary tree, you want to be able to extract the data from the tree in sorted order. This facilitates the use of trees for sorting – data is added to the tree in unsorted order, but when you extract the data from the tree, it is retrieved in sorted order. Retrieving data from a tree is accomplished by traversing the tree. Traversing a tree means visiting each vertex of the tree in some well-known order, such that every vertex is visited exactly once. There are three commonly used algorithms for traversing a tree: Preorder traversal Inorder traversal Postorder traversal Note that these algorithms are not just for binary trees – they work with n-ary trees (i.e. where each vertex may have any number of children). (c) University of Technology, Sydney

Preorder Traversal a, b, e, j, k, n, o, p, f, c, d, g, l, m, h, i
In a preorder traversal, you visit the “current vertex” FIRST. Note that we are working with n-ary trees (not binary), and that the tree is not an ordered tree in this case. The traversal algorithm is based upon: the current vertex (r) a set of subtrees, T1, T2, ..., Tn , where each subtree is rooted at a child of the current vertex (illustrated by the circles in the slide) Step 1: Visit the current vertex, r Step 2: Visit the left-most subtree, T1 using a preorder traversal of the subtree Step 3: Visit the second-to-left subtree, T2 using a preorder traversal etc. for the remainder of the subtrees of children of vertex r A preorder traversal will always begin with the root vertex of the tree. (c) University of Technology, Sydney

Inorder Traversal j, e, n, k, o, p, b, f, a, c, l, g, m, d, h, i
In an inorder traversal, you visit the “current vertex” IN THE MIDDLE. i.e. you visit the left child subtree first, then the current vertex, and then the right child subtree(s). Step 1: Visit the left-most subtree, T1 using an inorder traversal of the subtree Step 2: Visit the current vertex, r Step 3: Visit the second-to-left subtree, T2 using an inorder traversal etc. for the remainder of the subtrees of children of vertex r An inorder traversal will always begin with the leftmost leaf node of the tree. If you have an ordered tree (e.g. a binary search tree), performing an inorder traversal will retrieve the keys from the tree in sorted order. One technique for sorting a list of values is to use the binary search tree insertion algorithm to construct a binary search tree and then use an inorder traversal to retrieve the values in sorted order. (c) University of Technology, Sydney

Postorder Traversal j, n, o, p, k, e, f, b, c, l, m, g, h, i, d, a
In a postorder traversal, you visit the “current vertex” LAST. i.e. you visit all of the child subtrees first, and then visit the current vertex last. Step 1: Visit the left-most subtree, T1 using an inorder traversal of the subtree Step 2: Visit the second-to-left subtree, T2 using an inorder traversal etc. for the remainder of the subtrees of children of vertex r Step n+1: Visit the current vertex, r A postorder traversal will always end with the root vertex of the tree (and will start with the leftmost leaf node). (c) University of Technology, Sydney

Exercise (c) University of Technology, Sydney

Motivation XML is a highly structured language
As a human, it's easy for you to see the structure: Elements, Attributes, Entity references, Comments, Processing instructions, CDATA sections But to be any real use, XML has to be read by a computer application not just read it as a single chunk of ASCII text need to parse XML document to recognise the structure Human beings are quite good at discovering the meaning in written text. For example, consider the following two statements: Give to x the value 5 Give the value 5 to x A human being can understand that both of these statements are saying the same thing, even though they are written differently. They use the same words, but in a different order. However generally computers are not so good at interpreting statements like these. Computers prefer unambiguous statements. For example, to make a similar statement in a programming language, you might say: x = 5; That’s an example using syntax from a programming language. But even if you create a computer program that is going to read and process some information from files, it is much easier if the information in the file is in a well-structured format. XML is just a standard way of representing information. You can create an XML file in a text editor, read it, and then perhaps save it on a disk, or it to a friend who can read it too. That’s not very useful – what we really need is for computer applications to be able to read (and to a certain extent, understand) the XML, not just human beings. XML is a well-structured format that can easily be parsed by computer programs. This lecture talks more about the kinds of languages that can be parsed, and in particular, XML parsing. (c) University of Technology, Sydney

Agenda XML Introduction XML Namespaces  DTD/XML Schema
XML programming XSLT/XPATH (c) University of Technology, Sydney

XML document structure
Often you need to add a grammar (Rules) to structure a XML document Want to define allowed tags, elements, attributes and their order and data types Use “Document Type Definitions” (DTD) or an XML Schema This allows validation of XML documents to check its conformity to the grammar expressed by the schema (c) University of Technology, Sydney

Well-formed vs. Valid Well-formed:
XML document follows the basic syntax rules of XML Valid (or more precisely, schema-valid): XML document follows the basic syntax rules of XML and also follows the rules in its associated DTD or XML Schema Two words that are commonly used in XML literature are "well-formed" and "valid". These describe the "correctness" of an XML document. A document that is not well-formed contains a very basic syntax error. An XML parser cannot read in a document that is not well-formed. For example, the following XML is not well-formed because it is missing a ">" character. <name Fred </name> A document that is well-formed, but is not valid follows all the basic XML syntax rules, but does not conform to the extra rules contained in the associated XML Schema file. A document that is both well-formed and valid follows all the basic XML syntax rules, and also all of the rules in the associated XML Schema file. Note that by definition, every valid document is also well-formed (it is not possible to have a document that is valid but not well-formed). Actually this is a loose usage of the term "valid". To be precise, a valid XML document is one that follows the basic syntax rules of XML and also conforms to an associated DTD (Document Type Definition). DTDs are the predecessor to XML Schema (more later). To be precise, documents that conform to an XML Schema are said to be "schema-valid", to distinguish this case from when a document conforms to just a DTD. It is possible for a document to conform to both a DTD and an XML Schema, in which case it is both "valid" and "schema-valid". However it is also possible for a document to conform to a DTD but not to an XML Schema, in which case it is "valid" but not "schema-valid". Therefore there is some use in distinguishing these two cases. (c) University of Technology, Sydney

DTDs DTD specifies the format
(e.g. each book can have many authors, but only one publisher but Don’t have support for data-types (everything is text) Don’t look like XML Don’t have good support for namespaces (c) University of Technology, Sydney

Comment: Ancient History?
Why doesn’t DTD look like XML? Answer: XML was derived from SGML (designed in the 1960’s to represent structured documents for publishing) DTD’s are still commonly used!!! Roughly, DTDs are old-style XML Schema (in fact, XML Schema is quite new, so DTDs aren't such ancient history) SGML is a term that is often used in literature relating to XML. SGML stands for Standard Generalized Markup Language, and was a predecessor of XML. SGML's history began in the late 1960's (known then as GML), when it was developed as a language for generic markup of documents, primarily for publishing applications. The goal was that all kinds of documents (books, articles, newsletters, etc) could be marked up using SGML. The main adopters of the standard was in the publishing industry, and software vendors for the publishing industry (in particular, IBM). The problem with SGML is that it is too complex. It had all the markup that XML has now (elements, attributes, entity references, comments, CDATA sections, processing instructions), but it had additional styles of markup and introduced additional complexity even to the now-familiar notions of elements and attributes. Creating SGML documents (and document types) was something that required significant SGML expertise to do. One of the design goals of XML is that it is simple enough for almost anyone to be able to create basic documents. XML's simplicity is one of the reasons for its popularity. DTD stands for "Document Type Definition". A DTD is like an XML Schema – it defines a set of rules for what can be contained in an XML document instance. DTDs are still quite widely used, and many documents on XML will still refer to DTDs as the primary way of defining the format of XML documents, rather than XML Schema. However, XML Schema is more powerful, i.e. it is able to express more constraints on the XML language than DTDs can, and therefore XML Schema is now the preferred choice when describing "document types". (c) University of Technology, Sydney

DTDs Example <!– fax1_0.dtd file -->
<!ELEMENT fax (to, from, faxnumber, heading?, body?)> <!ELEMENT to (#PCDATA) #REQUIRED> <!ELEMENT from (#PCDATA) #REQUIRED> <!ELEMENT faxnumber (#PCDATA) #REQUIRED> <!ELEMENT heading (#PCDATA)> <!ELEMENT body (#PCDATA)> #PCDATA means any PC character string #REQUIRED means a mandatory element <!DOCTYPE note SYSTEM "note.dtd"> for external dtd file (c) University of Technology, Sydney

DTDs Example <?xml version="1.0"?> <!DOCTYPE web-app PUBLIC "-//UTS//note DTD 1.0//EN" " <fax> <to>Robert</to> <from>Chris</from> <faxnumber> </faxnumber> <heading>32525 DSP</heading> <body>Give everyone HD</body> </fax> Note: You can also use <!DOCTYPE note SYSTEM "note.dtd"> for external DTD files <!DOCTYPE note SYSTEM "note.dtd"> for external dtd file (c) University of Technology, Sydney

XML Schema XML Schema defines a sub-language of XML
 ie: Like a DTD, adds more control over the syntax of your XML document Often represented as a separate XML document itself (usually with an extension of .xsd) XML Schema defines a sub-language of XML – it restricts the set of elements and attributes and character data (text) that you are allowed to use in your XML documents to a subset that makes sense for some particular application. In a sense, XML Schema defines a grammar that imposes some additional structure on the basic syntax of an XML document. XML Schema allows you to create specific subsets of XML – it allows you to define a sub-language of XML, similar to the way Gram was defined as a sub-language of Dict. Note that this still isn't the same as semantics – even using XML Schema to constrain the set of elements, attributes and character data, you can still potentially create XML documents that contain nonsense – but it will be nonsense that follows a particular set of rules as defined in the Schema. Typically you will have a different XML Schema defined for each type of XML document you work with. If you have an address book XML document, it will have an address book Schema that defines the structure of an address book. If you have a purchase order XML document, it will have a purchase order Schema. XML Schema is especially important for the interchange of XML documents between different parties. Without some form of structuring, in principle any application could create XML documents with any tags inside. XML Schema specifies which tags are legal, or "valid", for that type of document. If two parties are exchanging documents, and both parties agree upon the XML Schema that is being used, then they both know the format of the document – they both know what to expect. As we will see in forthcoming examples, an XML Schema is an XML document itself. This means that XML Schemas can be processed using the same tools that are used to process "real" XML documents. And yes, there is an XML Schema that defines what can appear in an XML Schema (i.e. XML Schema can be defined in terms of itself). (c) University of Technology, Sydney

XML – XML Schema (cont) The schema definition itself allows the structure of elements and attributes to be defined, along with any constraints on these elements and attributes An element definition looks like: <element name=“elementName” type=“elementType” minOccurs=“min times allowed to occur” maxOccurs=“max times allowed to occur” /> The XML schema specification defines some standard elementTypes  You can also define your own element types (c) University of Technology, Sydney

XML – XML Schema (cont) A complex type definition is of the following form: <complexType name=“typeName”> <[element definition]> </complexType> An attribute definition is of the following form: <attribute name=“attributeName” type=“attributeType” [attribute options]> For further details about the full range of options available, refer to the XML Schema specification (c) University of Technology, Sydney

Sample XML Schema fax.xsd <?xml version="1.0"?>
<xsd:schema xmlns:xsd=" targetNamespace=" xmlns=" elementFormDefault="qualified"> <xsd:element name="fax"> <xsd:complexType> <xsd:sequence> <xsd:element name="from" type="xsd:string"/> <xsd:element name="to" type="xsd:string"/> <xsd:element name="faxnumber" type="xsd:string"/> <xsd:element name="heading" type="xsd:string"/> <xsd:element name="body" type="xsd:string"/> </xsd:sequence> </xsd:complexType> </xsd:element> </xsd:schema> This slide shows an example of an XML Schema document. It also shows an example of using namespaces in XML, but for now the XML Schema structure is more important. An XML document instance that uses this schema may (should!) contain an element named "fax". "fax" is a complex type. This essentially means that it can contain nested (child) elements and attributes. In this case, the internal structure of the "fax" complex type is that it contains a "sequence" of elements (these are its child elements). Then the list of child elements is shown. In this case, each one of them is a simple type (no children of its own), although it is possible to use complex types here as well (to create children of the children). Also notice that in this example, the type of all of these child elements is "xs:string". This is one of the built-in data types for simple types. As well as string, there is also "decimal", "integer", "double", "boolean", "time", "date" and many more. (c) University of Technology, Sydney

XML Schema data types XML Schemas have predefined data types eg:
string boolean integer, positiveInteger, negativeInteger time, date ,dateTime, duration base64Binary, hexBinary Name decimal QName anyURI ID, IDREF (c) University of Technology, Sydney

Using XML Schema (1) <?xml version="1.0"?>
<fax xmlns=" xmlns:xsi=" xsi:schemaLocation=" fax.xsd"> <from>Wayne</from> <to>Andrew</to> <faxnumber> </faxnumber> <heading>Reminder</heading> <body>Don't forget the lecture!</body> </fax> This is an example of an XML document instance that refers to the XML Schema on the previous slide (assumed to be in a file called "fax.xsd"). XML Schema files have the file extension ".xsd" by convention. Until now, all of the XML documents we have looked at were standalone documents. They did not refer to an XML Schema, and therefore there were no rules or restrictions on the elements that could be used in those documents. Essentially we could use any set of elements we liked. But in most XML documents, particularly in commercial applications and/or where the XML document is going to be passed to another user or application, the document will need to conform to some published XML Schema. Once you declare that an XML document instance uses a particular XML Schema, the contents of that XML document must follow the rules in the Schema. In this case, we have restricted the set of elements that may appear in the document, and specified the order (sequence) in which those elements must appear. If the XML document instance does not follow the rules, it is said to be not "valid". (c) University of Technology, Sydney

Using XML Schema (2) <?xml version="1.0"?>
<fax xmlns=" xmlns:xsi=" xsi:schemaLocation=" fax.xsd"> <from>f3rj934#$%R</from> <to>%TR$E#D</to> <faxnumber>go figure</faxnumber> <heading>Reminder</heading> <body>Don't forget the lecture!</body> </fax> This is a second example of an XML document that uses the XML Schema defined earlier. So far with the XML Schema we have succeeded in restricting which elements may appear, and the order in which they appear. But as the example in the slide shows, this still allows us to create documents that contain meaningless information. Note that we are still not talking about semantics – just trying to provide additional structural (syntax) information to help make the document more meaningful. For example: "to" and "from" elements only contain a proper name (a single word beginning with an uppercase letter) "faxnumber" contains an eight-digit number of the form " " These are artificial examples, but they illustrate the kinds of restrictions we might want to place on character data (CDATA) in an XML document. (c) University of Technology, Sydney

XML Schemas & validity Example shown allows string elements.
But this doesn’t necessarily make sense eg: faxnumber = “go figure” We can define our own data types that make sense to your schema simpleType derivative of base predefined data types complexType compound data type, like Javabean or C Structs (c) University of Technology, Sydney

Defining custom data types
Use regular expressions (patterns): <xsd:simpleType name="nameType"> <xsd:restriction base="xsd:string"> <xsd:pattern value="[A-Z][a-z]*"/> </xsd:restriction> </xsd:simpleType> <xsd:simpleType name="telnoType"> <xsd:pattern value="[0-9]{4}-[0-9]{4}"/> Chris What we are trying to do is to restrict the possible values that may be used as CDATA in an XML element. Effectively we are creating our own new data type. XML Schema supports a wide variety of built-in data types (string, integer, date, etc.), but also allows schema developers to create their own data types. Defining a custom data type involves including a type declaration in an XML Schema. Custom data types can be either simple types or complex types – in this example they are simple types. A simple type is one that is based around a set of data values (as opposed to complex types that have child elements/attributes rather than simple data values). In both of the examples shown, we are creating a new data type by restricting the set of legal values in an existing data type. There are several different types of "restriction" that are possible, e.g. if the base type was "integer", you could restrict the legal values to a range of numbers bounded by a minimum and maximum value. In this example we are restricting a string using a pattern. The pattern is in fact a regular expression. The first pattern shown defines a data type that consists of strings that start with one upper case letter ("[A-Z]") followed by zero or more lower case letters ("[a-z]*"). The second pattern shown defines a data type that consists of strings that consist of 4 digits ("[0-9]{4}") followed by a hyphen ("-") followed by another 4 digits ("[0-9]{4}"). (c) University of Technology, Sydney

Refined XML Schema 
<?xml version="1.0"?> <xsd:schema xmlns:xsd=" targetNamespace=" xmlns=" elementFormDefault="qualified">  <xsd:element name="fax"> <xsd:complexType> <xsd:sequence> <xsd:element name="from" type="nameType"/> <xsd:element name="to" type="nameType"/> <xsd:element name="faxnumber" type="telnoType"/> <xsd:element name="heading" type="xsd:string"/> <xsd:element name="body" type="xsd:string"/> </xsd:sequence> </xsd:complexType> </xsd:element> </xsd:schema> This slide shows the revised XML Schema ("fax.xsd") that uses the new data types. Most of the file is the same. The first change is that we need to incorporate the definitions of the two simple types from the previous slide. They should be inserted in the place shown by the comment on the slide. The definitions could also be inserted near the bottom of the file if you prefer (after the last </xs:element> shown but before the final </xs:schema>). Some people prefer to put the type declarations first, some prefer to put them last. The second change is to change the type for the "from", "to" and "faxnumber" elements to refer to the newly defined simple types, called "firstnameType" and "telnoType". (c) University of Technology, Sydney

Comment: regular expressions
Most string pattern matching in file managers is based on regular expressions, e.g. "*.txt" means "all files ending with .txt" String pattern matching in some text search tools are based on regular expressions, e.g. Unix grep, search-and-replace in Microsoft Word Syntax: string matches the exact string “string” [ABC] matches A or B or C only [A-Z] matches A, B, C ... Z . Matches any character * means zero or more of the preceding pattern {2} means exactly 2 of the preceding pattern Regular expressions appear in a wide variety of places in computing. Often a restricted form of regular expressions is used known as "wildcards". Wildcards only allow simple pattern matching – commonly the "*" character in a wildcard means "match zero or more characters" (from the alphabet) and the "?" character in a wildcard means "match any single character (from the alphabet). Wildcards like these are used in file managers like Windows Explorer – if you are searching for a filename, you might search for "*.txt" to return all files whose name starts with any sequence of characters, but ends with ".txt". Similarly if you are using the "dir" or "ls" command in MS-DOS or Unix respectively. The same wildcards are used in SQL (the language used to query and update databases), but with different notation. In SQL, the "%" character matches zero or more characters and the "_" character matches any single character. For example, you might use the following SQL statement to match names like "jonson", "jensen", "jonston", etc: SELECT * FROM people WHERE surname LIKE 'j_ns%n' Search-and-replace functions often include support for either simple wildcards or full regular expressions. Search-and-replace functions appear in editors (like Microsoft Word) and in command-line tools (like Unix "grep" which does searching). Programming languages often include support for using regular expressions when doing text processing, e.g. the Perl language has regular expressions built in, most other languages have regular expression libraries available, including Java (the java.util.regex package). And as we have seen, regular expressions are also used in XML Schema. (c) University of Technology, Sydney

DTDs versus XML Schema DTDs are not as expressive as XML Schema XML Schema allows a language designer to express more constraints than DTDs allow, e.g. "data types" associated with CDATA (xs:string, xs:integer, etc.) user-defined data types (simple types, complex types) regular expressions over CDATA (DTDs permit regular expressions over sequencing of elements, but not over CDATA content) Some of the advantages of XML Schema as compared with DTDs include: XML Schema allows character data (CDATA) to be assigned specific data types. In DTDs, there is no concept of data types and all CDATA is treated the same. For example, with a DTD, you cannot specify that a particular element value can only be a number. XML Schema supports user-defined data types. As well as having the built-in data types (string, decimal, integer, date, time, etc.), XML Schema allows a language designer to create their own data types and use them anywhere that a built-in type can be used. User-defined data types can be simple types (e.g. restriction on a set of values) or complex types (e.g. details of structures like addresses can be abstracted away as part of a complex type that can be reused in several places in a schema). XML Schema supports regular expressions to limit the character data that is possible, e.g. the telephone number example ("[0-9]{4}-[0-9]{4}"). In DTDs you can use regular expressions to control the sequencing and cardinality (number of occurrences) of elements (e.g. a fax may have one-or-more <to> elements), but not to restrict CDATA. XML Schema also allows you to specify sequencing and cardinality of elements, but does it in a different way – you can specify the minimum and maximum number of types an element can occur, which is more powerful than DTDs because it allows you to specify constraints like: "a fax may have between one and five recipients". It is also worth noting that an advantage of XML Schema is that it schemas are XML documents themselves. This means that they can be processed using normal XML tools like XML parsers, and tools that apply various transformations to XML documents. DTDs have their own unique syntax that is not XML. (c) University of Technology, Sydney

Next week… Agenda XML Introduction XML Namespaces DTD/XML Schema
 XML programming XSLT/XPATH Next week… (c) University of Technology, Sydney

XML Programming One of the benefits often stated for XML is that it is a language that is both: human readable (more or less) machine parseable XML is a context-free language Prior to XML, programmers would invent their own custom data representation languages e.g. for saving info in configuration files programmer must write a parser for their language a benefit of XML is that the parsers already exist When you consider what an XML document really is, it doesn’t seem very impressive. At its heart, it is really just a syntax for defining a set of elements where each element may have some values associated with it. However one of the main benefits of XML is that it is a standard syntax for describing structured information. Prior to XML, whenever programmers needed to store or exchange data, they would invent their own languages for doing so. For example, if your application wanted to store the number of users allowed to access the application at once, you might use: a comma-separated value (CSV) format: maxusers,5 a tab-separated value format: maxusers 5 a name=value format: maxusers=5 or even some binary format that is not human-readable If a programmer chose one of these formats, he/she would also have to write a custom parser for this particular language format (obviously the example languages shown would be easy to parse, but it would still require work on behalf of the programmer). XML has the advantage that it is both human readable and machine parseable (because it is a context-free language). Moreover, a benefit of using XML is that there are existing parsers available that programmers can reuse in their applications, rather than have to write their own parsers from scratch. (c) University of Technology, Sydney

XML parsers There are 3 three fundamental parser types.
Document model parser (DOM) Reads entire document into data structure in memory. Uses significantly more memory Convenient as whole document can be accessed as data structure Push parser (SAX) Event driven. As it parses document, notifies a listener object. Not intuitive programming, cannot navigate document Pull parser (sTax/XMLPull) Reads a little bit of a document at once. Your application drives the parser by requesting the next piece. As already mentioned, there are existing parsers available for XML documents. This means that whenever your application needs to read in an XML document and extract some information from it, rather than writing your own parser, you can reuse an existing one. There are two different kinds of XML parser available: DOM and SAX. Note that these are names of general “kinds” of parsers – there are implementations of DOM parsers and SAX parsers available for many different programming languages. The DOM and SAX parsers work quite differently, as described in the slide. However, in principle, either kind of parser can be used for any XML document – they can both be used to accomplish the same tasks. However in particular situations, one kind of parser might be preferred over the other, either for ease of coding, or efficiency of processing. The forthcoming slides show examples of code used for using DOM and SAX parsers in Java. (c) University of Technology, Sydney

XML parsers These are language-independent API specs
There are DOM & SAX implementations for virtually every language e.g. Java, Javascript, .NET etc E.g: Javascript DOM parser built into all modern browsers  how we get WEB 2.0 applications using AJAX (Asynchronous Javascript and XML) The Java API is called JAXP – Java API for XML Processing javax.xml.* packages (c) University of Technology, Sydney

Java and XML JAX Pax (Java XML Pack) is an all-in-one download of Java technologies for XML: JAXP – Java API for Parsers JAXM – Java API for Messaging JAXB – Java API for Binding JAXR – Java API for Registries JAX-RPC Java API for XML RPC JAX-WS – Java API for Web Services It is a bunch of APIs for developing and managing XML-based applications Allows production of Java classes from XML schemas JAXP is a lightweight API library for parsing and transforming XML documents. JAXM is the Java Technology for sending and receiving SOAP messages JAXB creates XML-to-Java binding schema (XJS) which maps XML elements to Java objects JAXR API for accessing various Business Service Registries JAX-RPC API to build and invoke Web Services (c) University of Technology, Sydney

JAXP APIs for accessing XML documents
SAX: Lightweight, event driven parser “when you see element x do action y” DOM: parses entire document into an object tree in memory In-memory Representation DOM Parser XML Document (c) University of Technology, Sydney

Document Object Model (DOM)
DOM treats XML document as a tree Every tree node contains one of the components from XML structure (element node, text node, attribute node etc.) Parser creates a (tree) data structure in memory We get what we need by traversing the tree Allows to create and modify XML documents DOM - a platform- and language-neutral interface that will allow programs and scripts to dynamically access and update the content, structure and style of documents (c) University of Technology, Sydney

DOM Tree (c) University of Technology, Sydney
Element - represents an element in an XML document, may have associated attributes or text nodes Attr- represents an attribute in an ELEMENT object Text - represents the textual content of an ELEMENT or ATTR (c) University of Technology, Sydney

Sample DOM tree Document Name: #document Element Name: addressBook
Attribute Name: list Value: "personal" Element Name: entry Element Name: name Element Name: address Element Name: phone This slide shows a sample DOM tree that might be in memory. Note the following points. The top of the tree is always a node whose node type is “Document”, and it represents the overall document. There will always be exactly one child of the “Document” node, and it will be a node of type “Element” – this is the root element of the XML document. Attributes appear as children of a particular element, and can be processed as children. Notice how text is represented – if you have an element called <phone> and the value of that element is “ ”, then they are represented as two different nodes in the tree – an Element node that contains the element name, and a Text node that contains the value. Text Name: #text Value: "Ms Smith" Text Name: #text Value: "1 Central..." Text Name: #text Value: " " (c) University of Technology, Sydney

Use DocumentBuilderFactory to create a DOM parser
Java DOM Use DocumentBuilderFactory to create a DOM parser DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance(); Use DocumentBuilder to read the XML document DocumentBuilder builder = factory.newDocumentBuilder(); Document document = builder.parse(…); (c) University of Technology, Sydney

Simple DOM Example public class EchoDOM{ static Document document;
public static void main(String argv[]){ DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance(); try { DocumentBuilder builder = factory.newDocumentBuilder(); document = builder.parse( new File("test.xml")); } catch (Exception e){ System.err.println("Sorry, an error: " + e); } if(document!=null){ scanDOMTree(document); - Obtain a DOM parser object (c) University of Technology, Sydney

Java DOM processing Once we parse() a document this is now in memory
We need to TRAVERSE the tree to read the document One technique is to use Recursion i.e. The method will call itself We then make decisions based on what the current node type is eg: DOCUMENT, ELEMENT, TEXT etc (c) University of Technology, Sydney

Scan(Document) Traverse DOM tree - 1 Document Name: #document
Element Name: addressBook Attribute Name: list Value: "personal" Element Name: entry Element Name: name Element Name: address Element Name: phone Text Name: #text Value: "Ms Smith" Text Name: #text Value: "1 Central..." Text Name: #text Value: " " (c) University of Technology, Sydney

Scan(addressBook) Traverse DOM tree - 2 Document Name: #document

Scan(entry) Traverse DOM tree - 3 Document Name: #document

Scan(name) Traverse DOM tree - 4 Document Name: #document

Recursive Tree processing
private static void scanDOMTree(Node node){ int type = node.getNodeType(); switch (type){ case Node.ELEMENT_NODE:{ System.out.print("<"+node.getNodeName()+">"); NamedNodeMap attrs = node.getAttributes(); for (int i = 0; i < attrs.getLength(); i++){ Node attr = attrs.item(i); System.out.print(" "+attr.getNodeName() "=\"" + attr.getNodeValue() "\""); } NodeList children = node.getChildNodes(); if (children != null) { int len = children.getLength(); for (int i = 0; i < len; i++) scanDOMTree(children.item(i)); System.out.println("</"+node.getNodeName()+">"); break; (c) University of Technology, Sydney

A "node" can be an element, attribute etc
DOM nodes A "node" can be an element, attribute etc // Check the type of node if (n.getNodeType() == Node.ELEMENT_NODE) { // process an ELEMENT String elementName = n.getNodeName(); } else if (n.getNodeType() == Node.ATTRIBUTE_NODE) { // process an ATTRIBUTE String attributeName = n.getNodeName(); String attributeValue = n.getNodeValue(); } else if (n.getNodeType() == Node.TEXT_NODE) { // process a TEXT node String textValue = n.getNodeValue(); } // etc. Your application should know what to expect may not need to enumerate all cases Note that the code on the previous slide refers to “Node” quite a lot. With the DOM parser, “Node” is the generic term to refer to any kind of XML markup – a Node may in fact be an “Element” or an “Attribute” or “Text”, etc. When the code gets into the detail of processing a single node in the tree, with DOM it will often use a nested “if” statement as shown in the slide, to determine which kind of node is actually being processed. Different node types may have names and/or values, e.g. element nodes always have a name, but do not have a value; attribute nodes have both a name and a value; text nodes have a value, but do not have a name. Note that a common mistake with DOM is to assume that an element node has a value associated with it. For example, if you consider the XML: <phone> </phone> you might think that this is represented in DOM as a single element, where the element name is “phone” and the element value is “ ”. In fact this is not the case. This information is represented by an element with the name “phone” and that element has a child node (which is a text node) and this child node has the value “ ”. See more on the next slide. Also note that if your application knows the format of the XML document it is reading, often you will not need to list all of the possible cases for a node type. For example, if you are looking to find the value of a single element within the whole XML file, your code might be tailored to search the tree for just this element and therefore can be simpler than if your code was trying to use the entire document. (c) University of Technology, Sydney

DOM Node Types (c) University of Technology, Sydney
This slide shows a table of all of the different DOM node types. The column headed “Interface” is the name of the node type. “nodeName” and “nodeValue” show whether that particular node type has a name and/or a value associated with it. The column headed “attributes” shows whether the node type can have attributes associated with it – in XML, only entities can have attributes, as shown in the table. This table is taken from the Java API documentation for the interface “Node”: (c) University of Technology, Sydney

Common DOM Methods Node.getNodeType() - the type of the underlying object, e.g. Node.ELEMENT_NODE Node.getNodeName() - value of this node, depending on its type, e.g. for elements it‘s tag name, for text nodes always string #text Node.getFirstChild() and getLastChild()- the first or last child of a given node Node.getNextSibling() and getPreviousSibling()- the next or previous sibling of a given node Node.getAttributes()- collection containing the attributes of this node (if it is an element node) or null (c) University of Technology, Sydney

Common DOM methods (2) Node.getNodeValue()- value of this node, depending on its type, e.g. value of an attribute but null in case of an element node Node.getChildNodes()- collection that contains all children of this node Node.getParentNode()-parent of this node Element.getAttribute(name)- an attribute value by name Element.getTagName()-name of the element Element.getElementsByTagName()-collection of all descendant Elements with a given tag name (c) University of Technology, Sydney

Common DOM methods (3) Element.setAttribute(name,value)-adds a new attribute, if an attribute with that name is already present in the element, its value is changed Attr.getValue()-the value of the attribute Attr.getName()-the name of this attribute Document.getDocumentElement()-allows direct access to the child node that is the root element of the document Document.createElement(tagName)-creates an element of the type specified (c) University of Technology, Sydney

Searching in DOM You can search for element tags:
NodeList Document.getElementsByTagName() abook = mydoc.getElementsByTagName("addressbook") addr = abook.getElementsByTagName("address") Also can search by XML id attribute: Element person = mydoc.getElementById("1") would return the ONE element with the id="1" e.g. <person id="1"> See also the javax.xml.xpath API (later…) (c) University of Technology, Sydney

Create Elements/Attributes
Create element; then append to parent // Create empty XML Document Document docXMLDoc = domBuilder.newDocument(); // Create a person element Element elmPerson = docXMLDoc.createElement("person"); // Create name attribute and set its value to “Jeff” elmPerson.setAttribute("name", "Jeff"); // Attach person element to the XML Doc docXMLDoc.appendChild(elmPerson ); Resulting XML: <person name="Jeff" /> (c) University of Technology, Sydney

SAX basics SAX is a very different kind of parser to DOM
It relies heavily upon a content handler class that the programmer (you!) must define the content handler class needs to implement certain methods, e.g. startDocument() and endDocument() startElement() and endElement() characters() ... others ... as the SAX parser is reading in the XML document, when it encounters various kinds of markup, it will call your methods, defined in your content handler class Based on the material we have covered so far about trees and parsers, DOM is a fairly natural progression. SAX on the other hand is a very different kind of parser – for a start, it is not based on trees at all. SAX is based on the concept of an “event handler”. As the XML document is being read in from the disk, various “events” are generated. Each time an event is triggered, some piece of code is executed to “handle” the event. The event-handler concept is not unique to XML parsing – it appears in various kinds of applications, most notably graphical (GUI) applications where events are triggered when the user clicks buttons and moves the mouse. In SAX, the events are triggered during the parsing process. In SAX, the event handlers are defined by the programmer – you. They are contained in a special class that you must define, known as the content handler class. Inside the content handler class, you are required to implement methods that correspond to the different kinds of events that might be triggered, such as: when the parser encounters the start of the document (or the end of the document); when the parser encounters the start of a new element (or the end of an element); when the parser encounters some text data. With DOM, the entire XML document is read in by the parser before your code can get access to it. However with SAX, your code to process the XML document executes while the parsing is taking place. (c) University of Technology, Sydney

Simple API for XML (SAX)
Event driven processing of XML documents Parser sends events to programmer‘s code Programmer decides what to do with every event SAX parser doesn't create any objects at all, it simply delivers events But impossible to move backward in XML data Impossible to modify document structure Fastest and least memory intensive way of working with XML (c) University of Technology, Sydney

Parsing XML document with SAX
All examples for this part based on: Import necessary classes: import org.xml.sax.*; import org.xml.sax.helpers.DefaultHandler; import javax.xml.parsers.SAXParserFactory; import javax.xml.parsers.SAXParser; (c) University of Technology, Sydney

Create SAX Parser SAXParserFactory -creates an instance of the parser factory SAXParserFactory factory = SAXParserFactory.newInstance(); SAXParser - defines several kinds of parse() methods. Every parsing method expects an XML data source (file, URI, stream) and a DefaultHandler object (or object of any class derived from DefaultHandler) SAXParser saxParser = factory.newSAXParser(); saxParser.parse( new File("test.xml"), handler); (c) University of Technology, Sydney

Handle SAX events Among others:
startDocument – receives notification of the beginning of a document endDocument – receives notification of the end of a document startElement – gives the name of the tag and any attributes it might have endElement – receives notification of the end of an element characters – parser will call this method to report each chunk of character data (c) University of Technology, Sydney

! SAX simple example startDocument() startElement(addressBook)
<?xml version="1.0" > <addressBook> <entry list="personal"> <name> Ms Smith </name> … & so on </addressBook> startDocument() startElement(addressBook) startElement(entry) Attributes[list] startElement(name) Characters(Ms Smith) endElement(name) endDocument() ! Caution: your code MAY get extra characters() events due to line feeds, blank spaces etc (c) University of Technology, Sydney

DefaultHandler class { public void startDocument() throws SAXException
public class EchoSAX extends DefaultHandler { public void startDocument() throws SAXException {//override necessary methods} public void endDocument() throws SAXException {...} public void startElement(...) throws SAXException{...} public void endElement(...) } (c) University of Technology, Sydney

Overriding of methods public void startDocument() throws SAXException{
System.out.println("DOCUMENT:"); } public void endDocument() throws SAXException{ System.out.println("END OF DOCUMENT:"); public void endElement(String namespaceURI, String sName, String qName) throws SAXException { System.out.println("END OF ELEMENT: "+sName); (c) University of Technology, Sydney

Parsing of the XML document
DefaultHandler handler = new EchoSAX(); // Use the default (non-validating) parser SAXParserFactory factory = SAXParserFactory.newInstance(); //Set validation on factory.setValidating(true); // Parse the input SAXParser saxParser = factory.newSAXParser(); saxParser.parse( new File("test.xml"), handler); (c) University of Technology, Sydney

Notes on SAX SAX doesn't keep track of the document structure
That's up to the programmer e.g. all textual data will be processed by the characters() method There is nothing attached to the character data to indicate which element it belongs to You have to keep track of "where you're up to" See the sample reading for one way to do this in a SAX content handler, using a stack Philipp K. Janert, Simple XML Parsing with SAX and DOM, ONJava.com [Internet]. The DOM parser builds a tree – the structure of the document is maintained in the structure of the tree. However the SAX parser does not store any information in memory at all – it simply reads the document and triggers events. If the application needs to know about the document structure, then it is up to the programmer to write some code that will store the document information in a suitable structure. For example, if your application was reading in an XML document that represents an address book, containing several entries, you might build your own data structure in memory that consists of an array of records, where each record is one entry in the address book. Each “entry” might contain a name, an address and a phone number. With SAX, it is up to the programmer to create such a data structure in memory, and to fill it with data from the XML document. With SAX, you can invent your own data structures that best represent the kind of data you need to store (unlike DOM which reads every XML document in the same way – into a tree structure). (c) University of Technology, Sydney

StAX JSR173 "Streaming API for XML" (StAX)
Similar to SAX, but the PROGRAMMER "pulls" each XML component and acts accordingly 2 versions – cursor API and iterative API Cursor API: XMLStreamReader Treats XML inputstream as set of tokens You parse the tokens per element, attribute etc. Iterative API: XMLEventReader Parser treats XML stream as list of "XML Events". You then check each event (c) University of Technology, Sydney

Using sTAX Cursor API Using Cursor API create a parser object
create an input stream for the parser loop through each returned XML element using the next() method retrieve the tag name and value by using the getLocalName() and getText() methods retrieve attributes by using the getAttributeName() and getAttributeValue() methods (c) University of Technology, Sydney

Example <?xml version="1.0"?> <catalog> </catalog>
START_DOCUMENT <?xml version="1.0"?> <catalog> <title id=“1”> <name>J2ME </name> </title> </catalog> START_TAG catalog START_TAG title attribute: id = 1 START_TAG name TEXT J2ME END_TAG name END_TAG title END_TAG catalog END_DOCUMENT (c) University of Technology, Sydney

Sample code Implement as a simple loop:
while (reader.hasNext()) { switch (reader.next()) { case START_ELEMENT: // <name> println(reader.getLocalName()); // "name" println(reader.getText()); // "J2ME" case END_ELEMENT: // </name> & so on… } (c) University of Technology, Sydney

Creating XML with Stax Similar to XMLStreamReader, use XMLStreamWriter
Just use methods such as writeStartElement("name") writeEndElement("name") writeCharacters("J2ME") Finish off with flush() & close() (c) University of Technology, Sydney

Which parser to use? DOM SAX StAX Want to read whole document AT ONCE
Interested in part of document only Controlled by parser library Programmer control CPU/Memory Hog lean minimal. Can be used on mobile phones Easy to code – single object Hard to code & debug – passed events to handle Medium to code. Intuitive alternative to DOM/SAX Can generate XML Read-only So ... if there are three types of parser that can be used, the obvious question is: which one is better? The answer of course is: neither. That’s why there’s 3 of them. DOM is better for some kinds of applications, while SAX/StAX is better for other kinds of applications. DOM is useful for: cases where you need to process the whole document anyway the documents you are processing are reasonably small in size SAX/StAX is useful for: cases where you are only interested in part of the document you have a very large document, and processing with DOM would use too much time/memory Generally, SAX/StAX is considered to be more powerful and flexible than DOM, but the price is that SAX is also considered to be harder to program. DOM is easier for beginners to use, but if you eventually start writing applications that need to process large documents quickly, DOM’s shortcomings will become apparent. StAX is generally easier to code than SAX – you are in control of pulling the XML lines in – much like using inputStreamReaders to read in plaintext documents!! (c) University of Technology, Sydney

JAXB Alternative to using a parser is to use JAXB (Java Architecture for XML Binding) JSR31 for JAXB 1.0. JSR222 for JAXB 2.0 Uses XML schema to generate Java classes Use xjc to create java classes from XML schema Use schemagen to create XML schema from Java (need tags) You will see this being used later in web services tutorial Alternatives include xmlBeans (apache Axis) (c) University of Technology, Sydney

JAXB 2.0 JAXB applications need to runtime framework too…. schemagen
xjc (from Sun javaee 5 tutorial) (c) University of Technology, Sydney

JAXB binding runtime Your application accesses JAXB objects like javabeans. Use getter/setters to access elements, attributes etc Use ObjectFactory to create the JAXB object Use UnMarshal class to read XML document (from Sun javaee 5 tutorial) (c) University of Technology, Sydney

JAXB Code sample From /pub/dca/lab05/sample2 directory…
xjc sample2.xsd  AddressList.java & ObjectFactory.java JAXBContext jc = JAXBContext.newInstance(ObjectFactory.class); Unmarshaller u = jc.createUnmarshaller(); AddressList a = (AddressList) u.unmarshal( new FileInputStream("sample2.xml")); // now you have a javabean containing address elements… List<AddressList.Address> al = a.getAddress(); for (Address address : al) { System.out.println(address.getName()); } (c) University of Technology, Sydney

Agenda XML Introduction XML Namespaces DTD/XML Schema XML programming
 XSLT/XPATH (c) University of Technology, Sydney

XSLT & XPATH Often we need to transform XML to different formats
eg: from one Schema to another eg: from XML to XHTML eg: from XML to text  Use Extensible Stylesheet Language Transformations (XSLT) This also requires using XPATH expressions to match elements and attributes (c) University of Technology, Sydney

XML Data File: CD collection
<?xml version="1.0"?> <catalog> <cd> <title>Empire Burlesque</title> <artist>Bob Dylan</artist> <country>USA</country> <company>Columbia</company> <price>10.90</price> <year>1985</year> </cd> … </catalog> XML Formatting Problem As we have seen, XML is a language that is primarily about representing or describing structured information. XML documents do not generally have built-in information about how the information should be displayed, unlike HTML for example where the HTML tags correspond to particular display styles. XML separates content from presentation (i.e. how the content is displayed). This was one of the reasons that XML was developed – HTML documents generally mix up the structure of the information with its presentation. XML was introduced to make this distinction clear. XML is usually used "behind the scenes", e.g. when two applications need to exchange information, or when your application needs to store information in a format that can be easily retrieved later. However some XML applications do involve presenting the information to a user, for example, displaying the data from an XML document in a web browser. For these applications, XML alone is not sufficient – there needs to be some way of describing how the XML should be displayed on the screen. XSLT fulfils this requirement. The slide shows an extract of the XML document used in this lecture. The examples used here are based upon the XSL tutorial at: (c) University of Technology, Sydney

Formatted XML Data (c) University of Technology, Sydney
What we are trying to achieve with this example is to take an XML document (as shown on the left in the slide) and convert it into a HTML representation (as shown on the right on the slide). In this example, we wish to display the title and artist of each CD in an HTML table, with a heading at the top. We are also choosing to ignore some information, namely the country, company, price and year elements. XSL (Extensible Stylesheet Language) is the programming language we will use to do the conversion. XSL can be used to convert information from XML into other formats (here it is HTML). XSL can be used to add presentation information to an XML document, e.g. here we have added a heading, "My CD Collection", and column headings in the table, "Title" and "Artist" with a green background. XSL can be used to filter the information that we wish to display, e.g. here we are only displaying the title and artist information. From the same XML document, we could create many different presentation styles, by using different XSL stylesheets. For example, here we have used a table, but with a different XSL stylesheet we could produce a bullet-list of just titles, or perhaps put the full details of each CD in a box/table of its own. We could add different headers or footers. All from the same basic XML document. Note that XSL is not limited to producing HTML. XSL can be used to produce almost any output format desired. From the same XML document, you can produce a HTML version, a PDF version and a version in Microsoft Word or Excel. Or you could produce one version in HTML for the web, and another version for a mobile phone display using WAP standards. (c) University of Technology, Sydney

XSLT Formatting XSL – Extensible Stylesheet Language XSL XML or XHTML
XSLT processor XML or XHTML XSL XSL – Extensible Stylesheet Language XSL An XSLT processor is the piece of software that does the conversion from XML into HTML in our example. The XSLT processor takes two input: an XML document an XSL stylesheet and produces another XML document as output (or in the case we are interested in, an XHTML document (HTML that is XML-compliant). XSLT = XSL Transformations XSLT is a subset of XSL – XSL includes not only the transformation aspect, but also a vocabulary for describing formatting (fonts, line spacing, borders, etc). Here we are mainly interested in XSLT, not the rest of the XSL standard. There are a range of XSLT processors available. The best known one is Xalan from the Apache group. Others include XT and Saxon. (c) University of Technology, Sydney

XSLT Tree conversion XSLT converts a source tree into a result tree
What an XSLT processor actually does is to convert a source tree into a result tree. The result tree is constructed by using the rules contained in the XSL stylesheet file that is used as input. The XSLT processor reads in the original XML document and builds a tree structure (e.g. using the DOM parser). It then applies the transformation rules produce the result tree. And finally it serializes the result tree to produce the output file. Serialization is a "flattening" process, whereby the result tree is traversed using a pre-order traversal and each node written out sequentially. In understanding how XSLT processors work, think of them as converting one tree into another tree, not converting one file into another file. Diagram source: Paul Grosso and Norman Walsh. XSL Concepts and Practical Use. Presentation at XML Europe 2000, Paris, France, 2000, [Internet] (c) University of Technology, Sydney

XSL Structure <?xml version="1.0"?>
<xsl:stylesheet version="1.0" xmlns:xsl=" <xsl:template match="..."> </xsl:template> ... </xsl:stylesheet> Every XSL document starts with a <stylesheet> tag, belonging to the XSL namespace (i.e. the namespace identified by " In this example we use the namespace prefix "xsl:", which is common, but of course you could use any prefix you like as long as it maps to the right namespace identifier (the URL). So the overall XSL document consists of information between: <xsl:stylesheet> ... </xsl:stylesheet> The basic content inside an XSL stylesheet is a set of templates. Each template describes one formatting rule, e.g. corresponding to one element. For example, there may be a template that specifies what transformation to apply to the <cd> elements in the input. There might be another template that specifies how to transform <title> elements, etc. Each template has a "match" attribute that specifies which part of the input tree the template applies to. In the simplest case, it might just apply to all elements with a particular name (e.g. all <title> elements). Or the template could match a more complex scenario, e.g. "all <title> elements that have a <cd> element as its immediate parent". Later we will look at the different kinds of matching rules. Also note that an XSL stylesheet is itself an XML document. (c) University of Technology, Sydney

Modelling XML Documents
<title>Empire Burlesque</title> title “Empire Burlesque” Element Node Text Node Element Name: title Value: null Text Name: #text Value: "Empire Burlesque" XSL Processing Now we will consider how an XSLT processor works, including an example. But first, recall the tree representation of an XML document. In particular, recall that when you have an element that contains some text, it is represented as two different nodes in the tree: an element node, that specifies the name of the element (and has no value associated with it); a text node, that stores the value of the text contained in the element. (c) University of Technology, Sydney

XML Document Tree “USA” “Bob Dylan” “10.90” “1985” year catalog cd
“Empire Burlesque” title artist country “Columbia” company price root This slide shows a tree representation of the input XML document used in the example. There is a root node of the tree (the "Document" node), and under that the structure of the elements and text nodes in the document. An XSLT processor begins by reading the input XML document into memory and constructing a tree structure as shown. [Aside: the XSLT processor may in fact use the SAX parser and not read the whole document into memory at once. But regardless, for the purposes of understanding how XSLT processors work, it is important to think of a logical tree structure in memory.] (c) University of Technology, Sydney

Tree Walking If the current node has children, then the first child is chosen If the current node has following siblings, then the next sibling is chosen If the current node does not have a following sibling, then the following sibling with following siblings is chosen If no such ancestors exist, then the process is complete After reading the XML input document into a tree structure in memory, an XSLT processor will then begin to visit each node in the tree by a process it calls "tree walking". It starts at the root node of the tree, and treats the root node as the "current node". It checks to see if there is an XSL template that matches this node (the root node). If there is a matching template, it applies the transformation shown in the template. After dealing with the current node, the tree walking process then proceeds to choose which node it will visit next. It uses the rules on the slide for choosing which node will come next. In fact, this is a pre-order traversal of the tree. As it visits each node, it searches the XSL stylesheet for matching templates. If there is a matching template, it applies the specified transformation before moving to the next node. (c) University of Technology, Sydney

XLST Tree Walking 1 root catalog 2 3 16 cd cd title 4 artist 6 country
8 company 10 price 12 year 14 title 17 artist 19 “Empire Burlesque” “Empire Burlesque” “Bob Dylan” “Bob Dylan” 5 7 9 11 13 15 18 20 “USA” “Columbia” “10.90” “1985” This slide illustrates the order in which the nodes are visited by the tree walking algorithm used by an XSLT processor. A pre-order traversal is conducted, visiting the nodes in the order shown by the numbers on the slide. Note that when there is an element containing some text, e.g. <title>Empire Burlesque<title> ... the element node and the text node are visited separately. (c) University of Technology, Sydney

XSL Templates The “root” element <?xml version="1.0"?>
<xsl:stylesheet version="1.0" xmlns:xsl=" <xsl:template match="/"> Some Output </xsl:template> </xsl:stylesheet> The “root” element Now we begin to look at some XSL templates and how they work. Over the next few slides we will work on building up a complete example of an XSL stylesheet that can be used to transform our original XML document (CD collection) into a HTML table format as shown earlier. <xsl:template match="/"> Some Output </xsl:template> To begin with, we look at a very basic template. This template will match the root node of the document, identified by the pattern "/". The root node is always the one that will be encountered first, in the pre-order traversal tree-walking process used by an XSLT processor. In this case, when we match the root node of the document, we simply want to place some static text into the result tree – the words "Some Output". If we used this stylesheet, we could supply any XML document as input and the output would always be the same – just the words "Some Output". (c) University of Technology, Sydney

XSL Text Output <?xml version="1.0"?>
<xsl:stylesheet version="1.0" xmlns:xsl=" <xsl:template match="/"> <html><body> <h2>My CD Collection</h2> <table border="1"> <tr bgcolor="#9acd32"> <th>Title</th> <th>Artist</th> </tr> </table> </body></html> </xsl:template> </xsl:stylesheet> Now we have seen that we can produce static text as output. If we are trying to convert our input XML document into HTML, naturally we want to be generating HTML as output. This slide is very similar to the previous one – all the output we are generating is static (i.e. it does not depend upon the input document). The difference is that now the output is HTML. In terms of tree operations, what this XSL template will do is begin to construct a structured result tree as follows (excluding attributes): (c) University of Technology, Sydney root html body h2 "My CD Collection" table tr td "Artist" "Title"

Invoking another template (1)
<?xml version="1.0"?> <xsl:stylesheet version="1.0" xmlns:xsl=" <xsl:template match="/"> <html><body> <h2>My CD Collection</h2> <table border="1"> <tr bgcolor="#9acd32"> <th>Title</th> <th>Artist</th> </tr> <xsl:apply-templates/> </table> </body></html> </xsl:template> </xsl:stylesheet> So far, the tree walking algorithm has never left the root node. In the previous examples, the tree walking algorithm visited the root node, generated some output, and stopped. To be useful, the tree walking algorithm needs to visit the whole tree, or at least some subset that the XSLT programmer wants it to visit. This slide illustrates how to make the tree walking algorithm visit the "next" node in the tree, using the XSLT element: <xsl:apply-templates/> "apply-templates" instructs the XSLT processor to look for other templates to apply, and to apply them. It is one of the most common instructions found in XSL stylesheets. Also notice the position of where we placed the <xsl:apply-templates/> tag. It wasn't placed at the very end of our root template – it was placed somewhere in the middle (after the first row of the table). If any of the other templates, when applied, produce output, this is the position that the output will be placed. In our example, the other templates are going to produce additional table rows, so therefore we have used <xsl:apply-templates/> to mark the position of where these additional table rows will go in the overall output (result tree). Notice that right now, although we are telling the XSLT processor to apply further templates, we have not defined any further templates to apply. But the output generated will contain all the text from the original XML input document. This is because of the default behaviour of the XSLT processor – it will traverse the tree, and whenever it encounters an element node, it will not generate any output but will continue traversing. However if it encounters a text node, it will display the text and continue traversing. (c) University of Technology, Sydney

<?xml version="1.0"?> <xsl:stylesheet version="1.0" xmlns:xsl=" <xsl:template match="/"> <html><body> <h2>My CD Collection</h2> <table border="1"> <tr bgcolor="#9acd32"> <th>Title</th> <th>Artist</th> </tr> <xsl:apply-templates/> </table> </body></html> </xsl:template> <xsl:template match="cd"> <tr><xsl:apply-templates/></tr> </xsl:stylesheet> Now we introduce a second template. This template will match any occurrence of the <cd> element in the input XML document. Because we want each CD to be displayed on its own row in the resulting HTML table, when we encounter the <cd> element in the input, we want to produce a new table row in the output (using the <tr>...</tr> HTML tags). Notice also the use of <xsl:apply-templates/> again, because after processing the <cd> element, we want the tree walking algorithm to continue processing its children, and to add any extra output generated into the table row we have just created. In the output generated, now we will have a row created for each CD, but the rows will not have anything in them as we haven't yet created the templates to define what goes in each row. (c) University of Technology, Sydney

... <xsl:apply-templates/> </table> </body></html> </xsl:template> <xsl:template match="cd"> <tr><xsl:apply-templates/></tr> <xsl:template match="title"> <td><xsl:apply-templates/></td> <xsl:template match="artist"> </xsl:stylesheet> Here we further extend the example by creating templates for the <title> and <author> elements in the original XML document. All we are doing is creating one table cell (<td>...</td>) for the title, and another for the artist. Again notice that we are using <xsl:apply-templates/> to continue the tree walking. But in the original XML document, there weren't any children of <title> and <artist> were there? Yes! Remember that text nodes are treated as separate nodes. In this case, when we specify <xsl:apply-templates/>, what will be "applied" is the template for handling text nodes. Remember that the default behaviour of the XSLT processor for handling text nodes is to just print the text out. In the output generated, now we will have a table that contains each artist and title displayed on a separate row. However there will still be extra information in the output that we don't want – the country, company, price and year. Before we are finished, we need to stop this information from being printed out. (c) University of Technology, Sydney

Hiding content <xsl:template match="country"/>
... <xsl:apply-templates/> </table> </body></html> </xsl:template> <xsl:template match="cd"> <tr><xsl:apply-templates/></tr> <xsl:template match="title"> <td><xsl:apply-templates/></td> <xsl:template match="artist"> <xsl:template match="country"/> <xsl:template match="company"/> <xsl:template match="price"/> <xsl:template match="year"/> </xsl:stylesheet> Sometimes with XSLT, you don't want all of the information in the original input document to appear in the output – i.e. you want to hide some of the content. You might want to hide some individual elements (as we are doing in this example), or you might even want to hide an entire subtree of the input XML document. The mechanism for hiding content is to stop the XSLT processor from visiting the nodes you don't want it to visit. i.e. in the tree walking algorithm, you stop the processor from visiting children of the current node. The way to do this is simply NOT to use <xsl:apply-templates/> inside your template. So in the example shown, we have defined some empty templates, e.g. <xsl:template match="country"/> which of course is equivalent to: <xsl:template match="country"> </xsl:template> If we did call <xsl:apply-templates/> in the middle, then the tree walking would continue, the text node underneath <country> would be visited, and the country would be printed out. By not calling <xsl:apply-templates/>, we are instructing the XSLT processor not to visit the children of the <country> element – i.e. not to visit the text node, and therefore not to print it out. If you create an XSL template for a particular element, and don't call <xsl:apply-templates/>, then none of the children of that element will be visited, and no output will be generated from that section of the input XML document. (c) University of Technology, Sydney

XSL hiding content <xsl:apply-templates/>
1 10 <xsl:template match="cd"> <tr><xsl:apply-templates/></tr> </xsl:template> 9 8 7 6 5 4 3 2 <xsl:template match="title"> <td><xsl:apply-templates/></td> </xsl:template> <xsl:template match="artist"> <xsl:template match="country"/> <xsl:template match="company"/> <xsl:template match="price"/> <xsl:template match="year"/> Now we have seen all of the templates used in our example, we can examine the overall execution process. The operation of the XSLT processor is driven entirely by the structure of the input XML document. The elements in the input, and the order of those elements, will determine the order in which the XSL stylesheet templates will be "invoked". Tree walking always begins at the root node of the document. If there is an XML template that matches the root ("/"), then it will be applied first. This template should call <xsl:apply-templates/> so that the tree walking can continue. In our example, the next element to be encountered is <catalog>. The XSLT processor looks to see if there is a template that matches the <catalog> element. In this example, there is no template that matches the catalog element, so the default behaviour is for the XSLT process to just continue the tree walking (and not generate any output). The next element encountered is <cd>. The XSLT processor searches for a template that matches the <cd> element. It finds one, and applies it. The template for "cd" includes a call to <xsl:apply-templates/>, so the tree-walking continues and the children of <cd> are visited. The next element encountered is <title>. Again the process is repeated: search for matching template, apply template, continue walking. The same is true for the <artist> element. When the XSLT processor encounters the <country> element, it begins as usual for searching for a template that matches the <country> element. It finds one, and it applies it (generating no output). However now there is no call to <xsl:apply-templates/>, so the children of the <country> element are not visited. The same process happens for <company>, <price> and <year>. Notice carefully that the order the templates are invoked depends upon the tree walking of the input XML document. The templates are not invoked in the sequential order in which they appear in the XSL file – the XSL file does not contain a sequence of steps to follow. This illustrates why using XSLT is an example of declarative programming – you specify what you want (for each template). You don't specify how to achieve that, i.e. you don't provide a step-by-step sequence of instructions to solve the problem. (c) University of Technology, Sydney

Extracting content You can extract the value of an element by using the <xsl:value-of select=“expression” <xsl:template match=“title"> <td>artist=<xsl:value-of select="/catalog/cd/artist"/> <xsl:apply-templates/></td> </xsl:template> Expression can be any Xpath expression. to return attribute values (c) University of Technology, Sydney

XSLT logic elements Like programming languages, XSLT has some control logic elements <xsl:for-each select=“expression"> repeating text </xsl:for-each> <xsl:if test=“expression”> content </xsl:if> <xsl:choose> <xsl:when test=“expression1">content1</xsl:when> … <xsl:otherwise>content </xsl:otherwise> </xsl:choose> (c) University of Technology, Sydney

'Match' attribute The match attribute in the template element is an expression as defined by the XPath specification The patterns used in the match attribute are XPath expressions: <xsl:template match="/catalog/cd/*"/> So far, in the match attribute for a template, we have just put the names of elements. That works okay in simple cases, but what if you want to match only very specific cases. For example, consider the XML document: <cd> <title>Triple J Hottest 100 Volume 10</title> <track> <title>Chemical Heart</title> </track> </cd> How do you distinguish the CD title from the track title? XPath defines a language that can be used to address specific parts of an XML document. With XPath, you can identify a single node in the XML document, or can address a group of nodes (a nodelist). In XPath, you would identify the CD title and track title in the above example as "/cd/title" and "/cd/track/title" respectively. The notation is similar to directory paths in a filesystem. Absolute XPath expressions start with a "/" (the root node of the XML document), while relative XPath expressions do not start with a "/". Relative expressions are evaluated relative to the current node being processed. There are some basic symbols: / = root node = current node * = all child nodes of the current node On the forthcoming slides, we will look at examples of different kinds of XPath expressions you can use to create XSL templates that match specific parts of an XML document. (c) University of Technology, Sydney

XML – XPath XPath is a language for addressing parts of an XML document Uses a compact, non-XML syntax So we can use XPath within URIs and XML attribute values Operates on the abstract, logical structure of an XML document Gets its name from its use of a path notation for navigating through the hierarchical structure of an XML document (c) University of Technology, Sydney

XML – XPath (cont) XPath models an XML document as a tree of nodes.
There are different types of nodes, including element nodes, attribute nodes and text nodes. It defines a way to compute a string-value for each type of node. Some types of nodes also have names XPath fully supports XML Namespaces. Name of a node is modeled as a pair, called a Data Model Consists of (local part, (possibly null) namespace URI) (c) University of Technology, Sydney

Example: /catalog/cd/company “USA” “Bob Dylan” “10.90” “1985” year
“Empire Burlesque” title artist country “Columbia” company price root (c) University of Technology, Sydney

XML – XPath (cont) Main part of XPath is the expression.
An expression is evaluated to return an object of type: node-set (an unordered collection of nodes without duplicates) boolean (true or false) number (a floating-point number) string (a sequence of UCS characters) Some examples of locations paths: child::para selects the para element children of the context node child::* selects all element children of the context node child::text() selects all text node children of the context node (c) University of Technology, Sydney

XPATH examples match="X | Y | Z“ “match the elements X, Y and Z only”
Select alternative elements: match="X | Y | Z“ “match the elements X, Y and Z only” <A> <X > <Y > <B > <Z > This style of XPath expression allows you to create a template that matches an arbitrary set of nodes in the tree. The example shown could be represented in XML as: <A> <X>...</X> <Y>...</Y> <B>...</B> <Z>...</Z> </A> In this case we are only interested in matching the elements X, Y and Z, but not B. (c) University of Technology, Sydney

XPATH examples Select alternative elements example:
match="name | foreign" <para>The <foreign>de facto</foreign> stylesheet language is <name>XSLT</name>. </para> Result: The de facto stylesheet language is XSLT For example, we might want to create XSL templates to make all proper names italic and also to make all foreign words italic. We could do this with two XSL templates: <xsl:template match="name"> <i><xsl:apply-templates/></i> </xsl:template> <xsl:template match="foreign"> But an abbreviated way of doing it would be to use a single template, with an "alternative elements" XPath expression: <xsl:template match="name|foreign"> (c) University of Technology, Sydney

XPath examples Specific parent: match="P/X“
“match X if it has a parent of P” <P> <A > <X> <X> <Y > <Y > <Z > <Z > This style of XPath expression allows you to specify that we want to match certain nodes only if they have a specific parent node. In the example on the slide, we only want to match node "X" if it has a parent of "P". In the example on the right-hand side, node "X" has a parent of "A", and therefore is not matched. Note that this is a relative XPath expression (i.e. does not start with a "/"), so the particular parent-child relationship being matched may occur anywhere in the document. Also note that the node being matched is X, not P. (c) University of Technology, Sydney

XPath Examples Specific Parent example: match="warning/para"
<para>A normal paragraph.</para> <warning> <para>Warning paragraph one.</para> <para>Warning paragraph two.</para> </warning> <para>Another normal paragraph.</para> A normal paragraph. WARNING: Warning paragraph one. Warning paragraph two. Another normal paragraph. For example, we might want to match all <para> elements, but only if they have a parent of <warning>. Any <para> elements without this parent are not matched. We could use an XSL template as follows: <xsl:template match="warning/para"> <i><xsl:apply-templates/></i> </xsl:template> This would make warning paragraphs italic, but non-warning paragraphs would not be affected. (c) University of Technology, Sydney

XPath Examples Specific ancestor: match="A/P/X“
“match X, but only when X has a parent of P, and P has a parent of A” <A > <B > <A > <P > <P > <Q > <X > <X > <X > This style of XPath expression follows on from the specific parent example. Here we extend the concept to allow as many ancestors as necessary to be specified. In the example on the slide, we want to match occurrences of the X element, but only when X has a parent of P, and P has a parent of A. This is a relative XPath expression – the particular combination of ancestors may occur anywhere in the document (but the ancestors must be directly linked in the tree – no intermediate nodes). If you put a leading "/" on the XPath expression, it makes it an absolute expression, meaning that it starts from the root. So for example if we wrote: match="/A/P/X" - notice the initial "/" – it would mean that A would have to be the first element node of the document (underneath the root), and P and X would have to be underneath A. Although you can specify absolute paths in this way, relative paths are preferred if they provide sufficient detail for a match to occur. (c) University of Technology, Sydney

XPath Examples Specific ancestor example: match="intro/warning/para"
<intro> <para>An introduction paragraph.</para> <warning> <para>Introduction warning.</para> </warning> </intro> An introduction paragraph. Introduction warning. For example, here we want to match <para> elements that have a <warning> element as a parent, and where <warning> has an <intro> element as its parent. <xsl:template match="intro/warning/para"> <i><xsl:apply-templates/></i> </xsl:template> (c) University of Technology, Sydney

XPath Examples Unknown Ancestry match="A/*/X" match="A/P/X"
match="A/Q/X" match="A/R/X“ “match all X elements where A is their grandparent” match="A/*/X" <A> <A> <A> <P> <Q> <R> <X> <X> <X> In this XPath expression, we want to match all X elements where A is their grandparent. This is a shorthand for listing out all the possible combinations – or in some XML documents there may be too many combinations to list. Note that this particular expression only allows for A to be a grandparent of X, i.e. there must be exactly one level of depth in the tree in between them. (c) University of Technology, Sydney

XPath Examples Unknown Ancestry example: match="intro/*/para"
<intro> <para>An introduction paragraph.</para> <warning> <para>Introduction warning.</para> </warning> <note> <para>Introduction note.</para> </note> </intro> An introduction paragraph. Introduction warning. Introduction note. For example, we might want to match all <para> elements in the <intro> section that have one level of depth in between them. In this example, they could be para's inside a <warning> element or para's inside a <note> element. <xsl:template match="intro/*/para"> <i><xsl:apply-templates/></i> </xsl:template> (c) University of Technology, Sydney

XPath Examples Unknown ancestry examples: match="intro//para"
The previous example of unknown ancestry only allowed for exactly one level of depth between the ancestor (grandparent) and the child (grandchild). It is usually more common to need to generalise that to any level of depth in between the two nodes. In this example, we want to match all <para> elements that are descendants of the <intro> element, regardless of how many levels of depth there are between <intro> and <para>. (c) University of Technology, Sydney

XPath Examples Specific Children – Predicate (search) Filter
match="X[C]“ “match X when at least one of its children is C” <X> <X> <C> <E> <D> <D> So far we have looked at nodes requiring specific parents or ancestors. Now we reverse direction and look at nodes that have specific children. In the example shown on the slide, we want to match all occurrences of element X when at least one of its child elements is C. It is important to note that here the element we want to match is X. Using the previous examples, we might want to match all <warning> elements that contain at least one <para>. To do that, we could use: <xsl:template match="warning[para]"> WARNING: <xsl:apply-templates/> </xsl:template> Again, carefully note the distinction between: warning/para – match a para element that has warning as its parent warning[para] – match a warning element that has para as its child The notation of using square brackets [ ] in XSL is called a predicate filter. If we have an expression like X[...], it indicates that we want to match node X, but we want to apply some kind of extra condition on X. In this example, the condition is that it has a specific child. (c) University of Technology, Sydney

< XPath Examples Specific Siblings match="P[S]/X“
“matches all elements named X that have a sibling named S” <P> <P> <X> <X> < <S> <T> Combining the "specific parent" and "specific child" expressions, we can come up with an expression that matches specific siblings of the current node. In this example, "P[S]/X" can be broken down into its two parts: ????/X – match elements named X, where X has a specific parent P[S] – elements named P that have S as a child Putting these two together, "P[S]/X" matches all elements named X that have a sibling named S. Another way of phrasing this is that the specific parent of X must have a specific child named S. (c) University of Technology, Sydney

XPath Examples Attribute Context match="X[@A]“
“matches X that have an attribute A” eg: <X A=“something”/> <X A=“z’> <X B=“z”> <X > So far we have been matching patterns of elements. Now we include attributes. The example on the slide matches all elements named X that have an attribute A. It does not matter what the value of the attribute is, as long as the attribute exists. If an "X" element does not have any attributes, or it doesn't have one named "A", then there is no match. For example, if we had an XML document like: <para status="visible"> ... </para> <para align="center"> ... </para> <para> ... </para> and an XSL template like: <xsl:template ... </xsl:template> This template would match the <para> element that has a status attribute (the first one shown). Note that this is another example of the use of a predicate filter. says we want to match node X, but only if the condition inside the square brackets (the predicate) is satisfied. In this case, the predicate says that we require the element to have a specific attribute. (c) University of Technology, Sydney

XPath Examples Attribute Value Context match="X[@A='abc']"
“matches X that have an attribute A with a value of abc” eg: <X A=“abc”/> <X A=“abc”> <X A=“xyz”> This case extends the previous one – now we are extending the predicate filter to require not only a specific attribute name to be present, but also that it has a specified value. In the example shown on the slide, we want to match all occurrences of the X element where it has an attribute whose attribute name is "A" and whose attribute value is "abc". For example, if we had an XML document like: <para status="secret"> ... </para> <para status="visible"> ... </para> and an XSL template like: <xsl:template ... </xsl:template> This template would match the <para> element where the status attribute has the value "secret" (the first one shown). (c) University of Technology, Sydney

XPath Examples Specific Child NOT present: match="X[not(C)]“
“match X that do not have a child C” <X> <X > <D> <C> <E> <D> We can also use the keyword "not" inside a predicate filter to indicate when a particular condition should not be true. In this example, the predicate filter says that we want to match all occurrences of element X that do not have a child C. not() is an example of a boolean function that can be used in XPath expressions. There are some more examples of functions shortly. (c) University of Technology, Sydney

XPath Examples Specific Sibling NOT present match="P[not(S)]/X“
“matches X that do not have a sibling S” <P> <P> <P> <X> <X> <T> <Z> <S> <X> This style of XPath expression matches all occurrences of element X that do not have a sibling S. (c) University of Technology, Sydney

XPath Examples Specific Attribute NOT present match="X[not(@A)]“
“matches X that do not have an attribute named A” <X B=“q”> <X A=“q”> <X> This style of XPath expression matches all nodes named X that do not have an attribute named A. Note in this example that there are two cases that match: the element X has an attribute, but that attribute is not named A the element X has no attributes at all (c) University of Technology, Sydney

Priorities <template match="intro//para"> ... <template match="warning//para"> ... With priorities: <template match="intro//para" priority="1"> ... <template match="warning//para" priority="2"> ... <intro> <para>Intro paragraph.</para> <warning> <para>Intro warning.</para> </warning> </intro> ? Recall that an XSLT processor, as part of visiting the current node that it is processing in the tree, will search for XSL templates that match the current node. Until now we have only considered cases where exactly one XSL template matches a given node of the tree. However, because we can create quite general XPath expressions that will match a range of different nodes, it is possible to create two different XSL templates that will match the same node in an XML document. An XSLT processor will always apply only one template. So therefore there has to be some rules to specify which XSL template gets priority, in the event that more than one template matches a given element. In the example shown in the slide, we have two templates that will match the second (bold) <para> element in the document shown. Without any specified priorities (the default), it may be difficult to determine which of the two templates will actually be applied. As an XSL author, if you know that it is likely that two different XSL templates will match a particular node, you should explicitly specify the priority of each template, so it is clear which template should be applied by the processor. The template with the higher numbered priority will be applied (in this example, "warning//para"). (c) University of Technology, Sydney

Default Priorities PRIORITY ‘0.5’ <template match="warning/para"> PRIORITY ‘0’ <template match="para"> <template match="x:para"> (ie: name space specific) PRIORITY ‘-0.25’ <template match="x:*"> PRIORITY ‘-0.5’ <template match="*"> <template match="node()"> This slide illustrates the default priorities for various XPath expressions. "para" or "x:para" – if you specify the name of an element, with no specific ancestry, then the default priority is zero. "x:*" – if you specify all nodes that belong to a specific namespace, then the default priority is "*" – if you specify all elements in the document, the priority is Similarly if you specify all nodes in the document ("node()"). The difference between these two is that "*" only matches element nodes, while "node()" matches all types of nodes (including text nodes). "warning/para" – if you specify some ancestry information, e.g. specific parent, then the priority increases to 0.5. For example, if you had three XSL templates as follows: <xsl:template match="para"> </xsl:template> <xsl:template match="warning/para"> ... </xsl:template> <xsl:template match="*"> </xsl:template> All three of these could match a particular element. In this case, "warning/para" would have highest priority (and would execute). The next highest priority would be "para", and the lowest priority would be "*". (c) University of Technology, Sydney

However, not all XSLT processors might follow this standard behaviour
Conflict Resolution When two or more templates remain applicable after lower priority matches, the template closest to the end of the stylesheet should be chosen However, not all XSLT processors might follow this standard behaviour Every template in an XSL stylesheet has some specified priority, either one that is explicitly set by a programmer, or a default priority. If two templates still have the same priority, after taking into account both specified and default priorities, then you are a bad XSL programmer. You should always make it clear in the XSL stylesheets that you write which templates should have priority when there are multiple possible matches for a given node. If you don't, then the default behaviour is to apply the template that is closest to the end of the stylesheet (i.e. the latest one that appears in the file). However not all XSLT processors necessarily follow this behaviour, which is why it is important that as an XSL programmer, you use priorities to make it clear which template should be applied for each node. The order in which you place templates in an XSL file should not affect the way a document is processed – you should be able to put XSL templates in any order in the XSL file. (c) University of Technology, Sydney

XPath/XSLT functions You can define variables with XSLT.
<xsl:variable name=“x” select=“expression” /> You then refer to the variable as <xsl:variable name=“x”> Or from within any <xsl:> element as {$x} XPath also provides many functions: see see (c) University of Technology, Sydney

XPath/XSLT Functions <xsl:template match="intro">
There are <xsl:value-of select="count(para)"/> intro paragraphs. <xsl:apply-templates/> </xsl:template> There are 1 intro paragraphs. <intro> <para>Intro paragraph.</para> <warning> <para>Intro warning.</para> </warning> </intro> The final aspect of XPath that we will look at is functions. In fact, we have already seen one example of a function – the not() function used in predicate filters to negate various conditions. The XPath standard defines a core set of functions: Node set functions that operate on nodes or sets of nodes, e.g. last(), position(), count(), etc. String functions, e.g. concat(), substring(), string-length(), etc. Boolean functions, e.g. true(), false(), not() Number functions, e.g. sum(), floor(), ceiling(), round() XSLT also defines a few extra functions in addition to the XPath ones, e.g. format-number() to allow you to do specific number formatting (like currency values). In the example shown on the slide, we are using two new types of syntax: <xsl:value-of select="..."/> This allows you to insert into the output the value of a particular XPath expression. Here we are using it to print the result of a function (but you could also use it to print the value of a particular node). count(para) This is a node set function that counts the number of occurrences of the <para> element that are immediate children of the current node. Note that the current node is "//intro" (as specified by the template match). Note that it only says there is 1 paragraph – this is because there is only one <para> element that is a direct child of <intro>. If we wanted to count all <para> elements that occurred underneath the current node, at any level of depth, we could use the following, i.e. all <para> nodes where the ancestor, at any level of depth, is the current node ("."). <xsl:value-of select="count(.//para)"/> (c) University of Technology, Sydney

XPath/XSLT Functions <intro> <para>Intro paragraph.</para> <warning> <para>Intro warning.</para> </warning> </intro> <xsl:template match="intro//para/text()[1]"> <b><xsl:value-of select="substring(.,1,1)"/></b> <xsl:value-of select="substring(.,2)"/> </xsl:template> <b>I</b>ntro paragraph. <b>I</b>ntro warning. This is a more complex example using functions. What we are trying to do is to make the first letter of each paragraph in the introduction bold using the HTML <b> tag. The first challenge is to match "the first block of text in every paragraph in the intro". For this we use the XPath expression: intro//para/text()[1] text() is an XPath function that matches a text node (any text node). the [1] is used to specify which child element to use, if there are many children. In this case, we want the first text node of the paragraph (in this example, each paragraph only has one text node as a child, but it would be possible for a paragraph to have more than one). intro//para is a notation we have seen before for "unknown ancestry", i.e. para nodes that have intro as an ancestor, at any level of depth. Now that we have matched the first text node of every paragraph in the introduction, the next challenge is to separate the first character. For this we can use the XPath substring() function. substring(.,1,1) matches the current node (".") and takes a substring starting at the first character (1), and the length is 1 characters. substring(.,2) matches the current node ("." and takes a substring starting at the second character (2), with no specified length (i.e. the remainder of the string). (c) University of Technology, Sydney

Using XSLT XSLT is a declarative language
ie: it doesn’t execute, isn’t compiled Instead, we call programs/methods which use an XSLT template to do the transformation We can link XSLT directly to XML file Most browsers will automatically transform it! In Java, we can use javax.xml.transform package (c) University of Technology, Sydney

Linking XSLT with XML Data
<?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="cdcatalog1.xsl"?> <catalog> <cd> <title>Empire Burlesque</title> <artist>Bob Dylan</artist> <country>USA</country> <company>Columbia</company> <price>10.90</price> <year>1985</year> </cd> ... </catalog> Most browsers will use cdcatalog1.xsl to transform this XML file An XSL stylesheet does not have to be linked with an XML document. Often an XSLT processor will read the XML document and the XSLT document as two separate inputs and both are specified by the user or application. One good reason for not linking a stylesheet with an XML document is because often there will be many different stylesheets that can apply to the same XML document, e.g. for different output formats. However, web browsers are one application where it can be useful to link an XSL stylesheet with an XML document. If you load an XML document in a web browser (Internet Explorer 6 or Netscape 7), the XML will be displayed in "raw" format. However if you provide a link to an XSL stylesheet as shown in the slide, then the web browser will apply the stylesheet transformations to the XML document before displaying it. <?xml-stylesheet type="text/xsl" href="cdcatalog1.xsl"?> Note that this is an XML processing instruction. Also note that this only works properly in Internet Explorer 6 and Netscape 7. Earlier versions of these browsers had some support, but often with bugs (e.g. IE5). Because of the limited browser support, using XML/XSL in this way is currently not the way that web sites are built. Consequently, linking XSL with XML documents in this way is not often done. (c) University of Technology, Sydney

Using Java to transform XML
try { // Load StreamSource objects with XML and XSLT files StreamSource xmlSource = new StreamSource( new File("input.xml") ); StreamSource xsltSource = new StreamSource( new File("format.xslt) ); // Create a StreamResult pointing to the output file StreamResult fileResult = new StreamResult(new FileOutputStream("output.xml")); // Load a Transformer object and perform the transformation TransformerFactory tfFactory = TransformerFactory.newInstance(); Transformer tf = tfFactory.newTransformer(xsltSource); tf.transform(xmlSource, fileResult); } catch (Exception e) { e.printStackTrace(); } HINT: Use import javax.xml.transform.*; import javax.xml.transform.dom.*; import javax.xml.transform.stream.*; import java.io.*; (c) University of Technology, Sydney

Using the XPATH API Java 5 also has javax.xml.xpath package:
XPath xpath = XPathFactory.newInstance().newXPath(); InputSource in_src = new InputSource("address.xml"); NodeList nodes = (NodeList) xpath.evaluate("//address", input, XPathConstants.NODESET); Or to use with DOM document, just replace in_src with document Use the 3rd parameter to set the return value type NODESET  return NodeList STRING BOOLEAN NUMBER (c) University of Technology, Sydney

Using Java utility to transform..
Apache Xalan XML parser has a utility to transform XML using XSL (weblogic, websphere etc have this built in) java org.apache.xalan.xslt.Process –IN addressbook.xml –XSL address2html.xsl –OUT address.html Alternatives: msxsl.exe on windows xsltproc on linux (from libxslt) (c) University of Technology, Sydney

Servlet/ JSP + EJB/ Java/ Transform Web service XML/XSLT on J2EE HTML
XML + XSL allows us to separate layout and content HTML Servlet/ JSP XSL + Transform EJB/ Java/ Web service XML (c) University of Technology, Sydney

Programming XPATH Use java.xml.xpath package
Use PathFactory to create an Xpath object Use evaluate() method to filter results evaluate() can return String or NodeList Eg: if we have a DOM document, you can use: XPath xpath = PathFactory.newInstance().newXPath(); NodeList nodes = (NodeList) xpath.evaluate(“//cd”, document, XpathConstants.NODESET); (c) University of Technology, Sydney

Summary XML is a complex and rapidly growing field of technology
XML is widely used already, and is becoming ubiquitous within e-commerce developments, particularly for B2B systems This lesson has introduced to a wide range of current XML specifications and technologies – you will need to do further study to investigate these areas in more detail (c) University of Technology, Sydney

Debugging xsl:message("Now at somewhere")
(c) University of Technology, Sydney

Introduction to XML V7 march 2010

Similar presentations

Presentation on theme: "Introduction to XML V7 march 2010"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Introduction to XML V7 march 2010

Similar presentations

Presentation on theme: "Introduction to XML V7 march 2010"— Presentation transcript:

Similar presentations

About project

Feedback