The Extensible Markup Language (XML) Towards a World-Wide-Database A presentation by : Prashanth L.Narayanan Abhishek Agarwal.

The Extensible Markup Language (XML) Towards a World-Wide-Database A presentation by : Prashanth L.Narayanan Abhishek Agarwal

Course Contents : Day 1 Brief background and origins of XML The need for and advantages of XML A simple XML document and it’s properties Well-formed and valid documents The Document Type Declaration (DTD) Comparison with HTML

Course Contents : Day 2 The Document Object Model (DOM) The Simple API for XML (SAX) DOM or SAX ? Namespaces XML Parsers IBM’s XML4J Parser A Homepage Building Application The Extensible Styling Language (XSL) Summary and concluding remarks

XML was developed by an XML Working Group formed under the auspices of the W3C in 1996. XML is a simplified subset of the Standard Generalized Markup Language (SGML) which was standardized in 1986, based on the Generalized Markup Language invented by IBM in 1969. XML was simplified for more general use on the Web and as a data interchange format. The simplifications don't detract from XML’s extensibility, but make it easier for anyone to write valid XML. It has also been simplified so that a parser can easily and quickly verify that documents are well-formed and valid. Brief background and origins

The design goals for XML –XML shall be straightforwardly usable over the Internet. –XML shall support a wide variety of applications. –XML shall be compatible with SGML. –It shall be easy to write programs which process XML documents. –The number of optional features in XML is to be kept to the absolute minimum, ideally zero. –XML documents should be human-legible and reasonably clear. –The XML design should be prepared quickly. –The design of XML shall be formal and concise. –XML documents shall be easy to create. –Terseness in XML markup is of minimal importance

XML in the global arena As a technology, XML is in the unique position of being embraced by all of the leaders in the computer industry. Also, many vertical industries are embracing XML for its ability to expedite the availability of their domain-specific information for internal and external use.

So what is XML anyway ? As a formal definition, the W3C XML 1.0 specification defines XML as follows : The Extensible Markup Language describes a class of data objects called XML documents and partially describes the behaviour of computer programs which process them. XML documents are made up of storage units which also included markup. A mechanism to impose constraints on the storage layout and logical structure is also provided by XML.

XML for the layperson For the layperson XML can be said to be simply the assertion “Let's use tags to format data." In the platform-centric world, the meaning of data was encoded by the position of the bytes in the data records in memory, on disk or in the message. But XML changes that. It uses human-readable text tags to bracket the data and mark-up the meaning.

Data in HTML This is an HTML source code example, illustrating a product catalog entry as presented on a corporate Web site: Product ID Description Price 12345678-Q Thinkpad 2000D $999.99

Display in HTML Your HTML browser renders the example above like so:

Data in XML XML uses markup tags as well, but, unlike HTML, XML tags describe the content, rather than the presentation of that content. So, in the example above instead of using, we would define our own tag called. Now we can find a specific product in all documents that follow this markup convention. We can now distinguish between products and the various data that can be presented as HTML tables.

Data in XML In XML, you can also define your own attributes for tags. So the same example above could be coded in XML as: 12345678-Q Thinkpad 2000D $999.99 Ironically, by avoiding formatting tags in the data, but marking the meaning of the data itself with, we actually make it easier for a client to search various individual servers for a product and receive a product list tailored to the preferences of the user.

Data accessible from any device

The crux of the issue In the previous figure, note how meaningful searches can be applied to XML data, and the result can be rendered differently, depending on the destination device. Note also that the XML processor can exist on the server, the client, or both. Using XML tags to define what your data means (using the natural vocabulary of your data's domain) is the key motivation for XML's invention and the basis of its usefulness.

So why is XML better ? Why not just use HTML? Using HTML involves the removal of the framework of meaning from the data, whereas XML allows the preservation of the framework of meaning with the data. In other words, building a web page involves the author assessing the data and then representing his/her understanding of the meaning by choosing levels of heading, strengths of emphasis, layers of bullets and so on. While the meaning may be clear to a "human"reader, a computer can't easily work out what meaning was intended just from the formatting. As a consequence, the world- wide-web is not a source of data that computers can interpret.

From World Wide Web to World Wide Database XML leaves the judgment on the meaning to the final consumer. It turns the world-wide-web into a world-wide- database. From this it's clear that it's not just the human content consumer who will benefit from XML. Infact IBM believes the most significant use of XML will be in the area of computer-computer activity, the area of transactions, messages and databases.

A Simple XML Document Rose Axl axlrose@gnr.com

Compare these HTML and XML snippets... HTML allows improper nesting HTML allows start tags, without end tags, like the tag. HTML allows attribute values without quotes. XML requires proper nesting. XML requires empty tags to be identified with a trailing slash, as in XML requires quoted attribute values.

Well formed XML Documents A well formed XML document is simply one that follows the previously described rules, i.e. : 1. All tags should be properly nested. 2. All empty tags should be identified with a trailing slash (eg. ) 3. All attributes should be within quotes.

Valid XML Documents Valid XML documents can be thought of as those that obey the rules defined by another document called the Document Type Declaration (DTD). The DTD can be in a separate file or in the same one as the XML file it defines.

The Document Type Declaration The XML Document Type Declaration contains or points to markup declaration that provide a grammar for a class of documents. This grammar is known as a document type definition, or DTD. The document type declaration can point to an external subset (a special kind of external entity) containing markup declarations, or can contain the markup declarations directly in an internal subset, or can do both. The DTD for a document consists of both subsets taken together.

Elements In EBNF (Extended Backus Naur Form) : person ::= (name e-mail*) In a DTD : This means that a person must have a name and optionally an email-id.

Element Definitions

The complete DTD for addressBook In plain English this DTD says that our addressBook is composed of one or more persons, where each person has a name, and optional e-mail address. The name is composed of a family name and a given name. And the content of each of these is a UTF-8 string data.

DTD as a separate document In this case a URI to the DTD must be specified in the XML document as shown : <!DOCTYPE addressBook SYSTEM "http://www.inf.com/xml/ab0.dtd"> Rose Axl axlrose@gnr.com

DTD as part of the XML In this case the declaration for the DTD is done as follows : <!DOCTYPE addressBook [ ]>...

Attributes Attributes are used to associate name-value pairs with elements. Attribute specifications may appear only within start tags and empty element tags. Attribute-list declarations may be used: 1. To define the set of attributes pertaining to a given element type. 2. To establish type constraints for these attributes. 3. To provide default values for attributes. Thus, Attribute-list declarations specify the name, data type, and default value (if any) of each attribute associated with a given element type:

The ‘gender’ attribute In the addressBook DTD, we may have the following declaration : What this means : <!ATTLIST : A definition of an attribute list starts. person gender : ‘gender’ is an attribute of ‘person’. (male|female) : are the two values that ‘gender’ may assume. #IMPLIED : this attribute is optional > : end of attribute list Thus, the element ‘person’ has an attribute called ‘gender’ (an enumerated type), which can take the values ‘male’ or ‘female’ and which is optional.

Default attribute values Suppose we want the gender attribute to default to ‘unknown’. We could express this by adding ‘unknown’ to the enumerated list, and appending the “unknown” literal in the DTD rule like so: <!ATTLIST person gender (male|female|unknown) “unknown”> This specifies that the gender can be male, female or unknown and also defaults the gender attribute to ‘unknown’.

Required attribute values Suppose the gender attribute is to be made compulsory then we can add the keyword ‘#REQUIRED’ as shown below :

Fixed attribute values If the gender attribute is to have a fixed value, then we can add the keyword ‘#FIXED’: An example from the Amazonia company that has only female employees …

Other Attribute Types : CDATA Other than ‘enumerated’ types, two other types of attributes are the ‘CDATA’ and the ‘tokenised’ types. Use the CDATA type when you want to refer to any character data, potentially even data containing markup. However, certain markup symbols like "<" are not allowed within attributes. So, if you declared an attribute for an element form, say an attribute called method, you could ensure its value was always 'POST', like so:

Other Attribute Types : Tokenised The tokenised attribute types are used to represent a fixed set of keyword types with special meanings. Often, we want to uniquely identify instances of a certain element, so it has an attribute with a value that must be unique. In our Address Book example, it would be helpful to be able to uniquely identify and refer to a person, even people with the same name and similar data. This is done using a tokenised attribute type called ID.

Tokenised attribute example The ID attribute type in the following example, plus the #REQUIRED keyword ensures that every person must have an id attribute whose value is unique within the document. Now that we can ensure the uniqueness of element attribute values, we can refer to them. The next tokenized type, called IDREF, takes care of this.

The IDREF attribute Each IDREF attribute is required to match an ID attribute on some element in the XML document. Similarly, attribute values of type IDREFS must contain whitespace- delimited ID values in the document. So, let's define the ability for a person to link to his or her manager and/or subordinates: <!ATTLIST link manager IDREF #IMPLIED subordinates IDREFS #IMPLIED> Note that we declared an EMPTY link element, which can contain attributes that refer to other people.

The complete DTD <!ATTLIST link manager IDREF #IMPLIED subordinates IDREFS #IMPLIED>

An XML Document for our DTD Person Junior junior@inf.com Person Senior senior@inf.com

Comparing XML and HTML XML tags must be properly nested (the well-formedness criteria) XML identifies empty tags with a trailing slash (eg. ) All XML documents have a single root element, which surrounds all others. In XML, attribute values must be quoted. XML markup tags are case sensitive. ( xxx ) Whitespace is relevant between start and end tags. XML is extensible and uses a Document Type Declaration (DTD) to define the allowable grammar rules for tags and attributes.

Still to come … Advanced topics in XML including the Document Object Model (DOM), the Simple API for XML (SAX) and Namespaces XML Parsers and the IBM XML4J in particular. An introduction to XSL A sample application to build and search among homepages.

Course Contents : Day 2 The Document Object Model (DOM) The Simple API for XML (SAX) DOM or SAX ? Namespaces XML Parsers IBM’s XML4J Parser A Homepage Building Application The Extensible Styling Language (XSL) Summary and concluding remarks

The Document Object Model (DOM) The Document Object Model (DOM) is an application programming interface (API) for HTML and XML documents. It defines the logical structure of documents and the way a document is accessed and manipulated. In the DOM specification, the term "document" is used in the broad sense - increasingly, XML is being used as a way of representing many different kinds of information that may be stored in diverse systems, and much of this would traditionally be seen as data rather than as documents. Nevertheless, XML presents this data as documents, and the DOM may be used to manage this data.

Objective of DOM As a W3C specification, one important objective for the Document Object Model is to provide a standard programming interface that can be used in a wide variety of environments and applications. The DOM is designed to be used with any programming language.

DOM Representation of HTML The DOM is a programming API for documents. It closely resembles the structure of the documents it models. For instance, consider this table, taken from an HTML document: Shady Grove Aeolian Over the River, Charlie Dorian

The DOM represents this table like this:

The DOM is not implementation specific In the DOM, documents have a logical structure which is very much like a tree; to be more precise, it is like a "forest" or "grove", which can contain more than one tree. However, the important point to note here is that DOM does not specify that documents must be implemented as a tree or a grove, nor does it specify how the relationships among objects be implemented. The DOM is a logical model that may be implemented in any convenient manner. In this specification, we use the term structure model to describe the tree-like representation of a document; we specifically avoid terms like "tree" or "grove" in order to avoid implying a particular implementation. One important property of DOM structure models is structural isomorphism: if any two Document Object Model implementations are used to create a representation of the same document, they will create the same structure model, with precisely the same objects and relationships.

The Simple API for XML (SAX) History : John Tigue, of DataChannel, was the first to attempt to develop an XML API collaboratively on the XML-DEV mailing list with his earlier XAPI-J. The process of developing SAX itself started on Saturday 13 December 1997, mainly as a result the persistence of Peter Murray-Rust. Peter is the author of the free Java-based XML browser JUMBO and after going through the headaches of supporting three different XML parsers with their own proprietary APIs, he insisted that parser writers should all support a common Java event-based API, which he code-named YAXPAPI (for Yet Another XML Parser API).

SAX is a part of the xml.org Peter initiated a discussion with Tim Bray (the author of the Lark, XML parser and one of the editors of the XML specification) and David Megginson (the author of Microstar’s Aelfred XML parser about coming up with a single, standard event-based API for XML parsers. In the end, Jon Bosak, the founder of XML, allowed SAX to use his xml.org domain for the Java package name org.xml.sax.

SAX : How does it work ? There are two major types of XML (or SGML) APIs: 1. tree-based APIs; and 2. event-based APIs. A tree-based API compiles an XML document into an internal tree structure, then allows an application to navigate that tree. The Document Object Model (DOM) working group at the World-Wide Web consortium has developed a standard tree-based API for XML and HTML documents. An event-based API, on the other hand, reports parsing events (such as the start and end of elements) directly to the application through callbacks, and does not usually build an internal tree. The application implements handlers to deal with the different events, much like handling events in a graphical user interface.

How does it work ? To understand how an event-based API can work, consider the following sample document: Hello, world! An event-based interface will break the structure of this document down into a series of linear events: start document start element: doc start element: para characters: Hello, world! end element: para end element: doc end document

DOM SAX An application handles these events just as it would handle events from a graphical user interface: there is no need to cache the entire document in memory or secondary storage. Finally, it is important to remember that it is possible to construct a parse tree using an event- based API, and it is possible to use an event-based API to traverse an in-memory tree.

DOM or SAX ? Tree-based APIs are useful for a wide range of applications, but they often put a great strain on system resources, especially if the document is large (under very controlled circumstances, it is possible to construct the tree in a lazy fashion to avoid some of this problem). Furthermore, some applications need to build their own, different data trees, and it is very inefficient to build a tree of parse nodes, only to map it onto a new tree. In both of these cases, an event-based API provides a simpler, lower- level access to an XML document: you can parse documents much larger than your available system memory, and you can construct your own data structures using your callback event handlers.

DOM for power, SAX for saving Consider, for example, the following task: - Locate the record element containing the word "Ottawa". If your XML document were 20MB large (or even just 2MB), it would be very inefficient to construct and traverse an in-memory parse tree just to locate this one piece of contextual information; an event- based interface would allow you to find it in a single pass using very little memory. But if your application required frequent searches and changes then definitely DOM is the one for you. In the Homepage building application that will be demonstrated later in this presentation, it can be seen why DOM was preferred by us.

Namespaces : What and Why The whole point of XML is to enable users to be able to create unique tags that identify their information in more meaningful ways than simply applying the basic set of HTML tags to all documents. While this gives users great flexibility, it poses problems for interchange and software integration. What happens when two documents make use of the same tag names in different contexts? For example, a tag in an illustrated parts catalog identifies something quite different than a part in a dramatic play. Within a single document, the term "title" may refer to the document itself, the name of a book, and the formal appellation associated with its author (e.g., "Dr."). The problem is not just for element names; it extends to attributes as well.

A partial solution The XML namespaces spec addresses this issue by allowing tags to have a context. That context is the tag or attributes’ XML namespace, which is simply a Web address. Because Web addresses are unique, they’re a handy way to establish unique contexts. For example, you could create a namespace called EDI, linked to a URL; declare that namespace at the beginning of your XML document; and then add "EDI:" as a prefix to any element name in the document. The use of the declared prefix provides a way for software to treat tags with EDI prefixes differently than tags with different, prefixes. You can also declare a default namespace at the start of your document; any tags without prefixes are assumed to be in the default namespace.

Namespaces by example Consider this scenario: suppose XML.com wanted to start publishing reviews of XML books. We'd want to mark the info up with XML, of course, but we'd also like to use HTML to help beautify the display. A tiny sample of what we might do follows.

The code... <h:htmlxmlns:xdc="http://www.xml.com/books" xmlns:h="http://www.w3.org/HTML/1998/html4"> Book Review XML: A Primer Author Price Pages Date Simon St. Laurent 31.98 352 1998/01

… and it’s explanation In this example, the elements prefixed with xdc are associated with a namespace whose name is http://www.xml.com/books, while those prefixed with h are associated with a namespace whose name is http://www.w3.org/HTML/1998/html4. The prefixes are linked to the full names using the attributes on the top element whose names begin. xmlns:. The prefixes don't mean anything at all - they are just shorthand placeholders for the full names. Those full names, you will have noticed, are URLs, i.e. Web addresses. We'll get back to why that is and what those are the addresses of a bit further on.

Why Namespaces? But first, an obvious question: why do we need these things? They are there to help computer software do its job. For example, suppose you're a programmer working for XML.com and you want to write a program to look up the books at Amazon.com and make sure the prices are correct. Such lookups are quite easy, once you know the author and the title. The problem, of course, is that this document has XML.com's book- review tags and HTML tags all mixed up together, and you need to be sure that you're finding the book titles, not the HTML page titles. The way you do this is to write your software to process the contents of tags, but only when they're in the http://www.xml.com/books namespace. This is safe, because programmers who are not working for XML.com are not likely to be using that namespace.

Namespaces are not URLs One of the confusing things about all this is that namespace names are URLs; it's easy to assume that since they're Web addresses, they must be the address of something. But they're not; these are URLs, but the namespace draft doesn't care what (if anything) they point at. Think about the example of the XML.com programmer looking for book titles; that works fine without the namespace name pointing at anything. The reason that the W3C decided to use URLs as namespace names is that they contain domain names (e.g. www.xml.com), which work globally across the Internet.

Namespaces in conclusion That's more or less all there is to it. The only purpose of namespaces is to give programmers a helping hand, enabling them to process the tags and attributes they care about and ignore those that don't matter to them. Quite a few people, after reading earlier drafts of the Namespace Recommendation, decided that namespaces were actually a facility for modular DTDs, or were trying to duplicate the function of SGML's "Architectural Forms". None of these theories are true. The only reason namespaces exist, once again, is to give elements and attributes programmer-friendly names that will be unique across the whole Internet. Namespaces are a simple, straightforward, unglamorous piece of syntax. But they are crucial for the future of XML programming.

XML Parsers An XML parser is simply a tool that conforms to any one of the existing object models for XML. Some examples of parsers : 1. XML4J 2. JUMBO 3. Lark 4. Aelfred

What should a parser do ? Irrespective of what object model a parser conforms to, it should atleast ensure the following : 1. Allow a user to parse a document that he has created from any source. 2. Ensure that the XML is well-formed. 3. Allow for the checking of the validity of the XML document against a DTD that may or may or not be in the same file.

IBM’s XML4J XML4J is IBM’s XML Parser and is today considered one of the best. The Parser can support both the DOM and SAX object models and the parser required for your needs can be created from the parser factory.

Packages of the XML4J The main packages of XML4J version 2.0.4 are : com.ibm.xml.framework org.w3c.dom org.xml.sax org.xml.sax.helpers

To construct a parser in XML4J In XML4J version 2, the DOM api is implemented using the SAX api. XML4J version 2 has a modular architecture and comes pre-bundled with 4 configurations of the parser (all in com.ibm.xml.parsers package). These are: SAX parser, Non Validating (com.ibm.xml.parsers.SAXParser) SAX parser, Validating (com.ibm.xml.parsers.ValidatingSAXParser) DOM parser, Non Validating (com.ibm.xml.parsers.NonValidatingDOMParser) DOM parser, Validating (com.ibm.xml.parsers.DOMParser.java)

PageX : A Homepage Building Application PageX is an application we developed, that accepts data from a user to build a simple homepage for him. The homepage simply consists of a heading, a set of links and then the corresponding targets. Once the XML file is created, the application then uses a DOM parser to move from node to node to pick out the values and plugs them into an HTML file so that the Homepage can be viewed with a browser.

PageX A Demo of the Homepage building application Download the source code herehere

But I cannot read XML ! XML is fine, but humans are not altogether comfortable with reading from tags. There must be some method to read from an XML file and show it in a form that is as comfortable to read as HTML.

Enter the Extensible Styling Language (XSL) XSL is a transformation language : it transforms a document written in one language (XML) into a document of another language (e.g., HTML) “XSL processor, when you encounter the root element (e.g., ) do [action1]” “XSL processor, when you encounter the element do [action2]” “XSL processor, when you encounter the element do [action3]” And so on...

What an XSL Processor does HTML XML (content) XSL (presentation) XSL Processo r

XML + XSL = World Wide Database The equation says it all! The future of the web lies in storing data as data and not as how that data should be presented. The web needs XSL parsers and XSL enabled tools to understand data for what they are worth; and not browsers that can only understand the how's rather than the what's.

Summary XML derives from SGML and is simply well- formed HTML with the advantage of allowing user-defined tags to format data. The emphasis is on data rather than presentation. XML leaves the judgment on the meaning to the final consumer. It turns the world-wide-web into a world-wide-database. The most significant use of XML will be in the area of computer-computer activity, the area of transactions, messages and databases XSL can be used to read XML and convert it into a more human-readable form.

Thank You! Comments, cribs, more info ? Mailto: or

The Extensible Markup Language (XML) Towards a World-Wide-Database A presentation by : Prashanth L.Narayanan Abhishek Agarwal.

Similar presentations

Presentation on theme: "The Extensible Markup Language (XML) Towards a World-Wide-Database A presentation by : Prashanth L.Narayanan Abhishek Agarwal."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

The Extensible Markup Language (XML) Towards a World-Wide-Database A presentation by : Prashanth L.Narayanan Abhishek Agarwal.

Similar presentations

Presentation on theme: "The Extensible Markup Language (XML) Towards a World-Wide-Database A presentation by : Prashanth L.Narayanan Abhishek Agarwal."— Presentation transcript:

Similar presentations

About project

Feedback