Download presentation
Presentation is loading. Please wait.
Published bySusan Harrington Modified over 6 years ago
1
X-Informatics: I-400 and I-590 An Introduction to XML
Spring Semester MW 6:00 pm – 7:15 pm Indiana Time Geoffrey Fox and Bryan Carpenter PTLIU Laboratory for Community Grids Informatics, (Computer Science , Physics) Indiana University Bloomington IN 47404 11/22/2018 xmlintrospring02
2
Outline of Introduction to XML
The two drivers for XML Original: A better way of specifying documents that is more powerful than HTML and easier to understand than SGML Current: XML as a object structure for totally general entities Basic XML: well formed and valid XML Examples of use of XML XML Syntax – this presentation Further Presentations on XML Schema XML based Document Object Model and style sheets Transforming XML documents XSLT Searching XML documents Applications of XML – Dublin Core, RDF, SVG, SOAP 11/22/2018 xmlintrospring02
3
An Information Nugget 11/22/2018 xmlintrospring02
4
Nugget In XML 11/22/2018 xmlintrospring02
5
Essential Issues We have a world of objects
Objects are instances of “classes of object” This SONY Viao is an instance of the general SONY Viao laptop class This SONY Viao laptop class is a subclass of the laptop class Object classes are made up of smaller things – which have “types” Color is a special type of variable with a 24 bit (or some other) specification String (characters) is a type (simple type); Integers are a type etc. Date is also a simple type in XML as so common XML has Schema to specify classes and these are made up of simple types and complex (complicated) types Further one Schema can be built from one or more other Schemas 11/22/2018 xmlintrospring02
6
An Example of RDF and Dublin Core
<rdf:RDF xmlns:rdf=" xmlns:dc=" <rdf:Description about=" <dc:Title>D-Lib Program - Research in Digital Libraries</dc:Title> <dc:Description>The D-Lib program supports the community of people with research interests in digital libraries and electronic publishing. </dc:Description> <dc:Publisher>Corporation For National Research Initiatives</dc:Publisher> <dc:Date> </dc:Date> <dc:Subject> <rdf:Bag> <rdf:li>Research; statistical methods</rdf:li> <rdf:li>Education, research, related topics</rdf:li> <rdf:li>Library use Studies</rdf:li> </rdf:Bag> </dc:Subject> <dc:Type>World Wide Web Home Page</dc:Type> <dc:Format>text/html</dc:Format> <dc:Language>en</dc:Language> </rdf:Description> </rdf:RDF> 11/22/2018 xmlintrospring02
7
XML Example for SOAP This is way to use XML to send command ls (list files) from one machine to another First argument HTTP Header SOAP Envelope With body Specify ls as 11/22/2018 xmlintrospring02
8
XML Example II HTTP Header
SOAP Envelope and body XML Example II And this is the result of ls sent back to client in SOAP over HTTP 11/22/2018 xmlintrospring02
9
Overview of HTML HTML = Hypertext Markup Language
the lingua franca of the World Wide Web HTML is a simple language well suited for hypertext, multimedia and the display of small and reasonably simple documents HTML 2.0 spec completed in Nov 95 HTML+ and HTML 3.0 never released HTML 3.2 (Jan 97) added tables, applets, and other capabilities (approximately 70 tags) this is what most people are familiar with today HTML 4.0 spec released in Dec 97 XHTML (XML Version of HTML 4.0) released January 2000 as a W3C recommendation 11/22/2018 xmlintrospring02
10
W3C Process The Web Consortium has a highly effective process for initiating and refining standards for the web The agreed standards for protocols and API’s are as critical to success of the web as are technologies The process to define standards involve moving from Working Draft to Last Call Working Draft to Candidate Recommendation to Proposed Recommendation and finally to Recommendation. The standards discussed here are quite recent XML Schema became a recommendation May SVG (2D Vector Graphics done in XML – relevant for animated web pages) became a recommendation September XQUERY (a proposed way of searching XML datastructures/documents) is currently a working draft dated December 11/22/2018 xmlintrospring02
11
XML in the HTML world XML = eXtensible Markup Language (name suggests documents not objects) XML was originally designed as a subset of SGML -- Standard Generalized Markup Language, but unlike the latter, XML was specifically designed for the web and for comparative simplicity Specification of W3C: and lots of links like XML 1.0 in February 98, with continuing refinements How XML fits into the Browser world: XML with Application Specific Schema describes the logical structure of the document. CSS (Cascading Style Sheets) or other style language describes the visual presentation of the document. The DOM (Document Object Model) allows scripting languages, such as JavaScript to access document objects. DHTML (Dynamic HTML) allows a dynamic presentation of the document. XHTML is XML Syntax for specifying Text Display – so HTML just does DISPLAY; XML does “Knowledge” and DISPLAY 11/22/2018 xmlintrospring02
12
Informatics View of Architecture
Raw Data Resource XML for Data (Virtual) XML Interface Note Server Tier uses lots of subsystems that are themselves separated by XML Interfaces Processing Server Information/Knowledge (Virtual) XML Interface XML for Knowledge Rendering to XML syntax Display Format XHTML and SVG are examples Clients 11/22/2018 xmlintrospring02
13
The original Motivation for XML as an enhancement of HTML separating Display and Knowledge
Limitations of HTML: Extensibility: HTML does not allow users to specify their own tags or attributes in order to parameterize or otherwise semantically qualify their data. Structure: HTML does not support the specification of deep structures needed to represent database schema or object-oriented hierarchies. Validation: HTML does not support the kind of language specification that allows applications to check data for structural validity when it is imported. 11/22/2018 xmlintrospring02
14
Logical vs. Visual Design
This is XML used as Interface between Knowledge and Rendering The logical design of a document (content) should be separate from its visual design (presentation) Separation of logical and visual (rendering) design promotes sound typography encourages better writing is more flexible Allows the same “knowledge/information” (defined in XML) to be displaced on PC’s, PDA’s, Braille devices etc. XML can be used to define the logical design, while the XSL (Extensible Style Language) is used to define the visual design (usually by mapping XML into HTML but better XML for Knowledge into XHTML or SVG or ….). 11/22/2018 xmlintrospring02
15
What is SGML? SGML = Standard Generalized Markup Language defined as an ISO (not W3C) standard (ISO8879) in 1986 A SGML document carries with it a grammar called a Document Type Definition (DTD). The DTD defines the tags and the meaning of those tags DTD syntax is not very nice Presentation is governed by a style sheet written in the Document Style Semantics and Specification Language (DSSSL) Note that HTML is a fixed SGML application, a hard-wired set of about 70 tags and 50 attributes, and does not need to have a DTD for each HTML instance. 11/22/2018 xmlintrospring02
16
SGML Example A simple SGML document with embedded DTD: <!DOCTYPE DOCUMENT [ <!ELEMENT DOCUMENT O O (p*,BIGP*)> <!ELEMENT p - O (#PCDATA)> <!ELEMENT BIGP - O (#PCDATA)> ]> <DOCUMENT> <p>Welcome to <BIGP>XML Style! </DOCUMENT> 11/22/2018 xmlintrospring02
17
SGML Example (cont’d) A corresponding DSSSL style sheet: <!DOCTYPE style-sheet PUBLIC "-//James Clark//DTD DSSSL Style Sheet//EN"> (root (make simple-page-sequence)) (element p (make paragraph)) (element BIGP (make paragraph font-size: 24pt space-before: 12pt)) DSSSL is simplified as XSL just as XML simplifies SGML 11/22/2018 xmlintrospring02
18
XML as a simple SGML XML is also an SGML application, but since XML is extensible (XML can be considered a metalanguage), every XML document must be accompanied by its DTD XML is a compromise between the non-extensible, limited capabilities of HTML and the full power and complexity of SGML XML offers “80% of the benefits of SGML for 20% of its complexity” XML designers tried to leave out all the SGML that would be rarely used on the web Note that XML specification is 30 pages and the SGML specification is 500 pages. XML allows you to define your own tags and to describe nested hierarchies of information. 11/22/2018 xmlintrospring02
19
Some Global Concepts We are defining objects – possible just for documents as in SGML We need to define “object templates” or their structure This is class for Java This is DTD for SGML and XML or Schema for XML We have instances of objects XML files or Java Objects with some way (optional for XML) of specifying We need to transform objects We can do this with “real software” i.e. read object into program, interpret and spit out in a different form We can use specialized transformation language with some control data –this is DSSSL plus stylesheet for SGML; XSLT plus stylesheet for XML; browser plus CSS stylesheet for HTML 11/22/2018 xmlintrospring02
20
XML Design Goals 1) XML shall be usable over the Internet
2) XML shall support a variety of applications 3) XML shall be compatible with SGML 4) It shall be easy to write programs that process XML documents 5) Optional features in XML shall be kept to the absolute minimum, ideally zero 6) XML documents should be human-legible and reasonably clear 7) Design of XML should be prepared quickly 8) Design of XML shall be formal and concise 9) XML documents shall be easy to create 10) Terseness in XML markup is of minimal importance 11/22/2018 xmlintrospring02
21
Features of XML I The documents are stored in plain text and thus can be transferred and processed anywhere. Inline-reusability - documents can be composed of many pieces Unifying principles make it easily acceptable “everything is a tree” UNICODE for different languages XML documents enable several types of uses traditional data processing - XML documents can be the data interchange medium document-driven programming archiving 11/22/2018 xmlintrospring02
22
A Tree <root> <onedown val=“abc”> <twodown> <threedown anotherval=“123”>stuff </threedown> </twodown> </onedown> <nextone> </nextone> </root> root onedown twodown threedown anotherval val Content(stuff) nextone 11/22/2018 xmlintrospring02
23
Features of XML II It is important to remember that XML is a markup language, not a programming language. XSL can be viewed as a way of programming data whose structure is defined in XML Except this isn’t really correct – you can build a programming language with XML Syntax <myshell program=“cat” args=“grades” /> <myshell program=“ls” args=“-l” /> M in XML is Markup reflecting its origin in the publication” community with markup specifying layout of document, fonts to use etc. XML’s most important use is not this original specifying abstract data structures -- equivalent to structures in C++ or classes in Java or Entity relationship in database world 11/22/2018 xmlintrospring02
24
Origins of XML First draft of XML spec released by W3C in Nov 96 (four other drafts published in 1997) The first XML parser (written in Java) released by Microsoft in July 97 Microsoft released version 1.8 of its XML parser (which supports XML 1.0) in Jan 98 W3C finalized the XML 1.0 spec in Feb 98 First XML-aware beta versions of Netscape and IE5.0 released in June 98 Sun announced Java Standard Extension for XML (XML API) in March 99 W3C ongoing effort as discussed 11/22/2018 xmlintrospring02
25
HTML becomes XHTML This shows how near HTML is to XML but also differences! 11/22/2018 xmlintrospring02
26
XHTML II You must have a body and a head section You canNOT use capital letters in element or attribute names 11/22/2018 xmlintrospring02
27
XHTML III no yes no or yes 11/22/2018 xmlintrospring02
28
XHTML IV no yes 11/22/2018 xmlintrospring02
29
XHTML V 11/22/2018 xmlintrospring02
30
XHTML VI 11/22/2018 xmlintrospring02
31
XHTML VII 11/22/2018 xmlintrospring02
32
XHMTL VIII 11/22/2018 xmlintrospring02
33
XHTML IX Converting HTML to XHTML requires … 11/22/2018
xmlintrospring02
34
Homework 2 Go to http://www.w3schools.com/xhtml/default.asp
Take XHTML Course Take quiz returning Either screen dump or Saved HTML of results page Build a valid XHTML file as your course home page It should NOT have frames 11/22/2018 xmlintrospring02
35
Homework 2 Continued Validate your XHTML File and send to 11/22/2018 xmlintrospring02
36
“Hello World!” in XML An XML document with external DTD: <?xml version="1.0"?> <!DOCTYPE greeting SYSTEM "hello.dtd"> <greeting>Hello World!</greeting> An XML document with embedded DTD: <?xml version="1.0"? standalone =“yes” ?> <!DOCTYPE greeting [ <!ELEMENT greeting (#PCDATA)> ]> <greeting>Hello World!</greeting> Current XHTML has a DTD but not a Schema Next version of XHTML with modules is Schema based Don’t need to understand DTD to use XHTML 11/22/2018 xmlintrospring02
37
XML and Related Acronyms
Document Type Definition (DTD), which defines the tags and their relationships – to be replaced (IMHO) by XML Schema Extensible Style Language (XSL) style sheets, which specify the presentation of the document Cascading Style Sheets(CSS) less powerful presentation technology without tag mapping capability XPATH which specifies location in document XLINK and XPOINTER which defines link-handling details Resource Description Framework (RDF), document metadata Document Object Model (DOM), API for converting the document to a tree object in your program for processing and updating Simple API for XML (SAX), “serial access” protocol, fast-to-execute protocol for processing document on the fly XML Namespaces, for an environment of multiple sets of XML tags XHTML, a definition of HTML tags for XML documents (which are then just HTML documents) XML Schema, offers a better alternative to DTD 11/22/2018 xmlintrospring02
38
Document Type Definition
The DTD specifies the logical structure of the document; it is a formal grammar describing document syntax and semantics The DTD does not describe the physical layout of the document; this is left to the style sheets and the scripts It is no mean task to write a DTD, so most users will adopt predefined DTDs (or can write an XML document without a DTD). DTDs can be written in separate files to facilitate re-use. Content-providers, industries and other groups can collaborate to define sets of tags: the essence of “any” field (physics, music …) is captured in a domain specific DTD/Schema XML documents are valid if they are consistent with a specified DTD or Schema We will NOT discuss DTD significantly in this presentation 11/22/2018 xmlintrospring02
39
XML Editors There are several XML editors at various prices and capabilities One list of available editors is at We have good experience with XML Spy which costs money but renewable 30 day licenses are available The capabilities of editors depends on how well they support Schemas As XML gets more complicated, expect a new generation of “processing tools” that accept XML as input with multiple Schema and produce some sort of output for people and/or computers Microsoft XML Notepad is simple free and dated has a set of good XML Schema links which inter alia discuss XML 11/22/2018 xmlintrospring02
40
XML must be “well-formed”
For the data contained in an XML document to be parsed correctly, its markup must be well-formed, meaning in part that properly nested and non-abbreviated starting and ending tags are used. This well-formed-ness provides a well defined encapsulation mechanism allowing designated sections of the data to be accessed programmatically. Current HTML browsers allow rule violations but XML is strict which is essential for many (robust) applications If XML was just used to render, then sloppiness allowable but as XML aimed at capturing object structure or information, we cannot have errors interpreted unpredictably by parsers Well-formed is less restrictive than valid XML documents must be well-formed – user can decide if need to be valid 11/22/2018 xmlintrospring02
41
Character Data in XML CDATA and PCDATA
XML documents are made up of markup and CDATA (character data) PCDATA is text gotten from parsing document and processing markup as necessary “markup” includes Tags and attributes (ALL that is important), Entity references, Character references, Comments, CDATA Section delimiters, DTD declarations and Processing Instructions XML allows you to specify chunks of text which may contain “reserved characters/strings” with an ugly syntax <![CDATA <ignored>Anything </ignored> ]]> Maybe (hopefully) this will be replaced by alternatives based on ideas like mail attachments – see 11/22/2018 xmlintrospring02
42
Characters in XML We can choose the character set such as UTF-8 (8 bit ASCII codes for characters) or the official default Unicode (16 bit character codes as used by Java) or even UCS which offers 32 bits for each character. This is specified in the xml processing instruction in the document prolog. You can use character reference markup π is Unicode for wrapped in &# .. ; syntax for a 16 bit (4 hexadecimal symbols) character reference in Unicode (ISO/IEC 10646) π is also using decimal form of Unicode One can use the five built-in entity references & for & ' for ‘ > for > < for < " for “ In the DTD approach (which we are ignoring), one can define arbitrary entity references &#x----; Hexadecimal (base ABCDEF &#----; Decimal (base 11/22/2018 xmlintrospring02
43
White Space in XML XML as default treats spaces, tabs, line feeds and carriage return “just” as white space. Thus <greeting>Hello World!</greeting> and <greeting>Hello World!</greeting> are identical This is similar to HTML. One can overrule this using attribute xml:space with syntax <greeting xml:space=“preserve” >Hello World!</greeting> This attribute must be defined in DTD with <!ATTLIST greeting xml:space (default|preserve) ‘preserve’ > defines element greeting to allow an attribute xml:space which can take values default or preserve with latter as default If you specify xml:space, then it holds not only for given element but all those contained within it. 11/22/2018 xmlintrospring02
44
XML Example Another example which could be used for URL exchanges between network capable applications: <LINK> <TITLE>XML Recommendation</TITLE> <URL> </URL> <DESCRIPTION> The official XML spec from W3C </DESCRIPTION> </LINK> 11/22/2018 xmlintrospring02
45
XML Example (cont’d) A document may have many such links:
<?xml version="1.0" encoding=”UTF-8” standalone="yes"?> <?xml-stylesheet type=“text/css” href=“fred.css” ?> <DOCUMENT> <LINKS> <LINK>…</LINK> <LINK>…</LINK> … </LINKS> </DOCUMENT> Here we have also added prolog processing instructions. 11/22/2018 xmlintrospring02
46
XML Prolog and Processing Instructions
Every XML file starts with the prolog, giving information about the document. The minimal prolog identifies it as an xml document <?xml version=“1.0”?> The prolog may also include the encoding and whether it is a standalone document: <?xml version="1.0" encoding="ISO ” standalone="yes” ?> If it is not standalone, it may specifiy external “entities” which may be named in the document or an external DTD An XML file may also contain more general processing instructions for the application processing the document: <?target instructions ?> where target is the name of the application. Only <?xml … ?> is understood by all XML processors Specification of a stylesheet by <?xml-stylesheet .. ?> is common 11/22/2018 xmlintrospring02
47
XML Prolog and Comments
The Prolog can contain: Processing Instructions DTD Specifications -- we have illustrated these but will not discuss further Comments Comments have same form anywhere in the XML document and are just like comments in HTML <!--This is the Prolog and <tag> Lousy Course</tag> is not treated as a tag--> You cannot have -- inside comments but <tag> </tag> is not treated as markup 11/22/2018 xmlintrospring02
48
Processing XML So in the beginning (1999), it was not clear how XML would be used One (major?) of original goals was specifying content of web pages and this implied processing of XML with “style-sheets” that specified mapping of XML into HTML Obviously this is some sort of “processing” XML was so popular that lots of other applications with lots of totally different processing were invented <?target instructions ?> was insufficient in ability to specify the way processing to be done and not very useful as better always to be modular and NOT associate details of processing with data So best to ignore XML processing tags unless used in very conventional way such as style sheets. Modern web page technology tends not to use this way but rather has a separate “configuration file” matching XML and style-sheets 11/22/2018 xmlintrospring02
49
Comments in XML <!-- --> syntax represents a comment on “file” as ignored by XML parser This is sometimes useful but more valuable is a comment that is preserved by parser as this can be either thrown away or preserved as you please Do this with some sort of tag like <yourcomment> This is a comment</yourcomment> Parsers read XML – check if well-formed/valid and return some sort of answer – in simplest model – this is a modified file OLD MODEL XML File XML Parser Output Use processing instructions to control parsing <!-- --> ignored 11/22/2018 xmlintrospring02
50
Role of Parsers New model disassociates data and action on the data
XML Parsers are critical technology – Editors built on top of them but parsers are basis of all use of XML in web services Output XML Parser XML File “Business Logic” Web Service Specify in XML Web Service (WSDL does this) 11/22/2018 xmlintrospring02
51
XML tag structure In XML terminology, a pair of start and end tags is an element. XML documents must have a strict hierarchical structure. All start tags must have an end tag. Any element must be properly nested within another. <LI> XML requires <B><I>proper nesting</I></B>.</LI> is well formed <LI> XML requires <B><I>proper nesting</I></LI>.</B> would be rejected by an XML Parser Empty tags (no content except perhaps attributes) are allowed as elements in XML documents. An empty tag is a start and end tag together and is identified by a trailing / after the tag name. So in XHTML one uses <br/> for the empty break tag. (So empty tags with no attributes are “flags”) A start tag and end tag with nothing in-between can also be considered an empty tag. <IMG SRC=“face.gif”></IMG> XML tags are case-sensitive. (<H1> is not the same as <h1>. 11/22/2018 xmlintrospring02
52
Document is a Single Tree
XML documents allow only one root element. So it must be <?xml version=“1.0” ?> <rootoftree> ……… </rootoftree> And not <?xml version=“1.0” ?> <rootoftree> ……… </rootoftree> <rootoftree> ……… </rootoftree> So there is only one tree in each document 11/22/2018 xmlintrospring02
53
XML Attributes I Tags can have any number of attributes (which must be declared inside the DTD or by the schema) All attribute values must be within single or double quotes. <FONT COLOR=“#FF00CC”> quoted attribute </FONT> If you have a double quote inside an attribute value, then either Use " for inside quote as in quote=“"” Enclose attribute value in single quotes as in quote=‘”’ Each attribute can only appear once in a given element definition One can choose (matter of taste) between <person name=“Fox” role=“teacher” ></person> and <person><name>Fox</name><role>teacher</role></person> Note you can repeat elements but you cannot repeat attributes to represent multiple occurrences 11/22/2018 xmlintrospring02
54
XML Attributes II Note that with DTD (this changes with Schema), all element and attribute values are text not numbers and so must be “converted” by application to intended form So <item> weekdays<quantity>5</quantity><item> or <item quantity=“5” >weekdays</item> Returns string “5” not the number 5 for quantity xml:lang is a useful attribute (in xml Namespace) which can be used (as always if declared in DTD or Schema as allowed attribute) to specify language <text xml:lang=“en”>Good English</text> <text xml:lang=“x-youth” >Coolio,Wax On, Wax Off, Dude</text> xml:lang can take values from an official vocabulary (such as en above which is ISO 639) or your private code starting with x- 11/22/2018 xmlintrospring02
55
XML Names and NMTOKEN Name Characters are letters, digits, hyphens, underscores, colons or full stops. An NMTOKEN is any collection of Name Characters NMTOKENS is any list of NMTOKEN’s separated by white space (space, tab, newline etc.) Case is significant: PERSON and person are distinct names Attribute and Element names must be (a subset of) NMTOKEN with restriction Names cannot begin with a digit Names cannot begin with xml (or any variant gotten by case changes) – system will use this prefix Colons are ONLY to be used in Namespaces – currently an informal rule only 11/22/2018 xmlintrospring02
56
CDATA Sections CDATA sections allow you to include unparsed characters in a document <![CDATA <ignored>Anything </ignored> ]]> In this example the ignored tag is not processed by XML parser Unfortunately you must guarantee that there is no ]]> string in the text between <![CDATA and ]]> <script language=“JavaScript”> <![CDATA var fred = 0; if( fred < 10) { document.writeln(“> and < here are NOT parsed”); } ]]> </script> 11/22/2018 xmlintrospring02
57
XML Namespaces I This is an extension to XML adopted January 1999 at Namespaces address problem in DTD that labels (element and attribute names) cannot be repeated; more fundamentally (and for Schemas especially) it provides subroutine or library capability to XML Suppose you had a XML file with <student> and <faculty> and you wanted to write <student><name>you</name><student> <teacher><name>me<special>Prof</special></name></teacher> This is invalid unless <name> is identical in structure for both teacher and student, as each element in tree must have unique structure. We can get round it by using <studentname> and <teachername> but this is not so satisfactory especially if you get this conflict by joining two different sets of tags together This is seen in XHTML when you could add MathML SMIL SVG tags …. 11/22/2018 xmlintrospring02
58
XML Namespaces II So we use new syntax xmlns=“ to define an XML Namespace The value of xmlns is hopefully a useful URL/URI telling you about tags. However this is not required. Microsoft in its cunning way uses in Office web export: <xml xmlns:v="urn:schemas-microsoft-com:vml“ xmlns:o="urn:schemas-microsoft-com:office:office“ xmlns:p="urn:schemas-microsoft-com:office:powerpoint"> And teaches Internet Explorer to understand these obscure “universal resource names” for VML Office and PowerPoint Namespaces respectively. xmlns is an attribute which can be used in any element (depending on parser you may need to declare this as allowed attribute in DTD/Schema) <student xmlns=“studentschema”><name> …. 11/22/2018 xmlintrospring02
59
XML Namespaces III And when we come to teacher use <bigboss:teacher xmlns:bigboss=“teacherschema”><bigboss:name> …. In the above, we made student elements as default We can more symmetrically write <university xmlns:bigboss=“teacherschema” xmlns:downtrodden=“studentschema” > <downtrodden:student><downtrodden:name>you </downtrodden:name></downtrodden:student> …….. <bigboss:teacher><bigboss:name>me </bigboss:name></bigboss:student> </university> 11/22/2018 xmlintrospring02
Similar presentations
© 2025 SlidePlayer.com Inc.
All rights reserved.