Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 herbert van de sompel CS 502 Computing Methods for Digital Libraries Cornell University – Computer Science Herbert Van de Sompel

Similar presentations


Presentation on theme: "1 herbert van de sompel CS 502 Computing Methods for Digital Libraries Cornell University – Computer Science Herbert Van de Sompel"— Presentation transcript:

1 1 herbert van de sompel CS 502 Computing Methods for Digital Libraries Cornell University – Computer Science Herbert Van de Sompel herbertv@cs.cornell.edu Lecture 9 Markup languages – XML

2 2 herbert van de sompel Problem The richness of text elements: letters, scripts, symbols structure: words, sentences, paragraphs, headings, tables appearance: fonts, layout, design, materials special: mathematics, music Digital libraries must represent ever variant!

3 3 herbert van de sompel Markup and Page description languages Mark-up languages represent the structure of text e.g., SGML, XML The mark-up must be combined with a style sheet for rendering. Page description languages represent the appearance of text e.g., PostScript, PDF

4 4 herbert van de sompel Markup and style sheets rendering software formatted document document content & structure markup-ed document style sheet rendering instructions

5 5 herbert van de sompel Multiple renderings from same markup-ed documents rendering software PC display document content & structure markup-ed document style sheet 1 print rendering software style sheet 2

6 6 herbert van de sompel Example: Oxford English Dictionary typography of printed text represented semantic information. Keyboard the text, capturing all typographic information. Automatic parser to extract semantics (e.g., date, quotation, phonetics, etc.). Markup in SGML to tag semantic information. Separate style sheets for various editions: print, CD- ROM, online. Before the web, yet used with the web.

7 7 herbert van de sompel XML - general Extensible Markup Language simplified SGML meta-language that allows defining markup languages for documents may replace HTML HTML can be seen as a markup language defined in XML => XHTML

8 8 herbert van de sompel XML – basic terminology XML instance document: the document that contains the text in a mark-up-ed form style sheet: the document that contains the formatting instructions to be applied to an instance document Document Type Definition: the document that defines the grammar with which instance documents are compliant (elements, attributes, character set, required elements, optional elements, …) XML Schema: similar as DTD, but more powerful An XML application will usually process 3 types of documents

9 9 herbert van de sompel XML documents – basic building blocks an XML document consists of one or more elements: opening tag closing tag element contenttext (PCDATA or other elements) Paul Smith an element can have attributes, specifying properties of the element attribute namename attribute value “value” Paul Smith an empty element has attributes only

10 10 herbert van de sompel XML – sample instance document (standalone) 0743204794 Kevin Davies Cracking the Genome 20.00

11 11 herbert van de sompel XML – XML declaration XML processing instructions: XML version character encoding used in the text standalone: is a DTD required to interpret this document? attribute order is significant

12 12 herbert van de sompel XML – comment line comment line: will be ignored by XML processor can not appear before the XML declaration can not reside inside an element tag

13 13 herbert van de sompel XML – the elements 0743204794 Kevin Davies Cracking the Genome 20.00 elements: hold no special significance for the XML processor, except for document and style rules that are defined for them parent, child, ancestor, descendant

14 14 herbert van de sompel XML – well formed-ness XML is not at all as forgiving as HTML HTML browser may accept something like this: This is a paragraph. And this is another one. And yet another one. not so with XML. XML is picky => well-formed XML

15 15 herbert van de sompel XML – well formed-ness download file http://www.cs.cornell.edu/courses/cs502/2001sp/slides/x mltest.xml http://www.cs.cornell.edu/courses/cs502/2001sp/slides/x mltest.xml open in Notepad

16 16 herbert van de sompel Every XML document must have a declaration Every opening tag must have a closing tag. Tags can not overlap (well-nested) XML documents can only have 1 root element Attribute values must be in quotation marks (single or double) – Only one value per attribute. XML – well formed-ness

17 17 herbert van de sompel can not be used in text. Encode “sanity characters”: << && ]]>]]& >> “" ‘&apos; XML – well formed-ness

18 18 herbert van de sompel element names must obey XML naming conventions: start with letter or underscore can contain letters, numbers, hyphens, periods, underscores no spaces in names! no leading space after < colon can only be used to separate namespace of the element from the element name case-sensitive can not start with xml, XML, xML, … XML – well formed-ness

19 19 herbert van de sompel XML – well formed-ness white spaces: space, tab, line feed, carriage return in HTML: must explicitly write white spaces as &nsbsp; because HTML processors strip off white spaces not so in XML: space in PCDATA stays tab in PCDATA stays multiple new line characters transformed into a single one

20 20 herbert van de sompel XML – character references Unicode code point: © == © == © go http://www.hclrss.demon.co.uk/unicode/http://www.hclrss.demon.co.uk/unicode/ pick some character references and include in XML doc


Download ppt "1 herbert van de sompel CS 502 Computing Methods for Digital Libraries Cornell University – Computer Science Herbert Van de Sompel"

Similar presentations


Ads by Google