Structure URIs (we will come back to them in lecture 3) XML Sofix xml example
Literature Castro, Elizabeth (2001) “XML for the World Wide Web” Peachpit Press RFC 2396 http://openlib.org/home/krichel/lis900gp02i
Uniform Resource Identifiers URI A Uniform Resource Identifier (URI) is a compact string of characters for identifying an abstract or physical resource. They provide a simple and extensible means for identifying a resource.
Universal concept of “resource” A resource can be anything that has identity. Not all resources are network ``retrievable''. The resource identifier identifies a resource, not necessarily the state in which the resource is in at a particular point in time.
Benefits of uniformity it allows different type of resource identifiers to be used in the same context, even when the mechanisms used to access those resources may differ it allows uniform semantic interpretation of common syntactic conventions across different types of resource identifiers
Benefits of extensibility allows introduction of new types of resource identifiers without interfering with the way that existing identifiers are used it allows the identifiers to be reused in many different contexts, thus permitting new applications or protocols to leverage a pre-existing, large, and widely-used set of resource identifiers.
transcribability The URI syntax was designed with global transcribability as one of its main concerns. A URI is a sequence of characters, not a sequence of bytes A URI may be transcribed from a non-network source, and thus should consist of characters that are most likely to be able to be typed into a computer A URI often needs to be remembered by people, and it is easier for people to remember a URI when it consists of meaningful components. Therefore it has a restricted set of characters, only US ASCII.
XML Stands for eXtensible Markup Language It is a recommendation by the World Wide Web Consortium (W3C). It is a new (1998) markup language that will transport a lot of contents over the Internet in the future. As its level of complexity goes it sits in between HTML and SGML.
Importance of XML XML will be, for the information industry, what the container is for international shipping. A uniform syntactic convention for the encoding of any piece of information expressed as textual data (i.e. as characters) Default character set is the UTF-8 encoding of Unicode.
HTML and XML HTML comes with predefined tags such as HTML, HEAD, TITLE, BODY, H1, H2, P, UL, LI, IMG, A, EM, B etc XML allows to use any tags. XML has not yet replaced HTML. It lacks native support for images and links.
XML and SGML SGML is the standard general markup language developed by an industry consortium Very complicated, to extent that there is no full implementation software ever written XML specs written by SGML aficionados who were aware of its problems
Original design goals XML shall be straightforwardly usable over the Internet. XML shall support a wide variety of applications. XML shall be compatible with SGML. It shall be easy to write programs which process XML documents. The number of optional features in XML is to be kept to the absolute minimum, ideally zero. XML documents should be human-legible and reasonably clear. The XML design should be prepared quickly. The design of XML shall be formal and concise. XML documents shall be easy to create. Terseness in XML markup is of minimal importance
Well-formed & valid XML Every piece of data that wants be be xml has to obey a set of rules. Otherwise it is just not XML These rules ensure that the document is “well-formed”. In addition, the XML document may obey to other rules, in that case it is called “valid”.
XML element Syntax contents Where name is the name of the element and contents is the contents of the element. is called the opening tag is called the closing tag Examples – F – Once upon a time there was…. Element names are case-sensitive. They must start with a letter or “_”. Element names must not start with “xml” in any capitalization.
Attributes to XML elements Are name/value pairs that further qualify element contents Syntax contents Example – 64 – con Attribute names have to obey the same rules as element names. Attribute values must be surrounded by single or double quotes.
Empty elements Elements that are empty may be written as. This is a shorthand for. Empty names may have attributes. Example: –
Processing instructions They are instructions to the software reading the XML. General syntax is
comments Start with May not contain a double hyphen Comments may not be nested i.e. no comments inside other comments.
Nesting elements Elements are allowed to contain other elements. Elements that contain other elements are called parent elements. Elements that are contained in another element are children of that element. Elements must be properly nested, i.e. child element closing tag must appear before parent element closing tag.
Root and prolog There must be one root element that contains all other element is the document. The prolog is what appears before the root element. The prolog may contain the XML declaration.
XML declaration The XML declaration is a special case of a processing instruction, it is written as If the XML declaration is there, it must be the first line. You can declare your character set in the XML declaration, like
Quote special symbols & is written as & < is written as < > is written as > “ is written as " ‘ is written as ' Example
Document Type Definition DTD DTDs are a legacy SGML tool to further define and refine the contents of an XML document. XML can be defined by an SGML Still in use by the technologically retarded. Not covered here, because there are more powerful replacements.
Example application: sofix Sofix is an XML based cataloging format for classical music CDs. It is named after Sophie C. Rigny. It is a creation of Thomas Krichel. Used for teaching purposes only.
Key concepts in Sofix Item: an individual CD or a collection of CDs kept physically together (i.e. sold together) Work: a piece of music as recorded on a CD. For simplicity, we do not distinguish between composition and recording of that composition. Track: semantics associated with physical separation of tracks on the disk
Sofix general rules Record all titles in English. If no English title provided, use a translation if it is obvious. If the translation is not obvious, use original language. All personal names as Lastname, Firstname Translatable names in English.
Contents of name of label number of the CD (followed by the works on the CD)
Contents of title of the work year when work was composed year when the recording was made name of contributor Possibly many contributor, followed by a series of tracks