Introduction to XML and TEI for Digital Archives

Introduction to XML and TEI for Digital Archives
Mark McDayter Department of English, Western University

Links for this Workshop
Links for the TEI, the TEILite Schema, the oXygen XML Editor (if you have not already downloaded this), and three rich text files for practice editing can be found here:

What Is Markup Language?
Computers have not yet developed the capacity to read “natural language” – the kind of language that humans write, read, and comprehend – in very sophisticated ways. To permit computers to “read,” and so process texts in the various useful ways that we’d like them to, markup language have been developed. These flag textual structures, words, and meanings in ways that computers can be programmed to understand. With the aid of markup languages, computers can display, parse, search, and manipulate texts.

Presentational vs. Semantic Markup
By far the most common markup language today is HTML, or HyperText Markup Language, the language behind the vast majority of web pages. HTML is (at the moment) a presentational markup language: its primary job is to determine how text looks on your computer screen. Semantic markup language, on the other hand, attaches particular meanings to text. It “interprets” text for computer software. The most important semantic markup language today is a sort of “cousin” of HTML, and is called XML or eXtensible Markup Language.

XML – Nearly as Cool as It Sounds
XML is actually a “metalanguage”: it provides a set of structures and rules that allows one to create one’s own specialized markup languages. Like HTML, XML uses “tags” or “elements” that label parts of the text. These tags may also have “attributes” that provide further information. Text that one wishes to mark is nestled within opening and closing tags. "An empty taxi drew up at the <place name=“House of Commons”>Commons</place> and <person name=“Clement Atlee”>Mr Attlee</name> got out.”

Anatomy of an Element <date when="1532" calendar="Gregorian">1532</date> Element Name Closing Tag Attribute Value Attribute Marked-Up Text

Healthy XML: Keeping “well formed”
The basic rules of XML are fairly simple and straight-forward. A root element is needed: all elements in an XML document must be contained within a single “root” element All elements need a closing tag: XML elements require both an opening and closing tag. “Empty” elements are signified by placing a backslash at the end of the element name: <page/> Elements must be nested within each other: Each element (except the root) must be neatly and completely tucked away within a parent element. Good: <poem>. . .<stanza>. . .</stanza>. . .</poem> Bad: <poem>. . .<stanza>. . .</poem>. . .</stanza>

Healthy XML II Case is important: You may use upper or lower case, or a combination of both, for your elements, but you must be consistent. <Name>, <NAME>, and <name> will all be read as different elements. Attribute Values must be enclosed in quotation marks: Single or double quotation marks may be used. <name type=“person”>. . .</name> Entity References must be declared: In effect, this means that special characters (except 5 built-in ones) must be declared in the DTD or Schema document that sets out the rules for the particular flavour of XML.

Additional Bits and Pieces
Processing Instructions: These are not elements, per se, but instructions to the computer on how to parse or render the marked up text. <?xml version="1.0" encoding="UTF-8"?> Entities: These are most usually special characters (not in the ASCII set). These must be “declared,” and appear within the text within an opening “&” and a closing “;”: For instance, ſ = ſ CDATA: This is text that you don’t want parsed by the programme validator. Usually, it contains “illegal” entities, such as might be used in a script. <![CDATA[ if (a < b && a < 0) then ]]>

“Valid” vs. “Well Formed”
An XML file is “well formed” so long as it complies with the basic rules for all XML files, as given above. It becomes “valid,” on the other hand, when it additionally complies with the specific rules for any given form or “flavour” of XML. XML files must be both “well formed” and “valid” to be properly parsed or processed by software.

Flavours of XML and TEI As noted, XML is actually a “metalanguage,” rather than a language itself: it provides the rules and structures from which specific semantic markup languages are built. Each of these “flavours” is designed to serve a particular need. The XML markup language for text that has become the international scholarly standard for the humanities is the Text Encoding Initiative, or TEI. The most recent version of the TEI tagset is called “P5,” and dates from 2007.

The TEI Guidelines The “rules” for TEI XML are laid out for human consumption in a very long, and rather complicated document called the TEI Guidelines. Those same “rules” are also used in three machine-readable formats that, with the assistance of a “validator,” ensure that any given TEI-compliant text is following the rules laid out in the Guidelines. Any one of these may be used to validate a document: DTD (Document Type Definition) W3C Schema Relax NG Schema The last of these is now most often used for TEI documents.

TEI Modules and Customizations
The full TEI has over 500 individual tags, but is designed as a modular system. Users can choose from 21 modules to add to the “base” tag sets. These include tagsets for marking up poetry, prose, and drama, as well as special tags for manuscripts, textual criticism, and linguistics. Users can also, within limits, modify the tag set to suit their own needs. Customizations are defined in an ODD, or “One Document Does it all” file, which is a specialized TEI document that describes the customization in both natural and machine-readable languages. Processing the ODD through the TEI’s “Roma” application generates a DTD or Schema that applies the customization to the documents it is intended to describe.

Modeling Text Before deciding on a particular customization of TEI, it is important to give some consideration to what aspects of the text you wish to highlight and structure. When a text is marked up, it is changed, even if not a single word of the original is altered. Marking up privileges particular elements, meanings, and structures in a text. And it is always interpretational to some degree or another. Marking up a text according to one customization or another should therefore represent a conscious decision to structure it in a particular way so as to enable certain kinds of operations upon it.

Considerations for Modeling
Modeling your marked-up text involves the consideration of a number of factors. Don’t feel too constrained by “common practice”: you are the information architect for the project. What sorts of information do you want to flag? Who will be using it, and for what purposes? What is the nature of the material you are transcribing? How do you want to structure it? What sorts of renditional features do you want to capture? Through what sort of interface will this material be made available?

What TEI XML Is Not Good At.
It is also important to bear in mind the things that TEI (and XML in general) is not very well suited to doing. One of these is showing parallel or overlapping structures. XML is intensely hierarchical, and, while there are some workarounds built into TEI intended to facilitate the representation of parallel structures, these are a bit clumsy and imperfect. TEI (and XML in general) is also not very good at representing ambiguity. Semantic markup is best at identifying clearly quantifiable and verifiable data, but it can be very difficult to represent “fuzzy” or uncertain meanings, or even “double” ones. Again, TEI provides a workaround (through the <choice> element), but this is still a far-from-perfect solution.

Thank God for TEILite The TEILite is a simplified customization of the TEI Guidelines designed to serve the needs of “90% of TEI users 90% of the time.” It is probably the most popular customization of the TEI. The DTD and Schemas for TEILite do not need to be customized, built, or generated: they are available in a ready-to-go form online, through the TEI web site. TEILite also comes preloaded on the oYygen XML Editor. When it is loaded onto this editor, the validator automatically adds all tags that are obligatory parts of the schema.

MetaBoring: The TEI Header
All TEI-compliant files have two main sections: the “text” portion, which includes your actual marked-up electronic text, and the “teiHeader,” which contains the metadata associated with your new electronic text. The header must be present, and can be used to provide a variety of information about your text, including: Information on your original source An outline of your editorial and markup procedures Responsibility statements Revision history Keywords relating to your text Catalogue information

A Simple Header <teiHeader> <fileDesc> <titleStmt>
<title>A Broad-side against Coffee: An Electronic Edition</title> </titleStmt> <publicationStmt> <p>Published as an example for the header module of TIELite.</p> </publicationStmt> <sourceDesc> <p>Anonymous Broadside. London: Printed for J.L.</p> </sourceDesc> </fileDesc> </teiHeader>

A Simple Body <text> <front> <div type="contents">
<head>Table of Contents</head> <list> <item>I. The Decision</item> [. . .] </list> </div> </front> <body> <p>[. . .]</p> </body> <back> <div type="colophon">[. . .]</div> </back> </text>

Simple Text Structure: Prose
<body> <head>Spectacularly Interesting Heading</head> <head>Witty Subheading</head>  <div type="section" n="3"> <head>Section Heading Sure to Tantalize</head> <div type="subsection" n="3.1"> <head>Subsection Heading</head> <p>. . .</p> </div> <div type="subsection" n="3.2"> <head>Another Exciting Heading</head> </body>

Simple Text Structure: Verse
<body> <lg type="poem" n="1"> <head>Poem’s Title.</head> <lg type=“stanza” n=“1”> <l n=“1”>First line of poem . . .</l> <l n=“2”>Second line of poem . . .</l> <l n=“3”>Third line of poem . . .</l> <lg type=“refrain”> <l n=“4”>Which nobody can deny.</l> </lg> </body>

Simple Text Structure: Drama
<div type="act"> <head>ACT I</head> <div type="scene"> <head>SCENE 1: Humlet Says Hi.</head> <stage type="setting">[description of stage]</stage> <stage type="entrance“>[stage directions]</stage> <sp> <speaker>Humlet</speaker> <p>Hello Danish World!</p> </sp> </div>

Introduction to XML and TEI for Digital Archives

Similar presentations

Presentation on theme: "Introduction to XML and TEI for Digital Archives"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Introduction to XML and TEI for Digital Archives

Similar presentations

Presentation on theme: "Introduction to XML and TEI for Digital Archives"— Presentation transcript:

Similar presentations

About project

Feedback