Presentation on theme: "August 20061 Chapter 3 - Modeling Information Learning XML by Erik T. Ray Slides were developed by Jack Davis College of Information Science and Technology."— Presentation transcript:
August Chapter 3 - Modeling Information Learning XML by Erik T. Ray Slides were developed by Jack Davis College of Information Science and Technology Radford University
August Designing a Markup Language Markup language design should start with questions about the requirements for the language: - How detailed do you need it to be? - How will it be generated? - Is it flexible enough to handle every expected situation? - Is it generic enough to support different formatting options and modes? Answering these questions will lead to a design for representing information with XML This chapter shows how different kinds of information are modeled using XML.
August Simple Data Storage XML can be used like an extremely basic database. Data has been stored in files as tables for a long time. The following is an excerpt from the /etc/passwd file: (in all unix systems). nobody:*:-2:-2:UnprivilegedUser:/nohome:/noshell root:*:0:0:System Administrator:/var/root:/bin/tcsh daemon:*:1:1:System Services:/var/root:/noshell smmsp:*:25:25:Sendmail User:/private/etc/mail:/noshell Data like this can be parsed, but it has problems. Certain characters aren't allowed. Each record lives on a separate line, so data can't span lines. A syntax error is easy to create and may be difficult to locate. XML's explicit markup gives it natural immunity to these types of problems.
August Dictionaries A dictionary is a simple one-to-one mapping of properties to values. A property has a name or key which is a unique identifier. It's a table with two columns, but it's an easy way to serialize data. Apple selected XML as its format for preference files (called property lists). Here's a chess program property list. BothSides Level 1 PlayerHasWhite SpeechRecog The data is stored in tabular form. Each "row" is a pair of elements, key and value. Values are different types, boolean, int, etc.
August Records A database consists of records which follow a consistent format (each has the same fields) with keys. A personnel database would have a record for each employee. This example is a simple record-style XML document used for expense tracking. checkbook XML document You can fairly quickly write a program that will calculate the ending balance. This example shows how XML makes reading and accessing data easy for the programmer. What's more, the XML is flexible enough to allow you to restructure the data without rewriting the program. Adding new fields, such as an ID attribute or time element, wouldn't affect the program a bit. With an ad hoc solution like the colon-delimited /etc/passwd file, you would not have that kind of flexibility.
August XML & Databases XML is very good at modeling simple data structures. XML is easier to modify than flat files, with minimal impact on processing software, so you can add or remove fields as you like. Writing programs to process the data is easy, since much of the parsing work has been abstracted out, and plenty of interfaces are available. The downside is that XML is not optimized for rapid, repetitive access. An XML parser has to read the entire document to pick out even a single detail, a huge overhead. Databases provide for faster access, but there are downsides to structural changes and incompatibilities between database systems when combining data. Many applications are now written that store XML in databases. This provides both speed and flexibility.
August Narrative Documents A narrative document contains text meant to be read by people. Web pages, books, journals, articles, and essays are all narrative documents. These documents have some common traits. - Order, the order of elements is inviolate, text runs in a single path called flow (try reading from back to front) - Sections are elements that break up the document into parts like chapters, subsections, etc. - Blocks are rectangular regions like titles and paragraphs. - Inlines are the strings inside blocks.
August Narrative documents - Flow Typically a narrative document contains on primary flow. There usually are some short tangential flows like sidebars, notes, tips, warnings, footnotes, etc. The main flow is formatted as a column, while other flows are in boxes or moved to the side or the very end, with some kind of link (e.g., a footnote). XHTML does not support more than one flow. Others like DocBook do have support for flows encapsulating them as elements inside the main flow. Sections are coded in two common ways. A major section Small section Some text… Another little section Some text …
August Flow (cont.) Here's the second way. Major section Small section Text…. Second subsection Text… The first method is called a flat structure. It relies on presentation details to divine where parts of the document begin and end. In this case, a bigger head means a larger section is starting and a small head indicates a subsection is starting. It's harder to write programs to recognize the details of flat structures than of hierarchical structures. XHTML is flat or hierarchical?????
August Blocks & Inlines A block is an element that contains a segment of a flow and is usually formatted as a rectangular region, separated from other blocks by space above and below. Blocks hold mixed content, both character data and elements. Paragraphs, section heads, and list items are examples of blocks. Elements inside blocks are called inline elements because they follow the line of text. They begin and end within the lines scanning from left to right. Inlines are used to mark words or phrases for special formatting from the surrounding text. Examples include emphasis, glossary terms, and important names. R. Buckminster Fuller once said, When people learned to do more with less, it was their lever to industrial success.
August Blocks & Inlines (cont.) The element para is a ???? The element person is a ????? There are different reasons to use inlines. One is to control how the text formats. In this case, a formatter will probably replace the quote start and end tags with quotation marks. For emphasis elements, it might render the contents in italic, underline, or bold. Another role for inlines is to mark text for special processing. The person element may have no special treatment by the formatter, but could be useful in other ways. Marking items as "person," "place," "definition," or whatever, makes it possible to mine data from the document to generate indexes, glossaries, search tables, and more.
August Complex Structures Not all structures found in narrative documents can be so readily classified as blocks or inlines. A table is not really a block, but an array of blocks. An illustration has no character data so it can't be considered a block. Lists also have their own rules, with indentation, autonumbering or bullets, and nesting. These structures usually remain inside the flow, interrupting the surrounding text briefly. Structures like figures and tables may float within the flow, meaning that the formatter has some leeway in placing the objects to produce the best page layout. Some objects have captions with references in the text like, "the data is summarized in Table 3." A simple attribute (float = "yes") may be sufficient to represent this capability in the markup. Complex objects behave a little like blocks in that they are usually separated vertically from each other and the surrounding text. The formatting details are handled via styles.
August Metadata Metadata is information about the document that is not part of the flow. It's useful to keep with the rest of the document, but it's not formatted or rendered. Examples include author name, copyright date, publisher, revision history, ISBN, and catalog number. In XHTML, the head element is reserved for non-renderable information including metatags. Metatags include information on title, descriptive terms for search engines, links to stylesheets, etc.
August Linked Objects Linked objects are elements that act as bookmarks in a document. Cross reference is an element that refers to a section or object somewhere else in the document. When formatted, it may be replaced with generated text, such as the section number or title of the referred object. It may be turned into a hyperlink. An invisible marker is another linked object in narrative documents. It has no overt function in the flow other than to mark a location so that later, when generating an index, you can calculate a page number or create a hyperlink. Index items often span a range of pages, so it might be captured with two markers, one at the beginning and one at the end.
August XHTML Simplicity is what has made HTML so popular. It's used by millions. It's good enough to model almost any simple document as long as you don't mind its limitations: single-column format, flat structure, and lack of page- oriented features. Graphic designers need better page layout capability and stylesheet granularity. Web developers want better structure and navigation. Librarians and researchers want more detailed metadata and search capability. Users with special needs want more localization and customization. The best feature of HTML is hypertext: text that spans documents. Documents are typically small and with many nonlinear flows. It's easy to get lost, so navigation aids are critical. But, the basic block and inline tag structure are present. Example 3-4 Unix manual page.
August Unix Manual Page It has a flat document structure. No elements were used to contain and divide sections. Although, is used in HTML, normally it's used to apply styles, not for document division. The inlines tt and i. The names are abbreviations for presentational terms, "teletype" and "italic." So, we're forced to mark up with tags that are associated with how the rendering is done, not with what they are. Blocks have been forced into generic roles. The paragraph under the head "SYNOPSIS" isn't really a paragraph. It would be better to use an element strictly for synopses or code listings, but it's not available in HTML. Using HTML for this example has its good and bad points. HTML is easy to use, so it's quick. However, the document is fit for only one purpose, display in a browser. A printout will probably look bad. Can't be indexed or searched.
August DocBook DocBook is a markup language designed specifically for technical documentation, modeling everything from one-page user manuals to thousand-page tomes. Like HTML it predates XML and was first an SGML application. Here's the Unix manual page done in DocBook. Note how the element types are much more specific. Also, note the section elements. open example 3-5example 3-5 what is the root tag in the document? DocBook is very closely bound to the type of document you're authoring. A book document would look much different in DocBook. DocBook is more complex. much more flexible in terms of formatting and processing. example 3-6 is a DocBook book exampleexample 3-6
August Complex Data XML is very easy to use to build a tag set that will represent complex data. Multimedia formats like Scalable Vector Graphics (SVG) and Synchronized Multimedia Integration Language (SMIL) map pictures and movies into XML markup. Complex ideas in the scientific realm are also coded as XML, consider MathML which is used to represent complex Math equations. XML uses elements and attributes to model even very complex structured documents. See 3-7 example, SVG3-7 example SVG in XML - being an XML application, it can be tested for wellformedness - can be edited in an generic editor Molecular Dynamics Language (MoDL) encodes molecules in XML - example 3-8example 3-8
August RSS Rich Site Summary or Really Simple Syndication was created by Netscape to describe content on web sites. They wanted a portal that was customizable, allowing readers to subscribe to particular subject areas or channels. Each time they returned to the site they would see updates on their favorite topics, saving them time hunting for this news on their own. There are different models of publishing with RSS. The pull model is where a content aggregator checks an RSS file periodically to see if anything has been updated, pulling in new articles as they appear. In the push model, also called publish and subscribe, the information source informs the content aggregator when it has something new. example 3-11 sample RSSexample 3-11