Chapter Five Markup and Metadata: Elements of Organization How to Build a Digital Library Ian H. Witten and David Bainbridge.

Slides:



Advertisements
Similar presentations
DOCUMENT TYPES. Digital Documents Converting documents to an electronic format will preserve those documents, but how would such a process be organized?
Advertisements

HyperText Markup Language (HTML). Introduction to HTML Hyper Text Markup Language HTML Example The structure of an HTML document Agenda.
XML: Extensible Markup Language
Web Development & Design Foundations with XHTML
1. Content – Collective term for all text, images, videos, etc. that you want to deliver to your audience. 2. Structure – How the content is placed on.
3 November 2008CIS 340 # 1 Topics To define XML as a technology To place XML in the context of system architectures.
HTML and XHTML Controlling the Display Of Web Content.
Markup Languages Controlling the Display Of Web Content.
23-Jun-15 HTML. 2 Web pages are HTML HTML stands for HyperText Markup Language Web pages are plain text files, written in HTML Browsers display web pages.
1 CS 502: Computing Methods for Digital Libraries Lecture 17 Descriptive Metadata: Dublin Core.
Developing a Basic Web Page with HTML
Glencoe Digital Communication Tools Create a Web Page with HTML Chapter Contents Lesson 4.1Lesson 4.1 Get Started with HTML (85) Lesson 4.2Lesson 4.2 Format.
Chapter 2 Introduction to HTML5 Internet & World Wide Web How to Program, 5/e Copyright © Pearson, Inc All Rights Reserved.
Chapter 14 Introduction to HTML
Introducing HTML & XHTML:. Goals  Understand hyperlinking  Understand how tags are formed and used.  Understand HTML as a markup language  Understand.
Basics of HTML Shashanka Rao. Learning Objectives 1. HTML Overview 2. Head, Body, Title and Meta Elements 3.Heading, Paragraph Elements and Special Characters.
Creating a Simple Page: HTML Overview
Creating a Basic Web Page
XML introduction to Ahmed I. Deeb Dr. Anwar Mousa  presenter  instructor University Of Palestine-2009.
Chapter 16 The World Wide Web. 2 The Web An infrastructure of information combined and the network software used to access it Web page A document that.
1 CS 502: Computing Methods for Digital Libraries Lecture 4 Text.
XP 1 CREATING AN XML DOCUMENT. XP 2 INTRODUCING XML XML stands for Extensible Markup Language. A markup language specifies the structure and content of.
Document Type Definitions Kanda Runapongsa Dept. of Computer Engineering Khon Kaen University.
HTML (HyperText Markup Language)
Chapter 1 Understanding the Web Design Environment Principles of Web Design, 4 th Edition.
Week 1 Understanding the Web Design Environment. 1-2 HTML: Then and Now HTML is an application of the Standard Generalized Markup Language Intended to.
Learning Web Design: Chapter 4. HTML  Hypertext Markup Language (HTML)  Uses tags to tell the browser the start and end of a certain kind of formatting.
CP2022 Multimedia Internet Communication1 HTML and Hypertext The workings of the web Lecture 7.
1 Web Developer Foundations: Using XHTML Chapter 2 Key Concepts.
TEXT ENCODING INITIATIVE (TEI) Inf 384C Block II, Module C.
 2008 Pearson Education, Inc. All rights reserved Introduction to XHTML.
Tutorial 1: XML Creating an XML Document. 2 Introducing XML XML stands for Extensible Markup Language. A markup language specifies the structure and content.
1 Tutorial 13 Validating Documents with DTDs Working with Document Type Definitions.
1 Chapter 10: XML What is XML What is XML Basic Components of XML Basic Components of XML XPath XPath XQuery XQuery.
XML 2nd EDITION Tutorial 1 Creating An Xml Document.
HTML: Hyptertext Markup Language Doman’s Sections.
1 Metadata –Information about information – Different objects, different forms – e.g. Library catalogue record Property:Value: Author Ian Beardwell Publisher.
Web Development & Design Foundations with XHTML Chapter 2 HTML/XHTML Basics.
Introduction to XML This presentation covers introductory features of XML. What XML is and what it is not? What does it do? Put different related technologies.
XML Instructor: Charles Moen CSCI/CINF XML  Extensible Markup Language  A set of rules that allow you to create your own markup language  Designed.
Markup and Metadata How to Build a Digital Library Ian H. Witten and David Bainbridge.
XP 1 Creating an XML Document Developing an XML Document for the Jazz Warehouse XML Tutorial.
Introduction to metadata
An Introduction to XML Sandeep Bhattaram
XML Design Goals 1.XML must be easily usable over the Internet 2.XML must support a wide variety of applications 3.XML must be compatible with SGML 4.It.
1 Tutorial 11 Creating an XML Document Developing a Document for a Cooking Web Site.
Copyright © Terry Felke-Morris WEB DEVELOPMENT & DESIGN FOUNDATIONS WITH HTML5 Chapter 2 Key Concepts 1 Copyright © Terry Felke-Morris.
UoS Libraries 2011 EndNote X5 - basic graduate session.
1 herbert van de sompel CS 502 Computing Methods for Digital Libraries Cornell University – Computer Science Herbert Van de Sompel
COMP9321 Web Application Engineering Semester 2, 2015 Dr. Amin Beheshti Service Oriented Computing Group, CSE, UNSW Australia Week 4 1COMP9321, 15s2, Week.
HTML Basics. HTML Coding HTML Hypertext markup language The code used to create web pages.
XML CSC1310 Fall HTML (TIM BERNERS-LEE) HyperText Markup Language  HTML (HyperText Markup Language): December  Markup  Markup is a symbol.
Microsoft Expression Web 3 – Illustrated Unit D: Structuring and Styling Text.
XP Review 1 New Perspectives on JavaScript, Comprehensive1 Introducing HTML and XHTML Creating Web Pages with HTML.
LBSC 690 Session 4 Programming. Languages How do we learn a language? Learn by listening Then reading Then writing How do we teach programming? Learn.
Writing Your Own Web Page: Using HTML and FrontPage Chapter 10.
Basic HTML Document Structure. Slide 2 Goals (XHTML HTML5) XHTML Separate document structure and content from document formatting HTML 5 Create a formal.
Department of Computer Science, Florida State University CGS 3066: Web Programming and Design Spring
Department of Computer Science, Florida State University CGS 3066: Web Programming and Design Spring
XP 2 HTML Tutorial 1: Developing a Basic Web Page.
Cascading Style Sheet CSS Closing Switch Closing Tag Code View
Glencoe Introduction to Web Design Chapter 4 XHTML Basics 1 Review Do you remember the vocabulary terms from this chapter? Use the following slides to.
Geospatial metadata Prof. Wenwen Li School of Geographical Sciences and Urban Planning 5644 Coor Hall
XML: Extensible Markup Language
Unit 4 Representing Web Data: XML
Introduction to XHTML.
WEBSITE DESIGN Chp 1
Creating an XML Document
Introducing HTML & XHTML:
Presentation transcript:

Chapter Five Markup and Metadata: Elements of Organization How to Build a Digital Library Ian H. Witten and David Bainbridge

Digital Library Elements Basic Building Blocks Basic Building Blocks Documents Documents Basic Elements of Organization Basic Elements of Organization Markup Markup Controls structure Controls structure Controls appearance Controls appearance Metadata Metadata Expedites access Expedites access

Structural Markup Identify and maintain the document structure: Identify and maintain the document structure: Section divisions Section divisions Headings Headings Subsection structure Subsection structure Lists Lists Quotations Quotations Tabular information Tabular information Structural markup items become metadata Structural markup items become metadata

Presentation Markup Specify how the document will appear typographically by formatting the document: Specify how the document will appear typographically by formatting the document: Page size Page size Headers and footers Headers and footers Font Font Line spacing Line spacing Section headers Section headers Figures Figures

Document Style Sheet The design of the document relates structure and appearance The design of the document relates structure and appearance Structure and appearance should aid the reader’s comprehension Structure and appearance should aid the reader’s comprehension A style sheet is a catalog of how each structural item should be presented A style sheet is a catalog of how each structural item should be presented

What is Metadata? “Data about data” (glibly) “Data about data” (glibly) Structural information Structural information about a particular information resource about a particular information resource Information is “structured” if it can be meaningfully manipulated without understanding its content Information is “structured” if it can be meaningfully manipulated without understanding its content Where does “data” end and “metadata” begin? Where does “data” end and “metadata” begin?

Kinds of Metadata Assist in navigating a document Assist in navigating a document Structural markup Structural markup Resource discovery Resource discovery Metadata to assist in finding documents through searching and browsing Metadata to assist in finding documents through searching and browsing Value of digital libraries depends on how easily information can be located Value of digital libraries depends on how easily information can be located Policy Policy Define rights, restrictions, and rules that govern who can do what with digital resources Define rights, restrictions, and rules that govern who can do what with digital resources Administration and Preservation Administration and Preservation Information necessary to preserve the integrity and functionality of a digital resource long term Information necessary to preserve the integrity and functionality of a digital resource long term

Explicit versus Extracted Metadata Explicit Metadata Explicit Metadata Requires careful analysis of a document Requires careful analysis of a document Takes 1-2 hours to create a traditional library catalog entry Takes 1-2 hours to create a traditional library catalog entry Extracted Metadata Extracted Metadata “Text Mining” “Text Mining” Automatically obtained from the contents of a document Automatically obtained from the contents of a document Cheaper, but less reliable Cheaper, but less reliable

HTML Hypertext Markup Language Hypertext Markup Language Document format of the World Wide Web Document format of the World Wide Web Original vision: separate document structure from presentation Original vision: separate document structure from presentation Inconsistent ways of formatting and metadata in HTML may discourage automatic processing of document collections Inconsistent ways of formatting and metadata in HTML may discourage automatic processing of document collections

Basic HTML Angle brackets enclose words Angle brackets enclose words My Story My Story Tag names are not case sensitive Tag names are not case sensitive

HTML Tags Paragraph Paragraph Table Row Table Row Table Cell Table Cell Special characters, list item Special characters, list item Images Images Italics Italics Unordered List, Bulleted List Unordered List, Bulleted List.. Link Anchor.. Link Anchor

HTML Opening Tags Opening Tags Attributes Attributes Special Markers Special Markers Header Header Gives global information Gives global information Title, encoding scheme, metadata Title, encoding scheme, metadata Body Body ASCII /UTF-8 Unicode ASCII /UTF-8 Unicode Local link anchors Local link anchors Navigation within a single document Navigation within a single document Forms Forms Collect data from user Collect data from user Frames Frames HTML document can be tiled into smaller, independent segments (each an HTML page) HTML document can be tiled into smaller, independent segments (each an HTML page) Frameset – a set of frames – can be displayed simultaneously (useful for navigation bars) Frameset – a set of frames – can be displayed simultaneously (useful for navigation bars)

HTML in Digital Libraries Many source documents are presented in HTML form Many source documents are presented in HTML form Explicit specification of metadata using tags Explicit specification of metadata using tags Extract text Extract text Plain text browser “lynx” extracts text from HTML documents Plain text browser “lynx” extracts text from HTML documents

XML Extensible Markup Language Extensible Markup Language Flexible way to characterize document structure and metadata Flexible way to characterize document structure and metadata Well suited to digital libraries Well suited to digital libraries Widespread use Widespread use

XML Document Type Description DTD = Document Type Description DTD = Document Type Description Tag Syntax Tag Syntax Keywords in Block Capitals Keywords in Block Capitals Square Bracket […] indicates DTD will appear in-line Square Bracket […] indicates DTD will appear in-line Otherwise, DTD can be in external file Otherwise, DTD can be in external file Referred to by a URL Referred to by a URL Desirable Desirable New elements New elements Keyword ELEMENT Keyword ELEMENT Tag name Tag name Description of what element may contain Description of what element may contain A Leaf A Leaf An element that is plain text, with no markup An element that is plain text, with no markup Declared as #PCDATA (parsed character data) Declared as #PCDATA (parsed character data) Special Characters Special Characters Encoded as in HTML (< &amp, etc.) Encoded as in HTML (< &amp, etc.)

XML Regular Expressions Regular expression Regular expression Comma indicates an ordered sequence Comma indicates an ordered sequence Vertical bar indicates a choice of one element from sequence Vertical bar indicates a choice of one element from sequence Asterisk indicates zero or more Asterisk indicates zero or more Plus indicates one or more Plus indicates one or more Question mark indicates zero or one Question mark indicates zero or one

XML Attributes Attributes Attributes Give set of possible values Give set of possible values No nesting No nesting Keyword ATTLIST Keyword ATTLIST Element to which it applies Element to which it applies Attribute name Attribute name Attribute type Attribute type Appearance restrictions (optional) Appearance restrictions (optional)

XML Entities Entities: Entities: &lt, &amp, &gt, &apos, &quote &lt, &amp, &gt, &apos, &quote New entities can be added in the DTD New entities can be added in the DTD Use syntax Use syntax ENTITY ENTITY Name Name “value” “value” Example: Example:

XML Parameter Entity Several elements share the same attributes Several elements share the same attributes Parameter Entity Parameter Entity Special type of entity Special type of entity Percent symbol Percent symbol

Well Formed and Valid XML Well Formed Well Formed A document that conforms to XML syntax but does not supply a DTD (Document Type Description) A document that conforms to XML syntax but does not supply a DTD (Document Type Description) Valid Valid A document that conforms to XML syntax and does supply a DTD A document that conforms to XML syntax and does supply a DTD The content follows the syntactic constraints defined in the DTD The content follows the syntactic constraints defined in the DTD

Parsing XML Parsing indicates whether the document conforms to the general rules of XML (or the specific DTD, when applicable) Parsing indicates whether the document conforms to the general rules of XML (or the specific DTD, when applicable) Parsing produces a parse tree Parsing produces a parse tree Begins with a root node Begins with a root node Root node has descendants Root node has descendants Descendants reflect text content and nested tags Descendants reflect text content and nested tags Programming Interface Programming Interface Lets user traverse the tree and retrieve the data Lets user traverse the tree and retrieve the data “API” Application Program Interface “API” Application Program Interface

XML DOM Document Object Model Document Object Model Application Program Interface (API) Application Program Interface (API) Cross-platform Cross-platform Cross-language Cross-language Allows programs to be written that access and modify the document’s: Allows programs to be written that access and modify the document’s: Content Content Structure Structure Style Style

XML and Digital Libraries XML is powerful XML is powerful XML allows file formats within a digital library to be shared XML allows file formats within a digital library to be shared Structure explanations are put in a DTD (Document Type Description) Structure explanations are put in a DTD (Document Type Description) XML provides syntax for expressing structural information – metadata XML provides syntax for expressing structural information – metadata XML goes further by combining with other standards: XML goes further by combining with other standards: Support document restructuring, querying, information extraction and formatting Support document restructuring, querying, information extraction and formatting Can have display capabilities similar to HTML Can have display capabilities similar to HTML

Style Sheets Control the presentation of marked-up documents Control the presentation of marked-up documents Two Kinds of Style Sheets: Two Kinds of Style Sheets: Cascading Style Sheets Cascading Style Sheets Work with HTML and XML Work with HTML and XML Extensible Stylesheet Language – XSL Extensible Stylesheet Language – XSL Works with XML Works with XML Powerful Powerful Allows document structure to be altered dynamically Allows document structure to be altered dynamically

Bibliographic Metadata Two Standards for Representing Document Metadata: Two Standards for Representing Document Metadata: Machine-Readable Cataloging (MARC) Machine-Readable Cataloging (MARC) Used by professional catalogers for use in libraries Used by professional catalogers for use in libraries The Dublin Core The Dublin Core Minimal standard used by people who are not trained in library cataloging Minimal standard used by people who are not trained in library cataloging Two metadata formats used by document authors in scientific and technical fields: Two metadata formats used by document authors in scientific and technical fields: BibTeX BibTeX Refer Refer

MARC Machine-Readable Cataloging Machine-Readable Cataloging Internally stored as collection of tagged fields Internally stored as collection of tagged fields Format covers: Format covers: Bibliographic records Bibliographic records Authority records – standardized forms that are part of the librarian’s controlled vocabulary Authority records – standardized forms that are part of the librarian’s controlled vocabulary Governed by AACR2R Governed by AACR2R Anglo-American Cataloging Rules Anglo-American Cataloging Rules Detailed set of rules and guidelines Detailed set of rules and guidelines Two Parts Two Parts Part 1: Description of Documents Part 1: Description of Documents Part 2: Description of Works Part 2: Description of Works

Dublin Core Set of metadata elements Set of metadata elements Simple - designed for non-specialists Simple - designed for non-specialists Intended for electronic materials that will not receive a full MARC catalog entry Intended for electronic materials that will not receive a full MARC catalog entry Named after Dublin, Ohio Named after Dublin, Ohio The first meeting was held there in 1995 The first meeting was held there in 1995 Approved by ANSI (American National Standards Organization) in 2001 Approved by ANSI (American National Standards Organization) in 2001

Dublin Core Fifteen metadata elements form the core element set Fifteen metadata elements form the core element set May be refined through qualifiers May be refined through qualifiers May be augmented by additional elements for local purposes May be augmented by additional elements for local purposes Resource Resource “Anything that has identity” “Anything that has identity” Similar to “entity” (objectives of bibliographic system) Similar to “entity” (objectives of bibliographic system) Does not impose any kind of vocabulary control or authority files Does not impose any kind of vocabulary control or authority files Two people might generate very different descriptions of the same resource Two people might generate very different descriptions of the same resource

Dublin Core Metadata Standard Title Title Creator Creator Subject Subject Description Description Publisher Publisher Contributor Contributor Date Date Type Type Format Identifier Source Language Relation Coverage Rights

BibTeX Manages bibliographic data and references within documents Manages bibliographic data and references within documents TeX TeX Generalized document-processing system Generalized document-processing system Scientific, Mathematical and Technical Purposes Scientific, Mathematical and Technical Purposes LaTeX LaTeX Customized Version of TeX Customized Version of TeX Freely available Freely available BibTeX BibTeX Subsystem of LaTeX Subsystem of LaTeX

Refer Similar to BibTeX Similar to BibTeX Designed by computer scientists for use by scientific and technical researchers Designed by computer scientists for use by scientific and technical researchers Basis of EndNote Basis of EndNote Bibliographic tool which augments Microsoft Word Bibliographic tool which augments Microsoft Word

Metadata for Images and Multimedia Metadata is not confined to text Metadata is not confined to text Most image files include data about resolution Most image files include data about resolution PNG can store text strings PNG can store text strings Image metadata is usually kept separate from the image file Image metadata is usually kept separate from the image file

Metadata for Images and Multimedia Two Metadata Formats: Two Metadata Formats: TIFF TIFF Tagged Image File Format Tagged Image File Format Associates metadata with image files Associates metadata with image files Widespread use for over a decade Widespread use for over a decade How images are stored in digital libraries How images are stored in digital libraries Normal images Normal images Document images Document images MPEG-7 MPEG-7 Multimedia Content Description Interface Multimedia Content Description Interface Scheme to define and store metadata associated with any multimedia information Scheme to define and store metadata associated with any multimedia information General, extensible, and still being standardized General, extensible, and still being standardized

TIFF Tagged Image File Format Public-domain file format for raster images Public-domain file format for raster images A raster image is a rectangular array of regularly sampled values, known as pixels A raster image is a rectangular array of regularly sampled values, known as pixels Incorporates extensive abilities for descriptive metadata Incorporates extensive abilities for descriptive metadata Not tied to specific input or output devices Not tied to specific input or output devices Byte-oriented format like Unicode Byte-oriented format like Unicode Compatible with big-endian and little-endian Compatible with big-endian and little-endian Single TIFF file can include several images Single TIFF file can include several images Images are characterized by sets of tags Images are characterized by sets of tags Most digital library projects with images use the TIFF format to store and archive the original captured images Most digital library projects with images use the TIFF format to store and archive the original captured images May convert to other formats for display May convert to other formats for display

MPEG-7 Multimedia Content Description Interface Multimedia Content Description Interface Wide scope Wide scope Still under development Still under development Multimedia Presentations may include: Multimedia Presentations may include: Still pictures Still pictures 3D models 3D models Audio Audio Speech Speech Video Video Information can be streamed from an online real-time source Information can be streamed from an online real-time source

Extracting Metadata Text Mining Text Mining Automatic extraction of information from text Automatic extraction of information from text Plain text documents Plain text documents Require text comprehension skills Require text comprehension skills Computer techniques for text analysis Computer techniques for text analysis Good results in constrained domains Good results in constrained domains XML and other Structured Markup Languages XML and other Structured Markup Languages Make key aspects of documents available to computers and people Make key aspects of documents available to computers and people Encoded information can easily be extracted by parsing the document structure Encoded information can easily be extracted by parsing the document structure Few documents contain explicitly encoded metadata Few documents contain explicitly encoded metadata

General Techniques Extracting Document Metadata Extracting Document Metadata Generic Entities Generic Entities Bibliography Entries Bibliography Entries

Techniques Available in Greenstone Software Language Identification Language Identification Extracting Acronyms Extracting Acronyms Extracting Key Phrases Extracting Key Phrases Generating Phrase Hierarchies These go beyond what is normally meant by “metadata” by extracting useful information for use in digital libraries Generating Phrase Hierarchies These go beyond what is normally meant by “metadata” by extracting useful information for use in digital libraries

Extracting Document Metadata Basic metadata is usually present on the first page Basic metadata is usually present on the first page Title, author, publisher, date of publication, keywords, abstract Title, author, publisher, date of publication, keywords, abstract Fairly uniform presentation Fairly uniform presentation Easy for human extraction Easy for human extraction Automatic extraction depends too much on the format of the documents and the uniformity of the collection Automatic extraction depends too much on the format of the documents and the uniformity of the collection

Generic Entities Fixed syntax information is easy to extract automatically from plain text documents Fixed syntax information is easy to extract automatically from plain text documents address address Web URLs Web URLs Artificial entities that are slightly less reliable Artificial entities that are slightly less reliable Money Money Time Time Dates Dates Semi-structured Data Semi-structured Data Names (begin with capital letters) Names (begin with capital letters)

Generic Entities Generic Entity Extraction Generic Entity Extraction The task of identifying entities such as times, dates, sums of money, and different kinds of names in running text The task of identifying entities such as times, dates, sums of money, and different kinds of names in running text

Bibliography Entries Traditional Citation Indexes Traditional Citation Indexes Identify the citations that a document makes and link them with the cited works Identify the citations that a document makes and link them with the cited works Advantages: Advantages: Navigation forward in time through listings that cite the current one Navigation forward in time through listings that cite the current one Navigation backward through the list of cited articles Navigation backward through the list of cited articles Locating related literature, placing articles in context, analyzing current research trends, etc. Locating related literature, placing articles in context, analyzing current research trends, etc. Structure of references makes it easier than generic entity extraction Structure of references makes it easier than generic entity extraction Power of a citation index depends on Power of a citation index depends on Ability to identity the article that is being referenced Ability to identity the article that is being referenced Ability to recognize different references to the same article Ability to recognize different references to the same article

Language Identification Easily derived from a document’s content Easily derived from a document’s content Language it is written in Language it is written in Encoding scheme Encoding scheme

Extracting Acronyms Acronym Acronym A word formed from the first (or first few) letters of a series of words A word formed from the first (or first few) letters of a series of words Acronyms are used extensively in technical, commercial and political documents Acronyms are used extensively in technical, commercial and political documents Acronyms can be identified using heuristics Acronyms can be identified using heuristics

Extracting Key Phrases Key words and key phrases are often attached to documents to provide a brief synopsis of what they are about Key words and key phrases are often attached to documents to provide a brief synopsis of what they are about

Key Phrases are Useful Metadata Documents are condensed into a few phrases that can be interpreted individually and independently Documents are condensed into a few phrases that can be interpreted individually and independently Good for information-retrieval tasks Good for information-retrieval tasks They describe the documents returned by a query They describe the documents returned by a query They can be used as the basis for search indexes They can be used as the basis for search indexes They can be used to browse an information collection They can be used to browse an information collection Document clustering technique Document clustering technique Help user get a feel for the content of an information collection Help user get a feel for the content of an information collection Provide a way of measuring similarity among documents Provide a way of measuring similarity among documents

Key Phrase Metadata Key-phrase metadata can successfully be obtained automatically from documents Key-phrase metadata can successfully be obtained automatically from documents Two Different Approaches: Two Different Approaches: Key-Phrase Assignment Key-Phrase Assignment Key-Phrase Extraction Key-Phrase Extraction

Generating Phrase Hierarchies Key phrases consist of a few well-chosen words that characterize the document Key phrases consist of a few well-chosen words that characterize the document It is useful to extract a structure that contains ALL the phrases in the documents It is useful to extract a structure that contains ALL the phrases in the documents Hierarchical structure of phrases can support browsing around a digital library collection Hierarchical structure of phrases can support browsing around a digital library collection