Presentation is loading. Please wait.

Presentation is loading. Please wait.

Chapter Five Markup and Metadata: Elements of Organization How to Build a Digital Library Ian H. Witten and David Bainbridge.

Similar presentations


Presentation on theme: "Chapter Five Markup and Metadata: Elements of Organization How to Build a Digital Library Ian H. Witten and David Bainbridge."— Presentation transcript:

1 Chapter Five Markup and Metadata: Elements of Organization How to Build a Digital Library Ian H. Witten and David Bainbridge

2 Digital Library Elements Basic Building Blocks Basic Building Blocks Documents Documents Basic Elements of Organization Basic Elements of Organization Markup Markup Controls structure Controls structure Controls appearance Controls appearance Metadata Metadata Expedites access Expedites access

3 Structural Markup Identify and maintain the document structure: Identify and maintain the document structure: Section divisions Section divisions Headings Headings Subsection structure Subsection structure Lists Lists Quotations Quotations Tabular information Tabular information Structural markup items become metadata Structural markup items become metadata

4 Presentation Markup Specify how the document will appear typographically by formatting the document: Specify how the document will appear typographically by formatting the document: Page size Page size Headers and footers Headers and footers Font Font Line spacing Line spacing Section headers Section headers Figures Figures

5 Document Style Sheet The design of the document relates structure and appearance The design of the document relates structure and appearance Structure and appearance should aid the reader’s comprehension Structure and appearance should aid the reader’s comprehension A style sheet is a catalog of how each structural item should be presented A style sheet is a catalog of how each structural item should be presented

6 What is Metadata? “Data about data” (glibly) “Data about data” (glibly) Structural information Structural information about a particular information resource about a particular information resource Information is “structured” if it can be meaningfully manipulated without understanding its content Information is “structured” if it can be meaningfully manipulated without understanding its content Where does “data” end and “metadata” begin? Where does “data” end and “metadata” begin?

7 Kinds of Metadata Assist in navigating a document Assist in navigating a document Structural markup Structural markup Resource discovery Resource discovery Metadata to assist in finding documents through searching and browsing Metadata to assist in finding documents through searching and browsing Value of digital libraries depends on how easily information can be located Value of digital libraries depends on how easily information can be located Policy Policy Define rights, restrictions, and rules that govern who can do what with digital resources Define rights, restrictions, and rules that govern who can do what with digital resources Administration and Preservation Administration and Preservation Information necessary to preserve the integrity and functionality of a digital resource long term Information necessary to preserve the integrity and functionality of a digital resource long term

8 Explicit versus Extracted Metadata Explicit Metadata Explicit Metadata Requires careful analysis of a document Requires careful analysis of a document Takes 1-2 hours to create a traditional library catalog entry Takes 1-2 hours to create a traditional library catalog entry Extracted Metadata Extracted Metadata “Text Mining” “Text Mining” Automatically obtained from the contents of a document Automatically obtained from the contents of a document Cheaper, but less reliable Cheaper, but less reliable

9 HTML Hypertext Markup Language Hypertext Markup Language Document format of the World Wide Web Document format of the World Wide Web Original vision: separate document structure from presentation Original vision: separate document structure from presentation Inconsistent ways of formatting and metadata in HTML may discourage automatic processing of document collections Inconsistent ways of formatting and metadata in HTML may discourage automatic processing of document collections

10 Basic HTML Angle brackets enclose words Angle brackets enclose words My Story My Story Tag names are not case sensitive Tag names are not case sensitive

11 HTML Tags Paragraph Paragraph Table Row Table Row Table Cell Table Cell Special characters, list item Special characters, list item Images Images Italics Italics Unordered List, Bulleted List Unordered List, Bulleted List.. Link Anchor.. Link Anchor

12 HTML Opening Tags Opening Tags Attributes Attributes Special Markers Special Markers Header Header Gives global information Gives global information Title, encoding scheme, metadata Title, encoding scheme, metadata Body Body ASCII /UTF-8 Unicode ASCII /UTF-8 Unicode Local link anchors Local link anchors Navigation within a single document Navigation within a single document Forms Forms Collect data from user Collect data from user Frames Frames HTML document can be tiled into smaller, independent segments (each an HTML page) HTML document can be tiled into smaller, independent segments (each an HTML page) Frameset – a set of frames – can be displayed simultaneously (useful for navigation bars) Frameset – a set of frames – can be displayed simultaneously (useful for navigation bars)

13 HTML in Digital Libraries Many source documents are presented in HTML form Many source documents are presented in HTML form Explicit specification of metadata using tags Explicit specification of metadata using tags Extract text Extract text Plain text browser “lynx” extracts text from HTML documents Plain text browser “lynx” extracts text from HTML documents

14 XML Extensible Markup Language Extensible Markup Language Flexible way to characterize document structure and metadata Flexible way to characterize document structure and metadata Well suited to digital libraries Well suited to digital libraries Widespread use Widespread use

15 XML Document Type Description DTD = Document Type Description DTD = Document Type Description Tag Syntax Tag Syntax Keywords in Block Capitals Keywords in Block Capitals Square Bracket […] indicates DTD will appear in-line Square Bracket […] indicates DTD will appear in-line Otherwise, DTD can be in external file Otherwise, DTD can be in external file Referred to by a URL Referred to by a URL Desirable Desirable New elements New elements Keyword ELEMENT Keyword ELEMENT Tag name Tag name Description of what element may contain Description of what element may contain A Leaf A Leaf An element that is plain text, with no markup An element that is plain text, with no markup Declared as #PCDATA (parsed character data) Declared as #PCDATA (parsed character data) Special Characters Special Characters Encoded as in HTML (< &amp, etc.) Encoded as in HTML (< &amp, etc.)

16 XML Regular Expressions Regular expression Regular expression Comma indicates an ordered sequence Comma indicates an ordered sequence Vertical bar indicates a choice of one element from sequence Vertical bar indicates a choice of one element from sequence Asterisk indicates zero or more Asterisk indicates zero or more Plus indicates one or more Plus indicates one or more Question mark indicates zero or one Question mark indicates zero or one

17 XML Attributes Attributes Attributes Give set of possible values Give set of possible values No nesting No nesting Keyword ATTLIST Keyword ATTLIST Element to which it applies Element to which it applies Attribute name Attribute name Attribute type Attribute type Appearance restrictions (optional) Appearance restrictions (optional)

18 XML Entities Entities: Entities: &lt, &amp, &gt, &apos, &quote &lt, &amp, &gt, &apos, &quote New entities can be added in the DTD New entities can be added in the DTD Use syntax Use syntax ENTITY ENTITY Name Name “value” “value” Example: Example:

19 XML Parameter Entity Several elements share the same attributes Several elements share the same attributes Parameter Entity Parameter Entity Special type of entity Special type of entity Percent symbol Percent symbol

20 Well Formed and Valid XML Well Formed Well Formed A document that conforms to XML syntax but does not supply a DTD (Document Type Description) A document that conforms to XML syntax but does not supply a DTD (Document Type Description) Valid Valid A document that conforms to XML syntax and does supply a DTD A document that conforms to XML syntax and does supply a DTD The content follows the syntactic constraints defined in the DTD The content follows the syntactic constraints defined in the DTD

21 Parsing XML Parsing indicates whether the document conforms to the general rules of XML (or the specific DTD, when applicable) Parsing indicates whether the document conforms to the general rules of XML (or the specific DTD, when applicable) Parsing produces a parse tree Parsing produces a parse tree Begins with a root node Begins with a root node Root node has descendants Root node has descendants Descendants reflect text content and nested tags Descendants reflect text content and nested tags Programming Interface Programming Interface Lets user traverse the tree and retrieve the data Lets user traverse the tree and retrieve the data “API” Application Program Interface “API” Application Program Interface

22 XML DOM Document Object Model Document Object Model Application Program Interface (API) Application Program Interface (API) Cross-platform Cross-platform Cross-language Cross-language Allows programs to be written that access and modify the document’s: Allows programs to be written that access and modify the document’s: Content Content Structure Structure Style Style

23 XML and Digital Libraries XML is powerful XML is powerful XML allows file formats within a digital library to be shared XML allows file formats within a digital library to be shared Structure explanations are put in a DTD (Document Type Description) Structure explanations are put in a DTD (Document Type Description) XML provides syntax for expressing structural information – metadata XML provides syntax for expressing structural information – metadata XML goes further by combining with other standards: XML goes further by combining with other standards: Support document restructuring, querying, information extraction and formatting Support document restructuring, querying, information extraction and formatting Can have display capabilities similar to HTML Can have display capabilities similar to HTML

24 Style Sheets Control the presentation of marked-up documents Control the presentation of marked-up documents Two Kinds of Style Sheets: Two Kinds of Style Sheets: Cascading Style Sheets Cascading Style Sheets Work with HTML and XML Work with HTML and XML Extensible Stylesheet Language – XSL Extensible Stylesheet Language – XSL Works with XML Works with XML Powerful Powerful Allows document structure to be altered dynamically Allows document structure to be altered dynamically

25 Bibliographic Metadata Two Standards for Representing Document Metadata: Two Standards for Representing Document Metadata: Machine-Readable Cataloging (MARC) Machine-Readable Cataloging (MARC) Used by professional catalogers for use in libraries Used by professional catalogers for use in libraries The Dublin Core The Dublin Core Minimal standard used by people who are not trained in library cataloging Minimal standard used by people who are not trained in library cataloging Two metadata formats used by document authors in scientific and technical fields: Two metadata formats used by document authors in scientific and technical fields: BibTeX BibTeX Refer Refer

26 MARC Machine-Readable Cataloging Machine-Readable Cataloging Internally stored as collection of tagged fields Internally stored as collection of tagged fields Format covers: Format covers: Bibliographic records Bibliographic records Authority records – standardized forms that are part of the librarian’s controlled vocabulary Authority records – standardized forms that are part of the librarian’s controlled vocabulary Governed by AACR2R Governed by AACR2R Anglo-American Cataloging Rules Anglo-American Cataloging Rules Detailed set of rules and guidelines Detailed set of rules and guidelines Two Parts Two Parts Part 1: Description of Documents Part 1: Description of Documents Part 2: Description of Works Part 2: Description of Works

27 Dublin Core Set of metadata elements Set of metadata elements Simple - designed for non-specialists Simple - designed for non-specialists Intended for electronic materials that will not receive a full MARC catalog entry Intended for electronic materials that will not receive a full MARC catalog entry Named after Dublin, Ohio Named after Dublin, Ohio The first meeting was held there in 1995 The first meeting was held there in 1995 Approved by ANSI (American National Standards Organization) in 2001 Approved by ANSI (American National Standards Organization) in 2001

28 Dublin Core Fifteen metadata elements form the core element set Fifteen metadata elements form the core element set May be refined through qualifiers May be refined through qualifiers May be augmented by additional elements for local purposes May be augmented by additional elements for local purposes Resource Resource “Anything that has identity” “Anything that has identity” Similar to “entity” (objectives of bibliographic system) Similar to “entity” (objectives of bibliographic system) Does not impose any kind of vocabulary control or authority files Does not impose any kind of vocabulary control or authority files Two people might generate very different descriptions of the same resource Two people might generate very different descriptions of the same resource

29 Dublin Core Metadata Standard Title Title Creator Creator Subject Subject Description Description Publisher Publisher Contributor Contributor Date Date Type Type Format Identifier Source Language Relation Coverage Rights

30 BibTeX Manages bibliographic data and references within documents Manages bibliographic data and references within documents TeX TeX Generalized document-processing system Generalized document-processing system Scientific, Mathematical and Technical Purposes Scientific, Mathematical and Technical Purposes LaTeX LaTeX Customized Version of TeX Customized Version of TeX Freely available Freely available BibTeX BibTeX Subsystem of LaTeX Subsystem of LaTeX

31 Refer Similar to BibTeX Similar to BibTeX Designed by computer scientists for use by scientific and technical researchers Designed by computer scientists for use by scientific and technical researchers Basis of EndNote Basis of EndNote Bibliographic tool which augments Microsoft Word Bibliographic tool which augments Microsoft Word

32 Metadata for Images and Multimedia Metadata is not confined to text Metadata is not confined to text Most image files include data about resolution Most image files include data about resolution PNG can store text strings PNG can store text strings Image metadata is usually kept separate from the image file Image metadata is usually kept separate from the image file

33 Metadata for Images and Multimedia Two Metadata Formats: Two Metadata Formats: TIFF TIFF Tagged Image File Format Tagged Image File Format Associates metadata with image files Associates metadata with image files Widespread use for over a decade Widespread use for over a decade How images are stored in digital libraries How images are stored in digital libraries Normal images Normal images Document images Document images MPEG-7 MPEG-7 Multimedia Content Description Interface Multimedia Content Description Interface Scheme to define and store metadata associated with any multimedia information Scheme to define and store metadata associated with any multimedia information General, extensible, and still being standardized General, extensible, and still being standardized

34 TIFF Tagged Image File Format Public-domain file format for raster images Public-domain file format for raster images A raster image is a rectangular array of regularly sampled values, known as pixels A raster image is a rectangular array of regularly sampled values, known as pixels Incorporates extensive abilities for descriptive metadata Incorporates extensive abilities for descriptive metadata Not tied to specific input or output devices Not tied to specific input or output devices Byte-oriented format like Unicode Byte-oriented format like Unicode Compatible with big-endian and little-endian Compatible with big-endian and little-endian Single TIFF file can include several images Single TIFF file can include several images Images are characterized by sets of tags Images are characterized by sets of tags Most digital library projects with images use the TIFF format to store and archive the original captured images Most digital library projects with images use the TIFF format to store and archive the original captured images May convert to other formats for display May convert to other formats for display

35 MPEG-7 Multimedia Content Description Interface Multimedia Content Description Interface Wide scope Wide scope Still under development Still under development Multimedia Presentations may include: Multimedia Presentations may include: Still pictures Still pictures 3D models 3D models Audio Audio Speech Speech Video Video Information can be streamed from an online real-time source Information can be streamed from an online real-time source

36 Extracting Metadata Text Mining Text Mining Automatic extraction of information from text Automatic extraction of information from text Plain text documents Plain text documents Require text comprehension skills Require text comprehension skills Computer techniques for text analysis Computer techniques for text analysis Good results in constrained domains Good results in constrained domains XML and other Structured Markup Languages XML and other Structured Markup Languages Make key aspects of documents available to computers and people Make key aspects of documents available to computers and people Encoded information can easily be extracted by parsing the document structure Encoded information can easily be extracted by parsing the document structure Few documents contain explicitly encoded metadata Few documents contain explicitly encoded metadata

37 General Techniques Extracting Document Metadata Extracting Document Metadata Generic Entities Generic Entities Bibliography Entries Bibliography Entries

38 Techniques Available in Greenstone Software Language Identification Language Identification Extracting Acronyms Extracting Acronyms Extracting Key Phrases Extracting Key Phrases Generating Phrase Hierarchies These go beyond what is normally meant by “metadata” by extracting useful information for use in digital libraries Generating Phrase Hierarchies These go beyond what is normally meant by “metadata” by extracting useful information for use in digital libraries

39 Extracting Document Metadata Basic metadata is usually present on the first page Basic metadata is usually present on the first page Title, author, publisher, date of publication, keywords, abstract Title, author, publisher, date of publication, keywords, abstract Fairly uniform presentation Fairly uniform presentation Easy for human extraction Easy for human extraction Automatic extraction depends too much on the format of the documents and the uniformity of the collection Automatic extraction depends too much on the format of the documents and the uniformity of the collection

40 Generic Entities Fixed syntax information is easy to extract automatically from plain text documents Fixed syntax information is easy to extract automatically from plain text documents Email address Email address Web URLs Web URLs Artificial entities that are slightly less reliable Artificial entities that are slightly less reliable Money Money Time Time Dates Dates Semi-structured Data Semi-structured Data Names (begin with capital letters) Names (begin with capital letters)

41 Generic Entities Generic Entity Extraction Generic Entity Extraction The task of identifying entities such as times, dates, sums of money, and different kinds of names in running text The task of identifying entities such as times, dates, sums of money, and different kinds of names in running text

42 Bibliography Entries Traditional Citation Indexes Traditional Citation Indexes Identify the citations that a document makes and link them with the cited works Identify the citations that a document makes and link them with the cited works Advantages: Advantages: Navigation forward in time through listings that cite the current one Navigation forward in time through listings that cite the current one Navigation backward through the list of cited articles Navigation backward through the list of cited articles Locating related literature, placing articles in context, analyzing current research trends, etc. Locating related literature, placing articles in context, analyzing current research trends, etc. Structure of references makes it easier than generic entity extraction Structure of references makes it easier than generic entity extraction Power of a citation index depends on Power of a citation index depends on Ability to identity the article that is being referenced Ability to identity the article that is being referenced Ability to recognize different references to the same article Ability to recognize different references to the same article

43 Language Identification Easily derived from a document’s content Easily derived from a document’s content Language it is written in Language it is written in Encoding scheme Encoding scheme

44 Extracting Acronyms Acronym Acronym A word formed from the first (or first few) letters of a series of words A word formed from the first (or first few) letters of a series of words Acronyms are used extensively in technical, commercial and political documents Acronyms are used extensively in technical, commercial and political documents Acronyms can be identified using heuristics Acronyms can be identified using heuristics

45 Extracting Key Phrases Key words and key phrases are often attached to documents to provide a brief synopsis of what they are about Key words and key phrases are often attached to documents to provide a brief synopsis of what they are about

46 Key Phrases are Useful Metadata Documents are condensed into a few phrases that can be interpreted individually and independently Documents are condensed into a few phrases that can be interpreted individually and independently Good for information-retrieval tasks Good for information-retrieval tasks They describe the documents returned by a query They describe the documents returned by a query They can be used as the basis for search indexes They can be used as the basis for search indexes They can be used to browse an information collection They can be used to browse an information collection Document clustering technique Document clustering technique Help user get a feel for the content of an information collection Help user get a feel for the content of an information collection Provide a way of measuring similarity among documents Provide a way of measuring similarity among documents

47 Key Phrase Metadata Key-phrase metadata can successfully be obtained automatically from documents Key-phrase metadata can successfully be obtained automatically from documents Two Different Approaches: Two Different Approaches: Key-Phrase Assignment Key-Phrase Assignment Key-Phrase Extraction Key-Phrase Extraction

48 Generating Phrase Hierarchies Key phrases consist of a few well-chosen words that characterize the document Key phrases consist of a few well-chosen words that characterize the document It is useful to extract a structure that contains ALL the phrases in the documents It is useful to extract a structure that contains ALL the phrases in the documents Hierarchical structure of phrases can support browsing around a digital library collection Hierarchical structure of phrases can support browsing around a digital library collection


Download ppt "Chapter Five Markup and Metadata: Elements of Organization How to Build a Digital Library Ian H. Witten and David Bainbridge."

Similar presentations


Ads by Google