Presentation is loading. Please wait.

Presentation is loading. Please wait.

Markup and Metadata How to Build a Digital Library Ian H. Witten and David Bainbridge.

Similar presentations


Presentation on theme: "Markup and Metadata How to Build a Digital Library Ian H. Witten and David Bainbridge."— Presentation transcript:

1 Markup and Metadata How to Build a Digital Library Ian H. Witten and David Bainbridge

2 Digital Library Elements Basic Elements of Organization Basic Elements of Organization Markup Markup Controls structure and appearance Controls structure and appearance Metadata Metadata Expedites access Expedites access

3 Structural Markup Identify and maintain the document structure: Identify and maintain the document structure: Section divisions Section divisions Headings Headings Subsection structure Subsection structure Lists Lists Quotations Quotations Tabular information Tabular information Structural markup items become metadata Structural markup items become metadata

4 Presentation Markup Specify how the document will appear typographically by formatting the document: Specify how the document will appear typographically by formatting the document: Page size Page size Headers and footers Headers and footers Font Font Line spacing Line spacing Section headers Section headers Figures Figures

5 Kinds of Metadata Assist navigation Assist navigation Structural markup Structural markup Resource discovery Resource discovery Metadata to assist in finding documents through searching and browsing Metadata to assist in finding documents through searching and browsing Value of digital libraries depends on how easily information can be located Value of digital libraries depends on how easily information can be located Policy Policy Define rights, restrictions, and rules that govern who can do what with digital resources Define rights, restrictions, and rules that govern who can do what with digital resources Administration and Preservation Administration and Preservation Information necessary to preserve the integrity and functionality of a digital resource long term Information necessary to preserve the integrity and functionality of a digital resource long term

6 Explicit versus Extracted Metadata Explicit Metadata Explicit Metadata Requires careful analysis of a document Requires careful analysis of a document Takes 1-2 hours to create a traditional library catalog entry (or 5 minutes, depending on number of fields!) Takes 1-2 hours to create a traditional library catalog entry (or 5 minutes, depending on number of fields!) Extracted Metadata Extracted Metadata “Text Mining” “Text Mining” Automatically obtained from the contents of a document Automatically obtained from the contents of a document Cheaper, but less reliable Cheaper, but less reliable

7 HTML Hypertext Markup Language Hypertext Markup Language Document format of the World Wide Web Document format of the World Wide Web Original vision: separate document structure from presentation Original vision: separate document structure from presentation Inconsistent ways of formatting and metadata in HTML may discourage automatic processing of document collections Inconsistent ways of formatting and metadata in HTML may discourage automatic processing of document collections

8 Basic HTML Angle brackets enclose words Angle brackets enclose words My Story My Story Tag names are not case sensitive Tag names are not case sensitive

9 HTML Tags Paragraph Paragraph Table Row Table Row Table Cell Table Cell Special characters, list item Special characters, list item Images Images Italics Italics Unordered List, Bulleted List Unordered List, Bulleted List.. Link Anchor.. Link Anchor

10 HTML Opening Tags Opening Tags Attributes Attributes Special Markers Special Markers Header Header Gives global information Gives global information Title, encoding scheme, metadata Title, encoding scheme, metadata Body Body ASCII /UTF-8 Unicode ASCII /UTF-8 Unicode Local link anchors Local link anchors Navigation within a single document Navigation within a single document Forms Forms Collect data from user Collect data from user Frames Frames HTML document can be tiled into smaller, independent segments (each an HTML page) HTML document can be tiled into smaller, independent segments (each an HTML page) Frameset – a set of frames – can be displayed simultaneously (useful for navigation bars) Frameset – a set of frames – can be displayed simultaneously (useful for navigation bars)

11 HTML in Digital Libraries Many source documents are presented in HTML form Many source documents are presented in HTML form Explicit specification of metadata using tags Explicit specification of metadata using tags Extract text Extract text Plain text browser “lynx” extracts text from HTML documents Plain text browser “lynx” extracts text from HTML documents

12 XML Extensible Markup Language Extensible Markup Language Flexible way to characterize document structure and metadata Flexible way to characterize document structure and metadata Well suited to digital libraries Well suited to digital libraries Widespread use Widespread use

13 XML Document Type Description DTD = Document Type Description DTD = Document Type Description Tag Syntax Tag Syntax Keywords in Block Capitals Keywords in Block Capitals Square Bracket […] indicates DTD will appear in-line Square Bracket […] indicates DTD will appear in-line Otherwise, DTD can be in external file Otherwise, DTD can be in external file Referred to by a URL Referred to by a URL Desirable Desirable New elements New elements Keyword ELEMENT Keyword ELEMENT Tag name Tag name Description of what element may contain Description of what element may contain A Leaf A Leaf An element that is plain text, with no markup An element that is plain text, with no markup Declared as #PCDATA (parsed character data) Declared as #PCDATA (parsed character data) Special Characters Special Characters Encoded as in HTML (< &amp, etc.) Encoded as in HTML (< &amp, etc.)

14 XML Regular Expressions Regular expression Regular expression Comma indicates an ordered sequence Comma indicates an ordered sequence Vertical bar indicates a choice of one element from sequence Vertical bar indicates a choice of one element from sequence Asterisk indicates zero or more Asterisk indicates zero or more Plus indicates one or more Plus indicates one or more Question mark indicates zero or one Question mark indicates zero or one

15 XML Attributes Attributes Attributes Give set of possible values Give set of possible values No nesting No nesting Keyword ATTLIST Keyword ATTLIST Element to which it applies Element to which it applies Attribute name Attribute name Attribute type Attribute type Appearance restrictions (optional) Appearance restrictions (optional)

16 XML Entities Entities: Entities: &lt, &amp, &gt, &apos, &quote &lt, &amp, &gt, &apos, &quote New entities can be added in the DTD New entities can be added in the DTD Use syntax Use syntax ENTITY ENTITY Name Name “value” “value” Example: Example:

17 XML Parameter Entity Several elements share the same attributes Several elements share the same attributes Parameter Entity Parameter Entity Special type of entity Special type of entity Percent symbol Percent symbol

18 Well Formed and Valid XML Well Formed Well Formed A document that conforms to XML syntax but does not supply a DTD (Document Type Description) A document that conforms to XML syntax but does not supply a DTD (Document Type Description) Valid Valid A document that conforms to XML syntax and does supply a DTD A document that conforms to XML syntax and does supply a DTD The content follows the syntactic constraints defined in the DTD The content follows the syntactic constraints defined in the DTD

19 Parsing XML Parsing indicates whether the document conforms to the general rules of XML (or the specific DTD, when applicable) Parsing indicates whether the document conforms to the general rules of XML (or the specific DTD, when applicable) Parsing produces a parse tree Parsing produces a parse tree Begins with a root node Begins with a root node Root node has descendents Root node has descendents Descendents reflect text content and nested tags Descendents reflect text content and nested tags Programming Interface Programming Interface Lets user traverse the tree and retrieve the data Lets user traverse the tree and retrieve the data “API” Application Program Interface “API” Application Program Interface

20 XML DOM Document Object Model Document Object Model Application Program Interface (API) Application Program Interface (API) Cross-platform Cross-platform Cross-language Cross-language Allows programs to be written that access and modify the document’s: Allows programs to be written that access and modify the document’s: Content Content Structure Structure Style Style

21 XML and Digital Libraries XML is powerful XML is powerful XML allows file formats within a digital library to be shared XML allows file formats within a digital library to be shared Structure explanations are put in a DTD (Document Type Description) Structure explanations are put in a DTD (Document Type Description) XML provides syntax for expressing structural information – metadata XML provides syntax for expressing structural information – metadata XML goes further by combining with other standards: XML goes further by combining with other standards: Support document restructuring, querying, information extraction and formatting Support document restructuring, querying, information extraction and formatting Can have display capabilities similar to HTML Can have display capabilities similar to HTML

22 Style Sheets Control the presentation of marked-up documents Control the presentation of marked-up documents Two Kinds of Style Sheets: Two Kinds of Style Sheets: Cascading Style Sheets Cascading Style Sheets Work with HTML and XML Work with HTML and XML Extensible Stylesheet Language – XSL Extensible Stylesheet Language – XSL Works with XML Works with XML Powerful Powerful Allows document structure to be altered dynamically Allows document structure to be altered dynamically

23 Bibliographic Metadata Two Standards for Representing Document Metadata: Two Standards for Representing Document Metadata: Machine-Readable Cataloging (MARC) Machine-Readable Cataloging (MARC) Used by professional catalogers for use in libraries Used by professional catalogers for use in libraries The Dublin Core The Dublin Core Minimal standard used by people who are not trained in library cataloging Minimal standard used by people who are not trained in library cataloging Two metadata formats used by document authors in scientific and technical fields: Two metadata formats used by document authors in scientific and technical fields: BibTeX BibTeX Refer Refer

24 MARC Machine-Readable Cataloging Machine-Readable Cataloging Internally stored as collection of tagged fields Internally stored as collection of tagged fields Format covers: Format covers: Bibliographic records Bibliographic records Authority records – standardized forms that are part of the librarian’s controlled vocabulary Authority records – standardized forms that are part of the librarian’s controlled vocabulary Governed by AACR2R Governed by AACR2R Anglo-American Cataloging Rules Anglo-American Cataloging Rules Detailed set of rules and guidelines Detailed set of rules and guidelines Two Parts Two Parts Part 1: Description of Documents Part 1: Description of Documents Part 2: Description of Works Part 2: Description of Works

25 Dublin Core Set of metadata elements Set of metadata elements Simple - designed for non-specialists Simple - designed for non-specialists Intended for electronic materials that will not receive a full MARC catalog entry Intended for electronic materials that will not receive a full MARC catalog entry Named after Dublin, Ohio Named after Dublin, Ohio The first meeting was held there in 1995 The first meeting was held there in 1995 Approved by ANSI (American National Standards Organization) in 2001 Approved by ANSI (American National Standards Organization) in 2001

26 Dublin Core Fifteen metadata elements form the core element set Fifteen metadata elements form the core element set May be refined through qualifiers May be refined through qualifiers May be augmented by additional elements for local purposes May be augmented by additional elements for local purposes Resource Resource “Anything that has identity” “Anything that has identity” Similar to “entity” (objectives of bibliographic system) Similar to “entity” (objectives of bibliographic system) Does not impose any kind of vocabulary control or authority files Does not impose any kind of vocabulary control or authority files Two people might generate very different descriptions of the same resource Two people might generate very different descriptions of the same resource

27 Dublin Core Metadata Standard Title Title Creator Creator Subject Subject Description Description Publisher Publisher Contributor Contributor Date Date Type Type Format Identifier Source Language Relation Coverage Rights

28 BibTeX Manages bibliographic data and references within documents Manages bibliographic data and references within documents TeX TeX Generalized document-processing system Generalized document-processing system Scientific, Mathematical and Technical Purposes Scientific, Mathematical and Technical Purposes LaTeX LaTeX Customized Version of TeX Customized Version of TeX Freely available Freely available BibTeX BibTeX Subsystem of LaTeX Subsystem of LaTeX

29 Refer Similar to BibTeX Similar to BibTeX Designed by computer scientists for use by scientific and technical researchers Designed by computer scientists for use by scientific and technical researchers Basis of EndNote Basis of EndNote Bibliographic tool which augments Microsoft Word Bibliographic tool which augments Microsoft Word

30 Metadata for Images and Multimedia Metadata is not confined to text Metadata is not confined to text Most image files include data about resolution Most image files include data about resolution PNG can store text strings PNG can store text strings Image metadata is usually kept separate from the image file Image metadata is usually kept separate from the image file

31 Metadata for Images and Multimedia Two Metadata Formats: Two Metadata Formats: TIFF TIFF Tagged Image File Format Tagged Image File Format Associates metadata with image files Associates metadata with image files Widespread use for over a decade Widespread use for over a decade How images are stored in digital libraries How images are stored in digital libraries Normal images Normal images Document images Document images MPEG-7 MPEG-7 Multimedia Content Description Interface Multimedia Content Description Interface Scheme to define and store metadata associated with any multimedia information Scheme to define and store metadata associated with any multimedia information General, extensible, and still being standardized General, extensible, and still being standardized

32 Extracting Metadata Text Mining Text Mining Automatic extraction of information from text Automatic extraction of information from text Plain text documents Plain text documents Require text comprehension skills Require text comprehension skills Computer techniques for text analysis Computer techniques for text analysis Good results in constrained domains Good results in constrained domains XML and other Structured Markup Languages XML and other Structured Markup Languages Make key aspects of documents available to computers and people Make key aspects of documents available to computers and people Encoded information can easily be extracted by parsing the document structure Encoded information can easily be extracted by parsing the document structure Few documents contain explicitly encoded metadata Few documents contain explicitly encoded metadata

33 General Techniques Extracting Document Metadata Extracting Document Metadata Title, Author, Publisher, Date, etc. Title, Author, Publisher, Date, etc. Generic Entities Generic Entities Email, URLs, Dates, Time, Money Email, URLs, Dates, Time, Money Bibliography Entries Bibliography Entries Citation analysis Citation analysis

34 Key Phrase Metadata Key-phrase metadata can successfully be obtained automatically from documents Key-phrase metadata can successfully be obtained automatically from documents Two Different Approaches: Two Different Approaches: Key-Phrase Assignment Key-Phrase Assignment Key-Phrase Extraction Key-Phrase Extraction

35 Generating Phrase Hierarchies Key phrases consist of a few well-chosen words that characterize the document Key phrases consist of a few well-chosen words that characterize the document It is useful to extract a structure that contains ALL the phrases in the documents It is useful to extract a structure that contains ALL the phrases in the documents Hierarchical structure of phrases can support browsing around a digital library collection Hierarchical structure of phrases can support browsing around a digital library collection


Download ppt "Markup and Metadata How to Build a Digital Library Ian H. Witten and David Bainbridge."

Similar presentations


Ads by Google