Download presentation
Presentation is loading. Please wait.
Published byClarence Bell Modified over 8 years ago
1
Markup and Metadata How to Build a Digital Library Ian H. Witten and David Bainbridge
2
Digital Library Elements Basic Elements of Organization Basic Elements of Organization Markup Markup Controls structure and appearance Controls structure and appearance Metadata Metadata Expedites access Expedites access
3
Structural Markup Identify and maintain the document structure: Identify and maintain the document structure: Section divisions Section divisions Headings Headings Subsection structure Subsection structure Lists Lists Quotations Quotations Tabular information Tabular information Structural markup items become metadata Structural markup items become metadata
4
Presentation Markup Specify how the document will appear typographically by formatting the document: Specify how the document will appear typographically by formatting the document: Page size Page size Headers and footers Headers and footers Font Font Line spacing Line spacing Section headers Section headers Figures Figures
5
Kinds of Metadata Assist navigation Assist navigation Structural markup Structural markup Resource discovery Resource discovery Metadata to assist in finding documents through searching and browsing Metadata to assist in finding documents through searching and browsing Value of digital libraries depends on how easily information can be located Value of digital libraries depends on how easily information can be located Policy Policy Define rights, restrictions, and rules that govern who can do what with digital resources Define rights, restrictions, and rules that govern who can do what with digital resources Administration and Preservation Administration and Preservation Information necessary to preserve the integrity and functionality of a digital resource long term Information necessary to preserve the integrity and functionality of a digital resource long term
6
Explicit versus Extracted Metadata Explicit Metadata Explicit Metadata Requires careful analysis of a document Requires careful analysis of a document Takes 1-2 hours to create a traditional library catalog entry (or 5 minutes, depending on number of fields!) Takes 1-2 hours to create a traditional library catalog entry (or 5 minutes, depending on number of fields!) Extracted Metadata Extracted Metadata “Text Mining” “Text Mining” Automatically obtained from the contents of a document Automatically obtained from the contents of a document Cheaper, but less reliable Cheaper, but less reliable
7
HTML Hypertext Markup Language Hypertext Markup Language Document format of the World Wide Web Document format of the World Wide Web Original vision: separate document structure from presentation Original vision: separate document structure from presentation Inconsistent ways of formatting and metadata in HTML may discourage automatic processing of document collections Inconsistent ways of formatting and metadata in HTML may discourage automatic processing of document collections
8
Basic HTML Angle brackets enclose words Angle brackets enclose words My Story My Story Tag names are not case sensitive Tag names are not case sensitive
9
HTML Tags Paragraph Paragraph Table Row Table Row Table Cell Table Cell Special characters, list item Special characters, list item Images Images Italics Italics Unordered List, Bulleted List Unordered List, Bulleted List.. Link Anchor.. Link Anchor
10
HTML Opening Tags Opening Tags Attributes Attributes Special Markers Special Markers Header Header Gives global information Gives global information Title, encoding scheme, metadata Title, encoding scheme, metadata Body Body ASCII /UTF-8 Unicode ASCII /UTF-8 Unicode Local link anchors Local link anchors Navigation within a single document Navigation within a single document Forms Forms Collect data from user Collect data from user Frames Frames HTML document can be tiled into smaller, independent segments (each an HTML page) HTML document can be tiled into smaller, independent segments (each an HTML page) Frameset – a set of frames – can be displayed simultaneously (useful for navigation bars) Frameset – a set of frames – can be displayed simultaneously (useful for navigation bars)
11
HTML in Digital Libraries Many source documents are presented in HTML form Many source documents are presented in HTML form Explicit specification of metadata using tags Explicit specification of metadata using tags Extract text Extract text Plain text browser “lynx” extracts text from HTML documents Plain text browser “lynx” extracts text from HTML documents
12
XML Extensible Markup Language Extensible Markup Language Flexible way to characterize document structure and metadata Flexible way to characterize document structure and metadata Well suited to digital libraries Well suited to digital libraries Widespread use Widespread use
13
XML Document Type Description DTD = Document Type Description DTD = Document Type Description Tag Syntax Tag Syntax Keywords in Block Capitals Keywords in Block Capitals Square Bracket […] indicates DTD will appear in-line Square Bracket […] indicates DTD will appear in-line Otherwise, DTD can be in external file Otherwise, DTD can be in external file Referred to by a URL Referred to by a URL Desirable Desirable New elements New elements Keyword ELEMENT Keyword ELEMENT Tag name Tag name Description of what element may contain Description of what element may contain A Leaf A Leaf An element that is plain text, with no markup An element that is plain text, with no markup Declared as #PCDATA (parsed character data) Declared as #PCDATA (parsed character data) Special Characters Special Characters Encoded as in HTML (< &, etc.) Encoded as in HTML (< &, etc.)
14
XML Regular Expressions Regular expression Regular expression Comma indicates an ordered sequence Comma indicates an ordered sequence Vertical bar indicates a choice of one element from sequence Vertical bar indicates a choice of one element from sequence Asterisk indicates zero or more Asterisk indicates zero or more Plus indicates one or more Plus indicates one or more Question mark indicates zero or one Question mark indicates zero or one
15
XML Attributes Attributes Attributes Give set of possible values Give set of possible values No nesting No nesting Keyword ATTLIST Keyword ATTLIST Element to which it applies Element to which it applies Attribute name Attribute name Attribute type Attribute type Appearance restrictions (optional) Appearance restrictions (optional)
16
XML Entities Entities: Entities: <, &, >, &apos, "e <, &, >, &apos, "e New entities can be added in the DTD New entities can be added in the DTD Use syntax Use syntax ENTITY ENTITY Name Name “value” “value” Example: Example:
17
XML Parameter Entity Several elements share the same attributes Several elements share the same attributes Parameter Entity Parameter Entity Special type of entity Special type of entity Percent symbol Percent symbol
18
Well Formed and Valid XML Well Formed Well Formed A document that conforms to XML syntax but does not supply a DTD (Document Type Description) A document that conforms to XML syntax but does not supply a DTD (Document Type Description) Valid Valid A document that conforms to XML syntax and does supply a DTD A document that conforms to XML syntax and does supply a DTD The content follows the syntactic constraints defined in the DTD The content follows the syntactic constraints defined in the DTD
19
Parsing XML Parsing indicates whether the document conforms to the general rules of XML (or the specific DTD, when applicable) Parsing indicates whether the document conforms to the general rules of XML (or the specific DTD, when applicable) Parsing produces a parse tree Parsing produces a parse tree Begins with a root node Begins with a root node Root node has descendents Root node has descendents Descendents reflect text content and nested tags Descendents reflect text content and nested tags Programming Interface Programming Interface Lets user traverse the tree and retrieve the data Lets user traverse the tree and retrieve the data “API” Application Program Interface “API” Application Program Interface
20
XML DOM Document Object Model Document Object Model Application Program Interface (API) Application Program Interface (API) Cross-platform Cross-platform Cross-language Cross-language Allows programs to be written that access and modify the document’s: Allows programs to be written that access and modify the document’s: Content Content Structure Structure Style Style
21
XML and Digital Libraries XML is powerful XML is powerful XML allows file formats within a digital library to be shared XML allows file formats within a digital library to be shared Structure explanations are put in a DTD (Document Type Description) Structure explanations are put in a DTD (Document Type Description) XML provides syntax for expressing structural information – metadata XML provides syntax for expressing structural information – metadata XML goes further by combining with other standards: XML goes further by combining with other standards: Support document restructuring, querying, information extraction and formatting Support document restructuring, querying, information extraction and formatting Can have display capabilities similar to HTML Can have display capabilities similar to HTML
22
Style Sheets Control the presentation of marked-up documents Control the presentation of marked-up documents Two Kinds of Style Sheets: Two Kinds of Style Sheets: Cascading Style Sheets Cascading Style Sheets Work with HTML and XML Work with HTML and XML Extensible Stylesheet Language – XSL Extensible Stylesheet Language – XSL Works with XML Works with XML Powerful Powerful Allows document structure to be altered dynamically Allows document structure to be altered dynamically
23
Bibliographic Metadata Two Standards for Representing Document Metadata: Two Standards for Representing Document Metadata: Machine-Readable Cataloging (MARC) Machine-Readable Cataloging (MARC) Used by professional catalogers for use in libraries Used by professional catalogers for use in libraries The Dublin Core The Dublin Core Minimal standard used by people who are not trained in library cataloging Minimal standard used by people who are not trained in library cataloging Two metadata formats used by document authors in scientific and technical fields: Two metadata formats used by document authors in scientific and technical fields: BibTeX BibTeX Refer Refer
24
MARC Machine-Readable Cataloging Machine-Readable Cataloging Internally stored as collection of tagged fields Internally stored as collection of tagged fields Format covers: Format covers: Bibliographic records Bibliographic records Authority records – standardized forms that are part of the librarian’s controlled vocabulary Authority records – standardized forms that are part of the librarian’s controlled vocabulary Governed by AACR2R Governed by AACR2R Anglo-American Cataloging Rules Anglo-American Cataloging Rules Detailed set of rules and guidelines Detailed set of rules and guidelines Two Parts Two Parts Part 1: Description of Documents Part 1: Description of Documents Part 2: Description of Works Part 2: Description of Works
25
Dublin Core Set of metadata elements Set of metadata elements Simple - designed for non-specialists Simple - designed for non-specialists Intended for electronic materials that will not receive a full MARC catalog entry Intended for electronic materials that will not receive a full MARC catalog entry Named after Dublin, Ohio Named after Dublin, Ohio The first meeting was held there in 1995 The first meeting was held there in 1995 Approved by ANSI (American National Standards Organization) in 2001 Approved by ANSI (American National Standards Organization) in 2001
26
Dublin Core Fifteen metadata elements form the core element set Fifteen metadata elements form the core element set May be refined through qualifiers May be refined through qualifiers May be augmented by additional elements for local purposes May be augmented by additional elements for local purposes Resource Resource “Anything that has identity” “Anything that has identity” Similar to “entity” (objectives of bibliographic system) Similar to “entity” (objectives of bibliographic system) Does not impose any kind of vocabulary control or authority files Does not impose any kind of vocabulary control or authority files Two people might generate very different descriptions of the same resource Two people might generate very different descriptions of the same resource
27
Dublin Core Metadata Standard Title Title Creator Creator Subject Subject Description Description Publisher Publisher Contributor Contributor Date Date Type Type Format Identifier Source Language Relation Coverage Rights
28
BibTeX Manages bibliographic data and references within documents Manages bibliographic data and references within documents TeX TeX Generalized document-processing system Generalized document-processing system Scientific, Mathematical and Technical Purposes Scientific, Mathematical and Technical Purposes LaTeX LaTeX Customized Version of TeX Customized Version of TeX Freely available Freely available BibTeX BibTeX Subsystem of LaTeX Subsystem of LaTeX
29
Refer Similar to BibTeX Similar to BibTeX Designed by computer scientists for use by scientific and technical researchers Designed by computer scientists for use by scientific and technical researchers Basis of EndNote Basis of EndNote Bibliographic tool which augments Microsoft Word Bibliographic tool which augments Microsoft Word
30
Metadata for Images and Multimedia Metadata is not confined to text Metadata is not confined to text Most image files include data about resolution Most image files include data about resolution PNG can store text strings PNG can store text strings Image metadata is usually kept separate from the image file Image metadata is usually kept separate from the image file
31
Metadata for Images and Multimedia Two Metadata Formats: Two Metadata Formats: TIFF TIFF Tagged Image File Format Tagged Image File Format Associates metadata with image files Associates metadata with image files Widespread use for over a decade Widespread use for over a decade How images are stored in digital libraries How images are stored in digital libraries Normal images Normal images Document images Document images MPEG-7 MPEG-7 Multimedia Content Description Interface Multimedia Content Description Interface Scheme to define and store metadata associated with any multimedia information Scheme to define and store metadata associated with any multimedia information General, extensible, and still being standardized General, extensible, and still being standardized
32
Extracting Metadata Text Mining Text Mining Automatic extraction of information from text Automatic extraction of information from text Plain text documents Plain text documents Require text comprehension skills Require text comprehension skills Computer techniques for text analysis Computer techniques for text analysis Good results in constrained domains Good results in constrained domains XML and other Structured Markup Languages XML and other Structured Markup Languages Make key aspects of documents available to computers and people Make key aspects of documents available to computers and people Encoded information can easily be extracted by parsing the document structure Encoded information can easily be extracted by parsing the document structure Few documents contain explicitly encoded metadata Few documents contain explicitly encoded metadata
33
General Techniques Extracting Document Metadata Extracting Document Metadata Title, Author, Publisher, Date, etc. Title, Author, Publisher, Date, etc. Generic Entities Generic Entities Email, URLs, Dates, Time, Money Email, URLs, Dates, Time, Money Bibliography Entries Bibliography Entries Citation analysis Citation analysis
34
Key Phrase Metadata Key-phrase metadata can successfully be obtained automatically from documents Key-phrase metadata can successfully be obtained automatically from documents Two Different Approaches: Two Different Approaches: Key-Phrase Assignment Key-Phrase Assignment Key-Phrase Extraction Key-Phrase Extraction
35
Generating Phrase Hierarchies Key phrases consist of a few well-chosen words that characterize the document Key phrases consist of a few well-chosen words that characterize the document It is useful to extract a structure that contains ALL the phrases in the documents It is useful to extract a structure that contains ALL the phrases in the documents Hierarchical structure of phrases can support browsing around a digital library collection Hierarchical structure of phrases can support browsing around a digital library collection
Similar presentations
© 2024 SlidePlayer.com Inc.
All rights reserved.