Presentation on theme: "These ain’t “Old News”! Creating access to historic newspapers Christine Guenther OCLC Product Manager, Digital Services Preservation Service Centers Bethlehem,"— Presentation transcript:
These ain’t “Old News”! Creating access to historic newspapers Christine Guenther OCLC Product Manager, Digital Services Preservation Service Centers Bethlehem, PA CALIFA – January 9, 2009
Objective Learn about the workflow to turn historical newspapers into a searchable collection online - starting from preservation microfilm or original paper. Prepare for the key decisions that lead to success and help you define your vision and expectations.
Outline Scanning workflow Metadata decisions Access options Your digital newspaper project
Getting Started Develop a vision & plan (Goal, Scope, Budget/Funding, Stakeholders…) Select content (titles, date range, page count, quality/completeness, copyright, …) Select format to digitize: Film or Original? Assess film quality (imaging, collation, film generation) No film available? Consider analog preservation as part of digital project
Film generations Archive Master 1 st generation Print Master 2nd generation Service Copies 3rd generation Best choice
Line that separates columns Heavy scratches Example: Heavily scratched Service Copy
Section 2: Content Conversion Content Conversion is major intersection – and it’s tied to your vision for access (presentation system) Determine what digital building blocks are needed for the planned presentation system: METADATA CREATION/COLLECTION (incl. text recognition - OCR) JPEG/JPEG2000 XML (METS/ALTO or other) PDF
OCR - Optical Character Recognition simple OCR (uncorrected) vs. enhancements (Headline/byline correction, article classification, text correction)
OCR – the rocky road to “99%” (?) Input: “photo” of the page Zoning: Columns & reading order Analyze characters/words – Recognition All CAPS fonts (major headlines) yield low accuracy OCR is cost effective tool to gain “full-text” searchability.
Main Choices for Content Conversion Image Only approach (aka digital microfilm) vs. PDF based vs. integrated model where page images and metadata are integrated via a presentation system.
PDF based presentation PRO Common format OCR Multi-page Free Reader Printing CON Slow Not suitable for 8bit Secondary searches Not scaleable Hidden searchable text
Integrated Presentation: Page level Integrated Presentation Page Level Access Example: ContentDM FEATURES: Bitonal or gray Search across collections Primary hits in JPEG 2000 Clipping tool Rich metadata, not only from OCR, but also Dublin Core
Integrated Presentation: Article level With article segmentation
Section 3: Presentation Digital Newspaper Collection go live! Page Level Access in CONTENTdm: AccessPA group license: www.accesspadigital.org Lycoming College, PAwww.accesspadigital.org Lycoming College, PA Wissahickon Valley PL, PA – Ambler Gazette Article Level Access in CONTENTdm: Seattle Spectator
Outlook – The Challenges Analog preservation (film) vs. electronic preservation: File sizes, costs of storage; scanning with digital preservation in mind creates loads of data “If you give the mouse a cookie….” (aka setting expectations) Regaining full-text logic from a photograph of a page; Newspapers are oversize, portrait format, screen is landscape. Zooming will improve legibility, but will not show full page at same time. Access without DAM is not practical, but has costs associated
Resources National Digital Newspaper Program (NDNP) http://www.loc.gov/ndnp/ (partnership of the Library of Congress and the National Endowment for the Humanities)
Questions? Today: Today: Break-out sessions “Tomorrow”: “Tomorrow”: Contact Christine Guenther firstname.lastname@example.org OCLC Preservation Service Center Bethlehem, PA 1-800-773 7222 Thank you! email@example.com