PDF (Portable Document Format) for Digital Preservation and Delivery John Laurie Digital Initiatives Librarian The University of Auckland Library National.

Slides:



Advertisements
Similar presentations
Don’t Type it! OCR it! How to use an online OCR..
Advertisements

Preservation of the Texas Agricultural Experiment Station Bulletin in the Digital Repository By Dr. Rob McGeachin Texas A&M University Libraries June,
E-Content Service Group Virtual Meeting Digital Preservation: How to Get Started.
Delivering textual resources. Overview Getting the text ready – decisions & costs Structures for delivery Full text Marked-up Image and text Indexed How.
DOCUMENT TYPES. Digital Documents Converting documents to an electronic format will preserve those documents, but how would such a process be organized?
MacKenzie Smith Associate Director for Technology MIT Libraries.
Services Digitisation & Content Management. 600 People – India.
PDF Those pesky proprietary formats: Alternatives Save the Day Sharon Trerise Northeast ADA & IT Center PPT.
Mark J. Myers Electronic Records Archivist, KY Dept for Libraries and Archives (2001-May, 2014) Electronic Records Specialist, TX State Library and Archive.
These ain’t “Old News”! Creating access to historic newspapers Christine Guenther OCLC Product Manager, Digital Services Preservation Service Centers Bethlehem,
February 24, 2015 Allison Kidd, ATRC. Direct Services for CSU Students & Employees with Disabilities Ensure Equal Access to Technology & Electronic Information.
Advanced Accessible PDF Document Training Adobe Acrobat 11.
Joachim Bauer Senior System Engineer, CCS
Records Services New Pilot Service ReBorn Digital – Joe Arthur.
Universal Design, Copyright, and Fair Use E-Reserves: A CSU Success Story Jesse Hausler, Assistive Technology Resource Center, ACCESS Project Cristi MacWaters,
JSTOR & OCR - A Case Study Kiffany Francis. What is JSTOR? “JSTOR is a not-for- profit organization with a dual mission to create and maintain a trusted.
PDFs & Dorsetforyou.com Laura Hall Senior Website Officer
By Jeffrey Dell Assistive Technology Specialist Mary Theobald Graduate Assistant Alt Text Office of Disability Services Cleveland State University.
Session 803: Processing PDF Files Gaeir Dietrich Director High Tech Center Training Unit
Processing PDF: How to Go from PDF to E-text to Audio Gaeir Dietrich Director High Tech Center Training Unit of the California Community Colleges Foothill.
Communicating on the Job Site : Annotated PDFs Joanna Gillette, Account Manager, Allen Press Tracy Candelaria, Managing Editor, Allen Press.
Project Selection Theses Posters – Disciplines: Psychology, Sociology/Anthropology, Biology – Benefits Immediate access to files. Born Digital Student.
1 Newspaper Digitisation Workflows Rose Holley- Manager ANDP Presentation to Cultural Heritage Digitisation professionals 26 November 2008.
1 Australian Newspapers Digitisation Program Development of the Newspapers Content Management System Rose Holley – ANDP Manager ANPlan/ANDP Workshop, 28.
Research Posters in PowerPoint. 2 Download Notes
Document Delivery Formats for the Web and Legal Digital Collections Kevin Reiss June 18 th, 2004 Law Library Rutgers-Newark School of Law.
Port Townsend Leader Historical Newspaper Archive Keith Darrock.
1 April 2004 – METS Opening Day West docWORKS/METAe Automated Conversion Of Printed Documents Into Fully Tagged METS Objects Claus Gravenhorst.
Basic Training Virtual Town Hall Training. Basic Training What is a URL Uniform Resource Locater Web page address Your Domain (
Luc Audrain Hachette Livre Head of digitalization
Adventures in Digital Asset Management: Fedora at the National Library of Wales Glen Robson National Library of Wales
Digital Reformatting of Text Aaron Choate Digital Library Production Services The University of Texas Libraries.
OCLC Online Computer Library Center CONTENTdm ® Digital Collection Management Software Ron Gardner, OCLC Digital Services Consultant ICOLC Meeting April.
Planning a digital library How to Build a Digital Library Ian H. Witten and David Bainbridge.
Looking back, moving forward: Examining the impact of digitizing the ACS archive 232nd ACS National Meeting September 13, 2006 David Martinsen, Adam Chesler.
© January/2008 CCS Content Conversion Specialists GmbH Weidestr. 134, Hamburg, Germany consulting technology digitization services.
Cataloguing Electronic resources Prepared by the Cataloguing Team at Charles Sturt University.
Lakeland Click arrow to advance show. Click on the “A” under “Listed By Name.” (“A” for Academic Search Database)
The DigiTool to FDA Program Lydia Motyka Florida Center for Library Automation.
Options for digital delivery Record Society Conference, April 19 th 2007 Bruce Tate Project Manager British History Online.
Publisher’s Perspective: Digitization of print resources, and archiving of digital resources Judy Best, June 13, 2006.
GTA Orientation, August 17, 2015 Allison Kidd Assistive Technology Resource Center.
Standard Grade Computing General Purpose Packages WORD-PROCESSING WORD-PROCESSING Chapter 2.
Planning a digital library How to Build a Digital Library Ian H. Witten and David Bainbridge.
Quality Levels of Reproduction Adolf Knoll National Library of the Czech Republic.
1 UNOG Library Digitization and Microform Unit (DMU) – December 2009.
E-Books Presentation. Hard Copy (Book) Scanning OCR Text Document HTML Conversion Text Formatting Linking Image Insertion Final QC Soft Copy (JPG/TIFF)
PDF Standard Change White paper update April 13, 2010.
PAN-European Exploitation of the Results of the Libraries Programme - EXPLOIT German Libraries Institute Berlin EXPLOIT 1 Electronic library materials.
INTELLECTUAL RIGHTS AND HISTORIC CORPORA Mark Sandler University of Michigan ICOLC, March, 2003.
Collecting History: Profiles in Science Alexa T. McCray National Library of Medicine Bethesda, MD Stanford University August 21, 1999.
Feb 21-25, 2005ICM 2005 Mumbai1 Converting Existing Corpus to an OAI Compliant Repository J. Tang, K. Maly, and M. Zubair Department of Computer Science.
Libraries of Course: integrating library content and services into the e-learning environment. Brian Flaherty Digital Services Manager University of Auckland.
Delivering textual and visual resources. Overview Case studies Methods for providing access Structures for delivery Full text Marked-up Image and text.
Presenting Documents How to Build a Digital Library Ian H. Witten and David Bainbridge.
Alternate Media Workflow Strategies for PDF. Why PDF? Portable document format (PDF) Reads the same on any computer Looks like the book Contains all the.
Here are some things you can do while you wait 1.Open your omeka.net site in your browser (e.g. 2.Open.
Digitizing Historical Newspapers South Carolina Digital Newspaper Program's participation with the Library of Congress' Chronicling America: Historic American.
DIGITIZATION IN THEORY AND PRACTICE WEBSITE: Helen Nneka Okpala Presentation done at University of.
Post-ALA Annual July 11, 2008 Pre-Conference Workshop: The Care and Feeding of Compound Objects Geri Ingram OCLC Digital Collection Services Manager, User.
ITL conference 2003 Putting Your Content on a Diet Using rich online media without download woes.
1 July 2004 – METS Opening Day UK docWORKS/METAe The Engine for Automated Metadata Extraction and XML Tagging Claus Gravenhorst Content.
Publishing on JACoW I ask for a single file that I can download or a CD which contains a complete set of files for publication. The internet is good enough.
Mid-Michigan Digital Practitioners Group Meeting March 27, 2015
Universal Design for Learning: An Inclusive Approach to Teaching
Ensuring our research is preserved for years to come:
RESEARCH TOPICS Web-Interface Performance DTD Extensibility Imaging
Dissemination and Communication Introductory course
Current Challenges in Digitization
Quick and Dirty: the art of OCR
Presentation transcript:

PDF (Portable Document Format) for Digital Preservation and Delivery John Laurie Digital Initiatives Librarian The University of Auckland Library National Digital Forum 2012

Issues Is PDF good enough? What’s a maximum file size PDF/A or simple PDF? Searchable text or clearscan? How dirty is our OCR? Can we attach metadata to PDF files? Should we be using METS-ALTO instead?

Local PDF collections at the University of Auckland Exam papers (image-only) - DigiTool JPS, NZJH, Early NZ Statutes, The Bookshelf - B-engine Theses, working papers - DSpace Course Materials (mainly chapters from books) – Linked from the Catalogue

Advantages and Disadvantages “PDF and PDF/A broadly acceptable for long term digital archiving” Seadle, Michael. Library Hi Tech27. 4 (2009): Seadle, MichaelLibrary Hi Tech27. 4 Widely used, constantly improving, Search engine friendly Open standard since 2008 Read out loud, print But simple? Morass of variables in my experience Image PDF files are large and slow to load Editing a problem – crowdsourcing proofreading Difficult to repurpose as HTML etc Metadata only at the item level

Scanning for PDF Condition of originals Target outcomes searchable text or ClearScan 300dpi for clear modern fonts 400dpi for older documents and very small fonts Adobe Acrobat or FineReader Different settings needed for photos and text-only pages Black-and-white scans don’t work for historical texts and old newspapers. Splitting born-digital PDFs

Optical Character Recognition (OCR) Accuracy depends on document and font - getting better all the time FineReader better than Adobe Acrobat but doesn’t offer ClearScan option ClearScan vs Searchable image, dirty OCR hidden behind image FineReader offers spell-checking, find and replace editing, proofreading Tables, HTML versions, rekeying Pdftotext and other text extractors for indexing

ABBYY FineReader 11 Spellchecking options

PDF text behind image

HTML showing actual text

File Sizes, Optimising files Compromise between image quality and overlarge files What size is too big? Text behind image – I’m saving at 300dpi, 40% quality, about 200K per page for simple text Breaking up into smaller sections Batch optimising Preservation masters, simple text, saved as 5-6MB TIFF as part of FineReader files Reduce File Size best method but often can’t save as PDF/A afterwards

PDF/A, PDF/A-1a, PDF/A-1b “PDF/A is an ISO-standardized version of the Portable Document Format (PDF) specialized for the digital preservation of electronic documents” A-1a is stricter than A-1b Many PDF files can’t be saved as PDF/A –after “reduce file size” because it substitutes non-embedded fonts. Many fonts not allowed to be embedded? Preflight identifies errors. Medline wants a PDF/A copy of each article PDFs downloaded from EBSCO, Springer and ProQuest not PDF/A compliant Will the smarter computers of the future need embedded fonts? “As we all get smarter and technology improves the acute concerns about format obsolescence may diminish” Butch Lazorchak The Signal, Library of Congressacute concerns about format obsolescence may diminish

ClearScan vs Searchable image Clearscan files are just over half the size, are sharper and clearer No Clearscan option from FineReader (spellcheck, find and replace editing, TIFF master copies) ClearScan substitutes a new font – matches shape not OCRed text unlike text only PDF, can’t guarantee 100% accuracy But pretty good especially on clean text

Adobe ClearScan example Text behind image says AkaroQ

Adobe Searchable Image Version Text behind image says AkaroQ

FineReader Text over the image FineReader Text over the image (FR reads Akaroa correctly from the same TIFF file)

Problems with text extraction for indexing using pdftotext applet Search for t h e

And diacritics

PDF XMP metadata Attaching Dublin Core metadata to PDF documents

PDF files

PDF vs METS-ALTO Papers Past and other newspaper projects use METS-ALTO METS (Metadata Encoding and Transmission Standard) links hierarchy of pages, sections, articles, issues and volumes, provides for descriptive and other metadata at each level – structural metadata ALTO (Analyzed Layout and Text Object) stores layout information and OCR text, enables page views, article views for newspapers. CCS (Content Conversion Specialists) have created DocWorks METAe which automates creation of METS-ALTO files and metadata for sections Should we all be using METS-ALTO? Derivatives (PDF, text, TEI, HTML) complex document structures, metadata at any level

Websites New Zealand Journal of History ResearchSpace Doctoral Theses Early New Zealand Statutes Early New Zealand Statistics test with PDF and HTML