International Atomic Energy Agency Digital Preservation Session Tue, 4 Nov th INIS Liaison Officers’ Meeting 3-5 Nov 2008, Vienna, Austria S. Rieder, G. St-Pierre, Y. Reynaud-Pulido, T. Kalapurackal Database Production and Imaging Group, INIS Unit INIS & NKM Section
International Atomic Energy Agency 34 th ILO Meeting, 3-5 Nov 2008, Vienna, AT 2 INIS Mission: preservation of nuclear knowledge serving as a reservoir of nuclear information provision of quality information services promotion of a culture of “information and knowledge sharing“ Digital Preservation at INIS
International Atomic Energy Agency 34 th ILO Meeting, 3-5 Nov 2008, Vienna, AT 3 INIS Non-Conventional Literature (NCL ) Production of the INIS electronic Full Text Database Digital Preservation Activities Digitization projects at IAEA at Member States Digital Preservation Digital Preservation at INIS
International Atomic Energy Agency 34 th ILO Meeting, 3-5 Nov 2008, Vienna, AT 4 Objectives: Consistent, high-level of image quality Interoperability and accessibility of digitized resources Long-term preservation of digital resources for future generations Member States IAEA Develop good practices for digital preservation ‘Overview of INIS Digital Preservation Practices ’: INIS Information Letter No. 253 & Attachment ( ) Digital Preservation Digital Preservation at INIS
International Atomic Energy Agency 34 th ILO Meeting, 3-5 Nov 2008, Vienna, AT 5 INIS principles and workflow base on Cornell University’s digital imaging tutorial: available in English, French, Spanish Digital Preservation Principles INIS Digital Preservation Principles
International Atomic Energy Agency 34 th ILO Meeting, 3-5 Nov 2008, Vienna, AT 6 INIS Workflow Document Benchmarking Document Preparation Scanning Quality Control Image Enhancement Metadata Creation/Validation Export including Compression Completeness Check Back-up Post-processing Storage and dissemination
International Atomic Energy Agency 34 th ILO Meeting, 3-5 Nov 2008, Vienna, AT 7 Benchmarking & Document Preparation Benchmarking: Adequately capture the ‘original’ content in digital form? Physical format & condition meets digitizing requirements? What is the type of material to be digitized? Which resolution? At which bit-depth? Which compression parameters? Estimated accuracy level for OCR? Other considerations? Preparation : Physically (unbind, remove staples/clips, etc.) Structurally (add/remove barcodes, separate chapters, parts, etc.) Characteristics of paper (eg. size, thick, glossy/mat, condition)
International Atomic Energy Agency 34 th ILO Meeting, 3-5 Nov 2008, Vienna, AT 8 Scanning – Capture Modes & Optical Resolution Capture modes: depends on the physical form of original Bitonal: 1 bit/pixel – black & white (printed text) Greyscale: 8 bits/pixel – 256 grey shades (black & white photographs) Colour: 24 bits/pixel – 16 million colours & grey shades (continuous tone & colour) Optical Resolution: “dots per inch” (DPI) or “pixels per inch” (PPI) High resolution fine detail large file size Bit depth: amount of information captured Greater bit depths more accurate representation
International Atomic Energy Agency 9 Scanning at INIS – Capture & Optical Resolution INIS practice: Standard Scanner Settings (for Plain b/w text): bitonal (black & white) 300 dpi Special Cases (colour, pictures): greyscale and colour 200 – 300 dpi with 8 bit depth (256 colours/tones) IMPORTANT: post-processing image compression needed to reduce file size NEVER use colour settings to scan B/W documents 34 th ILO Meeting, 3-5 Nov 2008, Vienna, AT
International Atomic Energy Agency 34 th ILO Meeting, 3-5 Nov 2008, Vienna, AT 10 Quality Control - QC Retain: value utility integrity of resources Verify: quality accuracy consistency INIS verifies: accuracy & completeness (eg. same number of pages?) data integrity correctness of metadata form and validity correct matching of metadata and image files ‘checksum’ algorithm (authenticity & integrity of digitized files) number & order of bytes (eg. after move, copy, transfer, burn) visual inspection: resolution, colour, tone, appearance attn: changeable light & monitors
International Atomic Energy Agency 34 th ILO Meeting, 3-5 Nov 2008, Vienna, AT 11 Image Enhancement Definition: Any process applied to the raw scan to improve quality or legibility of the resource Image Enhancement Image Enhancement at INIS: despeckling deskewing noise reduction black border removal colour and tone adjustment, etc.
International Atomic Energy Agency Quality Control and Image Enhancement (1) Skewed? th ILO Meeting, 3-5 Nov 2008, Vienna, AT
International Atomic Energy Agency Noisy (e.g. unnecessary dots)? Quality Control and Image Enhancement (2) th ILO Meeting, 3-5 Nov 2008, Vienna, AT
International Atomic Energy Agency Black border ? Quality Control and Image Enhancement (3) th ILO Meeting, 3-5 Nov 2008, Vienna, AT
International Atomic Energy Agency Quality Control and Image Enhancement (4) IMPORTANT Paper Size must match document hard copy A4 ≠ Letter Size Text cut = RESCAN If noticed during QC of incoming PDF, INIS will request the Input Centre to resend the page th ILO Meeting, 3-5 Nov 2008, Vienna, AT
International Atomic Energy Agency 34 th ILO Meeting, 3-5 Nov 2008, Vienna, AT 16 File Formats Very important: Prefer ‘ non-proprietary ’ formats Several standard file formats exist different resolution, bit-depth, colour capabilities, etc. INIS Digital Collection: 1.From ‘Paper’ or ‘Microfiche’ to ‘Digital’: Master images in TIFF Group IV (b/w), in JPEG (colour) Majority Full-Text searchable PDF 2.Digital files received from INIS National Centres PDF Compression: JBIG2 (b/w), JPEG (colour)
International Atomic Energy Agency 34 th ILO Meeting, 3-5 Nov 2008, Vienna, AT 17 Preservation Formats PDF: open standard – official ISO :2008 PDF/A: Long-term archiving of electronic documents Creation of PDF documents whose visual appearance will remain the same over the course of time Official ISO standard: ISO :2005 Further development ongoing INIS: considers adopting PDF/A for efficient preservation long-term archival of the Agency’s and Member States’ nuclear information resources pilot project in 2009
International Atomic Energy Agency 34 th ILO Meeting, 3-5 Nov 2008, Vienna, AT 18 OCR – Optical Character Recognition Printed text searchable as electronic text Primary objective for INIS digitization projects: creation of ‘searchable full text’ INIS: major tool for mass production: ABBYY FineReader 8 ~ 98% accuracy: printed text in Latin & Cyrillic characters Satisfactory testing with Script and Arabic Characters: Adobe Acrobat Professional 8.0: Chinese (Simplified), Japanese, Korean ABBYY FineReader Pro9: Hebrew, Thai VERUS™ Professional: Arabic
International Atomic Energy Agency Various OCR types Typewritten Hand print and cursive Fraktur Music scores MICR (Magnetic Ink Character Recognition) th ILO Meeting, 3-5 Nov 2008, Vienna, AT
International Atomic Energy Agency OCRprocess 1 (no or wrong dictionary) OCR process 1 (no or wrong dictionary)
International Atomic Energy Agency OCRprocess 2 (proper dictionary) OCR process 2 (proper dictionary)
International Atomic Energy Agency Scanned (raster) Image Visual representation of the original document Image Layer Hidden Text Enables full-text search Extra information for search engines OCRvalue added OCR - value added th ILO Meeting, 3-5 Nov 2008, Vienna, AT errors
International Atomic Energy Agency 34 th ILO Meeting, 3-5 Nov 2008, Vienna, AT 23 Storage of Digital Files Mandatory: reliable & controlled environment Storage of master files: high quality, industry standard devices, eg. CD-R, DVD, or other contemporary reliable media Backup of master files: regularly, off-site, secure location RAID: Redundant Array of Independent Disks several drives act collectively as a single storage system consider RAID for large production environment INIS: THECUS N5200B PRO, 5x3,5" SATA Raid 5 disks 1 TB each configured as local network data storage
International Atomic Energy Agency 34 th ILO Meeting, 3-5 Nov 2008, Vienna, AT 24 Back-up and Off-Site Storage Create: regular back-ups of master files Store: remote from the original source in a secure location INIS: 1970 to 1997: ‘microfiche’ NCL full text: paper microfiche safe, long-term storage INIS National Centres full set of NCL microfiche Austrian Central Lib. of Physics From 1997: ‘digital’ NCL on CD: INIS Document Delivery Centres (National Centres) Secure “off site” & back-up: Austrian Central Lib. of Physics 2008: microfiche to PDF Austrian Central Lib. of Physics INIS National Centres INIS Online Database
International Atomic Energy Agency 34 th ILO Meeting, 3-5 Nov 2008, Vienna, AT 25 Preservation Planning Contents of digital files must remain ‘meaningful’ Different processes: 1. Refreshing: copy files from one storage medium to another verify authenticity & integrity of the files (e.g. checksum) 2. Migration: transfer files from one HW & SW to another or from one computer generation to next generations format-based: move files from ‘obsolete’ format to ‘new’ format 3. Emulation: re-create technical environment maintain information about HW & SW = system reengineered INIS: Refreshing CD to DVD (until 2007) from 2008: copy to Thecus storage device When PDF/A implemented: ‘migration’
International Atomic Energy Agency 34 th ILO Meeting, 3-5 Nov 2008, Vienna, AT 26 Metadata Key role for digital resources: Describe, process, manage, track, access, preserve INIS: comprehensive ‘bibliographic’ metadata describe the intellectual content of full text bibliographic elements to identify & retrieve resources INIS Database: digital resources with bibliographic metadata Technical metadata for digital resources: automatic creation with PDF files Future: more sophisticated approach with implementation of PDF/A
International Atomic Energy Agency Thank you for your attention! Your INIS Digital Preservation Team 34 th ILO Meeting, 3-5 Nov 2008, Vienna, AT