Digital archival storage for the University of Michigan Library collections.

Slides:



Advertisements
Similar presentations
1 of 18 Information Dissemination New Digital Opportunities IMARK Investing in Information for Development Information Dissemination New Digital Opportunities.
Advertisements

Beyond the Google Book: the Future of the Digital Library Cory Snavely Library IT Core Services manager University of Michigan April 20, 2010.
HATHI TRUST A Shared Digital Repository Building A Future By Preserving Our Past The Preservation Infrastructure of HathiTrust Digital Library Jeremy York.
What is HathiTrust and How Can it Make a Difference? Sourcing and Scaling brought to the collective collection.
Archive Task Team (ATT) Disk Storage Stuart Doescher, USGS (Ken Gacke) WGISS-18 September 2004 Beijing, China.
E-Content Service Group Virtual Meeting Digital Preservation: How to Get Started.
Digital Preservation A Matter of Trust. Context * As of March 5, 2011.
October 24, 2006Merit Technical Staff Meeting1 The Google Project at the University of Michigan Perry Willett Head, Digital Library Production Service.
Take your CMS to the cloud to lighten the load Brett Pollak Campus Web Office UC San Diego.
1. The Digital Library Challenge Resources are hybrid: –Different formats: print, video, audio, web, etc. –Different locations: library, departments,
ECM RFP 101 Presented by: Carol Mitchell C.M. Mitchell Consulting.
Network Design and Implementation
11© 2011 Hitachi Data Systems. All rights reserved. HITACHI DATA DISCOVERY FOR MICROSOFT® SHAREPOINT ® SOLUTION SCALING YOUR SHAREPOINT ENVIRONMENT PRESENTER.
Missouri Public Service Commission Electronic Filing & Information System a Case Study for Business Process Management Todd Craig, CIO Missouri Public.
ARCHIVES AND ACCOUNTABILITY IN THE DIGITAL AGE Fran Blouin Director, Bentley Library University of Michigan Copenhagen City Archives February 2009.
Evolution of Enterprise Services in the Statistics Canada IT Environment Silver Buckler Chief, Managed Storage Section Informatics Technology Services.
Developing PANDORA Mark Corbould Director, IT Business Systems.
Virtual Network Servers. What is a Server? 1. A software application that provides a specific one or more services to other computers  Example: Apache.
1 The Vietnam Center and Archive Stephen Maxner, Ph.D.
Co-funded by the European Union under FP7-ICT Co-ordinated by aparsen.eu #APARSEN Storage Solutions The use case at the National Library of the.
SDLC Phase 2: Selection Dania Bilal IS 582 Spring 2009.
Harvard’s Digital Repository Service (DRS) Architecture Harvard University Library (HUL) Andrea Goethals, Randy Stern December 10, 2009.
The University of Texas Research Data Repository : “Corral” A Geographically Replicated Repository for Research Data Chris Jordan.
STEALTH Content Store for SharePoint using Windows Azure  Boosting your SharePoint to the MAX! "Optimizing your Business behind the scenes"
Microsoft ® SQL Server ® 2008 and SQL Server 2008 R2 Infrastructure Planning and Design Published: February 2009 Updated: January 2012.
Delivering a New Desktop and Application Deployment Strategy Indiana University and the New Emerging Personal Computing Model Duane Schau
LCoNZ Institutional Research Repository Project Overview LCoNZ – history and background IRR Project – why, what, how Lessons learned Future directions.
Presentation to Qwest Travel Services Meeting and Events Commodity Meetings Where Everything Works.
Digitizing Project Components Planning Document Prep Scanning Post Scan Processing Data Loading Document De-Prep Interface Creation Publicity Maintenance.
MSS Technologies and the AIIM Grand Canyon Chapter present: Electronic Document Management System Needs Analysis.
Hosted by Case Study - Storage Consolidation Steve Curry Yahoo Inc.
HathiTrust Digital Library. Overview ›Began in 2008 ›Large scale digital preservation repository ›Partnership of major research libraries ›Focus on both.
STEALTH Content Store for SharePoint using Caringo CAStor  Boosting your SharePoint to the MAX! "Optimizing your Business behind the scenes"
Managing Serials in an Electronic World the Stirling Experience Sonia Wilson University of Stirling Library 19 October 2004.
Ceph Storage in OpenStack Part 2 openstack-ch,
Planning and Designing Server Virtualisation.
1 © 2010 Overland Storage, Inc. © 2012 Overland Storage, Inc. Overland Storage The Storage Conundrum Neil Cogger Pre-Sales Manager.
Mass digitisation? Astrid Verheusen Projectmanager Research & Development Division National library of the Netherlands LIBER-EBLIDA Workshop on Digitisation.
Corral: A Texas-scale repository for digital research data Chris Jordan Data Management and Collections Group Texas Advanced Computing Center.
OCLC Online Computer Library Center Digital Preservation with OCLC Digitization Standards: Issues & Updates Taylor Surface, OCLC.
The Portal to Texas History: Harnessing Technology to Enable Collaboration with Small Museums and Libraries CNI, December 6, 2005 Cathy Nelson Hartman.
A Chicken or An Egg? Planning Your Digital Project Presentation to the Saskatchewan Libraries Conference Digitization 101 Pre-Conference Workshop May 3,
Storage Trends: DoITT Enterprise Storage Gregory Neuhaus – Assistant Commissioner: Enterprise Systems Matthew Sims – Director of Critical Infrastructure.
1 U.S. Department of the Interior U.S. Geological Survey Contractor for the USGS at the EROS Data Center EDC CR1 Storage Architecture August 2003 Ken Gacke.
HathiTrust’s Past, Present and Future. Short- and Long-term Functional Objectives Short-term Page turner mechanism (and Mobile!) Branding (overall initiative;
OSP310. What is a SharePoint® Farm? A collection of one or more SharePoint Servers and SQL Servers® providing a set of basic SharePoint.
Economical Big Local Storage Tom Klingler, Kent State University CNI Spring 2013 Membership Meeting April 4-5, 2013 San Antonio, TX.
From Your Archive to the Web: Managing the Project The digitization of the Historic Photograph Collection of the Public Library of Brookline Digital Commonwealth/
SDLC 1: Systems Planning and Selection Dania Bilal IS 582 Spring 2008.
HATHI TRUST A Shared Digital Repository Use of PREMIS for Internet Archive AIPs September 22, 2010.
CENTER FOR HIGH PERFORMANCE COMPUTING Introduction to I/O in the HPC Environment Brian Haymore, Sam Liston,
National Library of the Czech Republic as End-User of the Research Networks Adolf Knoll deputy director
Digital Preservation across the technologies, strategies, open standards & interoperability aspects including the legal issues Pratik Shrivastava Scientist.
The Oxford-Google Digitization Project* Michael Popham Oxford Digital Library * Rules of commercial confidentiality apply to this presentation!
Hosted by Creating RFPs for Tape Libraries Dianne McAdam Senior Analyst and Partner Data Mobility Group.
NanoSearch DMS V2.3 Make our daily job became easy With Documents Management System.
ACS PUBLICATIONS Over a Century of Essential Chemistry on Your Desktop H I G H Q U A L I T Y. H I G H I M P A C T. A C S P U B L I C A T I O N S Andrew.
ARIADNE is funded by the European Commission's Seventh Framework Programme Archiving and Repositories Holly Wright.
Digital Library Storage Strategies Robert Cartolano, Director Library Information Technology Office November 14, 2008.
5/29/2001Y. D. Wu & M. Liu1 Content Management for Digital Library May 29, 2001.
Information Technology Virtualized Server Update
Organizations Are Embracing New Opportunities
Physical Architecture Layer Design
Real IBM C exam questions and answers
DIGITAL LIBRARY.
Storage Trends: DoITT Enterprise Storage
Long-Lived Data Collections
digital archival storage
Information Technology Virtualized Server Proposal
Current Challenges in Digitization
Presentation transcript:

digital archival storage for the University of Michigan Library collections

Project Overview Project partnership with Google publicly announced in December Project partnership with Google publicly announced in December Bound print collection, about 7 million volumes, to be scanned over estimated four to six years. Bound print collection, about 7 million volumes, to be scanned over estimated four to six years. Direct scanning costs are borne by Google. Direct scanning costs are borne by Google.

Project Overview UM receives a copy of all digital files, including OCR and metadata, which we may use to build services. UM receives a copy of all digital files, including OCR and metadata, which we may use to build services. UM may share files with other research libraries under formal agreements. UM may share files with other research libraries under formal agreements. UM may not redistribute content en masse to other commercial services or the public. UM may not redistribute content en masse to other commercial services or the public. All uses are subject to copyright. All uses are subject to copyright.

Project Scale At about 320 pages per volume and 2.01 files per page, we’ll have 2.2 billion files. At about 320 pages per volume and 2.01 files per page, we’ll have 2.2 billion files. At about 6000 pages per GB or 54.6 MB per volume, we’ll have 380 TB of data. At about 6000 pages per GB or 54.6 MB per volume, we’ll have 380 TB of data. Production at full volume can scan about 35K volumes (1867 GB) per week, which averages to a sustained 3.16 MB per second for four years. Production at full volume can scan about 35K volumes (1867 GB) per week, which averages to a sustained 3.16 MB per second for four years.

Not too many libraries do this!

Characteristics of the Data Extremely well-defined data conventions: image files are TIFF or JPEG 2000, OCR files and metadata are UTF-8 text. Extremely well-defined data conventions: image files are TIFF or JPEG 2000, OCR files and metadata are UTF-8 text. A true archival system; indefinite retention requires its own set of best practices. A true archival system; indefinite retention requires its own set of best practices. Files are largely static. Files are largely static. Much material is in-copyright (security is paramount). Much material is in-copyright (security is paramount).

Application Requirements MBooks (web server farm/NAS) MBooks (web server farm/NAS) Periodic fixity check (checksum validation) Periodic fixity check (checksum validation) Full-text search? (how?!) Full-text search? (how?!) Textual analysis or other research? Textual analysis or other research? Anything beyond MBooks is likely to be either compute- or IO-intensive, or both. Anything beyond MBooks is likely to be either compute- or IO-intensive, or both. This is how you annoy storage vendors! This is how you annoy storage vendors!

Overall Approach Engagement with Office of the Provost from the beginning; a University project housed in the Library Engagement with Office of the Provost from the beginning; a University project housed in the Library Our Library IT environment has unusual depth due to our mature digital library. Our Library IT environment has unusual depth due to our mature digital library. Consulting relationship with academic computing and campus storage experts Consulting relationship with academic computing and campus storage experts RFI provided vendor landscape RFI provided vendor landscape RFP (very few Yes/No questions!) RFP (very few Yes/No questions!)

Cost Model from RFI Responses Model includes various ramp-up patterns, hardware replacement periods, starting cost, and rate of cost decrease. Model includes various ramp-up patterns, hardware replacement periods, starting cost, and rate of cost decrease. Cost per GB from selected RFI responses: average = median = $7 Cost per GB from selected RFI responses: average = median = $7 Too fast means initial investment is huge, no benefit from Moore’s Law. Too fast means initial investment is huge, no benefit from Moore’s Law. Too slow means simultaneous growth and replacement, costs peak at replacement interval. Too slow means simultaneous growth and replacement, costs peak at replacement interval. Four years is plenty fast, thank you! Four years is plenty fast, thank you!

Potential Funding Sources Development of CIC shared digital repository: multiple redundant sites and some staff funded by pay-to-play model Development of CIC shared digital repository: multiple redundant sites and some staff funded by pay-to-play model Again, engagement with Office of the Provost from the beginning Again, engagement with Office of the Provost from the beginning

Considerations “Future-proof” higher-cost investment with proven vendor and incremental upgrades? “Future-proof” higher-cost investment with proven vendor and incremental upgrades? “Throwaway” lower-cost solution with cutting-edge vendor and forklift upgrade? “Throwaway” lower-cost solution with cutting-edge vendor and forklift upgrade? Temporary solution (Linux NAS server and commodity SCSI/SATA arrays) has allowed project to proceed and further inform us on the decisions we’ll make. Temporary solution (Linux NAS server and commodity SCSI/SATA arrays) has allowed project to proceed and further inform us on the decisions we’ll make.

Best Architecture? Must have simultaneous access from potentially many front-end servers (cluster), so almost certainly a NAS component. Must have simultaneous access from potentially many front-end servers (cluster), so almost certainly a NAS component. NAS? NAS gateway to SAN? NAS/SAN hybrid? NAS? NAS gateway to SAN? NAS/SAN hybrid? Probably most promising in the flexibility department are the clustered NAS systems with SAS or SATA back ends. Probably most promising in the flexibility department are the clustered NAS systems with SAS or SATA back ends. Keep our options open; the right vendor could make all the difference. Keep our options open; the right vendor could make all the difference.

Highlights of the RFP Does not ask about compliance with exact specifications, but asks for detailed explanations of system architecture: all of the usual, and… Does not ask about compliance with exact specifications, but asks for detailed explanations of system architecture: all of the usual, and… Recommended upgrade path given our estimated growth pattern and project timeline Recommended upgrade path given our estimated growth pattern and project timeline Description of how load balancing and service are impacted as system is scaled and maintained Description of how load balancing and service are impacted as system is scaled and maintained How virtualization is implemented How virtualization is implemented Security provisions Security provisions Contact me if you’d like to have a copy. Contact me if you’d like to have a copy.

Proposal Evaluation Criteria Scalability of capacity, performance, and interconnect fabric Scalability of capacity, performance, and interconnect fabric Proven models/methods for growth Proven models/methods for growth Flexibility in application Flexibility in application Maintenance ease Maintenance ease

Near-term Work RFP responses due (Monday!) RFP responses due (Monday!) Space, support, backup Space, support, backup Work in CIC on governance and funding model for shared digital repository Work in CIC on governance and funding model for shared digital repository Continued development of MBooks functionality and integration with existing digital library resources Continued development of MBooks functionality and integration with existing digital library resources

Access MBooks MBookshttp:// Cory Snavely Cory