Presentation on theme: "Beyond the Google Book: the Future of the Digital Library Cory Snavely Library IT Core Services manager University of Michigan April 20, 2010."— Presentation transcript:
Beyond the Google Book: the Future of the Digital Library Cory Snavely Library IT Core Services manager University of Michigan April 20, 2010
www.hathitrust.org HathiTrust project profile Launched October 2008 26 member institutions and growing 99% Google-scanned materials 5.6 million volumes, 350 pages average 210 terabytes 2 US sites
www.hathitrust.org Founding principles Long-term digital preservation Access to materials Digital formats allow simultaneous preservation and access Pioneer the concept of a universal, non- commercial, collaborative digital library
www.hathitrust.org Preservation models Open Archical Information System (OAIS) – Provides formal guidelines for object storage and retrieval. – We incorporate these principles. Trusted Repository Audit Checklist (TRAC) – Provides auditing framework for assuring reliability: policy management, documentation, succession. – Our TRAC audit recently conducted. Still evolving!
www.hathitrust.org Preservation architecture content Google: multi-purpose index and advertising engine HathiTrust: preservation- oriented service architecture managed repository API index service
www.hathitrust.org Access services Bibliographic catalog for traditional discovery – Based on open-source Vufind system Full-text search – Based on open-source Solr system Page turner for online reading – Based on our older open-source digital library system – Only public-domain books according to US copyright law APIs for building new services
Search full-text of this item Save item to new or existing collection http://babel.hathitrust.org/cgi/mb Image, text, or pdf views
www.hathitrust.org Funding and Governance Major financial support from UM and Indiana University Cost-recovery based on content deposited Executive committee – deans and CIOs of founding institutions, executive director – budget and major initiatives Strategic advisory board – Representatives of member institutions – development priorities, policy development
www.hathitrust.org Staffing and server infrastructure Significant developer staff contributed by UM One sysadmin of three in Core Services funded by HT Basic infrastructure cost: $3.86/GB – 1 420TB Isilon storage cluster per site Linear cost increment for adding new storage – Tape backup – 2 web, 1 database, 4 search servers in both sites – 5 ingest, 2 index-building servers in Michigan – Shared development environment
www.hathitrust.org Material and Data Flow ingest web sync Google or other scanning project storage @UM storage @IU network or media delivery catalog rights database web index
www.hathitrust.org Automated Data Ingest Handles per-volume logistics Rigorously validate identifier, object files, object completeness Generate METS (XML) object inventory Determine copyright by date and place of publication 500K+ volumes/month!
www.hathitrust.org Data Characteristics 1 METS (XML) and 1 Zip archive (JPEG2000 and TIFF images and OCR text) per book 36MB average Zip file size (compressed) Layout uses pairtree, an IETF draft RFC developed at California Digital Library 39015123456789 pairtree_root/39/01/51/23/45/67/89/39015123456789