Presentation is loading. Please wait.

Presentation is loading. Please wait.

HATHI TRUST A Shared Digital Repository HathiTrust 101 John Wilkin and Jeremy York August 27, 2010.

Similar presentations

Presentation on theme: "HATHI TRUST A Shared Digital Repository HathiTrust 101 John Wilkin and Jeremy York August 27, 2010."— Presentation transcript:

1 HATHI TRUST A Shared Digital Repository HathiTrust 101 John Wilkin and Jeremy York August 27, 2010

2 Outline About HathiTrust – Mission & Goals Governance Content What we do (services) Partnership & Resources Technology Future Directions

3 Current Partners – Columbia University – New York Public Library – University of California system – CIC (Committee on Institutional Cooperation) – Triangle Research Library Network – University of Virginia – Yale University University of Chicago University of Illinois Indiana University University of Iowa University of Michigan Michigan State University University of Minnesota Northwestern University Ohio State University Pennsylvania State University Purdue University University of Wisconsin-Madison

4 Universal Digital Library Common Goal Single Entity, Many Partners HathiTrust

5 Governance HathiTrust Executive Committee Strategic Advisory Board Budget/Finances Decision-making Guidance on Policy, Planning

6 Executive Committee Paul Courant, University Librarian and Dean of Libraries, UM Laine Farley, Executive Director, CDL John King, Vice Provost for Academic Information, UM Paula Kaufman, University Librarian and Dean of Libraries, UI Brian Schottlaender, University Librarian, UCSD Ed Van Gemert, Deputy Director of Libraries, UW – Madison (ex officio) Brenda Johnson, Dean of Libraries, IU Brad Wheeler, Chief Information Officer, IU John Wilkin, Executive Director of HathiTrust and Associate University Librarian, LIT, UM

7 Strategic Advisory Board Ed Van Gemert (Chair), Deputy Director of Libraries, UW - Madison John Butler, Associate University Librarian for Information Technology, U Minn Patricia Cruse, Director, Preservation, CDL Bernie Hurley, Director, Library Technologies, UC Berkeley R. Bruce Miller, University Librarian, UC - Merced Sarah Pritchard, University Librarian, Northwestern Paul Soderdahl, Director, LIT, U Iowa John Wilkin, Executive Director, HathiTrust (ex officio) Robert Wolven, Columbia University

8 Content Distribution 6,331,718 – Total 1,215,210 – Public Domain * As of July 25, 2010

9 Language Distribution (1) * As of July 25, 2010

10 Language Distribution (2) The next 40 languages make up ~13% of total * As of July 25, 2010

11 Dates * As of July 25, 2010

12 Originating Institution * As of July 25, 2010

13 Content over time * As of July 25, 2010

14 Content Growth

15 Bit-level preservation and migration Long-term preservation Viewing Redistribution Print disabilities Section 108 Content Access Rights database Copyright review Rights management Temporary catalog Version 1 permanent catalog Summer 2010 Bibliographic search November 2009 Full-text search UM public domain UM Press Print on Demand Collection Builder Publish virtual collections Metadata files Bib API Data API Availability of data Inbound validation Fixity checks Google and IA ingest Full-PDF download Collection Builder Shibboleth Supporting partner development Development Environment Datasets Protocol Research Center Computational Research Born digital Images/maps Audio Beyond Books and Journals Services

16 Focus on users Preservation…with Access Brings concerns of research libraries to bear on the way the scholarly record is cared for and made available – Scholarly Resource – Bibliographic Search – Full-text search – Collections – Full-PDF download of public domain














30 Cost Model 1 Reasonable costs of sustaining the archive, includes cost of replacement, capital fund

31 Cost Model 1 Economies of scale keep costs low – $0.145/volume/year for Google-digitized – about $0.45/volume/year for IA-digitized Advantages not fully known until you jump in

32 For public domain volumes: (PD*X*C)/N For a given in­copyright volume: IC=(C*X)/H Share in costs of curation Share in uses of relevant materials Voice in future directions Free riders? Cost Model 2

33 Sustaining common resource Costs go down Quality of services increases – Realize in aggregated collection, something dont get through distributed search or federation

34 Cost Model 2: Timeline & Requirements Timeline: – Implement in 2013 – Accept new partners now with costs based on overlap calculations Requirements: – Print holdings database – Update mechanisms – Manual remediation

35 Print Holdings Database Print holdings database will also benefit – De-duplication Compromises user experience, obscures collection development needs – Management of print volumes Information to withdraw volumes (journals) – Legal uses of copyright materials Section 108, 121, ADA uses will depend knowledge of which institutions own(ed) which materials

36 Staff Staff/Expertise – highly integrated – Project managers, IT and communications staff, copyright experts, administrators – Working groups Shared development space

37 e-Commerce Print on Demand Content Ingest Transformation Validation Content Access PageTurner Collection Builder Large-scale Search Bibliographic Catalog Research Center APIs Quality Assurance Quality Review Content Certification User Services Usability User support (helpdesk) Outreach Project website Monthly newsletter Papers and presentations Communication with potential partners Surveys, general inquiries Repository evaluation and audit (e.g., DRAMBORA, TRAC) Legal Risk management (use of materials) Partner agreements Advocacy Governance Budget, Finances Decision-making Policy Planning Enterprise Management Communication and Coordination with partner institutions Project management Repository Administration Hardware configuration and maintenance Web and application server configuration and maintenance Security Permissions Logging Repository Administration Data management (content storage, backup, integrity checks, deletion) Hardware selection and replacement Content and Metadata specifications Disaster Recovery Processes for ensuring content integrity Rights Management Copyright determination Copyright review Copyright information management (database) Rightsholder permissions Bibliographic Data Management Entity description (record-level) Object identification (item-level) Data availability Collection Development Digital Expansion beyond books and journals (born-digital, images and maps, audio) Selection of content (for non- Google volume ingest and pilots projects) Print Cloud Library (effect of digital on print) Financial contributions of partners HathiTrust Functional Framework

38 Collection Development Digitization/Collaboration with other initiatives Public domain determinations Duplicate volumes Citation Building Collections Quality

39 A global change in the library environment June 2010 Median duplication: 31% June 2009 Median duplication: 19% Academic print book collection already substantially duplicated in mass digitized book corpus

40 Digitized Books in Shared Repositories ~75% of mass digitized corpus is backed up in one or more shared print repositories ~3.5M titles ~2.5M

41 Technology - OAIS GRIN Internal Data Loading GRIN Internal Data Loading Google [OCA] In-house Conversion Google [OCA] In-house Conversion MARC record extensions (Aleph) Rights DB MARC record extensions (Aleph) Rights DB Page Turner HathiTrust API OAI GeoIP DB CNRI Handles [Solr] Page Turner HathiTrust API OAI GeoIP DB CNRI Handles [Solr] METS/PREMIS object TIFF G4/JPEG2000 OCR MD5 checksums METS/PREMIS object TIFF G4/JPEG2000 OCR MD5 checksums METS object PNG OCR PDF METS object PNG OCR PDF Isilon Site Replication TSM MD5 checksum validation Isilon Site Replication TSM MD5 checksum validation GROOVE (JHOVE) GROOVE (JHOVE) ;

42 Future Directions Locally-digitized partner content Usage reporting Coordinate digital and print resources (holdings database) Computational Research Quality Strategies for openness Collaborative Development Extending Services through Shibboleth Non-book, non-journal content Born-digital content New Bibliographic Management Compliance with TRAC Grant projects OCLC Catalog 3-year review Improvements to Large-scale Search Improvements to PageTurner Ingest Reporting

43 How can HathiTrust make a difference? Digital Curation – Drive costs down – Reduce bibliographic indeterminacy – Make meaningful decisions about formats and quality – Increase discoverability – Consolidate development talent – Improve strength of archiving Print Curation – Means to associate our print holdings – Coordinated record-keeping Subsidiary benefits – Improve description – Quantify problems – Collective attention to solving shared problems

Download ppt "HATHI TRUST A Shared Digital Repository HathiTrust 101 John Wilkin and Jeremy York August 27, 2010."

Similar presentations

Ads by Google