Presentation is loading. Please wait.

Presentation is loading. Please wait.

HATHITRUST A Shared Digital Repository The HathiTrust Digital Repository: Under the hood SI 625 April 20, 2015 Jeremy York, Assistant Director, HathiTrust.

Similar presentations


Presentation on theme: "HATHITRUST A Shared Digital Repository The HathiTrust Digital Repository: Under the hood SI 625 April 20, 2015 Jeremy York, Assistant Director, HathiTrust."— Presentation transcript:

1 HATHITRUST A Shared Digital Repository The HathiTrust Digital Repository: Under the hood SI 625 April 20, 2015 Jeremy York, Assistant Director, HathiTrust Unless otherwise noted, these slides and their contents are licensed under a Creative Commons Attribution Unported License.Creative Commons Attribution Unported License

2 Outline Introduction Underlying Ideas Repository and Services

3 Introduction

4 HathiTrust Members Allegheny College American University of Beirut Arizona State University Auburn University Baylor University Boston College Boston University Brandeis University Brown University California Digital Library Carnegie Mellon University Case Western Reserve Colby College Columbia University Cornell University Dartmouth College Duke University Emory University Getty Research Institute Georgetown University Georgia Tech Harvard University Library Indiana University Iowa State University Johns Hopkins University Kansas State University Lafayette College Library of Congress Massachusetts Institute of Technology McGill University` Michigan State University Montana State University Mount Holyoke College New York Public Library New York University North Carolina Central University North Carolina State University Northeastern University Northwestern University The Ohio State University Oklahoma State University Penn State Princeton University Purdue University Rutgers University Stanford University State University System of Florida Swarthmore College Syracuse University Temple University Texas A&M University Texas Tech Tufts University Universidad Complutense de Madrid University of Alabama University of Alberta University of Arizona University of British Columbia University of Calgary University of California Berkeley Davis Irvine Los Angeles Merced Riverside San Diego San Francisco Santa Barbara Santa Cruz The University of Chicago University of Connecticut University of Delaware University of Houston University of Illinois University of Illinois at Chicago The University of Iowa University of Kansas University of Maine University of Maryland University of Massachusetts, Amherst University of Miami University of Michigan University of Minnesota University of Missouri University of Nebraska-Lincoln University of New Mexico The University of North Carolina at Chapel Hill University of Notre Dame University of Oklahoma University of Pennsylvania University of Pittsburgh University of Queensland University of Tennessee, Knoxville University of Texas University of Utah University of Vermont University of Virginia University of Washington University of Wisconsin- Madison Utah State University Vanderbilt University Virginia Tech Wake Forest University Washington University Yale University Library

5 Digital Repository Launched 2008 Initial focus on digitized book and journal content – 13.3 million total volumes – 6.7 million book titles – 350,000 serial titles – 5 million public domain (~38%)

6 The Name The meaning behind the name – Hathi (hah-tee)--Hindi for elephant – Big, strong – Never forgets, wise – Secure – Trustworthy

7 Mission To contribute to the common good by collecting, organizing, preserving, communicating, and sharing the record of human knowledge

8 Universal Library Common Goal Single Entity, Many Partners HathiTrust

9 Collections and Collaboration Comprehensive collection -Preservation…with Access ]Shared strategies – Copyright – Collection management, development – Preservation – Discovery / Use – Bibliographic Indeterminacy – Efficient user services Public Good

10 Content

11 1. Michigan4,712,752 2. California3,612,596 3. Harvard838,115 4. Wisconsin561,094 5. Indiana529,601 6. Cornell510,286 7. Penn State388,713 8. Illinois329,136 9. NYPL294,883 10. Princeton252,837 11. Minnesota193,124 12. Madrid117,291 13. Library of Congress 108,892 14. Keio University90,112

12 Dates

13 Language Distribution (1) The top 10 languages make up ~87% of all content

14 Language Distribution (2) The next 40 languages make up ~12% of total

15 HathiTrust and other e-databases

16 Content Distribution

17 Underlying Ideas

18 Underlying ideas Community Scale Access and Preservation Openness

19 Community

20

21 OAIS TRAC METS and PREMIS Repository Practices – Content – Reference – Fixity

22 Scale Mission – To contribute to the common good by collecting, organizing, preserving, communicating, and sharing the record of human knowledge Strategy – “Co-owned and managed”

23 Preservation and Access We engage in preservation for purposes of access “Light” archive benefits – Access to materials – Checks on integrity – Best chance for content to be used and valued, preserved

24 Openness Repository centralized...open Formats Software Organizational structure

25 Underlying ideas

26 Experience

27 What’s Missing? What should be included in the AIP? What should be validated? How should content be identified? How to operate at scale – managing preservation information (PREMIS; access information in rational way at scale)...

28 Repository Philosophy/Design OAIS/TRAC Consistency Standardization Simplicity (in design, not function) Practicality Sustainability

29 Source Bibliographic Data Content Package Michigan Indiana Bib Data Data Management Rights Data Storage Access Ingest Catalog Full-text Search PageTurner APIs Collections Holdings Data Datasets TDR

30 Building the Digital Repository Shared infrastructure – Centralized Administration: Ingest, validation, content integrity Functionality: full-text search, viewing print on demand – Geographically distributed In terms of location, coding, service development, digitization, content preparation

31 Source Bibliographic Data Content Package Michigan Indiana Bib Data Data Management Rights Data Storage Access Ingest Catalog Full-text Search PageTurner APIs Collections Holdings Data Datasets

32 Content Selection of content for digitization and preservation Types of materials Technology – Largely uniform in technical characteristics – 3 formats ITU G4 TIFF JPEG2000 Unicode (with and without coordinates)

33 Content Package images Source METS text HT METS Zip

34 Source Bibliographic Data Content Package Ingest Rigorous validation to ensure conformance with specifications: Resolution, image metadata Barcode Fixity Consistency Well-formedness Prepare archival package

35 Source Bibliographic Data Content Package Ingest More about ingest New Digitization Existing Digitization http://www.hathitrust.org/ingest Ingest checklist: Deposit Forms Bibliographic metadata specifications http://www.hathitrust.org/ingest_checklist Ingest tools Tools for validating, remediating, packaging Detailed content specifications http://www.hathitrust.org/ingest_tools Deposit Guidelines Policies http://www.hathitrust.org/deposit_guidelines Example METS files and METS profile http://www.hathitrust.org/digital_object_specific ations http://www.hathitrust.org/digital_object_specific ations

36 Source Bibliographic Data Content Package Bib Data Data Management Rights Data Storage Access Ingest Catalog Full-text Search PageTurner APIs Collections Holdings Data Datasets Michigan Indiana

37 Bib Data Data Management Rights Data Holdings Data Bibliographic Data Inventory Loading and updating records Duplicate detection and collation Source of information for VuFind catalog, APIs Rights determination (automated and support for manual review)

38 Bib Data Data Management Rights Data Holdings Data namespaceidattrreaso n sourceusertimenote Inu30000000078026211Jhovater2009-10-15 23:30:23 NULL

39 idnametypedscr 1pdcopyrightpublic domain 2iccopyrightin-copyright 3opbcopyrightout-of-print and brittle (implies in-copyright) 4orphcopyrightcopyright-orphaned (implies in-copyright) 5undcopyrightundetermined copyright status 6umallaccessavailable to UM affiliates and walk-in patrons (all campuses) 7worldaccessavailable to everyone in the world 8nobodyaccessavailable to nobody; blocked for all users 9pduscopyrightpublic domain only when viewed in the US 10cc-bycopyrightCreative Commons Attribution 11cc-by-ndcopyrightCreative Commons Attribution-NoDerivatives 12cc-by-nc-ndcopyrightCreative Commons Attribution-NonCommercial-NoDerivatives 13cc-by-nccopyrightCreative Commons Attribution-NonCommercial 14cc-by-nc-sacopyrightCreative Commons Attribution-NonCommercial-ShareAlike 15cc-by-sacopyrightCreative Commons Attribution-ShareAlike 16orphcandcopyrightorphan candidate - in 90-day holding period (implies in-copyright) 17cc-zerocopyrightCreative Commons Zero license (implies pd) 18und-worldcopyright Undetermined copyright status and permitted as world-viewable by the depositor 19Ic-uscopyrightIn copyright in the US Rights Attributes 39

40 Rights Determination Reason Codes idnamedscr 1bibbibliographically-derived by automatic processes 2ncnno printed copyright notice 3concontractual agreement with copyright holder on file 4ddddue diligence documentation on file 5manmanual access control override; see note for details 6pvtprivate personal information visible 7rencopyright renewal research was conducted 8nfineeds further investigation (copyright research partially complete; an ambiguous, unclear, or other time-consuming situation was encountered) 9cdpptitle page or verso contain copyright date and/or place of publication information not in bib record 10cipcondition review and in-print status research was conducted 11unpunpublished work 12gfvGoogle viewability set at VIEW_FULL 13crms derived from multiple reviews in the Copyright Review Management System (CRMS) via an internal resolution policy; consult CRMS records for details 14add author death date research was conducted or notification was received from authoritative source 15exp expiration of copyright term for non-US work with corporate author 16DelDeleted from repository; see note for details 17GattNon-US public domain work restored to in-copyright in the US by GATT 40

41 Access Determinations Automated Manual

42 Automatic Rights Determination Conducted on all works at time of ingest and when records are modified – Public domain worldwide US works published before 1923, US federal government publications, non-US works published prior to 1873 – Public domain in the United States Non-US works published prior to 1923

43 Manual Rights Determination IMLS-funded CRMS project – CRMS-US 2008: US-published works 1923-1963 Staff at 4 partner institutions – CRMS-World 2011: Expanded to non-US works Staff at 16 partner institutions – Double review with additional expert review for conflicts – Compliance with copyright formalities – As of March 2015 511,520 reviewed, 270,979 opened Rights Holder Permissions

44 System of Precedence Rights Database Bibliographic (automatic) Manual

45 Bib Data Data Management Rights Data Holdings Data Single-part monographs OCLC #; Local system ID; Timestamp; Holding Status; Condition Multi-part monographs Include enumeration and chronology Serials OCLC #; Local system ID; Timestamp; ISSN

46 Source Bibliographic Data Content Package Bib Data Data Management Rights Data Storage Access Ingest Catalog Full-text Search PageTurner APIs Collections Holdings Data Datasets Michigan Indiana

47 Storage Michigan Indiana Reliability – ensure integrity Redundancy – in single and multiple sites Scalability – including ease of management Accessibility – for repository processes and services Platform-independence – for data/object management

48 Storage Michigan Indiana EMC Isilon storage Disk-based Load-balancing and fail-over Internal redundancy (N+3) Efficient, reliable replication (daily) Scalable (single file system up to 5 petabytes)

49 Storage Michigan Indiana Object integrity Continual checks on data integrity Detection and repair of corrupt disk sectors Fixity checks on ingest Periodic checks on fixity of all objects

50 Architecture & Management images bib data bib data Source METS text HT METS../uc1/pairtree_root/b3/54/34/86/b34543486 b34543486.zip b34543486.mets.xml

51 Architecture & Management images bib data bib data Source METS text HT METS../uc1/pairtree_root/b3/54/34/86/b34543486 b34543486.zip b34543486.mets.xml

52 Architecture & Management images bib data bib data Source METS text HT METS../uc1/pairtree_root/b3/54/34/86/b34543486 b34543486.zip b34543486.mets.xml

53 Architecture & Management images bib data bib data Source METS text HT METS../uc1/pairtree_root/b3/54/34/86/b34543486 b34543486.zip b34543486.mets.xml Example ids: wu.89094366434 mdp.39015037375253 uc2.ark:/1390/t26973133 miua.aaj0523.1950.001

54 Architecture & Management Reference – Ability to locate objects definitively and reliably over time among other objects (Task Force on Archiving of Digital Information, 1996) – Identification of objects – Structure of the repository – Embedding of identifiers – Permanent URLs – Version dates

55 Architecture & Management images bib data bib data Source METS text HT METS../uc1/pairtree_root/b3/54/34/86/b34543486 b34543486.zip b34543486.mets.xml

56 What is METS? Metadata Encoding and Transmission Standard Administrative (including preservation), Technical, and Structural metadata

57 Why METS? Can serve as Archival Information Package and a Dissemination Information Package Designed to record the relationship between pieces of complex digital objects Can be created automatically as texts are loaded or reloaded Preservation actions (PREMIS)

58 Metadata Framework Details and specifications at repository level – Object specifications / Validation criteria – Page-tagging Variations at object level – Files missing – Non-valid files – Incorrect file checksums http://www.hathitrust.org/digital_object_specifications

59 Architecture & Management images bib data bib data Source METS text HT METS../uc1/pairtree_root/b3/54/34/86/b34543486 b34543486.zip b34543486.mets.xml

60 Object Entity identifier dul1.ark:/13960/t13n2vj0t file count 960 page count 320 Event Entity UUID 9af6a994-f6fe-3a61-ac0e-be793d347edb package inspection 2011-10-25T20:37:51Z Inspection of download package for missing files warning files missing islandoradventur00whit_scanfactors.xml MARC21 Code MiU Executor tool feedd.pl 0.9.17 software PREMIS Metadata

61 captureInitial capture (digitization) of item file renameFile renaming to HathiTrust conventions image modificationReplace boilerplate images with blank images image compressionConversion of raw scans to compressed TIFF and JPEG2000 image header modification Modification of image headers to meet HathiTrust conventions ingestionIngestion of object package into the repository message digest calculation Calculation of page-level MD5 checksums (refers to checksum calculations performed prior to content submission to HathiTrust when these checksums are available) validationValidation of technical characteristics of image and OCR files ocr splitDetail is package type specific, e.g.: a) Extraction of plain-text OCR from ALTO XML b) Split OCR into one plain text OCR file per page c) Splitting of IA XML OCR into one plain text OCR file and one XML file (with coordinates) per page package inspectionInspection of download package for missing files page feature mappingMapping of original page feature tags to HathiTrust tags fixity checkValidation of MD5 checksums of content files zip archive creationCompression of content files and source METS into zip archive zip file message digest calculation Calculation of md5 checksum for zip archive source mets creationCreation of source METS file

62 Provenance Strategies – Original source – Agent of digitization – Administrative metadata (provenance and preservation)

63 Security Data Integrity – Checksum validation, digital object provenance Physical security – Biometric door systems, locked racks Network security – Firewalling, vulnerability scanning Application security – Developer best practices, input validation Access control

64 Authentication Shibboleth – Login with organization – Attributes released to Service Provider – Authorize access – http://www.hathitrust.org/shibboleth http://www.hathitrust.org/shibboleth

65 Source Bibliographic Data Content Package Bib Data Data Management Rights Data Storage Access Ingest Catalog Full-text Search PageTurner APIs Collections Holdings Data Datasets Michigan Indiana

66 APIs Bibliographic API – Volume and rights information – MARC records – http://www.hathitrust.org/bib_api http://www.hathitrust.org/bib_api OAI – http://www.hathitrust.org/data http://www.hathitrust.org/data “Hathifiles” – http://www.hathitrust.org/hathifiles http://www.hathitrust.org/hathifiles Data API – Volume and rights information – Page images – OCR – http://www.hathitrust.org/data_api http://www.hathitrust.org/data_api

67 Computational Access Distribution of datasets – http://www.hathitrust.org/datasets http://www.hathitrust.org/datasets Non-Google-digitized Dataset (540,000+) – PD, PDUS, Open Access – Signed researcher statement Google-digitized (4.8 million+) – PD, PDUS, Open Access – Agreement between institution and Google – Brief proposal Characterize texts Provide ids (custom sets possible) Research, results, use of results – Signed researcher statement

68 HTRC http://www.hathitrust.org/htrc HathiTrust Research Center – Developed collaboratively by Indiana University and University of Illinois; launched July 2011 – Enables computational access to public domain and open access materials; working to support in-copyright materials as well – Secure Environment – bring researchers to the data – Build services and tools that facilitate research by digital humanities and informatics communities – Advanced Collaborative Support RFP: http://www.hathitrust.org/htrc/acs-rfphttp://www.hathitrust.org/htrc/acs-rfp Awards: http://www.hathitrust.org/htrc_acs_awards_spring2015http://www.hathitrust.org/htrc_acs_awards_spring2015

69 How to find out more About: http://www.hathitrust.org/abouthttp://www.hathitrust.org/about Twitter: http://twitter.com/hathitrusthttp://twitter.com/hathitrust Facebook: http://www.facebook.com/hathitrusthttp://www.facebook.com/hathitrust Monthly newsletter: – http:www.hathitrust.org/updates http:www.hathitrust.org/updates – RSS http://www.hathitrust.org/updates_rsshttp://www.hathitrust.org/updates_rss Contact us: feedback@issues.hathitrust.orgfeedback@issues.hathitrust.org Blogs: http://www.hathitrust.org/blogshttp://www.hathitrust.org/blogs – Large-scale Search – Perspectives from HathiTrust Resources – A Preservation Infrastructure Built to Last: Preservation, Community, and HathiTrust http://www.hathitrust.org/documents/york-MemoftheWorld-201209.pdf – PREMIS 2.0 Implementation: http://bit.ly/1O8Fokz

70 Thank you!


Download ppt "HATHITRUST A Shared Digital Repository The HathiTrust Digital Repository: Under the hood SI 625 April 20, 2015 Jeremy York, Assistant Director, HathiTrust."

Similar presentations


Ads by Google