Presentation on theme: "The HathiTrust Research Center: Building Shared Computational Resources to Mine the Largest Academic Digital Library Corpus Tweet Us: #HTRC #SESS037 #EDU13."— Presentation transcript:
The HathiTrust Research Center: Building Shared Computational Resources to Mine the Largest Academic Digital Library Corpus Tweet Us: #HTRC #SESS037 #EDU13
The HathiTrust Research Center: Building Shared Computational Resources to Mine the Largest Academic Digital Library Corpus Robert H. McDonald – Indiana University Beth Sandore Namachchivaya – University of Illinois John Unsworth – Brandeis University Educause Annual Meeting Anaheim, CA October 16, 2013 Tweet Us: #HTRC #SESS037 #EDU13
Tweet Us: #HTRC #SESS037 #EDU13 HathiTrust Partnership Allegheny College Arizona State University Baylor University Boston College Boston University California Digital Library Carnegie Mellon University Columbia University Cornell University Dartmouth College Duke University Emory University Florida State University Getty Research Institute Harvard University Library Indiana University Iowa State University Johns Hopkins University Lafayette College Library of Congress Massachusetts Institute of Technology McGill University` Michigan State University New York Public Library New York University North Carolina Central University North Carolina State University Northwestern University The Ohio State University The Pennsylvania State University Princeton University Purdue University Stanford University Syracuse University Texas A&M University Tufts University Universidad Complutense de Madrid University of Alabama University of Arizona University of Calgary University of California Berkeley Davis Irvine Los Angeles Merced Riverside San Diego San Francisco Santa Barbara Santa Cruz The University of Chicago University of Connecticut University of Delaware University of Florida University of Illinois at Chicago The University of Iowa University of Maryland University of Miami University of Michigan University of Minnesota University of Missouri University of Nebraska-Lincoln The University of North Carolina at Chapel Hill University of Notre Dame University of Oklahama University of Pennsylvania University of Pittsburgh University of Utah University of Virginia University of Washington University of Wisconsin-Madison Utah State University Virginia Tech Wake Forest University Washington University Yale University Library
Tweet Us: #HTRC #SESS037 #EDU13 HathiTrust Mission To contribute to the common good by collecting, organizing, preserving, communicating, and sharing the record of human knowledge
Tweet Us: #HTRC #SESS037 #EDU13 HathiTrust Services Long-term preservation – Bit-level and migration Bibliographic search Full-text search Reading and download capabilities Print on demand Collections Datasets HathiTrust Research Center
Tweet Us: #HTRC #SESS037 #EDU13 HathiTrust “Wow” Numbers 10,819,596 total volumes 5,672,046 book titles 281,890 serial titles 3,786,858,600 pages 485 terabytes 128 miles 8,791 tons 3,469,225 volumes(~32% of total) in the public domain
Tweet Us: #HTRC #SESS037 #EDU13 Discovery and Use Search, collections, online access APIs and data feeds – Data API – Bibliographic API – “Hathifiles” inventory files – OAI Computational Research – Distribution of datasets – Protocol-based access – Research Center
Tweet Us: #HTRC #SESS037 #EDU13 Research Center in Context
Tweet Us: #HTRC #SESS037 #EDU13 Goals for HTRC Provide a persistent and sustainable structure to enable scholars to ask and answer new questions. – Leverage data storage and computational infrastructure at Indiana & Illinois – Stimulate community development of new functionality and tools – Use tools to enable discoveries that would not be possible without the HTRC Enable scholars to fully utilize content of HathiTrust Library while preventing intellectual property misuse within U.S. copyright law. – Provide a secure computational and data environment for scholars to perform research using HathiTrust Digital Library.
Tweet Us: #HTRC #SESS037 #EDU13 Board of Governors Executive Committee Executive Director HathiTrust University of Illinois Indiana University HathiTrust Research Center University of Michigan Data Copy #1 Data Copy #2
Tweet Us: #HTRC #SESS037 #EDU13 HTRC Governance Reports to the HathiTrust Board of Governors HTRC Executive Committee – J. Stephen Downie (Co-director), Professor and Associate Dean for Research, University of Illinois GSLIS – Beth Plale (Co-director and Chair), Director Data To Insight Center and professor in the School of Informatics and Computing at Indiana University – Robert H. McDonald, Associate Dean of Libraries/Deputy Director Data to Insight Center at Indiana University – Beth Sandore Namachchivaya, Associate University Librarian for Information Technology Planning & Policy at the University of Illinois – John Unsworth, Vice Provost for Library & Technology Services and Chief Information Officer at Brandeis University HTRC Advisory Board (See members next slide) Google Public Domain agreement – in place for IU and UIUC
Tweet Us: #HTRC #SESS037 #EDU13 HTRC Advisory Board Cathy Blake, University of Illinois, Urbana-Champaign Beth Cate, Indiana University Greg Crane, Tufts University Laine Farley, California Digital Library Brian Geiger, University of California at Riverside David Greenbaum, University of California at Berkeley Fotis Jannidis, University of Wurzberg, Germany Matthew Jockers, Stanford University Jim Neal, Columbia University Bill Newman, Indiana University Bethany Nowviskie, University of Virginia Andrey Rzhetsky, University of Chicago Pat Steele, University of Maryland Craig Stewart, Indiana University David Theo Goldberg, University of California at Irvine John Towns, National Center for Supercomputing Applications Madelyn Wessel, University of Virginia
Tweet Us: #HTRC #SESS037 #EDU13 Hathifiles Tab-delimited inventory files Aggregated monthly Daily incremental files Contain – Identifiers – Limited bibliographic information – Rights, language, gov docs status information
Tweet Us: #HTRC #SESS037 #EDU13 Content Distribution
Tweet Us: #HTRC #SESS037 #EDU13 Language Distribution The top 10 languages make up ~86% of all content
Tweet Us: #HTRC #SESS037 #EDU13 Source Bibliographic Data Content Package Indiana Michigan Bib Data Data Management Rights Data Storage Access Ingest Catalog Full-text Search PageTurner APIs Collections Holdings Data Datasets
Tweet Us: #HTRC #SESS037 #EDU13 How is it available? Web interfaces APIs – Data API – Bib API Data feeds and distribution – Hathifiles – OAI – Datasets Soon: Virtual Machines
Tweet Us: #HTRC #SESS037 #EDU13 Copyright Strongly bound to US copyright issues with constant vigilance of the international scene Status determinations via: – Bibliographic metadata – Automatic and manual rights determination
Tweet Us: #HTRC #SESS037 #EDU13 Automatic Rights Determination Conducted on all works at time of ingest and when records are modified – Public domain worldwide US works published before 1923, US federal government publications, non-US works published prior to 1872 – Public domain in the United States Non-US works published prior to 1923
Tweet Us: #HTRC #SESS037 #EDU13 Manual Rights Determination IMLS-funded CRMS project – US-published works – Conformance with formalities – Expanding to non-US works – Double-blind review with expert review for conflicts – Staff at 4 HathiTrust partner institutions (15 will take part in non-US) – As of February 2012 ~190,000 reviewed, more than 100,000 opened Rights Holder Permissions
idnametypedscr 1pdcopyrightpublic domain 2iccopyrightin-copyright 3opbcopyrightout-of-print and brittle (implies in-copyright) 4orphcopyrightcopyright-orphaned (implies in-copyright) 5undcopyrightundetermined copyright status 6umallaccessavailable to UM affiliates and walk-in patrons (all campuses) 7worldaccessavailable to everyone in the world 8nobodyaccessavailable to nobody; blocked for all users 9pduscopyrightpublic domain only when viewed in the US 10cc-bycopyrightCreative Commons Attribution 11cc-by-ndcopyrightCreative Commons Attribution-NoDerivatives 12cc-by-nc-ndcopyrightCreative Commons Attribution-NonCommercial-NoDerivatives 13cc-by-nccopyrightCreative Commons Attribution-NonCommercial 14cc-by-nc-sacopyrightCreative Commons Attribution-NonCommercial-ShareAlike 15cc-by-sacopyrightCreative Commons Attribution-ShareAlike 16orphcandcopyrightorphan candidate - in 90-day holding period (implies in-copyright) 17cc-zerocopyrightCreative Commons Zero license (implies pd) 18und-worldcopyright Undetermined copyright status and permitted as world-viewable by the depositor 19Ic-uscopyrightIn copyright in the US Rights Attributes
Rights Determination Reason Codes idnamedscr 1bibbibliographically-derived by automatic processes 2ncnno printed copyright notice 3concontractual agreement with copyright holder on file 4ddddue diligence documentation on file 5manmanual access control override; see note for details 6pvtprivate personal information visible 7rencopyright renewal research was conducted 8nfineeds further investigation (copyright research partially complete; an ambiguous, unclear, or other time-consuming situation was encountered) 9cdpptitle page or verso contain copyright date and/or place of publication information not in bib record 10cipcondition review and in-print status research was conducted 11unpunpublished work 12gfvGoogle viewability set at VIEW_FULL 13crms derived from multiple reviews in the Copyright Review Management System (CRMS) via an internal resolution policy; consult CRMS records for details 14add author death date research was conducted or notification was received from authoritative source 15exp expiration of copyright term for non-US work with corporate author 16DelDeleted from repository; see note for details 17GattNon-US public domain work restored to in-copyright in the US by GATT
Tweet Us: #HTRC #SESS037 #EDU13 Type of work Searchable (bibliographic and full-text) Viewable*Full-PDF download (Data API) Print on Demand Print disabilities* Preservation uses (Section 108)* Public domain worldwide Worldwide Partners only if scanned by Google, if not, worldwide. WorldwidePartners worldwide N/A Public domain (US) – Non-US works published between 1872 and WorldwideWhen accessed from with the United States Partners in the US if scanned by Google, if not, anyone US Available within the United States Partners in the US; partners worldwide where similar laws in effect N/A Works that rights holders have opened access to in HathiTrust Worldwide Worldwide (if digitized by Google, full-PDF only available if opened with CC license) Worldwide with permission Partners worldwide N/A Works that are in-copyright or of undetermined status WorldwideNot available Partners in the US; partners worldwide where similar laws in effect Partners in the US; partner worldwide where similar laws in effect Orphan worksWorldwideTo participating partners Not available Partners in the US Partners in the US; partners worldwide where similar laws in effect * Note: Access to in-copyright works is subject to conditions on Terms of Access slide. See here also.here
HTRC Research Paradigm
Bring the COMPUTATION to the DATA!
Web services architecture and protocols Registry of services and algorithms Solr full text indexes noSQL store as volume store openID authentication Portal front-end, programmatic access Data mining algorithms
Tweet Us: #HTRC #SESS037 #EDU13 Agent framework Page/volume tree (file system) Volume store (Cassandra) SEASR analytics service Task deployment WSO2 registry services, collections, data capsule images Solr index HathiTrust corpus rsync HTRC Data API v0.1 NCSA local resources Programmatic access e.g., WS02 Identity Server University of Michigan Meandre Orchestration Agent instance Non-consumptive Data capsules Big Red II/IU Quarry 33 Blacklight Volume store (Cassandra) Volume store (Cassandra) NSF XSEDE Portal
HTRC Complexity hiding interface All the complexity Tabular info Statistical plots Spatial plots Request
Complexity hiding interface Other data (dictionaries, wiki data) Subsets of corpus HTRC Text mining algorithms
Tweet Us: #HTRC #SESS037 #EDU13 VM Image Manager VM Image Store VM Image Builder VM Manager VM instance Secure Virtual Cloud SSH Non-consumptive Output Storage Researcher HTRC Research Access Request for VM
Tweet Us: #HTRC #SESS037 #EDU Select volumes for analysis 2 2 Select algorithm 3 3 View/download results Named Entities Word frequencies Topic models
1315 volumes selected using a keyword search for ‘Darwin', ‘Romanes', 'anthropomorphism', and 'comparative psychology’. This set contains lots of books that are not of particular interest -- e.g., books on theology, college course catalogs. Challenge: Find the philosophical arguments in haystack of sentences Colin Allen Professor, Cognitive Science Indiana University Digging into Data 2011 https://inpho.cogs.indiana.edu/
Yearly values of ratio between two wordlists in three different genres. 4,275 volumes Ted Underwood, Dept of English, UIUC
Tweet Us: #HTRC #SESS037 #EDU13 Phenotypes implemented at level of genes General study: understanding of how phenotypes, such as human healthy diversity and maladies, are implemented at level of genes. Why HTRC: capture properties of language automatically -- for text transformations and information extraction. Generalize grammatical and idiomatic patterns as related to systems biology. Andrey Rzhetsky Professor, Department of Medicine University of Chicago
Tweet Us: #HTRC #SESS037 #EDU13 Other Grants and Proposals involving HTRC Zdenek Zdrahal, “DiscoveryCORE, Discovering Hidden Relationships in Semantically Connected Resources”, NEH Digging Into Data Challenge. Matthew Wilken, NotreDame, “Literary Geography at Scale”, American Council of Learned Societies (ACLS). Ichiro Fujinaga, “Single Interface for Music Score Searching and Analysis (SIMSSA)” to SSHRC, Canada. Pending. Andrew Piper, Text Mining the Novel: Establishing the Foundations of a New Discipline, SSHRC, Canada. Robert Liffe, University of Sussex, Textual Genomics Project (TTGP), United Kingdom Arts and Humanities Research Council. Edie Rasmussen. From Indexer’s Legacy to Scholar’s Desktop. Adam Farquhar, The British Library. IRIS, Arts and Humanities Research Council grant.
Tweet Us: #HTRC #SESS037 #EDU13 Workset Creation for Scholarly Analysis Funded at $493,000 by the Andrew W. Mellon Foundation; Co-PIs: J. Stephen Downie, Tim Cole, Beth Plale; 1 July June Goals: 1)enriching the metadata in the HathiTrust corpus 2)augmenting string-based metadata with URIs to leverage discovery and sharing through external services, and 3)formalizing the notion of collections and worksets in the context of the HathiTrust Research Center. Includes an open, competitive Request for Proposals in November 2013, with the intent to fund four prototyping projects that will build tools for enriching and augmenting metadata for the HathiTrust corpus.
Tweet Us: #HTRC #SESS037 #EDU13 HTRC Sloan Cloud for Secure Text- Mining at Scale Funded at $606,000 by The Alfred P. Sloan Foundation; Beth Plale, Indiana University, PI; Atul Prakash, University of Michigan, Co-PI; Fall Spring Goal: Prototype a system that enables secure text mining to be carried out at scale using public cloud resources, including: 1.a software cloud infrastructure based on OpenStack 2.mechanisms for managing a secure virtual machine We plan The Sloan Cloud will provide users with dedicated virtual machines that are pre-configured with appropriate tools and provide secure access to remote data that cannot be funneled through the VM to outside filesystems.
Tweet Us: #HTRC #SESS037 #EDU13 Thank You This presentation was made possible with content provided by many HTRC colleagues John Unsworth, J. Stephen Downie, Beth Plale, Robert H. McDonald, Beth Sandore, Yiming Sun, Miao Chen, Guangchen Ruan, Loretta Auvil, Kirk Hess, and many others… The HTRC Non-Consumptive Research Grant is graciously funded by the Alfred P. Sloan Foundation IU D2I-PTI is graciously funded by The Lilly Endowment, Inc. HTRC - IU D2I Center - UIUC GSLIS -
Tweet Us: #HTRC #SESS037 #EDU13 Contact Information Speakers : Robert H. McDonald, Indiana University Beth Sandore Namachchivaya, University of Illinois John Unsworth, Brandeis University Requests for assistance: Miao Chen, HTRC Education and Outreach
The HathiTrust Research Center: Building Shared Computational Resources to Mine the Largest Academic Digital Library Corpus Tweet Us: #HTRC #SESS037 #EDU13