HathiTrust And Its Research Center

Slides:



Advertisements
Similar presentations
HathiTrust Unless otherwise noted, these slides and their contents are licensed under a Creative Commons Attribution Unported License.
Advertisements

Beyond the Google Book: the Future of the Digital Library Cory Snavely Library IT Core Services manager University of Michigan April 20, 2010.
KAT HAGEDORN HATHITRUST SPECIAL PROJECTS COORDINATOR UNIVERSITY OF MICHIGAN LIBRARIES OCTOBER 9, 2009 Seamless Sharing: NYU, HathiTrust, ReCAP and the.
What is HathiTrust and How Can it Make a Difference? Sourcing and Scaling brought to the collective collection.
KAT HAGEDORN HATHITRUST SPECIAL PROJECTS COORDINATOR UNIVERSITY OF MICHIGAN LIBRARIES OCTOBER 9, 2009 Seamless Sharing: NYU, HathiTrust, ReCAP and the.
© 2012 Association for Computing Machinery Intro to the ACM Digital Library February 24, 2012 Intro to the ACM Digital Library February 24, 2012.
Digital Preservation A Matter of Trust. Context * As of March 5, 2011.
HathiTrust Digital Library: Enrich Your Research and Scholarship Doreen Bradley Chris Powell University Library May 2011.
CHORUS Implementation Webinar May 16, 2014 Mark Martin Assistant Director, Office of Scientific and Technical Information Office of Science U.S. Department.
Information Analysis at Scale: HathiTrust Research Center Beth Plale Director, Data to Insight Center Co-Director, HathiTrust Research Center November.
Cambridge University Press Our digital platforms for titles published by Cambridge University Press and our Partner Presses.
PubMed Central ANCHASL Spring Meeting April 1, 2005 Robert James Associate Director of Public Services Duke University.
Library Electronic Resources in the EUI Library Veerle Deckmyn, Library Director Aimee Glassel, Electronic Resources Librarian 4 September
HATHITRUST A Shared Digital Repository Big Collections in an Era of Big Copyright: Practical Strategies for Making the Most of Digitized Heritage Jeremy.
HathiTrust Research Center Tools SHARC: Secure HathiTrust Analytics Research Commons Dirk Herr-Hoyman HTRC Operations Manager + Architect Indiana University.
Elephant in the Room: Scaling Storage for the HathiTrust Research Center Robert H. McDonald Associate Dean for Library Technologies Deputy.
EDT 347 Education Technology Copyright and Fair Use.
1 Archiving and Preserving the Web Kristine Hanna Internet Archive April 2006.
PLUG-INs Information Fujariah Colleges
Library Electronic Resources in the EUI Library Veerle Deckmyn, Library Director Aimee Glassel, Electronic Resources Librarian 5 September
The Hathi Trust Research Center and tool builders John Unsworth (with Beth Plale, Scott Poole, Robert McDonald, and others) Project Bamboo Corpora Space.
Jonathan Band Jonathan Band PLLC Google Library Project: Copyright Issues.
HATHITRUST A Shared Digital Repository HathiTrust: Putting Research in Context HTRC UnCamp September 10, 2012 John Wilkin, Executive Director, HathiTrust.
Copyright and Fair Use Implications for Assistive Technology and Education.
HathiTrust Digital Library. Overview ›Began in 2008 ›Large scale digital preservation repository ›Partnership of major research libraries ›Focus on both.
Live Search Books University of Toronto – Scholar’s Portal Forum 2007 January 2007.
Publisher’s Perspective: Digitization of print resources, and archiving of digital resources Judy Best, June 13, 2006.
Preserving Digital Collections for Future Scholarship Oya Y. Rieger Cornell University
The Legislative Library of Ontario’s Ontario Documents Repository Road to Partnership.
HTRC Workshop 101 THATCamp Gainesville April 24, 2014.
© 2015 albert-learning.com GOOGLE BOOKS CASE. © 2015 albert-learning.com Vocabulary Law suitA case in a court of law involving a claim, complaint, etc.,
Breana McCracken University of Illinois at Urbana-Champaign HathiTrust and Copyright Future Implications - Strong precedent for libraries to continue to.
The Hindi word for ‘elephant’ ITC Friday, January 22, 2010.
H ATHI T RUST HTTP :// WWW. HATHITRUST. ORG Large-Scale Digital Initiatives and their potential impact on the Maine Shared Collections Strategy Colby College.
Challenges and Opportunities for Academic Libraries Collaborative Imperatives to Support Collections, Digital Initiatives, and New Services for a Changing.
HathiTrust’s Past, Present and Future. Short- and Long-term Functional Objectives Short-term Page turner mechanism (and Mobile!) Branding (overall initiative;
Section 108 Copyright in Libraries Slides produced by the Copyright Education & Consultation Program.
CONTENT DISCOVERY, SERVICES, AND SUSTAINED ACCESS Timothy Cole, William Mischo, Beth Sandore, Sarah Shreeves ~ University of Illinois Library
Accessing HTRC Data. What is Hathitrust Research Center? A collaborative research center launched jointly by Indiana University and the University of.
1 ARRO: Anglia Ruskin Research Online Making submissions: Benefits and Process.
The Legal Agreements of the National Geospatial Digital Archive Julie Sweetkind-Singer Stanford University NDIIPP National Conference, Washington, DC June.
HATHITRUST A Shared Digital Repository HathiTrust and the Future of Research Libraries American Antiquarian Society March 31, 2012 Jeremy York, Project.
Researching the African Diaspora and Creolité on the Internet Karen Hartman Information Resource Officer U.S. Embassy, Nairobi, Kenya February 5, 2008.
February 27, 2007 University Information Technology Services Research Computing Craig A. Stewart Associate Vice President, Research Computing Chief Operating.
INTELLECTUAL RIGHTS AND HISTORIC CORPORA Mark Sandler University of Michigan ICOLC, March, 2003.
HATHITRUST A Shared Digital Repository Institution Uses of HathiTrust Jeremy York University of Maine May 24, 2013.
HTRC Loretta Auvil, Boris Capitanu University of Illinois at Urbana-Champaign
Innovation, Copyright, and the Academy University of California Santa Barbara November 2, 2015 Kenneth D. Crews Gipson Hoffman & Pancione (Los Angeles)
HathiTrust: Possibilities Metadata Working Group Cornell University Library March 21, 2014.
HATHITRUST A Shared Digital Repository HathiTrust Large Digital Libraries: Beyond Google Books Modern Language Association January 5, 2012 Jeremy York,
HathiTrust: A valuable and visionary Partnership.
Million Book Project: Vision Becoming Reality Gabrielle Michalek, Carnegie Mellon Presentation to Carnegie Mellon Qatar Library November 9 & 10, 2005.
Text and Data Mining for Systematic Reviews Investigating Trends to Update Collaboration Services Virginia Pannabecker Virginia Tech, University Libraries.
Data Mining for Expertise: Using Scopus to Create Lists of Experts for U.S. Department of Education Discretionary Grant Programs Good afternoon, my name.
HathiTrust Digital Library Interface and Services
7th Annual Hong Kong Innovative Users Group Meeting
PLOS Facilitating Text & Data Mining The Role the Publisher Can Play
Mass Digitization of Books and the Potential for Universal Access
Matt Link Associate Vice President (Acting) Director, Systems
HathiTrust Copyright Review
Accessible Books Consortium Implementing the objectives of the Marrakesh Treaty at a practical level Presentation to Sub-Regional Meeting for ASEAN Countries.
GSLIS Research Showcase, April 9, 2010
TDM=Text Mining “automated processing of large amounts of structured digital textual content for purposes of information retrieval, extraction, interpretation.
IDEALS at the University Of Illinois: A Case Study of Integration Between an IR and Library Discovery Systems Sarah L. Shreeves University of Illinois.
Publishing Solutions for Contemporary Scholars: The Library as Innovator and Partner Sarah E. Thomas University Librarian Cornell University Ithaca, NY.
Hands-on Introduction and Refresher Course
ICT Communications Lesson 2: Searching the Web
Internet Basics and Information Literacy
Data(trans)forming Roberto Barcellan European Commission NTTS2019
Accessible Books Consortium Implementing the objectives of the Marrakesh Treaty at a practical level Presentation to Sub-Regional Meeting for ASEAN Countries.
Presentation transcript:

HathiTrust And Its Research Center John Unsworth University of Virginia September 27, 2018

“Hathi” means elephant, in Swahili “Hathi” means elephant, in Swahili. The elephant is a symbol of both massiveness and memory. The motto of the HathiTrust is “There’s an elephant in the library.” This photo is of an actual elephant in an actual library (in mid-20th-century Edinburgh, Scotland, as part of a PR campaign to remind patrons to return their library books).

HathiTrust Mission The mission of HathiTrust is to contribute to research, scholarship, and the common good by collaboratively collecting, organizing, preserving, communicating, and sharing the record of human knowledge.

Big Data: The HathiTrust 16,744,704 total volumes 8,126,978 book titles 449,098 serial titles 5,860,646,400 pages 751 terabytes 198 miles (shelf-wise) 13,605 tons (book-wise) 6,277,976 volumes (~35% of total) in the public domain, so 65% in copyright. Stats updated daily at: https://www.hathitrust.org/about

Domains of Knowledge Call Numbers from A-Z 50% in English A long tail of languages: over 450 different languages, ancient and modern, from Aleut to Zulu Publication dates from the 16th century to the present

HathiTrust Manages Complex Copyright Conditions For Access Type of work Searchable (bibliographic and full-text) Viewable* Data API Print on Demand Print disabilities* Preservation uses (Section 108)* Public domain worldwide Worldwide Partners only if scanned by Google, if not, worldwide. Partners worldwide N/A Public domain (US) – Non-US works published between 1872 and 1923. When accessed from with the United States Partners in the US if scanned by Google, if not, anyone US Available within the United States Partners in the US; partners worldwide where similar laws in effect Works that rights holders have opened access to in HathiTrust Worldwide (if digitized by Google, full-PDF only available if opened with CC license) Worldwide with permission Works that are in-copyright or of undetermined status Not available Partners in the US; partner worldwide where similar laws in effect HathiTrust recognizes and implements a complex set of copyright conditions. For the most part, HathiTrust texts are search-and-snippet only.  The exceptions: out of copyright works, works provided for students with documented print disabilities, and texts used for non-consumptive computational analysis.

HathiTrust Research Center The research arm of the HathiTrust, meant to provide computational access to the entire corpus, including copyrighted materials. Collaborative effort based at the University of Illinois and Indiana University and supported by co-investments from these universities and the University of Michigan, through the HathiTrust. Initiated in 2008, formally established in 2011 HTRC operates under a grant from the University of Michigan to Indiana University and the University of Illinois Urbana Champaign, with significant financial contributions from IU & UIUC.

HathiTrust Research Center Mission The mission of the Hathi Trust Research Center is to provide infrastructure and tools enabling and supporting computational research on the more than 16 million volumes of the HathiTrust collection. HTRC enable scholars to fully utilize content of HathiTrust under Fair Use, while preventing violations of U.S. copyright law.

Non-Consumptive Research Paradigm Not research in which a researcher reads or displays substantial portions of an in-copyright or rights-restricted volume to understand the expressive content presented within that volume. Non-consumptive analytics includes such computational tasks as text extraction, textual analysis and information extraction, linguistic analysis, automated translation, image analysis, file manipulation, OCR correction, and indexing and search. More here: https://www.hathitrust.org/htrc_ncup

Non-Consumptive Research Paradigm Bring the COMPUTATION to the DATA!

HTRC Analytics Since 2011, the HathiTrust Research Center has been developing services and tools that allow researchers to employ text and data mining methodologies using the HathiTrust collection. To date, this service has been available only on the portion of the collection that is out of copyright.

HTRC Analytics With the development of a landmark HathiTrust policy and an updated release of HTRC Analytics, HTRC now (9/24/2018) provides access to the text of the complete 16.7-million-item HathiTrust corpus for non-consumptive research, such as data mining and computational analysis, including items protected by copyright.

Three Approaches HTRC Analytics (for pre-determined web-based analyses, including Bookworm) Feature Extraction Services (including downloadable data sets) Secure Data Capsule access “Features” here are page-level derived statistical data, including unique words per page, number of occurrences of word per page, part of speech information, etc.

HTRC Analytics for All HTRC Algorithms: web-based, click-and-run tools to perform computational text analysis on shared public worksets or those you have created, including copyrighted items for ALL USERS. Extracted Features Dataset: Allows non-consumptive analysis on specific features extracted from the full text of the HathiTrust corpus, including copyrighted items for ALL USERS. HathiTrust+Bookworm: a tool for visualizing and analyzing word usage trends in the HathiTrust corpus.  Including copyrighted items for ALL USERS.

HTRC Analytics for Members HTRC Data Capsule: a secure computing environment for text analysis on the HathiTrust corpus, using the researcher’s tools of choice. Access to copyrighted items using an HTRC Data Capsule is available ONLY to HathiTrust member-affiliated researchers, because we anticipate significant demand for this service and HTRC has finite resources to support it.

HathiTrust Member Institutions

How Is This Possible? HathiTrust exists to enable lawful research and educational uses of its collection. In recent years, US courts have recognized that there is a legal basis for non-consumptive research on copyrighted materials. In 2016, HathiTrust established a Non-Consumptive Use Research Policy to ensure the responsible research use of copyrighted items. That policy is now embodied in the HTRC Analytics services, which allow researchers to conduct computational text analysis on copyrighted items, under the fair use provisions of US copyright law.

What’s Next? In collaboration with HTRC, JSTOR and Portico staff, I and some of my staff at UVA are exploring distributed text-mining as a way to enable TDM across both HTRC’s book materials and JSTOR and Portico’s journal materials. We’re starting with sample data sets in biology from HTRC and from 10 Portico publishers who agreed to be part of this pilot. We have developed interoperable Extracted Features Datasets as our first proof-of-concept, demonstrating that we have harmonized our metadata (no mean feat).

Distributed Text-Mining If text-mining services are only available on a per-publisher basis, there will be no real competition on the merits of the service, and no practical way for researchers to work across publisher collections. Distributed text-mining across HTRC-Portico-JSTOR materials could help to establish metadata interchange guidelines, APIs, and text-mining techniques that will make it possible for researchers to work with even broader collections of copyrighted content that can’t be aggregated and indexed in one place.

Acknowledgements At Indiana University, HTRC is affiliated with and supported by the IU Pervasive Technology Institute, the School of Informatics, Computing, and Engineering, and the IU Bloomington Libraries. Additional financial support comes from the Office of the Vice Provost for Research. Computational resources are provided by the Pervasive Technology Institute. At the University of Illinois Urbana-Champaign HTRC is hosted and supported by the School of Information Sciences in collaboration with the University of Illinois Library. Financial support is provided by the Office of the Provost and the Office of the Vice-Chancellor for Research. Additional resources to advance the mission of HTRC are supplied by the National Center for Supercomputing Applications.