Presentation is loading. Please wait.

Presentation is loading. Please wait.

Computational Research and Copyright John Unsworth BNN Future of the Academy Speaker Series MIT Faculty Club May 25, 2012.

Similar presentations


Presentation on theme: "Computational Research and Copyright John Unsworth BNN Future of the Academy Speaker Series MIT Faculty Club May 25, 2012."— Presentation transcript:

1 Computational Research and Copyright John Unsworth BNN Future of the Academy Speaker Series MIT Faculty Club May 25, 2012

2 HATHI TRUST A Shared Digital Repository HathiTrust Research Center

3 Goals of the HTRC Maintain repository of text mining algorithms, retrieval tools, derived data sets, and indices available for human and programmatic discovery. Be a user-driven resource, with an active advisory board, and a community model that allows users to share tools and results. Support interoperability across collections and institutions, through use of inCommon SAML identity. See also: http://www.ideals.illinois.edu/handle/2142/29936 -- a report prepared by the Illinois Center for Informatics Research in Science and Scholarship, on the experience of Google Digital Humanties grant recipients.http://www.ideals.illinois.edu/handle/2142/29936

4 The HathiTrust Research Center The HathiTrust Research Center (HTRC) enables computational access for nonprofit and educational users to published works in the public domain. In the future, it will offer computational access to in-copyright works from the HathiTrust as well. The center will break new ground in the areas of text mining and non-consumptive research, allowing scholars to fully utilize content of the HathiTrust Library while observing the requirements of current U.S. copyright law.

5 HTRC Partners The HTRC is a collaborative research center launched jointly by Indiana University and the University of Illinois, along with the HathiTrust Digital Library and Google. The HTRC will help researchers meet the technical challenges of working with massive digital collections, by developing tools and cyberinfrastructure that enable advanced computational access to those collections.

6 Memosof Understanding Completed: IU/UIUC MOU HT/IU/UIUC MOU Google/UIUC MOU Google/IU MOU To be developed: HTRC-Researcher/Center MOU

7 Executive Committee The HathiTrust Research Center is led by an Executive Management Team that includes: Stephen Downie (Co-director), Professor and Associate Dean for Research, University of Illinois Graduate School of Library and Information Science Beth Plale (Co-director and chair), Data To Insight Center director and professor in the School of Informatics and Computing at Indiana University Scott Poole, I-CHASS director and professor in the Department of Communication at the University of Illinois Robert McDonald, Indiana University Associate Dean of Libraries John Unsworth, Vice-Provost and CIO, Brandeis University

8 Advisory Board Cathy Blake, University of Illinois, Urbana-Champaign Beth Cate, Indiana University Greg Crane, Tufts University Laine Farley, California Digital Libraries Brian Geiger, University of California at Riverside David Greenbaum, University of California at Berkeley Fotis Jannidis, University of Wurzberg, Germany Matthew Jockers, Stanford University Jim Neal, Columbia University Bill Newman, Indiana University Bethany Nowviskie, University of Virginia Andrey Rzhetsky, University of Chicago Pat Steele, University of Maryland Craig Stewart, Indiana University David Theo Goldberg, University of California at Irvine John Towns, National Center for Supercomputing Applications Madelyn Wessel, University of Virginia

9 Timeline: Phase 1 The primary areas of work in Phase 1 include architecting the core cyberinfrastructure for data analysis, deploying some general-purpose analytical tools, and prototyping end-user services, including an access portal, support center capabilities, and facilities for sharing and storing derived research data. In Phase 1, only the public domain works in the HathiTrust will be available to researchers, since the security framework and policies for working with copyrighted material will still be under development. The HTRC will deliver a demonstration system in June 2012.

10 Timeline: Phase 2 This phase, which will require significant funding, will involve development of an operational research center that will provide ongoing and up-to-date access to the HTRC research corpus and associated tools. Phase 2 will commence during the 18th month of the project, and its launch will depend on garnering resources during Phase 1 and on the sustainability plan that will be developed in Phase 1.

11 Current Collections HTRC currently has a 250,000 volume collection of non-Google digitized content and a 50,000 volume collection of content that IU libraries digitized. These collections reside in a cluster of 3 4-core, 16 GB RAM machines. About 2.8M volumes of Google-produced public domain material will shortly be added to the HTRC collections, now that the Google MOUs have been signed.

12 HTRC Access and Use Users will be able to access the HTRC through a portal or programmatically, through a Data API The Data API cannot be used to download volumes, but it can be used to move data to a location where computation takes place. It can also be used to search SOLR indexes and pass volume IDs to other services for access and computation. The target audience of the HTRC is non-profit and educational researchers Authentication will depend on InCommon, a Shibboleth implementation that most HathiTrust institutions already support.

13

14 Architecture Solr Indexes: The HathiTrust and the HTRC both use Apache SOLR to index the materials in their collections. The Solr index is accessed through the Data API layer. The Data API layer limits some access, and does auditing, but otherwise is a pass through to the Solr API. Volume Store: HTRC uses Apache Cassandra, a noSQL data store cluster to hold the volumes of digitized text. Volume- and page-level access to HTRC data is provided through the HTRC Data API. Each machine has 500 GB of disk, and the volumes are partitioned and replicated across the 3 Cassandra instances. Registry: IU is running a version of WSO2 Governance Registry, where applications are registered prior to running in the non-consumptive framework. The registry is also used as a temporary storage for returned results.

15 “Research in which computational analysis is performed on one or more Books, but not research in which a researcher reads or displays substantial portions of a Book to understand the intellectual content presented within the Book.”

16 Non-consumptive Research One of HTRC’s unique challenges is support for non-consumptive research. This will entail bringing algorithms to data, and exporting results, and/or providing people with secure computational environments in which they can work with copyrighted materials without exporting them. Why is this worth doing? Because it enables a new art of information that can be used to make new kinds of arguments (and possibly to settle some old ones).

17 Non-Consumptive Research HTRC received funding from the Alfred P. Sloan Foundation for development of secure infrastructure on which to carry out execution of large-scale parallel tasks on copyrighted data using public compute resources such as FutureGrid or resources at NCSA. The high-level design uses a pool of VM images that run in a secure-capsule mode and are deployed onto compute resources. The team is working on a proof of concept deployment process with an OpenStack platform using Sigiri.

18 Blacklight Developed at the University of Virginia, Blacklight is an open-source discovery interface: http://projectblacklight.org/ http://projectblacklight.org/ Blacklight supports faceted searches, a known need of researchers. We expect Blacklight to be a significant component of the public face of the HTRC. Blacklight is designed to support data that is both full text and bibliographic. Blacklight is built on SOLR, the same technology that we already use to index the HTRC data.

19 Google DH study Google Digital Humanities Awards Recipient Interviews Report, Prepared For The Hathitrust Research Center by Virgil E. Varvel Jr. and Andrea Thomer at the Center For Informatics Research In Science And Scholarship, Graduate School of Library and Information Science, University Of Illinois At Urbana-.‐Champaign, in Fall 2011

20 Scope of the report 22 researchers who had received Google Digital Humanities grants were invited to debrief on their experience, in order to provide input to the design of the HTRC Interviews were conducted by phone, in person, or by Skype, using a semi-structured interview protocol

21 Findings of the report: OCR OCR quality is a significant issue; steps should be taken to improve OCR output as possible OCR quality should be indicated in volume- level metadata Scalability of scanned page images is necessary for human correction of OCR errors

22 Other Findings of the Report Researchers would like better metadata about the languages included in texts, particularly in multi-lingual documents. Better metadata about language by sections within volumes would be helpful. Automatic language identification functions would be helpful, but human‐created metadata is preferred, particularly for documents with low OCR quality. For one researcher, the primary issue was retrieving the bibliographic records in usable form. It took 10 months to design the queries and get the data.

23 Matt Jockers, “The Nineteenth-Century Literary Genome” via Digital Humanities Specialist (aka Elijah Meeks) http://dhs.stanford.edu

24 Arguing with Data Data enables arguments based on quantitative and/or empirical data Data still requires interpretation, and you can still make better and worse interpretations, and more or less compelling arguments In addition to new kinds of arguments, you can make new kinds of mistakes, especially mistakes based on incomplete data or on an incomplete understanding of data

25 Mistakes based on incomplete data

26

27 New kinds of arguments Ted Underwood is exploring the changing etymological basis of diction in English, over a 200-year period, especially the shift from words derived from German, to words derived from Latin, and back again. http://tedunderwood.wordpress.com/

28 Etymology and Style Ted Underwood, 2011 o English professors have a long, lively history of drawing specious conclusions from the “Latinate” or “Germanic” character of a particular writer’s style. o There is nevertheless good evidence that older words do predominate in informal, and especially spoken English. [Laly Bar-Ilan and Ruth A. Berman, “Developing register differentiation: the Latinate- Germanic divide in English,” Linguistics 45 (2007): 1- 35.] o Can we use this fact to trace broad changes of register in the history of written English?

29 The fundamental distinction is not Latinate/Germanic, but date of entry. French was the written language for 200 years; words that entered English before that point had to be used in the spoken language to survive. This includes “Latinate” words like “street” and “wall.” http://bit.ly/h8cJem

30 To understand the significance of the result, it needs to be broken down by genre. Initial results suggest that fiction and nonfiction prose both become more formal (less like speech) in the 18c. Drama and poetry change little, although older, less formal, “speechlike” words always predominate in drama.

31 The Value of HTRC Ted’s investigation concerns historical trends: as such, it is reasonable to think that it might be interesting to extend beyond 1900. Can he do that? Only if he is given the data. Will researchers have this kind of computational access to copyrighted data? Only through some institutional affordance like HTRC. Insitutions are risk-averse: in some sense, the most important infrastructure in HTRC is the MOU.


Download ppt "Computational Research and Copyright John Unsworth BNN Future of the Academy Speaker Series MIT Faculty Club May 25, 2012."

Similar presentations


Ads by Google