Idiosyncrasy at Scale Data Curation and the Digital Humanities John Unsworth December 7, 2010 IDCC Man walks around.

Slides:



Advertisements
Similar presentations
HATHI TRUST A Shared Digital Repository Building A Future By Preserving Our Past The Preservation Infrastructure of HathiTrust Digital Library Jeremy York.
Advertisements

What is HathiTrust and How Can it Make a Difference? Sourcing and Scaling brought to the collective collection.
History Study Center Primary and secondary sources documenting global history 2010.
Moving Forward With Digital Preservation at the Library of Congress Laura Campbell Associate Librarian for Strategic Initiatives Library of Congress.
Delivering textual resources. Overview Getting the text ready – decisions & costs Structures for delivery Full text Marked-up Image and text Indexed How.
An Introduction to Repositories Thornton Staples Director of Community Strategy and Alliances Director of the Fedora Project.
DOCUMENT TYPES. Digital Documents Converting documents to an electronic format will preserve those documents, but how would such a process be organized?
Persistent identifiers – an Overview Juha Hakala The National Library of Finland
0 Jim Suderman Member Canadian Research Team, InterPARES 2 / Archives of Ontario Jim Suderman Member Canadian Research Team, InterPARES 2 / Archives of.
Documenting the Resource Malcolm Polfreman
Information Types and Registries Giridhar Manepalli Corporation for National Research Initiatives Strategies for Discovering Online Data BRDI Symposium.
Tools for Textual Data John Unsworth May 20, 2009
Providing Online Access to the HKUST University Archives: EAD to INNOPAC Sintra Tsang and K.T. Lam The Hong Kong University of Science and Technology 7th.
Online Databases and the Online DB Industry Change, change and more change!
1 CS 502: Computing Methods for Digital Libraries Lecture 27 Preservation.
Digitization of Historical Materials Dana Logalbo-Baij LIBR559L June 9, 2011.
1 From Filing Cabinet to Desktop and Network: Records Management in N.C. State Government Ed Southern Government Records Branch N.C. Office of Archives.
Level 2 IT Users Qualification – Unit 1 Improving Productivity Name.
The British Library’s METS Experience The Cost of METS Carl Wilson
Basics of Good Documentation Document Control Systems
What are the Digital Humanities “…the work of the humanities is to create the vessels to store our culture. In this sense, the digitization of archives.
Promoting Digital Preservation Partnerships at the U.S. Library of Congress April 2004.
DIGITIZATION OF RARE LIBRARY MATERIALS Metadata Format Access to Digital Documents © Adolf Knoll, National Library of the Czech Republic.
The Big Six Approach to Locating, Evaluating and Sharing the Information You Seek at Bristol Elementary School.
TAKS Test CONSTRUCTION. Important WORD TRIPLET What is a triplet? Triplet… three Three reading selections linked by a common theme. Consists of –a literary.
Connecticut History Online A digital library? By Todd Vandenbark.
Interoperable Digitised Content “Discover, search, extract, link, associate, and view digitised content” Les Carr.
Digitization Panel August 12, 2010 Christopher C. Brown, coordinator Mike Culbertson, Colorado State U. James Mauldin, GPO.
Funded by: © AHDS Oxford Text Archive and good practice in the creation of electronic resources Martin Wynne
Metadata Considerations Implementing Administrative and Descriptive Metadata for your digital images 1.
From Concept to Reality: An overview of the University of Wisconsin Digital Collections Melissa Mclimans.
Configuration Management (CM)
Challenges and Opportunities for Academic Libraries Collaborative Imperatives to Support Collections, Digital Initiatives, and New Services for a Changing.
1 Archiving Michael J. Levin Harvard Center for Population and Development Studies
Reading, Multiple Choice and Graphic Text.  Information paragraph- presents ideas and information on a topic  News report- presents information in the.
Library Repositories and the Documentation of Rights Leslie Johnston, University of Virginia Library NISO Workshop on Rights Expression May 19, 2005.
Assignment Paired Prototyping Some content based on GDC 2006, Gingold and Hecker Brent M. Dingle 2014 Game Design and Development Program Mathematics,
Martin Halbert UNT Dean of Libraries MetaArchive President Monday, April 11, 2011 Newspaper Archive Summit University of Missouri Columbia, MO.
BUILDING ON COMMON GROUND: EXPLORING THE INTERSECTION OF ARCHIVES AND DATA CURATION Lizzy Rolando & Wendy Hagenmaier 6/3/2015IASSIST 2015.
1 24 September BREAKOUT :30 1)Review of Metadata Standards Directory (DCC version and GitHub) 2)Introduction of Metadata Standards Catalog.
ALA Institutional Repository Update ALA Archives at the University of Illinois Urbana-Champaign Chris Prom Cara Bertram Denise Rayman.
INTELLECTUAL RIGHTS AND HISTORIC CORPORA Mark Sandler University of Michigan ICOLC, March, 2003.
Chapter 8 Business Process (BP). Objectives After studying the chapter, students should be able to.. Explain definition of Business Process Describe elements.
Katherine Skinner, Educopia Institute Emily Gore, Clemson University U.S. Workshop on Roadmap for Digital Preservation Interoperability Framework NIST,
National Library of the Czech Republic Integration of digital materials into EDL Adolf Knoll National Library of the Czech Republic Helsinki CENL Workshop.
7 Strategies for Extracting, Transforming, and Loading.
ARKIVA The Digital Archive of the Society of Swedish Literature in Finland Jessica Parland-von Essen
Cyberinfrastructure Overview Russ Hobby, Internet2 ECSU CI Days 4 January 2008.
Session 15: Using Primary Sources Online. Focusing Questions How can we use primary resources to enrich the teaching and learning experience? How can.
1/16/2016I. Revels Digital Imaging Workshop 1 Selection Considerations For Digital Imaging Projects.
Family Classroom Museum Suzanne Hutchins Lonna Sanderson.
Sally McCallum Library of Congress
Chapter 1 Introduction to HTML, XHTML, and CSS HTML5 & CSS 7 th Edition.
Lifecycle Metadata for Digital Objects November 15, 2004 Preservation Metadata.
Managing Digital Assets File Naming and Resizing.
A Project of the University Libraries Ball State University Libraries A destination for research, learning, and friends.
CM226 College Composition II Wednesday, February 24, Unit 9: Polishing the Final Paper Unit 9 Seminar David Becker Welcome to College Composition.
UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN Clumps and Runners John Unsworth Bamboo Workshop, Tucson AZ, January 13, 2009.
Leveraging the Expertise of our Staff and the Information Resources We Manage MIT Libraries Visiting Committee April 13, 2005.
Building Preservation Environments with Data Grid Technology Reagan W. Moore Presenter: Praveen Namburi.
A RCHIVAL COLLECTIONS IN A D IGITAL W ORLD Cheryl Walters Nov. 6, 2008.
Introduction. Internet Worldwide collection of computers and computer networks that link people to businesses, governmental agencies, educational institutions,
Disclaimer This presentation is for informational purposes only and does not constitute legal advice.
AN ARCHETYPE FOR INFORMATION ORGANIZATION AND CLASSIFICATION OCLC WorldCat.
7th Annual Hong Kong Innovative Users Group Meeting
Digitisation in academic libraries: Experience from Makerere University Library, Kampala Uganda By Patrick Sekikome Presented at the CERN-UNESCO School.
GSLIS Research Showcase, April 9, 2010
Users and Digital Collections
HathiTrust And Its Research Center
Beyond Description: Metadata for Catalogers in the 21st Century
Presentation transcript:

Idiosyncrasy at Scale Data Curation and the Digital Humanities John Unsworth December 7, 2010 IDCC Man walks around Dublin. We follow every little detail of his day. He’s probably overtweeting.

our cultural commonwealth “Science and engineering have made great strides in using information technology to understand and shape the world around us. This report is focused on how these same technologies could help advance the study and interpretation of the vastly more messy and idiosyncratic realm of human experience.” (14)

Messy …. Claus Huitfeldt,”Electronic Textual Editing: Philosophy Case Study”

and idiosyncratic: “An obvious requirement for a normalised reproduction is that orthographic errors are corrected. In this example one will have to mark up the misspelling ’Vather’ so that it can be rendered correctly as ’Vater’ in the normalised version. Admittedly, orthographic rules are not always clear, and texts are frequently written according to idiosyncratic or inconsistent orthographies. Further complication are that orthography variation is quite often a literary means of expression, and that orthography may in itself be an object of study. In electronic texts spelling affects not only readability, but also retrieveability. Therefore, standardisation is much more important in electronic than in traditional editions. While there may be a need to retain the original orthography, as in diplomatic transcription, there is also a need to standardise orthography to some set of uniform spelling rules.”

six laws to give us pause: 1.Scholars are interested in particular texts. 2.Analytical tools are only useful if they can be applied to texts that are of interest. 3.No single collection has all texts. 4.No two collections will be identical in format. 5.No one collection will be internally consistent in format. 6.Analytical tools need comparable texts in order to provide meaningful results.

humanities data narratives include normalization Lots of normalization, at very granular levels. For example, with textual data, levels. For example, with textual data, normalization might apply to: SpellingVocabulary Punctuation (esp. hyphenation) Chunking (paragraphs, pages, arbitrary chunks) MarkupMetadata Therefore….

a text-mining example the MONK Project includes approximately 525 works of American literature from the 18th and 19th centuries, and 37 plays and 5 works of poetry by William Shakespeare provided by the scholars and libraries at Northwestern University, Indiana University, the University of North Carolina at Chapel Hill, and the University of Virginia. These texts are available to all users, regardless of institutional affiliation. Staff, student or faculty at one of the Big Ten schools may use MONK with an augmented data set that includes all the texts in the public data set plus about a thousand works of British literature from the 16th through the 19th century, provided by The Text Creation Partnership (EEBO and ECCO) and ProQuest (Chadwyck-Healey Nineteenth-Century Fiction). To use MONK with this data set you will be required to authenticate using the CIC implementation of Shibboleth.

MONK Ingest: 1. Tei source files (from various collections, with various idiosyncracies) go through Abbot, a series of xsl routines that transform the input format into TEI- Analytics (TEI-A for short), with some curatorial interaction. 2. “Unadorned” TEI-A files go through Morphadorner, a trainable part-of-speech tagger that tokenizes the texts into sentences, words and punctuation, assigns ids to the words and punctuation marks, and adorns the words with morphological tagging data (lemma, part of speech, and standard spelling). 3. Adorned TEI-A files go through Acolyte, a script that adds curator-prepared bibliographic data 4. Bibadorned files are processed by Prior, using a pair of files defining the parts of speech and word classes, to produce tab- delimited text files in MySQL import format, one file for each table in the MySQL database. 5. cdb.csh creates a Monk MySQL database and imports the tab-delimited text files.

7,483,715 total volumes 7,483,715 total volumes 4,227,702 book titles 4,227,702 book titles 182,782 serial titles 182,782 serial titles 2,619,300,250 pages 2,619,300,250 pages 278 terabytes 278 terabytes 89 miles 89 miles 6,081 tons 6,081 tons 1,826,616 volumes (~24% of total) in the public domain 1,826,616 volumes (~24% of total) in the public domain Digitized as of 12/6/2010:

HathiTrust ≠ Single Source HathiTrust materials will come from Google Books (from digitizing material contributed by the libraries to which the digitized data is returned), but they will also come from other past, current, and future digitization efforts, including the Open Content Alliance and various in- house efforts. So, expect continued collection-level idiosyncrasies. “Tributaries,” Ron Gonsalves

beyond normalization: Correction of texts by users (e.g., distributed proofreading) Correction or enriching of metadata by users Sharing of representations (e.g., indices, databases, visualizations) Publishing interpretations backed by data

HathiTrust Research Center If approved and funded, it would have the task of supporting research with HT materials Data curation would be a key element of its activities Keeping one person’s solution from becoming another person’s problem will be a challenge.

key htrc attributes Serves its users, giving priority to those from HT member institutions. Recognizes humanities as a key constituency, but also recognizes key constituencies in social science, computer science, and other areas. Knows the recipe for Stone Soup. Can manage the copyright issues it can’t avoid In the absence of a GB settlement, and with institutional budgets under stress, it may take some time to build this and make it sustainable, but if the community drives it, it will go.

dh answers: on data curation In December 2010, the Outreach Committee of the ACH (Association for Computers and the Humanities) will draft an open letter to the funders of digital humanities projects, advocating for the establishment of data curation or management plans and open source / open access guidelines for grant-funded DH. What should we be sure to address in such a letter? Please do not ask them to mandate the use of any form of embedded markup for data curation. What should be archived for others to freely repurpose and reuse is not text embedded with the subjective judgements of individuals, or the programming instructions of today that will mean nothing tomorrow. The only mandate for format (if any) should be plain Unicode text. --Desmond In response to the above, I'd like also to propose that you not ask them to exclude embedded markup. It is doubtless useful to be able to access Unicode text versions of document data, but this is trivial to retrieve from any markup system; in the case of most marked-up documents, the bulk of the actual value of the document (in terms of scholarly work invested in creating the transcription) inheres in the markup itself. I think it is good practice to provide easy access to Unicode text versions of marked-up texts as part of an application that hosts and delivers them, and if the application itself is being curated, then this functionality needs to be supported by the curation system. --Martin

Humanities Data is a lot bigger than books “The National Archives and Records Administration (NARA) is the Government agency that preserves and provides access to the U.S. Government's collection of documents recording the important events in American history. Our archival holdings number more than 10 billion pages of unique documents, many of them handwritten, and include formats such as maps, charts, aerial and still photographs, artifacts, and motion picture, sound, and video recordings.” --National Archives and Records Administration, Strategy for Digitizing Archival Materials for Public Access,