Tools for Textual Data John Unsworth May 20, 2009

Tools for Textual Data John Unsworth May 20, 2009 http://monkproject.org/

MONK: a case study Texts as data Texts from multiple sources Texts reprocessed into a new representation Different tools using the same data Interaction between tools and data Interaction between and among tools Interaction between users and data Questions for discussion

Texts as Data (1)‏ “The computer has no understanding of what a word is, but it follows instructions to 'count as' a word any string of alphanumerical characters that is not interrupted by non-alphabetical characters, notably blank space, but also punctuation marks, and some other symbols. 'Tokenization' is the name for the fundamental procedure in which the text is reduced to an inventory of its 'tokens' or character strings that count as words. This is an extraordinarily reductive procedure. It is very important to have a grasp of just how reductive it is in order to understand what kinds of inquiry are disabled and enabled by it.”

Texts as Data (2)‏ “A word token is the spelling or surface of form of a word. MONK performs a variety of operations that supply each token with additional 'metadata'. Take something like 'hee louyd hir depely'. This comes to exist in the MONK textbase as something like hee_pns31_he louyd_vvd_love hir_pno31_she depely_av-j_deep Because the textbase 'knows' that the surface 'louyd' is the past tense of the verb 'love' the individual token can be seen as an instance of several types: the spelling, the part of speech, and the lemma or dictionary entry form of a word.” (Martin Mueller)‏

Texts as Data (3)‏ Texts represent language, which changes over time (spellings)‏ Comparison of texts as data requires some normalization (lemma)‏ Counting as a means of comparison requires units to count (tokens)‏ Treating texts as data will usually entail a new representation of those texts, to make them comparable and to make their features countable.

Texts from Multiple Sources Scholars are interested in texts first, data second Tools are only useful if they can be applied to texts that are of interest No single collection has all texts No two collections will be identical in format No one collection will be internally consistent in format Five aphorisms about textual data (causing tool-builders to weep):

Public MONK Texts Documenting the American South from UNC- Chapel Hill (1.5 Gb, 8.5 M words)‏ Early American Fiction from the University of Virginia (930 Mb, 5.2 M words)‏ Wright American Fiction from Indiana University (4 Gb, 23 M words)‏ Shakespeare from Northwestern University (170 M, 850 K words) About 7 Gigabytes, 38 M words

Restricted MONK Texts Eighteenth-Century Collection Online (ECCO) from the Text Creation Partnership (6 Gb, 34 M words)‏ Early English Books Online (EEBO) from the Text Creation Partnership (7 G, 39 M words)‏ Nineteenth-Century Fiction (NCF) from Chadwyck Healey (7 G, 39 M words)‏ About 20 Gb, 112 M words

Texts reprocessed into a new representation (1)‏ MONK ingest process: 1. Tei source files (from various collections, with various idiosyncracies) go through Abbot, a series of xsl routines that transform the input format into TEI-Analytics (TEI-A for short), with some curatorial interaction. 2. “Unadorned” TEI-A files go through Morphadorner, a trainable part-of-speech tagger that tokenizes the texts into sentences, words and punctuation, assigns ids to the words and punctuation marks, and adorns the words with morphological tagging data (lemma, part of speech, and standard spelling).

Texts reprocessed into a new representation (2)‏ MONK ingest process (cont.): 3. Adorned TEI-A files go through Acolyte, a script that adds curator- prepared bibliographic data 4. Bibadorned files are processed by Prior, using a pair of files defining the parts of speech and word classes, to produce tab-delimited text files in MySQL import format, one file for each table in the MySQL database. 5. cdb.csh creates a Monk MySQL database and imports the tab-delimited text files.

Texts reprocessed into a new representation (3)‏ ENTERED according to Act of Congress, in the year 1867, by A. SIMPSON & CO., in the Clerk's Office of the District Court of the United States for the Southern District of New York. ENTERED according Representation is about 10x original in size, so 150 Mb becomes 1.5 Gb (90% metadata)‏

Problems Arising “In the MONK project we used texts from TCP EEBO and ECCO, Wright American Fiction, Early American Fiction, and DocSouth -- all of them archives that proclaimed various degrees of adherence to the earlier Guidelines. Our overriding impression was that each of these archives made perfectly sensible decisions about this or that within its own domain, and none of them paid any attention to how its texts might be mixed and matched with other texts. That was reasonable ten years ago. But now we live in a world where you can multiple copies of all these archives on the hard drive of a single laptop, and people will want to mix and match.”

...and Aris-ing “Soft hyphens at the end of a line or page were the greatest sinners in terms of unnecessary variance across projects, and they caused no end of trouble.... The texts differed widely in what they did with EOL phenomena. The DocSouth people were the most consistent and intelligent: they moved the whole word to the previous line.... DocSouth texts also observe linebreaks but don't encode them explicitly. The EAF texts were better at that and encoded line breaks explicitly. The TCP texts were the worst: they didn't observe line breaks unless there was a soft hyphen or a missing hyphen, and then they had squirrelly solutions for them. The Wright archive used an odd procedure that, from the perspective of subsequent tokenization, would make the trailing word part a distinct token.”

Different tools using the same data MONK Datastore Flamenco Faceted Browsing MONK extension for Zotero TeksTale Clustering and Word Clouds FeatureLens SEASR The MONK Workbench (Public)The MONK Workbench The MONK Workbench (Restricted)‏The MONK Workbench Each of these is (at least one) separate application; some are actually several.

Workbench Architecture (1)‏ The MONK Workbench is a browser-based application written in Ext JS, a JavaScript library for building richly interactive web applications using techniques such as AJAX, DHTML and DOM. The Workbench has components, like the component for creating a workset, and components often have a workflow—a notion of events that need to occur in a certain order.

Workbench Architecture (2)‏ The Workbench communicates by http with middleware, which is Java code that interprets events occurring in components and translates those events into terms that the datastore or the analytics engine (SEASR) can understand. The MONK middleware also translates in the other direction, taking output from queries to the datastore, or from analytics operations, and giving them back to components in the Workbench.

Interaction between tools and data Tools can't operate on features unless those features are made available: for example, in order to find an author's favorite adjectives, you need an interface for asking that question and you need data that can answer it.an author's favorite adjectives In order to find patterns, the data and the interface have to support pattern-finding.patterns In order to find all the fiction by women in a collection, your data has to include information about genre and gender, and your interface has to allow you to select those facets.find all the fiction by women

Interaction between/among tools Flamenco requires a slightly different data source from other tools in MONK, partly because it is meant to feed Zotero, so it needs COINS metadata. The html interface to the MONK datastore uses the same data source that is used by the MONK Workbench and by TeksTale. FeatureLens needs a unique index, and it needs one index per collection.

Interaction between users and data Users like simple interfaces, but simple interfaces limit complex operations Users may want to operate on features that are not available in the data representation Users create data, by using tools—not only as an end result, but all along the way; state information, for example, or information about the series of operations performed in order the produce a result Users may also want to correct or improve data

Questions for Discussion If different tools require different data representations, how should those representations be related, derived, maintained? What might be the characteristics of a “lowest- common-denominator” format for data that will need to be reprocessed into other representations? What principles would allow you to answer the following questions in particular cases?

Questions for Discussion How much manual/curatorial intervention is acceptable, and what options do you have if what's acceptable is less than what's necessary? Under what circumstances could tools have a normative impact on the practices of people who build and maintain collections? Under what circumstances could data have a normative impact on the practices of people who build and maintain tools?

Questions for Discussion Should users be allowed to change, correct, or improve data? If so, under what constraints or conditions? Should those who provide collections also host the computational tools that will be used on them? Why or why not? Should those who provide collections also collect the results of work done on their collections? Why or why not? What is the purpose of data curation?

Workbench Screen Captures Classification Comparison http://monkproject.org/

Tools for Textual Data John Unsworth May 20, 2009

Similar presentations

Presentation on theme: "Tools for Textual Data John Unsworth May 20, 2009"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Tools for Textual Data John Unsworth May 20, 2009

Similar presentations

Presentation on theme: "Tools for Textual Data John Unsworth May 20, 2009"— Presentation transcript:

Similar presentations

About project

Feedback