Download presentation
Presentation is loading. Please wait.
Published byAlexia Franklin Modified over 6 years ago
1
The HathiTrust Research Center’s Tools for Text Analysis with Digitized Text from the HathiTrust Digital Library
2
Community Notes tinyurl.com/dlf-htrc
3
About us Eleanor Dickson HTRC Digital Humanities Specialist
Scholarly Commons, University Library University of Illinois at Urbana-Champaign Sayan Bhattacharyya Postdoctoral Research Associate Graduate School of Library and Information Science
4
HathiTrust ecosystem
5
HathiTrust ecosystem
6
About the HathiTrust Digital Library
Repository 13+ million volumes | 3+ billion pages 50% of volumes are in English Material from the 15th C. on | 20th C. concentration 70% in copyright or undetermined | 30% open Interface Search and read books in the public domain
7
HathiTrust ecosystem
8
HathiTrust Research Center
HTRC for Text Analysis at scale from HTDL Digitized text Scanned & OCR-ed HathiTrust Research Center Computational methods provided tools and services E.g. Word counts, classification, topic modeling Analysis
9
About the HTRC Content provider (digitized text)
Digital Public Library of America Internet Archive And more… Tool provider (computational methods) Voyant MALLET
10
HT Points of Access HathiTrust Digital Library HTRC datasets
HT data request HathiFiles Bib and Data APIs HTRC datasets Extracted Features dataset Genre in English Language Literature (Underwood) HTRC Data Capsule HTRC Portal and Workset Builder
11
Portal & Workset Builder
A set of tools for assembling collections of digitized text and performing text analysis on them
12
The HTRC Portal Create account with institutional ID
From non-profit institution of higher-ed Access existing worksets (collections) Create new worksets in Workset Builder Run algorithms for text analysis Also: Generate script to create an Extracted Features dataset for any given workset Access HTRC Data Capsule
13
Worksets A workset is a user-created collection of text from the HathiTrust Digital Library. Textual datasets Built through searching OCR data and/or metadata In full text and titles By author, time period, subject classification, etc. Currently contains non-copyrighted material NOT digitized by Google
14
Why worksets? Worksets allow a researcher to curate subcorpora of text relevant to her Building research collections is familiar activity for humanists Worksets can be shared and cited
15
HTRC Algorithms Functions for text analysis
Extract, refine, analyze, and visualize textual data Built-in to Portal Some customization/optimization available, but mostly static tools For users who don’t require, or don’t know how to write, personalized algorithms
16
HTRC Algorithms Extracted Features Rsync Script Generator
MARC Downloader Meandre Classification Naïve Bayes Meandre Dunning Log-Likelihood to Tag Cloud Meandre OpenNLP Date Entities to Simile Meandre OpenNLP Entities List Meandre OpenNLP Report per Volume Meandre Tag Cloud Meandre Tag Cloud with Cleaning Meandre Topic Modeling Simple Deployable Word Count
17
Basic Portal Workflow
18
Go to Portal
19
HTRC Portal
20
Sign in / Sign up
21
Sign in
22
Sign in
23
Choose to create workset
If you choose to create a workset, you will switch into the Workset Builder, which is a search interface for creating a sub-collection of volumes in the HTDL. You can see there are also options to see “My Worksets” (ones you’ve made before), “All Worksets” (your worksets plus ones other users created and made public), or “Upload Workset” (which is an advanced feature for users who have HTDL volume IDs already and want to add metadata).
24
Log in to access Workset Builder
The first time you cross the internal bridge to the Workset Builder, you will be asked to reauthenticate yourself. This is an artifact of the system. You can check the box that says “Don’t show this again” if you don’t want to see the message again.
25
Build a workset NB: Click “More options” for advanced search options
26
Workset Builder: Search Results
An author search for “Austen, Jane”. Select all or some of the returned search items for your workset. Filtering results by “Austen, Jane, ” reduces results from 3,148 to a more manageable 390.
27
Workset Builder: Search Results
Once texts are selected, click “Selected Items”, located in upper right-hand menu
28
Workset Builder: Create Workset
When you’re done selecting items, review your choices, and select “Create/Update Workset”
29
Workset Builder: Workset Metadata
You can also alter or remove existing worksets by choosing “Manage Worksets” NB: you have the option to add selected items to an existing workset or to create a new workset.
30
Return to Portal
31
Questions?
32
Using the Portal Algorithms
After you’ve created a workset, you’ll be ready to analyze it. Return to the Portal and click on algorithms to see the list of the systems available algorithms.
33
Analysis in the HTRC Portal
You’ll see the list of possible algorithms. Clicking execute will allow you to run your chosen algorithm on a specified workset.
34
Prepare to run an algorithm
35
Prepare to run an algorithm
Make sure to mention that the user should enter a name of their choosing in the blank field for “input job name”, and that this is the same name that will show up later as “Job Title” when looking at results.
36
Prepare to run an algorithm
You will have to choose one or more worksets to run your algorithm on. You can use the radio buttons to see just your own worksets or a list of all worksets.
37
Choose workset(s) for analysis
Choose a workset from the list.
38
Run the analysis… Once you hit “submit” to start your job, you’ll be taken to a screen with your job history. You’ll see active jobs at the top. Notice that the status changes when you refresh the screen. And you’ll see completed jobs below.
39
View Results
40
Questions?
41
HathiTrust + Bookworm A tool for visualizing and analyzing word usage trends in the HathiTrust Digital Library The HTRC team includes personnel who combine an interest in humanistic research based on texts and their interpretation with openness towards the possibility of scientific and statistical analyses of corpora.
42
HT+BW Team Members: Funded by an NEH Implementation Grant (2014-2016)
Current J. Stephen Downie, University of Illinois, Urbana-Champaign Erez Lieberman Aiden, Baylor College of Medicine Benjamin Schmidt, Northeastern University Robert McDonald, Indiana University Loretta Auvil, University of Illinois, Urbana-Champaign Sayan Bhattacharyya, University of Illinois, Urbana-Champaign Muhammad Shamim, Baylor College of Medicine Peter Organisciak, University of Illinois, Urbana-Champaign Past Colleen Fallaw, University of Illinois, Urbana-Champaign Matt Nicklay, Baylor College of Medicine Funded by an NEH Implementation Grant ( )
43
The HathiTrust Digital Library (HT)
HT+BW Components The HathiTrust Digital Library (HT) Text data (OCR-ed) Metadata (classification, genre, dates, format, etc.) + The Bookworm tool was created at the Cultural Observatory (originally at Harvard, now at Rice), which previously collaborated with Google to create the Google Ngram Viewer; the lead developer of Bookworm, Benjamin Schmidt, is now on the faculty at Northeastern. The HathiTrust and Bookworm have similar interests but complementary expertise: the HathiTrust creates corpora, which Bookworm is designed to analyze. Bookworm (BW) (A tool that can visualize language usage trends within any repository of digitized text)
44
Bookworm originally developed by Erez Lieberman Aiden
HT+Bookworm helps answers questions like: How does language change over time? Bookworm originally developed by Erez Lieberman Aiden Co-creator of Google Books Ngram Viewer Partner in the HathiTrust+Bookworm project Erez was interested in how irregular verbs tend to become regular over time Originally studied by hand (Nature, 2007) Later, computationally (Science, 2010)
45
Change of past tense in verbs: an example
This is a simple demonstration of what Lieberman Aiden studied. Words like “burnt” that were irregular in the past tense have gradually been replaced by the more regular –ed past tense form. ‘burned’ (blue graph ) and ‘burnt’ (orange graph)
46
Change of past tense in verbs: an example — “burned” vs. “burnt” (contd.)
This is a simple demonstration of what Lieberman Aiden studied. Words like “burnt” that were irregular in the past tense have gradually been replaced by the more regular –ed past tense form. To understand in detail what is happening, need to facet the query!
47
Plotting lexical usage trends for “lady” versus “woman”
A hypothesis about ‘woman’ and ‘lady’: With the rise of relatively more freedom and independence for women, and more democratization and decrease of class hierarchies, we should expect to see: some decline in normalized counts of occurrences of "lady”; and some uptick in normalized counts of occurrences of "woman" in fiction. Hypothesis seems to be corroborated: As we reach modern times, occurrences of "lady" start declining in fiction occurrences of "woman" increase.
48
“lady” versus “woman” from 1780 to 1922
This is a simple demonstration of what Lieberman Aiden studied. Words like “burnt” that were irregular in the past tense have gradually been replaced by the more regular –ed past tense form. ‘lady’ (blue graph ) and ‘woman’ (orange graph)
49
Explore bookworm! bookworm.htrc.illinois.edu
Search for a word relevant to your research or personal interests Use only 1 word Search another word in an additional field Experiment with faceted searching Analyze your results Discuss with person sitting next to you
50
Future Development of HT+BW
Integration of HTRC worksets Plot trends within custom worksets Create worksets from plotted trends Hierarchical metadata Hierarchical faceting APIs Will likely make integration of a bookworm with many digitized collections possible
51
New bookworms being developed by our project partners with new and more interesting visualizations
Streamgraph version of Bookworm exposing Library of Congress categories helping to understand: How sociological changes have consequences for fields a word is used in How words pick up additional metaphorical senses over time
52
New bookworms being developed by our project partners with new and more interesting visualizations (contd.) The word ‘translation’ seems to start becoming important in the category ‘law’ starting in the 1980s:
53
New bookworms being developed by our project partners with new and more interesting visualizations (contd.) Metaphorical senses of the word “depression”:
54
Extracted Features Dataset
A function for downloading selected data and metadata elements from a workset
55
Extracted Features (EF) dataset defined
Downloadable page- and volume-level features extracted from a workset Currently only pre-1923, public domain volumes available Downloaded as JSON files Provides non-consumptive access at scale to volumes otherwise restricted by copyright
56
Extracted Features (EF) dataset: uses
EF dataset includes part-of-speech-tagged words and their frequencies Any algorithm that works with "bags of words" will work with the EF dataset Ex. topic modeling, Dunning’s log-likelihood estimation for “compare-and-contrast”,… etc.
57
Extracted Features (EF) Dataset: location
Available at:
58
Partial view of an EF data file json (“basic”)
"body":{ "tokenCount":205, "lineCount":35, "emptyLineCount":9, "sentenceCount":6, "tokenPosCount":{ "striking":{"JJ":1}, "London":{"NNP":1}, "four":{"CD":1}, ".":{".":7}, "dramatic":{"JJ":2}, "stands":{"VBZ":1}, ... "growth":{"NN":1} } }, "footer":{ "features":{ … "header":{ "tokenCount":7, "lineCount":3, "tokenPosCount":{ "I.":{"NN":1}, "THE":{"DT":1}, "INTRODUCTION":{"NN":1}, …. "AND":{"CC":1}}},
59
Extracted Features (EF) data has utility even for out-of-copyright material
Page-level extracted features (EF) provide “a sense of the content” of each workset at the level of individual works. EF dataset allows the flexibility of developing/running one’s own queries and algorithms against the text data, without depending on provided algorithms
60
A use case for “Basic” EF data
Charles Dickens’ novels Little Dorrit and Bleak House are a bit similar… …in that they both speak about the injustices of the 19th century British judicial system. However, Little Dorrit and Bleak House are also different. Little Dorrit: emphasis is on civil law rather than prisons. Bleak House: prisons and incarceration play an important role We can corroborate this using EF data for the two novels: Count (using the ‘grep -o’ command at the shell command-line) the number of pages in which the word ‘prison’ occurs at least once, for: the workset LittleDorrit1; and for the workset BleakHouse1 Compare these two counts
61
Occurrences of “prison” in Bleak House and Little Dorrit
Count (separately for Little Dorrit and Bleak House) the number of pages in which the word ‘prison’ occurs at least once: Downloaded EF dataset corresponding to the LittleDorrit1 workset: hvd basic.json grep –o prison hvd basic.json | wc –l Returns: Downloaded EF dataset corresponding to the BleakHouse1 workset: miun.aca8482,0001,001.basic.json grep –o prison miun.aca8482,0001,001.basic.json | wc –l Returns: Compare these two counts: 173 >> 29 (‘Prison’ is way more prevalent in Bleak House than in Little Dorrit !) (Q.E.D.)
62
Occurrences of “prison” in Bleak House and Little Dorrit (caveats)
We simply counted the occurrences of the word “prison” within each json. This does give us a rough estimate, but to make a precise, accurate computation, you should actually parse the json to count the frequency of occurrences of “prison”. Another simplification we made was to simply count occurrences of the exact word “prison”. For a large sized sample, this ought not matter so much. However, as an exercise, you may want to redo the activity by considering any expression that contains the string prison — e.g. “imprisonment”, “prisoner”, “prisons”, “prisoners”, etc.
63
Provides workset-level derived metadata and word usage statistics
Showcase: Community contribution to enhance Extracted Features Dataset: Eric Morgan Lease’s Workset Browser Eric Lease Morgan, Digital Initiatives Librarian at the University of Notre Dame, has recently contributed a ‘Workset Browser’ toolkit Makes use of EF data for a workset to provide analysis of content and metadata information for that workset. Provides workset-level derived metadata and word usage statistics Provides visualizations (as well as text lists) of “’big’ names” and “’great’ ideas” occurring in the workset Walkthrough of a 32-item ‘Thoreau’ workset as example of what is possible.
64
Page-level maps of genre from English-language monographs in the HathiTrust Digital Library, (Ted Underwood’s work)
65
Goal: What genres have been extracted? To show how:
Page-level maps of genre from English-language monographs in the HathiTrust Digital Library, (Ted Underwood’s work) Goal: To show how: literary scholars can use machine learning for selecting genre-specific collections from digital libraries What genres have been extracted? “Imaginative literature” Prose fiction, drama, poetry Assumed to be non-overlapping, discrete categories
66
Choice of genres Why these genres?
Prose fiction, drama, poetry, non-fiction prose, paratext Assumed to be non-overlapping, discrete categories Why these genres? Broad categories like these easier to map As we descend from broad categories to subgenres (e.g. “detective fiction”, “verse tragedy”, etc.), taxonomies become more like “folksonomies” less sharp categories
67
HathiTrust and Genre Dataset
Genre Mapping project (Ted Underwood) and Feature Extraction project (HathiTrust) both mapped volumes at page level Created opportunity for collating results to produce genre-specific feature datasets The provided genre datasets aggregate pages at the volume level E.g.: The provided fiction dataset: contains only volumes that were identified as containing fiction aggregates the wordcounts for only those pages that were identified as fiction
68
Assumptions behind how genre was mapped
Genre as “empirical social problem” Not as constituted by “formal characteristics” Different strategies adopted for different aspects of the problem Overall strategy is supervised machine learning, mapping each page to a genre Training set created by group of five human readers Performed closed readings of training set and assigned categories 1 graduate student and 4 undergraduate students Agreed with each other only 76% of the time But agreed 94.5% of time about broad categories
69
Assumptions behind how genre was mapped (contd.)
Predictive rather than explanatory model of genre Explanatory models: Try to identify key factors behind a phenomenon Identify deep structure Capture causal logic Predictive models: Do not try to capture causal logic Establish mapping between predictor variables and response variables Do not require identification of key variables Do not require structured data Employ a “skeptical epistemology” Appropriate for the humanities as the humanities grapple with concepts that are not well-understood
70
Overview of how genre was mapped (contd.)
Different combinations of methods and features were tried Best method was found to be: regularized logistic regression “regularization” in a supervised learning algorithm is a technique to prevent overfitting Best features were selected to be 1062 features maximizing predictive accuracy (by optimizing a weighted average of precision and recall) Consisted of: 1034 words 36 “structural features” (other page-level or volume-level information) With this best methods and best feature set, the regularized logistical regression model was trained for each genre: using a one-versus-all training (each genre contrasted to all other genres collectively) page-sequence information about genre was used to train a Hidden Markov Model that smoothed page-level predictions
71
How the training data was selected
To ensure representativeness of sampling of training data: a previous round of volume-level genre classification using Naïve Bayes classification served as guide to dividing corpus into volumes for subsequent random sampling Training data labeled by hand to match specific goal of project: Separating the data into a few broad categories that literary critics distinguish between in practice: Prose fiction, Drama, Poetry Everything else categorized as prose non-fiction Historical specificity was traded off for increased training data volume: Rather than train different models for each century separately, as much training data as possible was used: One training set covering the years (to make predictions about volumes in this range) Another training set covering the years (to make predictions about volumes in the range )
72
Empowering users to tune tradeoff between precision and recall
Users were allowed to trade off precision against recall as needed: This means: Allowing users to make the genre definitions more inclusive (higher recall but lower precision); or less inclusive (higher precision but lower recall) This capability was provided at volume level, not at page level as most users are likely to treat texts as volumes. Volume-level confidence metrics were computed and recorded: on the basis of: page-level confidence metrics; and additional volume-level evidence
73
Empowering users to tune tradeoff between precision and recall (contd
Users can define their own corpus of fiction, poetry or drama, either by: setting whatever confidence threshold they desire; and pulling out volumes that meet that threshold or by: choosing from some special pre-packaged “filtered” collections for these three genres (which have been provided by identifying, for each genre, confidence thresholds that increase precision for relatively little loss of recall)
74
limited to various thresholds of confidence.
Empowering users to tune tradeoff between precision and recall (contd.) Precision-recall graph for fiction datasets (left) and poetry datasets (right), limited to various thresholds of confidence. These curves are “incorporated” into the JSON metadata for all volumes where any pages of poetry, drama or fiction are predicted. So, a researcher concerned (for example) that the provided dataset excludes too much poetry, can recreate his/her own filtered dataset with a confidence threshold that creates his/her desired precision/recall trade-off.
75
Feature engineering Best features were selected to be 1062 features maximizing predictive accuracy Selected by grouping pages into categories corpus was planned to be classified into Top 500 words selected from each category Words from each category grouped into a master list limited to the top N most frequent words Consisted of: 1034 words 26 “structural features” (other page-level or volume-level information): per-page features: wordsPerLine, totalWords (highly predictive for fiction), capRatio (highly predictive for poetry), etc. volume-level features: metaFiction (information in metadata indicating fiction), etc. 10 categories: e.g. personalname, placename, propernoun, romannumeral, arabic1digit, etc.
76
Performance of model On broad categories:
Prose fiction, drama, poetry, non-fiction prose, paratext: Achieved 93.5% accuracy overall Compared to 94.5% for the 5 human readers Weak but slight correlation between pages where model failed, and pages where human readers disagreed Performance always has a tradeoff between precision and recall Precision: Better precision means minimization of the proportion of false positives to true positives Recall: Better recall means minimization of the proportion of false negatives to true negatives
77
Plotting trends within a specific genre dataset (e.g. fiction)
Chris Forster (Syracuse University English professor) recently wrote some code in R that helps in exploring the genre dataset. One such script plots normalized frequencies for any word in a given genre E.g. A plot of normalized frequencies for the word ‘america’ in the fiction dataset: mentions of "america" remained fairly constant until the 1830s or so: then started slowly and steadily rising:
78
Plotting trends within the fiction dataset (contd
Plotting trends within the fiction dataset (contd.) “lady” versus “woman” The plot of normalized occurrences of "lady" has an inverted 'v'-shaped curve: slowly and steadily rising until about 1810, and slowly and steadily falling subsequently. Plot of normalized occurrences of 'woman' remain more or less constant until about 1840, subsequently keeps rising slowly and steadily. Hypothesis seems borne out: As we reach modern times, occurrences of "lady" start declining in fiction occurrences of "woman" increase.
79
Plotting trends within the fiction dataset (contd
Plotting trends within the fiction dataset (contd.) “lady” (top plot) versus “woman” (bottom plot)
80
An environment for performing non-consumptive text analysis
HTRC Data Capsule An environment for performing non-consumptive text analysis
81
Introduction to the Data Capsule
A proof-of-concept for allowing non-consumptive research on protected texts in a secure environment Restricts flow of data in and out of environment.
82
Data Capsule Simply put:
It’s a user-accessible virtual machine that can be locked-down
83
Data Capsule Researcher is assigned a virtual machine (VM)
Can configure the VM as if one’s own computer download custom tools/algorithms to it Can load VM with HTRC textual data Only when it’s in "secure mode” Network and other data channels are restricted After analysis of the textual data, results ed to the user, text not
84
How it works Photo from Mindat.org
85
Breakout sessions Teaching Campus resource Research
How could the HTRC—or other text analysis tools—support scholarship and learning in your community?
86
Breakout sessions – use cases
Research A graduate student with some technical proficiency is interested in conducting a stylometric (changes in style of writing) analysis of government publications across cities and time periods. How would you help her get started? Teaching A faculty member on your campus is interested in introducing digital methods to his undergraduate literature class, but he doesn’t expect the students to learn computer programming. What tools could you recommend? Campus resource A digital humanities professor would like textual data for her students to use in a class assignment. Where could you send her to bulk download text?
87
Advanced Collaborative Support (ACS)
HTRC supports computational research by external researchers on material in the HathiTrust ACS calls for proposals are floated by HTRC approximately once a year Next ACS CFP likely to be some time in 2016 Selectees allocated dedicated support/time for one-on-one consultation with HTRC developers
88
Advanced Collaborative Support (ACS): Some of 2015’s projects
HTRC teams began providing support in Spring 2015 Projects required access at scale, using half a million to a million volumes each A few examples: Literary Geography at Scale Matthew Wilkens, University of Notre Dame Off the Books Michelle Alexopoulos, University of Toronto Taxonomizing the Texts: Towards Cultural-scale models of full text Colin Allen & Jaimie Murdock, Indiana University
89
Advanced Collaborative Support (ACS): Some of 2015’s projects (contd.)
Literary Geography at Scale Matthew Wilkens, University of Notre Dame Geolocation of world literature Will extract and geocode place names from eleven million volumes in HathiTrust (including 20th and 21st century texts) Pilot analysis completed on 10,000 randomly completed volumes Next step: Processing of the entire corpus
90
Advanced Collaborative Support (ACS): Some of 2015’s projects (contd.)
Off the Books Michelle Alexopoulos, University of Toronto Creation of new measures of technological innovation and diffusion Analyzes diffusion of technology-related terms in the corpus Will lead to mapping of waves of innovation over time and space Will aid in understanding how technological change produces economic change
91
Advanced Collaborative Support (ACS): Some of 2015’s projects (contd.)
Taxonomizing the Texts: Towards Cultural-Scale Models of Full Text Colin Allen & Jaimie Murdock, Indiana University Tests and visualizes topic models leveraging the HTRC Data Capsule Shows the relationship between a topic model based on a random sample of volumes and the entire category from which it is drawn Proof-of-concept models Library of Congress subject headings and visualizes them using their Topic Explorer tool
92
Jaimie Murdock and Colin Allen’s Topic Explorer
Visualization for document-document relations and topic-document relations Interactive Supports rapid experimentation for interpretive hypothesis On clicking a topic: documents will reorder according to that topic’s weight in each document topic bars will reorder according to the topic weights in the highest weighted document Keeps the following two elements visible: comparative topic distribution document composition Supports indirect comparison of different topic models (with different numbers of topics) Uses multiple windows, each with a model with a different number of topics
93
Jaimie Murdock and Colin Allen’s Topic Explorer (interface)
94
Jaimie Murdock and Colin Allen’s Topic Explorer (interface) [contd.]
95
Word similarity tool – David Mimno
Cornell University project with computing similarity based on HTRC Extracted Features Dataset Word similarity tool – David Mimno
96
Acknowledgements: HTRC Team
Illinois (GSLIS and the University Library): Indiana University Beth Plale Robert McDonald Angela Courtney Nicholae Cline Miao Chen Samitha Liyanage Leena Unnikrishnan Milinda Pathirage Leanne Mobley Stephen Downie Ryan Dubnicek Beth Sandore Namachichivaya Harriett Green Peter Organisciak Tim Cole Megan Senseney Loretta Auvil Sayan Bhattacharyya Boris Capitanu Eleanor Dickson
97
Get Involved! HTRC Announcements: htrc-announce-l @ list.indiana.edu
HTRC User Group: list.indiana.edu
98
Questions? Eleanor Dickson Visiting HTRC Digital Humanities Specialist
Sayan Bhattacharyya Postdoctoral Research Associate, GSLIS
Similar presentations
© 2025 SlidePlayer.com Inc.
All rights reserved.