Research in the Humanities: Content Analysis

Research in the Humanities: Content Analysis
Willow J Benavides 397C April 10, 2012

Agenda Explanation of content analysis
Overview of the Feinberg reading/review of class comments Paper 1: Visualizing Repetition in Text Paper 2: Literature Fingerprinting: A New Method for Visual Literary Analysis Questions

Content vs. Discourse Analysis
Discourse analysis looks at the social context and discourse in which the data was constructed, the ultimate meaning of the data (e.g., data as capta). “Content analysis is considered a scholarly methodology in the humanities by which texts are studied as to authorship, authenticity, or meaning. This latter subject include philology, hermeneutics, and semiotics” Dr. Farooq Joubish Content analysis looks at different ways of approaching the raw data of a given text and drawing meaningful information from it, either in terms of prevalence or placement in the text. Research in the humanities often involves data that is subjective and qualitative, not as easily quantified and studied as other forms of data. Two approaches presented in the readings: content analysis and discourse analysis.

“Information Studies, the Humanities, and Design Research: Interdisciplinary Opportunities” Melanie Feinberg Multidisciplinary approach to research. Research in the humanities and in design is open-ended, not seeking one definitive answer. Iterative process: synthesizing information from different sources to present one perspective or possible interpretation of data.

Computing in the Humanities Working Papers A.46, July 2008
Visualizing Repetition in Text Stan Ruecker, Milena Radzikowska, Piotr Michura, Carlos Fiorentino, Tanya Clement Computing in the Humanities Working Papers A.46, July 2008 Survey of different interfaces designed to help humanities researchers conduct text-mining focused on repetition.

authors Dr. Stan Ruecker Milena Radzikowska
Associate Professor of Humanities Computing in the Department of English and Film Studies at the University of Alberta. PhD research: the affordances of prospect for computer interfaces to large, interpretively-tagged text collections. Research Interests: browsing interfaces for electronic documents, computer- human interfaces, humanities visualization, and information design. Associate Professor in Information Design, Faculty of Communication Studies, Mount Royal University. MDes degree in visual communication design Research Interests are in the areas of visual communication, interface and information design, HCI, usability, and text visualization, digital humanities. “design wizard and lover of die hard” Visual Interface Design for Digital Cultural Heritage: A Guide to Rich-Prospect Browsing – Stan Ruecker, Milena Radzikowska, Stefan Sinclair Dr. Stan Ruecker Milena Radzikowska

Full-time faculty at UT Austin iSchool
PhD in English Literature and Language, 2009, University of Maryland, College Park Research Interests: digital humanities, digital libraries, humanities data curation, literary study, modernism, scholarly information infrastructure, scholarly publishing and communication Visual Communication Design Program in the Department of Art and Design at the University of Alberta MDes masters in visual communication design, University of Alberta, Canada 2008 Research Interests: introducing DfS in design programs, digital humanities research studio (DHRS) University of Alberta, information design and visualizations for decision support and interface design for several academic and industrial projects Dr. Tanya Clement Carlos Fiorentino

Lecturer at Industrial Design Faculty, Academy of Fine Arts in Krakow, Poland
M.F.A. in Graphic Design PhD candidate in the Department of Typography and Graphic Communication at the University of Reading. Research Interests: visual communication design, typography in the context of text visualization. Piotr Michura

MONK Project: Metadata Offer New Knowledge
Digital environment designed to help humanities scholars discover and analyze patterns in the texts they study. Based on the work of WordHoard (at Northwestern University) and NORA (Non- obvious relationship awareness – AKA No One Remembers Acronyms), supported by a Mellon grant – publicly available resource. “Over the last decade, many millions of dollars have been invested in creating digital library collections: at this point, terabytes of full-text humanities resources are publicly available on the web. Those collections, dispersed across many different institutions (not only libraries but also publishers) are large enough and rich enough to provide an excellent opportunity for text-mining, and we believe that web-based text-mining tools will make those collections significantly more useful, more informative, and more rewarding for research and teaching.”

Scholarly Primitives: common basic activities of humanities research
Discovering Annotating Comparing Referencing Sampling Illustrating Representing Deconstructing Contextualizing

Text Mining for Repetition
N-grams – typical unit of search Grams might be words, lemmas, or parts of speech Example: matching “trouble is brewing” with “trouble was brewing” Pattern: “trouble [verb] brewing” or “trouble [verb] lemma:brew” Choosing which items should be counted as repetitive is a task that requires consideration of factors such as the psychology of writing and reading, as well as what is possible technically.

Ideal Interfaces for text mining
Must automatically generate lists of n-grams Reading view that allows user to see n-grams in context User can search for particular words or phrases, patterns User has access to complete set of repeated phrases Tools for manipulating the results

FeatureLens: HCI at U of Maryland
4 distinct viewing aspects: Frequent patterns Sections overview Collection overview Document overview (text view) User has the ability to choose trends of repetition across the document.

dialR Assumes a PC environment, with a regular monitor and user interface. Text appears as a volume of space and the goal is experimentation. The dialR interface has a series of radar screens across the bottom and a set of transparent sheets in the main area. The user provides a word or phrase and the system scans for passages that include that word or phrase, highlighting results in the transparent sheets. A reading view is available on the right and can be configured to contain either continuous prose or excerpted sections.

Repetition Graph Removes the conventions of the codex, analyzes the “rudimentary building blocks of text” – words in sequence. Plots the words as a string in space, providing a visual thumbprint of repetition in the text. “In order to consider genuine visual alternatives, it is useful to strip away as many page conventions as possible and start from scratch in the new medium. By starting with a radically reduced representation of the document, we have the opportunity to consider configurations of text that can be developed through two- dimensional and three-dimensional manipulations of the text string.”

3 underlying paradigms Object Manipulation Model: Bradley (2005)
Using computers to augment researchers’ everyday practices through the flexibility of digital text. Direct Manipulation Paradigm: Shneiderman (1997) Entities, operations/actions, relations, and results of interest to the user are represented in a continuous manner. Meaningful physical actions are utilized (such as drag/drop or button pressing). Results of appropriately rapid, gradual/incremental, and reversible actions are visible immediately. Principles of Diagrammatic Typography: Mapping the results of analysis directly onto the appearance of text through variations in its typographic features.

Figure 6: By connecting the ends of the lines in Figure 5, we create a 3D version of the repetition graph, with the repeated word or phrase running down the spine and the length of text between repetitions indicated by the size of the loops.

Figure 7: Multiple loops can be used to show different kinds of information, as selected by the user. The loops on one side, for instance, could consist of reported dialogue, while the other is for narrative prose. – or active versus passive voice

Walls of text: the novel as slot machine
Predicated on the availability of high- resolution wall-sized displays. Presents the entire document in a single view consisting of micro-text columns, multiplied by the number of repetitions of a repeated n- gram aligned in relation to a reading slot in the middle of the pane.

Figure 8: The wall-sized display allows the system to present a full copy of the novel for each repetition. Although the columns are microtext, they are aligned along a reading slot that magnifies the repeated phrase and its immediate context.

Conclusions: Everything is still in the (iterative) design stage.
Ruecker has a protocol for studying interfaces prior to prototyping them, designed to address the following user issues: What affordances, if any, emerge from the use of the different interfaces? What techniques can be used to prevent information overload? Which functions do humanities scholars perceive in the various interfaces as being of potential benefit in conducting their research? How do users interact with the various interfaces?

Literature Fingerprinting: A New Method for Visual Literary Analysis Daniel Keim, Daniela Oelke
Proceedings of the 2007 IEEE Symposium on Visual Analytics Science and Technology Most textual analysis only considers one feature at a time – the authors wanted to consider several features in concert, to get a particular signature or “fingerprint” for a specific author.

authors Professor of Computer Science and Information Science at the University of Konstanz, Germany Research Interests: data analysis and visualization, GIS, data mining Assistant Professor of Computer Science at the University of Konstanz, Germany Research Interests: visual document analysis Dr. Daniel Keim Dr. Daniela Oelke

Computer-based Literary Analysis
Lexical, Syntactic, Semantic Text classification: Topic-oriented classification: often uses TF-IDF Genre classification: grammatical structure, layout of text, TF-IDF Authorship attribution: extracted features should not be consciously controllable by the writer, to prevent the method from being misdirected by a forged text Literary criticism: sequence analysis, translation criticism While computer analysis has made great progress in lexical and syntactic analysis, it is not as helpful in the field of semantics at this point, as words are often context dependent. TF-IDF = term frequency – inverted document frequency

Variables for Literary Analysis
Statistical Measures Word length, syllables per word, sentence length, proportions of Parts of Speech Vocabulary Measures Frequency of specific words, categories of words, Simpson’s Index, Hapax Legomena Syntax Measures Syntax tree of sentences

Vocabulary Measures Type-token ratio (R):
Simpson’s Index (D): the probability that two arbitrarily chosen words will belong to the same type. Honoré measure tests the tendency of an author to choose between a previously used word or a new one. Based on Hapax Legomena (V1): the number of words in a text that occur exactly once. Stable for texts with N>1300. N = number of tokens (i.e., word occurrences that form the sample text) V = types (number of lexical units that form the vocab in the sample) Vr = number of lexical units that occur exactly r times. D is calculated by dividing the total number of identical pairs by the number of all possible pairs.

Authorship Attribution Case Study
Method of fingerprinting could be useful to determine authorship when unknown or in dispute. Took 6 Jack London novels and 10 Mark Twain novels (freely available through Project Gutenberg) Removed Gutenberg specific info, chapter titles, split document into text blocks of approx. 10,000 words per block.

Visual Layout Each text block as a colored square – aligned left to right, top to bottom “Perception of a trend is easiest when displayed on a closed area with no borders visible.” Values modulated by color – if a measure is able to discriminate the two authors, the books of one will be mainly blue, the other mainly red.

Visualization of the two novels “The Iron Heel” and “Children of the Frost” by Jack London. Color is mapped to vocabulary richness. It can easily be seen that the structure of the two novels is very different with respect to this measure. This would be camouflaged if only a single value for each book would be calculated.

Anomalous Findings Huckleberry Finn by Mark Twain stood out as being very different from the rest of Twain’s work. Authors speculated that it could have been because of the “language peculiarities” of the southern dialect, or even the presence of a ghost writer.

The Mark Twain work contained passages of text that were quoted – not the original work of the author – the fingerprinting method worked to identify them. Comparing two novels whose sentence length are roughly the same: Jerry of the Islands by Jack London and Following the Equator by Mark Twain

Fig. 5 Visual Fingerprint of the Bible
Fig. 5 Visual Fingerprint of the Bible. More detailed view on the bible in which each pixel represents a single verse and verses are grouped to chapters. Color is again mapped to verse length. The detailed view reveals some interesting patterns that are camouflaged in the averaged version of fig. 4

The authors performed the visualization technique on their own paper to optimize its readability by identifying the longest sentences and modifying them.

Conclusion The authors used several literature analysis measures in conjunction to create visual fingerprints of a variety of texts. They presented a case study of the approach on the works of Mark Twain and Jack London, and demonstrated that the method can be used to distinguish one author from another. “Encouraged by these results, we plan to extend our ideas to other areas of literature analysis such as language development and translation criticism.”

Questions

Research in the Humanities: Content Analysis

Similar presentations

Presentation on theme: "Research in the Humanities: Content Analysis"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Research in the Humanities: Content Analysis

Similar presentations

Presentation on theme: "Research in the Humanities: Content Analysis"— Presentation transcript:

Similar presentations

About project

Feedback