Digital methods for Literary Criticism Fabio Ciotti University of Rome Tor Vergata 1.

Digital methods for Literary Criticism Fabio Ciotti University of Rome Tor Vergata 1

Methodological Intersections… This lecture aims at presenting a critical overview of the principal methods adopted in Computational Literary analysis We are really talking about Methodological Intersections: Literary Criticism and History of Literature Theory of literature Computational Linguistics Statistics and Probability studies Computer science Machine learning Data mining …

The wider context: Digital literary studies Recent coinage but more and more successful the application of computational methods and the use of digital tools to study literary texts and related phenomena In fact is one of the fundamental assets od DH since its very origin Digital Scholarly Editing and Digital Philology Text Encoding and digital annotation Text analysis and Computational criticism Quantitative sociology of literature Hypertext and new media studies Electronic Literature …

What we will talk about… Methodological issues in computational literary criticism Traditional quantitative approaches Distant reading approaches Critical stance of distant reading approaches Annotation and ontological modeling in literary analysis

Methodological issues in computational literary criticism The role of modeling and the methodological foundation of DH Distant reading vs close reading The exploratory approach and distant reading

Modeling and DH methodology Since its very origin, when it was still called Humanities Computing, DH domain has been characterized by the strong relevance of methodological issues “At its core, then, digital humanities is more akin to a common methodological outlook than an investment in any one specific set of texts or even technologies.” [What Is Digital Humanities and What’s It Doing in English Departments? Matthew G. Kirschenbaum] The central terms in this theoretical and methodological debate have been model and modeling… … again quit difficult to define!

Modeling The most thorough treatment of the concept is due to Willard McCarty By "modeling" I mean the heuristic process of constructing and manipulating models', a "model" I take to be either a representation of something for purposes of study, or a design for realizing something new. These two senses follow Clifford Geertz's analytic distinction between a denotative "model of" such as a grammar describing the features of a language, and an exemplary "model for" such as an architectural plan This distinction is not completely defined, and it “also reaches its vanishing point in the convergent purposes of modelling; the model of exists to tell us what we do not know, the model for to give us what we do not yet have. Models realize” W. McCarty, Modeling: A Study in Words and Meanings

Modeling We use the term "model" in the following sense: To an observer B, an object A* is a model of an object A to the extent that B can use A* to answer questions that interest him about A. The model relation is inherently ternary. Any attempt to suppress the role of the intentions of the investigator B leads to circular definitions or to ambiguities about "essential features" and the like. It is understood that B's use of a model entails the use of encodings for input and output, both for A and for A*. If A is the world, questions for A are experiments. A* is a good mode of A, in B's view, to the extent that A*'s answers agree with those of A's, on the whole, with respect to the questions important to B Marvin L. Minsky, Matter, Mind and Models The model must be determined, isomorphic to the domain and at the same time dependent on the perspective of the community who has the responsibility on it

Formal Modeling I prefer to adopt a notion of modeling strongly connected with that of formalization Formalization is to be understood as a set of semiotic and representational methods that generates a representation of a (or a set of) phenomenon/object accessible and algorithmically computable (at least partially) Formal models Mathematical (physics theories) Logical (axiomatizations) Statistical Computational (data structures, programs, simulations…)

Modeling functional taxonomy Representational/descriptive modeling This type of modeling is aimed at summarizing or representing the domain and its data structure in a formal compact manner. Unlike explanatory modeling, in descriptive modeling the reliance on an underlying causal theory is absent or incorporated in a less formal way, although we can say that modeling always encompass a theory of the domain to be modelled Example: text encoding Explanatory modeling Explaining is providing a causal explanation and explanatory modeling is the use of formal models for testing causal explanations Predictive modeling the process of applying a formal model or computational algorithm to data for the purpose of predicting new or future observations Cfr. Galit Shmueli, To Explain or to Predict?

Distant reading Understanding literary phenomena analyzing (computationally) massive amounts of textual data The groundbreaking steps in this direction are due to the Stanford Literary Lab, founded and directed by Franco Moretti and Matthew Jocker Moretti himself has attempted to give a literary theoretical rationale to these experimentations, introducing the notion of “distant reading”

Distant reading The basic idea is that there are synchronic or diachronic literary and cultural facts and phenomena that are undetectable to the usual deep reading and local interpretation methods and that requires the scrutiny of hundreds or thousands of texts and documents (and millions of lexical tokens) In this way we can gain access to otherwise unknowable information that plays a significant explanatory role in understanding literary and cultural facts and history/evolution … the trouble with close reading (in all of its incarnations, from the new criticism to deconstruction) is that it necessarily depends on an extremely small canon. This may have become an unconscious and invisible premise by now, but it is an iron one nonetheless: you invest so much in individual texts only if you think that very few of them really matter. [...] At bottom, it’s a theological exercise— very solemn treatment of very few texts taken very seriously— whereas what we really need is a little pact with the devil: we know how to read texts, now let’s learn how not to read them. Distant reading: where distance, let me repeat it, is a condition of knowledge: it allows you to focus on units that are much smaller or much larger than the text: devices, themes, tropes— or genres and systems. And if, between the very small and the very large, the text itself disappears, well, it is one of those cases when one can justifiably say, Less is more.

Close versus Distant Reading Close reading is the act of analyzing one (or a small set of) work(s) based upon deep reading and interpretation of local features and aspect of its formal structure or content For example, one could analyze a Goethe Faust poem based on the usage of metaphors, the meanings of each word, and comparisons drawn with preceding (or following) works based upon Faust myth Another example would be to compare and contrast two characters from two different works based upon the content of their interactions with others characters or situations (f.ex. Ulysses/Bloom) The notion of close reading is attributed to the theories of Richards and of the American New critics, but we can say that in general is the main methodological approach of most of 20 th century literary scholarship, from Russian formalist to Structuralism, Semiotics and even Poststructuralism (with some exception in the sociology of literature and in the British Cultural Studies)

Close versus Distant Reading Distant reading focuses on analyzing big or huge sets of works, usually adopting quantitative methods to examine a determined set of quantifiable textual features, to investigate and explain literary and cultural macro-phenomena like the evolution of genres the affirmation of a style and its reception the presence of recurrent content/theme in a given time span of literary history the notion of influence and intertextuality the sociological facts and aspects of literature The very idea of Moretti is that this quantitative formalism approach is the only way to study literature without restricting the attention to the Canon of the “great works”

Distant reading: methods an tools Data mining/machine learning heuristics and social network analysis are the preferred methods for distant reading in that they give the possibility to search for implicit recurring patterns and regular schemes inside wide amounts of not or poorly structured data, usually not visible to the naked eye topic modeling: the research of lexical tokens pattern co-occurring with a noticeable frequency inside a text or a corpus text clustering: the use of statistical clustering algorithms applied to specific textual features to automatically classify them in significant categories sentiment analysis: giving a quantitative value to the emotional valence of sentences and text by means of an emotional metric attributed to a set of lexical items network analysis, a set of methods and strategies for studying the relationships inside groups of n-individuals, based on graph theory

The exploratory approach and distant reading Doing research without prior formal modeling and theorizing is a tacit assumption of many recent works based on Data Analytic and Distant Reading The general underlying idea is the application of data mining and machine learning heuristics to search for implicit recurring patterns and regular schemes inside wide amounts of unstructured (or poorly structured) data, usually not visible to the naked eye In most of these applications, the phase of technical analysis precedes theoretical modeling in the research process Problem => Data => Analysis => (Model) => Explanation

Big Data and distant reading The problem with this exploratory way of doing research with humanities objects is that at the level of data building a lot of interpretation and theory is involved. So we have a double modeling phase, the former occurring before (and governing how) we build the data set, and the latter occurring before (and governing) the analysis: Problem => Data Model => Data => Model => Analysis => Explanation humanities object are intentional objects and it is very difficult to find anything relevant without a previous hypothetical model of what we are looking for starting from (presumed) raw data we can draw many different conclusions without having any acceptable criteria for deciding which is the best one, the one that best explain our phenomenical data

Before we start: the data To use Big Data methods you need big textual data collections Existing scholarly collections (TCP, DTA, WWP, OTA, BibIt…) High quality (avg), medium to low size data set, usually in XML format Existing non scholarly collections (Gutenberg, Liber Liber..) Medium to low quality, medium to big data sets, usually in text only format (Unicode hopefully) Huge collections (Hathi Trust, Internet archive, Google Books) Low quality (uncleaned OCR), huge data sets, text format Do it yourself: scanning and OCR Commercial or open source OCR works well with modern print books You have to do a lot of cleaning and correction (some can be automated using reg-ex: Cleaning OCR’d text with Regular Expressions) Cleaning OCR’d text with Regular Expressions

Cleaning the data… In most case the data set must undergo a cleaning preprocessing phase to reduce the dimensionality (complexity) of the data set eliminate data that could invalidate the analysis: running headers, page numbers, grammatical morphemes (???!) Filtering: remove words that bear little or no content information, like articles, conjunctions, prepositions, etc. Lemmatization: methods that try to map verb forms to the infinite tense and nouns to the singular form Stemming: try to build basic forms of words-set

Traditional quantitative text analysis The atomic datum is usually the word or orthographic form (intended as the maximal sequence of coded characters that is meaningful in a textual sequence) There is some evidence that letter sequences and information about parts of speech sometimes work better than words for authorship attribution, but words have the advantage of being meaningful in themselves and in their significance to larger issues like theme, characterization, plot, gender, race, and ideology (Hoover) Full text search (or contextual full text search) Basic statistical analysis: Frequency list Collocates Text comparison Concordances (KWIC lists)

Frequency list A sorted list of words (word types) together with their frequency, where frequency here usually means the number of occurrences in a given corpus or document Sort order can be alphabetic (ascending or descending), by frequency, z-score or tf-idf score The position in the list occupied by a single word is called rank

Collocates Collocates: the list of words that occurs more frequently near a given word within a given context http://wordhoard.northwestern.edu/userman/analysis-collocates.html Useful to study linguistic phenomena like grammatical concordance It can give an insight to thematic aspects or to semantic clusters that characterize a text

Text comparison Instead of using features to sort documents into categories, you start with two categories of documents and contrast them to identify distinctive features [Ted Underwood] Knowing that individual word forms in one text occur more or less often than in another text may help characterize some generic differences between those texts log-likelihood ratio (introduced in computational linguistics by Dunning) is a common algorithm to assess the size and significance of the difference of a word's frequency of use in the two texts. The log-likelihood ratio measures the discrepancy of the observed word frequencies from the values which we would expect to see if the word frequencies (by percentage) were the same in the two texts. The larger the discrepancy, the larger the value of, and the more statistically significant the difference between the word frequencies in the texts. Simply put, the log-likelihood value tells us how much more likely it is that the frequencies are different than that they are the same

Concordance A concordance is the list of the words (type) used in a text or a corpus, listing every instance of each word with its immediate linguistic context Historically concordance have taken two forms (but computational system have opted for the first one) Kwic (Key Word In Context) Kwoc (Key Word Out of Context) Concordance is bridge between qualitative and quantitative analysis of a text since it gives access to the actual segment of text containing the word its output remains a linguistic unit amenable to human interpretation

Tools for text analysis Local standalone tools Ant-conc Concordance Monoconc Word Smith txm Web application and service Wordhoard Philologic Voyant Textgrid

Text mining Text mining refers generally to the process of extracting interesting and non-trivial patterns or knowledge from unstructured text documents The notion of unstructured must be taken with caution since no digital information set can really be unstructured. We have better talk of level or grade of structuration of the data Overall methods Supervised methods: text classification Unsupervised methods: clustering and topic modelling

Supervised: classification The categories are known a-priori A program can learn to correctly distinguish texts by a given author, or learn (with a bit more difficulty) to distinguish poetry from prose, tragedies from history plays, or “gothic novels” from “sensation novels” The researcher provides examples of different categories (training set), but doesn’t have to specify how to make the distinction: algorithms can learn to recognize a combination of features that is the “fingerprint” of a given category After the training the algorithm can be applied to a wider non categorized data set Many kinds: Naïve Bayes, decision trees / random forests, support vector machines, neural networks, etc. No "best" one: performance is domain- and dataset-speciﬁc

Unsupervised: clustering and topic modelling a program can subdivide a group of documents using general measures of similarity instead of predetermined categories. This may reveal patterns you don’t expect Two kinds of unsupervised learning Single membership clustering: each document is assigned to one category -> clustering Mixed membership clustering: a document may be assigned to multiple categories, each with a different proportion -> topic modeling

The (meta)data model: text as multidimensional vector or “bag of words” Here are two simple text documents: (1) Rose is a rose is a rose is rose. (2) A rose is a rose is a rose is an onion. Based on these two text documents, a list is constructed as: [ “rose", “is", “a", “an”, "onion" ] which has 5 distinct words. using the indexes of the list, each document is represented by a 4-entry vector: (1) [4, 3, 1, 0, 0] (2) [3, 3, 1, 1, 1] each entry of the vectors refers to count of the corresponding entry in the list For example, in the first vector (which represents document 1), the first two entries are “4,3". The first entry corresponds to the word “Rose" which is the first word in the list, and its value is “4" because “Rose" appears in the first document 4 time. Similarly, the second entry corresponds to the word “is" which is the second word in the list, and its value is “3" because it appears in the first document 3 times. This vector representation does not preserve the order of the words in the original sentences

Topic modelling The hype of the moment!!! Topic models are algorithms for discovering the main lexical clusters (themes??) that characterize a large collection of documents Topic models can organize the collection according to the discovered topics Topic modeling algorithms can be adapted to many kinds of data. Among other applications, they have been used to find patterns in genetic data, images, social networks… Each document is modeled as a mixture of categories or topics A document is a probability distribution over topics A topic is a probability distribution over words

Various algorithms for topic modelling Latent Semantic Analysis Based on TF-IDF scores matrix of words and a linear algebra calculation Basically, the more often words are used together within a document, the more related they are to one another Latent Dirichlet Analysis Based on a Bayesian probabilistic approach most used now!

LDA rationales A very simplistic generative model for text: a document is a bag of topics a topic is a bag of word LDA Buffett by Matt Jockers LDA Buffett

LDA rationales If I can generate a document using this model, I can also reverse the process and infer, given any new document and a topic model I’ve already generated, what the topics are that the new document draws from But if we start from a bunch of text with no previously defined topic? Here is the trick: Step 1 You tell the algorithm how many topics you think there are. You can either use an informed estimate (e.g. results from a previous analysis), or simply trial-and-error. In trying different estimates, you may pick the one that generates topics to your desired level of interpretability, or the one yielding the highest statistical certainty (i.e. log likelihood) Step 2 The algorithm will assign every word to a temporary topic. Topic assignments are temporary as they will be updated in Step 3. Temporary topics are assigned to each word in a semi-random manner (according to a Dirichlet distribution, to be exact). This also means that if a word appears twice, each word may be assigned to different topics. Step 3 (iterative) The algorithm will check and update topic assignments, looping through each word in every document. For each word, its topic assignment is updated based on two criteria: How prevalent is that word across topics? How prevalent are topics in the document?

What’s in a topic? But what is a topic discovered by LDA topic modeling? This is something that the researcher must decide by mean of…. Interpretation! A theme (in the literary meaning)? A discourse (Underwood)? A sparse semantic cluster? Are apparently inconsistent topic interesting or are the demonstration of the method failure to give insights for literary explanations? http://www.lisarhody.com/some-assembly-required/ An open debate…

Sentiment analysis Giving a quantitative value to the emotional valence of sentences and text by means of an emotional metric attributed to a set of lexemes Matt Jocker application for plot analysis: syuzhet controversy “In the field natural language processing there is an area of research known as sentiment analysis or, sometimes, opinion mining. And when our colleagues engage in this kind of work, they very often focus their study on a highly stylized genre of non-fiction: the review, specifically movie reviews and product reviews. The idea behind this work is to develop computational methods for detecting what we, literary folk, might call mood, or tone, or sentiment, or perhaps even refer to as affect. The psychologists prefer the word valence, and valence seems most appropriate to this research of mine because the psychologists also like to measure degrees of positive and negative valence”. “I discovered that fluctuations in sentiment can serve as a rather natural proxy for fluctuations in plot movement” http://www.matthewjockers.net/2015/02/02/syuzhet/

Network analysis Network analysis, a set of methods and strategies for studying the relationships inside groups of n-individuals, based on graph theory. each individual constitutes a node and each relation an edge or arc connecting two nodes the resulting network is a formal and highly abstract model of the group internal relational structure some mathematical properties of the network can be computed and used as proxy for qualitative aspects of the domain, network analysis is very appealing since it can be easily turned into very attractive and (often) explanatory graphic visualizations https://dhs.stanford.edu/algorithmic-literacy/topic-networks-in-proust/

Network analysis A network is made of vertices and edges; a plot, of characters and actions: characters will be the vertices of the network, interactions the edges, and here is what the Hamlet network looks like: …. … once you make a network of a play, you stop working on the play proper, and work on a model instead. You reduce the text to characters and interactions, abstract them from everything else, and this process of reduction and abstraction makes the model obviously much less than the original object […] but also, in another sense, much more than it, because a model allows you to see the underlying structures of a complex object. [Franco Moretti, Distant Reading]

Tools for distant reading Mallet - Topic Modeling Tool (java GUI for Mallet) R language with R packages (from Matt Jocker web site) Stanford Topic Modeling Toolbox Gensim (Python library)

A critical stance towards Distant reading Some critical and methodological reflections on the weakness of massive quantitative methods Proxy fallacy one cannot use one measurement as a proxy for something else; rather, the effectiveness of that proxy is assumed rather than actually explored or tested in any way There are clear layers between the written word and its intended meaning, and those layers often depend on context and prior knowledge. Further, regardless of the intended meaning of the author, how her words are interpreted in the larger world can vary wildly [Scott Weingart, The Myth of Text Analytics and Unobtrusive Measurement]

A critical stance towards Distant reading Data mining algorithms in general are independent from the context (they can be applied indifferently to stock exchange transactions as to very large textual corpora). They individuate similarities and recurring patterns independently from the semantics of data. Humanities and literary data are heavily contextualized Text mining are agnostic toward the granularity of the data to which they are applied. Texts are only sequences of n-grams, and the probabilistic rules adopted to calculate the relevance of a given set of n-grams are completely independent from the fact that the units of analysis are individual coded characters, or linguistic tokens of greater extension If a very large textual set is composed of documents spread over a long period of time, diachronic variation of the form and usage of the language (both on the syntactic and semantic level) can invalidate purely quantitative and statistic measures

A critical stance towards Distant reading Data in literary studies do not precede formal modeling; on the contrary, they are the product of modeling. It is very dubious to assume innocently a data set as the starting point of a meaningful analysis Meaning in literary texts is multi-layered, and some layers do not have direct lexicalization or they have a very complex and dispersed one (think to aspects of a narrative text at different abstraction level like anaphora, themes, plot and fabula, actants). Purely quantitative analysis apply only to textual “degré zéro”, on which the secondary modeling systems of literature builds their significance Texts are essentially intentional objects: the meaning of a word; the usage of a metaphor; the choice of a metric or rhythmic solution in a poetic text are determined by the attribution of sense and meaning by the author and by the reader. Intentional phenomena do not follow regular pattern and are hardly (if ever) detectable by statistical methods

The intensional nature of literary phenomena One of the underlying assumption of the distant reading approach is quite analogous to the reductionist stance in cognitive sciences Interesting literary phenomena can be reduced without residues to material linguistic phenomena, that in turn are completely accessible to purely quantitative and statistic/probabilistic methods We can say that a purely quantitative approach to literary objects is eliminativist towards the intentional concepts of critical discourse Interpretation is based on the production and application of a set of intentional notions and terms to explain what the text means and how Semiotic and structural critics has tried to explain or to reformulate them in more formal and abstract concepts that preserve the intentional nature of text and interpretation

Semantic technologies and ontologies Semantic oriented approach is based on the modeling of complex human interpretations and annotations of the data through formal languages: creating and processing Rich Data Based on the concepts, frameworks and languages of Semantic Web and Linked Data [Tim Berners-Lee] The convergence between semiotic/structuralist theories and methods and contemporary ontologies and linked data oriented practices represents a big chance for the future development of Digital Literary Critics Building Rich Data for humanities research can enhance the efficacy of text mining technologies

Formal ontologies In the context of computer and information sciences, an ontology defines a set of representational primitives with which to model a domain of knowledge or discourse [Gruber] A formal ontology is a formalized account of a conceptual description of (portion of) the world The relevance of formal ontologies for literary and cultural objects digital processing are both theoretical and operational

Why ontologies matter for (digital) humanists Creating formal models based on explicit conceptualization and logical foundation grants that all the discourses are firmly grounded to a common “setting” of the domain. We all (try) to speak of the same thing. Formal ontologies permit the application of computational inferences and reasoning methods to express explanation and make predictions. Their grounding in description logic has made possible the development of efficient automatic reasoner and inference engines. Semantic Web modeling provides methods to compare and eventually merge different ontologies; the Open World Assumption, ensures the functionality of the model even if it is incomplete or conceived as a work in progress

Why ontologies matter for (digital) humanists In Humanities and Literary Studies conceptual formalization must face the problem of the indeterminacy of theories vagueness of terms intrinsic ambiguity of the domain To use computing we need to reduce the implicit and formalize with the consciousness that formal modeling is inside the hermeneutic process Making ontologies and linking them to digital cultural artifacts build knowledge: it asks for making explicit the tacit knowledge, which is a major part of Humanities work it asks for finding the data level correlatives to the abstract and theoretical notions that populate theories once they are formalized as ontologies An ontology, in the end, is an account of what the community knows as much as of how it knows what it knows, to recall Willard McCarty

Annotation and ontological modeling: examples Free form annotation http://www.annotating-literature.org/ Catma Ontology driven geo-annotation GEOLAT Ontology for narrative texts Zöllner-Weber's Noctua literaria ontology, http://www.figurenontologie.de An ontology for narrative characters (Ciotti)

Toward an Hermeneutic Machine A digital environment and infrastructure incorporating semantic methods and practices of digital interaction and cooperation already available and tested in the Digital Humanities community Networked infrastructure of resources, tools and services Multiple ontological modeling can be connected with the same (passage of) text thus uncovering its complexity Such stratified texts can be re-used in different fruition contexts, from “professional scholars” to culturally curious users who are attracted by the potential text mash-ups

Toward an Hermeneutic Machine Main components: high-quality documents archives belonging to different linguistic traditions / culture in standard encoding formats a set of methods and computational tools for distributed and cooperative annotation of digital resources a set of domain specific shared ontologies organized in a multilayer design to model particular aspects of the intra-, extra- and inter- textual structure: real places and spaces chronologically adapted real persons (including authors) works and literary history categories historical events fictional places and worlds fictional characters and entities themes and motives rhetorical figures genres and stylistic features tools to visualize and process semantic levels of digital information and share knowledge as linked data

Toward an infrastructure for a Literary Semantic Web Building such an infrastructure is demanding task but many of the building blocks are already there The history and evolution of the Web has shown that it is possible to build complex systems through an incremental and cooperative process The infrastructure we are envisioning is cooperative but it cannot be based on a crowdsourcing approach: we can rather call it a “competent and motivated community” driven project The representation of beliefs and interpretations made by a scholar depends on assumptions in common with a particular interpretive community who shares methodologies, disciplinary practices and criteria of rational acceptability; the community of experts licenses the correct interpretations and by the way of the ontological modeling shapes the frames in which interpretations occur

Readings… … other than the list of references I have already suggested Ted Underwood blog Matt Jocker Blog Scott Weingart Blog The Programming Historian Andrew Piper Blog … alle the references you can find from these! Go on explore!

Thank you!!! fabio.Ciotti@uniroma2.it https://www.facebook.com/Ciotti.Fabio http://www.aiucd.it

Digital methods for Literary Criticism Fabio Ciotti University of Rome Tor Vergata 1.

Similar presentations

Presentation on theme: "Digital methods for Literary Criticism Fabio Ciotti University of Rome Tor Vergata 1."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Digital methods for Literary Criticism Fabio Ciotti University of Rome Tor Vergata 1.

Similar presentations

Presentation on theme: "Digital methods for Literary Criticism Fabio Ciotti University of Rome Tor Vergata 1."— Presentation transcript:

Similar presentations

About project

Feedback