The Dangers and Delights of Data Mining Glenn Roe Summer School July 3 2012.

The Dangers and Delights of Data Mining Glenn Roe Digital.Humanities@Oxford Summer School July 3 2012

Some opening thoughts.... Machine Learning (ML) and Data Mining (DM) techniques will drive future humanistic research as a central component of future digital libraries. Old Digital Humanities (DH) tools were transparent. ML/DM are opaque. General impact of ML on all humanities research: categorize, link, organize, direct attention to some texts rather than others automatically. Examine three areas of possible critical assessment. DH is uniquely well-suited to critique the application of machine learning techniques in the humanities.

Emerging Digital Libraries Scale of digital collections requires machine assistance to: categorize and organize propose intertextual relations evaluate and rank queries facilitate discovery and navigation There are only about 30,000 days in a human life -- at a book a day, it would take 30 lifetimes to read a million books and our research libraries contain more than ten times that number. Only machines can read through the 400,000 books already publicly available for free download from the Open Content Alliance. -- Gregory Crane Only machines will read all the books.

And 5 million books? We constructed a corpus of digitized texts containing about 4% of all books ever printed. Analysis of this corpus enables us to investigate cultural trends quantitatively. We survey the vast terrain of culturomics focusing on linguistic and cultural phenomena that were reflected in the English language between 1800 and 2000. We show how this approach can provide insights about fields as diverse as lexicography, the evolution of grammar, collective memory, the adoption of technology, the pursuit of fame, censorship, and historical epidemiology. Culturomics extends the boundaries of rigorous quantitative inquiry to a wide array of new phenomena spanning the social sciences and the humanities. www.sciencexpress.org / 16 December 2010

Culturomics…

Reading from afar… (or not at all). Distant reading: where distance, let me repeat it, is a condition of knowledge: it allows you to focus on units that are much smaller or much larger than the text: devices, themes, tropes or genres and systems. And if, between the very small and the very large, the text itself disappears, well, it is one of those cases when one can justifiably say, less is more. If we want to understand the system in its entirety, we must accept losing something. We always pay a price for theoretical knowledge: reality is infinitely rich; concepts are abstract, are poor. But its precisely this poverty that makes it possible to handle them, and therefore to know. This is why less is actually more. Franco Moretti, Conjectures on World Literature (2000) http://www.newleftreview.org/A2094

Not Reading has a long history.Not Reading has a long history. LHistoire du livre Dépot légal After death inventories Library holdings/circulation records Archives of publishers Vocabulary of titles (Furet) Censorship records … Martin, Furet, Darnton, Chartier, etc…

From Not Reading to Text Mining By not reading we examine: concordances, frequency tables, feature lists, classification accuracies, collocation tables, statistical models, etc… We track: Literary topoi (E.R. Curtius), concepts (R. Koselleck, Begriffsgeschichte), and other semantic patterns: over time, between categories, across genres. So that distant reading and text mining can provide larger contexts for close reading.

Text Mining as Pattern Detection Data mining is the extraction of implicit, previously unknown, and potentially useful information from data. The idea is to build computer programs that sift through databases automatically, seeking regularities or patterns. Strong patterns, if found, will likely generalize to make accurate predictions on future data. Of course, there will be problems. Many patterns will be banal and uninteresting. Others will be spurious, contingent on accidental coincidences in the particular dataset used. And real data is imperfect: some parts are garbled, some missing. Anything that is discovered will be inexact: there will be exceptions to every rule and cases not covered by any rule. Algorithms need to be robust enough to cope with imperfect data and to extract regularities that are inexact but useful. -- Ian Witten, Data Mining: Practical Machine Learning Tools and Techniques, xvix.

Transparency of traditional DH approaches PhiloLogic: A few choice words... Open Source: http://philologic.uchicago.edu/http://philologic.uchicago.edu/ Advantages: Fast, robust, many search and reporting features. Collocation tables, sortable KWICS, etc. Handles various encoding schemes and object types. Known to work with most languages. Limitations: User initiated search for small number of words. Limited order of generalization. How to address larger issues (gender or genre). What to do with 150,000 (or more) hits?

Transparency of traditional DH approaches PhiloLogic searches return what you asked for in the order in which you asked. Example: search for various forms of moderni.* 1850-99 You get 82 hits. Results can be sorted and organized. Requires user selection. The user sifts through results and analyzes effectively raw output data.

Machine Learning is opaque... ML systems depend on many assumptions and selections that are not readily available to end users. The hunt for Googles infamous secret sauce. Open competition to find the over 250 ingredients in the Google search/sauce algorithms. A Black-box industry: analyzing the secret sauce for profit. Many commercial organizations examine Web mining extensively: e.g., Search Engine Watch www.searchenginewatch.com

Two ways of using DM in the humanities 1) Tool approach: PhiloMine, MONK, etc. allows direct manipulation of data mining materials. 2) Embedded approach: results of machine learning or text mining become part of general systems. - Google and other WWW search engines - Dedicated library systems (AquaBrowser) Most humanities scholars will use embedded machine learning systems.

Embedded Machine Learning Systems Humanists are already using machine learning and data mining in general applications: spam filters movie recommendations (Netflix) related book/article suggestions (Amazon) Adwords (monetizing the noun) etc... And coming soon to a library near you: LENS....

Embedded Machine Learning Systems

Building Data Mining Tools: Three types of data/text mining *Distinction is arbitrary and does not cover all text mining tasks. 1.Predictive Classification: learn categories from labeled data, predict on unknown instances. 2.Comparative Classification: learn categories from labeled data to find accuracy rate, errors, and most important features. 3. Similarity: measure document/part similarities, looking for meaningful connections.

Predictive Classification Widely used: spam filters, recommendation systems, etc. Computer reads text, identifies the words (features) most associated with each class (author, class of knowledge). Humanities applications: extract classes or labels from contemporary documents. Use contemporary classification system rather than modern system to predict classes. *Problem: information space can be noisy, incoherent.

Predictive Classification Text Mining the Digital Encyclopédie 74,131 articles in the current database 13,272 articles without classification (18%) We trained our classifiers on the 60K classified articles (comprised of 2,899 individual classes) to generate a model which is then used to classify the unknown instances, and then reclassify all 74K articles. The resulting ontology was optimized to 360 classes – this is a typical result of machine classification.

Predictive Classification Classifying the unclassified: DISCOURS PRELIMINAIRE DES EDITEURS, Class=Philosophy DEMI-PARABOLE, Class=Algebra Bois de chauffage, Class=Commerce Canard, Class=Natural history; Ornithology Chartre de Champagne, Class=Jurisprudence Chartre de commune, Class=Jurisprudence Chartre aux Normands, Class=Jurisprudence Chartre au roi Philippe, Class=Ecclesiastical history Chartre au roi Philippe fut donnée par Philippe Auguste vers la fin de l'an 1208, ou au commencement de l'an 1209, pour régler les formalités nouvelles que l'on devoit observer en Normandie dans les contestations qui survenoient pour raison des patronnages d'église, entre des patrons laiques & des patrons ecclésiastiques. Cette chartre se trouve employée dans l'ancien coûtumier de Normandie, après le titre de patronnage d'église; & lorsqu'on relut en 1585 le cahier de la nouvelle coûtume, il fut ordonné qu' à la fin de ce cahier l'on inséreroit la chartre au roi Philippe & la chartre Normande. Quelques - uns ont attribué la premiere de ces deux chartres à Philippe III. dit le Hardi; mais elle est de Philippe Auguste, ainsi que l'a prouvé M. de Lauriere au I. volume des ordonnances de la troisieme race, page 26. Voyez aussi à ce sujet le recueil d' arrêts de M. Froland, partie I. chap. vij.

Comparative Classification Comparative Categorical Feature Analysis Use classifiers as a form of hypothesis testing. Train a classifier on a set of categories (gender of author, class of knowledge). Run the trained model on the same data to find: Accuracy of classification Most salient features Errors or Mis-classified instances *Classification errors can be rich sources of inquiry for humanists.

Comparative Classification Text Mining the Digital Encyclopédie Original # of classes: 2,899 - New # of classes: 360 73.3% of articles were assigned to their original class, a number that is amazing given the complexity of the ontology. Which means that 26.7% of articles have a different class? This also means that of the 74,131 articles: 44,628 classified correctly 16,231 classified incorrectly 13,272 unclassified were classified

Comparative Classification Accrues: original classification too specific Tepidarium: reclassification seems more logical Achées: incorrect prediction although appropriate given vocabulary Text Mining the Digital Encyclopédie

Comparative Classification Predict classifications in other texts: Classification of Diderot's Éléments de physiologie by chapter. Most chapters classed as anatomy, medicine, physiology. "Avertissement": literature Chapter "Des Etres": metaphysics Chapter "Entendement": metaphysics and grammar Chapter "Volonté": ethics Leverage a contemporary classification system as way to support search and result filtering.

Clusters of Knowledge Top: History, Geography, Literature, Grammar, etc. Middle : Physical Sciences, Physics, Chemistry, etc. Lower: Biological Sciences & Natural History

Similarity: Documents Comparative and Predictive Classification one way to find meaningful patterns by abstracting data from the text. Typically build abstract models of a knowledge space based on identified characteristics of documents. (Supervised learning) Document similarity: unsupervised learning based on statistical characteristics of contents of texts. Many applications: Clustering, Topic Modeling, kNN classifiers etc.

Vector Space Similarity (VSM) Documents are bags of words (no word order). Each bag can be viewed as a vector. Vector dimensionality corresponds to the number of words in our vocabulary. Value at each dimension is number of occurrences of the associated word in the given document: amour ancien livre propre 1 0 3 0 All document vectors taken together comprise a document-term matrix *Used for many applications: information retrieval to topic segmentation.

Identification of similar articles d j = (w 1,j,w 2,j,...,w t,j )q = (w 1,q,w 2,q,...,w t,q ) Similarity: cosine of angle of two vectors in n-dimensional space, where dimensionality is equal to the number of words in the vectors.

Identification of similar articles Vector Space can be used to identify similar articles. Size matters - some unexpected results. GLOIRE, GLORIEUX, GLORIEUSEMENT, Voltaire, VANITÉ, NA, [Ethics] [0.539] VOLUPTÉ, NA, [Ethics] [0.514] FLATEUR, Jaucourt, [Ethics] [0.513] GOUVERNANTE denfans, Lefebvre, [0.511] CHRISTIANISME, NA, [Theology| Political science] [0.502] PAU, Jaucourt, [Modern geography] [0.493] PAU: birthplace of Henri IV.

VSM: Strengths/Limitations Well understood. Standard and robust. Many applications: kNN classifiers, clustering, topic segmentation. Assigns a numeric score which can be used with other measures (e.g., edit distance of headword) Numerous extensions and modifications : Latent Semantic Analysis, etc. Bag of words: no notion of text order. Requires identification of documents or block: articles. Not suitable for running text. Cannot identify smaller borrowings in longer texts. Similarity can reflect topic, subject, or theme, unrelated toborrowing or reuse.

Topic Modeling and LDA Topic modeling is a probabilistic method to classify text using distributions over words. In statistics, latent Dirichlet allocation (LDA) is a generative model that allows sets of observations to be explained by unobserved groups that explain why some parts of the data are similar. This method of analyzing text was first demonstrated by David Blei, Andrew Ng and Michael Jordan in 2002. Johann Peter Gustav Lejeune Dirichlet

What does LDA do? LDA is an unsupervised word clusterer and classifier. Preliminary assumption : each text is a combination of several topics. Each document is given a classification with a ranking of the most important topics. LDA generates distributions over words, or topics, from the text and classifies the corpus accordingly. dieu ame monde etre nature matiere esprit chose homme substance principe corps univers philosophe systeme idee intelligence eternite rien divine existence creature

Prior research on LDA David Blei ran a series of experiments on the journal Science from the year 1880-2002. Topic : energy molecule atoms matter atomic molecular theory (1900-1910) "The Atomic Theory from the Chemical Standpoint" "On Kathode Rays and Some Related Phenomena" "The Electrical Theory of Gravitation" "On Kathode Rays and Some Related Phenomena" "A Determination of the Nature and Velocity of Gravitation" "Experiments of J. J. Thomson on the Structure of the Atom"

Future research with LDA Text Segmentation : identify topic shifts within a document by classifying paragraphs. Dynamic topic modeling : understand how discourse evolves over time. Example from David Blei on epidemiology : 1880 : disease, cholera, found, fever, organisms 1910 : disease, fund, fungus, spores, cultures 1940 : cultures, virus, culture, strain, strains 1970 : mice, strain, strains, host, bacteria 2000 : bacteria, strain, strains, resistance, bacterial

Strengths/Weakness of LDA LDA is a powerful tool to classify unclassified data sets. A lot of research is being done on Topic Modeling by computer scientists : it is our challenge to use their findings and apply to text analysis. LDA is just an aspect of the wider goal of having machines contextualize text, identify coherent segments, and ultimately ease the processing of very large corpuses.

A Critical Approach to Data Mining Critique is a fundamental humanistic activity which is not necessarily limited to texts (i.e., reading the body). Machine learning will be a necessary component of future humanities research, and Digital Humanities is uniquely situated and suited to a critique ML tools and their applicability moving forward. I will touch on three primary areas of critique drawn from our own experiments with machine learning: 1) algorithms, features, and parameters; 2) classification and ontologies; 3) intertextual relations.

Opening the Black Box: PhiloMine Open Source: http://code.google.com/p/philomine/http://code.google.com/p/philomine/ PhiloLogic extension uses existing services. Permits moving to particular texts or features. WWW based form submission with defined tasks. Many classifiers (Support Vector Machine, etc). Many features (words, n-grams, lemmas, etc). Many feature selection and normalization options.

Opening the Black Box: PhiloMine

Algorithms, Features, & Parameters Algorithms = classifiers, segmenters, similarities, aligners Features = salient to task, elements of texts which can be computed (words, lemmas, n-grams, etc.) Parameters = many which have significant impact on results The devil is in the combination of details at all levels...

Features and Parameters Matter Parameter selection includes: type of features such as words, n-grams, and lemmas range of features, such as limiting to features that appear in a minimum number of instances statistical normalization of features thresholds for various functions Algorithm and parameter selection are task and data dependent Selection of algorithms and adjustment of parameters can radically alter results. For example...

Mining the Encyclopédie: Vector Space Similarity

Similarity - Unexpected Links Gnomonique similar to Wolstrope. Why? Gnomonique describes various types ofcadrans or sun dials that depend on the movement of celestial bodies. Wolstrope (modern geography) is the birthplace of Isaac Newton. Other most similar articles include Saturn, Planet, Clock making and Tylehurst, the birthplace of William Lloyd with an long exposition of his work and the history of the calendar by Newton. *Gnomonics: the art or science of constructing dials, as sundials, which show the time of day by the shadow of the gnomon (γνώμων), a pin or triangle raised above the surface of the dial..

Mining the Encyclopédie: Vector Space Similarity Same Vector Space Similarity problem as before with TF-IDF values rather than raw counts for features. TF-IDF normalizes word frequencies across articles. The weight increases proportionally to the number of times a word occurs in a document but is offset by the frequency of the word in the entire corpus. Produces rather different results.

Why Parameters Matter Note differences in articles identified as most similar and different rankings of the same articles (Wolstrope) when using different parameters. On inspection, both lists are reasonable and interesting. Ombre (shadow) is related to sundials. Experimentation, selection, and evaluation required. Similarity using counts Similarity using TF-IDF

Why Features Matter Feature reduction is a critical function in machine learning tasks Figure 1: features in more than 3% of articles (2,830) Figure 2: features in more than 1% of articles (7,500) Note the differences in most similar articles and rankings. Feature selection: critical Impact on all types of tasks. 1. Similarity (TF-IDF) 2,830 features2. Similarity (TF-IDF) 7,500 features

Why Features Matter Bi-grams: sequences of two words (in this case, lemmas, or root forms) with function words removed. From the article Gnomonique (with frequencies): académie_royal 1 afin_empêcher 1 aller_voir 1 an_avant 1 an_fondation 2 an_jusque 1 ancien_géometres 1 ancien_historien 1 angle_devoir 1 appelloient_autrefois 1 apprendre_facilement 1 art_écrire 1 attribuer_invention 1 autant_petit 1 avant_alexandre 1 avant_appliquer 1 avant_époque 1 avril_septembre 1 beaucoup_aisé 1 beaucoup_haut 1 beaucoup_plûtôt 1 bout_duquel 1 cadran_cadran 1 cadran_horisontal 2 cadran_solaire 3 cadran_vertical 1 caracteres_suivans 1 cause_position 1 certain_déterminer 1 certain_jour 1 chacun_moi 1 chap_xxxviij 1 chaque_moi 1 chez_juif 1 chez_nation 1 circonférence_cercle 1 PhiloMine generates lemmas and n- grams. Currently using TreeTagger for English and French lemmatizing and part of speech identification.

Why Features Matter 1. Similarity (TF-IDF) 2,830 features2. Similarity (TF-IDF) 7,500 features Similarity (TF-IDF) 19,000 bi-lemmas. Note again differences in identified articles and rank. Similarity scores are much lower, reflecting the different distributions of n-grams. Similar matches may be based on very small numbers of common features. PhiloMine can filter by a threshold score.

Choosing Parameters and Features... Feature and parameter selection have similar effects on other kinds of machine learning algorithms, such as classifiers. Open question: How do you choose features and parameters? Do you simply rerun tasks until you find results you like? What does this do to hypothesis testing in the humanities? BLDR: what does finding 86% accuracy of nationality of author really mean when we select among so many options?

Classifiers and Ontologies Numerous kinds of classifiers: Naive Bayes (MNB) Support Vector Machines (SVM) Decision Tree Nearest Neighbor (kNN) and many others. Suitability to task: SVM primarily binary classifier; MNB fast but simple; kNN slow, better on humanistic information spaces?

Different Classifiers, Different Results Classify Chapters of Montesquieu, De lesprit des Lois using Encyclopédie classifications, or ontology: Chapter: Opérations sur les monnoies du temps des empereurs. kNN Best category = Money kNN All categories = Money, Numismatics, Roman History MNB Best category = Jurisprudence MNB All categories = Jurisprudence Chapter: Des moeurs relatives aux combats. kNN Best category = Ethics kNN All categories = Ethics, History Of Chivalry, French Language MNB Best category = Literature MNB All categories = Literature, Grammar

Ontologies are historical artifacts Previous comparison based on the ontology of the Encyclopédie. Humanists know that ontologies (classification systems) are temporal, cultural, domain-specific artifacts. Ontologies encode perspectives, worldviews, and power relations. Classification systems in general [...] reflect the conflicting, contradictory motives of the sociotechnical situations that gave rise to them. -- Bowker and Star, Sorting Things Out If ontologies are contingent, how do we choose between them?

Ephraim Chambers, Cyclopaedia, 1728

Système figuré des connaissance humaines. Encyclopédie, 1751

Dewey Classification, 1876

Generated Ontologies: Graphing the relationship of the Encyclopédie classes using centroids.

Multiplication of Ontologies As shown on Michael K. Bergman's AI 3 site: http://www.mkbergman.com/?p=374 An unlimited number of ontologies. Which ones will machine learning tools in the humanities use?

Intertextuality and Directed Reading So, if algorithms, features, parameters, classifications, and ontologies are all contingent... Where does that leave us? We could: use machine learning and data mining tools for directed reading, i.e., approaches that aid in the discovery of intertextual relations over thousands/millions of books...

Intertextuality and Directed Reading We are working on systems to propose intertextual connections, linking related passages or citations between documents. Humans will then follow these machine generated/proposed links. This type of directed reading will have impact on what gets consulted. But, what happens to texts that fall outside of the results of ML?

PhiloLine: Sequence Alignment Open source: http://code.google.com/p/text-pair/http://code.google.com/p/text-pair/ Investigation of intertextual relationships begins with the identification of related passages using sequence alignment. Technique to identify regions of similarity shared by two strings or sequences, known in computer science as the longest common subsequence (LCS) problem. Applications in many domains, including: Bioinformatics: detection of similar DNA sequences; Plagiarism detection in text and computer code; Collation of texts or manuscript traditions, i.e., genetic criticism.

PhiloLine: Sequence Alignment Look for sequences of common words or n-grams; Only use n-grams of content words, filter out function words; Adjust parameters to allow for more flexible matching, e.g., related but not identical passages. L'homme est né libre, et partout il est dans les fers. Tel se croit le maître des autres, qui ne laisse pas d'être plus esclave qu'eux. trigram sequence bytes homme_libre_partout 208-213 5084-31 libre_partout_fers 211-218 5098-38 partout_fers_croit 213-221 5108-46 fers_croit_maitre 218-223 5132-33 croit_maitre_laisse 221-228 5149-42 maitre_laisse_esclave 223-233 5158-58

PhiloLine: Sequence Alignment Locke, John, [1783], Du gouvernement civil (GALE-ECCO): Que fi le pouvtoir légiilatif a été donné par le plus grand nombre, à une personne ou à plufieurs, teulement à vie, ou pour un tems autrement limité; quand ce tems-là est fini,. le pouvoir souverain retourne à la fociété; & quand il y ef retourné de cette manière, la fociété en peut disposer comme il lui plaît, & le remettre entre les mains de ceux qu'elle trouve bon, & ainfi établir une nouvelle forme de gouvernement. CHAPITRE X. De l'étendue du Pouvoir législatif. IL. PAR une communauté ou un état, il ne faut donc point entendre, ni une démocratie, ni aucune autre forme pré- cife de gouvernement,t Encyclopédie, GOUVERNEMENT, Jaucourt Si le pouvoir législatif a été donné par un peuple à une personne, ou à plusieurs à vie, ou pour un tems limité, quand ce tems - là est fini, le pouvoir souverain retourne à la société dont il émane. Dès qu'il y est retourné, la societé en peut de nouveau disposer comme il lui plait, le remettre entre les mains de ceux qu'elle trouve bon, de la maniere qu'elle juge à - propos, & ainsi ériger une nouvelle forme de gouvernement. Que Puffendorff qualifie tant qu'il voudra toutes les sortes de gouvernemens mixtes du nom d'irréguliers, la véritable régularité sera toujours celle qui sera le plus conforme au bien des sociétés civiles.

PhiloLine: Sequence Alignment She locks her lily fingers one in one. Fondling, she saith,since I have hemmed thee here Within the circuit of this ivory pale, I'll be a park, and thou shalt be my deer; Feed where thou wilt, on mountain or in dale: Graze on my lips; and if those hills be dry, Stray lower, where the pleasant fountains lie. Within this limit is relief enough.... Shakespeare, Venus and Adonis [1593] Pre. Fondling, said he, since I haue hem'd thee heere, VVithin the circuit of this Iuory pale. Dra. I pray you sir help vs to the speech of your master. Pre. Ile be a parke, and thou shalt be my Deere: He is very busie in his study. Feed where thou wilt, in mountaine or on dale. Stay a while he will come out anon. Graze on my lips, and when those mounts are drie, Stray lower where the pleasant fountaines lie. Go thy way thou best booke in the world. Ve. I pray you sir, what booke doe you read? Markham, The dumbe knight. [1608]

Distant vs. Directed Reading: What do we lose? What do we gain? Distant reading: where distance, let me repeat it, is a condition of knowledge: it allows you to focus on units that are much smaller or much larger than the text: devices, themes, tropesor genres and systems. And if, between the very small and the very large, the text itself disappears, well, it is one of those cases when one can justifiably say, Less is more. If we want to understand the system in its entirety, we must accept losing something. We always pay a price for theoretical knowledge: reality is infinitely rich; concepts are abstract, are poor. But its precisely this poverty that makes it possible to handle them, and therefore to know. This is why less is actually more. Franco Moretti, Conjuectures on World Literature (2000) http://www.newleftreview.org/A2094

Conclusions... Machine learning and data mining approaches will be necessary for future humanities research and will direct researchers to materials. These techniques may not be best suited, however, at finding oddities, exceptions, and other outliers that humanists love. Your critique is central here. Humanists understand the conditions of knowledge. Digital humanities can thus bring both technical sophistication and humanistic perspective to the critical analysis of machine learning and data mining techniques.

The Dangers and Delights of Data Mining Glenn Roe Summer School July 3 2012.

Similar presentations

Presentation on theme: "The Dangers and Delights of Data Mining Glenn Roe Summer School July 3 2012."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

The Dangers and Delights of Data Mining Glenn Roe Summer School July 3 2012.

Similar presentations

Presentation on theme: "The Dangers and Delights of Data Mining Glenn Roe Summer School July 3 2012."— Presentation transcript:

Similar presentations

About project

Feedback