Text Analytics National Center for Supercomputing Applications University of Illinois at Urbana-Champaign.

Text Analytics National Center for Supercomputing Applications University of Illinois at Urbana-Champaign

Outline Overview of Text Analytics Dunning Loglikelihood Comparison Text Clustering Frequent Patterns Analysis Entity Extraction Text Application: MONK Workbench Hands-On

SEASR @ Work – MONK Executes flows for each analysis requested –Predictive modeling using Naïve Bayes –Predictive modeling using Support Vector Machines (SVM) –Feature Comparison (Dunning Loglikelihood)

Dunning Loglikelihood Tag Cloud Words that are under-represented in writings by Victorian women as compared to Victorian men. Results are loaded into Wordle for the tag cloud —Sara Steger

SEASR @ Work – Dunning Loglikelihood Feature comparison of tokens Specify an analysis document/collection Specify a reference document/collection Perform statistics comparison using Dunning Loglikelihood Example showing over-represented Analysis Set: The Project Gutenberg EBook of A Tale of Two Cities, by Charles Dickens Reference Set: The Project Gutenberg EBook of Great Expectations, by Charles Dickens Example showing over-represented Analysis Set: The Project Gutenberg EBook of A Tale of Two Cities, by Charles Dickens Reference Set: The Project Gutenberg EBook of Great Expectations, by Charles Dickens

Feature Lens “The discussion of the children introduces each of the short internal narratives. This champions the view that her method of repetition was patterned: controlled, intended, and a measured means to an end. It would have been impossible to discern through traditional reading“

SEASR @ Work – DISCUS On-demand usage of analytics while surfing –While navigating request analytics to be performed on page –Text extraction and cleaning Summarization and key work extraction –List the important terms on the page being analyzed –Provide relevant short summaries Visual maps –Provide a visual representation of the key concepts –Show the graph of relations between concepts

SEASR @ Work – Entity Mash-up Entity Extraction with OpenNLP or Stanford NER Locations viewed on Google Map Dates viewed on Simile Timeline

SEASR Text Analytics Goals Address the Scholarly text analytics needs by: Efficiently managing distributed Literary and Historical textual assets Structuring extracted information to facilitate knowledge discovery Extracting information from text at a level of semantic/functional abstraction that is sufficiently rich and efficient for analysis Devising a representation for the extracted information Devising algorithms for question answering and inference Developing UI for effective visual knowledge discovery and data exploration with separate query logic from application logic Leveraging existing machine learning approaches for text Enabling the text analytics through SEASR components

Text Analytics Definition Many definitions in the literature The non trivial extraction of implicit, previously unknown, and potentially useful information from (large amount of) textual data An exploration and analysis of textual (natural- language) data by automatic and semi automatic means to discover new knowledge

Text Analytics: General Application Areas Information Retrieval –Indexing and retrieval of textual documents –Finding a set of (ranked) documents that are relevant to the query Information Extraction –Extraction of partial knowledge in the text Web Mining –Indexing and retrieval of textual documents and extraction of partial knowledge using the web Classification –Predict a class for each text document Clustering –Generating collections of similar text documents Question and Answering

Text Analytics Process

Text Preprocessing –Syntactic Text Analysis –Semantic Text Analysis Features Generation –Bag of Words –Ngrams Feature Selection –Simple Counting –Statistics –Selection based on POS Text/Data Analytics –Classification: Supervised Learning –Clustering: Unsupervised Learning –Information Extraction Analyzing Results –Visual Exploration, Discovery and Knowledge Extraction –Query-based question answering

Text Representation Many machine learning algorithms need numerical data, so text must be transformed Determining this representation can be challenging

Text Characteristics (1) Large textual data base –Enormous wealth of textual information on the Web –Publications are electronic High dimensionality –Consider each word/phrase as a dimension Noisy data –Spelling mistakes –Abbreviations –Acronyms Text messages are very dynamic –Web pages are constantly being generated (removed) –Web pages are generated from database queries Not well structured text –Email/Chat rooms “r u available ?” “Hey whazzzzzz up” –Speech

Text Characteristics (2) Dependency –Relevant information is a complex conjunction of words/phrases –Order of words in the query hot dog stand in the amusement park hot amusement stand in the dog park Ambiguity –Word ambiguity Pronouns (he, she …) Synonyms (buy, purchase) Multiple meanings (bat – it is related to baseball or mammal) –Semantic ambiguity The king saw the monkey with his glasses. (multiple meanings) Authority of the source –IBM is more likely to be an authorized source then my second far cousin

Text Preprocessing Syntactic analysis –Tokenization –Lemmitization –Part Of Speech (POS) tagging –Shallow parsing –Custom literary tagging Semantic analysis –Information Extraction Named Entity tagging Unnamed Entity tagging –Co-reference resolution –Ontological association (WordNet, VerbNet) –Semantic Role analysis –Concept-Relation extraction

Feature Selection Reduce Dimensionality –Learners have difficulty addressing tasks with high dimensionality Irrelevant Features –Not all features help! –Remove features that occur in only a few documents –Reduce features that occur in too many documents

Syntactic Analysis Tokenization –Text document is represented by the words it contains (and their occurrences) –e.g., “Lord of the rings”  {“the”, “Lord”, “rings”, “of”} –Highly efficient –Makes learning far simpler and easier –Order of words is not that important for certain applications Lemmitization/Stemming –Involves the reduction of corpus words to their respective headwords (i.e. lemmas) –Means removal suffixes, prefixes and infixes to root –Reduces dimensionality –Identifies a word by its root –e.g., flying, flew  fly Bigrams and trigrams –Retains semantic content

Syntactic Analysis Stop words –Identifies the most common words that are unlikely to help with text analytics, e.g., “the”, “a”, “an”, “you” –Identifies context dependent words to be removed, e.g., “computer” from a collection of computer science documents Scaling words –Important words should be scaled upwards, and vice versa –TF-IDF stands for Term Frequency and Inverse Document Frequency product Parsing / Part of Speech (POS) tagging –Generates a parse tree (graph) for each sentence –Each sentence is a stand alone graph –Find the corresponding POS for each word –e.g., John (noun) gave (verb) the (det) ball (noun) –Shallow Parsing: analysis of a sentence which identifies the constituents (noun groups, verbs,...), but does not specify their internal structure, nor their role in the main sentence

Text Analytics: Supervised vs. Unsupervised Supervised learning (Classification) –Data (observations, measurements, etc.) are accompanied by labels indicating the class of the observations –Split into training data and test data for model building process –New data is classified based on the model built with the training data –Techniques Bayesian classification, Decision trees, Neural networks, Instance- Based Methods, Support Vector Machines Unsupervised learning (Clustering) –Class labels of training data is unknown –Given a set of measurements, observations, etc. with the aim of establishing the existence of classes or clusters in the data

Text Analytics: Classification Given: Collection of labeled records –Each record contains a set of features (attributes), and the true class (label) –Create a training set to build the model –Create a testing set to test the model Find: Model for the class as a function of the values of the features Goal: Assign a class (as accurately as possible) to previously unseen records Evaluation: What Is Good Classification? –Correct classification Known label of test example is identical to the predicted class from the model –Accuracy ratio Percent of test set examples that are correctly classified by the model –Distance measure between classes can be used e.g., classifying “football” document as a “basketball” document is not as bad as classifying it as “crime”

Text Analytics: Clustering Given: Set of documents and a similarity measure among documents Find: Clusters such that –Documents in one cluster are more similar to one another –Documents in separate clusters are less similar to one another Similarity Measures: –Euclidean distance if attributes are continuous –Other problem-specific measures e.g., how many words are common in these documents Evaluation: What Is Good Clustering? –Produce high quality clusters with high intra-class similarity low inter-class similarity –Quality of a clustering method is also measured by its ability to discover some or all of the hidden patterns

Text Analytics: Frequent Patterns Given: Set of documents Find Frequent Patterns such that –Common words patterns used in the collection Evaluation: What Is Good Patterns? Results: 1060 patterns discovered. 322: Lincoln 147: Abe 117: man 100: Mr. 100: time 98: Lincoln Abe 91: father 85: Lincoln Mr. 85: Lincoln man 75: day 70: Abraham 70: President 68: boy 67: Lincoln time 65: Lincoln Abraham 65: life 63: Lincoln father 57: men 57: work 52: Lincoln day … 322: Lincoln 147: Abe 117: man 100: Mr. 100: time 98: Lincoln Abe 91: father 85: Lincoln Mr. 85: Lincoln man 75: day 70: Abraham 70: President 68: boy 67: Lincoln time 65: Lincoln Abraham 65: life 63: Lincoln father 57: men 57: work 52: Lincoln day …

Text Analytics: Discus Given: Set of documents Find Top Sentences and Top Tokens –Top sentences contain top tokens –Top tokens exist in top sentences Results:

Semantic Analysis Deep Parsing –more sophisticated syntactic, semantic and contextual processing must be performed to extract or construct the answer Information extraction is the identification of specific semantic elements within a text (e.g., entities, properties, relations) Extract the relevant information and ignore non- relevant information (important!) Link related information and output in a predetermined format

Information Extraction Information TypeState of the art (Accuracy) Entities an object of interest such as a person or organization. 90-98% Attributes a property of an entity such as its name, alias, descriptor, or type. 80% Facts a relationship held between two or more entities such as Position of a Person in a Company. 60-70% Events an activity involving several entities such as a terrorist act, airline crash, management change, new product introduction. 50-60% “Introduction to Text Mining,” Ronen Feldman, Computer Science Department, Bar-Ilan University, ISRAEL

Information Extraction Approaches Terminology (name) lists –This works very well if the list of names and name expressions is stable and available Tokenization and morphology –This works well for things like formulas or dates, which are readily recognized by their internal format (e.g., DD/MM/YY or chemical formulas) Use of characteristic patterns –This works fairly well for novel entities –Rules can be created by hand or learned via machine learning or statistical algorithms –Rules capture local patterns that characterize entities from instances of annotated training data

Relation (Event) Extraction Identify (and tag) the relation among two entities –A person is_located_at a location (news) –A gene codes_for a protein (biology) Relations require more information –Identification of two entities & their relationship –Predicted relation accuracy Pr(E1)*Pr(E2)*Pr(R) ~= (.93) * (.93) * (.93) =.80 Information in relations is less local –Contextual information is a problem: right word may not be explicitly present in the sentence –Events involve more relations and are even harder

Mayor Rex Luthor announced today the establishment of a new research facility in Alderwood. It will be known as Boynton Laboratory. NE:PersonNE:Time NE:Location NE:Organization Semantic Analytics –Named Entity (NE) Tagging

Mayor Rex Luthor announced today the establishment of a new research facility in Alderwood. It will be known as Boynton Laboratory. UNE:Organization Semantic Analysis Semantic Category (unnamed entity, UNE) Tagging

Mayor Rex Luthor announced today the establishment of a new research facility in Alderwood. It will be known as Boynton Laboratory. UNE:Organization Semantic Analysis Co-reference Resolution for entities and unnamed entities

Semantic Analysis Semantic Role Analysis

Semantic Analysis Concept-Relation Extraction

(c) 2001, Chicago Tribune. Visit the Chicago Tribune on the Internet at http://www.chicago.tribune.com/ Distributed by Knight Ridder/Tribune Information Services. By Stephen J. Hedges and Cam Simpson ……. The Finsbury Park Mosque is the center of radical Muslim activism in England. Through its doors have passed at least three of the men now held on suspicion of terrorist activity in France, England and Belgium, as well as one Algerian man in prison in the United States. ``The mosque's chief cleric, Abu Hamza al- Masri lost two hands fighting the Soviet Union in Afghanistan and he advocates the elimination of Western influence from Muslim countries. He was arrested in London in 1999 for his alleged involvement in a Yemen bomb plot, but was set free after Yemen failed to produce enough evidence to have him extradited..'‘ … Information Extraction:Template Extraction Finsbury Park Mosque Abu Hamza al-Masri chief cleric Finsbury Park Mosque England Abu Hamza al-Masri London 1999 his alleged involvement in a Yemen bomb plot England France United States Belgium Abu Hamza al-Masri London

Streaming Text: Knowledge Extraction Leveraging some earlier work on information extraction from text streams Information extraction process of using advanced automated machine learning approaches to identify entities in text documents extract this information along with the relationships these entities may have in the text documents The visualization above demonstrates information extraction of names, places and organizations from real-time news feeds. As news articles arrive, the information is extracted and displayed. Relationships are defined when entities co-occur within a specific window of words.

Results: Social Network (Tom in Red)

Simile Timeline Constructed by Hand

Simile Timeline in SEASR Dates are automatically extracted with their sentences

SEASR @ Work: MONK MONK: a case study Texts as data Texts from multiple sources Texts reprocessed into a new representation Different tools using the same data Slides from, John Unsworth, “Tools for Textual Data”, May 20, 2009

MONK Project MONK provides: 1400 works of literature in English from the 16th - 19th century = 108 million words, POS-tagged, TEI-tagged, in a MySQL database. Several different open-source interfaces for working with this data A public API to the datastore SEASR under the hood, for analytics Slides from, John Unsworth, “Tools for Textual Data”, May 20, 2009

MONK “A word token is the spelling or surface of form of a word. MONK performs a variety of operations that supply each token with additional 'metadata’. –Take something like 'hee louyd hir depely'. –This comes to exist in the MONK textbase as something like hee_pns31_he louyd_vvd_love hir_pno31_she depely_av-j_deep Because the textbase 'knows' that the surface 'louyd' is the past tense of the verb 'love' the individual token can be seen as an instance of several types: the spelling, the part of speech, and the lemma or dictionary entry form of a word.” (Martin Mueller) Slides from, John Unsworth, “Tools for Textual Data”, May 20, 2009

Text Data Texts represent language, which changes over time (spellings) Comparison of texts as data requires some normalization (lemma) Counting as a means of comparison requires units to count (tokens) Treating texts as data will usually entail a new representation of those texts, to make them comparable and to make their features countable. Slides from, John Unsworth, “Tools for Textual Data”, May 20, 2009

Text from Multiple Sources Five aphorisms about textual data (causing tool- builders to weep): Scholars are interested in texts first, data second Tools are only useful if they can be applied to texts that are of interest No single collection has all texts No two collections will be identical in format No one collection will be internally consistent in format Slides from, John Unsworth, “Tools for Textual Data”, May 20, 2009

Public MONK Texts Documenting the American South from UNC-Chapel Hill –(1.5 Gb, 8.5 M words) Early American Fiction from the University of Virginia –(930 Mb, 5.2 M words) Wright American Fiction from Indiana University –(4 Gb, 23 M words) Shakespeare from Northwestern University – (170 M, 850 K words) About 7 Gigabytes, 38 M words Slides from, John Unsworth, “Tools for Textual Data”, May 20, 2009

Restricted Monk Texts Eighteenth-Century Collection Online (ECCO) from the Text Creation Partnership –(6 Gb, 34 M words) Early English Books Online (EEBO) from the Text Creation Partnership –(7 G, 39 M words) Nineteenth-Century Fiction (NCF) from Chadwyck Healey –(7 G, 39 M words) About 20 Gb, 112 M words Slides from, John Unsworth, “Tools for Textual Data”, May 20, 2009

MONK Ingest Process Texts reprocessed into a new representation TEI source files (from various collections, with various idiosyncracies) go through Abbot, a series of xsl routines that transform the input format into TEI- Analytics (TEI-A for short), with some curatorial interaction. “Unadorned” TEI-A files go through Morphadorner, a trainable part-of-speech tagger that tokenizes the texts into sentences, words and punctuation, assigns ids to the words and punctuation marks, and adorns the words with morphological tagging data (lemma, part of speech, and standard spelling). Slides from, John Unsworth, “Tools for Textual Data”, May 20, 2009

MONK Ingest Process Adorned TEI-A files go through Acolyte, a script that adds curator-prepared bibliographic data Bibadorned files are processed by Prior, using a pair of files defining the parts of speech and word classes, to produce tab-delimited text files in MySQL import format, one file for each table in the MySQL database. cdb.csh creates a Monk MySQL database and imports the tab-delimited text files. Slides from, John Unsworth, “Tools for Textual Data”, May 20, 2009

MONK Tools MONK Datastore Flamenco Faceted Browsing MONK extension for Zotero TeksTale Clustering and Word Clouds FeatureLens SEASR The MONK Workbench (Public) The MONK Workbench (Restricted) Slides from, John Unsworth, “Tools for Textual Data”, May 20, 2009

SEASR @ Work – MONK Workbench Executes flows for each analysis requested –Predictive modeling using Naïve Bayes –Predictive modeling using Support Vector Machines (SVM) –Feature Comparison (Dunning Loglikelihood)

Feature Lens “The discussion of the children introduces each of the short internal narratives. This champions the view that her method of repetition was patterned: controlled, intended, and a measured means to an end. It would have been impossible to discern through traditional reading“

SEASR @ Work – Dunning Loglikelihood Feature comparison of tokens Specify an analysis document/collection Specify a reference document/collection Perform statistics comparison using Dunning Loglikelihood Example showing over-represented Analysis Set: The Project Gutenberg EBook of A Tale of Two Cities, by Charles Dickens Reference Set: The Project Gutenberg EBook of Great Expectations, by Charles Dickens Example showing over-represented Analysis Set: The Project Gutenberg EBook of A Tale of Two Cities, by Charles Dickens Reference Set: The Project Gutenberg EBook of Great Expectations, by Charles Dickens

Demonstration MONK Workbench –Feature comparisons with Dunning Loglikelihood –Find more documents like this with tagging and using predictive modeling

Learning Exercises Open MONK Workbench (http://monkproject.org)http://monkproject.org 1.Create a project by clicking "new" and giving the project a name and description; click ok 2.Click continue located at the bottom – right 3.Select "define workset", click continue 4.Define 2 worksets - one for early american fiction women, and the other early american fiction men select collection early american fiction select author gender female click "Search for works" click "select all" checkbox and then click Save -> as new workset and give "early am fict women" as name repeat for Men and save as "early am fict men"

Learning Exercises 5.Go back to the screen from step 3 by clicking on the project name under the MONK icon 6.Select "Compare Worksets" and click continue –Select the men workset and women workset from the drop down menu –analysis: dunning feature: lemma min freq: 10; feature class: all –click compare

Attendee Project Plan Review project plan Identify data set Modify and develop the project plan over the week Present and discuss project plan and results on Friday

Attendee Project Plan Study/Project Title Team Members and their Affiliation Procedural Outline of Study/Project –Research Question/Purpose of Study –Data Sources –Analysis Tools Activity Timeline or Milestones Report or Project Outcome(s) Ideas on what your team needs from SEASR staff to help you achieve your goal. Identify Data Set

Discussion Questions Identify and discuss three other text tools that could be useful in the Humanities? What are the obstacles to using this technology for text analysis - what will your colleagues say?

Text Analytics National Center for Supercomputing Applications University of Illinois at Urbana-Champaign.

Similar presentations

Presentation on theme: "Text Analytics National Center for Supercomputing Applications University of Illinois at Urbana-Champaign."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Text Analytics National Center for Supercomputing Applications University of Illinois at Urbana-Champaign.

Similar presentations

Presentation on theme: "Text Analytics National Center for Supercomputing Applications University of Illinois at Urbana-Champaign."— Presentation transcript:

Similar presentations

About project

Feedback