Text Analytics National Center for Supercomputing Applications University of Illinois at Urbana-Champaign.

Slides:



Advertisements
Similar presentations
Data Mining and the Web Susan Dumais Microsoft Research KDD97 Panel - Aug 17, 1997.
Advertisements

Chapter 5: Introduction to Information Retrieval
Text Analytics National Center for Supercomputing Applications University of Illinois at Urbana-Champaign.
University of Illinois Visualizing Text Loretta Auvil UIUC February 25, 2011.
Web Mining Research: A Survey Authors: Raymond Kosala & Hendrik Blockeel Presenter: Ryan Patterson April 23rd 2014 CS332 Data Mining pg 01.
Tools for Textual Data John Unsworth May 20, 2009
Text mining Extract from various presentations: Temis, URI-INIST-CNRS, Aster Data …
Information Retrieval in Practice
The Unreasonable Effectiveness of Data Alon Halevy, Peter Norvig, and Fernando Pereira Kristine Monteith May 1, 2009 CS 652.
April 22, Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Doerre, Peter Gerstl, Roland Seiffert IBM Germany, August 1999 Presenter:
© Prentice Hall1 DATA MINING TECHNIQUES Introductory and Advanced Topics Eamonn Keogh (some slides adapted from) Margaret Dunham Dr. M.H.Dunham, Data Mining,
Introduction to CL Session 1: 7/08/2011. What is computational linguistics? Processing natural language text by computers  for practical applications.
Recommender systems Ram Akella February 23, 2011 Lecture 6b, i290 & 280I University of California at Berkeley Silicon Valley Center/SC.
Recommender systems Ram Akella November 26 th 2008.
Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dörre, Peter Gerstl, and Roland Seiffert Presented By: Jake Happs,
Text Based Information Retrieval - Text Mining PKB - Antonie.
Mining the Medical Literature Chirag Bhatt October 14 th, 2004.
Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dijrre, Peter Gerstl, Roland Seiffert Presented by Huimin Ye.
Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dijrre, Peter Gerstl, Roland Seiffert Presented by Drew DeHaas.
Overview of Search Engines
Semantic Web Technologies Lecture # 2 Faculty of Computer Science, IBA.
Data Mining Techniques
The SEASR project and its Meandre infrastructure are sponsored by The Andrew W. Mellon Foundation SEASR Overview Loretta Auvil and Bernie Acs National.
Understanding Data Analytics and Data Mining Introduction.
Challenges in Information Retrieval and Language Modeling Michael Shepherd Dalhousie University Halifax, NS Canada.
MediaEval Workshop 2011 Pisa, Italy 1-2 September 2011.
Extracting Key Terms From Noisy and Multi-theme Documents Maria Grineva, Maxim Grinev and Dmitry Lizorkin Institute for System Programming of RAS.
Learning Object Metadata Mining Masoud Makrehchi Supervisor: Prof. Mohamed Kamel.
Chapter 7 Web Content Mining Xxxxxx. Introduction Web-content mining techniques are used to discover useful information from content on the web – textual.
Chapter 6: Foundations of Business Intelligence - Databases and Information Management Dr. Andrew P. Ciganek, Ph.D.
Distributional Part-of-Speech Tagging Hinrich Schütze CSLI, Ventura Hall Stanford, CA , USA NLP Applications.
Text Analytics National Center for Supercomputing Applications University of Illinois at Urbana-Champaign.
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
Data Mining Chapter 1 Introduction -- Basic Data Mining Tasks -- Related Concepts -- Data Mining Techniques.
Katrin Erk Vector space models of word meaning. Geometric interpretation of lists of feature/value pairs In cognitive science: representation of a concept.
CS 4705 Lecture 19 Word Sense Disambiguation. Overview Selectional restriction based approaches Robust techniques –Machine Learning Supervised Unsupervised.
1 Learning Sub-structures of Document Semantic Graphs for Document Summarization 1 Jure Leskovec, 1 Marko Grobelnik, 2 Natasa Milic-Frayling 1 Jozef Stefan.
5 - 1 Copyright © 2006, The McGraw-Hill Companies, Inc. All rights reserved.
Advanced Database Course (ESED5204) Eng. Hanan Alyazji University of Palestine Software Engineering Department.
Introduction to Text Mining By Soumyajit Manna 11/10/08.
C. Lawrence Zitnick Microsoft Research, Redmond Devi Parikh Virginia Tech Bringing Semantics Into Focus Using Visual.
SEASR Applications National Center for Supercomputing Applications University of Illinois at Urbana-Champaign.
SEASR Analytics Loretta Auvil Automated Learning Group Data-Intensive Technologies and Applications, National Center for Supercomputing.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Mining Logs Files for Data-Driven System Management Advisor.
Mashups and Dashboards National Center for Supercomputing Applications University of Illinois at Urbana-Champaign.
Auckland 2012Kilgarriff: NLP and Corpus Processing1 The contribution of NLP: corpus processing.
Using a Named Entity Tagger to Generalise Surface Matching Text Patterns for Question Answering Mark A. Greenwood and Robert Gaizauskas Natural Language.
What Is Text Mining? Also known as Text Data Mining Process of examining large collections of unstructured textual resources in order to generate new.
Visualizations, Mashups and Dashboards University of Illinois at Urbana-Champaign.
Knowledge Modeling and Discovery. About Thetus Thetus develops knowledge modeling and discovery infrastructure software for customers who: Have high-value.
SEASR Analytics and Zotero University of Illinois at Urbana-Champaign.
Data Mining and Decision Support
Exploiting Named Entity Taggers in a Second Language Thamar Solorio Computer Science Department National Institute of Astrophysics, Optics and Electronics.
WHAT IS DATA MINING?  The process of automatically extracting useful information from large amounts of data.  Uses traditional data analysis techniques.
Overview of Statistical NLP IR Group Meeting March 7, 2006.
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
WHAT IS DATA MINING?  The process of automatically extracting useful information from large amounts of data.  Uses traditional data analysis techniques.
A Self-organizing Semantic Map for Information Retrieval Xia Lin, Dagobert Soergel, Gary Marchionini presented by Yi-Ting.
Semantic Web Technologies Readings discussion Research presentations Projects & Papers discussions.
Reading literacy. Definition of reading literacy: “Reading literacy is understanding, using and reflecting on written texts, in order to achieve one’s.
Trends in NL Analysis Jim Critz University of New York in Prague EurOpen.CZ 12 December 2008.
Information Retrieval in Practice
Datamining : Refers to extracting or mining knowledge from large amounts of data Applications : Market Analysis Fraud Detection Customer Retention Production.
Mining the Data Charu C. Aggarwal, ChengXiang Zhai
Social Knowledge Mining
Classification and Prediction
Course Introduction CSC 576: Data Mining.
Introduction to Text Analysis
Slides showing what we have working now in Monk Last updated May 6, 2008 (by Catherine) Based on slides used at NEH meeting May 5th for a quick demo.
Presentation transcript:

Text Analytics National Center for Supercomputing Applications University of Illinois at Urbana-Champaign

Outline Overview of Text Analytics Dunning Loglikelihood Comparison Text Clustering /Topic Modeling Frequent Patterns Analysis Entity Extraction SEASR Community Hub: Text Analytics Flow Hands-On

Work – MONK Executes flows for each analysis requested –Predictive modeling using Naïve Bayes –Predictive modeling using Support Vector Machines (SVM) –Feature Comparison (Dunning Loglikelihood)

Dunning Loglikelihood Tag Cloud Words that are under-represented in writings by Victorian women as compared to Victorian men. Results are loaded into Wordle for the tag cloud —Sara Steger

Work – Dunning Loglikelihood Feature comparison of tokens Specify an analysis document/collection Specify a reference document/collection Perform statistics comparison using Dunning Loglikelihood Example showing over-represented Analysis Set: The Project Gutenberg EBook of A Tale of Two Cities, by Charles Dickens Reference Set: The Project Gutenberg EBook of Great Expectations, by Charles Dickens Example showing over-represented Analysis Set: The Project Gutenberg EBook of A Tale of Two Cities, by Charles Dickens Reference Set: The Project Gutenberg EBook of Great Expectations, by Charles Dickens

Feature Lens “The discussion of the children introduces each of the short internal narratives. This champions the view that her method of repetition was patterned: controlled, intended, and a measured means to an end. It would have been impossible to discern through traditional reading“

Work – DISCUS On-demand usage of analytics while surfing –While navigating request analytics to be performed on page –Text extraction and cleaning Summarization and key work extraction –List the important terms on the page being analyzed –Provide relevant short summaries Visual maps –Provide a visual representation of the key concepts –Show the graph of relations between concepts

Work – Entity Mash-up Entity Extraction with OpenNLP or Stanford NER Locations viewed on Google Map Dates viewed on Simile Timeline

SEASR Text Analytics Goals Address the Scholarly text analytics needs by: Efficiently managing distributed Literary and Historical textual assets Structuring extracted information to facilitate knowledge discovery Extracting information from text at a level of semantic/functional abstraction that is sufficiently rich and efficient for analysis Devising a representation for the extracted information Devising algorithms for question answering and inference Developing UI for effective visual knowledge discovery and data exploration with separate query logic from application logic Leveraging existing machine learning approaches for text Enabling the text analytics through SEASR components

Text Analytics Definition Many definitions in the literature The non trivial extraction of implicit, previously unknown, and potentially useful information from (large amount of) textual data An exploration and analysis of textual (natural- language) data by automatic and semi automatic means to discover new knowledge

Text Analytics: General Application Areas Information Retrieval –Indexing and retrieval of textual documents –Finding a set of (ranked) documents that are relevant to the query Information Extraction –Extraction of partial knowledge in the text Web Mining –Indexing and retrieval of textual documents and extraction of partial knowledge using the web Classification –Predict a class for each text document Clustering –Generating collections of similar text documents Question and Answering

Text Analytics Process

Text Preprocessing –Syntactic Text Analysis –Semantic Text Analysis Features Generation –Bag of Words –Ngrams Feature Selection –Simple Counting –Statistics –Selection based on POS Text/Data Analytics –Classification: Supervised Learning –Clustering: Unsupervised Learning –Information Extraction Analyzing Results –Visual Exploration, Discovery and Knowledge Extraction –Query-based question answering

Text Representation Many machine learning algorithms need numerical data, so text must be transformed Determining this representation can be challenging

Text Characteristics (1) Large textual data base –Enormous wealth of textual information on the Web –Publications are electronic High dimensionality –Consider each word/phrase as a dimension Noisy data –Spelling mistakes –Abbreviations –Acronyms Text messages are very dynamic –Web pages are constantly being generated (removed) –Web pages are generated from database queries Not well structured text – /Chat rooms “r u available ?” “Hey whazzzzzz up” –Speech

Text Characteristics (2) Dependency –Relevant information is a complex conjunction of words/phrases –Order of words in the query hot dog stand in the amusement park hot amusement stand in the dog park Ambiguity –Word ambiguity Pronouns (he, she …) Synonyms (buy, purchase) Multiple meanings (bat – it is related to baseball or mammal) –Semantic ambiguity The king saw the monkey with his glasses. (multiple meanings) Authority of the source –IBM is more likely to be an authorized source then my second far cousin

Text Preprocessing Syntactic analysis –Tokenization –Lemmitization –Part Of Speech (POS) tagging –Shallow parsing –Custom literary tagging Semantic analysis –Information Extraction Named Entity tagging Unnamed Entity tagging –Co-reference resolution –Ontological association (WordNet, VerbNet) –Semantic Role analysis –Concept-Relation extraction

Feature Selection Reduce Dimensionality –Learners have difficulty addressing tasks with high dimensionality Irrelevant Features –Not all features help! –Remove features that occur in only a few documents –Reduce features that occur in too many documents

Syntactic Analysis Tokenization –Text document is represented by the words it contains (and their occurrences) –e.g., “Lord of the rings”  {“the”, “Lord”, “rings”, “of”} –Highly efficient –Makes learning far simpler and easier –Order of words is not that important for certain applications Lemmitization/Stemming –Involves the reduction of corpus words to their respective headwords (i.e. lemmas) –Means removal suffixes, prefixes and infixes to root –Reduces dimensionality –Identifies a word by its root –e.g., flying, flew  fly Bigrams and trigrams –Retains semantic content

Syntactic Analysis Stop words –Identifies the most common words that are unlikely to help with text analytics, e.g., “the”, “a”, “an”, “you” –Identifies context dependent words to be removed, e.g., “computer” from a collection of computer science documents Scaling words –Important words should be scaled upwards, and vice versa –TF-IDF stands for Term Frequency and Inverse Document Frequency product Parsing / Part of Speech (POS) tagging –Generates a parse tree (graph) for each sentence –Each sentence is a stand alone graph –Find the corresponding POS for each word –e.g., John (noun) gave (verb) the (det) ball (noun) –Shallow Parsing: analysis of a sentence which identifies the constituents (noun groups, verbs,...), but does not specify their internal structure, nor their role in the main sentence

Text Analytics: Supervised vs. Unsupervised Supervised learning (Classification) –Data (observations, measurements, etc.) are accompanied by labels indicating the class of the observations –Split into training data and test data for model building process –New data is classified based on the model built with the training data –Techniques Bayesian classification, Decision trees, Neural networks, Instance- Based Methods, Support Vector Machines Unsupervised learning (Clustering) –Class labels of training data is unknown –Given a set of measurements, observations, etc. with the aim of establishing the existence of classes or clusters in the data

Text Analytics: Classification Given: Collection of labeled records –Each record contains a set of features (attributes), and the true class (label) –Create a training set to build the model –Create a testing set to test the model Find: Model for the class as a function of the values of the features Goal: Assign a class (as accurately as possible) to previously unseen records Evaluation: What Is Good Classification? –Correct classification Known label of test example is identical to the predicted class from the model –Accuracy ratio Percent of test set examples that are correctly classified by the model –Distance measure between classes can be used e.g., classifying “football” document as a “basketball” document is not as bad as classifying it as “crime”

Text Analytics: Topic Modeling Given: Set of documents Find: To reveal the semantic content in large collection of documents Usage: Mallet Topic Modeling tools Output: –Shows the percentage of relevance for each document in each cluster –Shows the key words and their counts for each topic

Text Analytics: Clustering Given: Set of documents and a similarity measure among documents Find: Clusters such that –Documents in one cluster are more similar to one another –Documents in separate clusters are less similar to one another Similarity Measures: –Euclidean distance if attributes are continuous –Other problem-specific measures e.g., how many words are common in these documents Evaluation: What Is Good Clustering? –Produce high quality clusters with high intra-class similarity low inter-class similarity –Quality of a clustering method is also measured by its ability to discover some or all of the hidden patterns

Text Analytics: Frequent Patterns Given: Set of documents Find Frequent Patterns such that –Common words patterns used in the collection Evaluation: What Is Good Patterns? Results: 1060 patterns discovered. 322: Lincoln 147: Abe 117: man 100: Mr. 100: time 98: Lincoln Abe 91: father 85: Lincoln Mr. 85: Lincoln man 75: day 70: Abraham 70: President 68: boy 67: Lincoln time 65: Lincoln Abraham 65: life 63: Lincoln father 57: men 57: work 52: Lincoln day … 322: Lincoln 147: Abe 117: man 100: Mr. 100: time 98: Lincoln Abe 91: father 85: Lincoln Mr. 85: Lincoln man 75: day 70: Abraham 70: President 68: boy 67: Lincoln time 65: Lincoln Abraham 65: life 63: Lincoln father 57: men 57: work 52: Lincoln day …

Text Analytics: Discus Given: Set of documents Find Top Sentences and Top Tokens –Top sentences contain top tokens –Top tokens exist in top sentences Results:

Semantic Analysis Deep Parsing –more sophisticated syntactic, semantic and contextual processing must be performed to extract or construct the answer Information extraction is the identification of specific semantic elements within a text (e.g., entities, properties, relations) Extract the relevant information and ignore non- relevant information (important!) Link related information and output in a predetermined format

Information Extraction Information TypeState of the art (Accuracy) Entities an object of interest such as a person or organization % Attributes a property of an entity such as its name, alias, descriptor, or type. 80% Facts a relationship held between two or more entities such as Position of a Person in a Company % Events an activity involving several entities such as a terrorist act, airline crash, management change, new product introduction % “Introduction to Text Mining,” Ronen Feldman, Computer Science Department, Bar-Ilan University, ISRAEL

Information Extraction Approaches Terminology (name) lists –This works very well if the list of names and name expressions is stable and available Tokenization and morphology –This works well for things like formulas or dates, which are readily recognized by their internal format (e.g., DD/MM/YY or chemical formulas) Use of characteristic patterns –This works fairly well for novel entities –Rules can be created by hand or learned via machine learning or statistical algorithms –Rules capture local patterns that characterize entities from instances of annotated training data

Relation (Event) Extraction Identify (and tag) the relation among two entities –A person is_located_at a location (news) –A gene codes_for a protein (biology) Relations require more information –Identification of two entities & their relationship –Predicted relation accuracy Pr(E1)*Pr(E2)*Pr(R) ~= (.93) * (.93) * (.93) =.80 Information in relations is less local –Contextual information is a problem: right word may not be explicitly present in the sentence –Events involve more relations and are even harder

Mayor Rex Luthor announced today the establishment of a new research facility in Alderwood. It will be known as Boynton Laboratory. NE:PersonNE:Time NE:Location NE:Organization Semantic Analytics –Named Entity (NE) Tagging

Mayor Rex Luthor announced today the establishment of a new research facility in Alderwood. It will be known as Boynton Laboratory. UNE:Organization Semantic Analysis Semantic Category (unnamed entity, UNE) Tagging

Mayor Rex Luthor announced today the establishment of a new research facility in Alderwood. It will be known as Boynton Laboratory. UNE:Organization Semantic Analysis Co-reference Resolution for entities and unnamed entities

Semantic Analysis Semantic Role Analysis

Semantic Analysis Concept-Relation Extraction

(c) 2001, Chicago Tribune. Visit the Chicago Tribune on the Internet at Distributed by Knight Ridder/Tribune Information Services. By Stephen J. Hedges and Cam Simpson ……. The Finsbury Park Mosque is the center of radical Muslim activism in England. Through its doors have passed at least three of the men now held on suspicion of terrorist activity in France, England and Belgium, as well as one Algerian man in prison in the United States. ``The mosque's chief cleric, Abu Hamza al- Masri lost two hands fighting the Soviet Union in Afghanistan and he advocates the elimination of Western influence from Muslim countries. He was arrested in London in 1999 for his alleged involvement in a Yemen bomb plot, but was set free after Yemen failed to produce enough evidence to have him extradited..'‘ … Information Extraction:Template Extraction Finsbury Park Mosque Abu Hamza al-Masri chief cleric Finsbury Park Mosque England Abu Hamza al-Masri London 1999 his alleged involvement in a Yemen bomb plot England France United States Belgium Abu Hamza al-Masri London

Streaming Text: Knowledge Extraction Leveraging some earlier work on information extraction from text streams Information extraction process of using advanced automated machine learning approaches to identify entities in text documents extract this information along with the relationships these entities may have in the text documents The visualization above demonstrates information extraction of names, places and organizations from real-time news feeds. As news articles arrive, the information is extracted and displayed. Relationships are defined when entities co-occur within a specific window of words.

Results: Social Network (Tom in Red)

Simile Timeline Constructed by Hand

Simile Timeline in SEASR Dates are automatically extracted with their sentences

Work: MONK MONK: a case study Texts as data Texts from multiple sources Texts reprocessed into a new representation Different tools using the same data Slides from, John Unsworth, “Tools for Textual Data”, May 20, 2009

MONK Project MONK provides: 1400 works of literature in English from the 16th - 19th century = 108 million words, POS-tagged, TEI-tagged, in a MySQL database. Several different open-source interfaces for working with this data A public API to the datastore SEASR under the hood, for analytics Slides from, John Unsworth, “Tools for Textual Data”, May 20, 2009

MONK “A word token is the spelling or surface of form of a word. MONK performs a variety of operations that supply each token with additional 'metadata’. –Take something like 'hee louyd hir depely'. –This comes to exist in the MONK textbase as something like hee_pns31_he louyd_vvd_love hir_pno31_she depely_av-j_deep Because the textbase 'knows' that the surface 'louyd' is the past tense of the verb 'love' the individual token can be seen as an instance of several types: the spelling, the part of speech, and the lemma or dictionary entry form of a word.” (Martin Mueller) Slides from, John Unsworth, “Tools for Textual Data”, May 20, 2009

Text Data Texts represent language, which changes over time (spellings) Comparison of texts as data requires some normalization (lemma) Counting as a means of comparison requires units to count (tokens) Treating texts as data will usually entail a new representation of those texts, to make them comparable and to make their features countable. Slides from, John Unsworth, “Tools for Textual Data”, May 20, 2009

Text from Multiple Sources Five aphorisms about textual data (causing tool- builders to weep): Scholars are interested in texts first, data second Tools are only useful if they can be applied to texts that are of interest No single collection has all texts No two collections will be identical in format No one collection will be internally consistent in format Slides from, John Unsworth, “Tools for Textual Data”, May 20, 2009

Public MONK Texts Documenting the American South from UNC-Chapel Hill –(1.5 Gb, 8.5 M words) Early American Fiction from the University of Virginia –(930 Mb, 5.2 M words) Wright American Fiction from Indiana University –(4 Gb, 23 M words) Shakespeare from Northwestern University – (170 M, 850 K words) About 7 Gigabytes, 38 M words Slides from, John Unsworth, “Tools for Textual Data”, May 20, 2009

Restricted Monk Texts Eighteenth-Century Collection Online (ECCO) from the Text Creation Partnership –(6 Gb, 34 M words) Early English Books Online (EEBO) from the Text Creation Partnership –(7 G, 39 M words) Nineteenth-Century Fiction (NCF) from Chadwyck Healey –(7 G, 39 M words) About 20 Gb, 112 M words Slides from, John Unsworth, “Tools for Textual Data”, May 20, 2009

MONK Ingest Process Texts reprocessed into a new representation TEI source files (from various collections, with various idiosyncracies) go through Abbot, a series of xsl routines that transform the input format into TEI- Analytics (TEI-A for short), with some curatorial interaction. “Unadorned” TEI-A files go through Morphadorner, a trainable part-of-speech tagger that tokenizes the texts into sentences, words and punctuation, assigns ids to the words and punctuation marks, and adorns the words with morphological tagging data (lemma, part of speech, and standard spelling). Slides from, John Unsworth, “Tools for Textual Data”, May 20, 2009

MONK Ingest Process Adorned TEI-A files go through Acolyte, a script that adds curator-prepared bibliographic data Bibadorned files are processed by Prior, using a pair of files defining the parts of speech and word classes, to produce tab-delimited text files in MySQL import format, one file for each table in the MySQL database. cdb.csh creates a Monk MySQL database and imports the tab-delimited text files. Slides from, John Unsworth, “Tools for Textual Data”, May 20, 2009

MONK Tools MONK Datastore Flamenco Faceted Browsing MONK extension for Zotero TeksTale Clustering and Word Clouds FeatureLens SEASR The MONK Workbench (Public) The MONK Workbench (Restricted) Slides from, John Unsworth, “Tools for Textual Data”, May 20, 2009

Work – MONK Workbench Executes flows for each analysis requested –Predictive modeling using Naïve Bayes –Predictive modeling using Support Vector Machines (SVM) –Feature Comparison (Dunning Loglikelihood)

Demonstration NLP flows in the Community Hub

Learning Exercises

Attendee Project Plan Review project plan Identify data set Modify and develop the project plan over the week Present and discuss project plan and results on Friday

Attendee Project Plan Study/Project Title Team Members and their Affiliation Procedural Outline of Study/Project –Research Question/Purpose of Study –Data Sources –Analysis Tools Activity Timeline or Milestones Report or Project Outcome(s) Ideas on what your team needs from SEASR staff to help you achieve your goal. Identify Data Set

Discussion Questions Identify and discuss three other text tools that could be useful in the Humanities? What are the obstacles to using this technology for text analysis - what will your colleagues say?