CSC 594 Topics in AI – Text Mining and Analytics

Slides:



Advertisements
Similar presentations
Prof. Carolina Ruiz Computer Science Department Bioinformatics and Computational Biology Program WPI WELCOME TO BCB4003/CS4803 BCB503/CS583 BIOLOGICAL.
Advertisements

Web Mining Research: A Survey Authors: Raymond Kosala & Hendrik Blockeel Presenter: Ryan Patterson April 23rd 2014 CS332 Data Mining pg 01.
Text mining Extract from various presentations: Temis, URI-INIST-CNRS, Aster Data …
Extract from various presentations: Bing Liu, Aditya Joshi, Aster Data … Sentiment Analysis January 2012.
Information Retrieval in Practice
Sunita Sarawagi.  Enables richer forms of queries  Facilitates source integration and queries spanning sources “Information Extraction refers to the.
Search Engines and Information Retrieval
April 22, Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Doerre, Peter Gerstl, Roland Seiffert IBM Germany, August 1999 Presenter:
Text Mining: Finding Nuggets in Mountains of Textual Data Jochen D ö rre, Peter Gerstl, and Roland Seiffert.
1 CSC 594 Topics in AI – Applied Natural Language Processing Fall 2009/ Shallow Parsing.
Open Information Extraction From The Web Rani Qumsiyeh.
1 CSC 594 Topics in AI – Applied Natural Language Processing Fall 2009/ Information Extraction.
1 Information Retrieval and Extraction 資訊檢索與擷取 Chia-Hui Chang, Assistant Professor Dept. of Computer Science & Information Engineering National Central.
Machine Learning in Natural Language Processing Noriko Tomuro November 16, 2006.
Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dörre, Peter Gerstl, and Roland Seiffert Presented By: Jake Happs,
1 CSC 594 Topics in AI – Applied Natural Language Processing Fall 2009/2010 Overview of NLP tasks (text pre-processing)
Enhance legal retrieval applications with an automatically induced knowledge base Ka Kan Lo.
Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dijrre, Peter Gerstl, Roland Seiffert Presented by Huimin Ye.
Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dijrre, Peter Gerstl, Roland Seiffert Presented by Drew DeHaas.
Overview of Search Engines
Introduction to machine learning
Artificial Intelligence Research Centre Program Systems Institute Russian Academy of Science Pereslavl-Zalessky Russia.
Information Extraction Junichi Tsujii Graduate School of Science University of Tokyo Japan Ronen Feldman Bar Ilan University Israel.
Data Mining Techniques
Webpage Understanding: an Integrated Approach
Data Mining. 2 Models Created by Data Mining Linear Equations Rules Clusters Graphs Tree Structures Recurrent Patterns.
Copyright R. Weber Machine Learning, Data Mining ISYS370 Dr. R. Weber.
Empirical Methods in Information Extraction Claire Cardie Appeared in AI Magazine, 18:4, Summarized by Seong-Bae Park.
Chapter 7 DATA, TEXT, AND WEB MINING Pages , 311, Sections 7.3, 7.5, 7.6.
1 A study on automatically extracted keywords in text categorization Authors:Anette Hulth and Be´ata B. Megyesi From:ACL 2006 Reporter: 陳永祥 Date:2007/10/16.
GLOSSARY COMPILATION Alex Kotov (akotov2) Hanna Zhong (hzhong) Hoa Nguyen (hnguyen4) Zhenyu Yang (zyang2)
Defining Text Mining Preprocessing Transforming unstructured data stored in document collections into a more explicitly structured intermediate format.
Using Text Mining and Natural Language Processing for Health Care Claims Processing Cihan ÜNAL
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
Thanks to Bill Arms, Marti Hearst Documents. Last time Size of information –Continues to grow IR an old field, goes back to the ‘40s IR iterative process.
Lecture 6 Hidden Markov Models Topics Smoothing again: Readings: Chapters January 16, 2013 CSCE 771 Natural Language Processing.
Data Mining Chapter 1 Introduction -- Basic Data Mining Tasks -- Related Concepts -- Data Mining Techniques.
10. Parsing with Context-free Grammars -Speech and Language Processing- 발표자 : 정영임 발표일 :
3-1 Data Mining Kelby Lee. 3-2 Overview ¨ Transaction Database ¨ What is Data Mining ¨ Data Mining Primitives ¨ Data Mining Objectives ¨ Predictive Modeling.
ITGS Databases.
Introduction to Text Mining By Soumyajit Manna 11/10/08.
Jochen Dijrre, Peter Gerstl, Roland Seiffert Presented by Trevor Crum 04/23/2014 *Slides modified from Shamil Mustafayev’s 2013 presentation * 1.
1 CSC 594 Topics in AI – Text Mining and Analytics Fall 2015/16 3. Word Association.
Auckland 2012Kilgarriff: NLP and Corpus Processing1 The contribution of NLP: corpus processing.
What Is Text Mining? Also known as Text Data Mining Process of examining large collections of unstructured textual resources in order to generate new.
CSC 594 Topics in AI – Text Mining and Analytics
Data Mining: Text Mining
CSC 594 Topics in AI – Text Mining and Analytics
Foundations of Statistical NLP Chapter 4. Corpus-Based Work 박 태 원박 태 원.
Jochen Dijrre, Peter Gerstl, Roland Seiffert Presented by Shamil Mustafayev 04/16/
Trends in NL Analysis Jim Critz University of New York in Prague EurOpen.CZ 12 December 2008.
Data mining in web applications
Information Retrieval in Practice
Oracle Advanced Analytics
SNS COLLEGE OF TECHNOLOGY
CSC 594 Topics in AI – Natural Language Processing
Search Engine Architecture
Taking a Tour of Text Analytics
Text Based Information Retrieval
CSC 594 Topics in AI – Text Mining and Analytics
Data Mining 101 with Scikit-Learn
Mining the Data Charu C. Aggarwal, ChengXiang Zhai
CSC 594 Topics in AI – Natural Language Processing
CSE 635 Multimedia Information Retrieval
Text Mining & Natural Language Processing
Course Introduction CSC 576: Data Mining.
Introduction to Information Retrieval
Text Mining & Natural Language Processing
Introduction to Text Analysis
PolyAnalyst™ text mining tool Allstate Insurance example
Presentation transcript:

CSC 594 Topics in AI – Text Mining and Analytics Fall 2015/16 1. Introduction

Unstructured Data “80 % of business-relevant information originates in unstructured form, primarily text.” (a quote in 2008) “Based on the industry’s current estimations, unstructured data will occupy 90% of the data by volume in the entire digital space over the next decade.” (a quote in 2010) “The possibilities for data mining from large text collections are virtually untapped. Text expresses a vast, rich range of information, but encodes this information in a form that is difficult to decipher automatically. For example, it is much more difficult to graphically display textual content than quantitative data.” (Marti Hear, UC Berkeley, 2007)

IBM Watson Content Analytics: Discover Hidden Value in Your Unstructured Data

Text Mining and Analytics You use the terms text analytics, text data mining, and text mining almost synonymously in this course. Text analytics uses algorithms for turning free-form text (unstructured data) into data that can be analyzed (structured data) by applying statistical and machine learning methods, as well as Natural Language Processing (NLP) techniques. Once structured data is obtained, the same mining and analytic techniques can apply. So the most significant part of Text Mining/Analytics is how to convert texts into structured data. This course covers SAS Enterprise Miner and SAS Text Miner. Other courses include Teragram products like SAS Content Categorization, SAS Sentiment Analysis, and SAS Ontology Management.

Converting Text into Structured Data A huge amount of preprocessing is required to convert text. Cleaning up ‘dirty’ texts Remove mark-up tags from web documents, encrypted symbols such as emoticons/emoji’s, extraneous strings such as “AHHHHHHHHHHHHHHHHHHHHH” Correct misspelled words.. Tokenization Remove punctuations, normalizing upper/lower cases, etc. Sentence splitting Identifying multi-word expressions (e.g. “as well as”, “radio wave”) and Named Entities (e.g. “Allied Waste”, “Super Mario Bros.”) Adding other linguistic information Parts-of-speech (e.g. noun, verb, adjective, adverb, preposition) Filtering non-significant/irrelevant words – to reduce dimensions Filtering non-content words using a stop-list (e.g. “the”, “a”, “an”, “and”) Combining tokens by stemming/lemmatizing or using synonyms Other NLP features/techniques, e.g. n-grams, syntax trees This course covers SAS Enterprise Miner and SAS Text Miner. Other courses include Teragram products like SAS Content Categorization, SAS Sentiment Analysis, and SAS Ontology Management.

Text Mining Process Pipeline Process is essentially a linear pipeline. Feedback from the results of Text Mining might affect earlier preprocessing (to Parsing, or even data collection)..

Text Mining Paradigm

Data Mining – Two Broad Areas Pattern Discovery/Exploratory Analysis (Unsupervised Learning) There is no target variable, and some form of analysis is performed to do the following: identify or define homogeneous groups, clusters, or segments find links or associations between entities, as in market basket analysis Prediction (Supervised Learning) A target variable is used, and some form of predictive or classification model is developed. Input variables are associated with values of a target variable, and the model produces a predicted target value for a given set of inputs. Classification refers to identifying “natural” groups, such as identifying different breeds within a species. Anthropology and other sciences try to find clear boundaries between groups to help define natural classification schemes. On the other hand, the same algorithms that can identify natural groups can be applied to any data set, even if no natural grouping exists.

Text Mining Applications – Unsupervised Information retrieval (IR) finding documents with relevant content of interest used for researching medical, scientific, legal, and news documents such as books and journal articles Document categorization for organizing clustering documents into naturally occurring groups extracting themes or concepts Anomaly detection identifying unusual documents that might be associated with cases requiring special handling such as unhappy customers, fraud activity, and so on Relate the major application areas to the two abstract goals.

Text Mining Applications – Unsupervised Text clustering Trend analysis Trend for the Term “text mining” from Google Trends Cluster No. Comment Key Words 1 1, 3, 4 doctor, staff, friendly, helpful 2 5, 6, 8 treatment, results, time, schedule 3   2, 7 service, clinic, fast

Text Mining Applications – Supervised Many typical predictive modeling or classification applications can be enhanced by incorporating textual data in addition to traditional input variables. churning propensity models that include customer center notes, website forms, e-mails, and Twitter messages hospital admission prediction models incorporating medical records notes as a new source of information insurance fraud modeling using adjustor notes sentiment categorization (next page) stylometry or forensic applications that identify the author of a particular writing sample Relate the major application areas to the two abstract goals.

Sentiment Analysis The field of sentiment analysis deals with categorization (or classification) of opinions expressed in textual documents Green color represents positive tone, red color represents negative tone, and product features and model names are highlighted in blue and brown, respectively.

Structured + Text Data in Predictive Models Use of both types of data in building predictive models. Anomaly detection includes examining internet traffic looking for terrorist sources. The course notes discuss how text mining references present general application areas in different ways. ROC Chart of Models With and Without Textual Comments

Discussion What sort of pattern discovery or predictive modeling application do you have in mind that can incorporate text data?

Typical Text Pre-processing Step Given a raw text (in a corpus), we typically pre-process the text by applying the following tasks in order: Part-Of-Speech (POS) tagging – assign a POS to every word in a sentence in the text Named Entity Recognition (NER) – identify named entities (proper nouns and some common nouns which are relevant in the domain of the text) Shallow Parsing – identify the phrases (mostly verb phrases) which involve named entities Information Extraction (IE) – identify relations between phrases, and extract the relevant/significant “information” described in the text Source: Andrew McCallum, UMass Amherst

1. Part-Of-Speech (POS) Tagging POS tagging is a process of assigning a POS or lexical class marker to each word in a sentence (and all sentences in a corpus). Input: the lead paint is unsafe Output: the/Det lead/N paint/N is/V unsafe/Adj

2. Named Entity Recognition (NER) NER is to process a text and identify named entities in a sentence e.g. “U.N. official Ekeus heads for Baghdad.”

3. Shallow Parsing Shallow (or Partial) parsing identifies the (base) syntactic phases in a sentence. After NEs are identified, dependency parsing is often applied to extract the syntactic/dependency relations between the NEs. [NP He] [v saw] [NP the big dog] [PER Bill Gates] founded [ORG Microsoft]. found Bill Gates Microsoft nsubj dobj Dependency Relations nsubj(Bill Gates, found) dobj(found, Microsoft)

4. Information Extraction (IE) Identify specific pieces of information (data) in an unstructured or semi-structured text Transform unstructured information in a corpus of texts or web pages into a structured database (or templates) Applied to various types of text, e.g. Newspaper articles Scientific articles Web pages etc. Source: J. Choi, CSE842, MSU

Entities: “Bridgestone Sport Co.” “a local concern” Bridgestone Sports Co. said Friday it had set up a joint venture in Taiwan with a local concern and a Japanese trading house to produce golf clubs to be supplied to Japan. The joint venture, Bridgestone Sports Taiwan Co., capitalized at 20 million new Taiwan dollars, will start production in January 1990 with production of 20,000 iron and “metal wood” clubs a month. template filling TIE-UP-1 Relationship: TIE-UP Entities: “Bridgestone Sport Co.” “a local concern” “a Japanese trading house” Joint Venture Company: “Bridgestone Sports Taiwan Co.” Activity: ACTIVITY-1 Amount: NT$200000000 ACTIVITY-1 Activity: PRODUCTION Company: “Bridgestone Sports Taiwan Co.” Product: “iron and ‘metal wood’ clubs” Start Date: DURING: January 1990