Text Data Mining: Introduction Hao Chen School of Information Systems University of California at Berkeley

Slides:



Advertisements
Similar presentations
Data Mining and the Web Susan Dumais Microsoft Research KDD97 Panel - Aug 17, 1997.
Advertisements

Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki
The role of Domain Knowledge in a large scale Data Mining Project Kopanas I., Avouris N., Daskalaki S. University of Patras.
Data Mining Methodology 1. Why have a Methodology  Don’t want to learn things that aren’t true May not represent any underlying reality ○ Spurious correlation.
Prof. Carolina Ruiz Computer Science Department Bioinformatics and Computational Biology Program WPI WELCOME TO BCB4003/CS4803 BCB503/CS583 BIOLOGICAL.
Data Mining Glen Shih CS157B Section 1 Dr. Sin-Min Lee April 4, 2006.
Data Mining Sangeeta Devadiga CS 157B, Spring 2007.
5/11/981 Untangling Text Data Mining Stanford Digital Libraries Seminar May 11, 1998 Marti Hearst UC Berkeley SIMS
Text Tango: A New Text Data Mining Project Text Tango: A New Text Data Mining Project Marti A. Hearst GUIR Meeting, Sept 17, 1998.
Text Data Mining Prof. Marti Hearst UC Berkeley SIMS ABLE May 7, 1999.
Text Mining Tools: Instruments for Scientific Discovery Marti Hearst UC Berkeley SIMS Advanced Technologies Seminar June 15, 2000.
© Prentice Hall1 DATA MINING TECHNIQUES Introductory and Advanced Topics Eamonn Keogh (some slides adapted from) Margaret Dunham Dr. M.H.Dunham, Data Mining,
An Overview of Text Mining Rebecca Hwa 4/25/2002 References M. Hearst, “Untangling Text Data Mining,” in the Proceedings of the 37 th Annual Meeting of.
Untangling Text Data Mining Marti Hearst UC Berkeley SIMS ACL’99 Plenary Talk June 23, 1999.
1 SIMS 290-2: Applied Natural Language Processing Marti Hearst October 20, 2004.
Data Mining – Intro.
Data mining By Aung Oo.
Advanced Database Applications Database Indexing and Data Mining CS591-G1 -- Fall 2001 George Kollios Boston University.
Overview of Web Data Mining and Applications Part I
CIT 858: Data Mining and Data Warehousing Course Instructor: Bajuna Salehe Web:
Introduction to machine learning
Data Mining By Andrie Suherman. Agenda Introduction Major Elements Steps/ Processes Tools used for data mining Advantages and Disadvantages.
CS Machine Learning. What is Machine Learning? Adapt to / learn from data  To optimize a performance function Can be used to:  Extract knowledge.
LÊ QU Ố C HUY ID: QLU OUTLINE  What is data mining ?  Major issues in data mining 2.
OLAM and Data Mining: Concepts and Techniques. Introduction Data explosion problem: –Automated data collection tools and mature database technology lead.
Data Mining Techniques
Data Mining. 2 Models Created by Data Mining Linear Equations Rules Clusters Graphs Tree Structures Recurrent Patterns.
Kansas State University Department of Computing and Information Sciences CIS 830: Advanced Topics in Artificial Intelligence From Data Mining To Knowledge.
Data Mining Techniques As Tools for Analysis of Customer Behavior
Data Mining Chun-Hung Chou
3 Objects (Views Synonyms Sequences) 4 PL/SQL blocks 5 Procedures Triggers 6 Enhanced SQL programming 7 SQL &.NET applications 8 OEM DB structure 9 DB.
Chapter 1 Introduction to Data Mining
Introduction to Data Mining Group Members: Karim C. El-Khazen Pascal Suria Lin Gui Philsou Lee Xiaoting Niu.
Knowledge Discovery and Data Mining Evgueni Smirnov.
Introduction to Web Mining Spring What is data mining? Data mining is extraction of useful patterns from data sources, e.g., databases, texts, web,
Data Mining Chapter 1 Introduction -- Basic Data Mining Tasks -- Related Concepts -- Data Mining Techniques.
Introduction to machine learning and data mining 1 iCSC2014, Juan López González, University of Oviedo Introduction to machine learning Juan López González.
Knowledge Discovery and Data Mining Evgueni Smirnov.
Data Mining By Dave Maung.
Principles of Data Mining. Introduction: Topics 1. Introduction to Data Mining 2. Nature of Data Sets 3. Types of Structure Models and Patterns 4. Data.
Data Mining – Intro. Course Overview Spatial Databases Temporal and Spatio-Temporal Databases Multimedia Databases Data Mining.
6.1 © 2010 by Prentice Hall 6 Chapter Foundations of Business Intelligence: Databases and Information Management.
Advanced Database Course (ESED5204) Eng. Hanan Alyazji University of Palestine Software Engineering Department.
Text Mining Tools: Instruments for Scientific Discovery Marti Hearst UC Berkeley SIMS IMA Text Mining Workshop April 17, 2000.
Data Mining BY JEMINI ISLAM. Data Mining Outline: What is data mining? Why use data mining? How does data mining work The process of data mining Tools.
MIS2502: Data Analytics Advanced Analytics - Introduction.
Data Mining and Decision Support
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
WHAT IS DATA MINING?  The process of automatically extracting useful information from large amounts of data.  Uses traditional data analysis techniques.
Web Analytics Xuejiao Liu INF 385F: WIRED Fall 2004.
The KDD Process for Extracting Useful Knowledge from Volumes of Data Fayyad, Piatetsky-Shapiro, and Smyth Ian Kim SWHIG Seminar.
Data Mining – Intro.
MIS2502: Data Analytics Advanced Analytics - Introduction
DATA MINING © Prentice Hall.
School of Computer Science & Engineering
Introduction to Data Mining
Data Mining 101 with Scikit-Learn
Introduction C.Eng 714 Spring 2010.
Text Tango: A New Text Data Mining Project
Datamining : Refers to extracting or mining knowledge from large amounts of data Applications : Market Analysis Fraud Detection Customer Retention Production.
Sangeeta Devadiga CS 157B, Spring 2007
CSE591: Data Mining by H. Liu
Data Warehousing and Data Mining
Untangling Text Data Mining
Supporting End-User Access
Course Introduction CSC 576: Data Mining.
Nearest Neighbors CSC 576: Data Mining.
Data Warehousing Data Mining Privacy
Welcome! Knowledge Discovery and Data Mining
CSE591: Data Mining by H. Liu
Presentation transcript:

Text Data Mining: Introduction Hao Chen School of Information Systems University of California at Berkeley

The KDD Process for Extracting Useful Knowledge from Volumes of Data Large databases becomes ubiquitous Large databases becomes ubiquitous grocery store’s checkout registry grocery store’s checkout registry credit card authorization credit card authorization Computer technology allow efficient and inexpensive data storage and access Computer technology allow efficient and inexpensive data storage and access But our ability to analyze and understand large dataset lags far behind. But our ability to analyze and understand large dataset lags far behind.

Manual Data Analysis Impractical Slow, expensive, and highly subjective Slow, expensive, and highly subjective Becomes impractical as data volumns grow Becomes impractical as data volumns grow N: number of records (10 9 ) N: number of records (10 9 ) D: number of fields ( ) D: number of fields ( ) Need computer technology to automate the bookkeeping. Need computer technology to automate the bookkeeping. First KDD Workshop in 1989 First KDD Workshop in 1989

Definitions of KDD Knowledge Discovery from Data The nontrivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data. Knowledge Discovery from Data The nontrivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data.

KDD Process: Selection Learning the application domain Learning the application domain Creating a target dataset Creating a target dataset

KDD Process: Preprocessing Data cleaning & preprocessing Data cleaning & preprocessing remove noise remove noise handle missing data fields handle missing data fields time sequence information time sequence information

KDD Process: Transformation Data reduction & projection Data reduction & projection features extraction features extraction dimensionality reduction dimensionality reduction invariant representation invariant representation

KDD Process: Data Mining Choosing function of data mining Choosing function of data mining Choosing data mining algorithms Choosing data mining algorithms Data mining: searching for patterns of interest Data mining: searching for patterns of interest

KDD Process: Interpretation / Evaluation Interpretation Interpretation Using discovered knowledge Using discovered knowledge

What is Data Mining? Fitting models to or determining patterns from very large datasets. Fitting models to or determining patterns from very large datasets. A “regime” which enables people to interact effectively with massive data stores. A “regime” which enables people to interact effectively with massive data stores. Deriving new information from data. Deriving new information from data. finding patterns across large datasets finding patterns across large datasets discovering heretofore unknown information discovering heretofore unknown information

What is Data Mining? Potential point of confusion: Potential point of confusion: The extracting ore from rock metaphor does not really apply to the practice of data mining The extracting ore from rock metaphor does not really apply to the practice of data mining If it did, then standard database queries would fit under the rubric of data mining If it did, then standard database queries would fit under the rubric of data mining Find all employee records in which employee earns $300/month less than their managers In practice, DM refers to: In practice, DM refers to: finding patterns across large datasets discovering heretofore unknown information

Another Definition of DM What SQL currently cannot do. What SQL currently cannot do. A standard query does not infer new information A standard query does not infer new information It retrieves a subset of what is already present and known. SQL originally intended for business apps DM requires sophisticated aggregate queries DM requires sophisticated aggregate queries

DM Touchstone Applications Finding patterns across data sets: Finding patterns across data sets: Reports on changes in retail sales Reports on changes in retail sales to improve sales Patterns of sizes of TV audiences Patterns of sizes of TV audiences for marketing Patterns in NBA play Patterns in NBA play to alter, and so improve, performance Deviations in standard phone calling behavior Deviations in standard phone calling behavior to detect fraud for marketing

DM Touchstone Applications Separating signal from noise: Separating signal from noise: Classifying faint astronomical objects Classifying faint astronomical objects Finding genes within DNA sequences Finding genes within DNA sequences Discovering novel tectonic activity Discovering novel tectonic activity

Components of Data Mining The model The model function of the model function of the model classification clustering representational form of the model representational form of the model linear function of multiple variables Gaussian probability density function The preference criterion The preference criterion goodness of fit goodness of fit avoiding overfitting avoiding overfitting The search algorithm The search algorithm

Model Function Classification Classification Regression Regression Clustering Clustering Summarization Summarization Dependency modeling Dependency modeling Link analysis Link analysis Sequence analysis Sequence analysis

Model Representation Decision tree Decision tree Linear model Linear model Nonlinear model (e.g. Neural Network) Nonlinear model (e.g. Neural Network) Example-based method (e.g. Nearest Neighbor) Example-based method (e.g. Nearest Neighbor) Probabilistic graphical dependency model (e.g. Baysian Network) Probabilistic graphical dependency model (e.g. Baysian Network) Relational attribute model Relational attribute model

Search Algorithm Parameter search, given a model Parameter search, given a model Model search over model space Model search over model space predictive predictive descriptive descriptive

What’s New Here? Sounds like statistical modeling or machine learning. Sounds like statistical modeling or machine learning. Main difference: scale and availability Main difference: scale and availability Datasets too large for classical analysis Datasets too large for classical analysis Increased opportunity for access Increased opportunity for access end user is often not a statistician New issues in sampling New issues in sampling

Statistician’s Viewpoint What’s new about DM? What’s new about DM? Returns statisticians to their empirical roots Returns statisticians to their empirical roots exploration rather than modeling Hypothesis testing may be irrelevant Hypothesis testing may be irrelevant given the large data sizes everything is significant Data was collected for some other purpose than what it is being analyzed for now Data was collected for some other purpose than what it is being analyzed for now

The Statistician’s Viewpoint (David Hand 97) conservative conservative rigorous rigorous abstract abstract idealized idealized adventurous adventurous engineering engineering practical practical real solutions real solutions StatisticsMachine Learningvs.

Research Challenges Massive datasets & high dimensionality Massive datasets & high dimensionality User interaction & prior knowledge User interaction & prior knowledge Overfitting & assessing statistical significance Overfitting & assessing statistical significance Missing data Missing data Understandability of patterns Understandability of patterns Managing changing data and knowledge Managing changing data and knowledge Integration Integration Nonstandard, multimedia, object-oriented data Nonstandard, multimedia, object-oriented data

A Database Perspective on Knowledge Discovery Concept of data mining as a querying process Concept of data mining as a querying process First steps toward efficient development of knowledge discovery applications First steps toward efficient development of knowledge discovery applications

New Research Frontier Short term: Efficient algorithms implementing machine learning tools on the top of large databases Short term: Efficient algorithms implementing machine learning tools on the top of large databases Long term: building optimizing compilers for ad hoc queries and embedding queries in application programming interfaces Long term: building optimizing compilers for ad hoc queries and embedding queries in application programming interfaces

KDDMS KDD objects KDD objects a rule a rule a classifier a classifier a clustering a clustering KDD queries KDD queries a predicate returning a set of KDD or DB objects a predicate returning a set of KDD or DB objects

Examples of KDD Query Generate a classifier Generate a classifier Generate the strongest rule Generate the strongest rule Generate all rules with consequent attribute values computed by SQL query Generate all rules with consequent attribute values computed by SQL query Find tuples that belong to the largest cluster Find tuples that belong to the largest cluster

Future Directions KDD applications need development support KDD applications need development support query KDD objects query KDD objects data mining operations data mining operations nearest neighbors clustering Development of querying tools is a big challenge Development of querying tools is a big challenge Provide developers with build applications using a KDD query language Provide developers with build applications using a KDD query language

Text Data Mining Peoples’ first thought: Peoples’ first thought: Make it easier to find things on the Web. Make it easier to find things on the Web. But this is information retrieval! But this is information retrieval! The metaphor of extracting ore from rock: The metaphor of extracting ore from rock: Does make sense for extracting documents of interest from a huge pile. Does make sense for extracting documents of interest from a huge pile. But does not reflect notions of DM in practice: But does not reflect notions of DM in practice: finding patterns across large collections discovering heretofore unknown information

Real Text DM What would finding a pattern across a large text collection really look like? What would finding a pattern across a large text collection really look like?

From: “The Internet Diary of the man who cracked the Bible Code” Brendan McKay, Yahoo Internet Life, (William Gates, agitator, leader) Bill Gates + MS-DOS in the Bible!

From: “The Internet Diary of the man who cracked the Bible Code” Brendan McKay, Yahoo Internet Life,

Real Text DM The point: The point: Discovering heretofore unknown information is not what we usually do with text. Discovering heretofore unknown information is not what we usually do with text. (If it weren’t known, it could not have been written by someone!) (If it weren’t known, it could not have been written by someone!) However: However: There is a field whose goal is to learn about patterns in text for its own sake... There is a field whose goal is to learn about patterns in text for its own sake...

Observation Research that exploits patterns in text does so mainly in the service of computational linguistics, rather than for learning about and exploring text collections.

TDM using Metadata (instead of Text) Data: Data: Reuter’s newswire (22,000 articles, late 1980s) Categories: commodities, time, countries, people, and topic Goals: Goals: distributions of categories across time (trends) distributions of categories between collections category co-occurrence (e.g., topic|country) Interactive Interface: Interactive Interface: lists, pie charts, 2D line plots

Combining Text with Metadata (images, hyperlinks) Examples Examples Text + Links to find “authority pages” (Kleinberg at Cornell, Page at Stanford) Text + Links to find “authority pages” (Kleinberg at Cornell, Page at Stanford) Usage + Time + Links to study evolution of web and information use (Pitkow et al. at PARC) Usage + Time + Links to study evolution of web and information use (Pitkow et al. at PARC) Images + Text to improve image search Images + Text to improve image search

True Text Data Mining: Don Swanson’s Medical Work Given Given medical titles and abstracts medical titles and abstracts a problem (incurable rare disease) a problem (incurable rare disease) some medical expertise some medical expertise find causal links among titles find causal links among titles symptoms symptoms drugs drugs results results

Swanson Example (1991) Problem: Migraine headaches (M) Problem: Migraine headaches (M) stress associated with M stress associated with M stress leads to loss of magnesium stress leads to loss of magnesium calcium channel blockers prevent some M calcium channel blockers prevent some M magnesium is a natural calcium channel blocker magnesium is a natural calcium channel blocker spreading cortical depression (SCD)implicated in M spreading cortical depression (SCD)implicated in M high levels of magnesium inhibit SCD high levels of magnesium inhibit SCD M patients have high platelet aggregability M patients have high platelet aggregability magnesium can suppress platelet aggregability magnesium can suppress platelet aggregability All extracted from medical journal titles All extracted from medical journal titles

Swanson’s TDM Two of his hypotheses have received some experimental verification. Two of his hypotheses have received some experimental verification. His technique His technique Only partially automated Only partially automated Required medical expertise Required medical expertise Few people are working on this. Few people are working on this.

Conclusions Currently, what might be construed as Text Data Mining is really Computational Linguistics Currently, what might be construed as Text Data Mining is really Computational Linguistics Text is tricky to process, but rich and abundant (now) Text is tricky to process, but rich and abundant (now) There are many CL tools available There are many CL tools available Data Mining directly from text Data Mining directly from text tells us about language tells us about language produces meta-information that may be useful for information access produces meta-information that may be useful for information access

Conclusions Information Access != Text Data Mining Information Access != Text Data Mining IA = finding needle in haystack IA = finding needle in haystack TDM = finding patterns or new information TDM = finding patterns or new information However, Information Access may potentially be served by Text Data Mining techniques: However, Information Access may potentially be served by Text Data Mining techniques: automated metadata assignment automated metadata assignment collection overviews collection overviews The synthesis of ideas from TDM and IA : The synthesis of ideas from TDM and IA : Perhaps a new field of exploratory data analysis over text! Perhaps a new field of exploratory data analysis over text!

Promising Research Directions Text Data Mining Problems: Text Data Mining Problems: Patterns within sets of documents: Patterns within sets of documents: What is the latest in this field? How is this field related to that field? Chains of evidence embedded in text: Chains of evidence embedded in text: What drugs have been tested for this symptom? What effects did this funding have on that field? Human use of information over time Human use of information over time How does information diffuse across the web?

Needed from Systems Support for linking chains of associations Support for linking chains of associations Support for combined structured and unstructured data Support for combined structured and unstructured data Support for combining disparate collections Support for combining disparate collections

Statistical Themes & Lessons for Data Mining Statistical themes Statistical themes Statistical lessons Statistical lessons Cooperation between statistical and computational communities Cooperation between statistical and computational communities

Overview of Statistical Science Probability distributions Probability distributions Estimation, consistency, uncertainty, assumptions, robustness, and model averaging Estimation, consistency, uncertainty, assumptions, robustness, and model averaging Hypothesis testing Hypothesis testing Model scoring Model scoring Markov Chain Monte Carlo Markov Chain Monte Carlo Generalized model classes Generalized model classes

Overview of Statistical Sciences Rational decision making and planning Rational decision making and planning Inference to causes Inference to causes Prediction Prediction

Important Themes of Statistics to Data Mining Clarity about goals Clarity about goals Use of model that are reliable means to the goal, understandable and plausible to users Use of model that are reliable means to the goal, understandable and plausible to users Sense of uncertainties of models and predictions Sense of uncertainties of models and predictions

Lessons Data can lie Data can lie Sometimes it’s not what’s in the data that matters Sometimes it’s not what’s in the data that matters Perversity of the pervasive P-value Perversity of the pervasive P-value Intervention and prediction Intervention and prediction