Presentation is loading. Please wait.

Presentation is loading. Please wait.

Text Data Mining: Introduction Hao Chen School of Information Systems University of California at Berkeley

Similar presentations


Presentation on theme: "Text Data Mining: Introduction Hao Chen School of Information Systems University of California at Berkeley"— Presentation transcript:

1 Text Data Mining: Introduction Hao Chen School of Information Systems University of California at Berkeley hchen@sims.berkeley.edu

2 The KDD Process for Extracting Useful Knowledge from Volumes of Data Large databases becomes ubiquitous Large databases becomes ubiquitous grocery store’s checkout registry grocery store’s checkout registry credit card authorization credit card authorization Computer technology allow efficient and inexpensive data storage and access Computer technology allow efficient and inexpensive data storage and access But our ability to analyze and understand large dataset lags far behind. But our ability to analyze and understand large dataset lags far behind.

3 Manual Data Analysis Impractical Slow, expensive, and highly subjective Slow, expensive, and highly subjective Becomes impractical as data volumns grow Becomes impractical as data volumns grow N: number of records (10 9 ) N: number of records (10 9 ) D: number of fields (10 2 -- 10 3 ) D: number of fields (10 2 -- 10 3 ) Need computer technology to automate the bookkeeping. Need computer technology to automate the bookkeeping. First KDD Workshop in 1989 First KDD Workshop in 1989

4 Definitions of KDD Knowledge Discovery from Data The nontrivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data. Knowledge Discovery from Data The nontrivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data.

5 KDD Process: Selection Learning the application domain Learning the application domain Creating a target dataset Creating a target dataset

6 KDD Process: Preprocessing Data cleaning & preprocessing Data cleaning & preprocessing remove noise remove noise handle missing data fields handle missing data fields time sequence information time sequence information

7 KDD Process: Transformation Data reduction & projection Data reduction & projection features extraction features extraction dimensionality reduction dimensionality reduction invariant representation invariant representation

8 KDD Process: Data Mining Choosing function of data mining Choosing function of data mining Choosing data mining algorithms Choosing data mining algorithms Data mining: searching for patterns of interest Data mining: searching for patterns of interest

9 KDD Process: Interpretation / Evaluation Interpretation Interpretation Using discovered knowledge Using discovered knowledge

10 What is Data Mining? Fitting models to or determining patterns from very large datasets. Fitting models to or determining patterns from very large datasets. A “regime” which enables people to interact effectively with massive data stores. A “regime” which enables people to interact effectively with massive data stores. Deriving new information from data. Deriving new information from data. finding patterns across large datasets finding patterns across large datasets discovering heretofore unknown information discovering heretofore unknown information

11 What is Data Mining? Potential point of confusion: Potential point of confusion: The extracting ore from rock metaphor does not really apply to the practice of data mining The extracting ore from rock metaphor does not really apply to the practice of data mining If it did, then standard database queries would fit under the rubric of data mining If it did, then standard database queries would fit under the rubric of data mining Find all employee records in which employee earns $300/month less than their managers In practice, DM refers to: In practice, DM refers to: finding patterns across large datasets discovering heretofore unknown information

12 Another Definition of DM What SQL currently cannot do. What SQL currently cannot do. A standard query does not infer new information A standard query does not infer new information It retrieves a subset of what is already present and known. SQL originally intended for business apps DM requires sophisticated aggregate queries DM requires sophisticated aggregate queries

13 DM Touchstone Applications Finding patterns across data sets: Finding patterns across data sets: Reports on changes in retail sales Reports on changes in retail sales to improve sales Patterns of sizes of TV audiences Patterns of sizes of TV audiences for marketing Patterns in NBA play Patterns in NBA play to alter, and so improve, performance Deviations in standard phone calling behavior Deviations in standard phone calling behavior to detect fraud for marketing

14 DM Touchstone Applications Separating signal from noise: Separating signal from noise: Classifying faint astronomical objects Classifying faint astronomical objects Finding genes within DNA sequences Finding genes within DNA sequences Discovering novel tectonic activity Discovering novel tectonic activity

15 Components of Data Mining The model The model function of the model function of the model classification clustering representational form of the model representational form of the model linear function of multiple variables Gaussian probability density function The preference criterion The preference criterion goodness of fit goodness of fit avoiding overfitting avoiding overfitting The search algorithm The search algorithm

16 Model Function Classification Classification Regression Regression Clustering Clustering Summarization Summarization Dependency modeling Dependency modeling Link analysis Link analysis Sequence analysis Sequence analysis

17 Model Representation Decision tree Decision tree Linear model Linear model Nonlinear model (e.g. Neural Network) Nonlinear model (e.g. Neural Network) Example-based method (e.g. Nearest Neighbor) Example-based method (e.g. Nearest Neighbor) Probabilistic graphical dependency model (e.g. Baysian Network) Probabilistic graphical dependency model (e.g. Baysian Network) Relational attribute model Relational attribute model

18 Search Algorithm Parameter search, given a model Parameter search, given a model Model search over model space Model search over model space predictive predictive descriptive descriptive

19 What’s New Here? Sounds like statistical modeling or machine learning. Sounds like statistical modeling or machine learning. Main difference: scale and availability Main difference: scale and availability Datasets too large for classical analysis Datasets too large for classical analysis Increased opportunity for access Increased opportunity for access end user is often not a statistician New issues in sampling New issues in sampling

20 Statistician’s Viewpoint What’s new about DM? What’s new about DM? Returns statisticians to their empirical roots Returns statisticians to their empirical roots exploration rather than modeling Hypothesis testing may be irrelevant Hypothesis testing may be irrelevant given the large data sizes everything is significant Data was collected for some other purpose than what it is being analyzed for now Data was collected for some other purpose than what it is being analyzed for now

21 The Statistician’s Viewpoint (David Hand 97) conservative conservative rigorous rigorous abstract abstract idealized idealized adventurous adventurous engineering engineering practical practical real solutions real solutions StatisticsMachine Learningvs.

22 Research Challenges Massive datasets & high dimensionality Massive datasets & high dimensionality User interaction & prior knowledge User interaction & prior knowledge Overfitting & assessing statistical significance Overfitting & assessing statistical significance Missing data Missing data Understandability of patterns Understandability of patterns Managing changing data and knowledge Managing changing data and knowledge Integration Integration Nonstandard, multimedia, object-oriented data Nonstandard, multimedia, object-oriented data

23 A Database Perspective on Knowledge Discovery Concept of data mining as a querying process Concept of data mining as a querying process First steps toward efficient development of knowledge discovery applications First steps toward efficient development of knowledge discovery applications

24 New Research Frontier Short term: Efficient algorithms implementing machine learning tools on the top of large databases Short term: Efficient algorithms implementing machine learning tools on the top of large databases Long term: building optimizing compilers for ad hoc queries and embedding queries in application programming interfaces Long term: building optimizing compilers for ad hoc queries and embedding queries in application programming interfaces

25 KDDMS KDD objects KDD objects a rule a rule a classifier a classifier a clustering a clustering KDD queries KDD queries a predicate returning a set of KDD or DB objects a predicate returning a set of KDD or DB objects

26 Examples of KDD Query Generate a classifier Generate a classifier Generate the strongest rule Generate the strongest rule Generate all rules with consequent attribute values computed by SQL query Generate all rules with consequent attribute values computed by SQL query Find tuples that belong to the largest cluster Find tuples that belong to the largest cluster

27 Future Directions KDD applications need development support KDD applications need development support query KDD objects query KDD objects data mining operations data mining operations nearest neighbors clustering Development of querying tools is a big challenge Development of querying tools is a big challenge Provide developers with build applications using a KDD query language Provide developers with build applications using a KDD query language

28 Text Data Mining Peoples’ first thought: Peoples’ first thought: Make it easier to find things on the Web. Make it easier to find things on the Web. But this is information retrieval! But this is information retrieval! The metaphor of extracting ore from rock: The metaphor of extracting ore from rock: Does make sense for extracting documents of interest from a huge pile. Does make sense for extracting documents of interest from a huge pile. But does not reflect notions of DM in practice: But does not reflect notions of DM in practice: finding patterns across large collections discovering heretofore unknown information

29 Real Text DM What would finding a pattern across a large text collection really look like? What would finding a pattern across a large text collection really look like?

30 From: “The Internet Diary of the man who cracked the Bible Code” Brendan McKay, Yahoo Internet Life, www.zdnet.com/yilwww.zdnet.com/yil (William Gates, agitator, leader) Bill Gates + MS-DOS in the Bible!

31 From: “The Internet Diary of the man who cracked the Bible Code” Brendan McKay, Yahoo Internet Life, www.zdnet.com/yil

32 Real Text DM The point: The point: Discovering heretofore unknown information is not what we usually do with text. Discovering heretofore unknown information is not what we usually do with text. (If it weren’t known, it could not have been written by someone!) (If it weren’t known, it could not have been written by someone!) However: However: There is a field whose goal is to learn about patterns in text for its own sake... There is a field whose goal is to learn about patterns in text for its own sake...

33 Observation Research that exploits patterns in text does so mainly in the service of computational linguistics, rather than for learning about and exploring text collections.

34 TDM using Metadata (instead of Text) Data: Data: Reuter’s newswire (22,000 articles, late 1980s) Categories: commodities, time, countries, people, and topic Goals: Goals: distributions of categories across time (trends) distributions of categories between collections category co-occurrence (e.g., topic|country) Interactive Interface: Interactive Interface: lists, pie charts, 2D line plots

35 Combining Text with Metadata (images, hyperlinks) Examples Examples Text + Links to find “authority pages” (Kleinberg at Cornell, Page at Stanford) Text + Links to find “authority pages” (Kleinberg at Cornell, Page at Stanford) Usage + Time + Links to study evolution of web and information use (Pitkow et al. at PARC) Usage + Time + Links to study evolution of web and information use (Pitkow et al. at PARC) Images + Text to improve image search Images + Text to improve image search

36 True Text Data Mining: Don Swanson’s Medical Work Given Given medical titles and abstracts medical titles and abstracts a problem (incurable rare disease) a problem (incurable rare disease) some medical expertise some medical expertise find causal links among titles find causal links among titles symptoms symptoms drugs drugs results results

37 Swanson Example (1991) Problem: Migraine headaches (M) Problem: Migraine headaches (M) stress associated with M stress associated with M stress leads to loss of magnesium stress leads to loss of magnesium calcium channel blockers prevent some M calcium channel blockers prevent some M magnesium is a natural calcium channel blocker magnesium is a natural calcium channel blocker spreading cortical depression (SCD)implicated in M spreading cortical depression (SCD)implicated in M high levels of magnesium inhibit SCD high levels of magnesium inhibit SCD M patients have high platelet aggregability M patients have high platelet aggregability magnesium can suppress platelet aggregability magnesium can suppress platelet aggregability All extracted from medical journal titles All extracted from medical journal titles

38 Swanson’s TDM Two of his hypotheses have received some experimental verification. Two of his hypotheses have received some experimental verification. His technique His technique Only partially automated Only partially automated Required medical expertise Required medical expertise Few people are working on this. Few people are working on this.

39 Conclusions Currently, what might be construed as Text Data Mining is really Computational Linguistics Currently, what might be construed as Text Data Mining is really Computational Linguistics Text is tricky to process, but rich and abundant (now) Text is tricky to process, but rich and abundant (now) There are many CL tools available There are many CL tools available Data Mining directly from text Data Mining directly from text tells us about language tells us about language produces meta-information that may be useful for information access produces meta-information that may be useful for information access

40 Conclusions Information Access != Text Data Mining Information Access != Text Data Mining IA = finding needle in haystack IA = finding needle in haystack TDM = finding patterns or new information TDM = finding patterns or new information However, Information Access may potentially be served by Text Data Mining techniques: However, Information Access may potentially be served by Text Data Mining techniques: automated metadata assignment automated metadata assignment collection overviews collection overviews The synthesis of ideas from TDM and IA : The synthesis of ideas from TDM and IA : Perhaps a new field of exploratory data analysis over text! Perhaps a new field of exploratory data analysis over text!

41 Promising Research Directions Text Data Mining Problems: Text Data Mining Problems: Patterns within sets of documents: Patterns within sets of documents: What is the latest in this field? How is this field related to that field? Chains of evidence embedded in text: Chains of evidence embedded in text: What drugs have been tested for this symptom? What effects did this funding have on that field? Human use of information over time Human use of information over time How does information diffuse across the web?

42 Needed from Systems Support for linking chains of associations Support for linking chains of associations Support for combined structured and unstructured data Support for combined structured and unstructured data Support for combining disparate collections Support for combining disparate collections

43 Statistical Themes & Lessons for Data Mining Statistical themes Statistical themes Statistical lessons Statistical lessons Cooperation between statistical and computational communities Cooperation between statistical and computational communities

44 Overview of Statistical Science Probability distributions Probability distributions Estimation, consistency, uncertainty, assumptions, robustness, and model averaging Estimation, consistency, uncertainty, assumptions, robustness, and model averaging Hypothesis testing Hypothesis testing Model scoring Model scoring Markov Chain Monte Carlo Markov Chain Monte Carlo Generalized model classes Generalized model classes

45 Overview of Statistical Sciences Rational decision making and planning Rational decision making and planning Inference to causes Inference to causes Prediction Prediction

46 Important Themes of Statistics to Data Mining Clarity about goals Clarity about goals Use of model that are reliable means to the goal, understandable and plausible to users Use of model that are reliable means to the goal, understandable and plausible to users Sense of uncertainties of models and predictions Sense of uncertainties of models and predictions

47 Lessons Data can lie Data can lie Sometimes it’s not what’s in the data that matters Sometimes it’s not what’s in the data that matters Perversity of the pervasive P-value Perversity of the pervasive P-value Intervention and prediction Intervention and prediction


Download ppt "Text Data Mining: Introduction Hao Chen School of Information Systems University of California at Berkeley"

Similar presentations


Ads by Google