Presentation is loading. Please wait.

Presentation is loading. Please wait.

An Overview and Example of Data Mining

Similar presentations


Presentation on theme: "An Overview and Example of Data Mining"— Presentation transcript:

1 An Overview and Example of Data Mining
University of Rhode Island Department of Computer Science and Statistics March 30, 2007 An Overview and Example of Data Mining Daniel T. Larose, Ph.D. Professor of Statistics Director, Data Editor, Wiley Series on Methods and Applications in Data Mining URI Dept of Computer Science and Statistics - An Overview and Example of Data Mining - Larose

2 Overview Part One: A Brief Overview of Data Mining Part Two:
An Example of Data Mining: Modeling Response to Direct Mail Marketing But first, a shameless plug … URI Dept of Computer Science and Statistics - An Overview and Example of Data Mining - Larose

3 Master of Science in DM at CCSU Faculty
Dr. Roger Bilisoly (from Ohio State Univ., Statistics) Text Mining, Intro to Data Mining Dr. Darius Dziuda (from Warsaw Polytechnic Univ, CS) Data Mining for Genomics and Proteomics, Biomarker Discovery Dr. Zdravko Markov (from Sofia Univ, CS) Data Mining (CS perspective), Machine Learning Dr. Daniel Miller (from UConn, Statistics) Applied Multivariate Analysis, Mathematical Statistics II, Intro to Data Mining Dr. Krishna Saha (from Univ of Windsor, Statistics) Intro to Data Mining using R Dr. Daniel Larose (Program Director) (from UConn, Statistics) Intro to Data Mining, Data Mining Methods, Applied Data Mining, Web Mining URI Dept of Computer Science and Statistics - An Overview and Example of Data Mining - Larose

4 Master of Science in DM at CCSU Program (36 credits)
Core Courses (27 credits) All available online. Stat 521 Introduction to Data Mining (4 cr) Stat 522 Data Mining Methods (4 cr) Stat 523 Applied Data Mining (4 cr) Stat 525 Web Mining Stat 526 Data Mining for Genomics and Proteomics Stat 527 Text Mining Stat 416 Mathematical Statistics II Stat 570 Applied Multivariate Analysis Electives ( 6 credits. Choose two)  CS 570 Topics in Artificial Intelligence: Machine Learning CS 580 Topics in Advanced Database: Data Mining Stat 455 Experimental Design Stat 551 Applied Stochastic Processes Stat 567 Linear Models Stat 575 Mathematical Statistics III   Stat 529 Current Issues in Data Mining                               Capstone Requirement: Stat 599 Thesis (3 credits) URI Dept of Computer Science and Statistics - An Overview and Example of Data Mining - Larose

5 Master of Science in DM at CCSU
Only MS in DM that is entirely online. Some courses available on campus. Student must come to CCSU to present Thesis We reach students in about 30 US States and a dozen foreign countries Half of our students already have master’s degrees About 15% already have Ph.D.’s Typical student is a mid-career professional Backgrounds are diverse: Computer Science, Engineering, Finance, Chemistry, Database Admin, Statistics, etc. URI Dept of Computer Science and Statistics - An Overview and Example of Data Mining - Larose

6 Graduate Certificate in Data Mining
18 Credits: Required Courses (12 credits) Stat 521 Introduction to Data Mining Stat 522 Data Mining Methods and Models Stat 523 Applied Data Mining Elective Courses (6 credits. Choose Two): Stat 525 Web Mining Stat 526 Data Mining for Genomics and Proteomics Stat 527 Text Mining Stat 529 Current Issues in Data Mining Some other graduate-level data mining or statistics course, with approval of advisor. No Mathematical Statistics requirement. URI Dept of Computer Science and Statistics - An Overview and Example of Data Mining - Larose

7 Material for Part I Drawn From: Discovering Knowledge in Data: An Introduction to Data Mining (Wiley, 2005) Chapter 1. An Introduction to Data Mining Chapter 2. Data Preprocessing Chapter 3. Exploratory Data Analysis Chapter 4. Statistical Approaches to Estimation and Prediction Chapter 5. K-Nearest Neighbor Chapter 6. Decision Trees Chapter 7. Neural Networks Chapter 8. Hierarchical and K-Means Clustering Chapter 9. Kohonen networks Chapter 10. Association Rules Chapter 11. Model Evaluation Techniques URI Dept of Computer Science and Statistics - An Overview and Example of Data Mining - Larose

8 Material for Part II Drawn From: Data Mining Methods and Models (Wiley, 2006)
Chapter 1. Dimension Reduction Methods Chapter 2. Regression Modeling Chapter 3. Multiple Regression and Model Building Chapter 4. Logistic Regression Chapter 5. Naïve Bayes Classification and Bayesian Networks Chapter 6. Genetic Algorithms Chapter 7. Case Study: Modeling Response to Direct-Mail Marketing URI Dept of Computer Science and Statistics - An Overview and Example of Data Mining - Larose

9 No Material Drawn From: Data Mining the Web: Uncovering Patterns in Web Content, Structure, and Usage (Wiley, April 2007) Part One: Web Structure Mining Information Retrieval and Web Search Hyperlink-Based Ranking Part Two: Web Content Mining Clustering Evaluating Clustering Classification Part Three: Web Usage Mining Data Preprocessing, Exploratory Data Analysis, Association Rules, Clustering, and Classification for Web Usage Mining With Dr. Zdravko Markov, Computer Science, CCSU URI Dept of Computer Science and Statistics - An Overview and Example of Data Mining - Larose

10 Call for Book Proposals Wiley Series on Methods and Applications in Data Mining
Suggested topics: Data Mining in Bioinformatics Emerging Techniques in Data Mining (e.g., SVM) Data Mining with Evolutionary Algorithms Drug Discovery Using Data Mining Mining Data Streams Visual Analysis in Data Mining Books in press: Data Mining for Genomics and Proteomics, by Darius Dziuda Practical Text Mining Using Perl, by Roger Bilisoly Contact Series Editor at URI Dept of Computer Science and Statistics - An Overview and Example of Data Mining - Larose

11 What is Data Mining? “Data mining is the analysis of (often large) observational data sets to find unsuspected relationships and to summarize the data in novel ways that are both understandable and useful to the data owner.” David Hand, Heikki Mannila & Padhraic Smyth, Principles of Data Mining, MIT Press, 2001 URI Dept of Computer Science and Statistics - An Overview and Example of Data Mining - Larose

12 Why Data Mining? “We are drowning in information but starved for knowledge.” John Naisbitt, Megatrends, 1984. “The problem is that there are not enough trained human analysts available who are skilled at translating all of this data into knowledge, and thence up the taxonomy tree into wisdom.” Daniel Larose, Discovering Knowledge in Data: An Introduction to Data Mining, Wiley, 2005. URI Dept of Computer Science and Statistics - An Overview and Example of Data Mining - Larose

13 Need for Human Direction
Automation is no substitute for human supervision and input. Humans need to be actively involved at every phase of data mining process. “Rather than asking where humans fit into data mining, we should instead inquire about how we may design data mining into the very human process of problem solving.” - Daniel Larose, Discovering Knowledge in Data: An Introduction to Data Mining, Wiley, 2005. URI Dept of Computer Science and Statistics - An Overview and Example of Data Mining - Larose

14 “Data Mining is Easy to Do Badly”
Black box software Powerful, “easy-to-use” data mining algorithms Makes their misuse dangerous. Too easy to point and click your way to disaster. What is needed: An understanding of the underlying algorithmic and statistical model structures. An understanding of which algorithms are most appropriate in which situations and for which types of data. URI Dept of Computer Science and Statistics - An Overview and Example of Data Mining - Larose

15 CRISP-DM: Cross-Industry Standard Process for Data Mining
URI Dept of Computer Science and Statistics - An Overview and Example of Data Mining - Larose

16 CRISP: DM as a Process Business / Research Understanding Phase
Enunciate your objectives Data Understanding Phase: EDA Data Preparation Phase: Preprocessing Modeling Phase: Fun and interesting! Evaluation Phase Confluence of results? Objectives Met? Deployment Phase: Use results to solve problem. If desired: Use lessons learned to reformulate business / research objective. URI Dept of Computer Science and Statistics - An Overview and Example of Data Mining - Larose

17 What About Data Dredging?
“A sufficiently exhaustive search will certainly throw up patterns of some kind. Many of these patterns will simply be a product of random fluctuations, and will not represent any underlying structure.” David J. Hand, Data Mining: Statistics and More? The American Statistician, May, 1998. URI Dept of Computer Science and Statistics - An Overview and Example of Data Mining - Larose

18 Guarding Against Data Dredging: Cross-Validation is the Key
Partition the data into training set and test set. If the pattern shows up in both data sets, decreases the probability that it represents noise. More generally, may use n-fold cross-validation. URI Dept of Computer Science and Statistics - An Overview and Example of Data Mining - Larose

19 Inference and Huge Data Sets
Hypothesis testing becomes sensitive at the huge sample sizes prevalent in data mining applications. Even very tiny effects will be found significant. So, data mining tends to de-emphasize inference URI Dept of Computer Science and Statistics - An Overview and Example of Data Mining - Larose

20 Need for Transparency and Interpretability
Data mining models should be transparent Results should be interpretable by humans Decision Trees are transparent Neural Networks tend to be opaque If a customer complains about why he/she was turned down for credit, we should be able to explain why, without saying “Our neural net said so.” URI Dept of Computer Science and Statistics - An Overview and Example of Data Mining - Larose

21 Part Two: Modeling Response to Direct Mail Marketing
Business Understanding Phase: Clothing Store Purchase Data Results of a direct mail marketing campaign Task: Construct a classification model For classifying customers as either responders or non-responders to the marketing campaign, To reduce costs and increase return-on-investment URI Dept of Computer Science and Statistics - An Overview and Example of Data Mining - Larose

22 Data Understanding: The Clothing Store dataset
List of fields in the dataset (28,7999 customers, 51 fields) URI Dept of Computer Science and Statistics - An Overview and Example of Data Mining - Larose

23 Data Preparation and EDA Phase
Not covered in this presentation. URI Dept of Computer Science and Statistics - An Overview and Example of Data Mining - Larose

24 Modeling Strategy Apply principal components analysis to address multicollinearity. Apply cluster analysis. Briefly profile clusters. Balance the training data set. Establish baseline model performance In terms of expected profit per customer contacted. Apply classification algorithms to training data set: CART C5.0 (C4.5) Neural networks Logistic regression. URI Dept of Computer Science and Statistics - An Overview and Example of Data Mining - Larose

25 Modeling Strategy continued
Evaluate each model using test data set. Apply misclassification costs in line with cost benefit table. Apply overbalancing as a surrogate for misclassification costs. Find best overbalancing proportion. Combine predictions from four models Using model voting. Using mean response probabilities. URI Dept of Computer Science and Statistics - An Overview and Example of Data Mining - Larose

26 Principal Components Analysis (PCA)
Multicollinearity does not degrade prediction accuracy. But muddles individual predictor coefficients. Interested in predictor characteristics, customer profiling, etc? Then PCA is required. But, if interested solely in classification (prediction, estimation), PCA not strictly required. URI Dept of Computer Science and Statistics - An Overview and Example of Data Mining - Larose

27 Report Two Model Sets: Model Set A: Includes principal components
All purpose model set Model Set B: Includes correlated predictors, not principal components Use restricted to classification URI Dept of Computer Science and Statistics - An Overview and Example of Data Mining - Larose

28 Principal Components Analysis (PCA)
Seven correlated variables. Two components extracted Account for 87% of variability URI Dept of Computer Science and Statistics - An Overview and Example of Data Mining - Larose

29 Principal Components Analysis (PCA)
Purchasing Habits Customer general purchasing habits Expect component to be strongly indicative of response Principal Component 2: Promotion Contacts Unclear whether component will be associated with response Components validated by test data set URI Dept of Computer Science and Statistics - An Overview and Example of Data Mining - Larose

30 BIRCH Clustering Algorithm
Requires only one pass through data set Scalable for large data sets Benefit: Analyst need not pre-specify number of clusters Drawback: Sensitive to initial records encountered Leads to widely variable cluster solutions Requires “outer loop” to find consistent cluster solution Zhang, Ramakrishnan and Livny, BIRCH: A New Data Clustering Algorithm and Its Applications, Data Mining and Knowledge Discovery 1, 1997. URI Dept of Computer Science and Statistics - An Overview and Example of Data Mining - Larose

31 BIRCH Clusters Cluster 3 shows: Higher response for flag predictors
Higher averages for numeric predictors URI Dept of Computer Science and Statistics - An Overview and Example of Data Mining - Larose

32 BIRCH Clusters Cluster 3 has highest response rate (red).
URI Dept of Computer Science and Statistics - An Overview and Example of Data Mining - Larose

33 Balancing the Data For “rare” classes, provides more equitable distribution. Drawback: Loss of data: Here, 40% of non-responders randomly omitted All responders retained Responders increases from 16.58% to 24.76% Test data set should never be balanced URI Dept of Computer Science and Statistics - An Overview and Example of Data Mining - Larose

34 False Positive vs. False Negative: Which is Worse?
For direct mail marketing, a false negative error is probably worse than a false positive. Generate misclassification costs based on the observed data. Construct cost-benefit table URI Dept of Computer Science and Statistics - An Overview and Example of Data Mining - Larose

35 Decision Cost / Benefit Analysis
Outcome Classified Actual Cost Rationale True Negative No $0 No contact made; no revenue lost True Positive Yes -$26.40 (Anticipated revenue) – (Cost of contact) False Negative $28.40 Loss of anticipated revenue False Positive $2.00 Cost of contact URI Dept of Computer Science and Statistics - An Overview and Example of Data Mining - Larose

36 Establish Baseline Model Performance
Benchmarks “Don’t Send a Marketing Promotion to Anyone” Model “Send a Marketing Promotion to Everyone” Model Will compare candidate models against this baseline error rate. URI Dept of Computer Science and Statistics - An Overview and Example of Data Mining - Larose

37 Model Set A (With 50% Balancing)
No model beats benchmark of $2.63 profit per customer Misclassification costs had not been applied Now define FN cost = $28.40, FP cost = $2 Outperformed baseline “Send to everyone” model URI Dept of Computer Science and Statistics - An Overview and Example of Data Mining - Larose

38 Model Set A: Effect of Misclassification Costs
For the 447 highlighted records: Only 20.8% responded. But model predicts positive response. Due to high false negative misclassification cost. URI Dept of Computer Science and Statistics - An Overview and Example of Data Mining - Larose

39 Model Set A: PCA Component 1 is Best Predictor
First principal component ($F-PCA-1), Purchasing Habits, represents both the root node split and the secondary split Most important factor for predicting response URI Dept of Computer Science and Statistics - An Overview and Example of Data Mining - Larose

40 Over-Balancing as a Surrogate for Misclassification Costs
Software limitation: Neural network and logistic regression models in Clementine: Lack methods for applying misclassification costs Over-balancing is an alternate method which can achieve similar results Starves the classifier of instances of non-response URI Dept of Computer Science and Statistics - An Overview and Example of Data Mining - Larose

41 Over-Balancing as a Surrogate for Misclassification Costs
Neural network model results Three over-balanced models outperform baseline Properly applied, over-balancing can be used as a surrogate for misclassification costs URI Dept of Computer Science and Statistics - An Overview and Example of Data Mining - Larose

42 Over-Balancing as a Surrogate for Misclassification Costs
Apply 80% - 20% over-balancing to the other models. URI Dept of Computer Science and Statistics - An Overview and Example of Data Mining - Larose

43 Combination Models: Voting
Smoothes out strengths and weaknesses of each model Each model supplies a prediction for each record Count the votes for each record Disadvantage of combination models: Lack of easy interpretability Four competing combination models… URI Dept of Computer Science and Statistics - An Overview and Example of Data Mining - Larose

44 Combination Models: Voting
Mail a Promotion only if: All four models predict response Protects against false positive All four classification algorithms must agree on a positive prediction At least three models predict response At least two models predict response Any model predicts response Protects against false negatives URI Dept of Computer Science and Statistics - An Overview and Example of Data Mining - Larose

45 Combination Models: Voting
None beat the logistic regression model: $2.96 profit per customer Perhaps combination models will do better with Model Collection B… URI Dept of Computer Science and Statistics - An Overview and Example of Data Mining - Larose

46 Model Collection B: Non-PCA Models
Models retain correlated variables Use restricted to prediction only Since the correlated variables are highly predictive Expect Collection B will outperform the PCA models URI Dept of Computer Science and Statistics - An Overview and Example of Data Mining - Larose

47 Model Collection B: CART and C5.0
Using misclassification costs, and 50% balancing Both models outperform the best PCA model URI Dept of Computer Science and Statistics - An Overview and Example of Data Mining - Larose

48 Model Collection B: Over-Balancing
Apply over-balancing as a surrogate for misclassification costs for all models Best performance thus far. URI Dept of Computer Science and Statistics - An Overview and Example of Data Mining - Larose

49 Combination Models: Voting
Combine the four models via voting and 80%-20% over-balancing Synergy: Combination model outperforms any individual model. URI Dept of Computer Science and Statistics - An Overview and Example of Data Mining - Larose

50 Combining Models Using Mean Response Probabilities
Combine the confidences that each model reports for its decisions Allows finer tuning of the decision space Derive a new variable: Mean Response Probability (MRP): Average of response confidences of the four models. URI Dept of Computer Science and Statistics - An Overview and Example of Data Mining - Larose

51 Combining Models Using Mean Response Probabilities
Multi-modality due to the discontinuity of the transformation used in derivation of MRP URI Dept of Computer Science and Statistics - An Overview and Example of Data Mining - Larose

52 Combining Models Using Mean Response Probabilities
Where shall we define response vs. non-response? Recall that FN is 14.2 times worse than FP Set partitions on the low side => fewer FN decisions are made URI Dept of Computer Science and Statistics - An Overview and Example of Data Mining - Larose

53 Combining Models Using Mean Response Probabilities
Optimal partition: near 50%. Mail a promotion to a prospective customer only if the mean response probability is at least 50% Best model in case study. MRP = 0.51 $ profit “send to everyone” $2.62 profit 20.7% profit enhancement (54.44 cents) URI Dept of Computer Science and Statistics - An Overview and Example of Data Mining - Larose

54 Summary For more on this Case Study, see Data Mining Methods and Models (Wiley, 2006) So, the best part about all this is: Data mining is fun! If you love to play with data, and you love to construct and evaluate models, then data mining is for you. URI Dept of Computer Science and Statistics - An Overview and Example of Data Mining - Larose


Download ppt "An Overview and Example of Data Mining"

Similar presentations


Ads by Google