Data Mining Methodology 1. Why have a Methodology  Don’t want to learn things that aren’t true May not represent any underlying reality ○ Spurious correlation.

Slides:



Advertisements
Similar presentations
Introduction to Hypothesis Testing
Advertisements

Tests of Hypotheses Based on a Single Sample
Imbalanced data David Kauchak CS 451 – Fall 2013.
CPSC 502, Lecture 15Slide 1 Introduction to Artificial Intelligence (AI) Computer Science cpsc502, Lecture 15 Nov, 1, 2011 Slide credit: C. Conati, S.
FTP Biostatistics II Model parameter estimations: Confronting models with measurements.
From the homework: Distribution of DNA fragments generated by Micrococcal nuclease digestion mean(nucs) = bp median(nucs) = 110 bp sd(nucs+ = 17.3.
Data preprocessing before classification In Kennedy et al.: “Solving data mining problems”
Beyond Null Hypothesis Testing Supplementary Statistical Techniques.
Introduction to Data Mining with XLMiner
Cost-Sensitive Classifier Evaluation Robert Holte Computing Science Dept. University of Alberta Co-author Chris Drummond IIT, National Research Council,
Statistical Methods Chichang Jou Tamkang University.
Cluster Analysis.  What is Cluster Analysis?  Types of Data in Cluster Analysis  A Categorization of Major Clustering Methods  Partitioning Methods.
Maximizing Classifier Utility when Training Data is Costly Gary M. Weiss Ye Tian Fordham University.
Darlene Goldstein 29 January 2003 Receiver Operating Characteristic Methodology.
Chapter Sampling Distributions and Hypothesis Testing.
Statistics for the Social Sciences Psychology 340 Fall 2006 Hypothesis testing.
Statistical Comparison of Two Learning Algorithms Presented by: Payam Refaeilzadeh.
Using Statistics in Research Psych 231: Research Methods in Psychology.
Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc. Chapter 8 Tests of Hypotheses Based on a Single Sample.
1 CSI5388 Data Sets: Running Proper Comparative Studies with Large Data Repositories [Based on Salzberg, S.L., 1997 “On Comparing Classifiers: Pitfalls.
INFERENTIAL STATISTICS – Samples are only estimates of the population – Sample statistics will be slightly off from the true values of its population’s.
Today Evaluation Measures Accuracy Significance Testing
M. Sulaiman Khan Dept. of Computer Science University of Liverpool 2009 COMP527: Data Mining Classification: Evaluation February 23,
AM Recitation 2/10/11.
Overview DM for Business Intelligence.
Overview of Statistical Hypothesis Testing: The z-Test
© 2008 McGraw-Hill Higher Education The Statistical Imagination Chapter 9. Hypothesis Testing I: The Six Steps of Statistical Inference.
Chapter 8 Introduction to Hypothesis Testing
Kansas State University Department of Computing and Information Sciences CIS 830: Advanced Topics in Artificial Intelligence From Data Mining To Knowledge.
Multiple Choice Questions for discussion
Topics: Statistics & Experimental Design The Human Visual System Color Science Light Sources: Radiometry/Photometry Geometric Optics Tone-transfer Function.
Hypothesis Testing: One Sample Cases. Outline: – The logic of hypothesis testing – The Five-Step Model – Hypothesis testing for single sample means (z.
Chapter 8 Introduction to Hypothesis Testing
Making decisions about distributions: Introduction to the Null Hypothesis 47:269: Research Methods I Dr. Leonard April 14, 2010.
Statistical Decision Making. Almost all problems in statistics can be formulated as a problem of making a decision. That is given some data observed from.
GA-Based Feature Selection and Parameter Optimization for Support Vector Machine Cheng-Lung Huang, Chieh-Jen Wang Expert Systems with Applications, Volume.
Evaluating What’s Been Learned. Cross-Validation Foundation is a simple idea – “ holdout ” – holds out a certain amount for testing and uses rest for.
10.2 Tests of Significance Use confidence intervals when the goal is to estimate the population parameter If the goal is to.
1 CS 391L: Machine Learning: Experimental Evaluation Raymond J. Mooney University of Texas at Austin.
Ensembles. Ensemble Methods l Construct a set of classifiers from training data l Predict class label of previously unseen records by aggregating predictions.
Chapter 10 Verification and Validation of Simulation Models
Chapter 8 Evaluating Search Engine. Evaluation n Evaluation is key to building effective and efficient search engines  Measurement usually carried out.
ASSESSING LEARNING ALGORITHMS Yılmaz KILIÇASLAN. Assessing the performance of the learning algorithm A learning algorithm is good if it produces hypotheses.
ASSESSING LEARNING ALGORITHMS Yılmaz KILIÇASLAN. Assessing the performance of the learning algorithm A learning algorithm is good if it produces hypotheses.
Chapter 3 Data Mining Methodology and Best Practices
Stats Lunch: Day 3 The Basis of Hypothesis Testing w/ Parametric Statistics.
Data Mining Practical Machine Learning Tools and Techniques By I. H. Witten, E. Frank and M. A. Hall Chapter 5: Credibility: Evaluating What’s Been Learned.
Machine Learning Tutorial-2. Recall, Precision, F-measure, Accuracy Ch. 5.
IMPORTANCE OF STATISTICS MR.CHITHRAVEL.V ASST.PROFESSOR ACN.
Evaluating Classification Performance
Data Mining and Decision Support
Classification and Prediction: Ensemble Methods Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University.
Ensemble Methods Construct a set of classifiers from the training data Predict class label of previously unseen records by aggregating predictions made.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Genetic Algorithms (in 1 Slide) l GA: based on an analogy to biological evolution l Each.
Machine Learning in Practice Lecture 10 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.
26134 Business Statistics Week 4 Tutorial Simple Linear Regression Key concepts in this tutorial are listed below 1. Detecting.
Chapter 5 – Evaluating Predictive Performance Data Mining for Business Analytics Shmueli, Patel & Bruce.
Statistics (cont.) Psych 231: Research Methods in Psychology.
Review Statistical inference and test of significance.
Inferential Statistics Psych 231: Research Methods in Psychology.
Machine Learning with Spark MLlib
Central Limit Theorem, z-tests, & t-tests
Chapter 10 Verification and Validation of Simulation Models
Chapter 6 Hypothesis tests.
CSCI N317 Computation for Scientific Applications Unit Weka
Psych 231: Research Methods in Psychology
Inferential Statistics
Psych 231: Research Methods in Psychology
Evaluating Classifiers
Advisor: Dr.vahidipour Zahra salimian Shaghayegh jalali Dec 2017
Presentation transcript:

Data Mining Methodology 1

Why have a Methodology  Don’t want to learn things that aren’t true May not represent any underlying reality ○ Spurious correlation ○ May not be statistically significant or may be statistically significant but coincidental ○ Because data mining makes less assumptions about the data and searches through a richer hypothesis space, this is a big issue Model overfitting is an issue 2

Why have a Methodology II  Data may not reflect relevant population Data mining normally assumes training data matches the test and score data  Quick overview of how data used for DM ○ Training set used to build the model ○ Validation set used to tune model or select amongst alternative models ○ Test set used to evaluate model & report quality For prediction tasks, test set must have the “answer” ○ Model eventually applied to score set, which for predictive tasks, does not have the answer ○ Evaluation must always occur on data not used to build or tune or select the model 3

Why have a Methodology III  Do not want to learn things that are not useful May be already known May not be actionable 4

Hypothesis Testing  Data Mining is not usually used for hypothesis testing B&L does not really say this Typical assumption is data already collected and you have little influence on the process ○ Data may be in a data warehouse ○ usually do not modify the scenarios for collecting the data or the parameters ○ Experimental design not part of data mining ○ Active learning is related to this, where you carefully select the data to learn from 5

The Methodology (Fayyad)  According to the article by Fayyad et. al, the main steps are: Data Selection Preprocessing Transformation Data Mining Interpretation/Evaluation 6

The Methodology (B & L)  According to Berry & Linoff, the main steps are: Translate business problem into a DM problem Select Data Get to know the data Create a model set Fix problems with the data (“preprocess”) Transform the data Build models (“Data mining”) Assess models (“Interpret/Evaluate”) Deploy Models Assess Results then start over 7

Steps in the Process: Selection  Many of the steps are not very complex, so her some selective comments: Selection: ○ DM usually tries to use all available data ○ May not be necessary, can generate learning curves where see how performance varies with increasing amounts of data ○ Data Mining is not afraid of using lots of variables (unlike statistics). But some data mining methods (especially statistical ones) do have problems with many variables. 8

Steps in the Process: Know the Data  Getting to know the data: always useful and also helps make sure you understand the problem Data visualization can help Data mining is not really like a black box where the computer does all of the work ○ having or generating good features (variables) is critical. Data visualization can help 9

Steps in the Process: Create Model Set  Creating a model (training) set Sometimes you may want to form the training set other than by random sampling ○ It is often recommended to balance the classes if they are highly unbalanced Not really a good idea or needed. Can use cost-sensitive learning instead, but we will address later May want to focus on harder problems -Active learning skews the training data, but the purpose is to save effort in manually labeling the training data 10

Steps in the Process: Create Model Set Data sets relevant to Data Mining ○ Training set: used to build initial model ○ Validation set: used to either tune model (e.g., pruning) or select amongst multiple models ○ Test set: used to evaluate goodness of model For predictive tasks, must have class labels ○ Score set: Data that model ultimately build for For predictive tasks, class labels are not available Note that training, validation and test data come from labeled data Cross validation can maximize size of labeled data ○ 10-fold cross validation uses 90% for training and 10% testing. It will entail 10 runs. 11

Steps in the Process: Fix Data  Many data mining methods don’t need as much variable “fixing” as statistical methods  Types of fixing Missing values: many ways to fix Too many categorical values: reduce ○ Binning, etc. Numerical values skewed ○ Take log etc  Data preprocessing (Fayyad) may just alter the representation 12

Steps in the Process: Transform  Aggregate data to a higher level Time series data often must be converted into examples for classification algorithms ○ Phone call data aggregated from call level to describe activity associated with a phone #/user  Construction of new features is part of this step. Feature construction can be critical. Area of plot more useful for estimating value of home than length and width. 13

Steps in the Process: Assess Model  Predictive models are assessed based on the correctness of their predictions Accuracy is the simplest measure, but often not very useful since not all errors are equal ○ we will learn more about this later ○ Lift curves are discussed in B&L (p 81) Lift ratio = P(class|sample) / P(class|population) Life only makes sense when we can be selective, like in direct marketing where we don’t have to judge every response  Descriptive models can be hard to evaluate since their may not be objective criteria How do you tell if a clustering is meaningful? ○ More on assessment methods later 14

Steps in the Process: Deploy  Research models are fine, we run them off line and when we want to In a business, must deal with real-world issues ○ In the WISDM project, we want to classify activities in real time. This is also needed for many fraud detection models. Must be able to execute the model and do it quickly, possibly on different hardware. Some tools allow you to export the model as code ○ Even in off-line evaluation, may need to handle huge amounts of data 15

Steps in the Process: Assess Results  True assessment is not just of model, but includes the business context Takes into account all costs and benefits This may include costs that are very hard to quantify ○ How much does a false negative medical test cost it causes the patient to die of a preventable disease? 16

Steps in the Process: Iterate  Data Mining is an iterative process Iteration can occur between most of the steps Example: You don’t like overall results so you add another feature. You then assess its impact to see if you should keep it. Example: You realize that assessment of your model does not make sense and is missing some costs, so you then incorporate these costs into the model 17