An Improved Categorization of Classifiers Sensitivity on Sample Selection Bias Wei Fan Ian Davidson Bianca Zadrozny Philip S. Yu.

Slides:



Advertisements
Similar presentations
Wei Fan Ed Greengrass Joe McCloskey Philip S. Yu Kevin Drummey
Advertisements

Forecasting Skewed Biased Stochastic Ozone Days: Analyses and Solutions Forecasting Skewed Biased Stochastic Ozone Days: Analyses and Solutions Kun Zhang,
A Fully Distributed Framework for Cost-sensitive Data Mining Wei Fan, Haixun Wang, and Philip S. Yu IBM T.J.Watson, Hawthorne, New York Salvatore J. Stolfo.
A General Framework for Fast and Accurate Regression by Data Summarization in Random Decision Trees Wei Fan, IBM T.J.Watson Joe McCloskey, US Department.
On Sample Selection Bias and Its Efficient Correction via Model Averaging and Unlabeled Examples Wei Fan Ian Davidson.
A Framework for Scalable Cost- sensitive Learning Based on Combining Probabilities and Benefits Wei Fan, Haixun Wang, and Philip S. Yu IBM T.J.Watson Salvatore.
Pruning and Dynamic Scheduling of Cost-sensitive Ensembles Wei Fan, Haixun Wang, and Philip S. Yu IBM T.J.Watson, Hawthorne, New York Fang Chu UCLA, Los.
On the Optimality of Probability Estimation by Random Decision Trees Wei Fan IBM T.J.Watson.
Actively Transfer Domain Knowledge Xiaoxiao Shi Wei Fan Jiangtao Ren Sun Yat-sen University IBM T. J. Watson Research Center Transfer when you can, otherwise.
Systematic Data Selection to Mine Concept Drifting Data Streams Wei Fan IBM T.J.Watson.
A General Framework for Mining Concept-Drifting Data Streams with Skewed Distributions Jing Gao Wei Fan Jiawei Han Philip S. Yu University of Illinois.
Is Random Model Better? -On its accuracy and efficiency-
Making Data Mining Models Useful to Model Non-paying Customers of Exchange Carriers Wei Fan, IBM T.J.Watson Janek Mathuria, and Chang-tien Lu Virginia.
Decision Tree Evolution using Limited number of Labeled Data Items from Drifting Data Streams Wei Fan 1, Yi-an Huang 2, and Philip S. Yu 1 1 IBM T.J.Watson.
ReverseTesting: An Efficient Framework to Select Amongst Classifiers under Sample Selection Bias Wei Fan IBM T.J.Watson Ian Davidson SUNY Albany.
Type Independent Correction of Sample Selection Bias via Structural Discovery and Re-balancing Jiangtao Ren Xiaoxiao Shi Wei Fan Philip S. Yu.
Type Independent Correction of Sample Selection Bias via Structural Discovery and Re-balancing Jiangtao Ren 1 Xiaoxiao Shi 1 Wei Fan 2 Philip S. Yu 2 1.
When Efficient Model Averaging Out-Perform Bagging and Boosting Ian Davidson, SUNY Albany Wei Fan, IBM T.J.Watson.
The Software Infrastructure for Electronic Commerce Databases and Data Mining Lecture 4: An Introduction To Data Mining (II) Johannes Gehrke
Entropy Estimation and Applications to Decision Trees.
Data Mining Lecture 9.
Forecasting Skewed Biased Stochastic Ozone Days: Analyses and Solutions Forecasting Skewed Biased Stochastic Ozone Days: Analyses and Solutions Presentor:
Decision Tree Approach in Data Mining
Learning Algorithm Evaluation
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Other Classification Techniques 1.Nearest Neighbor Classifiers 2.Support Vector Machines.
Ensemble Methods An ensemble method constructs a set of base classifiers from the training data Ensemble or Classifier Combination Predict class label.
CMPUT 466/551 Principal Source: CMU
Evaluation.
Assuming normally distributed data! Naïve Bayes Classifier.
On Appropriate Assumptions to Mine Data Streams: Analyses and Solutions Jing Gao† Wei Fan‡ Jiawei Han† †University of Illinois at Urbana-Champaign ‡IBM.
1 Confidence Interval for a Mean. 2 Given A random sample of size n from a Normal population or a non Normal population where n is sufficiently large.
Basic Data Mining Techniques Chapter Decision Trees.
Evaluation.
Sample Selection Bias Lei Tang Feb. 20th, Classical ML vs. Reality  Training data and Test data share the same distribution (In classical Machine.
Data mining and statistical learning - lecture 13 Separating hyperplane.
1 Machine Learning: Lecture 5 Experimental Evaluation of Learning Algorithms (Based on Chapter 5 of Mitchell T.., Machine Learning, 1997)
CSCI 347 / CS 4206: Data Mining Module 06: Evaluation Topic 01: Training, Testing, and Tuning Datasets.
Basic Data Mining Techniques
Inductive learning Simplest form: learn a function from examples
by B. Zadrozny and C. Elkan
Introduction to machine learning and data mining 1 iCSC2014, Juan López González, University of Oviedo Introduction to machine learning Juan López González.
Data Mining: Classification & Predication Hosam Al-Samarraie, PhD. Centre for Instructional Technology & Multimedia Universiti Sains Malaysia.
1 CS 391L: Machine Learning: Experimental Evaluation Raymond J. Mooney University of Texas at Austin.
Combining multiple learners Usman Roshan. Bagging Randomly sample training data Determine classifier C i on sampled data Goto step 1 and repeat m times.
Ensembles. Ensemble Methods l Construct a set of classifiers from training data l Predict class label of previously unseen records by aggregating predictions.
ISQS 6347, Data & Text Mining1 Ensemble Methods. ISQS 6347, Data & Text Mining 2 Ensemble Methods Construct a set of classifiers from the training data.
Chapter 6 Classification and Prediction Dr. Bernard Chen Ph.D. University of Central Arkansas.
What is a Confidence Interval?. Sampling Distribution of the Sample Mean The statistic estimates the population mean We want the sampling distribution.
COP5992 – DATA MINING TERM PROJECT RANDOM SUBSPACE METHOD + CO-TRAINING by SELIM KALAYCI.
Weka Just do it Free and Open Source ML Suite Ian Witten & Eibe Frank University of Waikato New Zealand.
Classification Ensemble Methods 1
Data Mining and Decision Support
Classification and Prediction: Ensemble Methods Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University.
Competition II: Springleaf Sha Li (Team leader) Xiaoyan Chong, Minglu Ma, Yue Wang CAMCOS Fall 2015 San Jose State University.
1 Systematic Data Selection to Mine Concept-Drifting Data Streams Wei Fan Proceedings of the 2004 ACM SIGKDD international conference on Knowledge discovery.
Chapter 18 Section 1 – 3 Learning from Observations.
Learning From Observations Inductive Learning Decision Trees Ensembles.
Dr. Chen, Data Mining  A/W & Dr. Chen, Data Mining Chapter 3 Basic Data Mining Techniques Jason C. H. Chen, Ph.D. Professor of MIS School of Business.
PEP-PMMA Training Session Statistical inference Lima, Peru Abdelkrim Araar / Jean-Yves Duclos 9-10 June 2007.
DATA MINING TECHNIQUES (DECISION TREES ) Presented by: Shweta Ghate MIT College OF Engineering.
Predicting Mortgage Pre-payment Risk. Introduction Definition Borrower pays off the loan before the contracted term loan length. Lender loses future part.
Classification and Prediction
In-Class Exercise: Discrete Distributions
الأستاذ المساعد بقسم المناهج وطرق التدريس
Introduction to Data Mining, 2nd Edition
iSRD Spam Review Detection with Imbalanced Data Distributions
Classification and Prediction
IDSL, Intelligent Database System Lab
Decision Trees for Mining Data Streams
Learning from Data Streams
Presentation transcript:

An Improved Categorization of Classifiers Sensitivity on Sample Selection Bias Wei Fan Ian Davidson Bianca Zadrozny Philip S. Yu

What is sample selection bias? Inductive learning: training data (x,y) is sampled from the universe of examples. In many applications: training data (x,y) is not sampled randomly. Insurance and mortgage data: you only know those people you give a policy. School data: self-select There are different possibilities of how (x,y) is selected (Zadrozny04) S=1 denotes (x,y) is chosen. S is independent from x and y. Total random sample. S is dependent on y not x. Class bias S is dependent on x not on y. Feature bias. S is dependent on both x and y. Both class and feature.

Important Problem It is very hard to guarantee random sample for many real-world applications. Heckman received Nobel Prize for his two- step approach on regression methods. Many recent related work such as Bianca Zadrozny04 Andrew Smith and Charles Elkan04. etc

Feature Bias P(s=1|x,y) = P(s=1|x) Bias conditional on x But not directly conditional on y. Example: Survey data Loan approval. Question: Given two modeling techniques M1 and M2 Which one is more sensitive on feature bias? Sensitive: constructed model and accuracy changes significantly as a result of feature bias.

Our paper shows this Most classifier algorithm can be sensitive or insensitive to feature bias. P(y|x) is the true probability distribution, which is unknown for most problems P(y|x,M) is the estimated probability by model M. The dependency on M is none-trivial. Insensitive if the model is the correct model or asymptotically P(y|x,M) = P(y|x) Sensitive if the model is the incorrect model or P(y|x,M) != P(y|x)

Correct and Incorrect Model

Correct Model

Incorrect/Correct Models

Result on Decision Tree

Practical Implication Given a realistic dataset, you most likely will never know its true model either before or after data mining. Given a modeling technique, you will most likely not know if it will be or will not be the true model. Reality is: you dont know if it will be sensitive or insensitive to sample selection bias. Long paper on request.