Presentation on theme: "An Improved Categorization of Classifiers Sensitivity on Sample Selection Bias Wei Fan Ian Davidson Bianca Zadrozny Philip S. Yu."— Presentation transcript:
An Improved Categorization of Classifiers Sensitivity on Sample Selection Bias Wei Fan Ian Davidson Bianca Zadrozny Philip S. Yu
What is sample selection bias? Inductive learning: training data (x,y) is sampled from the universe of examples. In many applications: training data (x,y) is not sampled randomly. Insurance and mortgage data: you only know those people you give a policy. School data: self-select There are different possibilities of how (x,y) is selected (Zadrozny04) S=1 denotes (x,y) is chosen. S is independent from x and y. Total random sample. S is dependent on y not x. Class bias S is dependent on x not on y. Feature bias. S is dependent on both x and y. Both class and feature.
Important Problem It is very hard to guarantee random sample for many real-world applications. Heckman received Nobel Prize for his two- step approach on regression methods. Many recent related work such as Bianca Zadrozny04 Andrew Smith and Charles Elkan04. etc
Feature Bias P(s=1|x,y) = P(s=1|x) Bias conditional on x But not directly conditional on y. Example: Survey data Loan approval. Question: Given two modeling techniques M1 and M2 Which one is more sensitive on feature bias? Sensitive: constructed model and accuracy changes significantly as a result of feature bias.
Our paper shows this Most classifier algorithm can be sensitive or insensitive to feature bias. P(y|x) is the true probability distribution, which is unknown for most problems P(y|x,M) is the estimated probability by model M. The dependency on M is none-trivial. Insensitive if the model is the correct model or asymptotically P(y|x,M) = P(y|x) Sensitive if the model is the incorrect model or P(y|x,M) != P(y|x)
Correct and Incorrect Model
Result on Decision Tree
Practical Implication Given a realistic dataset, you most likely will never know its true model either before or after data mining. Given a modeling technique, you will most likely not know if it will be or will not be the true model. Reality is: you dont know if it will be sensitive or insensitive to sample selection bias. Long paper on request.