# An Improved Categorization of Classifiers Sensitivity on Sample Selection Bias Wei Fan Ian Davidson Bianca Zadrozny Philip S. Yu.

## Presentation on theme: "An Improved Categorization of Classifiers Sensitivity on Sample Selection Bias Wei Fan Ian Davidson Bianca Zadrozny Philip S. Yu."— Presentation transcript:

An Improved Categorization of Classifiers Sensitivity on Sample Selection Bias Wei Fan Ian Davidson Bianca Zadrozny Philip S. Yu

What is sample selection bias? Inductive learning: training data (x,y) is sampled from the universe of examples. In many applications: training data (x,y) is not sampled randomly. Insurance and mortgage data: you only know those people you give a policy. School data: self-select There are different possibilities of how (x,y) is selected (Zadrozny04) S=1 denotes (x,y) is chosen. S is independent from x and y. Total random sample. S is dependent on y not x. Class bias S is dependent on x not on y. Feature bias. S is dependent on both x and y. Both class and feature.

Important Problem It is very hard to guarantee random sample for many real-world applications. Heckman received Nobel Prize for his two- step approach on regression methods. Many recent related work such as Bianca Zadrozny04 Andrew Smith and Charles Elkan04. etc

Feature Bias P(s=1|x,y) = P(s=1|x) Bias conditional on x But not directly conditional on y. Example: Survey data Loan approval. Question: Given two modeling techniques M1 and M2 Which one is more sensitive on feature bias? Sensitive: constructed model and accuracy changes significantly as a result of feature bias.

Our paper shows this Most classifier algorithm can be sensitive or insensitive to feature bias. P(y|x) is the true probability distribution, which is unknown for most problems P(y|x,M) is the estimated probability by model M. The dependency on M is none-trivial. Insensitive if the model is the correct model or asymptotically P(y|x,M) = P(y|x) Sensitive if the model is the incorrect model or P(y|x,M) != P(y|x)

Correct and Incorrect Model

Correct Model

Incorrect/Correct Models

Result on Decision Tree

Practical Implication Given a realistic dataset, you most likely will never know its true model either before or after data mining. Given a modeling technique, you will most likely not know if it will be or will not be the true model. Reality is: you dont know if it will be sensitive or insensitive to sample selection bias. Long paper on request.

Download ppt "An Improved Categorization of Classifiers Sensitivity on Sample Selection Bias Wei Fan Ian Davidson Bianca Zadrozny Philip S. Yu."

Similar presentations