Great Workshop La Palma -June 2011 Handling Imbalanced Datasets in Multistage Classification Mauro López Centro de Astrobiología.

Great Workshop La Palma -June 2011 Handling Imbalanced Datasets in Multistage Classification Mauro López mauro@cab.inta-csic.es Centro de Astrobiología - Madrid (ex-LAEFF)

Great Workshop La Palma -June 2011 Problem ● Real world classification problems deal with imbalanced datasets ● Classifiers usually are biased towards the majority class

Great Workshop La Palma -June 2011 Problem: Misclassification Cost ● Most of the literature assumes that the minority class is more important. ● Miss-classification cost is usually less important for the majority class ● I.e. breast cancer detection

Great Workshop La Palma -June 2011 Problem: Astronomy ● But in star classification misclassification costs are the same for every class ● A class with very few instances can be very well represented

Great Workshop La Palma -June 2011 Problem: Not Only the Classifiers ● Feature selection, discretization and other preprocessing filters suffer the same problem

Great Workshop La Palma -June 2011 Multistage Classifier ● Several advantages ● Specialized classifiers ● Better selection of relevant features ● Combination of classification methods ● But there is a drawback ● Worsen the imbalanced problem

Great Workshop La Palma -June 2011 Evaluation

Great Workshop La Palma -June 2011 Evaluation ● Most used measure in classification: accuracy ● Accuracy= (TP+TN)/(TP+TN+FP+FN) ● We cannot say a classifier is good just by looking to the accuracy ● Example: when classifying a training set composed of 1000 instances labeled as A and 1 instance labeled as B is easy to get an “outstanding” 99,9% ● It can be useful for comparing classifiers

Great Workshop La Palma -June 2011 Evaluation ● Summarize performance over a range of tradeoffs between true positive and false positive error rates ● Useful if FN and FP errors have different cost

Great Workshop La Palma -June 2011 Evaluation ● Main goal for imbalanced datasets is to improve the recall without decreasing the precision ● F-value combines both measures ● (β is usually set to 1) Precision = TPTP TPTP FP

Great Workshop La Palma -June 2011 Solutions. Undersampling ● (Random) removal of instances belonging to the majority class ● Problems: we can lose important instances

Great Workshop La Palma -June 2011 Solutions. Oversampling ● Instances belonging to the minority class are replicated ● Problems: possible overfitting, does not increase the decision region for the class ● Advantage: fast

Great Workshop La Palma -June 2011 Solutions: SMOTE ● Synthetic Minority Oversampling Technique ● Generates new instances combining old ones. ● No overfitting ● Forces the minority class to be more general (broader decision region)

Great Workshop La Palma -June 2011 Smote - Warning ● "Real stupidity beats artificial intelligence every time." — Terry Pratchett (Hogfather) ● RV vs ALL ● Extreme imbalanced ratio: 331.2 ● Can be so good?

Great Workshop La Palma -June 2011 RV vs all

Great Workshop La Palma -June 2011 RV Smotified

Great Workshop La Palma -June 2011 Solutions: Adding Weights ● Does not remove important examples ● Does not overfit ● But needs algorithms prepared to manage weights ● 10-fold-cv can be tricky

Great Workshop La Palma -June 2011 Solutions: Boosting ● Creates weak classifiers weighted for classifying hard instances. ● It maintains accuracy over the entire dataset

Great Workshop La Palma -June 2011 Experiment ● Hipparcos dataset ● 1661 instances ● 47 attributes + class ● 23 classes

Great Workshop La Palma -June 2011 Multistage Hierarchy ● Imbalanced ratio

Great Workshop La Palma -June 2011 Experiment – J48 ● Node 1: LPV vs. Other ● Imbalanced ratio: 4.3 ● Good classification in spite of imbalance ● Low margin for improvement

Great Workshop La Palma -June 2011 Experiment - J48 ● Node 3: Eclipsing vs Other ● Imbalanced ratio: 1.33 ● When dataset is balanced, adding new instances does not improve the classification

Great Workshop La Palma -June 2011 Experiment ● Node 5: GDOR vs. Other ● Imbalanced ratio: 28.07

Great Workshop La Palma -June 2011 Experiment ● Node 11: SPB+ACV vs. Other ● Imbalanced ratio 3.8

Great Workshop La Palma -June 2011 Results ● Using a balanced dataset improves the classification +10% ● FS is specially affected by the imbalance

Great Workshop La Palma -June 2011 Thank you ● Time to wake up

Great Workshop La Palma -June 2011 Handling Imbalanced Datasets in Multistage Classification Mauro López Centro de Astrobiología.

Similar presentations

Presentation on theme: "Great Workshop La Palma -June 2011 Handling Imbalanced Datasets in Multistage Classification Mauro López Centro de Astrobiología."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Great Workshop La Palma -June 2011 Handling Imbalanced Datasets in Multistage Classification Mauro López Centro de Astrobiología.

Similar presentations

Presentation on theme: "Great Workshop La Palma -June 2011 Handling Imbalanced Datasets in Multistage Classification Mauro López Centro de Astrobiología."— Presentation transcript:

Similar presentations

About project

Feedback