Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Presenter : Chien-Shing Chen Author: Gustavo E. A. P. A. Batista Ronaldo C. Prati Maria Carolina Monard A study of the Behavior of Several Methods for Balancing Machine Learning Training Data SigKDD,2004
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Outline Motivation Objective Introduction KNN 10 methods Experimental Results Conclusions Personal Opinion
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Motivation class imbalances are significant losses of performance in standard classifiers
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Objective a broad experimental evaluation involving 10 methods, to deal with the class imbalance problem 13 UCI data sets
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Introduction is a large imbalance between the majority class and the minority class present some degree of class overlapping May incorrectly classify many cases from the minority class because the nearest neighbors of these cases are examples belonging to the majority class
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Introduction, For instance, it is straightforward to create a classifier having an accuracy of 99% in a domain where the majority class proportion corresponds to 98% of the examples, by simply forecasting every new example as belonging to the majority class.
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Introduction
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Introduction ROC curve (AUC) represent the expected performance as a single scalar it is equivalent to the Wilconxon test of ranks
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Methods Implement k-NN algorithm
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Methodological Implement k-NN algorithm Use the Heterogeneous Value difference Metric distance function Euclidean distance for quantitative attributes VDM distance for qualitative attributes
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Method (1) Random over-sampling Random replication of minority class examples
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Method (2) Random under-sampling Random elimination of majority class examples
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Method (3) Tomek Links Given two examples E i and E j belonging to different classes, and d(E i,E j ) is the distance between E i and E j. A (E i,E j ) pair is called a Tomek link if there is not an example E l, such that d(E i,E l )<d(E i,E j ) or d(E j,E l )<d(E i,E j ) 1.borderline 2.Is noise As an under-sampling method, eliminate majority class example As a data cleaning method, eliminate both class examples
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Method (4) Condensed Nearest Neighbor Rule find consistent subset of examples, eliminate the examples from the majority class are distant from the decision brooder a subset is consistent with E if using a 1-NN, correctly classifies the examples in E. an algorithm to create a subset from E as an under-sampling method 1.Randomly draw one majority class example all examples from the minority class put these examples in 2.1-NN over the examples in to classify the examples in E 3.every misclassified example from E is moved to
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Method (5) One-sided selection (OSS) Is an under-sampling method resulting from the application of Tomek links followed by the application of CNN Remove noisy and borderline majority class examples
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Method (6) CNN + Tomek links It is similar to the OSS, but the method to find the consistent subset is applied before the Tomk links. As finding Tomek links is computationally demanding, it would be computationally cheaper.
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Method (7) Neighborhood Cleaning Rule Use Wilson’s Edited Nearest Neighbor Rule
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Method (8) Synthetic Minority Over-sampling Technique (Smote) Its main idea is to form new minority class examples by interpolating between several minority class examples that lie together. cause the decision boundaries for the minority class to spread further into the majority class space
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Method (9) Smote + Tomek links
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Method (10) Smote + ENN ENN remove more examples than the Tomek links does, so it is expected that it will provide a more in depth data cleaning.
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Experimental Evaluation C4.5 symbolic learning algorithm to induce decision trees 15 UCI data sets
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Experimental Evaluation
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Experimental Evaluation
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Experimental Evaluation
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Experimental Evaluation Unpruned decision trees obtained better results
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Experimental Evaluation
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Experimental Evaluation
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Experimental Evaluation
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Experimental Evaluation
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Experimental Evaluation
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Experimental Evaluation
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Experimental Evaluation
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Experimental Evaluation
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Conclusion imbalance 10 methods of matching Smote + Tomek or Smote + ENN might be applied to data sets with a small number of positive instances. Large number of positive examples, the Random over- sampling method less expensive than other methods would produce meaningful results. ROC curves
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Conclusions Drawback go deeper! Be carefully! Application What alternative methodological are there ? Future Work easy to implement