Presentation is loading. Please wait.

Presentation is loading. Please wait.

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Presenter : Chien-Shing Chen Author: Gustavo.

Similar presentations


Presentation on theme: "Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Presenter : Chien-Shing Chen Author: Gustavo."— Presentation transcript:

1 Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Presenter : Chien-Shing Chen Author: Gustavo E. A. P. A. Batista Ronaldo C. Prati Maria Carolina Monard A study of the Behavior of Several Methods for Balancing Machine Learning Training Data SigKDD,2004

2 Intelligent Database Systems Lab N.Y.U.S.T. I. M. Outline Motivation Objective Introduction KNN 10 methods Experimental Results Conclusions Personal Opinion

3 Intelligent Database Systems Lab N.Y.U.S.T. I. M. Motivation class imbalances are significant losses of performance in standard classifiers

4 Intelligent Database Systems Lab N.Y.U.S.T. I. M. Objective a broad experimental evaluation involving 10 methods, to deal with the class imbalance problem 13 UCI data sets

5 Intelligent Database Systems Lab N.Y.U.S.T. I. M. Introduction is a large imbalance between the majority class and the minority class present some degree of class overlapping May incorrectly classify many cases from the minority class because the nearest neighbors of these cases are examples belonging to the majority class

6 Intelligent Database Systems Lab N.Y.U.S.T. I. M. Introduction, For instance, it is straightforward to create a classifier having an accuracy of 99% in a domain where the majority class proportion corresponds to 98% of the examples, by simply forecasting every new example as belonging to the majority class.

7 Intelligent Database Systems Lab N.Y.U.S.T. I. M. Introduction

8 Intelligent Database Systems Lab N.Y.U.S.T. I. M. Introduction ROC curve (AUC) represent the expected performance as a single scalar it is equivalent to the Wilconxon test of ranks

9 Intelligent Database Systems Lab N.Y.U.S.T. I. M. Methods Implement k-NN algorithm

10 Intelligent Database Systems Lab N.Y.U.S.T. I. M. Methodological Implement k-NN algorithm Use the Heterogeneous Value difference Metric distance function Euclidean distance for quantitative attributes VDM distance for qualitative attributes

11 Intelligent Database Systems Lab N.Y.U.S.T. I. M. Method (1) Random over-sampling Random replication of minority class examples

12 Intelligent Database Systems Lab N.Y.U.S.T. I. M. Method (2) Random under-sampling Random elimination of majority class examples

13 Intelligent Database Systems Lab N.Y.U.S.T. I. M. Method (3) Tomek Links Given two examples E i and E j belonging to different classes, and d(E i,E j ) is the distance between E i and E j. A (E i,E j ) pair is called a Tomek link if there is not an example E l, such that d(E i,E l )<d(E i,E j ) or d(E j,E l )<d(E i,E j ) 1.borderline 2.Is noise As an under-sampling method, eliminate majority class example As a data cleaning method, eliminate both class examples

14 Intelligent Database Systems Lab N.Y.U.S.T. I. M. Method (4) Condensed Nearest Neighbor Rule find consistent subset of examples, eliminate the examples from the majority class are distant from the decision brooder a subset is consistent with E if using a 1-NN, correctly classifies the examples in E. an algorithm to create a subset from E as an under-sampling method 1.Randomly draw one majority class example all examples from the minority class put these examples in 2.1-NN over the examples in to classify the examples in E 3.every misclassified example from E is moved to

15 Intelligent Database Systems Lab N.Y.U.S.T. I. M. Method (5) One-sided selection (OSS) Is an under-sampling method resulting from the application of Tomek links followed by the application of CNN Remove noisy and borderline majority class examples

16 Intelligent Database Systems Lab N.Y.U.S.T. I. M. Method (6) CNN + Tomek links It is similar to the OSS, but the method to find the consistent subset is applied before the Tomk links. As finding Tomek links is computationally demanding, it would be computationally cheaper.

17 Intelligent Database Systems Lab N.Y.U.S.T. I. M. Method (7) Neighborhood Cleaning Rule Use Wilson’s Edited Nearest Neighbor Rule

18 Intelligent Database Systems Lab N.Y.U.S.T. I. M. Method (8) Synthetic Minority Over-sampling Technique (Smote) Its main idea is to form new minority class examples by interpolating between several minority class examples that lie together. cause the decision boundaries for the minority class to spread further into the majority class space

19 Intelligent Database Systems Lab N.Y.U.S.T. I. M. Method (9) Smote + Tomek links

20 Intelligent Database Systems Lab N.Y.U.S.T. I. M. Method (10) Smote + ENN ENN remove more examples than the Tomek links does, so it is expected that it will provide a more in depth data cleaning.

21 Intelligent Database Systems Lab N.Y.U.S.T. I. M. Experimental Evaluation C4.5 symbolic learning algorithm to induce decision trees 15 UCI data sets

22 Intelligent Database Systems Lab N.Y.U.S.T. I. M. Experimental Evaluation

23 Intelligent Database Systems Lab N.Y.U.S.T. I. M. Experimental Evaluation

24 Intelligent Database Systems Lab N.Y.U.S.T. I. M. Experimental Evaluation

25 Intelligent Database Systems Lab N.Y.U.S.T. I. M. Experimental Evaluation Unpruned decision trees obtained better results

26 Intelligent Database Systems Lab N.Y.U.S.T. I. M. Experimental Evaluation

27 Intelligent Database Systems Lab N.Y.U.S.T. I. M. Experimental Evaluation

28 Intelligent Database Systems Lab N.Y.U.S.T. I. M. Experimental Evaluation

29 Intelligent Database Systems Lab N.Y.U.S.T. I. M. Experimental Evaluation

30 Intelligent Database Systems Lab N.Y.U.S.T. I. M. Experimental Evaluation

31 Intelligent Database Systems Lab N.Y.U.S.T. I. M. Experimental Evaluation

32 Intelligent Database Systems Lab N.Y.U.S.T. I. M. Experimental Evaluation

33 Intelligent Database Systems Lab N.Y.U.S.T. I. M. Experimental Evaluation

34 Intelligent Database Systems Lab N.Y.U.S.T. I. M. Conclusion imbalance 10 methods of matching Smote + Tomek or Smote + ENN might be applied to data sets with a small number of positive instances. Large number of positive examples, the Random over- sampling method less expensive than other methods would produce meaningful results. ROC curves

35 Intelligent Database Systems Lab N.Y.U.S.T. I. M. Conclusions Drawback go deeper! Be carefully! Application What alternative methodological are there ? Future Work easy to implement


Download ppt "Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Presenter : Chien-Shing Chen Author: Gustavo."

Similar presentations


Ads by Google