Presentation is loading. Please wait.

Presentation is loading. Please wait.

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Extreme Re-balancing for SVMs: a case study Advisor :

Similar presentations


Presentation on theme: "Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Extreme Re-balancing for SVMs: a case study Advisor :"— Presentation transcript:

1 Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Extreme Re-balancing for SVMs: a case study Advisor : Dr. Hsu Reporter : Wen-Hsiang Hu Author : Bhavani Raskutti and Adam Kowalczyk Sigkdd Explorations

2 Intelligent Database Systems Lab N.Y.U.S.T. I. M. 2 Outline Motivation Objective Related Research Support Vector Machines Re-balancing of the Data Sample Balancing Weight Balancing Experimental Discussion Conclusion Personal Opinion

3 Intelligent Database Systems Lab N.Y.U.S.T. I. M. 3 Motivation  A standard recipe for two class discrimination is to take examples from both classes, then generate a model for discriminating them. However, there are many applications were obtaining examples of a second class is difficult. ─ e.g. classifying sites of “interest” to a web surfer  There are situations when the data has heavily unbalanced representatives of the two classes of interest, ─ e.g. fraud detection and information filtering

4 Intelligent Database Systems Lab N.Y.U.S.T. I. M. 4 Objective  Get better performance by one-class learners

5 Intelligent Database Systems Lab N.Y.U.S.T. I. M. 5 Related Research (1/2)  Many solutions have been proposed to address the imbalance problem including sampling and weighting examples. ─ Typically, these methods focus on cases when the imbalance ratio of minority to majority class is around 10:90  In this paper, we focus on extreme imbalance in very high dimensional input spaces, where at the learning stage the minority class consists of around 1-3% of data.

6 Intelligent Database Systems Lab N.Y.U.S.T. I. M. 6 Related Research (2/2)  In both cases (image retrieval and document classification) ─ One-class models are much worse than the two-class models  In this paper, we show that for certain problems such as the gene knock-out experiments for understanding AHR( 芳香巠基碳水化合物接受器 ) signalling pathway ─ minority one-class SVMs significantly outperform models learnt using examples from both classes.

7 Intelligent Database Systems Lab N.Y.U.S.T. I. M. 7 Support Vector Machines (1/4)  Given a training sequence (x i,y i ) of binary n-vectors and bipolar labels  Our aim is to find a “good” discriminating function  kernel machine:

8 Intelligent Database Systems Lab N.Y.U.S.T. I. M. 8 Support Vector Machines (2/4)

9 Intelligent Database Systems Lab N.Y.U.S.T. I. M. 9 Support Vector Machines (3/4)

10 Intelligent Database Systems Lab N.Y.U.S.T. I. M. 10 Support Vector Machines (4/4)  If the kernel k satisfies the Mercer theorem assumptions[7;24;25] then for the minimiser of (2) we have where  We shall be using the popular polynomial kernel

11 Intelligent Database Systems Lab N.Y.U.S.T. I. M. 11 Re-balancing of the Data - Sample Balancing aaaaaa 0:1

12 Intelligent Database Systems Lab N.Y.U.S.T. I. M. 12 Re-balancing of the Data - Weight Balancing  a  The case of “balanced proportions” achieved for B= 0. B= +1 representing the case of learning from positive examples only. Similarly, learning from negative class only is achieved for B= -1. is a parameter called a balance factor

13 Intelligent Database Systems Lab N.Y.U.S.T. I. M. 13 Experiments- Real World Data Collections  AHR-data set used for task 2 of KDD Cup 2002 ─ 芳香巠基碳水化合物的資料集 ─ for cancer research ─ three class: change, control, nc  Reuters data ─ 12902 documents

14 Intelligent Database Systems Lab N.Y.U.S.T. I. M. 14 Performance Measures  We have used AROC, the Area under the Receiver Operating Characteristic (ROC) curve as our main performance measure.  The trivial uniform random predictor has AROC of 0.5, while a perfect predictor has an AROC of 1. X i from the negative class X j from the positive class

15 Intelligent Database Systems Lab N.Y.U.S.T. I. M. 15 Experiments with Real World Data  The sizes of the data split training:test were ─ 50%:50% for the Reuters data ─ 70%:30% for the AHR-data

16 Intelligent Database Systems Lab N.Y.U.S.T. I. M. 16 Impact of Regularization Constant positive 1-calss – – – – – – – balanced 2-class – ‧ – ‧ – ‧ – un-balanced 2-class …………… negative 1-class

17 Intelligent Database Systems Lab N.Y.U.S.T. I. M. 17 Experiments with Sample Balancing

18 Intelligent Database Systems Lab N.Y.U.S.T. I. M. 18 Impact of feature selection (1/2)  feature selection methods: ─ DocFreq (Document frequency thresholding): 1 ─ ChiSqua(χ 2 ): The measures the lack of independence between a feature and a class of interest. ─ MutInfo (Mutual Information) ─ InfGain (Information gain): term goodness measure  We have used all of the minority cases and sampled the majority cases at different mixture ratios (MajorityOnly sample balancing).

19 Intelligent Database Systems Lab N.Y.U.S.T. I. M. 19 Impact of feature selection (2/2) two

20 Intelligent Database Systems Lab N.Y.U.S.T. I. M. 20 Experiments with Weight Balancing  In order to understand if the impact of negative examples may be reduced using the balance factor B in Equation (4) ─ Tests on AHR data ─ Tests on Reuters

21 Intelligent Database Systems Lab N.Y.U.S.T. I. M. 21 Tests on AHR data  B= 0 : balanced 2-class  B= +1 : positive 1-class  B= -1 : negative 1-class

22 Intelligent Database Systems Lab N.Y.U.S.T. I. M. 22 Tests on Reuters ------- balanced 2-class positive 1-class

23 Intelligent Database Systems Lab N.Y.U.S.T. I. M. 23 Experiments with Synthetic Data  S 1 : n inf =1; n noise =999  S 2 : n inf =10; n noise =990  S 3 : n inf =1; n noise =19 polynomial kernels: non-linear kernel two polynomial kernels : linear kernel

24 Intelligent Database Systems Lab N.Y.U.S.T. I. M. 24 Discussion

25 Intelligent Database Systems Lab N.Y.U.S.T. I. M. 25 Conclusion  The Reuters dataset ─ provides quite good results but using both classes always produces better results  The AHR data set ─ The positive one-class learners performing significantly better than two-class learners.  One-class learning from positive class examples can be a very robust classification technique when dealing with very unbalanced data and high dimensional noisy feature space.

26 Intelligent Database Systems Lab N.Y.U.S.T. I. M. 26 Personal Opinion  Strength ─ many experiments  Weakness ─ equations are not clear  Application ─ SVM document classification Image retrieval


Download ppt "Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Extreme Re-balancing for SVMs: a case study Advisor :"

Similar presentations


Ads by Google