Presentation is loading. Please wait.

Presentation is loading. Please wait.

Intelligent Database Systems Lab N.Y.U.S.T. I. M. BNS Feature Scaling: An Improved Representation over TF·IDF for SVM Text Classification Presenter : Lin,

Similar presentations


Presentation on theme: "Intelligent Database Systems Lab N.Y.U.S.T. I. M. BNS Feature Scaling: An Improved Representation over TF·IDF for SVM Text Classification Presenter : Lin,"— Presentation transcript:

1 Intelligent Database Systems Lab N.Y.U.S.T. I. M. BNS Feature Scaling: An Improved Representation over TF·IDF for SVM Text Classification Presenter : Lin, Shu-Han Authors : George Forman (Hewlett-Packard Labs) Conference on Information and Knowledge Management (CIKM) (2009)

2 Intelligent Database Systems Lab N.Y.U.S.T. I. M. 2 Outline Motivation Objective Methodology Experiments Conclusion Comments

3 Intelligent Database Systems Lab N.Y.U.S.T. I. M. Motivation 3 Multi-class classification: 1 of n problem, e.g., topic category. Binary-class classification: 1 of 2 problem A B C D So every problem can be decompose to many binary classification problem: The positive/negative problem negative positive

4 Intelligent Database Systems Lab N.Y.U.S.T. I. M. Motivation Feature “Scaling” = “weighting”=“Scoring” ‘TF·IDF’ representation: IDF is oblivious to the class labels inappropriately  Scales some features inappropriately 4 Positive (100)Negative (900)IDF X80 (80%)0 (0%)Log(1000/80)=1.1 Y8 (8%)0 (0%)Log(1000/8)=2.1

5 Intelligent Database Systems Lab N.Y.U.S.T. I. M. Objectives Maximize classification “performance”  Feature selection  Feature scaling: Make numeric range greater for more predictive feature  Predictive:  100% positive, 0% negative  0% positive, 100% negative 5

6 Intelligent Database Systems Lab N.Y.U.S.T. I. M. Methodology – Feature Scoring Metrics 6 F -1 : The inverse normal cumulative distribution function

7 Intelligent Database Systems Lab N.Y.U.S.T. I. M. 7 Positive (30) Negative (400) BNSIDFLORIG Italy 30 (100%)03.291.164.680.37 x2 (7%)00.142.331.760.02 patient 30 (100%) 400 (100%) 0.00 cost 0 400 (100%) 3.290.03-4.680.37 y15 (50%)200 (50%)0.000.300.00

8 Intelligent Database Systems Lab N.Y.U.S.T. I. M. Methodology – Feature Scoring Metrics 8 Positive (30) Negative (400) BNSIDFLORIG Italy 30 (100%)03.291.164.680.37 x3 (10%)0 0.362.161.950.03 + [0% ~ 100%], - 0%

9 Intelligent Database Systems Lab N.Y.U.S.T. I. M. Methodology – Feature Scoring Metrics 9 Positive (30) Negative (400) BNSIDFLORIG Italy 040 (10%) 0.361.03-0.820.01 x0 400 (100%) 3.290.03-4.680.37 + 0%, - [0% ~ 100%]

10 Intelligent Database Systems Lab N.Y.U.S.T. I. M. Methodology – Feature Scoring Metrics 10 Positive (30) Negative (400) BNSIDFLORIG patient 30 (100%) 400 (100%) 0.00 y15 (50%)200 (50%)0.000.300.00 + [0% ~ 100%], - [0% ~ 100%]

11 Intelligent Database Systems Lab N.Y.U.S.T. I. M. Methodology – Feature Scoring Metrics 11 Positive (30) Negative (400) BNSIDFLORIG Italy 30 (100%)03.291.164.680.37 cost 0 400 (100%) 3.290.03-4.680.37 + [0% ~ 100%], - [100% ~ 0%]

12 Intelligent Database Systems Lab N.Y.U.S.T. I. M. 12 Positive (30) Negative (400) BNSIDFLORIG Italy 30 (100%)03.291.164.680.37 x2 (7%)00.142.331.760.02 patient 30 (100%) 400 (100%) 0.00 cost 0 400 (100%) 3.290.03-4.680.37 y15 (50%)200 (50%)0.000.300.00

13 Intelligent Database Systems Lab N.Y.U.S.T. I. M. Experiments – Accuracy & F-measure 13

14 Intelligent Database Systems Lab N.Y.U.S.T. I. M. Experiments – Precision vs. Recall 14

15 Intelligent Database Systems Lab N.Y.U.S.T. I. M. Experiments – The effect of class distribution 15

16 Intelligent Database Systems Lab N.Y.U.S.T. I. M. Experiments – compare to other scoring metrics 16

17 Intelligent Database Systems Lab N.Y.U.S.T. I. M. Experiments – Feature selection + Feature scaling 17

18 Intelligent Database Systems Lab N.Y.U.S.T. I. M. Conclusions BNS  the difference between the rate of + class and - class Use IG selection + BNS scaling No need to feature selection: better use all features for the best performance Better to simply use all binary features 18

19 Intelligent Database Systems Lab N.Y.U.S.T. I. M. Comments Advantage  Idea is clear: consider the class distribution Drawback  Restrict to the 2-class problem  Use all features takes time Application  Instead of IDF 19


Download ppt "Intelligent Database Systems Lab N.Y.U.S.T. I. M. BNS Feature Scaling: An Improved Representation over TF·IDF for SVM Text Classification Presenter : Lin,"

Similar presentations


Ads by Google