Presentation is loading. Please wait.

Presentation is loading. Please wait.

GA-Based Feature Selection and Parameter Optimization for Support Vector Machine Cheng-Lung Huang, Chieh-Jen Wang Expert Systems with Applications, Volume.

Similar presentations


Presentation on theme: "GA-Based Feature Selection and Parameter Optimization for Support Vector Machine Cheng-Lung Huang, Chieh-Jen Wang Expert Systems with Applications, Volume."— Presentation transcript:

1 GA-Based Feature Selection and Parameter Optimization for Support Vector Machine Cheng-Lung Huang, Chieh-Jen Wang Expert Systems with Applications, Volume 31, Issue 2, pp.231-240, 2006.

2 Outline INTRODUCTION BACKGROUND KNOWLEDGE GA-BASED FEATURE SELECTION AND PARAMETER OPTIMIZATION NUMERICAL ILLUSTRATIONS CONCLUSION

3 Introduction Support vector machines (SVM) were first suggested by Vapnik (1995) for classification. SVM classifies data with different class label by determining a set of support vectors that outline a hyperplane in the feature space. The kernel function transforms train data vector to feature space. SVM used in a range of problems including pattern recognition (Pontil and Verri 1998), bioinformatics (Brown et al. 1999), text categorization (Joachims 1998).

4 Problems While using SVM, we confront two problems: How to set the best parameters for SVM ! How to choose the input attributes for SVM !

5 Feature Selection Feature selection is used to identify a powerfully predictive subset of fields within the database and to reduce the number of fields presented to the mining process. Affects several aspects of pattern classification: 1.The accuracy of classification algorithm learned 2.The time needed for learning a classification function 3.The number of examples needed for learning 4.The cost associated with feature

6 SVM Parameters Setting Proper parameters setting can improve the classification accuracy of SVM. The parameters that should be optimized include penalty parameter C and the parameters with different kernel function. Grid Algorithm is an alternative to find the best C and the gamma parameter, however it is time consuming and does not perform well.

7 Research Purposes This research objective is to optimize the parameters and the feature subset simultaneously, without degrading the classification accuracy of SVM. Genetic Algorithms (GA) have the potential to be used to generate both the feature subset and the SVM parameters at the same time.

8 An Overview of This Paper

9 Support Vector Machine (SVM) Support vector machine (SVM) is a new technique for data classification were first suggested by Vapnik in 1995. SVM is using Separating Hyperplane to distinguish the data of two or several different Class that deal with the data mining problem of classification.

10 Separating Hyperplane

11

12 Slack Variable

13 Penalty Parameter Slack variable which accounts for the cost of overlapping error Consequently objective function must be revised by using penalty parameter C, as follows

14 Non-Linear Classifier

15 Polynomial: RBF: Sigmoidal: Kernel Function

16 Genetic Algorithm Genetic algorithms (GA), a general adaptive optimization search methodology based on a direct analogy to Darwinian natural selection.

17 Wrapper Model of Feature Selection

18 Chromosomes Design : represents the value of parameter C : represents the value of parameter γ : represents selected features

19 Genotype to Phenotype The bit strings for parameter C and γ are genotype that should be transformed into phenotype Value: the phenotype value P min : the minimum value of parameter (user define) P max : the maximum value of parameter (user define) D: the decimal value of bit string L: the length of bit string

20 Fitness Function Design W A : SVM classification accuracy weight SVM_accuracy: SVM classification accuracy W F : weight of the features C i : cost of feature i F i : “1” represents that feature i is selected; “0” represents that feature i is not selected

21 System Flows for GA-based SVM (1) Data preprocess: scaling (2) Converting genotype to phenotype (3) Feature subset (4) Fitness evaluation (5) Termination criteria (6) Genetic operation

22 Figure of System Flows

23 Experimental Dataset No.Names#Classes#Instances Nominal features Numeric features Total features 1 German (credit card) 21000024 2 Australian (credit card) 26906814 3 Pima-Indian diabetes 2760088 4 Heart disease (Statlog Project) 22707613 5 Breast cancer(Wisconsin) 2699010 6 Contraceptive Method Choice 31473729 7 Ionosphere 2351034 8 Iris 3150044 9 Sonar 2208060 10 Statlog project : vehicle 4940018 11 Vowel 1199031013

24 Experiments Description To guarantee that the present results are valid and can be generalized for making predictions regarding new data Using k-fold-cross-validation This study used k = 10, meaning that all of the data will be divided into ten parts, each of which will take turns at being the testing data set.

25 Accuracy Calculation Accuracy using the binary target datasets can be demonstrated by the positive hit rate (sensitivity), the negative hit rate (specificity), and the overall hit rate. For the multiple class datasets, the accuracy is demonstrated only by the average hit rate.

26 Accuracy Calculation Sensitivity is the proportion of cases with positive class that are classified as positive: P(T+|D+) = TP / (TP+FN). Specificity is the proportion of cases with the negative class: P(T-|D-) = TN / (TN + FP). Overall hit rate is the overall accuracy which is calculated by (TP+TN) / (TN+FP+FN+FP). Target (or Disease) +- Predicted (or Test) +True Positive(TP)False Positive(FP) -False Negative(FN)True Negative(TN)

27 Accuracy Calculation The SVM_accuracy of the fitness in function is measured by Sensitivity*Specificity for the datasets with two classes (positive or negative). Overall hit rate for the datasets with multiple classes.

28 GA Parameter Setting Chromosome Represented by using Binary Code Population Size 500 Crossover Rate 0.7,One Point Crossover Mutation Rate 0.02 Roulette Wheel Selection Elitism Replacement

29 W A and W F Weight W A and W F can influence the experiment result according to the fitness function The higher W A is; the higher classification accuracy is. The higher W F is; the smaller the number of features is.

30 Folder #4 of German Dataset Curve Diagram

31 Experimental Results for German Dataset

32 Results summary (GA-based approach vs. Grid search ) GA-based approachGrid algorithm p-value for Wilcoxon Testing NamesNumber of Original features Number of Selected features Average Positive Hit Rate Average Negative Hit Rate Average Overall Hit Rate% Average Positive Hit Rate Average Negative Hit Rate Average Overall Hit Rate% German2413±1.83 0.896080.76621 85.6±1.96 0.8882710.462476 76±4.06 0.005 * Australian143±2.45 0.84720.92182 88.1±2.25 0.8857140.823529 84.7±4.74 0.028 * diabetes83.7±0.95 0.783460.87035 81.5±7.13 0.5925930.88 77.3±3.03 0.139 Heart disease135.4±1.85 0.944670.95108 94.8±3.32 0.750.909091 83.7±6.34 0.005 * breast cancer101±0 0.98780.899696.19±1.240.980.94444495.3±2.280.435 Contraceptive95.4±0.53 N/A 71.22±4.15N/A 53.53±2.430.005* ionosphere346±0 0.99630.987698.56±2.030.940.989.44±3.580.005* iris41±0 N/A 100±0N/A 97.37±3.460.046* sonar6015±1.1 0.98630.984298±3.50.655550.987±4.220.004* vehicle189.2±1.4 N/A 84.06±3.54N/A 83.33±2.740.944 Vowel137.8±1 N/A 99.3±0.82N/A 95.95±2.910.02*

33 ROC curve for fold #4 of German Credit Dataset

34 Average AUC for Datasets GA-based approachGrid algorithm German0.84240.7886 Australian0.90190.8729 diabetes0.82980.7647 Heart disease0.94580.8331 breast cancer0.94230.9078 Contraceptive0.77010.6078 ionosphere0.96610.8709 iris0.97560.9572 sonar0.95220.8898 vehicle0.85870.8311 Vowel0.95130.9205

35 Conclusion We proposed a GA-based strategy to select features subset and to set the parameters for SVM classification. We have conducted two experiments to evaluate the classification accuracy of the proposed GA- based approach with RBF kernel and the grid search method on 11 real-world datasets from UCI database. Generally, compared with the grid search approach, the proposed GA-based approach has good accuracy performance with fewer features.

36 Thank You Q & A


Download ppt "GA-Based Feature Selection and Parameter Optimization for Support Vector Machine Cheng-Lung Huang, Chieh-Jen Wang Expert Systems with Applications, Volume."

Similar presentations


Ads by Google