Presentation is loading. Please wait.

Presentation is loading. Please wait.

(SubLoc) Support vector machine approach for protein subcelluar localization prediction (SubLoc) Kim Hye Jin Intelligent Multimedia Lab. 2001.09.07.

Similar presentations


Presentation on theme: "(SubLoc) Support vector machine approach for protein subcelluar localization prediction (SubLoc) Kim Hye Jin Intelligent Multimedia Lab. 2001.09.07."— Presentation transcript:

1 (SubLoc) Support vector machine approach for protein subcelluar localization prediction (SubLoc) Kim Hye Jin Intelligent Multimedia Lab

2 Contents Introduction Materials and Methods –Support vector machine –Design and implementation of the prediction system –Prediction system assessment Result Discussion and Conclusion

3 Introduction (1) Motivation –A key functional charactristic of potential gene products such as proteins Traditional methods –Protein N-terminal sorting signals Nielsen et al.,(1999), von Heijne et al (1997) –Amino acid composition Nakashima and Inshikawa(1994), Nakai(2000) Andrade et al(1998), Cedano et al(1997), Reinhart and Hubbard(1998)

4 Materials and Methods(1) Dataset - SWISSPROT release Essential sequences which complete and reliable localization annotations -No transmembrane proteins By Rost et al.,1996; Hirokawa et al.,1998;Lio and Vnnucci,2000 -Redundancy reduction -Effectiveness test - by Reinhardt and Hubbard (1998)

5

6 Support vector machine(1) A quadratic optimization problem with boundary constraints and one linear equality constraints Basically for two classification problem input vector x =(x 1,.. x 20 ) ( x i : aa) output vector y {-1,1} Idea –Map input vectors into a high dimension feature space –Construct optimal separating hyperplane(OSH) –maximize the margin; the distance between hyperplane and the nearest data points of each class in the space H K(x i,x j ) –Mapping by a kernel function K(x i,x j )

7

8 Support vector machine(2) Decision function Where the coefficient by solving convex quadratic programming

9 Support vector machine(3) Constraints –In eq(2), C is regularization parameter => control the trade- off between margin and misclassification error Typical kernel functions Eq(3), polynomial with d parameter Eq(4), radial basic function (RBF) with r parameter

10 Support vector machine(4) Benefits of SVM –Globally optimization –Handle large feature spaces –Effectively avoid over-fitting by controlling margin –Automatically identify a small subset made up of informative points

11 Design and implementation of the prediction system Problem : Multi-class classification problem –Prokaryotic sequences 3 classes –Eukaryotic sequences 4 classes Solution –To reduce the multi-classification into binary classification –1-v-r SVM( one versus rest ) QP problem –LOQO algorithm (Vanderbei, 1994) SVM light Speed –Less than 10 min on a PC running at 500MHz

12 Prediction system assessment Prediction quality test by jackknife test –Each protein was singled out in turn as a test protein with the remaining proteins used to train SVM

13 Results (1) SubLoc prediction accuracy by jackknife test –Prokaryotic sequence case d=1and d=9 for polynomial kernel =5.0 for RBF C = 1000 for SVM constraints –Eukaryotic sequence case d =9 for polynomial kernel =16.0 for RBF C=500 for each SVM Test : 5 – fold cross validation ( since limited computational power)

14

15

16 Comparison based on amino acid composition –Neural network Reinhardt and Hubbard, 1998 –Covariant discriminant algorithm Chou and Elrod, 1999 Based on the full sequence information in genome sequence –Markov model ( Yuan, 1999)

17

18

19 Assigning a reliability index RI (reliability index) Diff between the highest and the second - highest output value of the 1-v-r SVM 78% of all sequence have RI 3 and 95.9% correct prediction

20 Robustness to errors in the N-terminal sequence

21

22 Discussion and Conclusion SVM information condensation –The number of SVs is quite small –The ratio of SVs to all training is 13-30%

23 SVM parameter selection Little influence on the classification performance –Table8 shows with little difference between kernel functions –Robust characteristic of the dataset by Vapnik(1995)

24 Improvement of the perfomance Combining with other methods –Sorting signal base method and amino acid composition Signal : sensitive to errors in N terminal Composition: weakness in similar aa Incorporate other informative features Bayesian system integrating in the whole genome expression data Fluorescence microscope images


Download ppt "(SubLoc) Support vector machine approach for protein subcelluar localization prediction (SubLoc) Kim Hye Jin Intelligent Multimedia Lab. 2001.09.07."

Similar presentations


Ads by Google