Presentation is loading. Please wait.

Presentation is loading. Please wait.

Protein Fold Recognition as a Data Mining Coursework Project Badri Adhikari Department of Computer Science University of Missouri-Columbia.

Similar presentations


Presentation on theme: "Protein Fold Recognition as a Data Mining Coursework Project Badri Adhikari Department of Computer Science University of Missouri-Columbia."— Presentation transcript:

1 Protein Fold Recognition as a Data Mining Coursework Project Badri Adhikari Department of Computer Science University of Missouri-Columbia

2 Overview Problem Description Data Description and Specification Data Preprocessing Methodology Discussion of Results

3 The Problem Identify if two proteins belong to same fold [Binary Classification] Protein [Structure] Database Protein 1 Protein 2 Protein1-Protein20.005 Feature 1 0.065 Feature 2 0.79 Feature 3 Protein 3 Protein2-Protein30.034 0.152 Protein Pair Y Same Fold? N

4 The Problem Identify if two proteins belong to same fold [Binary Classification] Protein1-Protein20.005 Feature 1 0.065 Feature 2 0.79 Feature 3 Protein2-Protein30.034 0.152 Protein Pair Y Same Fold? N Protein fold recognition Customer 00010.005 Feature 1 0.065 Feature 2 0.79 Feature 3 Customer 00020.034 0.152 Customer Identification Y Potential Customer Y Customer 00030.005 0.0650.79N Recognizing Potential New Customers

5 Data Specification File size: 1.5G Examples count: 951600 Positive(+1) labels:7438 Negative(-1) labels: 944162 Number of features: 84

6 Data Description #119l-d119l 1alo-d1alo_1 -1 1:1.62 2:1.13 3:0.913193847041375 4:0.592859664812671 5:0.900250016135814 6:0.373223645145762 7:0.1888487926 #119l-d119l 1chka-d1chka +1 1:1.62 2:2.38 3:0.940068662681886 4:0.750255408177719 5:0.915009668297068 6:0.42843991595408 7:0.21728602097 #119l-d119l 1fcdc-d1fcdc1 -1 1:1.62 2:0.8 3:0.817167643633653 4:0.232414817929789 5:0.856456855019105 6:0.218979560580544 7:0.03189285886 #119l-d119l 1gal-d1gal_1 -1 1:1.62 2:3.22 3:0.917750359046243 4:0.658669958746871 5:0.901110932746401 6:0.415795355014216 7:0.1805356684 #119l-d119l 1gbs-d1gbs +1 1:1.62 2:1.85 3:0.933021070719294 4:0.703993132134445 5:0.911155732036236 6:0.358955863036017 7:0.1302872472 #119l-d119l 1iov-d2dln_2 -1 1:1.62 2:2.1 3:0.893588888139836 4:0.603748017846831 5:0.886157496493048 6:0.374372299060332 7:0.15693565952 #119l-d119l 1kte-d1kte -1 1:1.62 2:1.05 3:0.884958967781326 4:0.527597954487768 5:0.883935799166059 6:0.238710961583473 7:0.0332240052 #119l-d119l 1sly-d1sly_2 +1 1:1.62 2:1.68 3:0.914572798274142 4:0.607628373774305 5:0.900757073008603 6:0.294119634790149 7:0.0624425214 #119l-d119l 2baa-d2baa +1 1:1.62 2:2.43 3:0.856257831918934 4:0.450623745727366 5:0.869561412073741 6:0.378932373372537 7:0.1569895990 #119l-d119l 6lyt-d193l +1 1:1.62 2:1.29 3:0.904095075521014 4:0.602608563805903 5:0.893625220106668 6:0.337268512070013 7:0.1408736485 #119l-d119l 1sly-d1sly_2 +1 1:1.62 2:1.68 3:0.914572798274142 4:0.607628373774305 5:0.90075707 #119l-d119l 1sly-d1sly_2

7 Data Description #119l-d119l 1alo-d1alo_1 -1 1:1.62 2:1.13 3:0.913193847041375 4:0.592859664812671 5:0.900250016135814 6:0.373223645145762 7:0.1888487926 #119l-d119l 1chka-d1chka +1 1:1.62 2:2.38 3:0.940068662681886 4:0.750255408177719 5:0.915009668297068 6:0.42843991595408 7:0.21728602097 #119l-d119l 1fcdc-d1fcdc1 -1 1:1.62 2:0.8 3:0.817167643633653 4:0.232414817929789 5:0.856456855019105 6:0.218979560580544 7:0.03189285886 #119l-d119l 1gal-d1gal_1 -1 1:1.62 2:3.22 3:0.917750359046243 4:0.658669958746871 5:0.901110932746401 6:0.415795355014216 7:0.1805356684 #119l-d119l 1gbs-d1gbs +1 1:1.62 2:1.85 3:0.933021070719294 4:0.703993132134445 5:0.911155732036236 6:0.358955863036017 7:0.1302872472 #119l-d119l 1iov-d2dln_2 -1 1:1.62 2:2.1 3:0.893588888139836 4:0.603748017846831 5:0.886157496493048 6:0.374372299060332 7:0.15693565952 #119l-d119l 1kte-d1kte -1 1:1.62 2:1.05 3:0.884958967781326 4:0.527597954487768 5:0.883935799166059 6:0.238710961583473 7:0.0332240052 #119l-d119l 1sly-d1sly_2 +1 1:1.62 2:1.68 3:0.914572798274142 4:0.607628373774305 5:0.900757073008603 6:0.294119634790149 7:0.0624425214 #119l-d119l 2baa-d2baa +1 1:1.62 2:2.43 3:0.856257831918934 4:0.450623745727366 5:0.869561412073741 6:0.378932373372537 7:0.1569895990 #119l-d119l 6lyt-d193l +1 1:1.62 2:1.29 3:0.904095075521014 4:0.602608563805903 5:0.893625220106668 6:0.337268512070013 7:0.1408736485 #119l-d119l 1sly-d1sly_2 +1 1:1.62 2:1.68 3:0.914572798274142 4:0.607628373774305 5:0.90075707 #119l-d119l 1sly-d1sly_2 Protein query-target pair as example id

8 Data Description #119l-d119l 1alo-d1alo_1 -1 1:1.62 2:1.13 3:0.913193847041375 4:0.592859664812671 5:0.900250016135814 6:0.373223645145762 7:0.1888487926 #119l-d119l 1chka-d1chka +1 1:1.62 2:2.38 3:0.940068662681886 4:0.750255408177719 5:0.915009668297068 6:0.42843991595408 7:0.21728602097 #119l-d119l 1fcdc-d1fcdc1 -1 1:1.62 2:0.8 3:0.817167643633653 4:0.232414817929789 5:0.856456855019105 6:0.218979560580544 7:0.03189285886 #119l-d119l 1gal-d1gal_1 -1 1:1.62 2:3.22 3:0.917750359046243 4:0.658669958746871 5:0.901110932746401 6:0.415795355014216 7:0.1805356684 #119l-d119l 1gbs-d1gbs +1 1:1.62 2:1.85 3:0.933021070719294 4:0.703993132134445 5:0.911155732036236 6:0.358955863036017 7:0.1302872472 #119l-d119l 1iov-d2dln_2 -1 1:1.62 2:2.1 3:0.893588888139836 4:0.603748017846831 5:0.886157496493048 6:0.374372299060332 7:0.15693565952 #119l-d119l 1kte-d1kte -1 1:1.62 2:1.05 3:0.884958967781326 4:0.527597954487768 5:0.883935799166059 6:0.238710961583473 7:0.0332240052 #119l-d119l 1sly-d1sly_2 +1 1:1.62 2:1.68 3:0.914572798274142 4:0.607628373774305 5:0.900757073008603 6:0.294119634790149 7:0.0624425214 #119l-d119l 2baa-d2baa +1 1:1.62 2:2.43 3:0.856257831918934 4:0.450623745727366 5:0.869561412073741 6:0.378932373372537 7:0.1569895990 #119l-d119l 6lyt-d193l +1 1:1.62 2:1.29 3:0.904095075521014 4:0.602608563805903 5:0.893625220106668 6:0.337268512070013 7:0.1408736485 #119l-d119l 1sly-d1sly_2 +1 1:1.62 2:1.68 3:0.914572798274142 4:0.607628373774305 5:0.90075707 #119l-d119l 1sly-d1sly_2 Labels

9 Data Description #119l-d119l 1alo-d1alo_1 -1 1:1.62 2:1.13 3:0.913193847041375 4:0.592859664812671 5:0.900250016135814 6:0.373223645145762 7:0.1888487926 #119l-d119l 1chka-d1chka +1 1:1.62 2:2.38 3:0.940068662681886 4:0.750255408177719 5:0.915009668297068 6:0.42843991595408 7:0.21728602097 #119l-d119l 1fcdc-d1fcdc1 -1 1:1.62 2:0.8 3:0.817167643633653 4:0.232414817929789 5:0.856456855019105 6:0.218979560580544 7:0.03189285886 #119l-d119l 1gal-d1gal_1 -1 1:1.62 2:3.22 3:0.917750359046243 4:0.658669958746871 5:0.901110932746401 6:0.415795355014216 7:0.1805356684 #119l-d119l 1gbs-d1gbs +1 1:1.62 2:1.85 3:0.933021070719294 4:0.703993132134445 5:0.911155732036236 6:0.358955863036017 7:0.1302872472 #119l-d119l 1iov-d2dln_2 -1 1:1.62 2:2.1 3:0.893588888139836 4:0.603748017846831 5:0.886157496493048 6:0.374372299060332 7:0.15693565952 #119l-d119l 1kte-d1kte -1 1:1.62 2:1.05 3:0.884958967781326 4:0.527597954487768 5:0.883935799166059 6:0.238710961583473 7:0.0332240052 #119l-d119l 1sly-d1sly_2 +1 1:1.62 2:1.68 3:0.914572798274142 4:0.607628373774305 5:0.900757073008603 6:0.294119634790149 7:0.0624425214 #119l-d119l 2baa-d2baa +1 1:1.62 2:2.43 3:0.856257831918934 4:0.450623745727366 5:0.869561412073741 6:0.378932373372537 7:0.1569895990 #119l-d119l 6lyt-d193l +1 1:1.62 2:1.29 3:0.904095075521014 4:0.602608563805903 5:0.893625220106668 6:0.337268512070013 7:0.1408736485 #119l-d119l 1sly-d1sly_2 +1 1:1.62 2:1.68 3:0.914572798274142 4:0.607628373774305 5:0.90075707 #119l-d119l 1sly-d1sly_2 Feature values for each examples

10 Preprocessing Task 1: Group related data rows Problem : All The records are not independent of each other. Solution: Group records with same query template together, so that they are together either in the test data set or training data set. #119l-d119l 1alo-d1alo_1 -1 1:1.62 2:1.13 3:0.913193847041375 4:0.592859664812671 5:0.900250016135814 6:0.373223645145762 7:0.1888487926 #119l-d119l 1chka-d1chka +1 1:1.62 2:2.38 3:0.940068662681886 4:0.750255408177719 5:0.915009668297068 6:0.42843991595408 7:0.21728602097 #1aab-d1aab 1cyx-d1cyx -1 1:0.83 2:1.58 3:0.771248274633033 4:0.362259086646671 5:0.824117832571076 6:0.248160387073783 #119l-d119l 1fcdc-d1fcdc1 -1 1:1.62 2:0.8 3:0.817167643633653 4:0.232414817929789 5:0.856456855019105 6:0.218979560580544 7:0.03189285886

11 Preprocessing Task 2: Balance the positive and negative data Problem: Dataset has just 0.78% positive examples All Examples All +ve examples -ve examples (equal to +ve) Balancing the number of positive and negative examples All Random Remaining -ve examples Used only for testing Balanced examples

12 Methodology SVM light as the mining tool SVM light is an implementation of Support Vector Machines (SVMs) in C $ svm_learn example1/train.dat example1/model $ svm_classify example1/test.dat example1/model example1/prediction

13 Methodology Process for deciding the Kernel Function Many different kernels: linear, polynomial, radial basis function, or user defined. Consider the RBF kernel K(x, y) = Parameters to consider: -mmemory size of cache for kernel evaluations - ggamma value for rbf kernel -ctrade-off between training error and margin Use cross-validation to find the best parameter C and ϒ Use the best parameter C and ϒ to train the whole training set

14 Determining gamma parameter: – Ran training and testing for 100 gamma values between 0 and 1 – Found gamma = 0.15 as the best value – Ran again to find more precise gamma for 120 values from 0 to 0.3 – Found best value of gamma as 0.1 Used default C value of 0 Methodology Parameter Determination

15 ROC curve For different values of threshold - average sensitivity and specificity was computed from values in each fold threshold Evaluation with 10-fold cross-validation For threshold = -1.02 ThresholdSensivitySpecificityFPRAccuracyPrecision -1.020.7860.7690.2310.7690.027 1-specificity sensitivity

16 References A machine learning information retrieval approach to protein fold recognition by Jianlin Cheng and Pierre Baldi A Practical Guide to Support Vector Classification by Chih-Wei Hsu, Chih-Chung Chang, and Chih-Jen Lin available at http://www.csie.ntu.edu.tw/~cjlin/papers/guide/guide.pdfhttp://www.csie.ntu.edu.tw/~cjlin/papers/guide/guide.pdf Cross-Validation by PAYAM REFAEILZADEH, LEI TANG, HUAN LIU available at http://www.public.asu.edu/~ltang9/papers/ency-cross-validation.pdf http://www.public.asu.edu/~ltang9/papers/ency-cross-validation.pdf Classroom slides at http://people.cs.missouri.edu/~chengji/datamining2012/Chapter4_Classification_Prediction. ppt http://people.cs.missouri.edu/~chengji/datamining2012/Chapter4_Classification_Prediction. ppt

17 Thank you for your time Questions and comments are welcome.


Download ppt "Protein Fold Recognition as a Data Mining Coursework Project Badri Adhikari Department of Computer Science University of Missouri-Columbia."

Similar presentations


Ads by Google