CZ5226: Advanced Bioinformatics Lecture 7: Statistical Learning Methods Prof. Chen Yu Zong Tel: 6874-6877

Slides:



Advertisements
Similar presentations
Image classification Given the bag-of-features representations of images from different classes, how do we learn a model for distinguishing them?
Advertisements

Support Vector Machines
SVM—Support Vector Machines
Machine learning continued Image source:
Discriminative and generative methods for bags of features
LSM3241: Bioinformatics and Biocomputing Lecture 2: Bioinformatics of viral genome Prof. Chen Yu Zong Tel:
Structural bioinformatics
Image classification Given the bag-of-features representations of images from different classes, how do we learn a model for distinguishing them?
Announcements See Chapter 5 of Duda, Hart, and Stork. Tutorial by Burge linked to on web page. “Learning quickly when irrelevant attributes abound,” by.
Bayesian Classification of Protein Data Thomas Huber Computational Biology and Bioinformatics Environment ComBinE Department of Mathematics.
Essential Bioinformatics and Biocomputing (LSM2104: Section I) Biological Databases and Bioinformatics Software Prof. Chen Yu Zong Tel:
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Support Vector Machines and Kernel Methods
BL5203: Molecular Recognition & Interaction Lecture 5: Drug Design Methods Ligand-Protein Docking (Part I) Prof. Chen Yu Zong Tel:
Linear Discriminant Functions Chapter 5 (Duda et al.)
Bioinformatics (3 lectures) Why bother about proteins/prediction What is bioinformatics Protein databases Making use of database information –Predictions.
Detecting the Domain Structure of Proteins from Sequence Information Niranjan Nagarajan and Golan Yona Department of Computer Science Cornell University.
LSM2104/CZ2251 Essential Bioinformatics and Biocomputing Essential Bioinformatics and Biocomputing Protein Structure and Visualization (3) Chen Yu Zong.
Lecture 7: Computer aided drug design: Statistical approach. Lecture 7: Computer aided drug design: Statistical approach. Chen Yu Zong Department of Computational.
Protein Tertiary Structure Prediction
JM - 1 Introduction to Bioinformatics: Lecture VIII Classification and Supervised Learning Jarek Meller Jarek Meller Division.
Whole Genome Expression Analysis
LSM3241: Bioinformatics and Biocomputing Lecture 3: Machine learning method for protein function prediction Prof. Chen Yu Zong Tel:
Protein Secondary Structure Prediction with inclusion of Hydrophobicity information Tzu-Cheng Chuang, Okan K. Ersoy and Saul B. Gelfand School of Electrical.
 Four levels of protein structure  Linear  Sub-Structure  3D Structure  Complex Structure.
CZ3253: Computer Aided Drug design Lecture 3: Drug and Cheminformatics Databases Prof. Chen Yu Zong Tel:
CZ5225 Methods in Computational Biology Lecture 4-5: Protein Structure and Structural Modeling Prof. Chen Yu Zong Tel:
1 SUPPORT VECTOR MACHINES İsmail GÜNEŞ. 2 What is SVM? A new generation learning system. A new generation learning system. Based on recent advances in.
From Structure to Function. Given a protein structure can we predict the function of a protein when we do not have a known homolog in the database ?
Use of Machine Learning in Chemoinformatics Irene Kouskoumvekaki Associate Professor December 12th, 2012 Biological Sequence Analysis course.
Classifiers Given a feature representation for images, how do we learn a model for distinguishing features from different classes? Zebra Non-zebra Decision.
Meng-Han Yang September 9, 2009 A sequence-based hybrid predictor for identifying conformationally ambivalent regions in proteins.
PREDICTION OF CATALYTIC RESIDUES IN PROTEINS USING MACHINE-LEARNING TECHNIQUES Natalia V. Petrova (Ph.D. Student, Georgetown University, Biochemistry Department),
Support Vector Machine Data Mining Olvi L. Mangasarian with Glenn M. Fung, Jude W. Shavlik & Collaborators at ExonHit – Paris Data Mining Institute University.
CS 1699: Intro to Computer Vision Support Vector Machines Prof. Adriana Kovashka University of Pittsburgh October 29, 2015.
LSM3241: Bioinformatics and Biocomputing Lecture 6: Fundamentals of Molecular Modeling Prof. Chen Yu Zong Tel:
Support Vector Machines and Gene Function Prediction Brown et al PNAS. CS 466 Saurabh Sinha.
CZ5225: Modeling and Simulation in Biology Lecture 7, Microarray Class Classification by Machine learning Methods Prof. Chen Yu Zong Tel:
Support Vector Machines (SVM): A Tool for Machine Learning Yixin Chen Ph.D Candidate, CSE 1/10/2002.
CZ5225 Methods in Computational Biology Lecture 2-3: Protein Families and Family Prediction Methods Prof. Chen Yu Zong Tel:
Intelligent Database Systems Lab N.Y.U.S.T. I. M. An integrated scheme for feature selection and parameter setting in the support vector machine modeling.
Finding Clusters within a Class to Improve Classification Accuracy Literature Survey Yong Jae Lee 3/6/08.
1 Kernel Machines A relatively new learning methodology (1992) derived from statistical learning theory. Became famous when it gave accuracy comparable.
Roughly overview of Support vector machines Reference: 1.Support vector machines and machine learning on documents. Christopher D. Manning, Prabhakar Raghavan.
CZ3253: Computer Aided Drug design Lecture 7: Drug Design Methods II: SVM Prof. Chen Yu Zong Tel:
Linear Discriminant Functions Chapter 5 (Duda et al.) CS479/679 Pattern Recognition Dr. George Bebis.
A new protein-protein docking scoring function based on interface residue properties Reporter: Yu Lun Kuo (D )
Support Vector Machine
CS 9633 Machine Learning Support Vector Machines
SMA5422: Special Topics in Biotechnology
CZ3253: Computer Aided Drug design Introduction about the module Prof
LSM3241: Bioinformatics and Biocomputing Lecture 4: Sequence analysis methods revisited Prof. Chen Yu Zong Tel:
Prediction of RNA Binding Protein Using Machine Learning Technique
חיזוי ואפיון אתרי קישור של חלבון לדנ"א מתוך הרצף
Support Vector Machine (SVM)
CS 2750: Machine Learning Support Vector Machines
CZ3253: Computer Aided Drug design Lecture 4: Structural modeling of chemical molecules Prof. Chen Yu Zong Tel:
Prediction of protein function from sequence analysis
CZ5225 Methods in Computational Biology Lecture 7: Protein Structure and Structural Modeling Prof. Chen Yu Zong Tel:
Megon Walker Bioinformatics Program Boston University
Text for section 1 1 Text for section 2 2 Text for section 3 3
Text for section 1 1 Text for section 2 2 Text for section 3 3
Text for section 1 1 Text for section 2 2 Text for section 3 3
Text for section 1 1 Text for section 2 2 Text for section 3 3
Text for section 1 1 Text for section 2 2 Text for section 3 3
Text for section 1 1 Text for section 2 2 Text for section 3 3
Text for section 1 1 Text for section 2 2 Text for section 3 3
Text for section 1 1 Text for section 2 2 Text for section 3 3
Text for section 1 1 Text for section 2 2 Text for section 3 3
Text for section 1 1 Text for section 2 2 Text for section 3 3
Presentation transcript:

CZ5226: Advanced Bioinformatics Lecture 7: Statistical Learning Methods Prof. Chen Yu Zong Tel: Room 07-24, level 7, SOC1, National University of Singapore

2 Classification of Drugs or Proteins by SVM A drug or a protein is classified as either belong (+) or not belong (-) to a class Examples of drug class: inhibitor of a protein, BBB penetrating, genotoxic Examples of protein class: enzyme EC3.4 family, DNA-binding By screening against all classes, the property of a drug or the function of a protein can be identified Drug or Protein Class-1 SVM Class-2 SVM Class-3 SVM Drug or Protein belongs to Family

3 Classification of Drugs or Proteins by SVM What is SVM? Support vector machines, a machine learning method, learning by examples, statistical learning, classify objects into one of the two classes. Advantages of SVM: Diversity of class members (no racial discrimination). Use of structure-derived physico-chemical features as basis for drug or protein classification (no structure-similarity or sequence-similarity required in the algorithm).

4 SVM References C. Burges, "A tutorial on support vector machines for pattern recognition", Data Mining and Knowledge Discovery, Kluwer Academic Publishers,1998 (on-line). R. Duda, P. Hart, and D. Stork, Pattern Classification, John-Wiley, 2nd edition, 2001 (section 5.11, hard-copy). S. Gong et al. Dynamic Vision: From Images to Face Recognition, Imperial College Pres, 2001 (sections 3.6.2, 3.7.2, hard copy). Online lecture notes ( )Online lecture noteshttp:// Publications of SVM drug prediction: –J. Chem. Inf. Comput. Sci. 44,1630 (2004) –J. Chem. Inf. Comput. Sci. 44, 1497 (2004) –Toxicol. Sci. 79,170 (2004).

5 SVM References Publications of SVM protein function prediction: –Bioinformatics 2002; 18, 147 –Nucleic Acids Res 2003; 31, 3692 –Proteins 2004; 55, 66 –RNA 2004; 10, 355 –J Biol Chem 2004; 279, –Nucleic Acids Res. 2004; 32(21): –Virology 2005; 331(1): Publications of SVM peptide-binder prediction: –BMC Bioinformatics Sep 11;3(1):25 –Bioinformatics Oct 12;19(15): –Protein Sci Mar;13(3): –Genome Inform Ser Workshop Genome Inform. 2004;15(1):

6 Other MHC-Peptide Prediction References –J Comput Biol. 2004;11(4): –Methods Dec;34(4):454-9 –Methods Dec;34(4): –Methods Dec;34(4): –Org Biomol Chem Nov 21;2(22): –Immunogenetics Sep;56(6): –J Immunol Jun 15;172(12): –J Immunol Jun 1;172(11): –Appl Bioinformatics. 2003;2(1):63-6 –Appl Bioinformatics. 2003;2(3):155-8 –Bioinformatics Jun 12;20(9): –Proteins Feb 15;54(3): –Novartis Found Symp. 2003;254:102-20; discussion 120-5, , –Hum Immunol Dec;64(12): –J Mol Graph Model Jan;22(3): –Neural Comput Dec;15(12): –Tissue Antigens Nov;62(5):378-84

7 Other MHC-Peptide Prediction References –Bioinformatics Sep 22;19(14): –Hybrid Hybridomics Aug;22(4): –Nucleic Acids Res Jul 1;31(13): –Bioinformatics May 22;19(8): –Methods Mar;29(3): –J Proteome Res May-Jun;1(3): –J Mol Biol Feb 28;326(4): –BMC Bioinformatics Sep 11;3(1):25 –Hum Immunol Sep;63(9):701-9 –J Comput Biol. 2002;9(3): –Mol Med Mar;8(3): –Immunol Cell Biol Jun;80(3):280-5 –Immunol Cell Biol Jun;80(3):270-9 –BMC Struct Biol May 13;2(1):2 –Biologicals Sep-Dec;29(3-4): –Bioinformatics Dec;17(12): –Bioinformatics Oct;17(10):942-8 –J Med Chem Oct 25;44(22): –J Comput Aided Mol Des Jun;15(6): –Protein Sci Sep;9(9):

8 Machine Learning Method Inductive learning: Example-based learning Descriptor Positive examples Negative examples

9 Machine Learning Method A=(1, 1, 1) B=(0, 1, 1) C=(1, 1, 1) D=(0, 1, 1) E=(0, 0, 0) F=(1, 0, 1) Feature vectors: Descriptor Feature vector Positive examples Negative examples

10 SVM Method Feature vectors in input space: A=(1, 1, 1) B=(0, 1, 1) C=(1, 1, 1) D=(0, 1, 1) E=(0, 0, 0) F=(1, 0, 1) Z Input space X Y B A E F Feature vector

11 SVM Method Border New border Project to a higher dimensional space Protein family members Nonmembers Protein family members Nonmembers

12 SVM method Support vector New border Protein family members Nonmembers

13 SVM Method Protein family members Nonmembers New border Support vector

14 Best Linear Separator?

15 Best Linear Separator?

16 Find Closest Points in Convex Hulls c d

17 Plane Bisect Closest Points d c

18 Find using quadratic program Many existing and new solvers.

19 Best Linear Separator: Supporting Plane Method Maximize distance Between two parallel supporting planes Distance = “Margin” =

20 Best Linear Separator?

21 SVM Method Border line is nonlinear

22 SVM method Non-linear transformation: use of kernel function

23 SVM method Non-linear transformation

24 SVM Method

25 SVM Method

26 SVM Method

27 SVM Method

28 SVM for Classification of Drugs How to represent a drug? Each structure represented by specific feature vector assembled from structural, physico-chemical properties: –Simple molecular properties (molecular weight, no. of rotatable bonds etc. 18 in total) –Molecular Connectivity and shape (28 in total) –Electro-topological state polarity (84 in total) –Quantum chemical properties (electric charge, polaritability etc. 13 in total) –Geometrical properties (molecular size vector, van der Waals volume, molecular surface etc. 16 in total) J. Chem. Inf. Comput. Sci. 44,1630 (2004) J. Chem. Inf. Comput. Sci. 44, 1497 (2004) Toxicol. Sci. 79,170 (2004).

SVM-based drug design and property prediction software Useful for inhibitor/activator/substrate prediction, drug safety and pharmacokinetic prediction. Computer loaded with SVMProt Support vector machines classifier for every Drug class Identifiedclasses Drug designed or property predicted Send structure to classifier J. Chem. Inf. Comput. Sci. 44,1630 (2004) J. Chem. Inf. Comput. Sci. 44, 1497 (2004) Toxicol. Sci. 79,170 (2004). Input structure through internet Option 2 Option 1 Input structure on local machine Your drug structure Which class your drug belongs to? Drug Chemical Structure Chemical Structure

SVM Drug Prediction Results Protein inhibitor/activator/substrate prediction: 86% of the 129 estrogen receptor activators and 84% of 101 non-activators correctly predicted. 81% of 116 P-glycoprotein substrates and 79% of 85 non-substrates correctly predicted Drug Toxicity Prediction: 97% of 102 TdP+ and 84% of 243 TdP- agents correctly predicted 73% of 229 genotoxic and 93% of 631 non-genotoxic agents correctly predicted Pharmacokinetics prediction : 95% of 276 BBB+ and 82% of 139 BBB- agents correctly predicted 90% of 131 human intestine absorption and 80% of 65 non-absoption agents correctly predicted. J. Chem. Inf. Comput. Sci. 44,1630 (2004) J. Chem. Inf. Comput. Sci. 44, 1497 (2004) Toxicol. Sci. 79,170 (2004).

31 SVM for Classification of Proteins How to represent a protein? Each sequence represented by specific feature vector assembled from encoded representations of tabulated residue properties: –amino acid composition –Hydrophobicity –normalized Van der Waals volume –polarity, –Polarizability –Charge –surface tension –secondary structure –solvent accessibility Three descriptors, composition (C), transition (T), and distribution (D), are used to describe global composition of each of these properties. Nucleic Acids Res. 2003; 31:

32 SVM for Classification of Proteins How to represent a protein? From protein sequence: To Feature vector : (C_amino acid composition, T_ amino acid composition, D_ amino acid composition, C_hydrophobicity, T_hydrophobicity, D_hydrophobicity, … ) Nucleic Acids Res. 2003; 31:

33 SVM for Classification of Proteins How to represent a protein?

Protein function prediction software SVMProt Useful for functional prediction of novel proteins, distantly-related proteins, homologous proteins of different functions Your protein sequence Computer loaded with SVMProt Support vector machines classifier for every protein functional family Identified Functional families Protein functional indications Send sequence to classifier Nucl. Acids Res. 31, (2003) Input sequence through internet Option 2Option 1 Input sequence on local machine Your protein sequence Which functional families your protein belong to?

Protein function prediction software SVMProt Useful for functional prediction of novel proteins, distantly-related proteins, homologous proteins of different functions. Protein families covered: 46 enzyme families, 3 receptor families, 4 transporter and channel families, 6 DNA- and RNA-binding families, 8 structural families, 2 regulator/factor families. SVMProt web-version at: Nucl. Acids Res. 31, (2003)

Protein function prediction software SVMProt Nucl. Acids Res. 31, (2003) Probability of correct prediction Prediction score

SVMProt Protein Functional Family Prediction Results Overall prediction accuracies: 87% of the 34,582 proteins correctly assigend to their respective functional family. 97% of the 310,000 non-member proteins correctly predicted Novel enzymes: 67% of the 12 non-homologous enzymes (having no homlogous proteins by PSI- BLAST search of NR databases) are correctly assigned 83% of the 29 non-homologous enzymes (having no homologous proteins by PSI- BLAST search of SwissProt database) are correctly assigned. 70% of the 20 pairs of homologous enzymes of different functions are correctly assigned. NR databases include all non-redundant GenBank, CDS translations, PDB, SwissProt, PIR, and PRF databases 92% of 12,900 enzymes correctly assigned by BLAST in 1997 Nucleic Acids Res 2003; 31, 3692 Proteins 2004; 55, 66