PhyCMAP: Predicting protein contact map using evolutionary and physical constraints by integer programming Zhiyong Wang and Jinbo Xu Toyota Technological.

Slides:



Advertisements
Similar presentations
Blast to Psi-Blast Blast makes use of Scoring Matrix derived from large number of proteins. What if you want to find homologs based upon a specific gene.
Advertisements

1 Introduction to Sequence Analysis Utah State University – Spring 2012 STAT 5570: Statistical Bioinformatics Notes 6.1.
Direct-Coupling Analysis (DCA) and Its Applications in Protein Structure and Protein-Protein Interaction Prediction Wang Yang
PROTEOMICS 3D Structure Prediction. Contents Protein 3D structure. –Basics –PDB –Prediction approaches Protein classification.
Rosetta Energy Function Glenn Butterfoss. Rosetta Energy Function Major Classes: 1. Low resolution: Reduced atom representation Simple energy function.
Protein Threading Zhanggroup Overview Background protein structure protein folding and designability Protein threading Current limitations.
Three-Stage Prediction of Protein Beta-Sheets Using Neural Networks, Alignments, and Graph Algorithms Jianlin Cheng and Pierre Baldi Institute for Genomics.
Structure Prediction. Tertiary protein structure: protein folding Three main approaches: [1] experimental determination (X-ray crystallography, NMR) [2]
Chapter 9 Structure Prediction. Motivation Given a protein, can you predict molecular structure Want to avoid repeated x-ray crystallography, but want.
Protein-DNA interactions: amino acid conservation and the effects of mutations on binding specificity Nicholas M. Luscombe and Janet M. Thornton JMB (2002)
Progressive MSA Do pair-wise alignment Develop an evolutionary tree Most closely related sequences are then aligned, then more distant are added. Genetic.
Structure Prediction. Tertiary protein structure: protein folding Three main approaches: [1] experimental determination (X-ray crystallography, NMR) [2]
Protein Fold recognition Morten Nielsen, Thomas Nordahl CBS, BioCentrum, DTU.
Protein Fold recognition
Identifying functional residues of proteins from sequence info Using MSA (multiple sequence alignment) - search for remote homologs using HMMs or profiles.
CISC667, F05, Lec20, Liao1 CISC 467/667 Intro to Bioinformatics (Fall 2005) Protein Structure Prediction Protein Secondary Structure.
Protein Structures.
Protein Sequence Analysis - Overview Raja Mazumder Senior Protein Scientist, PIR Assistant Professor, Department of Biochemistry and Molecular Biology.
Template-based Prediction of Protein 8-state Secondary Structures June 12 th 2013 Ashraf Yaseen and Yaohang Li DEPARTMENT OF COMPUTER SCIENCE OLD DOMINION.
Predicting Protein Solvent Accessibility with Sequence, Evolutionary Information and Context-based Features 12/05/2013 Ashraf Yaseen Department of Mathematics.
Homology Modeling David Shiuan Department of Life Science and Institute of Biotechnology National Dong Hwa University.
Protein Tertiary Structure Prediction
Construyendo modelos 3D de proteinas ‘fold recognition / threading’
CRB Journal Club February 13, 2006 Jenny Gu. Selected for a Reason Residues selected by evolution for a reason, but conservation is not distinguished.
Predicting Secondary Structure of All-Helical Proteins Using Hidden Markov Support Vector Machines Blaise Gassend, Charles W. O'Donnell, William Thies,
Using Motion Planning to Study Protein Folding Pathways Susan Lin, Guang Song and Nancy M. Amato Department of Computer Science Texas A&M University
Representations of Molecular Structure: Bonds Only.
Multiple Alignment and Phylogenetic Trees Csc 487/687 Computing for Bioinformatics.
Protein Secondary Structure Prediction Based on Position-specific Scoring Matrices Yan Liu Sep 29, 2003.
Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.
Graphical Modeling of Multiple Sequence Alignment Jinbo Xu Toyota Technological Institute at Chicago Computational Institute, The University of Chicago.
Protein Classification II CISC889: Bioinformatics Gang Situ 04/11/2002 Parts of this lecture borrowed from lecture given by Dr. Altman.
Protein secondary structure Prediction Why 2 nd Structure prediction? The problem Seq: RPLQGLVLDTQLYGFPGAFDDWERFMRE Pred:CCCCCHHHHHCCCCEEEECCHHHHHHCC.
Protein Folding and Modeling Carol K. Hall Chemical and Biomolecular Engineering North Carolina State University.
Some principles and examples related to evaluation of sequence similarities with help of length equivalent measures (ELEMS) Jaroslav Kubrycht and Karel.
Protein Tertiary Structure. Protein Data Bank (PDB) Contains all known 3D structural data of large biological molecules, mostly proteins and nucleic acids:
Study of Protein Prediction Related Problems Ph.D. candidate Le-Yi WEI 1.
Pairwise Sequence Analysis-III
Protein Sequence Analysis - Overview - NIH Proteomics Workshop 2007 Raja Mazumder Scientific Coordinator, PIR Research Assistant Professor, Department.
Protein Modeling Protein Structure Prediction. 3D Protein Structure ALA CαCα LEU CαCαCαCαCαCαCαCα PRO VALVAL ARG …… ??? backbone sidechain.
Protein Structure Prediction ● Why ? ● Type of protein structure predictions – Sec Str. Pred – Homology Modelling – Fold Recognition – Ab Initio ● Secondary.
LOGO iDNA-Prot|dis: Identifying DNA-Binding Proteins by Incorporating Amino Acid Distance- Pairs and Reduced Alphabet Profile into the General Pseudo Amino.
Alignment & Secondary Structure You have learned about: Data & databases Tools Amino Acids Protein Structure Today we will discuss: Aligning sequences.
Protein Structure Prediction Graham Wood Charlotte Deane.
Construction of Substitution matrices
Query sequence MTYKLILNGKTKGETTTEAVDAATAEKVFQYANDN GVDGEWTYTE Structure-Sequence alignment “Structure is better preserved than sequence” Me! Non-redundant.
Machine Learning Methods of Protein Secondary Structure Prediction Presented by Chao Wang.
EMBL-EBI Eugene Krissinel SSM - MSDfold. EMBL-EBI MSDfold (SSM)
Modeling Cell Proliferation Activity of Human Interleukin-3 (IL-3) Upon Single Residue Replacements Majid Masso Bioinformatics and Computational Biology.
Using the Fisher kernel method to detect remote protein homologies Tommi Jaakkola, Mark Diekhams, David Haussler ISMB’ 99 Talk by O, Jangmin (2001/01/16)
Chapter 14 Protein Structure Classification
University of Washington
7. (Predicted) residue pair contacts guide ab initio modeling
Protein Structure Prediction and Protein Homology modeling
Database extraction of residue-specific empirical potentials
Matt Menke, Tufts Bonnie Berger, MIT Lenore Cowen, Tufts
Prediction of RNA Binding Protein Using Machine Learning Technique
Support Vector Machine (SVM)
There are four levels of structure in proteins
Protein Folding and Protein Threading
Protein Structures.
The future of protein secondary structure prediction accuracy
Rosetta: De Novo determination of protein structure
Protein structure prediction.
Yang Liu, Perry Palmedo, Qing Ye, Bonnie Berger, Jian Peng 
Volume 20, Issue 6, Pages (June 2012)
7. (Predicted) residue pair contacts guide ab initio modeling
Folding Membrane Proteins by Deep Transfer Learning
Luis Sanchez-Pulido, John F.X. Diffley, Chris P. Ponting 
Protein structure prediction
Presentation transcript:

PhyCMAP: Predicting protein contact map using evolutionary and physical constraints by integer programming Zhiyong Wang and Jinbo Xu Toyota Technological Institute at Chicago Web server at See for an extended versionhttp://arxiv.org/abs/

Problem Definition Contact : Distance between two C α or C β atoms < 8Å short range: 6-12 AAs apart medium range: AAs long range: >24 AAs apart 1J8B

Existing Work Residue co-evolution method: mutual information (MI), PSICOV, Evfold N eeds a large number of homologous sequences PSICOV and Evfold better than MI since they differentiate direct and indirect residue couplings (Residues A and C indirect coupling if it is due to direct A-B and B-C couplings) PSICOV and Evfold also enforce sparsity Supervised learning method: NNcon, SVMcon, CMAPpro Mutual information, sequence profile and others Predicts contacts one by one, ignoring their correlation Do not differentiate direct and indirect residue couplings First-principle method: Astro-Fold No evolutionary information Minimize contact potential Enforce physical feasibility including sparsity

Our Method: PhyCMAP 1. Focus on proteins with few sequence homologs  proteins with many sequence homologs very likely have similar templates in PDB 2. Integrate by machine learning  seq profile, residue co-evolution and non-evolutionary info  (implicitly) differentiate direct and indirect residue couplings through feature engineering 3. Enforce physical constraints, which imply sparsity

Info used by Random Forests Evolution info from a single protein family – sequence profile – co-evolution: 2 types of mutual information (MI) Non-evolution info from the whole structure space: residue contact potential Mixed info from the above 2 sources – homologous pairwise contact score – EPAD: context-specific evolutionary-based distance- dependent statistical potential amino acid physic-chemical properties

Mutual Information 1. Contrastive Mutual Information (CMI): remove local background by measuring the MI difference of one pair with its neighbors. 2. Chaining effect of residue couplings: MI, MI 2, MI 3, MI 4, equivalent to (1-MI), (1-MI) 2, (1-MI) 3, (1-MI) 4 (see for more details)

CMI Example: 1J8B Upper triangle: mutual information Lower triangle: contrastive mutual information Blue boxes: native contacts

Homologous Pairwise Contact Score Probability of a residue pair forming a contact between 2 secondary structures. PS beta (a, b): prob of two AAs a and b forming a beta contact PS helix (a, b): prob of two AAs a and b forming a helix contact H: the set of sequence homologs in a multiple seq alignment

Training Random Forests Training dataset – Chosen before CASP10 started – 900 non-redundant protein structures – <25% sequence identity – All contacts and 20% of non-contacts Model parameters – Number of features: 300 – Number of trees: 500 – 5 fold cross validation

Select Physically Feasible Contacts by Integer Linear Programming Maximize accumulative contact probability while minimize violation of physical constraints X i,j Indicate one contact between two residues i and j RrRr a relaxation variable of the r th soft constraint g(R) penalty for violation of physical constraints

Soft Constraints 1 # contacts between two secondary structure segments is limited s1,s295%Max H,H512 H,E310 H,C411 E,H412 E,E913 E,C615 C,H312 C,E512 C,C620

Soft Constraints 2 Upper and lower bounds for #contacts between two beta strands

Soft Constraints 3 Statistics shows that only 3.4% of loop segments that have a contact between the start and end residues.

Hard Constraints 1 For parallel contacts between two β strands, the contacts of neighboring residue pairs should satisfy the following constraints For anti-parallel contacts

Hard Constraints 2 1) One residue cannot form contacts with both j and j+2 when j and j+2 are in the same alpha helix 2) One beta-strand can form beta-sheets with up to 2 other beta-strands.

Test Datasets CASP10: 123 proteins – 36 are “hard”, i.e., no similar templates in PDB – low sequence identity (<25%) among them – low seq id with the training data, which were chosen before CASP10 started Set600: 601 proteins – share <25% seq ID with the training proteins – each has ≥50 AAs and an X-ray structure with resolution <1.9Å – each has ≥5 AAs with predicted secondary structure being alpha-helix or beta-strand

Accuracy w.r.t. #sequence homologs 1.M eff : #non-redundant sequence homologs of a protein 2.Divide the CASP10 targets into groups by M eff 3.Top L/10 predicted medium- and long-range contacts logM eff accuracy

Results on CASP10 – Medium Range Overall accuracy on top L/5 predicted C β contacts: PhyCMAP 0.465, CMAPpro 0.370, PSICOV CMAPpro PSICOV PhyCMAP

Results on CASP10 – Long Range Overall accuracy on top L/5 predicted C β contacts: PhyCMAP: 0.373, CMAPpro: 0.313, PSICOV: CMAPproPSICOV PhyCMAP

Results on 36 hard CASP10 targets accuracy on top L/5 medium and long-range C β contacts: PhyCMAP: 0.363, CMAPpro: 0.308, PSICOV: CMAPproPSICOV PhyCMAP

CMAPpro PSICOV PhyCMAP Results on Set600 with few homologs (Meff ≤ 100) top L/5 predicted medium and long C β contacts: PhyCMAP: 0.345, CMAPpro: 0.287, PSICOV: 0.059

Example: T0677-D2 Dozens of sequence homologs Meff=31 Upper triangle: native C β contacts Left lower triangle: PhyCMAP accuracy Right lower triangle: Evfold accuracy ~0 Note contacts between alpha helices are not continuous

Example: T0693-D2 Many sequence homologs Meff=2208 Upper triangles: native C β contacts Left lower triangle: PhyCMAP accuracy Right lower triangle: Evfold accuracy 0.419

Example: T0701-D1 Many sequence homologs Meff=3300 Upper triangle: native C β contacts Left lower triangle: PhyCMAP accuracy Right lower triangle: Evfold accuracy 0.444

Example: T0756-D1 Many sequence homologs Meff=1824 Upper triangles: native C β contacts Left lower triangle: PhyCMAP accuracy Right lower triangle: Evfold accuracy 0.500

Summary Combining seq profile, residue co-evolution, non- evolutionary info can result in good accuracy even for proteins with non- redundant seq homologs Physical constraints are helpful for proteins with few sequence homologs C β accuracy on 130 proteins Meff ≤ 100

Acknowledgements Student: Zhiyong Wang Funding – NIH R01GM – NSF CAREER award – Alfred P. Sloan Research Fellowship Computational resources – University of Chicago Beagle team – TeraGrid Web server at

Protein contact Contact : Distance between two C α or C β atoms < 8Å; or Distance between the closest atoms of 2 residues. 1J8B short range: 6-12 AAs apart medium range: AAs long range: >24 AAs apart

Why contact prediction? Contacts describe spatial and functional relationship of residues Contains key information for 3D structure Useful for protein structure prediction Used for protein structure alignment and classification

Contrastive Mutual Information Contrastive Mutual Information (CMI) removes local background, by measuring the MI difference between one pair of residues and neighboring pairs.

Integer Linear Programming Objective function: g(R): penalty for violation of physical constraints VariablesExplanations X i,j equal to 1 if there is a contact between two residues i and j. AP u,v equal to 1 if two beta-strands u and v form an anti-parallel beta-sheet. P u,v equal to 1 if two beta-strands u and v form a parallel beta-sheet. S u,v equal to 1 if two beta-strands u and v form a beta-sheet. T u,v equal to 1 if there is an alpha-bridge between two helices u and v. RrRr a non-negative integral relaxation variable of the r th soft constraint.

Hard Constraints 3 One beta-strand can form beta-sheets with up to 2 other beta-strands.

Global constraints Antiparallel and parallel contacts A residue contact implies a segment-wise contact Put a limit of total number of contacts – k is the number of top contacts we want to predict.

Results on Set600 with many sequence homologs (Meff > 100) CMAPproPSICOV PhyCMAP top L/5 predicted medium and long C β contacts: PhyCMAP: 0.611, CMAPpro: 0.515, PSICOV: 0.569

Contribution of HPS and CMI features Average C β accuracy the 471 proteins with M eff >100

Contribution of physical constraints Average C β accuracy on 130 proteins with Meff ≤ 100