Presentation on theme: "PhyCMAP: Predicting protein contact map using evolutionary and physical constraints by integer programming Zhiyong Wang and Jinbo Xu Toyota Technological."— Presentation transcript:
PhyCMAP: Predicting protein contact map using evolutionary and physical constraints by integer programming Zhiyong Wang and Jinbo Xu Toyota Technological Institute at Chicago Web server at http://raptorx.uchicago.eduhttp://raptorx.uchicago.edu See http://arxiv.org/abs/1308.1975 for an extended versionhttp://arxiv.org/abs/1308.1975
Problem Definition Contact : Distance between two C α or C β atoms < 8Å short range: 6-12 AAs apart medium range: 12-24 AAs long range: >24 AAs apart 1J8B
Existing Work Residue co-evolution method: mutual information (MI), PSICOV, Evfold N eeds a large number of homologous sequences PSICOV and Evfold better than MI since they differentiate direct and indirect residue couplings (Residues A and C indirect coupling if it is due to direct A-B and B-C couplings) PSICOV and Evfold also enforce sparsity Supervised learning method: NNcon, SVMcon, CMAPpro Mutual information, sequence profile and others Predicts contacts one by one, ignoring their correlation Do not differentiate direct and indirect residue couplings First-principle method: Astro-Fold No evolutionary information Minimize contact potential Enforce physical feasibility including sparsity
Our Method: PhyCMAP 1. Focus on proteins with few sequence homologs proteins with many sequence homologs very likely have similar templates in PDB 2. Integrate by machine learning seq profile, residue co-evolution and non-evolutionary info (implicitly) differentiate direct and indirect residue couplings through feature engineering 3. Enforce physical constraints, which imply sparsity
Info used by Random Forests Evolution info from a single protein family – sequence profile – co-evolution: 2 types of mutual information (MI) Non-evolution info from the whole structure space: residue contact potential Mixed info from the above 2 sources – homologous pairwise contact score – EPAD: context-specific evolutionary-based distance- dependent statistical potential amino acid physic-chemical properties
Mutual Information 1. Contrastive Mutual Information (CMI): remove local background by measuring the MI difference of one pair with its neighbors. 2. Chaining effect of residue couplings: MI, MI 2, MI 3, MI 4, equivalent to (1-MI), (1-MI) 2, (1-MI) 3, (1-MI) 4 (see http://arxiv.org/abs/1308.1975 for more details) http://arxiv.org/abs/1308.1975
CMI Example: 1J8B Upper triangle: mutual information Lower triangle: contrastive mutual information Blue boxes: native contacts
Homologous Pairwise Contact Score Probability of a residue pair forming a contact between 2 secondary structures. PS beta (a, b): prob of two AAs a and b forming a beta contact PS helix (a, b): prob of two AAs a and b forming a helix contact H: the set of sequence homologs in a multiple seq alignment
Training Random Forests Training dataset – Chosen before CASP10 started – 900 non-redundant protein structures – <25% sequence identity – All contacts and 20% of non-contacts Model parameters – Number of features: 300 – Number of trees: 500 – 5 fold cross validation
Select Physically Feasible Contacts by Integer Linear Programming Maximize accumulative contact probability while minimize violation of physical constraints X i,j Indicate one contact between two residues i and j RrRr a relaxation variable of the r th soft constraint g(R) penalty for violation of physical constraints
Soft Constraints 1 # contacts between two secondary structure segments is limited s1,s295%Max H,H512 H,E310 H,C411 E,H412 E,E913 E,C615 C,H312 C,E512 C,C620
Soft Constraints 2 Upper and lower bounds for #contacts between two beta strands
Soft Constraints 3 Statistics shows that only 3.4% of loop segments that have a contact between the start and end residues.
Hard Constraints 1 For parallel contacts between two β strands, the contacts of neighboring residue pairs should satisfy the following constraints For anti-parallel contacts
Hard Constraints 2 1) One residue cannot form contacts with both j and j+2 when j and j+2 are in the same alpha helix 2) One beta-strand can form beta-sheets with up to 2 other beta-strands.
Test Datasets CASP10: 123 proteins – 36 are “hard”, i.e., no similar templates in PDB – low sequence identity (<25%) among them – low seq id with the training data, which were chosen before CASP10 started Set600: 601 proteins – share <25% seq ID with the training proteins – each has ≥50 AAs and an X-ray structure with resolution <1.9Å – each has ≥5 AAs with predicted secondary structure being alpha-helix or beta-strand
Accuracy w.r.t. #sequence homologs 1.M eff : #non-redundant sequence homologs of a protein 2.Divide the CASP10 targets into groups by M eff 3.Top L/10 predicted medium- and long-range contacts logM eff accuracy
Results on CASP10 – Medium Range Overall accuracy on top L/5 predicted C β contacts: PhyCMAP 0.465, CMAPpro 0.370, PSICOV 0.316 CMAPpro PSICOV PhyCMAP
Results on CASP10 – Long Range Overall accuracy on top L/5 predicted C β contacts: PhyCMAP: 0.373, CMAPpro: 0.313, PSICOV: 0.315 CMAPproPSICOV PhyCMAP
Results on 36 hard CASP10 targets accuracy on top L/5 medium and long-range C β contacts: PhyCMAP: 0.363, CMAPpro: 0.308, PSICOV: 0.180 CMAPproPSICOV PhyCMAP
CMAPpro PSICOV PhyCMAP Results on Set600 with few homologs (Meff ≤ 100) top L/5 predicted medium and long C β contacts: PhyCMAP: 0.345, CMAPpro: 0.287, PSICOV: 0.059
Example: T0677-D2 Dozens of sequence homologs Meff=31 Upper triangle: native C β contacts Left lower triangle: PhyCMAP accuracy 0.357 Right lower triangle: Evfold accuracy ~0 Note contacts between alpha helices are not continuous
Example: T0693-D2 Many sequence homologs Meff=2208 Upper triangles: native C β contacts Left lower triangle: PhyCMAP accuracy 0.744 Right lower triangle: Evfold accuracy 0.419
Example: T0701-D1 Many sequence homologs Meff=3300 Upper triangle: native C β contacts Left lower triangle: PhyCMAP accuracy 0.794 Right lower triangle: Evfold accuracy 0.444
Example: T0756-D1 Many sequence homologs Meff=1824 Upper triangles: native C β contacts Left lower triangle: PhyCMAP accuracy 0.944 Right lower triangle: Evfold accuracy 0.500
Summary Combining seq profile, residue co-evolution, non- evolutionary info can result in good accuracy even for proteins with 10--100 non- redundant seq homologs Physical constraints are helpful for proteins with few sequence homologs C β accuracy on 130 proteins Meff ≤ 100
Acknowledgements Student: Zhiyong Wang Funding – NIH R01GM0897532 – NSF CAREER award – Alfred P. Sloan Research Fellowship Computational resources – University of Chicago Beagle team – TeraGrid Web server at http://raptorx.uchicago.edu
Protein contact Contact : Distance between two C α or C β atoms < 8Å; or Distance between the closest atoms of 2 residues. 1J8B short range: 6-12 AAs apart medium range: 12-24 AAs long range: >24 AAs apart
Why contact prediction? Contacts describe spatial and functional relationship of residues Contains key information for 3D structure Useful for protein structure prediction Used for protein structure alignment and classification
Contrastive Mutual Information Contrastive Mutual Information (CMI) removes local background, by measuring the MI difference between one pair of residues and neighboring pairs.
Integer Linear Programming Objective function: g(R): penalty for violation of physical constraints VariablesExplanations X i,j equal to 1 if there is a contact between two residues i and j. AP u,v equal to 1 if two beta-strands u and v form an anti-parallel beta-sheet. P u,v equal to 1 if two beta-strands u and v form a parallel beta-sheet. S u,v equal to 1 if two beta-strands u and v form a beta-sheet. T u,v equal to 1 if there is an alpha-bridge between two helices u and v. RrRr a non-negative integral relaxation variable of the r th soft constraint.
Hard Constraints 3 One beta-strand can form beta-sheets with up to 2 other beta-strands.
Global constraints Antiparallel and parallel contacts A residue contact implies a segment-wise contact Put a limit of total number of contacts – k is the number of top contacts we want to predict.
Results on Set600 with many sequence homologs (Meff > 100) CMAPproPSICOV PhyCMAP top L/5 predicted medium and long C β contacts: PhyCMAP: 0.611, CMAPpro: 0.515, PSICOV: 0.569
Contribution of HPS and CMI features Average C β accuracy the 471 proteins with M eff >100
Contribution of physical constraints Average C β accuracy on 130 proteins with Meff ≤ 100