Graphical Modeling of Multiple Sequence Alignment Jinbo Xu Toyota Technological Institute at Chicago Computational Institute, The University of Chicago.

Graphical Modeling of Multiple Sequence Alignment Jinbo Xu Toyota Technological Institute at Chicago Computational Institute, The University of Chicago

Two applications of MSA Predict inter-residue interaction network (i.e., protein contact map) from MSA using joint graphical lasso – An important subproblem of protein folding Align two MSAs through alignment of two Markov Random Fields (MRFs) – Homology detection and fold recognition – Merge two MSAs into a larger one

Modeling MSA by Markov Random Fields

Numeric Representation of MSA …00 …10… 21 elements for each column in MSA Represent a sequence in MSA as a L×21 binary vector

Gaussian Graphical Model (GGM)

Covariance and Precision Matrix L L The precision matrix has dimension 21L×21L 21×21 one residue pair Larger values indicate stronger interaction

Today’s talk Predict inter-residue interaction network (i.e., protein contact map) from MSA using joint graphical lasso – An important subproblem of protein folding Align two MSAs through alignment of two Markov Random Fields (MRFs) – Homology detection and fold recognition – Merge two MSAs into a larger one

Protein Contact Map (residue interaction network) 1 2 3 4 6.0 8.1 5.9 1234 10110 21011 31101 40110 Two residues in contact if their C α or C β distance < 8Å 3.8 Shorter distance Stronger interaction

Contact Matrix is Sparse short range: 6-12 AAs apart along primary sequence medium range: 12-24 AAs apart long range: >24 AAs apart #contacts is linear w.r.t. sequence length

Input: MEKVNFLKNGVLRLPPGFRFRPTDEELVVQYLKRKVFSFPLPASIIPEVEVYKSDPWDLPGDMEQEKYFFSTK EVKYPNGNRSNRATNSGYWKATGIDKQIILRGRQQQQQLIGLKKTLVFYRGKSPHGCRTNWIMHEYRLAN LESNYHPIQGNWVICRIFLKKRGNTKNKEENMTTHDEVRNREIDKNSPVVSVKMSSRDSEALASANSELKK KASIIFYDFMGRNNSNGVAASTSSSGITDLTTTNEESDDHEESTSSFNNFTTFKRKIN Protein Contact Prediction Output: With L/12 long-range native contacts, the fold of a protein can be roughly determined [Baker group]

Contact Prediction Methods Evolutionary coupling analysis (unsupervised learning)  Identity co-evolved residues from multiple sequence alignment  No solved protein structures used at all  High-throughput sequencing makes this method promising  e.g., mutual information, Evfold, PSICOV, plmDCA, GREMLIN Supervised machine learning  Input features: sequence (profile) similarity, chemical properties similarity, mutual information  (implicitly) learn information from solved structures  examples: NNcon, SVMcon, CMAPpro, PhyCMAP

Evolutionary Coupling (EC) Analysis Observation: two residues in contact tend to co-evolve, i.e., two co-evolved residues likely to form a contact http://www.plosone.org/article/info%3Adoi%2F10.1371%2Fjournal.pone.0028766

Evolutionary Coupling (EC) Analysis (Cont’d) Local statistical methods: examine the correlation between two residues independent of the others Mutual information (MI): two residues in contact likely to have large MI Not all residue pairs with large MI are in contact due to indirect evolutionary coupling. If A~B and B~C, then likely A~C Global statistical methods: examine the correlation between two residues condition on the others Need a large number of sequences Maximum-Entropy: Evfold Graphical lasso: PSICOV Pseudo-likelihood: plmDCA, GREMLIN

Single MSA-based Contact Prediction Enforce sparse precision matrix Why ?

Issues with Existing Methods  Evolutionary coupling (EC) analysis works for proteins with a large number of sequence homologs  Focus on how to improve the statistical methods instead of use of extra biological information/insight, e.g., relax the Gaussian assumption, consensus of a few EC methods,  Use information mostly in a single protein family  Physical constraints other than sparsity not used

Our Work: contact prediction using multiple MSAs  Jointly predict contacts for related families of similar folds. That is, predict contacts using multiple MSAs.  These MSAs share inter-residue interaction network to some degree  Integrate evolutionary coupling (EC) analysis with supervised learning  EC analysis makes use of residue co-evolution information  Supervised learning makes use of sequence (profile) similarity Goal: focus on proteins without many sequence homologs Strategy: increase statistical power by information aggregation

Red: shared; Blue: unique to PF00116; Green: unique to PF13473 Observation: different protein families share similar contact maps

Joint evolutionary coupling (EC) analysis Jointly predict contacts for a set of related protein families  Predict contacts for a protein family using information in other related families  Enforce contact map consistency among related families  Do not lose family-specific information

Joint graphical lasso for joint evolutionary coupling analysis How to enforce contact map consistency?

Residue Pair/Submatrix Grouping In total ≤L(L-1)/2 groups where L is the seq length

Enforce Contact Map Consistency by Group Penalty Group conservation level

Supervised Machine Learning Input features: sequence profile, amino acid chemical properties, mutual information power series, context-specific statistical potential Mutual information power series: – Local info: mutual information matrix (MI) – Partially global info: MI 2, MI 3, …, MI 11 – Can be calculated much faster than PSICOV Random Forests trained by 800-900 proteins

Joint EC Analysis with Supervised Prediction as Prior sparsity contact map consistency among families similarity with supervised prediction Log-likelihood of K families This optimization problem can be solved by ADMM to suboptimal

Accuracy on 98 Pfam families Medium-rangeLong-range L/10L/5L/2L/10L/5L/2 CoinDCA 0.4960.4350.3120.5610.5020.391 PSICOV 0.3750.3120.2130.4460.4000.311 PSICOV_b 0.3880.3060.1990.4620.4000.294 plmDCA 0.4330.3540.2330.4840.4430.343 plmDCA_h 0.4330.3390.2110.4800.4130.292 GREMLIN 0.4010.3320.2250.4470.4230.329 GREMLIN_h 0.3910.3160.2040.4280.4000.301 Merge_p 0.3030.2460.1780.3700.3280.253 Merge_m 0.2760.2230.1690.3550.3090.232 Voting 0.4050.2800.1680.3370.3530.275

Accuracy vs. # Sequence Homologs (A) Medium-range (B) Long-range X-axis: ln of the number of non-redundant sequence homologs Y-axis: L/10 accuracy

Accuracy on 123 CASP10 targets Medium-rangeLong-range L/10L/5L/2L/10L/5L/2 CoinDCA0.5000.4400.3400.4120.3510.279 Evfold0.2940.2490.1880.2570.2250.171 PSICOV0.3100.2590.1920.2760.2250.168 plmDCA0.3440.2890.2140.3260.2800.213 GREMLIN0.3430.2800.2290.3200.2780.159 NNcon0.3930.3340.2260.2390.1880.001 CMAPpro0.4140.3630.2760.3360.2970.227

Accuracy vs. # sequence homologs (CASP10) X-axis: ln of # non-redundant sequence homologs Y-axis: L/10 long-range prediction accuracy

Accuracy vs. Contact Conservation Level (A)Medium-range; (B) long-range X-axis: conservation level, the larger, the more conserved

Today’s Talk Predict inter-residue interaction network (i.e., protein contact map) from MSA using joint graphical lasso – An important subproblem of protein folding Align two MSAs through alignment of two Markov Random Fields (MRFs) – Remote homology detection and fold recognition – Merge two MSAs into a larger one

Homology Detection & Fold Recognition Primary sequence comparison – Similar sequences -> very likely homologous – Sequence alignment method, e.g., BLAST, FASTA – works only for close homologs Profile-based method – Compare two protein families instead of primary sequences, using evolutionary information in a family – Sequence-profile alignment & profile-profile alignment – Profile can be represented as a matrix (e.g., FFAS) or a HMM (e.g., HHpred, HMMER) – Sometimes works for remote homologs, but not sensitive enough

MSA to Sequence Profile Two popular profile representations: (1) Position-specific scoring matrix (PSSM); (2) Hidden Markov Model (HMM)

Position-Specific Scoring Matrix (PSSM) Taken from http://carrot.mcb.uconn.edu/~olgazh/bioinf2010/class10.html

Hidden Markov Model (HMM) http://www.biopred.net/eddy.html

Our Work: Markov Random Fields (MRF) Representation 1) MRF encodes long-range residue interaction pattern while HMM does not; 2) Long-range interaction pattern encodes global information of a protein, So can deal with proteins of similar folds but divergent sequences

Protein alignment by aligning two MRFs GRK-YSA GRK-YSA FLV-LYI KLV-LYI PTAKFRE PTAKFRS PTVPGYE PTVPGRS MRF1 MRF2 Family 1 Family 2

Scoring function for MRF alignment local alignment potential pairwise alignment potential MRF1 MRF2 NP-hard due to 1)Gaps allowed 2)Pairwise potential

Alternating Direction of Method Multiplier (ADMM) Make a copy of z to y Add a penalty term to obtain an augmented problem

ADMM (Cont’d)

Both subproblems can solved efficiently by dynamic programming!

Superfamily & Fold Recognition Rate Superfamily level detection Fold level detection

Conclusion Joint evolutionary coupling analysis + supervised learning can significantly improve protein contact prediction by using information in multiple MSAs Long-range residue interaction encoded in an MSA helpful for remote homolog detection

Acknowledgements RaptorX servers at http://raptorx.uchicago.eduhttp://raptorx.uchicago.edu Students: Jianzhu Ma, Zhiyong Wang, Sheng Wang Funding – NIH R01GM0897532 – NSF CAREER award and NSF ABI – Alfred P. Sloan Research Fellowship Computational resources – University of Chicago Beagle team – TeraGrid and Open Science Grid

Input: MEKVNFLKNGVLRLPPGFRFRPTDEELVVQYLKRKVFSFPLPASIIPEVEVYKSDPWDLPGDMEQEKYFFSTK EVKYPNGNRSNRATNSGYWKATGIDKQIILRGRQQQQQLIGLKKTLVFYRGKSPHGCRTNWIMHEYRLAN LESNYHPIQGNWVICRIFLKKRGNTKNKEENMTTHDEVRNREIDKNSPVVSVKMSSRDSEALASANSELKK KASIIFYDFMGRNNSNGVAASTSSSGITDLTTTNEESDDHEESTSSFNNFTTFKRKIN Protein Structure Prediction Output: 1. One of the most challenging problems in computational biology! 2. Improved due to better algorithms and large databases 3. Knowledge-based methods outperforms physics-based methods 4. Big demand: our server processes > 800 jobs/week, >12k users in 3yrs

Performance in CASP9 (2010) A blind test for protein structure prediction Server ranking tested on the 50 hardest TBM targets Adapted from http://predictioncenter.org/casp9/doc/presentations/CASP9_TBM.pdf http://predictioncenter.org/casp9/doc/presentations/CASP9_TBM.pdf

Performance in CASP10 (2012) A blind test for protein structure prediction The only server group among top 10 Adapted from http://predictioncenter.org/casp10/doc/presentations/CASP10_TBM_GM.pdf http://predictioncenter.org/casp10/doc/presentations/CASP10_TBM_GM.pdf The top 10 performing human/server groups on the hardest TBM targets

My Work Analyze large-scale biological data and build predictive models Protein sequence and structure alignment Homology detection and fold recognition Protein structure prediction Protein function prediction (e.g., interaction and binding site prediction) Biological network construction and analysis Study computational methods that have applications beyond bioinformatics Machine learning (e.g. probabilistic graphical model) Optimization (discrete, combinatorial and continuous)

Homology Detection & Fold Recognition Homology detection & fold recognition – Determine the relationship between two proteins – Given a query, search for all homologs in a database Homology search/fold recognition useful for – Study protein evolutionary relationship – Functional transfer – Homology modeling (i.e., template-based modeling) Two proteins are homologous if they have shared ancestry. Two proteins have the same fold if their 3D structures are similar.

Structure Prediction (Cont’d) Template-based modeling (TBM) – Using solved protein structures as template, e.g., homology modeling and protein threading – Most reliable, but fails when no good templates Template-free modeling (FM) or ab initio folding – Not using solved protein structures as template – Mostly works only on some small proteins Subproblems – Loop modeling – Inter-residue contact prediction

Residue Pair Grouping

Precision Submatrix Grouping Suppose that residue pair (2,4) in Family 1 aligned to pair (3,5) in Family 2 In total ≤L(L-1)/2 groups where L is the seq length

Performance on the 31 Pfam families with only distantly-related auxiliary families Medium-rangeLong-range L/10L/5L/2L/10L/5L/2 CoinFold0.4570.4000.2670.5580.5240.416 PSICOV0.4130.3600.2520.4940.4650.377 PSICOV_p0.3200.2950.2120.3960.3550.290 PSICOV_v0.4000.3200.1790.3960.3750.261 CoinFold: our work PSICOV: single-family method PSICOV_p: merge multiple families and apply single-family method PSICOV_v: single-family method for each family and then consensus

Performance on the 13 Pfam families with closely-related auxiliary families Medium-rangeLong-range L/10L/5L/2L/10L/5L/2 CoinFold0.5010.3950.2510.4620.4130.293 PSICOV0.4330.3510.2310.3980.3310.234 PSICOV_p0.3350.2200.1750.3220.2760.194 PSICOV_v0.4230.3200.1880.3860.3840.301 CoinFold: our work PSICOV: single-family method PSICOV_p: merge multiple families and apply single-family method PSICOV_v: single-family method for each family and then consensus

Our method vs. PSICOV

Our method vs. GREMLIN L/10 top predicted long-range contacts are evaluated

Performance vs. family size CoinFold: our work PSICOV: single-family method PSICOV_p: merge multiple families and apply single-family method PSICOV_v: single-family method for each family and then consensus

Multiple Sequence Alignment (MSA) of One Protein Family

Top L/10 long-range prediction accuracy on 15 large Pfam families PFAM IDMEFFCoinFoldPSICOVPSICOV_ p PF0004129810.767 0.667 PF0059530260.5560.4440.233 PF0306133340.3750.2500.500 PF0152235190.6150.4620.077 PF0057837330.308 0.231 PF0005937440.455 0.182 PF0768638010.9170.5830.667 PF0003440600.6000.5000.200 PF0098945960.5830.2500.500 PF0014446840.2720.2120.242 PF0008550750.6360.5450.182 PF0016867350.667 0.556 PF0051572300.500 0.250 PF0008990450.783 0.739 PF00550114760.857 0.714

Running Time Average protein sequence length Time (in seconds)

Performance: Alignment Accuracy

Performance: Homology Detection

Performance: Alignment Accuracy Tmalign, Matt and DeepAlign represent three different ground truth

Joint Graphical Lasso Formulation Rewrite the original problem as

Alternating Direction of Method Multiplier (ADMM) Add a penalty term to obtain an augmented problem, which has the same solution but converges faster.

Lagrangian Relaxation

ADMM (Cont’d) For a fixed U, split the relaxation problem into two subproblems and solve them alternatively

Graphical Modeling of Multiple Sequence Alignment Jinbo Xu Toyota Technological Institute at Chicago Computational Institute, The University of Chicago.

Similar presentations

Presentation on theme: "Graphical Modeling of Multiple Sequence Alignment Jinbo Xu Toyota Technological Institute at Chicago Computational Institute, The University of Chicago."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Graphical Modeling of Multiple Sequence Alignment Jinbo Xu Toyota Technological Institute at Chicago Computational Institute, The University of Chicago.

Similar presentations

Presentation on theme: "Graphical Modeling of Multiple Sequence Alignment Jinbo Xu Toyota Technological Institute at Chicago Computational Institute, The University of Chicago."— Presentation transcript:

Similar presentations

About project

Feedback