Graphical Modeling of Multiple Sequence Alignment Jinbo Xu Toyota Technological Institute at Chicago Computational Institute, The University of Chicago
Two applications of MSA Predict inter-residue interaction network (i.e., protein contact map) from MSA using joint graphical lasso – An important subproblem of protein folding Align two MSAs through alignment of two Markov Random Fields (MRFs) – Homology detection and fold recognition – Merge two MSAs into a larger one
Modeling MSA by Markov Random Fields
Numeric Representation of MSA …00 …10… 21 elements for each column in MSA Represent a sequence in MSA as a L×21 binary vector
Gaussian Graphical Model (GGM)
Covariance and Precision Matrix L L The precision matrix has dimension 21L×21L 21×21 one residue pair Larger values indicate stronger interaction
Today’s talk Predict inter-residue interaction network (i.e., protein contact map) from MSA using joint graphical lasso – An important subproblem of protein folding Align two MSAs through alignment of two Markov Random Fields (MRFs) – Homology detection and fold recognition – Merge two MSAs into a larger one
Protein Contact Map (residue interaction network) Two residues in contact if their C α or C β distance < 8Å 3.8 Shorter distance Stronger interaction
Contact Matrix is Sparse short range: 6-12 AAs apart along primary sequence medium range: AAs apart long range: >24 AAs apart #contacts is linear w.r.t. sequence length
Input: MEKVNFLKNGVLRLPPGFRFRPTDEELVVQYLKRKVFSFPLPASIIPEVEVYKSDPWDLPGDMEQEKYFFSTK EVKYPNGNRSNRATNSGYWKATGIDKQIILRGRQQQQQLIGLKKTLVFYRGKSPHGCRTNWIMHEYRLAN LESNYHPIQGNWVICRIFLKKRGNTKNKEENMTTHDEVRNREIDKNSPVVSVKMSSRDSEALASANSELKK KASIIFYDFMGRNNSNGVAASTSSSGITDLTTTNEESDDHEESTSSFNNFTTFKRKIN Protein Contact Prediction Output: With L/12 long-range native contacts, the fold of a protein can be roughly determined [Baker group]
Contact Prediction Methods Evolutionary coupling analysis (unsupervised learning) Identity co-evolved residues from multiple sequence alignment No solved protein structures used at all High-throughput sequencing makes this method promising e.g., mutual information, Evfold, PSICOV, plmDCA, GREMLIN Supervised machine learning Input features: sequence (profile) similarity, chemical properties similarity, mutual information (implicitly) learn information from solved structures examples: NNcon, SVMcon, CMAPpro, PhyCMAP
Evolutionary Coupling (EC) Analysis Observation: two residues in contact tend to co-evolve, i.e., two co-evolved residues likely to form a contact
Evolutionary Coupling (EC) Analysis (Cont’d) Local statistical methods: examine the correlation between two residues independent of the others Mutual information (MI): two residues in contact likely to have large MI Not all residue pairs with large MI are in contact due to indirect evolutionary coupling. If A~B and B~C, then likely A~C Global statistical methods: examine the correlation between two residues condition on the others Need a large number of sequences Maximum-Entropy: Evfold Graphical lasso: PSICOV Pseudo-likelihood: plmDCA, GREMLIN
Single MSA-based Contact Prediction Enforce sparse precision matrix Why ?
Issues with Existing Methods Evolutionary coupling (EC) analysis works for proteins with a large number of sequence homologs Focus on how to improve the statistical methods instead of use of extra biological information/insight, e.g., relax the Gaussian assumption, consensus of a few EC methods, Use information mostly in a single protein family Physical constraints other than sparsity not used
Our Work: contact prediction using multiple MSAs Jointly predict contacts for related families of similar folds. That is, predict contacts using multiple MSAs. These MSAs share inter-residue interaction network to some degree Integrate evolutionary coupling (EC) analysis with supervised learning EC analysis makes use of residue co-evolution information Supervised learning makes use of sequence (profile) similarity Goal: focus on proteins without many sequence homologs Strategy: increase statistical power by information aggregation
Red: shared; Blue: unique to PF00116; Green: unique to PF13473 Observation: different protein families share similar contact maps
Joint evolutionary coupling (EC) analysis Jointly predict contacts for a set of related protein families Predict contacts for a protein family using information in other related families Enforce contact map consistency among related families Do not lose family-specific information
Joint graphical lasso for joint evolutionary coupling analysis How to enforce contact map consistency?
Residue Pair/Submatrix Grouping In total ≤L(L-1)/2 groups where L is the seq length
Enforce Contact Map Consistency by Group Penalty Group conservation level
Supervised Machine Learning Input features: sequence profile, amino acid chemical properties, mutual information power series, context-specific statistical potential Mutual information power series: – Local info: mutual information matrix (MI) – Partially global info: MI 2, MI 3, …, MI 11 – Can be calculated much faster than PSICOV Random Forests trained by proteins
Joint EC Analysis with Supervised Prediction as Prior sparsity contact map consistency among families similarity with supervised prediction Log-likelihood of K families This optimization problem can be solved by ADMM to suboptimal
Accuracy on 98 Pfam families Medium-rangeLong-range L/10L/5L/2L/10L/5L/2 CoinDCA PSICOV PSICOV_b plmDCA plmDCA_h GREMLIN GREMLIN_h Merge_p Merge_m Voting
Accuracy vs. # Sequence Homologs (A) Medium-range (B) Long-range X-axis: ln of the number of non-redundant sequence homologs Y-axis: L/10 accuracy
Accuracy on 123 CASP10 targets Medium-rangeLong-range L/10L/5L/2L/10L/5L/2 CoinDCA Evfold PSICOV plmDCA GREMLIN NNcon CMAPpro
Accuracy vs. # sequence homologs (CASP10) X-axis: ln of # non-redundant sequence homologs Y-axis: L/10 long-range prediction accuracy
Accuracy vs. Contact Conservation Level (A)Medium-range; (B) long-range X-axis: conservation level, the larger, the more conserved
Today’s Talk Predict inter-residue interaction network (i.e., protein contact map) from MSA using joint graphical lasso – An important subproblem of protein folding Align two MSAs through alignment of two Markov Random Fields (MRFs) – Remote homology detection and fold recognition – Merge two MSAs into a larger one
Homology Detection & Fold Recognition Primary sequence comparison – Similar sequences -> very likely homologous – Sequence alignment method, e.g., BLAST, FASTA – works only for close homologs Profile-based method – Compare two protein families instead of primary sequences, using evolutionary information in a family – Sequence-profile alignment & profile-profile alignment – Profile can be represented as a matrix (e.g., FFAS) or a HMM (e.g., HHpred, HMMER) – Sometimes works for remote homologs, but not sensitive enough
MSA to Sequence Profile Two popular profile representations: (1) Position-specific scoring matrix (PSSM); (2) Hidden Markov Model (HMM)
Position-Specific Scoring Matrix (PSSM) Taken from
Hidden Markov Model (HMM)
Our Work: Markov Random Fields (MRF) Representation 1) MRF encodes long-range residue interaction pattern while HMM does not; 2) Long-range interaction pattern encodes global information of a protein, So can deal with proteins of similar folds but divergent sequences
Protein alignment by aligning two MRFs GRK-YSA GRK-YSA FLV-LYI KLV-LYI PTAKFRE PTAKFRS PTVPGYE PTVPGRS MRF1 MRF2 Family 1 Family 2
Scoring function for MRF alignment local alignment potential pairwise alignment potential MRF1 MRF2 NP-hard due to 1)Gaps allowed 2)Pairwise potential
Alternating Direction of Method Multiplier (ADMM) Make a copy of z to y Add a penalty term to obtain an augmented problem
ADMM (Cont’d)
Both subproblems can solved efficiently by dynamic programming!
Superfamily & Fold Recognition Rate Superfamily level detection Fold level detection
Conclusion Joint evolutionary coupling analysis + supervised learning can significantly improve protein contact prediction by using information in multiple MSAs Long-range residue interaction encoded in an MSA helpful for remote homolog detection
Acknowledgements RaptorX servers at Students: Jianzhu Ma, Zhiyong Wang, Sheng Wang Funding – NIH R01GM – NSF CAREER award and NSF ABI – Alfred P. Sloan Research Fellowship Computational resources – University of Chicago Beagle team – TeraGrid and Open Science Grid
Input: MEKVNFLKNGVLRLPPGFRFRPTDEELVVQYLKRKVFSFPLPASIIPEVEVYKSDPWDLPGDMEQEKYFFSTK EVKYPNGNRSNRATNSGYWKATGIDKQIILRGRQQQQQLIGLKKTLVFYRGKSPHGCRTNWIMHEYRLAN LESNYHPIQGNWVICRIFLKKRGNTKNKEENMTTHDEVRNREIDKNSPVVSVKMSSRDSEALASANSELKK KASIIFYDFMGRNNSNGVAASTSSSGITDLTTTNEESDDHEESTSSFNNFTTFKRKIN Protein Structure Prediction Output: 1. One of the most challenging problems in computational biology! 2. Improved due to better algorithms and large databases 3. Knowledge-based methods outperforms physics-based methods 4. Big demand: our server processes > 800 jobs/week, >12k users in 3yrs
Performance in CASP9 (2010) A blind test for protein structure prediction Server ranking tested on the 50 hardest TBM targets Adapted from
Performance in CASP10 (2012) A blind test for protein structure prediction The only server group among top 10 Adapted from The top 10 performing human/server groups on the hardest TBM targets
My Work Analyze large-scale biological data and build predictive models Protein sequence and structure alignment Homology detection and fold recognition Protein structure prediction Protein function prediction (e.g., interaction and binding site prediction) Biological network construction and analysis Study computational methods that have applications beyond bioinformatics Machine learning (e.g. probabilistic graphical model) Optimization (discrete, combinatorial and continuous)
Homology Detection & Fold Recognition Homology detection & fold recognition – Determine the relationship between two proteins – Given a query, search for all homologs in a database Homology search/fold recognition useful for – Study protein evolutionary relationship – Functional transfer – Homology modeling (i.e., template-based modeling) Two proteins are homologous if they have shared ancestry. Two proteins have the same fold if their 3D structures are similar.
Structure Prediction (Cont’d) Template-based modeling (TBM) – Using solved protein structures as template, e.g., homology modeling and protein threading – Most reliable, but fails when no good templates Template-free modeling (FM) or ab initio folding – Not using solved protein structures as template – Mostly works only on some small proteins Subproblems – Loop modeling – Inter-residue contact prediction
Residue Pair Grouping
Precision Submatrix Grouping Suppose that residue pair (2,4) in Family 1 aligned to pair (3,5) in Family 2 In total ≤L(L-1)/2 groups where L is the seq length
Performance on the 31 Pfam families with only distantly-related auxiliary families Medium-rangeLong-range L/10L/5L/2L/10L/5L/2 CoinFold PSICOV PSICOV_p PSICOV_v CoinFold: our work PSICOV: single-family method PSICOV_p: merge multiple families and apply single-family method PSICOV_v: single-family method for each family and then consensus
Performance on the 13 Pfam families with closely-related auxiliary families Medium-rangeLong-range L/10L/5L/2L/10L/5L/2 CoinFold PSICOV PSICOV_p PSICOV_v CoinFold: our work PSICOV: single-family method PSICOV_p: merge multiple families and apply single-family method PSICOV_v: single-family method for each family and then consensus
Our method vs. PSICOV
Our method vs. GREMLIN L/10 top predicted long-range contacts are evaluated
Performance vs. family size CoinFold: our work PSICOV: single-family method PSICOV_p: merge multiple families and apply single-family method PSICOV_v: single-family method for each family and then consensus
Multiple Sequence Alignment (MSA) of One Protein Family
Top L/10 long-range prediction accuracy on 15 large Pfam families PFAM IDMEFFCoinFoldPSICOVPSICOV_ p PF PF PF PF PF PF PF PF PF PF PF PF PF PF PF
Running Time Average protein sequence length Time (in seconds)
Performance: Alignment Accuracy
Performance: Homology Detection
Performance: Alignment Accuracy Tmalign, Matt and DeepAlign represent three different ground truth
Joint Graphical Lasso Formulation Rewrite the original problem as
Alternating Direction of Method Multiplier (ADMM) Add a penalty term to obtain an augmented problem, which has the same solution but converges faster.
Lagrangian Relaxation
ADMM (Cont’d) For a fixed U, split the relaxation problem into two subproblems and solve them alternatively