Graphical Modeling of Multiple Sequence Alignment Jinbo Xu Toyota Technological Institute at Chicago Computational Institute, The University of Chicago.

Slides:



Advertisements
Similar presentations
PhyCMAP: Predicting protein contact map using evolutionary and physical constraints by integer programming Zhiyong Wang and Jinbo Xu Toyota Technological.
Advertisements

Profiles for Sequences
© Wiley Publishing All Rights Reserved. Analyzing Protein Sequences.
Structural bioinformatics
Abstracts of main servers in CASP11
Chapter 9 Structure Prediction. Motivation Given a protein, can you predict molecular structure Want to avoid repeated x-ray crystallography, but want.
Heuristic alignment algorithms and cost matrices
Profile-profile alignment using hidden Markov models Wing Wong.
Multiple sequence alignment Conserved blocks are recognized Different degrees of similarity are marked.
Protein Fold recognition Morten Nielsen, Thomas Nordahl CBS, BioCentrum, DTU.
Protein Fold recognition
Identifying functional residues of proteins from sequence info Using MSA (multiple sequence alignment) - search for remote homologs using HMMs or profiles.
MULTICOM – A Combination Pipeline for Protein Structure Prediction
. Protein Structure Prediction [Based on Structural Bioinformatics, section VII]
Performance Optimization of Clustal W: Parallel Clustal W, HT Clustal and MULTICLUSTAL Arunesh Mishra CMSC 838 Presentation Authors : Dmitri Mikhailov,
Sequence similarity.
Similar Sequence Similar Function Charles Yan Spring 2006.
Multiple sequence alignment Conserved blocks are recognized Different degrees of similarity are marked.
Bioinformatics Unit 1: Data Bases and Alignments Lecture 3: “Homology” Searches and Sequence Alignments (cont.) The Mechanics of Alignments.
Multiple sequence alignment methods 1 Corné Hoogendoorn Denis Miretskiy.
Introduction to Bioinformatics From Pairwise to Multiple Alignment.
Detecting the Domain Structure of Proteins from Sequence Information Niranjan Nagarajan and Golan Yona Department of Computer Science Cornell University.
Protein Structures.
Protein Sequence Analysis - Overview Raja Mazumder Senior Protein Scientist, PIR Assistant Professor, Department of Biochemistry and Molecular Biology.
Multiple Sequence Alignment CSC391/691 Bioinformatics Spring 2004 Fetrow/Burg/Miller (Slides by J. Burg)
Multiple sequence alignment
Multiple Sequence Alignment
Pair-wise Sequence Alignment What happened to the sequences of similar genes? random mutation deletion, insertion Seq. 1: 515 EVIRMQDNNPFSFQSDVYSYG EVI.
Protein Tertiary Structure Prediction
Construyendo modelos 3D de proteinas ‘fold recognition / threading’
SUPERVISED NEURAL NETWORKS FOR PROTEIN SEQUENCE ANALYSIS Lecture 11 Dr Lee Nung Kion Faculty of Cognitive Sciences and Human Development UNIMAS,
CRB Journal Club February 13, 2006 Jenny Gu. Selected for a Reason Residues selected by evolution for a reason, but conservation is not distinguished.
Scoring Matrices Scoring matrices, PSSMs, and HMMs BIO520 BioinformaticsJim Lund Reading: Ch 6.1.
Gapped BLAST and PSI- BLAST: a new generation of protein database search programs By Stephen F. Altschul, Thomas L. Madden, Alejandro A. Schäffer, Jinghui.
Sequence analysis: Macromolecular motif recognition Sylvia Nagl.
Multiple Alignment and Phylogenetic Trees Csc 487/687 Computing for Bioinformatics.
Neural Networks for Protein Structure Prediction Brown, JMB 1999 CS 466 Saurabh Sinha.
Eric C. Rouchka, University of Louisville SATCHMO: sequence alignment and tree construction using hidden Markov models Edgar, R.C. and Sjolander, K. Bioinformatics.
Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.
Protein Folding Programs By Asım OKUR CSE 549 November 14, 2002.
A Study of Residue Correlation within Protein Sequences and its Application to Sequence Classification Christopher Hemmerich Advisor: Dr. Sun Kim.
Bioinformatics Multiple Alignment. Overview Introduction Multiple Alignments Global multiple alignment –Introduction –Scoring –Algorithms.
Multiple alignment: Feng- Doolittle algorithm. Why multiple alignments? Alignment of more than two sequences Usually gives better information about conserved.
HMMs for alignments & Sequence pattern discovery I519 Introduction to Bioinformatics.
Bioinformatics Ayesha M. Khan 9 th April, What’s in a secondary database?  It should be noted that within multiple alignments can be found conserved.
Multiple Alignment and Phylogenetic Trees Csc 487/687 Computing for Bioinformatics.
Study of Protein Prediction Related Problems Ph.D. candidate Le-Yi WEI 1.
Protein Sequence Analysis - Overview - NIH Proteomics Workshop 2007 Raja Mazumder Scientific Coordinator, PIR Research Assistant Professor, Department.
Protein Structure Prediction: Homology Modeling & Threading/Fold Recognition D. Mohanty NII, New Delhi.
Sequence Based Analysis Tutorial March 26, 2004 NIH Proteomics Workshop Lai-Su L. Yeh, Ph.D. Protein Science Team Lead Protein Information Resource at.
Application of latent semantic analysis to protein remote homology detection Wu Dongyin 4/13/2015.
Pairwise sequence alignment Lecture 02. Overview  Sequence comparison lies at the heart of bioinformatics analysis.  It is the first step towards structural.
Combining Evolutionary Information Extracted From Frequency Profiles With Sequence-based Kernels For Protein Remote Homology Detection Name: ZhuFangzhi.
Final Report (30% final score) Bin Liu, PhD, Associate Professor.
Machine Learning Methods of Protein Secondary Structure Prediction Presented by Chao Wang.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
Detecting Remote Evolutionary Relationships among Proteins by Large-Scale Semantic Embedding Xu Linhe 14S
Protein Tertiary Structure Prediction Structural Bioinformatics.
Mismatch String Kernals for SVM Protein Classification Christina Leslie, Eleazar Eskin, Jason Weston, William Stafford Noble Presented by Pradeep Anand.
Using the Fisher kernel method to detect remote protein homologies Tommi Jaakkola, Mark Diekhams, David Haussler ISMB’ 99 Talk by O, Jangmin (2001/01/16)
9/6/07BCB 444/544 F07 ISU Dobbs - Lab 3 - BLAST1 BCB 444/544 Lab 3 BLAST Scoring Matrices & Alignment Statistics Sept6.
BIOINFORMATION A new taxonomy-based protein fold recognition approach based on autocross-covariance transformation - - 王红刚 14S
7. (Predicted) residue pair contacts guide ab initio modeling
Protein Structure Prediction and Protein Homology modeling
Sequence Based Analysis Tutorial
Protein Structures.
Protein Sequence Analysis - Overview -
Protein structure prediction.
7. (Predicted) residue pair contacts guide ab initio modeling
Presentation transcript:

Graphical Modeling of Multiple Sequence Alignment Jinbo Xu Toyota Technological Institute at Chicago Computational Institute, The University of Chicago

Two applications of MSA Predict inter-residue interaction network (i.e., protein contact map) from MSA using joint graphical lasso – An important subproblem of protein folding Align two MSAs through alignment of two Markov Random Fields (MRFs) – Homology detection and fold recognition – Merge two MSAs into a larger one

Modeling MSA by Markov Random Fields

Numeric Representation of MSA …00 …10… 21 elements for each column in MSA Represent a sequence in MSA as a L×21 binary vector

Gaussian Graphical Model (GGM)

Covariance and Precision Matrix L L The precision matrix has dimension 21L×21L 21×21 one residue pair Larger values indicate stronger interaction

Today’s talk Predict inter-residue interaction network (i.e., protein contact map) from MSA using joint graphical lasso – An important subproblem of protein folding Align two MSAs through alignment of two Markov Random Fields (MRFs) – Homology detection and fold recognition – Merge two MSAs into a larger one

Protein Contact Map (residue interaction network) Two residues in contact if their C α or C β distance < 8Å 3.8 Shorter distance Stronger interaction

Contact Matrix is Sparse short range: 6-12 AAs apart along primary sequence medium range: AAs apart long range: >24 AAs apart #contacts is linear w.r.t. sequence length

Input: MEKVNFLKNGVLRLPPGFRFRPTDEELVVQYLKRKVFSFPLPASIIPEVEVYKSDPWDLPGDMEQEKYFFSTK EVKYPNGNRSNRATNSGYWKATGIDKQIILRGRQQQQQLIGLKKTLVFYRGKSPHGCRTNWIMHEYRLAN LESNYHPIQGNWVICRIFLKKRGNTKNKEENMTTHDEVRNREIDKNSPVVSVKMSSRDSEALASANSELKK KASIIFYDFMGRNNSNGVAASTSSSGITDLTTTNEESDDHEESTSSFNNFTTFKRKIN Protein Contact Prediction Output: With L/12 long-range native contacts, the fold of a protein can be roughly determined [Baker group]

Contact Prediction Methods Evolutionary coupling analysis (unsupervised learning)  Identity co-evolved residues from multiple sequence alignment  No solved protein structures used at all  High-throughput sequencing makes this method promising  e.g., mutual information, Evfold, PSICOV, plmDCA, GREMLIN Supervised machine learning  Input features: sequence (profile) similarity, chemical properties similarity, mutual information  (implicitly) learn information from solved structures  examples: NNcon, SVMcon, CMAPpro, PhyCMAP

Evolutionary Coupling (EC) Analysis Observation: two residues in contact tend to co-evolve, i.e., two co-evolved residues likely to form a contact

Evolutionary Coupling (EC) Analysis (Cont’d) Local statistical methods: examine the correlation between two residues independent of the others Mutual information (MI): two residues in contact likely to have large MI Not all residue pairs with large MI are in contact due to indirect evolutionary coupling. If A~B and B~C, then likely A~C Global statistical methods: examine the correlation between two residues condition on the others Need a large number of sequences Maximum-Entropy: Evfold Graphical lasso: PSICOV Pseudo-likelihood: plmDCA, GREMLIN

Single MSA-based Contact Prediction Enforce sparse precision matrix Why ?

Issues with Existing Methods  Evolutionary coupling (EC) analysis works for proteins with a large number of sequence homologs  Focus on how to improve the statistical methods instead of use of extra biological information/insight, e.g., relax the Gaussian assumption, consensus of a few EC methods,  Use information mostly in a single protein family  Physical constraints other than sparsity not used

Our Work: contact prediction using multiple MSAs  Jointly predict contacts for related families of similar folds. That is, predict contacts using multiple MSAs.  These MSAs share inter-residue interaction network to some degree  Integrate evolutionary coupling (EC) analysis with supervised learning  EC analysis makes use of residue co-evolution information  Supervised learning makes use of sequence (profile) similarity Goal: focus on proteins without many sequence homologs Strategy: increase statistical power by information aggregation

Red: shared; Blue: unique to PF00116; Green: unique to PF13473 Observation: different protein families share similar contact maps

Joint evolutionary coupling (EC) analysis Jointly predict contacts for a set of related protein families  Predict contacts for a protein family using information in other related families  Enforce contact map consistency among related families  Do not lose family-specific information

Joint graphical lasso for joint evolutionary coupling analysis How to enforce contact map consistency?

Residue Pair/Submatrix Grouping In total ≤L(L-1)/2 groups where L is the seq length

Enforce Contact Map Consistency by Group Penalty Group conservation level

Supervised Machine Learning Input features: sequence profile, amino acid chemical properties, mutual information power series, context-specific statistical potential Mutual information power series: – Local info: mutual information matrix (MI) – Partially global info: MI 2, MI 3, …, MI 11 – Can be calculated much faster than PSICOV Random Forests trained by proteins

Joint EC Analysis with Supervised Prediction as Prior sparsity contact map consistency among families similarity with supervised prediction Log-likelihood of K families This optimization problem can be solved by ADMM to suboptimal

Accuracy on 98 Pfam families Medium-rangeLong-range L/10L/5L/2L/10L/5L/2 CoinDCA PSICOV PSICOV_b plmDCA plmDCA_h GREMLIN GREMLIN_h Merge_p Merge_m Voting

Accuracy vs. # Sequence Homologs (A) Medium-range (B) Long-range X-axis: ln of the number of non-redundant sequence homologs Y-axis: L/10 accuracy

Accuracy on 123 CASP10 targets Medium-rangeLong-range L/10L/5L/2L/10L/5L/2 CoinDCA Evfold PSICOV plmDCA GREMLIN NNcon CMAPpro

Accuracy vs. # sequence homologs (CASP10) X-axis: ln of # non-redundant sequence homologs Y-axis: L/10 long-range prediction accuracy

Accuracy vs. Contact Conservation Level (A)Medium-range; (B) long-range X-axis: conservation level, the larger, the more conserved

Today’s Talk Predict inter-residue interaction network (i.e., protein contact map) from MSA using joint graphical lasso – An important subproblem of protein folding Align two MSAs through alignment of two Markov Random Fields (MRFs) – Remote homology detection and fold recognition – Merge two MSAs into a larger one

Homology Detection & Fold Recognition Primary sequence comparison – Similar sequences -> very likely homologous – Sequence alignment method, e.g., BLAST, FASTA – works only for close homologs Profile-based method – Compare two protein families instead of primary sequences, using evolutionary information in a family – Sequence-profile alignment & profile-profile alignment – Profile can be represented as a matrix (e.g., FFAS) or a HMM (e.g., HHpred, HMMER) – Sometimes works for remote homologs, but not sensitive enough

MSA to Sequence Profile Two popular profile representations: (1) Position-specific scoring matrix (PSSM); (2) Hidden Markov Model (HMM)

Position-Specific Scoring Matrix (PSSM) Taken from

Hidden Markov Model (HMM)

Our Work: Markov Random Fields (MRF) Representation 1) MRF encodes long-range residue interaction pattern while HMM does not; 2) Long-range interaction pattern encodes global information of a protein, So can deal with proteins of similar folds but divergent sequences

Protein alignment by aligning two MRFs GRK-YSA GRK-YSA FLV-LYI KLV-LYI PTAKFRE PTAKFRS PTVPGYE PTVPGRS MRF1 MRF2 Family 1 Family 2

Scoring function for MRF alignment local alignment potential pairwise alignment potential MRF1 MRF2 NP-hard due to 1)Gaps allowed 2)Pairwise potential

Alternating Direction of Method Multiplier (ADMM) Make a copy of z to y Add a penalty term to obtain an augmented problem

ADMM (Cont’d)

Both subproblems can solved efficiently by dynamic programming!

Superfamily & Fold Recognition Rate Superfamily level detection Fold level detection

Conclusion Joint evolutionary coupling analysis + supervised learning can significantly improve protein contact prediction by using information in multiple MSAs Long-range residue interaction encoded in an MSA helpful for remote homolog detection

Acknowledgements RaptorX servers at Students: Jianzhu Ma, Zhiyong Wang, Sheng Wang Funding – NIH R01GM – NSF CAREER award and NSF ABI – Alfred P. Sloan Research Fellowship Computational resources – University of Chicago Beagle team – TeraGrid and Open Science Grid

Input: MEKVNFLKNGVLRLPPGFRFRPTDEELVVQYLKRKVFSFPLPASIIPEVEVYKSDPWDLPGDMEQEKYFFSTK EVKYPNGNRSNRATNSGYWKATGIDKQIILRGRQQQQQLIGLKKTLVFYRGKSPHGCRTNWIMHEYRLAN LESNYHPIQGNWVICRIFLKKRGNTKNKEENMTTHDEVRNREIDKNSPVVSVKMSSRDSEALASANSELKK KASIIFYDFMGRNNSNGVAASTSSSGITDLTTTNEESDDHEESTSSFNNFTTFKRKIN Protein Structure Prediction Output: 1. One of the most challenging problems in computational biology! 2. Improved due to better algorithms and large databases 3. Knowledge-based methods outperforms physics-based methods 4. Big demand: our server processes > 800 jobs/week, >12k users in 3yrs

Performance in CASP9 (2010) A blind test for protein structure prediction Server ranking tested on the 50 hardest TBM targets Adapted from

Performance in CASP10 (2012) A blind test for protein structure prediction The only server group among top 10 Adapted from The top 10 performing human/server groups on the hardest TBM targets

My Work Analyze large-scale biological data and build predictive models Protein sequence and structure alignment Homology detection and fold recognition Protein structure prediction Protein function prediction (e.g., interaction and binding site prediction) Biological network construction and analysis Study computational methods that have applications beyond bioinformatics Machine learning (e.g. probabilistic graphical model) Optimization (discrete, combinatorial and continuous)

Homology Detection & Fold Recognition Homology detection & fold recognition – Determine the relationship between two proteins – Given a query, search for all homologs in a database Homology search/fold recognition useful for – Study protein evolutionary relationship – Functional transfer – Homology modeling (i.e., template-based modeling) Two proteins are homologous if they have shared ancestry. Two proteins have the same fold if their 3D structures are similar.

Structure Prediction (Cont’d) Template-based modeling (TBM) – Using solved protein structures as template, e.g., homology modeling and protein threading – Most reliable, but fails when no good templates Template-free modeling (FM) or ab initio folding – Not using solved protein structures as template – Mostly works only on some small proteins Subproblems – Loop modeling – Inter-residue contact prediction

Residue Pair Grouping

Precision Submatrix Grouping Suppose that residue pair (2,4) in Family 1 aligned to pair (3,5) in Family 2 In total ≤L(L-1)/2 groups where L is the seq length

Performance on the 31 Pfam families with only distantly-related auxiliary families Medium-rangeLong-range L/10L/5L/2L/10L/5L/2 CoinFold PSICOV PSICOV_p PSICOV_v CoinFold: our work PSICOV: single-family method PSICOV_p: merge multiple families and apply single-family method PSICOV_v: single-family method for each family and then consensus

Performance on the 13 Pfam families with closely-related auxiliary families Medium-rangeLong-range L/10L/5L/2L/10L/5L/2 CoinFold PSICOV PSICOV_p PSICOV_v CoinFold: our work PSICOV: single-family method PSICOV_p: merge multiple families and apply single-family method PSICOV_v: single-family method for each family and then consensus

Our method vs. PSICOV

Our method vs. GREMLIN L/10 top predicted long-range contacts are evaluated

Performance vs. family size CoinFold: our work PSICOV: single-family method PSICOV_p: merge multiple families and apply single-family method PSICOV_v: single-family method for each family and then consensus

Multiple Sequence Alignment (MSA) of One Protein Family

Top L/10 long-range prediction accuracy on 15 large Pfam families PFAM IDMEFFCoinFoldPSICOVPSICOV_ p PF PF PF PF PF PF PF PF PF PF PF PF PF PF PF

Running Time Average protein sequence length Time (in seconds)

Performance: Alignment Accuracy

Performance: Homology Detection

Performance: Alignment Accuracy Tmalign, Matt and DeepAlign represent three different ground truth

Joint Graphical Lasso Formulation Rewrite the original problem as

Alternating Direction of Method Multiplier (ADMM) Add a penalty term to obtain an augmented problem, which has the same solution but converges faster.

Lagrangian Relaxation

ADMM (Cont’d) For a fixed U, split the relaxation problem into two subproblems and solve them alternatively