Direct-Coupling Analysis (DCA) and Its Applications in Protein Structure and Protein-Protein Interaction Prediction Wang Yang 2014.1.3.

Slides:



Advertisements
Similar presentations
Motivation “Nothing in biology makes sense except in the light of evolution” Christian Theodosius Dobzhansky.
Advertisements

Feature Selection as Relevant Information Encoding Naftali Tishby School of Computer Science and Engineering The Hebrew University, Jerusalem, Israel NIPS.
Protein – Protein Interactions Lisa Chargualaf Simon Kanaan Keefe Roedersheimer Others: Dr. Izaguirre, Dr. Chen, Dr. Wuchty, ChengBang Huang.
Phylogenetic Tree A Phylogeny (Phylogenetic tree) or Evolutionary tree represents the evolutionary relationships among a set of organisms or groups of.
Functional Site Prediction Selects Correct Protein Models Vijayalakshmi Chelliah Division of Mathematical Biology National Institute.
1 Introduction to Sequence Analysis Utah State University – Spring 2012 STAT 5570: Statistical Bioinformatics Notes 6.1.
PhyCMAP: Predicting protein contact map using evolutionary and physical constraints by integer programming Zhiyong Wang and Jinbo Xu Toyota Technological.
Bioinformatics Phylogenetic analysis and sequence alignment The concept of evolutionary tree Types of phylogenetic trees Measurements of genetic distances.
Measuring the degree of similarity: PAM and blosum Matrix
Phylogenetic reconstruction
Maximum Likelihood. Likelihood The likelihood is the probability of the data given the model.
Molecular Evolution Revised 29/12/06
Structural bioinformatics
Correlated Mutations and Co-evolution May 1 st, 2002.
Mutual Information Mathematical Biology Seminar
Heuristic alignment algorithms and cost matrices
. Class 5: Multiple Sequence Alignment. Multiple sequence alignment VTISCTGSSSNIGAG-NHVKWYQQLPG VTISCTGTSSNIGS--ITVNWYQQLPG LRLSCSSSGFIFSS--YAMYWVRQAPG.
Protein Sectors: Evolutionary Units of Three-Dimensional Structure Najeeb Halabi, Olivier Rivoire, Stanislas Leibler, and Rama Ranganthan Cell 138, ,
Molecular Evolution with an emphasis on substitution rates Gavin JD Smith State Key Laboratory of Emerging Infectious Diseases & Department of Microbiology.
Maximum Likelihood Flips usage of probability function A typical calculation: P(h|n,p) = C(h, n) * p h * (1-p) (n-h) The implied question: Given p of success.
Lecture 9 Hidden Markov Models BioE 480 Sept 21, 2004.
Identifying functional residues of proteins from sequence info Using MSA (multiple sequence alignment) - search for remote homologs using HMMs or profiles.
Summary Protein design seeks to find amino acid sequences which stably fold into specific 3-D structures. Modeling the inherent flexibility of the protein.
Molecular Evolution, Part 2 Everything you didn’t want to know… and more! Everything you didn’t want to know… and more!
Class 3: Estimating Scoring Rules for Sequence Alignment.
BLOSUM Information Resources Algorithms in Computational Biology Spring 2006 Created by Itai Sharon.
Maximum Entropy Model LING 572 Fei Xia 02/07-02/09/06.
Sequence Alignments Revisited
Alignment III PAM Matrices. 2 PAM250 scoring matrix.
Multiple Sequence Alignments
Adaptive Molecular Evolution Nonsynonymous vs Synonymous.
Materials and Methods Abstract Conclusions Introduction 1. Korber B, et al. Br Med Bull 2001; 58: Rambaut A, et al. Nat. Rev. Genet. 2004; 5:
Maximum Entropy Model LING 572 Fei Xia 02/08/07. Topics in LING 572 Easy: –kNN, Rocchio, DT, DL –Feature selection, binarization, system combination –Bagging.
TGCAAACTCAAACTCTTTTGTTGTTCTTACTGTATCATTGCCCAGAATAT TCTGCCTGTCTTTAGAGGCTAATACATTGATTAGTGAATTCCAATGGGCA GAATCGTGATGCATTAAAGAGATGCTAATATTTTCACTGCTCCTCAATTT.
Information theoretic interpretation of PAM matrices Sorin Istrail and Derek Aguiar.
Alignment Statistics and Substitution Matrices BMI/CS 576 Colin Dewey Fall 2010.
Multiple Sequence Alignment CSC391/691 Bioinformatics Spring 2004 Fetrow/Burg/Miller (Slides by J. Burg)
Phylogeny Estimation: Traditional and Bayesian Approaches Molecular Evolution, 2003
Structural alignments of Proteins using by TOPOFIT method Vitkup D., Melamud E., Moult J., Sander C. Completeness in structural genomics. Nature Struct.
Molecular phylogenetics
CRB Journal Club February 13, 2006 Jenny Gu. Selected for a Reason Residues selected by evolution for a reason, but conservation is not distinguished.
Phylogenetic Analysis. General comments on phylogenetics Phylogenetics is the branch of biology that deals with evolutionary relatedness Uses some measure.
Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.
Bioinformatics 2011 Molecular Evolution Revised 29/12/06.
Sequence Analysis CSC 487/687 Introduction to computing for Bioinformatics.
TGCAAACTCAAACTCTTTTGTTGTTCTTACTGTATCATTGCCCAGAATAT TCTGCCTGTCTTTAGAGGCTAATACATTGATTAGTGAATTCCAATGGGCA GAATCGTGATGCATTAAAGAGATGCTAATATTTTCACTGCTCCTCAATTT.
Construction of Substitution Matrices
Calculating branch lengths from distances. ABC A B C----- a b c.
A Study of Residue Correlation within Protein Sequences and its Application to Sequence Classification Christopher Hemmerich Advisor: Dr. Sun Kim.
Bioinformatics Multiple Alignment. Overview Introduction Multiple Alignments Global multiple alignment –Introduction –Scoring –Algorithms.
Sequence Alignment Csc 487/687 Computing for bioinformatics.
Identifying and Modeling Selection Pressure (a review of three papers) Rose Hoberman BioLM seminar Feb 9, 2004.
Pairwise Sequence Analysis-III
Chapter 10 Phylogenetic Basics. Similarities and divergence between biological sequences are often represented by phylogenetic trees Phylogenetics is.
PROTEIN PATTERN DATABASES. PROTEIN SEQUENCES SUPERFAMILY FAMILY DOMAIN MOTIF SITE RESIDUE.
Phylogeny Ch. 7 & 8.
De novo discovery of mutated driver pathways in cancer Discussion leader: Matthew Bernstein Scribe: Kun-Chieh Wang Computational Network Biology BMI 826/Computer.
Chapter 3 The Interrupted Gene.
NEW TOPIC: MOLECULAR EVOLUTION.
Ayesha M.Khan Spring Phylogenetic Basics 2 One central field in biology is to infer the relation between species. Do they possess a common ancestor?
Motif Search and RNA Structure Prediction Lesson 9.
Maximum Entropy … the fact that a certain prob distribution maximizes entropy subject to certain constraints representing our incomplete information, is.
The statistics of pairwise alignment BMI/CS 576 Colin Dewey Fall 2015.
A Brief Maximum Entropy Tutorial Presenter: Davidson Date: 2009/02/04 Original Author: Adam Berger, 1996/07/05
Bayesian Brain Probabilistic Approaches to Neural Coding 1.1 A Probability Primer Bayesian Brain Probabilistic Approaches to Neural Coding 1.1 A Probability.
Evolutionary Change in Sequences
Substitution Matrices and Alignment Statistics BMI/CS 776 Mark Craven February 2002.
Linkage and Linkage Disequilibrium
Protein Structure Prediction and Protein Homology modeling
The general problem Distant (remote) homology poses challenges:
Presentation transcript:

Direct-Coupling Analysis (DCA) and Its Applications in Protein Structure and Protein-Protein Interaction Prediction Wang Yang

Outline 1.Molecular Co-evolution phenomenon 2.Applications of Co-evolution in protein structure prediction and PPI prediction. 3.Co-evolution measurement: −Local model-Mutual Information (MI) measure of coupling. −Global model-Direct Coupling Analysis (DCA). 4.Principle of Direct Coupling Analysis (DCA). 5.Summary 2

Molecular Co-evolution 3 What is Molecular coevolution? Two (or more) genes/ proteins/ residues : 1)exert selective pressures on each other 2)evolve in response to each other Molecular co-evolution can be due to specific co-adaptation between the two co-evolving elements, where changes in one of them are compensated by changes in the other, or by a less specific external force affecting the evolutionary rates of both elements in a similar magnitude. Co-evolutionary signatures between proteins serve as markers of physical interactions and/or functional relationships For this reason, computational methods emerged for studying co- evolution at the protein or residue level so as to predict features such as protein-protein interactions, residue contacts within protein structures and protein functional sites.

Native contact by co-evolution analysis 4 Co-evolution information for protein structure prediction Marks, D. S.; Colwell, L. J.; Sheridan, R.; Hopf, T. A.; Pagnani, A.; Zecchina, R.; Sander, C. PLoS One 2011, 6, e28766.

Use Co-evolution to predict protein 3D structure 5 De Juan, D.; Pazos, F.; Valencia, A. Nat Rev Genet 2013, 14,

Groups of co ‑ evolving residues are implicated in functional specificity and structure–function coordination Specificity ‑ determining positions (SDPs) are groups of positions that coordinately mutate in the context of subfamily divergence. 6 De Juan, D.; Pazos, F.; Valencia, A. Nat Rev Genet 2013, 14,

Co-evolution measurement 7 Local statistical model : calculate correlation of each residue pair (i, j) in the multiple sequence alignment independently. Mutual Information(MI) : Global statistical model : Coupling of the pair i and j depends on the rest of the alignment. To compute a set of direct residue couplings that best explains all pair correlations observed in the multiple sequence alignment. Direct-coupling Information(DI) :

Shortcomings of Local statistical model 8 Correlation in amino acid substitution may arise from direct as well as indirect interactions. Local covariance methods are unable to distinguish between direct and indirect correlation. However: 1.All direct interactions are contained in the local correlations. 2.All detected correlations in substitutions are generated by the set of direct interactions A B C Direct Interaction Indirect Interaction

Direct information VS. Mutual information 9 Intradomain contacts prediction using DI and MI pairs.

Direct information VS. Mutual information 10 Intradomain contacts (<=8Å) prediction using DI and MI pairs. Interdomain contacts prediction using DI and MI pairs. Morcos, F.; Pagnani, A.; Lunt, B.; Bertolino, A.; Marks, D. S.; Sander, C.; Zecchina, R.; Onuchic, J. N.; Hwa, T.; Weigt, M. Proc Natl Acad Sci U S A 2011, 108, E Weigt, M.; White, R. A.; Szurmant, H.; Hoch, J. A.; Hwa, T. Proc Natl Acad Sci U S A 2009, 106,

Principle of DCA 11 To find a minimal set of pair interactions that, through transitivity, will produce all the observed pair correlations. More precisely, to seek a general model, the full joint- probability distribution P(A 1 …A L ), for a particular amino acid sequence A 1 …A L to be a member of the family under consideration that the marginals probability P ij (A i,A j ) for pair occurrences are consistent with the observation of the MSA: Where:

Maximum-entropy Modeling 12 Information is the reduction of uncertainty. When you have only limited information, the best and safest guess is to model all that is known and assume nothing about the uncertainty. − satisfy a set of constraints that must hold − choose the most “uniform” distribution Choose the one with maximum entropy Constraints: Maximum S: For our case:

Why Maximum-entropy Modeling can find the direct interactions(coupling)? 13 Actually, we are not finding the direct couplings! What we have observed in the MSA (Mutual information) is self- redundant. We are finding the minimum set of couplings that can deduce all observed couplings. The model that can reflect all our observations. But we reduce all pair-wise couplings to as low as possible. Thus, indirect couplings are removed! No addition assumption (information) was added to the system, thus our guess has the lowest risk!

Maximization of the entropy 14 Constraints: Our goal is Maximum S and keep all the constraints: Lagrange multipliers Where Z is the partition function: MCMC sampling Message passing sampling Mean field approximation …

Example1: Use DI to predict protein- protein binding interface 15 Low MI implies low DI, but high MI does not necessarily imply high DI.

16 High DI pairs are physical interactions. Example1: Use DI to predict protein- protein binding interface Weigt, M.; White, R. A.; Szurmant, H.; Hoch, J. A.; Hwa, T. Proc Natl Acad Sci U S A 2009, 106, Direct Information is inversely correlated with residue distance of pairs in the Spo0B/Spo0F cocrystal structure

Example2: Use DI to predict protein 3D structure without using template information 17 Top-ranked predicted structures can make correct contacts in the absence of constraints and avoid incorrect contacts in spite of false positive constraints Marks, D. S.; Colwell, L. J.; Sheridan, R.; Hopf, T. A.; Pagnani, A.; Zecchina, R.; Sander, C. PLoS One 2011, 6, e28766.

How many distance constraints are needed for fold prediction? 18

When would it have been possible to fold from sequence? 19

Summary 20 1.Until recently, co-evolution information has not been effectively used. Local statistical models are not good enough. 2.DCA is a powerful global statistical model to find direct interactions. 3.However, statistical background noise (e.g. low statistical resolution in the empirical correlations due to an insufficient number of proteins in the family or due to global correlations from phylogenetic bias in the frequency counts) and functional constraints may not be spatially close, such as functional constraints imposed by protein-protein or protein- ligand interactions. 4.But it can be used to improve the current structure prediction, refinement and identify potential binding proteins! Thank you very much!