Presentation on theme: "Direct-Coupling Analysis (DCA) and Its Applications in Protein Structure and Protein-Protein Interaction Prediction Wang Yang 2014.1.3."— Presentation transcript:
Direct-Coupling Analysis (DCA) and Its Applications in Protein Structure and Protein-Protein Interaction Prediction Wang Yang 2014.1.3
Outline 1.Molecular Co-evolution phenomenon 2.Applications of Co-evolution in protein structure prediction and PPI prediction. 3.Co-evolution measurement: −Local model-Mutual Information (MI) measure of coupling. −Global model-Direct Coupling Analysis (DCA). 4.Principle of Direct Coupling Analysis (DCA). 5.Summary 2
Molecular Co-evolution 3 What is Molecular coevolution? Two (or more) genes/ proteins/ residues : 1)exert selective pressures on each other 2)evolve in response to each other Molecular co-evolution can be due to specific co-adaptation between the two co-evolving elements, where changes in one of them are compensated by changes in the other, or by a less specific external force affecting the evolutionary rates of both elements in a similar magnitude. Co-evolutionary signatures between proteins serve as markers of physical interactions and/or functional relationships For this reason, computational methods emerged for studying co- evolution at the protein or residue level so as to predict features such as protein-protein interactions, residue contacts within protein structures and protein functional sites.
Native contact by co-evolution analysis 4 Co-evolution information for protein structure prediction Marks, D. S.; Colwell, L. J.; Sheridan, R.; Hopf, T. A.; Pagnani, A.; Zecchina, R.; Sander, C. PLoS One 2011, 6, e28766.
Use Co-evolution to predict protein 3D structure 5 De Juan, D.; Pazos, F.; Valencia, A. Nat Rev Genet 2013, 14, 249-61.
Groups of co ‑ evolving residues are implicated in functional specificity and structure–function coordination Specificity ‑ determining positions (SDPs) are groups of positions that coordinately mutate in the context of subfamily divergence. 6 De Juan, D.; Pazos, F.; Valencia, A. Nat Rev Genet 2013, 14, 249-61.
Co-evolution measurement 7 Local statistical model : calculate correlation of each residue pair (i, j) in the multiple sequence alignment independently. Mutual Information(MI) ： Global statistical model : Coupling of the pair i and j depends on the rest of the alignment. To compute a set of direct residue couplings that best explains all pair correlations observed in the multiple sequence alignment. Direct-coupling Information(DI) ：
Shortcomings of Local statistical model 8 Correlation in amino acid substitution may arise from direct as well as indirect interactions. Local covariance methods are unable to distinguish between direct and indirect correlation. However: 1.All direct interactions are contained in the local correlations. 2.All detected correlations in substitutions are generated by the set of direct interactions A B C Direct Interaction Indirect Interaction
Direct information VS. Mutual information 9 Intradomain contacts prediction using DI and MI pairs.
Direct information VS. Mutual information 10 Intradomain contacts (<=8Å) prediction using DI and MI pairs. Interdomain contacts prediction using DI and MI pairs. Morcos, F.; Pagnani, A.; Lunt, B.; Bertolino, A.; Marks, D. S.; Sander, C.; Zecchina, R.; Onuchic, J. N.; Hwa, T.; Weigt, M. Proc Natl Acad Sci U S A 2011, 108, E1293-301. Weigt, M.; White, R. A.; Szurmant, H.; Hoch, J. A.; Hwa, T. Proc Natl Acad Sci U S A 2009, 106, 67-72.
Principle of DCA 11 To find a minimal set of pair interactions that, through transitivity, will produce all the observed pair correlations. More precisely, to seek a general model, the full joint- probability distribution P(A 1 …A L ), for a particular amino acid sequence A 1 …A L to be a member of the family under consideration that the marginals probability P ij (A i,A j ) for pair occurrences are consistent with the observation of the MSA: Where:
Maximum-entropy Modeling 12 Information is the reduction of uncertainty. When you have only limited information, the best and safest guess is to model all that is known and assume nothing about the uncertainty. − satisfy a set of constraints that must hold − choose the most “uniform” distribution Choose the one with maximum entropy Constraints: Maximum S: For our case:
Why Maximum-entropy Modeling can find the direct interactions(coupling)? 13 Actually, we are not finding the direct couplings! What we have observed in the MSA (Mutual information) is self- redundant. We are finding the minimum set of couplings that can deduce all observed couplings. The model that can reflect all our observations. But we reduce all pair-wise couplings to as low as possible. Thus, indirect couplings are removed! No addition assumption (information) was added to the system, thus our guess has the lowest risk!
Maximization of the entropy 14 Constraints: Our goal is Maximum S and keep all the constraints: Lagrange multipliers Where Z is the partition function: MCMC sampling Message passing sampling Mean field approximation …
Example1: Use DI to predict protein- protein binding interface 15 Low MI implies low DI, but high MI does not necessarily imply high DI.
16 High DI pairs are physical interactions. Example1: Use DI to predict protein- protein binding interface Weigt, M.; White, R. A.; Szurmant, H.; Hoch, J. A.; Hwa, T. Proc Natl Acad Sci U S A 2009, 106, 67-72. Direct Information is inversely correlated with residue distance of pairs in the Spo0B/Spo0F cocrystal structure
Example2: Use DI to predict protein 3D structure without using template information 17 Top-ranked predicted structures can make correct contacts in the absence of constraints and avoid incorrect contacts in spite of false positive constraints Marks, D. S.; Colwell, L. J.; Sheridan, R.; Hopf, T. A.; Pagnani, A.; Zecchina, R.; Sander, C. PLoS One 2011, 6, e28766.
How many distance constraints are needed for fold prediction? 18
When would it have been possible to fold from sequence? 19
Summary 20 1.Until recently, co-evolution information has not been effectively used. Local statistical models are not good enough. 2.DCA is a powerful global statistical model to find direct interactions. 3.However, statistical background noise (e.g. low statistical resolution in the empirical correlations due to an insufficient number of proteins in the family or due to global correlations from phylogenetic bias in the frequency counts) and functional constraints may not be spatially close, such as functional constraints imposed by protein-protein or protein- ligand interactions. 4.But it can be used to improve the current structure prediction, refinement and identify potential binding proteins! Thank you very much!