UNC Chapel Hill David A. O’Brien Chain Growing Using Statistical Energy Functions David A. O'Brien Balasubramanian Krishnamoorthy: Jack Snoeyink Alex Tropsha.

Slides:



Advertisements
Similar presentations
Time averages and ensemble averages
Advertisements

François Fages MPRI Bio-info 2007 Formal Biology of the Cell Protein structure prediction with constraint logic programming François Fages, Constraint.
Protein Structure Prediction using ROSETTA
Todd J.Taylor, Iosif I.Vaisman Abstract: A method of protein structural domain assignment using an Ising/Potts-like.
Protein Threading Zhanggroup Overview Background protein structure protein folding and designability Protein threading Current limitations.
Chapter 9 Structure Prediction. Motivation Given a protein, can you predict molecular structure Want to avoid repeated x-ray crystallography, but want.
Heuristic alignment algorithms and cost matrices
CISC667, F05, Lec21, Liao1 CISC 467/667 Intro to Bioinformatics (Fall 2005) Protein Structure Prediction 3-Dimensional Structure.
Graphical Models for Protein Kinetics Nina Singhal CS374 Presentation Nov. 1, 2005.
RNA Folding Kinetics Bonnie Kirkpatrick Dr. Nancy Amato, Faculty Advisor Guang Song, Graduate Student Advisor.
Sequence Alignments Revisited
Protein Structure Prediction Samantha Chui Oct. 26, 2004.
©CMBI 2006 Amino Acids “ When you understand the amino acids, you understand everything ”
Sequence comparison: Score matrices Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas
Whole genome alignments Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas
A Statistical Geometry Approach to the Study of Protein Structure Majid Masso Bioinformatics and Computational Biology George Mason University.
Alignment Statistics and Substitution Matrices BMI/CS 576 Colin Dewey Fall 2010.
Protein Structure Alignment by Incremental Combinatorial Extension (CE) of the Optimal Path Ilya N. Shindyalov, Philip E. Bourne.
Construyendo modelos 3D de proteinas ‘fold recognition / threading’
Structural alignments of Proteins using by TOPOFIT method Vitkup D., Melamud E., Moult J., Sander C. Completeness in structural genomics. Nature Struct.
1 Patterns of Substitution and Replacement. 2 3.
Human Genetic Variation Basic terminology. What is a gene? A gene is a functional and physical unit of heredity passed from parent to offspring. Genes.
Proteins Secondary Structure Predictions Structural Bioinformatics.
©CMBI 2006 Amino Acids “ When you understand the amino acids, you understand everything ”
BINF6201/8201 Hidden Markov Models for Sequence Analysis
RNA Secondary Structure Prediction Spring Objectives  Can we predict the structure of an RNA?  Can we predict the structure of a protein?
Development of Novel Geometrical Chemical Descriptors and Their Application to the Prediction of Ligand-Protein Binding Affinity Shuxing Zhang, Alexander.
Sequence Analysis CSC 487/687 Introduction to computing for Bioinformatics.
Particle Filters for Shape Correspondence Presenter: Jingting Zeng.
Shaping up the protein folding funnel by local interaction: Lesson from a structure prediction study George Chikenji*, Yoshimi Fujitsuka, and Shoji Takada*
Ab Initio Methods for Protein Structure Prediction CS882 Presentation, by Shuai C., Li.
Secondary structure prediction
Doug Raiford Lesson 19.  Framework model  Secondary structure first  Assemble secondary structure segments  Hydrophobic collapse  Molten: compact.
1 Statistical Mechanics and Multi- Scale Simulation Methods ChBE Prof. C. Heath Turner Lecture 14 Some materials adapted from Prof. Keith E. Gubbins:
Intel Confidential – Internal Only Co-clustering of biological networks and gene expression data Hanisch et al. This paper appears in: bioinformatics 2002.
Multiple Mapping Method with Multiple Templates (M4T): optimizing sequence-to-structure alignments and combining unique information from multiple templates.
Protein Folding and Modeling Carol K. Hall Chemical and Biomolecular Engineering North Carolina State University.
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
Evolutionary Algorithms for Finding Optimal Gene Sets in Micro array Prediction. J. M. Deutsch Presented by: Shruti Sharma.
MINRMS: an efficient algorithm for determining protein structure similarity using root-mean-squared-distance Andrew I. Jewett, Conrad C. Huang and Thomas.
Pg. 55. Carbohydrates Organic compounds composed of carbon, hydrogen, and oxygen in a ratio of 1:2:1 Carbohydrates can exist as 1) monosaccharides (simple.
Comparative methods Basic logics: The 3D structure of the protein is deduced from: 1.Similarities between the protein and other proteins 2.Statistical.
The statistics of pairwise alignment BMI/CS 576 Colin Dewey Fall 2015.
Monte Carlo Simulation of Folding Processes for 2D Linkages Modeling Proteins with Off-Grid HP-Chains Ileana Streinu Smith College Leo Guibas Rachel Kolodny.
Mean Field Theory and Mutually Orthogonal Latin Squares in Peptide Structure Prediction N. Gautham Department of Crystallography and Biophysics University.
1 Three-Body Delaunay Statistical Potentials of Protein Folding Andrew Leaver-Fay University of North Carolina at Chapel Hill Bala Krishnamoorthy, Alex.
Proteins Structure Predictions Structural Bioinformatics.
We propose an accurate potential which combines useful features HP, HH and PP interactions among the amino acids Sequence based accessibility obtained.
Amino Acids. Amino acids are used in every cell of your body to build the proteins you need to survive. Amino Acids have a two-carbon bond: – One of the.
Computational Physics (Lecture 10) PHY4370. Simulation Details To simulate Ising models First step is to choose a lattice. For example, we can us SC,
Substitution Matrices and Alignment Statistics BMI/CS 776 Mark Craven February 2002.
Research Overview III Jack Snoeyink UNC Chapel Hill.
Mingze Zhang, Mun Choon Chan and A. L. Ananda School of Computing
Hidden Markov Models BMI/CS 576
Computational Physics (Lecture 10)
Protein Folding Notes.
Majid Masso School of Systems Biology, George Mason University
Protein Synthesis: Translation
Protein Structure Prediction and Protein Homology modeling
Protein Sequence Alignments
Do now activity #2 Name all the DNA base pairs.
Tutorial 3 – Protein Scoring Matrices PAM & BLOSUM
3-Dimensional Structure
Yang Zhang, Andrzej Kolinski, Jeffrey Skolnick  Biophysical Journal 
Rosetta: De Novo determination of protein structure
Do now activity #6 What is the definition of: RNA?
Protein structure prediction.
謝孫源 (Sun-Yuan Hsieh) 成功大學 電機資訊學院 資訊工程系
Example of regression by RBF-ANN
Presentation transcript:

UNC Chapel Hill David A. O’Brien Chain Growing Using Statistical Energy Functions David A. O'Brien Balasubramanian Krishnamoorthy: Jack Snoeyink Alex Tropsha Andrew Leaver-Fey Shuquan Zong

UNC Chapel Hill David A. O’Brien Overview  Lattice Chain Growth Algorithm  Statistical Energy Functions  2-body Miyazawa-Jernigan Potential  4-body Potential  Local Shape Potential  Results  Chains  Identifying Good Decoys  Current Work  New Scoring Functions  Incremental Tetrahedralization  Future work

UNC Chapel Hill David A. O’Brien Chain Growing - Introduction  Lattice Chain Growing Goals:  Test measures of proteins  Build protein chains that maximize a given measure  If these chains appear native like, confirms that this is valid measure  Predict protein structures from just sequence information, ab initio.  Develop an algorithm to build 3D folded protein decoys from the sequence that are similar to the native structure  Evaluate these decoys and determine which are native-like. In short, be able to pick the most native-like structure from the large set of decoys we will generate.

UNC Chapel Hill David A. O’Brien Lattice Chain Growth Algo.  Cubic lattice (311) w/ 24 possible moves {(3,1,1),(3,1,-1),…,(-3,1,1)}  Generate chain configuration by sequential addition of links until full length of chain is reached.  New links can not be placed in the zone of exclusion of of other links and must satisfy angle constraints.

UNC Chapel Hill David A. O’Brien Lattice Chain Growth Algo.: Adding a new link  Generate a set of possible open lattice nodes.  For each, calculate a temperature-dependent transition probability.  Choose one of these open lattice nodes with a Monte Carlo step.  Variations such as look 2 steps ahead or building from middle

UNC Chapel Hill David A. O’Brien Temperature-Dependent Transition Probability  Probability at step i of picking configuration x’ from x 1 … x C :  T = temperature  k B = Boltzman Constant  E = Energy (Lower is better.)

UNC Chapel Hill David A. O’Brien Overview  Lattice Chain Growth Algorithm  Statistical Energy Functions  2-body Miyazawa-Jernigan Potential  4-body Potential  Local Shape Potential  Results  Chains  Identifying Good Decoys  Current Work  New Scoring Functions  Incremental Tetrahedralization  Future work

UNC Chapel Hill David A. O’Brien Statistical Energy Functions  Statistical energy functions assume that “contact” energies between amino acid residues in native proteins are related to their observed frequency in a representative structural database.  If a potential configuration (decoy) has a certain set of nearby residues that is common in nature, give this a good score.  Score for entire protein is sum of all contact energies.  We use three statistical energy functions:  2-body Miyazawa-Jernigan  4-body Potential  Local Shape Potential

UNC Chapel Hill David A. O’Brien Statistical Energy Functions Overview  Global vs. Local  Global:Measures well the entire protein (or partial fragment)  Local:Measures just a small sequence of consecutive residues  2-body Miyazawa-Jernigan  Easy to calculate  Can be global or local  4-body Potential  Expensive to calculate  Works better as a global measure  Good for determining native-like folded structures  Local Shape Potential  Easy to calculate  Defined as a local measure  Global measure ?

UNC Chapel Hill David A. O’Brien Overview  Lattice Chain Growth Algorithm  Statistical Energy Functions  2-body Miyazawa-Jernigan Potential  4-body Potential  Local Shape Potential  Results  Chains  Identifying Good Decoys  Current Work  New Scoring Functions  Incremental Tetrahedralization  Future work

UNC Chapel Hill David A. O’Brien  For two-body potentials:  Actual  ij values are taken from the Miyazawa-Jernigan matrix as reevaluated in 1996 Two-body Statistical Energy Function Miyazawa S, Jernigan RL. Residue residue potentials with a favorable contact pair term and an unfavorable high packing density term, for simulation and threading. J Mol Biol 1996;256:

UNC Chapel Hill David A. O’Brien Overview  Lattice Chain Growth Algorithm  Statistical Energy Functions  2-body Miyazawa-Jernigan Potential  4-body Potential  Local Shape Potential  Results  Chains  Identifying Good Decoys  Current Work  New Scoring Functions  Incremental Tetrahedralization  Future work

UNC Chapel Hill David A. O’Brien  Calculates the energy based on a sets of 4 nearby residues (quad).  Quads calculated from the Delaunay Tessellation.  The 4 vertices of each tetrahedra define a quad.  Each quad is given a statistical score. Four-Body Statistical Energy Function Convex hull formed by the tetrahedral edges Each tetrahedron corresponds to a cluster of four residues

UNC Chapel Hill David A. O’Brien Four-Body Statistical Energy Function - Overview  Four-body potential is written.  Training set of 1166 proteins were tessellated  Frequency of each quad type is counted  Each quad is typed in two ways  by the combination of the four residue types {i,j,k,l}  by the number of consecutively appearing residues (  ) 25.5% 35.6%11.4% 22.1% 5.4%

UNC Chapel Hill David A. O’Brien Four-Body Statistical Energy Function - Classifying quadruplets  Denote each quad by {i,j,k,l}  i,j,k and l can be any of the 20 amino acids (L20)  e.g. AALV, TLKM, TTLK, YYYY etc.  8855 possible combinations  Or 20 amino acids can be grouped into just 6 types (L6)  Groups defined by chemical properties of amino acids  126 possible combinations c={cysteine}f={phenylaline, tyrosine, tryptophan} h={histiine, arginine, lysine} n={asparagine, aspartic acid, glutamine, glutamic acid} s={serine, threonine, proline, alanine, glycine} v={methionine, isoleucine, leucine, valine}

UNC Chapel Hill David A. O’Brien Four-Body Statistical Energy Function - Classifying quadruplets  L20 Case:  5  -types x 8855 combination ==> 44,275 quad types  Not all quad types observed in training set  Potential of unfound types set to some fraction of the lowest score for a represented quad type.  L6 Case:  5  -types x 126 combination ==> 630 quad types  All but a few quad types observed in training set

UNC Chapel Hill David A. O’Brien Four-Body Statistical Energy Function - Formulation  Formulation is an extension of the previous 2-body formula: where,

UNC Chapel Hill David A. O’Brien Overview  Lattice Chain Growth Algorithm  Statistical Energy Functions  2-body Miyazawa-Jernigan Potential  4-body Potential  Local Shape Potential  Results  Chains  Identifying Good Decoys  Current Work  New Scoring Functions  Incremental Tetrahedralization  Future work

UNC Chapel Hill David A. O’Brien  Motivation :  Fragment libraries model protein structures accurately.  Use the frequency of common fragments to construct a statistical function that supplements the 2 and 4-body energy functions to grow better decoys  Good fragment libraries exist, but for the lattice-chain building we need fragments that fit in the 311 lattice  Main Idea:  For each possible consecutive sequence of four residues, i, j, k, and l, calculate in which shape these residues most often occur. Shape – A Shape – B  If Shape – A is found more often in nature, try to build chain accordingly Local Shape Statistical Energy Function

UNC Chapel Hill David A. O’Brien  Create set of canonical lattice shapes of length 4 (and 5)  Calculate ways to embed chain of length 4 (or 5) in 311 lattice.  155 canonical shapes for length 4, (2789 for length 5)  For L6, there are 6 4 =1,296 sequences  155 x 1,296 = 200,880 combinations Parse representative set of 971 proteins into segments.  For each 4 length segment, calculate RMSD against each canonical shape Local Shape Statistical Energy Function … Shape 1 Shape 2 Shape 155 Sample protein

UNC Chapel Hill David A. O’Brien  Turning RMSD values into frequencies  If only the canonical shape with best RMSD are counted, not all 200,880 shapes found in training set.  If two canonical shapes have low RMSD, give each some credit  If each For each RMSD  i,j,k,l, i,j,k,l = residue type,  = shape  Normalize the 155 RMSD values Local Shape Statistical Energy Function

UNC Chapel Hill David A. O’Brien Overview  Lattice Chain Growth Algorithm  Statistical Energy Functions  2-body Miyazawa-Jernigan Potential  4-body Potential  Local Shape Potential  Results  Chains  Identifying Good Decoys  Current Work  New Scoring Functions  Incremental Tetrahedralization  Future work

UNC Chapel Hill David A. O’Brien  Decoys produced by the Chain Growing still not good enough.  Relatively good correlation between RMSD and 4-Body Energy.  2mhu Built with MJ PotentialLocal Shape Pot. Results-Building Decoys Native state Four-body Energy per residue

UNC Chapel Hill David A. O’Brien Overview  Lattice Chain Growth Algorithm  Statistical Energy Functions  2-body Miyazawa-Jernigan Potential  4-body Potential  Local Shape Potential  Results  Chains  Identifying Good Decoys  Current Work  New Scoring Functions  Incremental Tetrahedralization  Future work

UNC Chapel Hill David A. O’Brien  20L or 6L Non-bonded  Sum only the contribution of  -type 0 tetrahedra. Identifying good Decoys

UNC Chapel Hill David A. O’Brien  Non-Bounded L20 scoring function applied to a set of folded and unfolded decoys. Discriminating Native & Non-Native

UNC Chapel Hill David A. O’Brien Overview  Lattice Chain Growth Algorithm  Statistical Energy Functions  2-body Miyazawa-Jernigan Potential  4-body Potential  Local Shape Potential  Results  Chains  Identifying Good Decoys  Current Work  New Scoring Functions  Incremental Tetrahedralization  Future work

UNC Chapel Hill David A. O’Brien  20L or 6L Non-bonded  Sum only the contribution of  -type 0 tetrahedra.  20L or 6L 5T  Sum contribution of all tetrahedra.  20L Ratio All  As above, but Define: Adjustments to Scoring Functions

UNC Chapel Hill David A. O’Brien Incremental Tetrahedralization  Maintain constant tetrahedralization and only add and remove single vertices.  When evaluating a new candidate, update total energy by tagging new quadruplets as well as any that have been removed.  Add the effect of the new, and subtract effect of those removed. Add candidate and evaluate. Add next candidate and reevaluate. Remove candidate and reset state.

UNC Chapel Hill David A. O’Brien References Generating folded protein structures with a lattice chain-growth algorithm. H.H. Gan, A. Tropsha and T. Schlick, J. Chem. Phys. 113, (2000). Lattice protein folding with two and four-body statistical potentials. H.H. Gan, A. Tropsha and T. Schlick, Proteins: Structure, Function, and Genetics 43, (2001). Miyazawa S, Jernigan RL. Residue–residue potentials with a favorable contact pair term and an unfavorable high packing density term, for simulation and threading. J Mol Biol 1996;256: 623–644. Tropsha A, Sigh RK, Vaisman LI. Delaunay tessellation of proteins: Four body nearest neighbor propensities of amino acid residues, J. Comput. Biol. 1996:3:2, (1996). R. Kolodny, P. Koehl, L. Guibas and M. Levitt. Small libraries of protein fragments model native protein structures accurately, J. Mol. Biol., 323, (2002).