Protein Quaternary Fold Recognition Using Conditional Graphical Models

Protein Quaternary Fold Recognition Using Conditional Graphical Models
Yan Liu, Jaime Carbonell Vanathi Gopalakrishnan (U Pitt), Peter Weigele (MIT) Language Technologies Institute School of Computer Science Carnegie Mellon University IJCAI-2007 – Hyderabad, India

Snapshot of Cell Biology
Nobelprize.org Protein structure + Protein function DSCTFTTAAAAKAGKAKAG Protein sequence On earth, there are different kinds of living things, such as plants, animals and us human beings. All these kinds of living things are made up of cell. In the nucleus of the cell stores the DNA, which will be transcribed and translated into proteins. Proteins, as chains of amino acids, adopt a unique three-dimensional structures at it native environment, which ultimately determine their functions so that they make up a large portion of the living organisms and perform most of the important functions Proteins are composed of amino acids and have a defined three-dimensional structure. This form dictates the function. Proteins are responsible for all the reactions and activities of the cell. The structure of the individual proteins is encoded in DNA in the cell nucleus. Cytoskeletal proteins control the shape and movement of the cell. Proteins are synthesized on ribosomes. Mitochondrial proteins are responsible for cell respiration and the synthesis of ATP that provides cellular energy. Enzymes in the cell catalyze chemical reactions. Storage vesicles contin, and release, hormones and neurotransmittors. They act on receptors and control ion channels. In this way cells can communicate with each other and order proteins in the cell to work in concert with the entire organism.

Example Protein Structures
Triple beta-spiral fold in Adenovirus Fiber Shaft Protein quaternary structures: To emphasize the importance of protein structures to functions. Here is an interesting example of the triple beta spiral fold. The graph on the left shows the 3-D structure of the adenovirus fiber, which consists of a rigid shaft (a protein with TBS fold) and a knob. They are part of the virus capids. The fiber protein holds the antenna-like structure will serve as hands of the virus so that they can attach the virus to the cell surface, where the DNA virus can be injected. We will comeback to this fold later in our discussion, but at this stage, we can understand the importance of the protein structures to functions, which motives us to determine protein structures. The structures provide important information about their functions. Motivates extensive work on identifying the protein structures This triple beta-spiral fold exists in different kinds of virus proteins with any sequence similarity. Identifying more examples of this fold will demonstrate its common existence of all virus proteins. and furthermore indicate they come from a common ancestor. Enlarged view of an adenovirus particle. The viral capsid is an icosahedron with 12 antenna-like fiber projections that function to attach the virus to the cell surface during infection. The viral DNA is packaged inside the particle. Adenovirus Fibre Shaft Virus Capsid

Predicting Protein Structures
Protein Structure is a key determinant of protein function Crystalography to resolve protein structures experimentally in-vitro is very expensive, NMR can only resolve very-small proteins The gap between the known protein sequences and structures: 3,023,461 sequences v.s. 36,247 resolved structures (1.2%) Therefore we need to predict structures in-silico Before digging into the details of prediction algorithms, we start with introducing the current understanding of protein structures. The protein structures are defined as having 4 conceptual levels in hierarchies. The primary structures refers to the linear polymers of amino acids. There are 20 types of standard amino acids in nature, which are represented by English letters. The secondary structure of a protein can be thought of as the local conformation of the polypeptide chain, or intuitively as building blocks for its three-dimensional structures. There are 2 types of dominant secondary structures, which are alpha-helix, parallel or anti-parallel beta-sheets. These two exhibit a high degree of regularity and they are connected by the rest irregular regions, called loops. The tertiary structure of a protein is often defined as the global three-dimensional structures and usually represented as a set of 3-D coordinates for each atoms. An important property is that protein sequences have been selected by the evolutionary process to achieve a unique reproducible and stable structure. Sometimes several protein chains (either identical or non-identical) will unit together and form chemical bonds between each other to reach a structurally stable unit. As we can see, there are many interesting and challenging problems about the protein structures. Here, we focus on protein quatenary fold recognition and alignment prediction. That is, given the protein sequence information only, our goal is to predict what the secondary structure elements are, how they arrange themselves in three-dimensional space, and how multiple chains associate into complexes. All the experimentally solved 3-D structure data will be deposited in a worldwide repository, that is, protein data bank Structural biologists have systematically annotated the structures and evolutionary relationships. SCOP is the database that store the classification of proteins based on current understanding SCOP, acronym for structural classification of proteins, is a database of annotating the the structural and evolutionary relationships of protein manually. family for proteins with clear evolutionary relationships.

Quaternary Folds and Alignments
Protein fold Identifiable regular arrangement of secondary structural elements Thus far, a limited number of protein folds have been discovered (~1000) Very few research work on quaternary folds Complex structures and few labeled data Quaternary fold recognition Biology task Protein fold Membership and non-membership proteins Will the protein take the fold? AI task Pattern to be induced Training data (seq-struc pairs + physics) Does the pattern appear in the testing sequence? Similar structure stabilization mechanism as tertiary structures, however, with more complex interaction map with inter-chain and intra-chain dependencies The key task is how to represent the pattern? Graphical models are a natural approach Seq 1: APA FSVSPA … SGACGP ECAESG Seq 2 : DSCTFT…TAAAAKAGKAKCSTITL

Previous Work Sequence similarity perspective
Sequence similarity searches, e.g. PSI-BLAST [Altschul et al, 1997] Profile HMM, .e.g. HMMER [Durbin et al, 1998] and SAM [Karplus et al, 1998] Window-based methods, e.g. PSI_pred [Jones, 2001] Physical forces perspective Homology modeling or threading, e.g. Threader [Jones, 1998] Structural biology perspective Painstakingly hand-engineered methods for specific structures, e.g. αα- and ββ- hairpins, β-turn and β-helix [Efimov, 1991; Wilmot and Thornton, 1990; Bradley at al, 2001] Fail to capture the structure properties and long-range dependencies Generative models based on rough approximation of free-energy, perform very poorly on complex structures Previous work on protein structure prediction can be summarized as three different approaches Fail to capture the structure properties of the protein folds. without sequences similarity Relies strongly on the validity of the assumptions we take in the free energy definition No principled probabilistic models to formulate the structured properties of proteins Informative features without clear mapping to structures Polar, hydrophobic, aromatic and etc.. Motivated by previous work in protein structure prediction and conditional random fields, we propose the generalized conditional graphical model Very Hard to generalize due to built-in constants, fixed features

Conditional Random Fields
Hidden Markov model (HMM) [Rabiner, 1989] Conditional random fields (CRFs) [Lafferty et al, 2001] Model conditional probability directly (discriminative models, directly optimizable) Allow arbitrary dependencies in observation Adaptive to different loss functions and regularizers Promising results in multiple applications But, need to scale up (computationally) and extend to long-distance dependencies One of the simplest form of structures is a chain. A series of graphical models have been developed to model the sequential data, among which CRFs have demonstrated both theoretical advantages and strong empirical successes. Unlike the HMM, which defines the joint probability of the labels and observations as a product of the emission probability and transition probability, CRF takes a discriminative training approach and defines the conditional probability of the labels given the observation as an exponential function of the features f. The dependencies are captured via the features one way to define the feature is to factorized it Definition of the features: Lambda, Z CRF models the conditional probability directly without any assumptions about the data, which results in a series of nice properties. HMM: no assumption about the data generation, MEMM: global normalization to find a solution globally. Hidden markov model: markov assumption and output-independence assumptions simple Markov network, also known as chain Conditional Random Field (CRF). The conditional probability of y given x is a product of node and edge potentials. The node potentials roughly correspond to emission probabilities in an HMM, while the edge potentials correspond to transition probabilities. However, these potentials do not need to sum to 1 as in HMMs, but simply need to be positive functions. Consider the node potential, phi_n. A natural way to represent potentials is using a log linear combination of basis functions. Each basis function could be an indicator function asking a question like “is pixel p on and the letter equal to ‘z’”. Note that the weights w can be negative. Now consider the edge potentials, phi_e. The basis functions here are indicator functions asking a question like is current letter “z” and the next one an “a”. Now the products of such potentials are again log-linear combinations of basis functions. We will use this compact vector notation by stacking up all the parameters and basis functions together. Now are basis functions are simply counts like how many times z is followed by a and how many times pixel p is on when the letter is z.

Our Solution: Conditional Graphical Models
Local dependency Long-range dependency Outputs Y = {M, {Wi} }, where Wi = {pi, qi, si} Feature definition Node feature Local interaction feature Long-range interaction feature Here is the general framework of conditional graphical models. Given a protein structure level we want to predict, we generate a undirected graph, referred to as protein structure graph in our later discussion, In the graph, the nodes .. , The edges indicate the dependencies between nodes, which can be local dependency, or intuitively the peptide bonds to connect the amino acid, or long-range dependency, or intuitively the chemical bonding between residues that are far way in sequence order. These potentials are captured via the features, including the node potentials and edge potentials. To use the graphical models for protein structure prediction, we first need to define the graph that encode the structure information. Long-range interactions () : they are distant in primary structures with unknown number insertions

Linked Segmentation CRF
Node: secondary structure elements and/or simple fold Edges: Local interactions and long-range inter-chain and intra-chain interactions L-SCRF: conditional probability of y given x is defined as In quaternary protein fold, we have …. Therefore we have both inter-chain and intra-chain arcs in the graph. Therefore we introduce the dynamic SCRF model where the conditional prob is defined over an exponential form of the features over single nodes and a pair of nodes. Notice that we define the joint prob of the labels y_i for all the proteins because the folds are stablized by participation of all proteins. This results in a complex graph in which approximate inferences have to be applied. We derive the Quaternary structures Multiple sequences with tertiary structures associated together to form a stable structure Similar structure stabilization mechanism as tertiary structures Very limited research work to date Complex structures Few positive training data Joint Labels

Linked Segmentation CRF (II)
Classification: Training : learn the model parameters λ Minimizing regularized negative log loss Iterative search algorithms by seeking the direction whose empirical values agree with the expectation Complex graphs results in huge computational complexity To use the graphical models for protein structure prediction, we first need to define the graph that encode the structure information. Long-range interactions () : they are distant in primary structures with unknown number insertions

Approximate Inference of L-SCRF
Most approximation algorithms cannot handle variable number of nodes in the graph, but we need variable graph topologies, so… Reversible jump MCMC sampling [Greens, 1995, Schmidler et al, 2001] with Four types of Metropolis operators State switching Position switching Segment split Segment merge Simulated annealing reversible jump MCMC [Andireu et al, 2000] Replace the sample with RJ MCMC Theoretically converge on the global optimum Sampling algorithms are pursued due to its simplicity and its ability to handle random variables of variable dimensions. However, there is a native formulation. State switching: given a segmentation yi = (Mi;wi), select a segment j uniformly from [1;M], and a state value s0 uniformly from state set S. Set y¤i = yi except that s¤i;j = s0. Position Switching: given a segmentation yi = (Mi;wi), select the segment j uniformly from [1;M] and a position assignment d0 » U[di;j¡1 + 1; di;j+1 ¡ 1]. Set y¤i = yi except that d¤i;j = d0. Segment split: given a segmentation yi = (Mi;wi), propose y¤i = (M¤ i ;wi¤) with M¤ i = Mi + 1 segments by splitting the jth segment, where j is randomly sampled from U[1;M]. Set w¤i;k = wi;k for k = 1; : : : ; j ¡ 1, and w¤i;k+1 = wi;k for k = j + 1; : : : ;Mi. Sample a value assignment of v » P(v), compute w¤i ;w¤i+1 via (w¤i;j ;w¤i;j+1; v0) = ª(wi;j ; v). Segment merge: given a segmentation yi = (Mi;wi), propose Mi¤ = Mi ¡ 1 by merging the jth segment and j +1th segment, where j is sampled uniformly from [1;M ¡ 1]. Set w¤i;k = wi;k for k = 1; : : : ; j ¡1, and w¤i;k¡1 = wi;k for k = j+1; : : : ;Mi. Sample a value assignment of v0 » P(v0), compute wi;j via (w¤i;j ; v) = ª¡1(wi;j ;wi;j+1; v0). Then the acceptance rate for the proposed transition from

Experiments: Target Quaternary Fold
Triple beta-spirals [van Raaij et al. Nature 1999] Virus fibers in adenovirus, reovirus and PRD1 Double barrel trimer [Benson et al, 2004] Coat protein of adenovirus, PRD1, STIV, PBCV of beta-strands and an alpha-helix, connected by coils, Three parallel β-strands (B1, B2, B3 strands), T2 turn: a conserved two-residue turn Structural similarity without sequence similarity, reach the twilight zone of sequence-based algorithms Reason for choose these two: Computationally: example of folds that are important but have limited number of examples. The TBS is easy because we have Biologically: They are both protein folds related to the virus proteins, its common existence in viruses attacking different species reveal important evolution information and suggested the common ancestor of all viruses

Features for Protein Fold Recognition

Tertiary Fold Recognition: β-Helix fold
Histogram and ranks for known β-helices against PDB-minus dataset 5 Chain graph model reduces the real running time of SCRFs model by around 50 times

Fold Alignment Prediction: β-Helix
Predicted alignment for known β -helices on cross-family validation

Discovery of New Potential β-helices
Run structural predictor seeking potential β-helices from Uniprot (structurally unresolved) databases Full list (98 new predictions) can be accessed at Verification on 3 proteins with later experimentally resolved structures from different organisms 1YP2: Potato Tuber ADP-Glucose Pyrophosphorylase 1PXZ: The Major Allergen From Cedar Pollen GP14 of Shigella bacteriophage as a β-helix protein No single false positive! 93 top rank proteins, good at predicting non-homologous sequences Pollen from cedar and cypress trees is a major cause of seasonal hypersensitivity in humans in several regions of the Northern Hemi

Experiment Results: Fold Recognition
Triple beta-spirals Double barrel-trimer We can see that it is extremely difficult in predicting the DBT fold. However, our method is able to give higher ranks for 3 of the 4 known DBT proteins, although we are unable to reach a clear separation between the DBT proteins and the rest. The results are within our expectation because the lack of signal features and unclear understanding about the inter-chain interactions makes the prediction significantly harder. We believe more improvement can be achieved by combining the results from multiple algorithms.

Experiment Results: Alignment Prediction
Triple beta-spirals Four states: B1, B2, T1 and T2 Correct Alignment: B1: i – o B2: a - h Predicted Alignment B1 B2

Experiment Results: Discovery of New Membership Proteins
Predicted membership proteins of triple beta-spirals can be accessed at Membership proteins of double barrel-trimer suggested by biologists [Benson, 2005] compared with L-SCRF predictions Further verification need experimental data

Conclusion Conditional graphical models for protein structure prediction Effective representation for protein structural properties Feasibility to incorporate different kinds of informative features Efficient inference algorithms for large-scale applications A major extension compared with previous work Knowledge representation through graphical models Ability to handle long-range interactions within one chain and between chains Future work Automatic learning of graph topology Applications to other domains In our exploration, we have demonstrated the effectiveness of the conditional graphical models for general secondary structure prediction on globular proteins, tertiary fold (motif) recognition for two specific folds, i.e. right-handed $\beta$-helix and leucine-rich repeats (mostly non-globular proteins), quaternary fold recognition for two specific folds, i.e. triple $\beta$-spirals and double barrel trimer. Therefore we confirmed the thesis statement, that is, the conditional graphical models are theoretically justified and empirically effective for protein structure prediction, independent of the protein structure hierarchies. Contribution and limitation

Graphical Models A graphical model is a graph representation of probability dependencies [Pearl 1993; Jordan 1999] Node: random variables Edges: dependency relations Directed graphical model (Bayesian networks) Undirected graphical model (Markov random fields) Graphical models are a natural choice to handle the structured prediction problem due to its convenient representation of probability dependencies via graphs. In the graph, the nodes denote the random variables and the edges represent the dependency relations Based on the directionality of the graphs, we have the directed and undirected graphical models Definition. To represent the constraints or correlations, the graphical model uses the edge to represent qualitatively and the potentials, that is, features quantitatively. One of the simplest form of structures is a chain. A series of graphical models have been developed to model the sequential data. CRFs have both theoretical advantages and strong empirical successes. HMM: no assumption about the data generation, MEMM: global normlization to find a solution globally. Hidden markov model: markov assumption and output-independence assumptions simple Markov network, also known as chain Conditional Random Field (CRF). The conditional probability of y given x is a product of node and edge potentials. The node potentials roughly correspond to emission probabilities in an HMM, while the edge potentials correspond to transition probabilities. However, these potentials do not need to sum to 1 as in HMMs, but simply need to be positive functions. Consider the node potential, phi_n. A natural way to represent potentials is using a log linear combination of basis functions. Each basis function could be an indicator function asking a question like “is pixel p on and the letter equal to ‘z’”. Note that the weights w can be negative. Now consider the edge potentials, phi_e. The basis functions here are indicator functions asking a question like is current letter “z” and the next one an “a”. Now the products of such potentials are again log-linear combinations of basis functions. We will use this compact vector notation by stacking up all the parameters and basis functions together. Now are basis functions are simply counts like how many times z is followed by a and how many times pixel p is on when the letter is z.

Protein Quaternary Fold Recognition Using Conditional Graphical Models

Similar presentations

Presentation on theme: "Protein Quaternary Fold Recognition Using Conditional Graphical Models"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Protein Quaternary Fold Recognition Using Conditional Graphical Models

Similar presentations

Presentation on theme: "Protein Quaternary Fold Recognition Using Conditional Graphical Models"— Presentation transcript:

Similar presentations

About project

Feedback