Presentation is loading. Please wait.

Presentation is loading. Please wait.

Carnegie Mellon School of Computer Science 1 Protein Quaternary Fold Recognition Using Conditional Graphical Models Yan Liu IBM Research Jaime Carbonell.

Similar presentations


Presentation on theme: "Carnegie Mellon School of Computer Science 1 Protein Quaternary Fold Recognition Using Conditional Graphical Models Yan Liu IBM Research Jaime Carbonell."— Presentation transcript:

1 Carnegie Mellon School of Computer Science 1 Protein Quaternary Fold Recognition Using Conditional Graphical Models Yan Liu IBM Research Jaime Carbonell (CMU), Vanathi Gopalakrishnan (U Pitt), Peter Weigele (MIT) ICML-2007 workshop

2 Carnegie Mellon School of Computer Science 2 Snapshot of Cell Biology Nobelprize.org + Protein function DSCTFTTAAAAKAGKAKAG Protein sequence Protein structure

3 Carnegie Mellon School of Computer Science 3 Example Protein Structures Adenovirus Fibre Shaft Virus Capsid Triple beta-spiral fold in Adenovirus Fiber Shaft

4 Carnegie Mellon School of Computer Science 4 Predicting Protein Structures Protein Structure is a key determinant of protein function Crystalography to resolve protein structures experimentally in-vitro is very expensive, NMR can only resolve very-small proteins The gap between the known protein sequences and structures:  3,023,461 sequences v.s. 36,247 resolved structures (1.2%)  Therefore we need to predict structures in-silico

5 Carnegie Mellon School of Computer Science 5 Quaternary Folds and Alignments Protein fold  Identifiable regular arrangement of secondary structural elements Thus far, a limited number of protein folds have been discovered (~1000)  Very few research work on quaternary folds Complex structures and few labeled data Quaternary fold recognition Seq 1: APA FSVSPA … SGACGP ECAESG Seq 2 : DSCTFT…TAAAAKAGKAKCSTITL Biology task Protein foldMembership and non- membership proteins Will the protein take the fold? AI taskPattern to be induced Training data (seq- struc pairs + physics) Does the pattern appear in the testing sequence?

6 Carnegie Mellon School of Computer Science 6 Related Work Previous Work in General Protein Structure Prediction  Sequence similarity perspective [Altschul et al, 1997, Durbin et al, 1998, Karplus et al, 1998, Jones, 2001]  Physical forces perspective [Jones, 1998]  Structural biology perspective [ Efimov, 1991; Wilmot and Thornton, 1990; Bradley at al, 2001] Previous Work in Quaternary Structure Prediction  Mostly on partial tasks, e.g. classification of protein sequences, analysis of domain-domain docking or interaction types and geometric regularities and constraints Computational challenges in viral fold recognition  Complex structures, insufficient data and less sequence similarities between membership proteins

7 Carnegie Mellon School of Computer Science 7 Conditional Random Fields Hidden Markov model (HMM) [Rabiner, 1989] Conditional random fields (CRFs) [Lafferty et al, 2001]  Model conditional probability directly (discriminative models, directly optimizable)  Allow arbitrary dependencies in observation  Adaptive to different loss functions and regularizers  Promising results in multiple applications  But, need to scale up (computationally) and extend to long-distance dependencies

8 Carnegie Mellon School of Computer Science 8 Segmentation CRF:  Outputs Y = {M, {W i } }, where W i = {p i, q i, s i } Feature definition  Node feature  Local interaction feature  Long-range interaction feature Our Solution: Conditional Graphical Models Long-range dependencyLocal dependency

9 Carnegie Mellon School of Computer Science 9 Linked Segmentation CRF Node: secondary structure elements and/or simple fold Edges: Local interactions and long-range inter-chain and intra- chain interactions L-SCRF: conditional probability of y given x is defined as Joint Labels

10 Carnegie Mellon School of Computer Science 10 Objective: Training : learn the model parameters λ  Minimizing regularized negative log loss  Iterative search algorithms by seeking the direction whose empirical values agree with the expectation Complex graphs results in huge computational complexity Linked Segmentation CRF (II)

11 Carnegie Mellon School of Computer Science 11 Approximate Inference - Learning Most approximation algorithms cannot handle variable number of nodes in the graph, but we need variable graph topologies, so… Contrastive Divergence [Hinton & Welling, 2002] Δλ k = E p0 [f k ]– E p1 [f k ] P 0 : estimated from empirical samples P 1 : estimated from a few samples starting the seeds from the empirical samples Δλ k

12 Carnegie Mellon School of Computer Science 12 Approximate Inference - Inference Reversible jump MCMC sampling [Greens, 1995, Schmidler et al, 2001] with Four types of Metropolis operators  State switching  Position switching  Segment split  Segment merge MAP estimate using simulated annealing reversible jump MCMC [Andireu et al, 2000]  Replace the sample with RJ MCMC  Theoretically converge on the global optimum

13 Carnegie Mellon School of Computer Science 13 Experiments: Target Quaternary Fold Triple beta-spirals [van Raaij et al. Nature 1999]  Virus fibers in adenovirus, reovirus and PRD1 Double barrel trimer [Benson et al, 2004]  Coat protein of adenovirus, PRD1, STIV, PBCV

14 Carnegie Mellon School of Computer Science 14 Features for Protein Fold Recognition

15 Carnegie Mellon School of Computer Science 15 Experiment Results: Fold Recognition Double barrel- trimer Triple beta-spirals

16 Carnegie Mellon School of Computer Science 16 Experiment Results: Alignment Prediction Triple beta-spirals Four states: B1, B2, T1 and T2 Correct Alignment: B1: i – o B2: a - h Predicted Alignment B1B2

17 Carnegie Mellon School of Computer Science 17 Experiment Results: Discovery of New Membership Proteins Predicted membership proteins of triple beta-spirals can be accessed at http://www.cs.cmu.edu/~yanliu/swissprot_list.xls Membership proteins of double barrel-trimer suggested by biologists [Benson, 2005] compared with L-SCRF predictions

18 Carnegie Mellon School of Computer Science 18 Conclusion Conditional graphical models for protein structure prediction  Effective representation for protein structural properties  Feasibility to incorporate different kinds of informative features  Efficient inference algorithms for large-scale applications A major extension compared with previous work  Knowledge representation through graphical models  Ability to handle long-range interactions within one chain and between chains Future work  Automatic learning of graph topology  Applications to other domains

19 Carnegie Mellon School of Computer Science 19

20 Carnegie Mellon School of Computer Science 20 Tertiary Fold Recognition: β- Helix fold Histogram and ranks for known β-helices against PDB-minus dataset 5 Chain graph model reduces the real running time of SCRFs model by around 50 times

21 Carnegie Mellon School of Computer Science 21 Fold Alignment Prediction: β- Helix Predicted alignment for known β -helices on cross-family validation

22 Carnegie Mellon School of Computer Science 22 Discovery of New Potential β -helices Run structural predictor seeking potential β-helices from Uniprot (structurally unresolved) databases  Full list (98 new predictions) can be accessed at www.cs.cmu.edu/~yanliu/SCRF.html www.cs.cmu.edu/~yanliu/SCRF.html Verification on 3 proteins with later experimentally resolved structures from different organisms  1YP2: Potato Tuber ADP-Glucose Pyrophosphorylase  1PXZ: The Major Allergen From Cedar Pollen  GP14 of Shigella bacteriophage as a β-helix protein  No single false positive!

23 Carnegie Mellon School of Computer Science 23 Previous Work Sequence similarity perspective  Sequence similarity searches, e.g. PSI-BLAST [Altschul et al, 1997]  Profile HMM,.e.g. HMMER [Durbin et al, 1998] and SAM [Karplus et al, 1998]  Window-based methods, e.g. PSI_pred [Jones, 2001] Physical forces perspective  Homology modeling or threading, e.g. Threader [Jones, 1998] Structural biology perspective  Painstakingly hand-engineered methods for specific structures, e.g. αα- and ββ- hairpins, β-turn and β-helix [ Efimov, 1991; Wilmot and Thornton, 1990; Bradley at al, 2001] Generative models based on rough approximation of free-energy, perform very poorly on complex structures Very Hard to generalize due to built-in constants, fixed features Fail to capture the structure properties and long-range dependencies

24 Carnegie Mellon School of Computer Science 24 Graphical Models A graphical model is a graph representation of probability dependencies [Pearl 1993; Jordan 1999]  Node: random variables  Edges: dependency relations Directed graphical model (Bayesian networks) Undirected graphical model (Markov random fields)


Download ppt "Carnegie Mellon School of Computer Science 1 Protein Quaternary Fold Recognition Using Conditional Graphical Models Yan Liu IBM Research Jaime Carbonell."

Similar presentations


Ads by Google