Presentation is loading. Please wait.

Presentation is loading. Please wait.

Carnegie Mellon School of Computer Science 1 Conditional Graphical Models for Protein Structure Prediction Yan Liu Language Technologies Institute Carnegie.

Similar presentations


Presentation on theme: "Carnegie Mellon School of Computer Science 1 Conditional Graphical Models for Protein Structure Prediction Yan Liu Language Technologies Institute Carnegie."— Presentation transcript:

1 Carnegie Mellon School of Computer Science 1 Conditional Graphical Models for Protein Structure Prediction Yan Liu Language Technologies Institute Carnegie Mellon University Thesis Committee Jaime Carbonell (Chair) John Lafferty Eric P. Xing Vanathi Gopalakrishnan (Univ. of Pittsburgh) PhD Thesis Proposal

2 Carnegie Mellon School of Computer Science 2 Proteins in Our Life Nobelprize.org Predict protein structures from sequences

3 Carnegie Mellon School of Computer Science 3 Protein Structure Hierarchy Our focus: predicting the topology of the structures

4 Carnegie Mellon School of Computer Science 4 Previous work General approaches for structural motif recognition  Sequence similarity searches, e.g. PSI-BLAST [Altschul et al, 1997]  Profile HMM,.e.g. HMMER [Durbin et al, 1998] and SAM [Karplus et al, 1998]  Homology modeling or threading, e.g. Threader [Jones, 1998]  Window-based methods, e.g. PSI_pred [Jones, 2001] Methods of careful design for specific structure motifs  Example: αα- and ββ- hairpins, β-turn and β-helix - Structural similarity without clear sequence similarity - Long-range interactions - Hard to generalize

5 Carnegie Mellon School of Computer Science 5 Missing pieces Informative features without clear interpretation No principled models to formulate the structured properties of proteins Discriminative models Graphical models Conditional graphical models for protein structure prediction

6 Carnegie Mellon School of Computer Science 6 Conditional Graphical Models (CGM) Structural prediction is reduced to segmentation and labeling problem Define segment semanticsstructural modules  W = (M, {Wi}), M: # of segments, Wi: configuration of segment i The conditional probability of a possible segmentation W given the observation x is defined as

7 Carnegie Mellon School of Computer Science 7 Advantages of Conditional Graphical Models Flexible feature definition and kernelization Different loss functions and regularizers Convex functions for globally optimal solutions Structured information, especially the long-range interactions

8 Carnegie Mellon School of Computer Science 8 Training and Testing Training phase : learn the model parameters  Minimizing regularized log loss  Seek the direction whose empirical values agree with the expectation  Iterative searching algorithms have to be applied Testing phase: search the segmentation that maximizes P(w| x )

9 Carnegie Mellon School of Computer Science 9 Thesis Work Conditional graphical models for protein structure prediction

10 Carnegie Mellon School of Computer Science 10 Model Roadmap Conditional random fields Kernel CRFs Segmentation CRFs Chain graph model Factorial segmentation CRFs Kernelized Segmentation Locally normalized Factorized

11 Carnegie Mellon School of Computer Science 11 Outline Conditional graphical models for protein structure prediction

12 Carnegie Mellon School of Computer Science 12 Task Definition and Evaluation Materials Given a protein sequence, predict its secondary structure assignments  Three class: helix (C), sheets (E) and coil (C) APAFSVSPASGACGP ECA CCEEEEECCCCCHH HCCC Protein Secondary Structure Prediction

13 Carnegie Mellon School of Computer Science 13 CGM on Secondary Structure Prediction [Liu, Carbonell, Klein-Seetharaman and Gopalakrishnan, Bioinformatics 2004] Structural component  Individual residues Segmentation definition  W = (n, Y), n: # of residues, Yi = H, E or C Specific models  Conditional random fields (CRFs)  Kernel conditional random fields (kCRFs) where Protein Secondary Structure Prediction

14 Carnegie Mellon School of Computer Science 14 Target Directions Protein Secondary Structure Prediction Input Features (PSI-BLAST profile) Prediction Combination (CRFs) Feature Exploration (KCRFs) Learning Algorithm (SVM) Beta-sheet Detection

15 Carnegie Mellon School of Computer Science 15 Prediction Combination Previous work  Window-based label combination Rule-based algorithm  Window-based score combination SVM, Neural networks Other graphical models for score combination  Maximum entropy Markov model (MEMM)  Higher-order MEMMs (HOMEMM)  Pseudo state duration MEMMs (PSMEMM) Protein Secondary Structure Prediction MEMM HOMEMM

16 Carnegie Mellon School of Computer Science 16 Experiment Results Protein Secondary Structure Prediction - Prediction Combination Graphical models are consistently better than the window-based approaches CRFs perform the best among the four graphical models

17 Carnegie Mellon School of Computer Science 17 Experiment Results Offset from the correct register for the correctly predicted beta-strand pairs (CB513) Protein Secondary Structure Prediction - Beta-sheet detection Considerable improvement in prediction performance Prediction accuracy for beta-sheets

18 Carnegie Mellon School of Computer Science 18 Experiment Results Prediction accuracy for KCRFs with vertex cliques (V) and edge cliques (E)  PSI-BLAST profile with RBF kernel Protein Secondary Structure Prediction - Feature exploration

19 Carnegie Mellon School of Computer Science 19 Summary Conditional graphical model for protein secondary structure prediction  Conditional random fields (CRFs)  Kernel conditional random fields (KCRFs) Protein Secondary Structure Prediction

20 Carnegie Mellon School of Computer Science 20 Outline Conditional graphical models for protein structure prediction

21 Carnegie Mellon School of Computer Science 21 Task Definition and Evaluation Materials..APAFSVSPASGACGPECA.. Contains the structural motif?..NNEEEEECCCCCHHHCCC.. Structural motif recognition Structural motif  Regular arrangement of secondary structural elements  Super-secondary structure, or protein fold Yes

22 Carnegie Mellon School of Computer Science 22 CGM for Structural Motif Recognition Structural component  Secondary structure elements Protein structural graph  Nodes for the states of secondary structural elements of unknown lengths  Edges for interactions between nodes in 3-D Example: β-α-β motif Tradeoff between fidelity of the model and graph complexity Structural Motif Recognition

23 Carnegie Mellon School of Computer Science 23 CGM for Structural Motif Recognition Segmentation definition  Yi: state of segment i Si: the length of segment i M: number of segments Segmentation conditional random fields (SCRFs)  For any graph, we have  For a simplified graph, we have Structural Motif Recognition

24 Carnegie Mellon School of Computer Science 24 Protein Folds with Structural Repeats Prevalent in proteins and important in functions Each repeats consists of structural motifs and insertions Challenge  Low sequence similarity in structural motifs  Long-range interactions Structural Motif Recognition

25 Carnegie Mellon School of Computer Science 25 CGM for Structural Motif Recognition Chain graph  A graph consisting of directed and undirected graphs  Given a variable set V that forms multiple subgraphs U Segmentation defintion  Two layer segmentation W = {M, {Ξ i }, T } M: # of envelops Ξ i : the segmentation of envelops Ti: the state of envelop i = repeat, non-repeat Chain graph model Structural Motif Recognition

26 Carnegie Mellon School of Computer Science 26 Experiments Right-handed β-helix fold  An elongated helix-like structures whose successive rungs composed of three parallel β - strands (B1, B2, B3 strands)  T2 turn: a unique two-residue turn  Low sequence identity Leucine-rich repeats (LLR)  Solenoid-like regular arrangement of beta-strands and an alpha-helix, connected by coils  Relatively high sequence identity (many Leucines) Structural Motif Recognition

27 Carnegie Mellon School of Computer Science 27 Experiment Results Cross-family validation for classifying β -helix proteins  SCRFs can score all known β -helices higher than non β -helices Structural Motif Recognition SCRFs model

28 Carnegie Mellon School of Computer Science 28 Experiment Results Predicted Segmentation for known Beta-helices Structural Motif Recognition SCRFs model

29 Carnegie Mellon School of Computer Science 29 Experiment Results Histograms of scores for known β- helices against PDB-minus dataset  18 non β-helix proteins have a score higher than 0  13 from β-class and 5 from α/β class  Most confusing proteins: β-sandwiches and left-handed β-helix 5 Structural Motif Recognition SCRFs model

30 Carnegie Mellon School of Computer Science 30 Experiment Results Verification on recently crystallized structures Successfully identify  1YP2: Potato Tuber ADP-Glucose Pyrophosphorylase score 10.47  1PXZ: Jun A 1, The Major Allergen From Cedar Pollen score 32.35  GP14 of Shigella bacteriophage as a β-helix protein with scoring 15.63 Structural Motif Recognition SCRFs model

31 Carnegie Mellon School of Computer Science 31 Experiment Results Cross-family validation for classifying β -helix proteins  Chain graph model can score all known β -helices higher than non β -helices Structural Motif Recognition Chain graph model

32 Carnegie Mellon School of Computer Science 32 Experiment Results Cross-family validation for classifying LLR proteins  Chain graph model can score all known LLR higher than non- LLR Structural Motif Recognition Chain graph model

33 Carnegie Mellon School of Computer Science 33 Experiment Results Predicted Segmentation for known Beta-helices and LLRs Structural Motif Recognition Chain graph model

34 Carnegie Mellon School of Computer Science 34 Further Experiments Virus proteins  Noncellular biological entity that can reproduce only within a host cell  Dynamical properties  Various kinds of viruses Adenovirus (common cold) Bacteriophage (which infects bacteria) DNA virus, RNA virus Structural Motif Recognition Gp 41: core protein of HIV virus (1AIK)

35 Carnegie Mellon School of Computer Science 35 Summary Segmented conditional graphical models for structural motif recognition  Segmentation conditional random fields  Chain graph model Successful applications  Right-handed beta-helices  Leucine-rich repeats Further verfication  Virus spike folds and others Structural Motif Recognition

36 Carnegie Mellon School of Computer Science 36 Outline Conditional graphical models for protein structure prediction

37 Carnegie Mellon School of Computer Science 37 Quaternary Structure Prediction Quaternary structures  Multiple chains associated together through noncovalent bonds or disulfide bonds Classes of quaternary structures  Based on the number of subunits  Based on the identity of the subunits Homo-oligomers* and hetero-oligomers Very few related research work..APAFSVSPASGACGPECA.. Contains the quaternary structures?..NNEEEEECCCCCHHHCCC.. Yes

38 Carnegie Mellon School of Computer Science 38 Proposed Work Structural component  Secondary structure elements  Super-secondary structure elements Protein structural graph  Nodes for the states of secondary structural or super-secondary structural elements of unknown lengths  Edges for interactions between nodes in 3-D  Protocol: nodes representing secondary structure elements must involve long-range interactions Quaternary structure prediction

39 Carnegie Mellon School of Computer Science 39 Proposed Work Segmentation definition  W = (M, {Wi}), M: # of segments Wi: configuration of segment i Factorial segmentation conditional random fields R: # of chains in quaternary structures Quaternary structure prediction

40 Carnegie Mellon School of Computer Science 40 Experiment Design Triple beta-spirals  Described by van Raaij et al. in Nature (1999)  Clear sequence repeats  Two proteins with crystallized structures and about 20 without structure annotation Tripe beta-helices  Described by van Raaij et al. in JMB (2001)  Structurally similar to beta-helix without clear sequence similarity  Two proteins with crystallized structures  Information from beta-helices can be used Both folds characterized by unusual stability to heat, protease, and detergent Quaternary structure prediction

41 Carnegie Mellon School of Computer Science 41 Summary Conditional graphical models for protein structure prediction

42 Carnegie Mellon School of Computer Science 42 Expected Contribution Computational contribution  Conditional graphical models for structured data prediction  “First effective models for the long-range interactions” Biological contribution  Improvement in protein structure prediction  Hypothesis on potential proteins with biological important folds  Aids for function analysis and drug design

43 Carnegie Mellon School of Computer Science 43 Timeline Jun, 2005 -- Aug, 2005  Data collection for triple beta-spirals and triple beta-helices  Preliminary investigation for virus protein folds Sep, 2005 -- Nov, 2005  Implementation and Testing the model on synthetic data and real data  Determining the specific virus proteins to work on Nov, 2005 -- Jan, 2006  Virus protein fold recognition  Analysis the properties for virus protein folds Feb, 2006 -- June, 2006:  Virus protein fold recognition  Investigation the possibility for protein function prediction or information extraction July, 2006 -- Aug, 2006  Writing up thesis

44 Carnegie Mellon School of Computer Science 44

45 Carnegie Mellon School of Computer Science 45 Features Node features  Regular expression template, HMM profiles  Secondary structure prediction scores  Segment length Inter-node features  β -strand Side-chain alignment scores  Preferences for parallel alignment scores  Distance between adjacent B23 segments Features are general and easy to extend Structural Motif Recognition

46 Carnegie Mellon School of Computer Science 46 Evaluation Measure Q3 (accuracy) Precision, Recall Segment Overlap quantity (SOV) Matthew’s Correlation coefficients P +P- T +Pu T -on

47 Carnegie Mellon School of Computer Science 47 Local Information PSI-blast profile  Position-specific scoring matrices (PSSM)  Linear transformation[Kim & Park, 2003] SVM classifier with RBF kernel Feature#1 (Si): Prediction score for each residue Ri

48 Carnegie Mellon School of Computer Science 48 Previous Work Huge literature over decades  Window-based methods  Hidden Markov models Major breakthroughs in recent years  Combine the predictions from neighboring residues or various methods (Cuff & Barton, 1999)  Explore evolutionary information from sequences (Jones, 1999)  Specific algorithm for beta-sheets or infer the paring of beta-strands (Meiler & Baker, 2003) Protein Secondary Structure Prediction


Download ppt "Carnegie Mellon School of Computer Science 1 Conditional Graphical Models for Protein Structure Prediction Yan Liu Language Technologies Institute Carnegie."

Similar presentations


Ads by Google