Presentation is loading. Please wait.

Presentation is loading. Please wait.

Carnegie Mellon School of Computer Science 1 Conditional Graphical Models for Protein Structure Prediction Yan Liu Language Technologies Institute Carnegie.

Similar presentations


Presentation on theme: "Carnegie Mellon School of Computer Science 1 Conditional Graphical Models for Protein Structure Prediction Yan Liu Language Technologies Institute Carnegie."— Presentation transcript:

1 Carnegie Mellon School of Computer Science 1 Conditional Graphical Models for Protein Structure Prediction Yan Liu Language Technologies Institute Carnegie Mellon University Thesis Committee Jaime Carbonell (Chair) John Lafferty Eric P. Xing Vanathi Gopalakrishnan (Univ. of Pittsburgh) PhD Thesis Proposal

2 Carnegie Mellon School of Computer Science 2 Proteins in Our Life Nobelprize.org Predict protein structures from sequences

3 Carnegie Mellon School of Computer Science 3 Protein Structure Hierarchy Focus on predicting the topology of the structures, instead of the specific 3-D coordinates

4 Carnegie Mellon School of Computer Science 4 Machine Learning Aspect on Previous Work Ab Initio  Similar to generative model Fold recognition  Classification problem Homology modeling  Nearest neighbor

5 Carnegie Mellon School of Computer Science 5 Missing pieces Informative features without clear understanding No explicit models to consider the structured properties of proteins Discriminative models Graphical models Conditional graphical models for protein structure prediction

6 Carnegie Mellon School of Computer Science 6 Conditional Graphical Models Define segment semanticsstructural modules  W = (M, {Wi}), M: # of segments, Wi: configuration of segment i The conditional probability of a possible segmentation W given the observation x is defined as

7 Carnegie Mellon School of Computer Science 7 Conditional Graphical Models Flexible feature definition and kernelization Applicable to different loss functions and regularizers Convex functions for globally optimal solutions Capture the structured information directly, especially the long-range interactions

8 Carnegie Mellon School of Computer Science 8 Training and Testing Training phase : learn the model parameters  Minimizing regularized log loss  Seek the direction whose empirical values agrees with the expectation  Iterative searching algorithm have to be applied Testing phase: search the segmentation that maximize P(w|x)

9 Carnegie Mellon School of Computer Science 9 Proposed Work Conditional graphical models for protein structure prediction

10 Carnegie Mellon School of Computer Science 10 Model Roadmap Conditional random fields Kernel CRFs Segmentation CRFs Chain graph model Factorial segmentation CRFs Kernelized Segmented Locally normalized Factorized

11 Carnegie Mellon School of Computer Science 11 Outline Conditional graphical models for protein structure prediction

12 Carnegie Mellon School of Computer Science 12 Task Definition and Evaluation Materials Given a protein sequence, predict its secondary structure assignments  Three class: helix (C), sheets (E) and coil (C) Focus on globular proteins Experiment materials  Datasets RS126 and CB513  Evaluation measure Accuracy, Precision, Recall, Segment of overlap (SOV) APAFSVSPASGACGP ECA CCEEEEECCCCCHH HCCC Protein Secondary Structure Prediction

13 Carnegie Mellon School of Computer Science 13 Previous Work Huge literature over decades  Window-based methods  Hidden Markov models Major breakthroughs in recent years  Combine the predictions from neighboring residues or various methods (Cuff & Barton, 1999)  Explore evolutionary information from sequences (Jones, 1999)  Specific algorithm for beta-sheets or infer the paring of beta-strands (Meiler & Baker, 2003) Protein Secondary Structure Prediction

14 Carnegie Mellon School of Computer Science 14 Proposed work Structural component  Individual residues Protein Structural graph  A chain structure Segmentation definition  W = (n, Y), n: # of residues, Yi = H, E or C Specific models  Conditional random fields (CRFs)  Kernel conditional random fields (kCRFs) where Protein Secondary Structure Prediction

15 Carnegie Mellon School of Computer Science 15 Target Directions Protein Secondary Structure Prediction Input Features (PSI-BLAST profile) Prediction Combination (CRFs) Feature Exploration (KCRFs) Learning Algorithm (SVM) Beta-sheet Detection

16 Carnegie Mellon School of Computer Science 16 Prediction Combination Previous work  Window-based label combination Rule-based algorithm  Window-based score combination SVM, Neural networks Other graphical models for score combination  Maximum entropy Markov model (MEMM)  Higher-order MEMMs (HOMEMM)  Pseudo state duration MEMMs (PSMEMM) Protein Secondary Structure Prediction MEMM HOMEMM

17 Carnegie Mellon School of Computer Science 17 Experiment Results Protein Secondary Structure Prediction - Prediction Combination Graphical models are consistently better than the window-based approaches CRFs perform the best among the four graphical models

18 Carnegie Mellon School of Computer Science 18 Beta-sheet detection Given a protein sequence, predict where are the beta-strands and how they are aligned to form beta-sheets Protein Secondary Structure Prediction APAFSVSPASGACGP ECA NNEEEEENNNEEEEE NN

19 Carnegie Mellon School of Computer Science 19 Experiment Results Offset from the correct register for the correctly predicted beta-strand pairs (CB513) Protein Secondary Structure Prediction - Beta-sheet detection Considerable improvement in prediction performance Prediction accuracy for beta-sheets

20 Carnegie Mellon School of Computer Science 20 Experiment Results Prediction accuracy for KCRFs with vertex cliques (V) and edge cliques (E)  PSI-BLAST profile with RBF kernel Protein Secondary Structure Prediction - Feature exploration

21 Carnegie Mellon School of Computer Science 21 Summary Conditional graphical model with a chain structure for protein secondary structure prediction  Conditional random fields (CRFs)  Kernel conditional random fields (KCRFs) Protein Secondary Structure Prediction

22 Carnegie Mellon School of Computer Science 22 Outline Conditional graphical models for protein structure prediction

23 Carnegie Mellon School of Computer Science 23 Task Definition and Evaluation Materials..APAFSVSPASGACGPE CA.. Contains the structural motif?..NNEEEEECCCCCHHH CCC.. Structural motif recognition Structural motif  Regular arrangement of secondary structural elements  Super-secondary structure, or protein fold Experiment materials  Datasets PDB-minus dataset  Cross-family validation  Evaluation measure Rank of scores and segmentation results Yes

24 Carnegie Mellon School of Computer Science 24 Previous work General approaches for structural motif recognition  Sequence similarity searches, e.g. PSI-BLAST [Altschul et al, 1997]  Profile HMM,.e.g. HMMER [Durbin et al, 1998] and SAM [Karplus et al, 1998]  Homology modeling or threading, e.g. Threader [Jones, 1998] Methods of careful design for specific structure motifs  Example: αα- and ββ- hairpins, β-turn and β-helix Our goal is to have a general probabilistic framework to address all these problems for structural motif prediction - Structural similarity without clear sequence similarity - Long-range interactions - Hard to generalize Structural motif recognition

25 Carnegie Mellon School of Computer Science 25 Proposed work (I) Structural component  Secondary structure elements Protein structural graph  Nodes for the states of secondary structural elements of unknown lengths  Edges for interactions between nodes in 3-D Example: β-α-β motif Tradeoff between fidelity of the model and graph complexity Structural Motif Recognition

26 Carnegie Mellon School of Computer Science 26 Proposed Work (II) Segmentation definition  W = (M, {Wi}), M: # of segments, Wi = {(Yi, Si)} Yi: State of segment I Si: the length of segment i Segmentation conditional random fields (SCRFs)  For any graph, we have  For a simplified graph, we have Structural Motif Recognition

27 Carnegie Mellon School of Computer Science 27 Protein Folds with Structural Repeats Prevalent in proteins and important in functions Each repeats consists of structural motifs and insertions Challenge  Low sequence similarity in structural motifs  Low structural conservation in insertions  Long-range interactions Structural Motif Recognition

28 Carnegie Mellon School of Computer Science 28 Proposed Work (III) Chain graph  A graph consisting of directed and undirected graphs  Given a variable set V that forms multiple subgraphs U Segmentation defintion  Two layer segmentation W = {M, {Ξ i }, T } M: # of envelops Ξ i : the segmentation of envelops Ti: the state of envelop i = repeat, non-repeat Chain graph model Structural Motif Recognition

29 Carnegie Mellon School of Computer Science 29 Experiments Right-handed β-helix fold  An elongated helix-like structures whose successive rungs composed of three parallel β - strands (B1, B2, B3 strands)  T2 turn: a unique two-residue turn  Low sequence identity Leucine-rich repeats (LLR)  Solenoid-like regular arrangement of beta-strands and an alpha-helix, connected by coils  Relatively high sequence identity (many Leucines) Structural Motif Recognition

30 Carnegie Mellon School of Computer Science 30 Experiment Results Cross-family validation for classifying β -helix proteins  SCRFs can score all known β -helices higher than non β -helices Structural Motif Recognition SCRFs model

31 Carnegie Mellon School of Computer Science 31 Experiment Results Predicted Segmentation for known Beta-helices Structural Motif Recognition SCRFs model

32 Carnegie Mellon School of Computer Science 32 Experiment Results Histograms of scores for known β- helices against PDB-minus dataset  18 non β-helix proteins have a score higher than 0  13 from β-class and 5 from α/β class  Most confusing proteins: β-sandwiches and left-handed β-helix 5 Structural Motif Recognition SCRFs model

33 Carnegie Mellon School of Computer Science 33 Experiment Results Verification on recently crystallized structures Successfully identify  1YP2: Potato Tuber ADP-Glucose Pyrophosphorylase score 10.47  1PXZ: Jun A 1, The Major Allergen From Cedar Pollen score 32.35  GP14 of Shigella bacteriophage as a β-helix protein with scoring 15.63 Structural Motif Recognition SCRFs model

34 Carnegie Mellon School of Computer Science 34 Experiment Results Cross-family validation for classifying β -helix proteins  Chain graph model can score all known β -helices higher than non β -helices Structural Motif Recognition Chain graph model

35 Carnegie Mellon School of Computer Science 35 Experiment Results Cross-family validation for classifying LLR proteins  Chain graph model can score all known LLR higher than non- LLR Structural Motif Recognition Chain graph model

36 Carnegie Mellon School of Computer Science 36 Experiment Results Predicted Segmentation for known Beta-helices and LLRs Structural Motif Recognition Chain graph model

37 Carnegie Mellon School of Computer Science 37 Summary Segmented conditional graphical models for structural motif recognition  Segmentation conditional random fields  Chain graph model Successful applications  Right-handed beta-helices  Leucine-rich repeats Structural Motif Recognition

38 Carnegie Mellon School of Computer Science 38 Outline Conditional graphical models for protein structure prediction

39 Carnegie Mellon School of Computer Science 39 Quaternary Structure Prediction Quaternary structures  Multiple chains associated together through noncovalent bonds or disulfide bonds Classes of quaternary structures  Based on the number of subunits Monomer, dimer, trimer, tetramer and etc  Based on the identity of the subunits Homo-oligomers* and hetero-oligomers Symmetrical structural arrangement of oligomeric proteins Very few related research work..APAFSVSPASGACGPECA.. Contains the quaternary structures?..NNEEEEECCCCCHHHCCC.. Yes

40 Carnegie Mellon School of Computer Science 40 Proposed Work Structural component  Secondary structure elements  Super-secondary structure elements Protein structural graph  Nodes for the states of secondary structural or super-secondary structural elements of unknown lengths secondary structure elements must involve long- range interactions  Edges for interactions between nodes in 3-D Quaternary structure prediction

41 Carnegie Mellon School of Computer Science 41 Proposed Work Segmentation definition  W = (M, {Wi}), M: # of segments, Wi = {(Yi, Si)} Yi: State of segment i Si: the length of segment i Factorial segmentation conditional random fields R: # of chains in quaternary structures Quaternary structure prediction

42 Carnegie Mellon School of Computer Science 42 Experiment Design Triple beta-spirals  Described by van Raaij et al. in Nature (1999)  Clear sequence repeats  Two proteins with crystallized structures and about 20 without structure annotation Tripe beta-helices  Described by van Raaij et al. in JMB (2001)  Structurally similar to beta-helix without clear sequence similarity  Two proteins with crystallized structures  Information from beta-helices can be used Both folds characterized by unusual stability to heat, protease, and detergent Quaternary structure prediction

43 Carnegie Mellon School of Computer Science 43 Experiment Design Virus proteins  Noncellular biological entity that can reproduce only within a host cell  Dynamical properties  Various kinds of viruses Adenovirus (common cold) Bacteriophage (which infects bacteria) DNA virus, RNA virus Structural motif and quaternary structure prediction Gp 41: core protein of HIV virus (1AIK)

44 Carnegie Mellon School of Computer Science 44 Summary Conditional graphical models for protein structure prediction

45 Carnegie Mellon School of Computer Science 45 Expected Contribution Computational contribution  Conditional graphical models for structured data prediction  “First effective models for the long-range interactions” Biological contribution  Improvement in protein structure prediction  Hypothesis on potential proteins with biological important folds  Aids for function analysis and drug design

46 Carnegie Mellon School of Computer Science 46 Timeline Jun, 2005 -- Aug, 2005  Data collection for triple beta-spirals and triple beta-helices  Preliminary investigation for virus protein folds Sep, 2005 -- Nov, 2005  Implementation and Testing the model on synthetic data and real data  Determining the specific virus proteins to work on Nov, 2005 -- Jan, 2006  Virus protein fold recognition  Analysis the properties for virus protein folds Feb, 2006 -- June, 2006:  Virus protein fold recognition  Investigation the possibility for protein function prediction or information extraction July, 2006 -- Aug, 2006  Writing up thesis

47 Carnegie Mellon School of Computer Science 47

48 Carnegie Mellon School of Computer Science 48 Features Node features  Regular expression template, HMM profiles  Secondary structure prediction scores  Segment length Inter-node features  β -strand Side-chain alignment scores  Preferences for parallel alignment scores  Distance between adjacent B23 segments Features are general and easy to extend Structural Motif Recognition

49 Carnegie Mellon School of Computer Science 49 Evaluation Measure Q3 (accuracy) Precision, Recall Segment Overlap quantity (SOV) Matthew’s Correlation coefficients P +P- T +Pu T -on

50 Carnegie Mellon School of Computer Science 50 Local Information PSI-blast profile  Position-specific scoring matrices (PSSM)  Linear transformation[Kim & Park, 2003] SVM classifier with RBF kernel Feature#1 (Si): Prediction score for each residue Ri


Download ppt "Carnegie Mellon School of Computer Science 1 Conditional Graphical Models for Protein Structure Prediction Yan Liu Language Technologies Institute Carnegie."

Similar presentations


Ads by Google