Presentation is loading. Please wait.

Presentation is loading. Please wait.

Predicting Protein Structures and Structural Features on a Genomic Scale Pierre Baldi School of Information and Computer Sciences Institute for Genomics.

Similar presentations


Presentation on theme: "Predicting Protein Structures and Structural Features on a Genomic Scale Pierre Baldi School of Information and Computer Sciences Institute for Genomics."— Presentation transcript:

1 Predicting Protein Structures and Structural Features on a Genomic Scale Pierre Baldi School of Information and Computer Sciences Institute for Genomics and Bioinformatics University of California, Irvine

2 UNDERSTANDING INTELLIGENCE Human intelligence (inverse problem) AI (direct problem) Choice of specific problems is key Protein structure prediction is a good problem

3 PROTEINS R 1 R 3 | | C α N C β C α / \ / \ / \ / \ N C β C α N C β | R 2

4

5 Utility of Structural Information (Baker and Sali, 2001)

6 CAVEAT

7 REMARKS Structure/Folding Backbone/Full Atom Homology Modeling Fold Recognition (Threading) Ab Initio (Physical Potentials/Molecular Dynamics, Statistical Mechanics/Lattice Models) Statistical/Machine Learning (Training Sets, SS prediction) Mixtures: ab-initio with statistical potentials, machine learning with profiles, etc.

8 PROTEIN STRUCTURE PREDICTION (ab initio)

9

10 Helices 1GRJ (Grea Transcript Cleavage Factor From Escherichia Coli)

11 Antiparallel β-sheets 1MSC (Bacteriophage Ms2 Unassembled Coat Protein Dimer)

12 Parallel β-sheets 1FUE (Flavodoxin)

13 Contact map

14 Secondary structure prediction

15 GRAPHICAL MODELS: BAYESIAN NETWORKS X1, …,Xn random variables associated with the vertices of a DAG = Directed Acyclic Graph The local conditional distributions P(Xi|Xj: j parent of i) are the parameters of the model. They can be represented by look-up tables (costly) or other more compact parameterizations (Sigmoidal Belief Networks, XOR, etc). The global distribution is the product of the local characteristics: P(X1,…,Xn) = Π i P(Xi|Xj : j parent of i)

16

17

18

19

20

21 DATA PREPARATION Starting point: PDB data base.  Remove sequences not determined by X ray diffraction.  Remove sequences where DSSP crashes.  Remove proteins with physical chain breaks (neighboring AA having distances exceeding 4 Angstroms)  Remove sequences with resolution worst than 2.5 Angstroms.  Remove chains with less than 30 AA.  Remove redundancy (Hobohm’s algorithm, Smith-Waterman, PAM 120, etc.)  Build multiple alignments (BLAST, PSI-BLAST, etc.)

22 SECONDARY STRUCTURE PROGRAMS  DSSP (Kabsch and Sander, 1983): works by assigning potential backbone hydrogen bonds (based on the 3D coordinates of the backbone atoms) and subsequently by identifying repetitive bonding patterns.  STRIDE (Frishman and Argos, 1995): in addition to hydrogen bonds, it uses also dihedral angles.  DEFINE (Richards and Kundrot, 1988): uses difference distance matrices for evaluating the match of interatomic distances in the protein to those from idealized SS.

23 SECONDARY STRUCTURE ASSIGNMENTS DSSP classes: H = alpha helix E = sheet G = 3-10 helix S = kind of turn T = beta turn B = beta bridge I = pi-helix (very rare) C = the rest CASP (harder) assignment: α = H and G β = E and B γ = the rest Alternative assignment: α = H β = B γ = the rest

24 ENSEMBLES

25

26

27

28 FUNDAMENTAL LIMITATIONS 100% CORRECT RECOGNITION IS PROBABLY IMPOSSIBLE FOR SEVERAL REASONS SOME PROTEINS DO NOT FOLD SPONTANEOUSLY OR MAY NEED CHAPERONES QUATERNARY STRUCTURE [BETA-STRAND PARTNERS MAY BE ON A DIFFERENT CHAIN] STRUCTURE MAY DEPEND ON OTHER VARIABLES [ENVIRONMENT, PH] DYNAMICAL ASPECTS FUZZINESS OF DEFINITIONS AND ERRORS IN DATABASES

29

30

31 BB-RNNs

32 2D RNNs

33 2D INPUTS AA at positions i and j Profiles at positions i and j Correlated profiles at positions i and j + Secondary Structure, Accessibility, etc.

34

35 PERFORMANCE (%) 6Å6Å8Å8Å10Å12Å non- contacts contacts all

36 Protein Reconstruction Using predicted secondary structure and predicted contact map PDB ID : 1HCR, chain A Sequence: GRPRAINKHEQEQISRLLEKGHPRQQLAIIFGIGVSTLYRYFPASSIKKRMN True SS : CCCCCCCCHHHHHHHHHHHCCCCHHHHHHHCECCHHHHHHHCCCCCCCCCCC Pred SS : CCCCCCCHHHHHHHHHHHHCCCCHHHHEEHECHHHHHHHHCCCHHHHHHHCC PDB ID: 1HCR Chain A (52 residues) Model # 147 RMSD 3.47Å

37 Protein Reconstruction Using predicted secondary structure and predicted contact map PDB ID : 1BC8, chain C Sequence: MDSAITLWQFLLQLLQKPQNKHMICWTSNDGQFKLLQAEEVARLWGIRKNKPNMNYDKLSRALRYYYVKNIIKKVNGQKFVYKFVSYPEILNM True SS : CCCCCCHHHHHHHHCCCHHHCCCCEECCCCCEEECCCHHHHHHHHHHHHCCCCCCHHHHHHHHHHHHHHCCEEECCCCCCEEEECCCCHHHCC Pred SS : CCCHHHHHHHHHHHHHCCCCCCEEEEECCCEEEEECCHHHHHHHHHHHCCCCCCCHHHHHHHHHHHHHCCCEEECCCCEEEEEEECCHHHHCC PDB ID: 1BC8 Chain C (93 residues) Model # 1714 RMSD 4.21Å

38 CASP6 Self Assessment Evaluation based on GDT_TS of first submitted model GDT_TS: Global Distance Test Total Score GDT_TS = (GDT_P 1 + GDT_P 2 + GDT_P 4 + GDT_P 8 ) / 4 P n : percentage of residues under distance cutoff n

39 Hard Target Summary N: number of targets predicted Av.R.: average rank sumZ: sum of Z scores on all targets in set sumZpos: sum of Z scores for predictions with positive Z score groupNAv.R.sumZsumZpos BAKER-ROBETTA baldi-group-server Rokky Pmodeller ZHOUSPARKS ACE Pcomb RAPTOR zhousp PROTINFO-AB Top 10 groups displayed, of 65 registered servers Assessment on 25 new fold and fold recognition analogous target domains

40 Hard Target Summary Top 10 groups displayed, of 65 registered servers Assessment on 19 new fold and fold recognition analogous target domains less than 120 residues N: number of targets predicted Av.R.: average rank sumZ: sum of Z scores on all targets in set sumZpos: sum of Z scores for predictions with positive Z score groupNAv.R.sumZsumZpos baldi-group-server BAKER-ROBETTA Rokky PROTINFO-AB ZHOUSPARKS Pcomb Pmodeller PROTINFO ACE RAPTOR

41 Target Information –Length: 70 amino acids –Resolution: 1.52 Å –PDB code: 1WHZ –Description: Hypothetical Protein From Thermus Thermophilus Hb8 –Domains: single domain Target T0281 Detailed Target Analysis Assessment –GDT_TS server rank of our 1 st model: 2 –GDT_TS: –RMSD to native: 6.15

42 Target T0281 Contact Map Comparison True Map vs. Predicted MapTrue Map vs. Recovered Map *note: true map is lower left

43 Target T0281 Structure Comparison true structurepredicted structure

44 Target T0281 Structure Comparison Superposition True structure: thick trace Predicted structure: thin trace

45 Target Information –Length: 51 amino acids –Resolution: 2.00 Å –PDB code: 1WD5 –Description: Putative phosphoribosyl transferase, T. thermophilus –Domains: 2 nd domain, residues of 208 AA sequence Target T0280_2 Detailed Target Analysis Assessment –GDT_TS server rank of our 1 st model: 1 (also 1 st among human groups) –GDT_TS: –RMSD to native: 5.81

46 Target T0280_2 Contact Map Comparison *note: true map is lower left True Map vs. Predicted MapTrue Map vs. Recovered Map

47 Target T0281 Structure Comparison true structurepredicted structure

48 Target T0281 Structure Comparison Superposition True structure: thick trace Predicted structure: thin trace

49 THE SCRATCH SUITE –DOMpro: domains –DISpro: disordered regions –SSpro: secondary structure –SSpro8: secondary structure –ACCpro: accessibility –CONpro: contact number –DI-pro: disulphide bridges –BETA-pro: beta partners –CMAP-pro: contact map –CCMAP-pro: coarse contact map –CON23D-pro: contact map to 3D –3D-pro: 3D structure (homology + fold recognition + ab-initio)

50

51 SISQQTVWNQMATVRTPLNFDSSKQSFCQFSVDLLGGGISVDKTGDWITLVQNSPISNLL CCCECCCCCCEEEECCCCCCCCCCCCEEEEEEECCCCEEEECCCCCCEEEEECCHHHHHH CCCEEEEECEEEEECCCCCCCTCCCCEEEEEEEETCSEEEECTTTTEEEEEECCHHHHHH eeeeee---e--e-e-eee-ee-eee e-e--eeeeee RVAAWKKGCLMVKVVMSGNAAVKRSDWASLVQVFLTNSNSTEHFDACRWTKSEPHSWELI HHHHHHCCCEEEEEEEEEECCEEECCCCCEEEEEEEECCCCCCCCCEEEEEECCCCCCCC HHHHHHTTCEEEEEEEEEEEEEEECCCCCEEEEEEEECCCTTCCCEEEEEEECCTCCEEE ee---e e-e-ee-e-e-e-----e--eeee--e e-e-ee-e.. Solvent accessibility threshold: 25% PSI-BLAST hits : 24.. Query served in 151 seconds

52 Advantage of Machine Learning Pitfalls of traditional ab-initio approaches Machine learning systems take time to train (weeks). Once trained however they can predict structures almost faster than proteins can fold. Predict or search protein structures on a genomic or bioengineering scale.

53 DAG-RNNs APPROACH Two steps: –1. Build relevant DAG to connect inputs, outputs, and hidden variables –2. Use a deterministic (neural network) parameterization together with appropriate stationarity assumptions/weight sharing—overall models remains probabilistic Process structured data of variable size, topology, and dimensions efficiently Sequences, trees, d-lattices, graphs, etc Convergence theorems Other applications

54

55 Convergence Theorems Posterior Marginals: σBN  dBN in distribution σBN  dBN in probability (uniformly) Belief Propagation: σBN  dBN in distribution σBN  dBN in probability (uniformly)

56 Structural Databases PPDB = Poxvirus Proteomic Database ICBS = Inter Chain Beta Sheet Database

57

58

59

60 Strategies for drug design Block, modulate, mediate β-sheet interactions [Covalent modification of a chain to prevent β- sheet formation]

61

62

63 Three-Stage Prediction of Protein Beta-Sheets Using Neural Networks, Alignments, and Graph Algorithms Jianlin Cheng and Pierre Baldi School of Info. and Computer Sci. University of California Irvine

64 Beta-Sheet Architecture

65 Importance of Predicting Beta-Sheet Structure AB-Initio Structure Prediction Fold Recognition Model Refinement Protein Design Protein Stability

66 Previous Work Methods –Statistical potential approach for strand alignment. (Hubbard, 1994; Zhu and Braun, 1999) –Statistical potentials to improve beta-sheet secondary structure prediction.(Asogawa,1997) –Information theory approach for strand alignment. (Steward and Thornton, 2000) –Neural networks for beta-residue contacts. (Baldi, et.al, 2000) Shortcomings Focus on one single aspect; not utilize structural contexts and evolutionary information; not exploit constraints enough; not publicly available.

67 Three-Stage Prediction of Beta- Sheets Stage 1 Predict beta-residue pairings using 2D- Recursive Neural Networks (2D-RNN). Stage 2 Align beta-strands using alignment algorithms. Stage 3 Predict beta-strand pairs and beta-sheet architecture using graph algorithms.

68 Dataset and Statistics Num Chains916 Beta residues 48,996 Residue Pairs 31,638 Beta Strand 10,745 Strand Pairs 8,172 Beta Sheet 2,533

69 Stage 1: Prediction of Beta-Residue Pairings Using 2D-RNN Input Matrix I (m×m) 2D-RNN O = f(I) Target / Output Matrix (m×m) (i,j) i-2i-1 ii+1i+2j-2j-1 jj+1j+2|i-j| 20 profiles3 SS2 SA T ij : 0/1 O ij : Pairing Prob. (i,j) I ij Total: 251 inputs

70 An Example Target Protein 1VJG Beta-Residue Pairing Map (Target Matrix)

71 An Example Output

72 Stage 2: Beta-Strand Alignment Use output probability matrix as scoring matrix Dynamic programming Disallow gaps and use simplified searching algorithms 1m n1 1m 1n Anti-parallel Parallel Total number of alignments = 2(m+n-1)

73 Strand Alignment and Pairing Matrix The alignment score (Pseudo Binding Energy) is the sum of the probabilities of paired residues. The best alignment is the alignment with maximum score. Strand Pairing Matrix. Strand Pairing Matrix of 1VJG

74 Stage 3: Prediction of Beta-Strand Pairings and Beta-Sheet Architecture Strand Pairing Constraints

75 Minimum Spanning Tree Like Algorithm Strand Pairing Graph (SPG) Goal: Find a set of connected subgraphs that maximize the sum of pseudo-energy and satisfy the constraints. Algorithm: Minimum Spanning Tree Like Algorithm.

76 Example of MST Like Algorithm Strand Pairing Matrix of 1VJG Assembly of beta-strands Step 1: Pair strand 4 and 5

77 Example of MST Like Algorithm Strand Pairing Matrix of 1VJGA N Assembly of beta-strands Step 2: Pair strand 1 and 2

78 Example of MST Like Algorithm Strand Pairing Matrix of 1VJGA N Assembly of beta-strands Step 3: Pair strand 1 and 3

79 Example of MST Like Algorithm Strand Pairing Matrix of 1VJGA N Assembly of beta-strands Step 4: Pair strand 3 and 6

80 Example of MST Like Algorithm Strand Pairing Matrix of 1VJGA N C Assembly of beta-strands Step 5: Pair strand 6 and 7

81 Beta-Residue Pairing Results Sensitivity = Specificity = 41% Base-line: 2.3%. Ratio of improvement = ROC area: 0.86 At 5% FPR, TPR is 58% CMAPpro: Spec. and Sens. is 27%. ROC area:0.8. TPR=42% at 5% FPR.

82 Strand Pairing Results Naïve algorithm of pairing all adjacent strands –Specificity = 42% –Sensitivity = 50% MST like algorithm –Specificity = 53% –Sensitivity = 59% –>20% correctly predicted strand pairs are non-adjacent strand pairs

83 Strand Alignment Results Paring Direction Align. All Align. Anti-P Align. Para. Align. Bridge Acc.93%72%69%71%88% On the correctly predicted pairs: Pairing Direction Align. All Align. Anti-P Align. Para. Align. Bridge Acc.84%66%63%66%73% On all native pairs: Pairing direction is 15% higher than of random algorithm. Alignment accuracy is improved by >15%.

84 Application and Future Work New methods for beta-residue pairings (e.g. Linear Programming, SVM), and strand alignment and pairings. More inputs (Punta and Rost, 2005). Applications –AB-Initio Structure Sampling (beta-sheet) –Fold Recognition (conservation of beta-sheets) –Contact Map –Model Refinement (pairing direction/alignment) Web server and dataset

85 A New Fold Example (CASP6) 1S12 (T0201, 94 residues) CEEEEEECCEEEECCCCCCCCHHHHHHHHHHHHHHHHHHHHHHHEHHCCCCEEEEHHHHHHHHHHHHHHHHHHHHHHHHHCCCCEEEEEEECCC Predicted: 1-2, 2-4, 3-4, 4-5 CEEEEECCCEEEEECCCCCHHHHHHHHHHHHHHHHHHHHCCCEEEEEECCEEEEEECCCCHHHHHHHHHHHHHHHHHHHHCCCCEEEEECCCCCC True: 1–2, 2-4, 3-4, 1-5 True SS: Predicted SS: Rendered in Rasmol Strand Pairing Matrix

86 ACKNOWLEDGMENTS UCI: –Gianluca Pollastri, Pierre-Francois Baisnee, Michal Rosen-Zvi –Arlo Randall, S. Joshua Swamidass, Jianlin Cheng, Yimeng Dou, Yann Pecout, Mike Sweredoski, Alessandro Vullo, Lin Wu –James Nowick, Luis Villareal DTU: Soren Brunak Columbia: Burkhard Rost U of Florence: Paolo Frasconi U of Bologna: Rita Casadio, Piero Fariselli

87 1DFN| Defensin

88 Sequence with cysteine's position identified: MSNHTHHLKFKTLKRAWKASKYFIVGLSC[29]LYKFNLKSLVQTALSTLAMITLTSLVITAIIYISVGNA KAKPTSKPTIQQTQQPQNHTSPFFTEHNYKSTHTSIQSTTLSQLLNIDTTRGITYGHSTNETQNRKI KGQSTLPATRKPPINPSGSIPPENHQDHNNFQTLPYVPC[173]STC[176]EGNLAC[182]LSLC[18 6]HIETERAPSRAPTITLKKTPKPKTTKKPTKTTIHHRTSPETKLQPKNNTATPQQG ILSSTEHHTNQSTTQI Length: 257, Total number of cysteines: 5 Four bonded cysteines form two disulfide bonds : ( red cysteine pair) (blue cysteine pair) Prediction Results from DIpro (http://contact.ics.uci.edu/bridge.html) Predicted Bonded Cysteines: 173,176,182,186 Predicted disulfide bonds Bond_IndexCys1_PositionCys2_Position Prediction Accuracy for both bond state and bond pair are 100%. A Perfectly Predicted Example

89 Sequence with cysteine's position identified: MTLGRRLAC[9]LFLAC[14]VLPALLLGGTALASEIVGGRRARPHAWPFMVSLQLRGGHFC[ 55]GATLIAPNFVMSAAHC[71]VANVNVRAVRVVLGAHNLSRREPTRQVFAVQRIFENGYD PVNLLNDIVILQLNGSATINANVQVAQLPAQGRRLGNGVQC[151]LAMGWGLLGRNRGIA SVLQELNVTVVTSLC[181]RRSNVC[187]TLVRGRQAGVC[198]FGDSGSPLVC[208]NGLI HGIASFVRGGC[223]ASGLYPDAFAPVAQFVNWIDSIIQRSEDNPC[254]PHPRDPDPASR TH Length: 267, Total Cysteine Number: 11 Eight bonded cysteines form four disulfide bonds: (Red), (Blue), (Green), (Purple) A Hard Example with Many Non-Bonded Cysteines Prediction Results from DIpro (http://contact.ics.uci.edu/bridge.html) Predicted Bonded Cysteines: 9,14,55,71,181,187,223,254 Predicted Disulfide Bonds: Bond_IndexCys1_PositionCys2_Position (correct) 2914 (wrong) (wrong) (correct) Bond State Recall: 5 / 8 = 0.625, Bond State Precision = 5 / 8 = Pair Recall = 2 / 4 = 0.5 ; Pair Precision = 2 / 4 = 0.5 Bond number is predicted correctly.

90 Bond Num Bond State Recall(%) Bond State Precision(%) Pair Recall(%) Pair Precision(%) Overall bond state recall: 78%; overall bond state precision: 74%; bond number prediction accuracy: 53%; average difference between true bond number and predicted bond number: 1.1. Prediction Accuracy on SP51 Dataset on All Cysteines

91 CURRENT WORK Feedback: Ex: SS  Contacts  SS  Contacts Homology, homology, homology: SSpro 4.0 performs at 88%

92


Download ppt "Predicting Protein Structures and Structural Features on a Genomic Scale Pierre Baldi School of Information and Computer Sciences Institute for Genomics."

Similar presentations


Ads by Google