Presentation is loading. Please wait.

Presentation is loading. Please wait.

Statistical Challenges in Proteomics Shortcourse Joint Statistical Meetings Denver, CO 2008 Scott C. Schmidler Department of Statistical Science Duke University.

Similar presentations


Presentation on theme: "Statistical Challenges in Proteomics Shortcourse Joint Statistical Meetings Denver, CO 2008 Scott C. Schmidler Department of Statistical Science Duke University."— Presentation transcript:

1 Statistical Challenges in Proteomics Shortcourse Joint Statistical Meetings Denver, CO 2008 Scott C. Schmidler Department of Statistical Science Duke University

2 2 Instructor Scott C. Schmidler Assistant Professor Department of Statistical Science Program in Computational Biology & Bioinformatics Program in Structural Biology & Biophysics Duke University 223 Old Chemistry Building schmidler@stat.duke.edu Box 90251 www.stat.duke.edu/~scs Duke University Ph: (919) 684-8064 Durham, NC 27708-0251 Fax: (919) 684-8594

3 3 Abstract Introduction to principal aims, technologies, and statistical issues arising in structural and functional proteomics studies. Overview of experimental data sources: X-ray, NMR, mass spectrometry (MALDI, SELDI, MS/MS), peptide arrays. Statistical problems in structural proteomics: molecular comparison and database search, classification of structures, structure-based function prediction. Statistical problems in functional proteomics: fragment identification, normalization and registration of spectra, peak finding, sample comparison, classification and marker identification.

4 4 Overview What is proteomics? –Biological overview and questions of interest Structural proteomics –Structure comparison and alignment –Protein structure prediction –Protein folding simulations Functional proteomics –Mass spectrometry –Protein-protein interaction networks

5 5 Timetable 8:30-10:15: –Overview of proteomics –Structure and function of proteins –Alignment & shape analysis 10:15-10:30: Break 10:30-12:30: –Protein structure prediction –Protein folding simulations 12:30-2: Lunch 2:00-3:15: –Mass spectrometry methods and data analysis 3:15-3:30: Break 3:30-5:00: –Protein-protein interaction networks –Docking & drug design

6 6 What is proteomics? The proteome is the entire complement of proteins encoded by a genome. It is distinguished from the genome, the ribonome, the metabolome, etc. by the focus on proteins.

7 7 Molecular biology review ‘Central Dogma’... NWVLSTAADM...... AAC UGG GUC CUA UCG ACA GCA GCC... DNA sequence Protein structure

8 8 Molecular biology of the cell

9 9 What is proteomics? Proteomics is the study of proteomes. Proteomics is particularly concerned with to large-scale, high-throughput studies of protein function, expression, and interactions. Often it is assumed to be synonymous with a particular technology, such as mass spectrometry, 2D PAGE, or Y2H.

10 10 Why proteomics? Goals: Cataloging and characterization of protein function and interactions, toward an integrated view of cellular processes. Many different technologies bear on these questions. All produce exciting data with interesting statistical challenges. We will examine a few. We will distinguish structural and functional proteomics.

11 11 Journals Proteins: Structure, Function, and Bioinformatics Protein Science Journal of Molecular Biology Biophysical Journal Journal of Computational Biology Bioinformatics PLoS Computational Biology Proteomics Journal of Proteome Research

12 12 Proteins

13 13 Proteins are linear polymers

14 14 Amino Acid Side Chains Vary in charge, polarity and hydrophobicity, volume, and hydrogen bonding potential

15 15 Protein folding PISPIETVPVKLKPGMDGPKVKQWPLTEEKIKALVEICTEMEKEGKISKIG PENPYNTPVFAIKKKDSTKWRKLVDFRELNKRTQDFWEVQLGIPHPAGLKK KKSVTVLDVGDAYFSVPLDEDFRKYTAFTIPSINNETPGIRYQYNVLPQGW KGSPAIFQSSMTKILEPFKKQNPDIVIYQYMDDLYVGSDLEIGQHRTKIEE LRQHLLRWGLTTPDKKHQKEPPFLWMGYELHPDKWTVQPIVLPEKDSWTVN DIQKLVGKLNWASQIYPGIKVRQLCKLLRGTKALTEVIPLTEEAELELAEN REILKEPVHGVYYDPSKDLIAEIQKQGQGQWTYQIYQEPFKNLKTGKYARM RGAHTNDVKQLTEAVQKITTESIVIWGKTPKFKLPIQKETWETWWTEYWQA TWIPEWEFVNTPPLVKLWYQLEKEPIVGAETFYVDGAANRETKLGKAGYVT NKGRQKVVPLTNTTNQKTELQAIYLALQDSGLEVNIVTDSQYALGIIQAQP DKSESELVNQIIEQLIKKEKVYLAWVPAHKGIGGNEQVDKLVSAGI PISPIETVPVKLKPGMDGPKVKQWPLTEEKIKALVEICTEMEKEGKISKIG PENPYNTPVFAIKKKDSTKWRKLVDFRELNKRTQDFWEVQLGIPHPAGLKK KKSVTVLDVGDAYFSVPLDEDFRKYTAFTIPSINNETPGIRYQYNVLPQGW KGSPAIFQSSMTKILEPFKKQNPDIVIYQYMDDLYVGSDLEIGQHRTKIEE LRQHLLRWGLTTPDKKHQKEPPFLWMGYELHPDKWTVQPIVLPEKDSWTVN DIQKLVGKLNWASQIYPGIKVKQLCKLLRGTKALTEVIPLTEEAELELAEN REILKEPVHGVYYDPSKDLIAEIQKQGQGQWTYQIYQEPFKNLKTGKYARM RGAHTNDVKQLTEAVQKITTESIVIWGKTPKFKLPIQKETWETWWTEYWQA TWIPEWEFVNTPPLVKLWYQ Sequence of 984 amino acids: Compact 3-dimensional structure: (7404 atoms) HIV reverse transcriptase

16 16 The role(s) of protein structure... In living organisms: Catalytic, structural, regulatory, signaling, transport Cellular development, differentiation, metabolism, and replication In molecular medicine: Understanding function and mechanism Inherited and infectious disease –Sickle cell anemia Development of novel therapeutics Example: Hemoglobin

17 17 Regulation of gene expression Helix-turn-helix Structural motif in DNA-binding proteins, transcription regulation.

18 18 DNA Helicase Unwinds DNA duplex to allow polymerase access for replication/ transcription.

19 19 Immunoglobulins Antibodies: Hyper-variable regions form antigen binding sites. Assembly yields tremendous structural diversity.

20 20 Membrane channels

21 21 Overview What is proteomics? –Biological overview and questions of interest Structural proteomics –Structure comparison and alignment –Protein structure prediction –Protein folding simulations Functional proteomics –Mass spectrometry –Protein-protein interaction networks

22 22 Structural Proteomics

23 23 Structural data X-ray crystallography –Protein crystallization and X-ray diffraction –Can be very high-resolution –Labor intensive; crystal may distort structure NMR spectroscopy –Lower resolution – observe ensembles –Dynamical behavior and solution conformation observable

24 24 Overview of protein X-ray crystallography

25 25 Overview of protein NMR spectroscopy

26 26 Reasons for studying structure Determining function(s) Understanding mechanisms Identifying interactions Understanding folding Evolutionary comparison Protein engineering Rational drug design

27 27 Protein structure resources

28 28 The Protein Data Bank (PDB) www.rcsb.org

29 29 Protein Data Bank (PDB) www.rcsb.org January 27, 2009: 51,977 released atomic coordinate entries http://www.rcsb.org/pdb/statistics/holdings.do Experimental Technique Diffraction and other NMR Theoretical modeling Counts by molecule Type Proteins, peptides, and viruses Nucleic acids Protein/nucleic acid complexes Carbohydrates

30 30 PDB statistics

31 31 Structure of PDB files Example: 1hho oxyhemoglobin –Header –Sequence, secondary structure information –Atomic positions –Heterogens and connectivity information

32 32 Viewing structures Rasmol –www.umass.edu/microbio/rasmol/www.umass.edu/microbio/rasmol/ Chime (browser plug-in) –http://www.mdlchime.com/chime/http://www.mdlchime.com/chime/ MAGE –kinemage.biochem.duke.edu/kinemage.biochem.duke.edu/ Others –www.rcsb.org/pdb/software-list.html - Graphicswww.rcsb.org/pdb/software-list.html - Graphics

33 33 Viewing protein structures Rasmol demo

34 34 Statistical methods for protein structure Secondary structure prediction Threading and fold recognition Homology modeling Backbone dihedral angle (phi/psi) distributions Loop modeling (indels) Side chain rotamer libraries Active site recognition Protein folding theory (stat mech)

35 35 Statistical problems in structural biology Estimation in random fields Inverse problems Nonparametrics Hard Monte Carlo optimization/integration HMMs; classification Statistical mechanics Complex spatial models and shape analysis

36 36 References

37 37 Protein structure comparison and analysis

38 38 Structural genomics “High-throughput”, high-resolution structure determination 9 pilot sites in operation –NIH: ~ $185M first 4 years (2001-4) –Estimate ~$75M/yr in production phase

39 39 Protein Data Bank growth

40 40 Growth in new folds Many new structures related to existing ones

41 41 Pairwise structure alignment How similar? 4hhb_A: Human deoxyhemoglobin A 5mbn: Sperm whale deoxymyoglobin ?

42 42 Goals of structure comparison –Identifying homology Determining function(s) and mechanism Database search, clustering –Studying variability Interpreting SNPs –Inherited disease –Drug response Function and mechanism –Evolutionary distance –Visualization and 3D statistics Holm & Sander 1996

43 43 Protein structure classifications Large hierarchical classifications –SCOP – (S)tructural (C)lassification (O)f (P)roteins Murzin et. al. (1995) J. Mol. Biol. Hand curated scop.berkeley.edu Classes, families, topologies, folds –CATH (Thornton et. al.) (C)lass, (A)rchitecture, (T)opology (H)omologous superfamily Algorithmic (except Architecture) www.cathdb.info CATH hierarchy

44 44 Pairwise protein structure comparison Steps: –Find corresponding positions Hard: Iterative dynamic programming or heuristic methods –Rotate/translate for optimal match Easy: Least-squares computations (SVD) –Statistical significance

45 45 Protein landmarks C  ’s Others: –Side chain centroids –Active site residues/atoms –Electrostatic or solvent accessible surface 4hhb_A

46 46 Size-and-shape matching Partial Procrustes distance: Least-squares problem with R, T as nuisance parameters. Solution is obtained by centering X and Y and setting (or quaternions)

47 47 Partial Procrustes solution Optimal rotation and translation 4hhb_A 5mbn

48 48 Multiple structure superposition Iterative pairwise methods MPOSE 1 –Affine-invariant statistical model of family: –Estimate model by least-squares computation –Yields non-iterative multiple superposition –Available at dna.stanford.edu where 1: Wu, Schmidler, Hastie, Brutlag (1998) J. Comp. Biol

49 49 Example: globin family 7 globins: Human deoxyhemoglobin  and  (4hhbA/B) Sperm whale deoxymyoglobin (5mbn) Larval deoxyhemoglobin (1ecd) from Chironomous thummi Sea lamprey cyanohemoglobin (2lhb) Yellow lupin root nodule cyanoleghemoglobin (2lh3) from Lupinus luteus Annelid worm deoxyhemoglobin (2hbg) from Glycera dibranchiata

50 50 Structural alignment of globins

51 51 Structural variability in globins Note: E,G helices conserved

52 52 Shape variability PCA of Procrustes residuals (Approximate tangent space coords) PC 1PC 2

53 53 Bayesian shape matching and alignment

54 54 Bayesian shape matching Correspondence between landmarks unkown –Shape analysis assumes labels –Ignores (significant) uncertainty in matching Define an alignment as a pair: –MatchingM –Registration T( ;  ) Rigid body: Affine: for

55 55 Matching M is the adjacency matrix of a bipartite graph

56 56 Bayesian shape matching Obtain posterior distribution where Exponential number alignments –Can find MAP –Draw samples by MCMC

57 57 Likelihood Statistical model Likelihood –Alternative: shape distribution (Profile likelihood)

58 58 Distributions on shape Multivariate distributions on figure space (R np ) –R, u, are nuisance parameters –conditional and marginal approaches Distributions on general shape space –Tangent space approximations allow MVA

59 59 Gibbs sampling Draw from conditional distributions and Sampling M –For some priors, dynamic prog algorithms exist Liu & Lawrence 1999, Schmidler 2003 Sampling  –Easy for some priors –Alternative: integrate out

60 60 Priors on alignments: gap penalty method Order-preserving matching: Sequence alignment – gap penalties Prior distribution See e.g. (Liu & Lawrence 1999) for sequence alignment

61 61 Bayesian flexible matching

62 62 Transformations with changepoints Consider as a sequence of transformations Introduce changepoints j such that: Multiple changepoints: with

63 63 Bayesian flexible alignment Changepoint likelihood (shape version): Posterior distribution: MAP matching or sampling (Schmidler 2000)

64 64 Rigid alignment - Calmodulin RMSD = 11.97 A

65 65 Flexible alignment - Calmodulin RMSD = 0.7 A

66 66 Detection of hinge points, flexible regions and disorder Change-point analysis –Bayesian approach: Calmodulin Posterior distn of changepoint location

67 67 Example: TIM barrels Identification of flexible loop at active site: Posterior for k=2 changepoints

68 68 Predicting functional sites Multiple structures of related function –Enzyme active sites Multiple structure alignment Fit statistical models –Spherical or lattice statistics –A 3D “motif” –Scan new structures for predicted sites Sensitivity, specificity From Wei et. al. 1998

69 69 References Schmidler SC. Bayesian Flexible Shape Matching With Applications to Structural Bioinformatics. Submitted to Journal of the American Statistical Association. Rodriguez A, Schmidler SC. Bayesian Protein Structure Alignment. Submitted to Annals of Applied Statistics. Wang R, Schmidler SC. Bayesian Multiple and Flexible Protein Structure Alignment. In preparation for Journal of Computational Biology. Schmidler SC (2006). Fast Bayesian Shape Matching Using Geometric Algorithms (with discussion). Bayesian Statistics 8, Eds. J.M. Bernardo, S. Bayarri, J.O. Berger, A.P. Dawid, D. Heckerman, A.F.M. Smith and M. West, Oxford University Press, pp 471-490. Wu, TD, Schmidler, SC, Hastie, T, Brutlag, DL (1998). Regression analysis of multiple protein structures. Journal of Computational Biology, Vol 5, No 3, pp 585-595.Regression analysis of multiple protein structures. Journal of Computational Biology See http://www.stat.duke.edu/~scs/Publications.html

70 70 Protein structure prediction

71 71 Sequence vs. structure Computational solutions sorely needed

72 72 Protein structure prediction PISPIETVPVKLKPGMDGPKVKQWPLTEEKIKALVEICTEMEKEGKISKIG PENPYNTPVFAIKKKDSTKWRKLVDFRELNKRTQDFWEVQLGIPHPAGLKK KKSVTVLDVGDAYFSVPLDEDFRKYTAFTIPSINNETPGIRYQYNVLPQGW KGSPAIFQSSMTKILEPFKKQNPDIVIYQYMDDLYVGSDLEIGQHRTKIEE LRQHLLRWGLTTPDKKHQKEPPFLWMGYELHPDKWTVQPIVLPEKDSWTVN DIQKLVGKLNWASQIYPGIKVRQLCKLLRGTKALTEVIPLTEEAELELAEN REILKEPVHGVYYDPSKDLIAEIQKQGQGQWTYQIYQEPFKNLKTGKYARM RGAHTNDVKQLTEAVQKITTESIVIWGKTPKFKLPIQKETWETWWTEYWQA TWIPEWEFVNTPPLVKLWYQLEKEPIVGAETFYVDGAANRETKLGKAGYVT NKGRQKVVPLTNTTNQKTELQAIYLALQDSGLEVNIVTDSQYALGIIQAQP DKSESELVNQIIEQLIKKEKVYLAWVPAHKGIGGNEQVDKLVSAGI PISPIETVPVKLKPGMDGPKVKQWPLTEEKIKALVEICTEMEKEGKISKIG PENPYNTPVFAIKKKDSTKWRKLVDFRELNKRTQDFWEVQLGIPHPAGLKK KKSVTVLDVGDAYFSVPLDEDFRKYTAFTIPSINNETPGIRYQYNVLPQGW KGSPAIFQSSMTKILEPFKKQNPDIVIYQYMDDLYVGSDLEIGQHRTKIEE LRQHLLRWGLTTPDKKHQKEPPFLWMGYELHPDKWTVQPIVLPEKDSWTVN DIQKLVGKLNWASQIYPGIKVKQLCKLLRGTKALTEVIPLTEEAELELAEN REILKEPVHGVYYDPSKDLIAEIQKQGQGQWTYQIYQEPFKNLKTGKYARM RGAHTNDVKQLTEAVQKITTESIVIWGKTPKFKLPIQKETWETWWTEYWQA TWIPEWEFVNTPPLVKLWYQ Sequence of 984 amino acids: 3D coordinates of 7404 atoms: HIV reverse transcriptase

73 73 Protein Architecture Levels of protein structure: –Primary –Secondary –(Super-secondary) –Tertiary –Quaternary

74 74 Secondary structure prediction for protein folding PISPIETVPVKLKPGMDGPKVKQWPLTEEKIKALVEICTEMEKE GKISKIGPENPYNTPVFAIKKKDSTKWRKLVDFRELNKRTQDFW EVQLGIPHPAGLKKKKSVTVLDVGDAYFSVPLDEDFRKYTAFTI PSINNETPGIRYQYNVLPQGWKGSPAIFQSSMTKILEPFKKQNP DIVIYQYMDDLYVGSDLEIGQHRTKIEELRQHLLRWGLTTPDKK HQKEPPFLWMGYELHPDKWTVQPIVLPEKDSWTVNDIQKLVGKL NWASQIYPGIKVRQLCKLLRGTKALTEVIPLTEEAELELAENRE ILKEPVHGVYYDPSKDLIAEIQKQGQGQWTYQIYQEPFKNLKTG KYARMRGAHTNDVKQLTEAVQKITTESIVIWGKTPKFKLPIQKE TWETWWTEYWQATWIPEWEFVNTPPLVKLWYQLEKEPIVGAETF YVDGAANRETKLGKAGYVTNKGRQKVVPLTNTTNQKTELQAIYL ALQDSGLEVNIVTDSQYALGIIQAQPDKSESELVNQIIEQLIKK EKVYLAWVPAHKGIGGNEQVDKLVSAGI PISPIETVPVKLKPGMDGPKVKQWPLTEEKIKALVEICTEMEKE GKISKIGPENPYNTPVFAIKKKDSTKWRKLVDFRELNKRTQDFW EVQLGIPHPAGLKKKKSVTVLDVGDAYFSVPLDEDFRKYTAFTI PSINNETPGIRYQYNVLPQGWKGSPAIFQSSMTKILEPFKKQNP DIVIYQYMDDLYVGSDLEIGQHRTKIEELRQHLLRWGLTTPDKK HQKEPPFLWMGYELHPDKWTVQPIVLPEKDSWTVNDIQKLVGKL NWASQIYPGIKVKQLCKLLRGTKALTEVIPLTEEAELELAENRE ILKEPVHGVYYDPSKDLIAEIQKQGQGQWTYQIYQEPFKNLKTG KYARMRGAHTNDVKQLTEAVQKITTESIVIWGKTPKFKLPIQKE TWETWWTEYWQATWIPEWEFVNTPPLVKLWYQ Sequence of amino acids: Predict structural segments: Goal: Recover 3D coords Given a protein sequence: NWVLSTAADMQGVVTDGMASGLDKD... Predict a secondary structure sequence: LLEEEELLLLHHHHHHHHHHLHHHL... The secondary structure prediction problem: H =  -helix E = Extended  -strand L = Loop/coil

75 75 Local effects in protein folding: Amino acid propensities Secondary structure propensities: Position-specific propensities:

76 76 Local interactions Intra-segment correlations, induced by –Side chain interactions –Native environment E.g. hydrophobicity patterns: Sequence of helical peptide: NLAKMVVKTAEAILKD Amphipathic  -helices: Amphipathic  -strands: Hydrophobic moment:

77 77 Window-based prediction For each position in a protein sequence: –...LSTAADMQGVVTDGMASGLDKD... Predict its secondary structure based on a local window: –...LSTAADMQGVVTDGMASGLDKD... Slide window along sequence: –...LSTAADMQGVVTDGMASGLDKD...

78 78 Modeling structural correlations A R N D C Q E G H I L K M F P S T W Y V D...... W Y V N R A Pair-wise dependence: A R N D C Q E G H I L K M F P S T W Y V HELIX STRAND LOOP P(A|H)... P(A|E) P(A|L) Conditional independence models: * * *

79 79 PHD (Rost & Sander, 1993) Neural network based –2 levels: Sequence -> Structure Structure -> Structure –Uses multiple sequence alignment Amino acid frequencies Conservation weight –Post-processing by dynamic programming

80 80 2-level neural network system Accuracy: 70.2% (Multiple sequence alignment) ~63% (Single sequence)

81 81 Special case: helical transmembrane proteins Membrane proteins biologically important Difficult to determine experimentally Easier to predict –Constraints imposed by lipid bilayer –Strong hydrophobicity signal –Cytoplasmic residues positively charged –2-state bacteriorhodopsin Accuracy: 95% (Multiple alignment)

82 82 Bayesian protein structure prediction Probability model: Bayesian inference Predict Structure to maximize probability Model-based structure prediction –Probabilistic modeling of segments Hydrophobicity patterns Side chain interactions Helical capping

83 83 Stochastic models HMM models: –Asai et. al. (1993), Stultz et. al. (1993) –Relatively poor predictive accuracy Stochastic segment models: –Schmidler et. al. (2000) –Significantly more accurate HHHLL

84 84 Probabilistic model Structure- and position- specific frequencies: Segment length priors: Conditional independence of inter-segment residues: Markovian dependence in segment types:

85 85 Example prediction: Cytochrome C5 (1cc5) True: Predicted:

86 86 Stochastic models for local and non-local dependencies Modeling correlated mutations for structure prediction –Local interactions Helical side-chains Helix capping –Non-local interactions  -sheets, coiled-coils, hp contacts, salt bridges, Cys-Cys

87 87 Example: prediction of  -Strand contact map for 5pti Pairing and register of  -hairpin correctly predicted Predicted contacts:True contacts:

88 88 Dissimilar sequence, function, same fold Observations: 1.> 30% sequence identity yields >85% structure 2.< 5% sequence identity can still have same structure –convergent evolution? 3.Many different functions arise from same fold –e.g. TIM barrels Implication: far fewer folds than proteins

89 89 Fold recognition tasks 1.Given a sequence and a fold – do they match? Score function Algorithm for matching (“threading”) 2.Create a library of known folds Representative, non-redundant domains 3.Statistical evaluation of scores Which is “best”? Are any “significant”?

90 90 Threading: Contact potentials Profile methods miss interactions Empirical potentials: Probabilities estimated from experimental structures (GRF) Implicitly account for –Van der Waals, Electrostatics –Other? properties of native structures

91 91 Modern strategy for protein structure prediction Search for sequence homologs Matches sequence or family with known structure? (30% rule) Threading and homology modeling Fold recognition Ab initio methods Hit? Yes No Yes No (60%) analogy vs first principles

92 92 3D Structure Prediction

93 93 Challenges Representation –Atomic detail vs coarse-grained models Conformational sampling –Generating potential conformations Scoring –Evaluating quality using physical or statistical energy functions Global optimization Significance and predictive uncertainty

94 94 Fragment-based methods Rosetta (Baker et al) –Build up library of short protein fragments from PDB structures –Conformational sampling by MC fragment substitution –Coarse- and finer-grained energetics as search progresses

95 95 Evaluation of secondary structure prediction Large database of protein sequences: –Known structures X-ray crystallography, NMR “Gold-standard” assignment –(see Colloc’h et al ‘93) –Non-homologous < 25-30% identity Cross-validation

96 96 CASP Critical

97 97 References Schmidler SC, Liu JL, Brutlag DB. Stochastic Segment Interaction Models for Biological Sequence Analysis. Submitted to Journal of the American Statistical Association. Schmidler SC, Liu JL, Brutlag DB. Bayesian Modeling of Non-local Interactions in Protein Sequences: Prediction of β-Sheets. Submitted to Journal of Computational Biology. Schmidler, SC, Liu, JS, Brutlag, DL (2000). Bayesian segmentation of protein secondary structure. Journal of Computational Biology, Vol 7, No 1/2, pp 233-248Bayesian segmentation of protein secondary structure. Schmidler SC, Liu JS, Brutlag DL (1999). Bayesian protein structure prediction. Case Studies in Bayesian Statistics, Vol 5, 363-378. See http://www.stat.duke.edu/~scs/Publications.html

98 98 Protein folding simulations

99 99 Simulation of folding Physics models: –QM intractable –Classical mechanics: pair potentials –Van der Waals –Electrostatics –Covalent bond vibration –Hydrogen bonds, desolvation Molecular dynamics –Solve dynamical system (2 nd law: F=ma) ‘Monte Carlo’ method + _  

100 100 Molecular dynamics How to choose  t? Bond vibration –1-2 femtoseconds Time scales Recall –Protein folding occurs on the order of ms-sec –Some very small proteins may fold in 10-20  sec 1s1ms = 10 -3 s 1  s = 10 -6 s 1ns = 10 -9 s 1ps = 10 -12 s 1fs = 10 -15 s

101 101 Molecular dynamics

102 102 Simulation movie Helical peptide simulation HIV mper at lipid bilayer

103 103 Convergence analysis

104 104 Statistical Evaluation

105 105 Statistical Challenges Parameter estimation Evaluating likelihoods –“ensemble measurement” problem Detailed experimental data

106 106 References Schmidler SC, Cooke B. Preserving the Boltzmann Ensemble in Replica-Exchange Molecular Dynamics Simulations. Journal of Chemical Physics, (to appear). Cooke B, Schmidler SC (2008). Statistical Prediction and Molecular Dynamics Simulation. Biophysical Journal, (to appear).Statistical Prediction and Molecular Dynamics Simulation. See http://www.stat.duke.edu/~scs/Publications.html

107 107 Overview What is proteomics? –Biological overview and questions of interest Structural proteomics –Structure comparison and alignment –Protein structure prediction –Protein folding simulations Functional proteomics –Mass spectrometry –Protein-protein interaction networks

108 108 Functional Proteomics

109 109 What is functional proteomics? Analogy to functional genomics. Large scale measurement of protein expression and identification of differential expression. Goal: to identify and characterize key functional proteins in different physiological or disease states.

110 110 Why functional proteomics? mRNA transcripts levels are at best a very noisy measure of cell state. –Proteins and RNA have different half-lives –Post-translational modification critical in protein function, interaction, localization –Proteins are the real cellular machinery. Since mRNA levels correlate poorly with protein abundance, better to measure protein directly.

111 111 2D PAGE Dim 1: isoelectric focusing (IEF) –pH gradient –Applied current causes proteins to migrate to pI Dim 2: SDS-PAGE –Polyacrylamide gel electrophoresis –Migration according to size/molecular weight pH gradient Applied current Applied current

112 112 Differential expression Stain with flourescent dye Digitized imaging Spot finding –Segmentation Gel comparison –Registration –Differential intensities –Spots present/absent –Automation is hard (F. Seillier-Moiseiwitsch)

113 113 2D PAGE Many difficulties: –Requires substantial pre-processing which is hard to automate (each study different) –Hydrophobic (membrane) proteins don’t work –Spots may contain multiple proteins –Low concentration proteins not detected –Reproducibility Quantitative comparison of intensities difficult –Present/absent more reliable

114 114 Protein Identification Peptide Mass Fingerprinting (PMF) –Digestion: trypsin (protease) cleaves carboxyl side of Arg and Lys Mass spectrometry: –MALDI-TOF –ESI: electrospray ionization –MS/MS: tandem mass spec R KRRKK

115 115 MALDI-TOF Small organic compound (matrix) added to dilute macromolecules; dries into a crysalline deposit Short laser pulse excites matrix, energy causes desporption and ionization Ionized aerosol enters a vacuum and ions are accelerated by an electric field Flight time determined by mass/charge ratio (m/z) –MALDI Produces singly-charged ions, so usually just mass Rapid – used for most high-throughput proteomics Matrix-Assisted Laser Desorption Ionization – Time of Flight

116 116 MALDI-TOF Animation

117 117 Database identification Match fingerprint against genome databases Assumptions: –Reproducible fingerprint (all sites cleaved) not all proteases suitable; trypsin common –Sequence in database – organism genome sequenced cDNA, ESTs can sometimes be used)

118 118 Example: ProFound Zhang and Chait (2000) Bayesian matching –Prior based on: organism taxonomy mass range other available experimental information –Likelihood based on fragment matches –http://prowl.rockefeller.edu/cgi-bin/ProFound

119 119 Example: ProFound ProFound: Zhang and Chait (2000), Anal. Chem. 72, 11 High resolution spectrum: 30kDa band from Sacch. cerevisiae

120 120 Identification of a two-protein mixture: close homologues Low resolution spectrum for human mitochondrial sample, searching all taxa ProFound Zhang and Chait (2000) Anal. Chem. 72, 11

121 121 Other PMF algorithms Online PMF searches –ProFound http://prowl.rockefeller.edu/cgi-bin/ProFound –MS-Fit: http://prospector.ucsf.edu/ucsfhtml4.0/msfit.htm –Mascot: http://www.matrixscience.com/home.html –Several at once: http://us.expasy.org/tools/peptident.html Databases –Sequenced organisms: http://www.ebi.ac.uk/ –ESTs http://www.ncbi.nlm.nih.gov/dbEST/

122 122 Whole proteome expression (well, not really) MALDI: SELDI: Surface-enhanced Laser Desorption Ionization –Protein “chips” – surface immobilization –Allows differential capture of proteins based on properties Monoclonal antibody arrays –Promising but hard to develop, primarily useful for known proteins

123 123 Analysis issues with MALDI/SELDI Mass calibration –One trick: autolysis products (protease fragments) Registration and normalization –sounds familiar? microarrays…

124 124 Registration between datasets Baggerly et al reanalysis of Petrocoin et al Datasets 2 & 3

125 125 Discriminating features (?) Baggerly et al reanalysis of Petrocoin et al Datasets 2 & 3

126 126 Discriminating features (?) Baggerly et al reanalysis of Petrocoin et al Datasets 2 & 3 Note the low m/z < 500

127 127 Technology challenges Post-translational modification –Phosphorylation Tyr, Ser, Thr Difficult to analyze – suppresses ionization Membrane proteins –Hydrophobic, do not work in PAGE gels Quantitative expression levels –Isotope labelling

128 128 Statistical analysis challenges Peak finding –Whole sample spectra suppress most interesting proteins –Fractionation; statistical modelign Classification based on whole-sample spectrum Variable selection for biomarker identification Representation of spectra –Wavelets and other local bases –Non-parametric Bayes (Clyde & Wolpert) Quantitative abundance and differentiation

129 129 Bayesian biomarker identification

130 130 Goal: Whole-genome protein expression High-throughput protein expression Mass spectrometry – MALDI-TOF Peptide identification Abundance Collaboration: Tim Haystead lab, Mike Datto –Students: Casper Wu Keck project: Joe Nevins, Mike West

131 131 Mass spec data MALDI-TOF Time of flight determines m/z ratio Intensity a noisy indicator of (relative) abundance Vast majority of peptides lost in noise

132 132 Bioinformatics challenges Classification of spectra –Subject to many problems Noise structure (protocol), calibration, over fitting, interpretation Protein identification: –Peak location: spectrum peptide masses Baseline correction, signal/noise ratio –Database search: peptide masses proteins Mass redundancy

133 133 Challenges of whole-proteome expression Identification of large numbers of peptides –vs. a few key differences –Multiple peptides/protein needed Estimate abundance Protein ID under mass redunancy HPLC fractionation and recombination –Protocol design Requires detection of low signal/noise ratio

134 134 Mass spec data MALDI-TOF Time of flight determines m/z ratio Intensity a noisy indicator of (relative) abundance Vast majority of peptides lost in noise

135 135 Statistical detection Missing data problem –# peptides, locations “missing” –Estimate probabilities via Monte Carlo sampling, exact recursions Utilize natural isotope signature

136 136 Simulated data Peptide signature: + noise = Bayesian posterior Probabilities for peptide location:

137 137 Serum profiling data Ongoing work: Abundance estimation from multiple fractions Experimental validation

138 138 Conclusions Bionformatics challenges different from existing uses of technology Requires careful: –Sample collection and preparation –Experimental protocol design, calibration, sample randomization (Datto serum) –Statistical analysis (baseline correction, noise filtering) –All this must precede complex multivariate algorithms for analyzing abundances/intensities Petricoin et al study –Baggerly et al review; others

139 139 References

140 140 Protein-protein interactions

141 141 Protein-protein interactions Another major emphasis of functional proteomics is the elucidation of protein interactions Often cast as the problem of measuring and reconstructing interaction networks. Techniques include Y2H, and computational methods: genome sequence analysis, structure- structure docking.

142 142 Predicted protein interaction network Eisenberg et al 1999 Genome sequence analysis to predict fusion events Combination of multiple methods

143 143 Protein-protein interactions: Yeast two hybrid (Y2H) Y2H: High-throughput screening of protein-protein interactions

144 144 Plasmid constructs for Y2H

145 145 Y2H Pros: –Amenable to high-throughput –Putative identification of many interactions Cons: –Highly noisy –Multiple high-throughput datasets have little overlap (D’haeselleer and Church, 2003)

146 146 References

147 147 Molecular Docking

148 148 Structural basis of protein-protein interactions Protein interaction data quite noisy Protein structure represents significant prior information about likelihood of interaction Suggests combining information Structural information more difficult to use: –Docking

149 149 Protein-protein interactions Protein-protein docking –Similar principles as protein-ligand docking, but harder (?) –Geometry + Energetics + Sampling/Search

150 150 Connolly surface (cont’d) Define surface by points where probe sphere ‘touches’ surfaces Sample contact surface and represent with points and surface normals Intelligent sampling for sparseness

151 151 Example: Connolly surface for HIV-I Protease active site Dock program, Kuntz et al, UCSF dock.compbio.ucsf.edu

152 152 Modeling the target site: Sphere generation Create complementary image of potential binding sites Generate spheres for every two surface points, do not intersect surface Keep approx. 1 sphere/atom Sphere clusters of interest Image of ligand geometry

153 153 Geometric hashing (cont’d) Same algorithm described earlier for protein structure alignment

154 154 Protein-protein docking

155 155 References

156 156 Summary Many important problems with complex data: shapes, spectra, images, networks Many new technologies generating large amounts of noisy data Many opportunities for statisticians!

157 157 Journals Proteins: Structure, Function, and Bioinformatics Protein Science Journal of Molecular Biology Biophysical Journal Journal of Computational Biology Bioinformatics PLoS Computational Biology Proteomics Journal of Proteome Research

158 158 Instructor Scott C. Schmidler Assistant Professor Department of Statistical Science Program in Computational Biology & Bioinformatics Program in Structural Biology & Biophysics Duke University 223 Old Chemistry Building schmidler@stat.duke.edu Box 90251 www.stat.duke.edu/~scs Duke University Ph: (919) 684-8064 Durham, NC 27708-0251 Fax: (919) 684-8594


Download ppt "Statistical Challenges in Proteomics Shortcourse Joint Statistical Meetings Denver, CO 2008 Scott C. Schmidler Department of Statistical Science Duke University."

Similar presentations


Ads by Google