Statistical Challenges in Proteomics Shortcourse Joint Statistical Meetings Denver, CO 2008 Scott C. Schmidler Department of Statistical Science Duke University.

Slides:



Advertisements
Similar presentations
Mass Spectrometry Kyle Chau and Andrew Gioe. Computation of Molecular Mass -Mass Spectrum is a plot of intensity as a function of mass- charge ratio,
Advertisements

Determination of Protein Structure. Methods for Determining Structures X-ray crystallography – uses an X-ray diffraction pattern and electron density.
PROTEOMICS 3D Structure Prediction. Contents Protein 3D structure. –Basics –PDB –Prediction approaches Protein classification.
Protein Structure Prediction
Review: Amino Acid Side Chains Aliphatic- Ala, Val, Leu, Ile, Gly Polar- Ser, Thr, Cys, Met, [Tyr, Trp] Acidic (and conjugate amide)- Asp, Asn, Glu, Gln.
Protein Threading Zhanggroup Overview Background protein structure protein folding and designability Protein threading Current limitations.
1 Protein Structure, Structure Classification and Prediction Bioinformatics X3 January 2005 P. Johansson, D. Madsen Dept.of Cell & Molecular Biology, Uppsala.
Protein Tertiary Structure Prediction
Structural bioinformatics
Structure Prediction. Tertiary protein structure: protein folding Three main approaches: [1] experimental determination (X-ray crystallography, NMR) [2]
CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU Homology Modeling Anne Mølgaard, CBS, BioCentrum, DTU.
Two Examples of Docking Algorithms With thanks to Maria Teresa Gil Lucientes.
Chapter 9 Structure Prediction. Motivation Given a protein, can you predict molecular structure Want to avoid repeated x-ray crystallography, but want.
Heuristic alignment algorithms and cost matrices
Protein Structure, Databases and Structural Alignment
Structure Prediction. Tertiary protein structure: protein folding Three main approaches: [1] experimental determination (X-ray crystallography, NMR) [2]
Thomas Blicher Center for Biological Sequence Analysis
Introduction to BioInformatics GCB/CIS535
Computational Biology, Part 10 Protein Structure Prediction and Display Robert F. Murphy Copyright  1996, 1999, All rights reserved.
The Protein Data Bank (PDB)
. Protein Structure Prediction [Based on Structural Bioinformatics, section VII]
An Integrated Approach to Protein-Protein Docking
BL5203: Molecular Recognition & Interaction Lecture 5: Drug Design Methods Ligand-Protein Docking (Part I) Prof. Chen Yu Zong Tel:
Molecular modelling / structure prediction (A computational approach to protein structure) Today: Why bother about proteins/prediction Concepts of molecular.
CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU Homology Modelling Thomas Blicher Center for Biological Sequence Analysis.
Protein Structure Prediction Samantha Chui Oct. 26, 2004.
Protein Tertiary Structure Prediction Structural Bioinformatics.
Protein Structures.
My contact details and information about submitting samples for MS
Fa 05CSE182 CSE182-L9 Mass Spectrometry Quantitation and other applications.
Proteome.
Protein Tertiary Structure Prediction
Construyendo modelos 3D de proteinas ‘fold recognition / threading’
Chapter 12 Protein Structure Basics. 20 naturally occurring amino acids Free amino group (-NH2) Free carboxyl group (-COOH) Both groups linked to a central.
PROTEIN STRUCTURE NAME: ANUSHA. INTRODUCTION Frederick Sanger was awarded his first Nobel Prize for determining the amino acid sequence of insulin, the.
CRB Journal Club February 13, 2006 Jenny Gu. Selected for a Reason Residues selected by evolution for a reason, but conservation is not distinguished.
 Four levels of protein structure  Linear  Sub-Structure  3D Structure  Complex Structure.
Representations of Molecular Structure: Bonds Only.
Finish up array applications Move on to proteomics Protein microarrays.
Neural Networks for Protein Structure Prediction Brown, JMB 1999 CS 466 Saurabh Sinha.
Biomolecular Nuclear Magnetic Resonance Spectroscopy BASIC CONCEPTS OF NMR How does NMR work? Resonance assignment Structure determination 01/24/05 NMR.
Function first: a powerful approach to post-genomic drug discovery Stephen F. Betz, Susan M. Baxter and Jacquelyn S. Fetrow GeneFormatics Presented by.
Department of Mechanical Engineering
Secondary structure prediction
Protein Classification II CISC889: Bioinformatics Gang Situ 04/11/2002 Parts of this lecture borrowed from lecture given by Dr. Altman.
Part I : Introduction to Protein Structure A/P Shoba Ranganathan Kong Lesheng National University of Singapore.
Protein Structure & Modeling Biology 224 Instructor: Tom Peavy Nov 18 & 23, 2009
Protein secondary structure Prediction Why 2 nd Structure prediction? The problem Seq: RPLQGLVLDTQLYGFPGAFDDWERFMRE Pred:CCCCCHHHHHCCCCEEEECCHHHHHHCC.
High throughput Protein Measurement Techniques Harin Kanani.
Lecture 9. Functional Genomics at the Protein Level: Proteomics.
1 Protein Structure Prediction (Lecture for CS397-CXZ Algorithms in Bioinformatics) April 23, 2004 ChengXiang Zhai Department of Computer Science University.
Genomics II: The Proteome Using high-throughput methods to identify proteins and to understand their function.
Proteomics What is it? How is it done? Are there different kinds? Why would you want to do it (what can it tell you)?
Protein Modeling Protein Structure Prediction. 3D Protein Structure ALA CαCα LEU CαCαCαCαCαCαCαCα PRO VALVAL ARG …… ??? backbone sidechain.
Proteomics Session 1 Introduction. Some basic concepts in biology and biochemistry.
Protein Structure Prediction ● Why ? ● Type of protein structure predictions – Sec Str. Pred – Homology Modelling – Fold Recognition – Ab Initio ● Secondary.
Central dogma: the story of life RNA DNA Protein.
Introduction to Protein Structure Prediction BMI/CS 576 Colin Dewey Fall 2008.
CSE182 CSE182-L11 Protein sequencing and Mass Spectrometry.
Protein Folding & Biospectroscopy Lecture 6 F14PFB David Robinson.
Proteome and Gene Expression Analysis Chapter 15 & 16.
Protein Structure and Bioinformatics. Chapter 2 What is protein structure? What are proteins made of? What forces determines protein structure? What is.
Structural classification of Proteins SCOP Classification: consists of a database Family Evolutionarily related with a significant sequence identity Superfamily.
Lecture 10 CS566 Fall Structural Bioinformatics Motivation Concepts Structure Solving Structure Comparison Structure Prediction Modeling Structural.
Protein Tertiary Structure Prediction Structural Bioinformatics.
Protein Structure Prediction and Protein Homology modeling
Proteomics Informatics David Fenyő
Protein Structures.
Protein structure prediction.
Proteomics Informatics David Fenyő
Presentation transcript:

Statistical Challenges in Proteomics Shortcourse Joint Statistical Meetings Denver, CO 2008 Scott C. Schmidler Department of Statistical Science Duke University

2 Instructor Scott C. Schmidler Assistant Professor Department of Statistical Science Program in Computational Biology & Bioinformatics Program in Structural Biology & Biophysics Duke University 223 Old Chemistry Building Box Duke University Ph: (919) Durham, NC Fax: (919)

3 Abstract Introduction to principal aims, technologies, and statistical issues arising in structural and functional proteomics studies. Overview of experimental data sources: X-ray, NMR, mass spectrometry (MALDI, SELDI, MS/MS), peptide arrays. Statistical problems in structural proteomics: molecular comparison and database search, classification of structures, structure-based function prediction. Statistical problems in functional proteomics: fragment identification, normalization and registration of spectra, peak finding, sample comparison, classification and marker identification.

4 Overview What is proteomics? –Biological overview and questions of interest Structural proteomics –Structure comparison and alignment –Protein structure prediction –Protein folding simulations Functional proteomics –Mass spectrometry –Protein-protein interaction networks

5 Timetable 8:30-10:15: –Overview of proteomics –Structure and function of proteins –Alignment & shape analysis 10:15-10:30: Break 10:30-12:30: –Protein structure prediction –Protein folding simulations 12:30-2: Lunch 2:00-3:15: –Mass spectrometry methods and data analysis 3:15-3:30: Break 3:30-5:00: –Protein-protein interaction networks –Docking & drug design

6 What is proteomics? The proteome is the entire complement of proteins encoded by a genome. It is distinguished from the genome, the ribonome, the metabolome, etc. by the focus on proteins.

7 Molecular biology review ‘Central Dogma’... NWVLSTAADM AAC UGG GUC CUA UCG ACA GCA GCC... DNA sequence Protein structure

8 Molecular biology of the cell

9 What is proteomics? Proteomics is the study of proteomes. Proteomics is particularly concerned with to large-scale, high-throughput studies of protein function, expression, and interactions. Often it is assumed to be synonymous with a particular technology, such as mass spectrometry, 2D PAGE, or Y2H.

10 Why proteomics? Goals: Cataloging and characterization of protein function and interactions, toward an integrated view of cellular processes. Many different technologies bear on these questions. All produce exciting data with interesting statistical challenges. We will examine a few. We will distinguish structural and functional proteomics.

11 Journals Proteins: Structure, Function, and Bioinformatics Protein Science Journal of Molecular Biology Biophysical Journal Journal of Computational Biology Bioinformatics PLoS Computational Biology Proteomics Journal of Proteome Research

12 Proteins

13 Proteins are linear polymers

14 Amino Acid Side Chains Vary in charge, polarity and hydrophobicity, volume, and hydrogen bonding potential

15 Protein folding PISPIETVPVKLKPGMDGPKVKQWPLTEEKIKALVEICTEMEKEGKISKIG PENPYNTPVFAIKKKDSTKWRKLVDFRELNKRTQDFWEVQLGIPHPAGLKK KKSVTVLDVGDAYFSVPLDEDFRKYTAFTIPSINNETPGIRYQYNVLPQGW KGSPAIFQSSMTKILEPFKKQNPDIVIYQYMDDLYVGSDLEIGQHRTKIEE LRQHLLRWGLTTPDKKHQKEPPFLWMGYELHPDKWTVQPIVLPEKDSWTVN DIQKLVGKLNWASQIYPGIKVRQLCKLLRGTKALTEVIPLTEEAELELAEN REILKEPVHGVYYDPSKDLIAEIQKQGQGQWTYQIYQEPFKNLKTGKYARM RGAHTNDVKQLTEAVQKITTESIVIWGKTPKFKLPIQKETWETWWTEYWQA TWIPEWEFVNTPPLVKLWYQLEKEPIVGAETFYVDGAANRETKLGKAGYVT NKGRQKVVPLTNTTNQKTELQAIYLALQDSGLEVNIVTDSQYALGIIQAQP DKSESELVNQIIEQLIKKEKVYLAWVPAHKGIGGNEQVDKLVSAGI PISPIETVPVKLKPGMDGPKVKQWPLTEEKIKALVEICTEMEKEGKISKIG PENPYNTPVFAIKKKDSTKWRKLVDFRELNKRTQDFWEVQLGIPHPAGLKK KKSVTVLDVGDAYFSVPLDEDFRKYTAFTIPSINNETPGIRYQYNVLPQGW KGSPAIFQSSMTKILEPFKKQNPDIVIYQYMDDLYVGSDLEIGQHRTKIEE LRQHLLRWGLTTPDKKHQKEPPFLWMGYELHPDKWTVQPIVLPEKDSWTVN DIQKLVGKLNWASQIYPGIKVKQLCKLLRGTKALTEVIPLTEEAELELAEN REILKEPVHGVYYDPSKDLIAEIQKQGQGQWTYQIYQEPFKNLKTGKYARM RGAHTNDVKQLTEAVQKITTESIVIWGKTPKFKLPIQKETWETWWTEYWQA TWIPEWEFVNTPPLVKLWYQ Sequence of 984 amino acids: Compact 3-dimensional structure: (7404 atoms) HIV reverse transcriptase

16 The role(s) of protein structure... In living organisms: Catalytic, structural, regulatory, signaling, transport Cellular development, differentiation, metabolism, and replication In molecular medicine: Understanding function and mechanism Inherited and infectious disease –Sickle cell anemia Development of novel therapeutics Example: Hemoglobin

17 Regulation of gene expression Helix-turn-helix Structural motif in DNA-binding proteins, transcription regulation.

18 DNA Helicase Unwinds DNA duplex to allow polymerase access for replication/ transcription.

19 Immunoglobulins Antibodies: Hyper-variable regions form antigen binding sites. Assembly yields tremendous structural diversity.

20 Membrane channels

21 Overview What is proteomics? –Biological overview and questions of interest Structural proteomics –Structure comparison and alignment –Protein structure prediction –Protein folding simulations Functional proteomics –Mass spectrometry –Protein-protein interaction networks

22 Structural Proteomics

23 Structural data X-ray crystallography –Protein crystallization and X-ray diffraction –Can be very high-resolution –Labor intensive; crystal may distort structure NMR spectroscopy –Lower resolution – observe ensembles –Dynamical behavior and solution conformation observable

24 Overview of protein X-ray crystallography

25 Overview of protein NMR spectroscopy

26 Reasons for studying structure Determining function(s) Understanding mechanisms Identifying interactions Understanding folding Evolutionary comparison Protein engineering Rational drug design

27 Protein structure resources

28 The Protein Data Bank (PDB)

29 Protein Data Bank (PDB) January 27, 2009: 51,977 released atomic coordinate entries Experimental Technique Diffraction and other NMR Theoretical modeling Counts by molecule Type Proteins, peptides, and viruses Nucleic acids Protein/nucleic acid complexes Carbohydrates

30 PDB statistics

31 Structure of PDB files Example: 1hho oxyhemoglobin –Header –Sequence, secondary structure information –Atomic positions –Heterogens and connectivity information

32 Viewing structures Rasmol – Chime (browser plug-in) – MAGE –kinemage.biochem.duke.edu/kinemage.biochem.duke.edu/ Others – - Graphicswww.rcsb.org/pdb/software-list.html - Graphics

33 Viewing protein structures Rasmol demo

34 Statistical methods for protein structure Secondary structure prediction Threading and fold recognition Homology modeling Backbone dihedral angle (phi/psi) distributions Loop modeling (indels) Side chain rotamer libraries Active site recognition Protein folding theory (stat mech)

35 Statistical problems in structural biology Estimation in random fields Inverse problems Nonparametrics Hard Monte Carlo optimization/integration HMMs; classification Statistical mechanics Complex spatial models and shape analysis

36 References

37 Protein structure comparison and analysis

38 Structural genomics “High-throughput”, high-resolution structure determination 9 pilot sites in operation –NIH: ~ $185M first 4 years (2001-4) –Estimate ~$75M/yr in production phase

39 Protein Data Bank growth

40 Growth in new folds Many new structures related to existing ones

41 Pairwise structure alignment How similar? 4hhb_A: Human deoxyhemoglobin A 5mbn: Sperm whale deoxymyoglobin ?

42 Goals of structure comparison –Identifying homology Determining function(s) and mechanism Database search, clustering –Studying variability Interpreting SNPs –Inherited disease –Drug response Function and mechanism –Evolutionary distance –Visualization and 3D statistics Holm & Sander 1996

43 Protein structure classifications Large hierarchical classifications –SCOP – (S)tructural (C)lassification (O)f (P)roteins Murzin et. al. (1995) J. Mol. Biol. Hand curated scop.berkeley.edu Classes, families, topologies, folds –CATH (Thornton et. al.) (C)lass, (A)rchitecture, (T)opology (H)omologous superfamily Algorithmic (except Architecture) CATH hierarchy

44 Pairwise protein structure comparison Steps: –Find corresponding positions Hard: Iterative dynamic programming or heuristic methods –Rotate/translate for optimal match Easy: Least-squares computations (SVD) –Statistical significance

45 Protein landmarks C  ’s Others: –Side chain centroids –Active site residues/atoms –Electrostatic or solvent accessible surface 4hhb_A

46 Size-and-shape matching Partial Procrustes distance: Least-squares problem with R, T as nuisance parameters. Solution is obtained by centering X and Y and setting (or quaternions)

47 Partial Procrustes solution Optimal rotation and translation 4hhb_A 5mbn

48 Multiple structure superposition Iterative pairwise methods MPOSE 1 –Affine-invariant statistical model of family: –Estimate model by least-squares computation –Yields non-iterative multiple superposition –Available at dna.stanford.edu where 1: Wu, Schmidler, Hastie, Brutlag (1998) J. Comp. Biol

49 Example: globin family 7 globins: Human deoxyhemoglobin  and  (4hhbA/B) Sperm whale deoxymyoglobin (5mbn) Larval deoxyhemoglobin (1ecd) from Chironomous thummi Sea lamprey cyanohemoglobin (2lhb) Yellow lupin root nodule cyanoleghemoglobin (2lh3) from Lupinus luteus Annelid worm deoxyhemoglobin (2hbg) from Glycera dibranchiata

50 Structural alignment of globins

51 Structural variability in globins Note: E,G helices conserved

52 Shape variability PCA of Procrustes residuals (Approximate tangent space coords) PC 1PC 2

53 Bayesian shape matching and alignment

54 Bayesian shape matching Correspondence between landmarks unkown –Shape analysis assumes labels –Ignores (significant) uncertainty in matching Define an alignment as a pair: –MatchingM –Registration T( ;  ) Rigid body: Affine: for

55 Matching M is the adjacency matrix of a bipartite graph

56 Bayesian shape matching Obtain posterior distribution where Exponential number alignments –Can find MAP –Draw samples by MCMC

57 Likelihood Statistical model Likelihood –Alternative: shape distribution (Profile likelihood)

58 Distributions on shape Multivariate distributions on figure space (R np ) –R, u, are nuisance parameters –conditional and marginal approaches Distributions on general shape space –Tangent space approximations allow MVA

59 Gibbs sampling Draw from conditional distributions and Sampling M –For some priors, dynamic prog algorithms exist Liu & Lawrence 1999, Schmidler 2003 Sampling  –Easy for some priors –Alternative: integrate out

60 Priors on alignments: gap penalty method Order-preserving matching: Sequence alignment – gap penalties Prior distribution See e.g. (Liu & Lawrence 1999) for sequence alignment

61 Bayesian flexible matching

62 Transformations with changepoints Consider as a sequence of transformations Introduce changepoints j such that: Multiple changepoints: with

63 Bayesian flexible alignment Changepoint likelihood (shape version): Posterior distribution: MAP matching or sampling (Schmidler 2000)

64 Rigid alignment - Calmodulin RMSD = A

65 Flexible alignment - Calmodulin RMSD = 0.7 A

66 Detection of hinge points, flexible regions and disorder Change-point analysis –Bayesian approach: Calmodulin Posterior distn of changepoint location

67 Example: TIM barrels Identification of flexible loop at active site: Posterior for k=2 changepoints

68 Predicting functional sites Multiple structures of related function –Enzyme active sites Multiple structure alignment Fit statistical models –Spherical or lattice statistics –A 3D “motif” –Scan new structures for predicted sites Sensitivity, specificity From Wei et. al. 1998

69 References Schmidler SC. Bayesian Flexible Shape Matching With Applications to Structural Bioinformatics. Submitted to Journal of the American Statistical Association. Rodriguez A, Schmidler SC. Bayesian Protein Structure Alignment. Submitted to Annals of Applied Statistics. Wang R, Schmidler SC. Bayesian Multiple and Flexible Protein Structure Alignment. In preparation for Journal of Computational Biology. Schmidler SC (2006). Fast Bayesian Shape Matching Using Geometric Algorithms (with discussion). Bayesian Statistics 8, Eds. J.M. Bernardo, S. Bayarri, J.O. Berger, A.P. Dawid, D. Heckerman, A.F.M. Smith and M. West, Oxford University Press, pp Wu, TD, Schmidler, SC, Hastie, T, Brutlag, DL (1998). Regression analysis of multiple protein structures. Journal of Computational Biology, Vol 5, No 3, pp Regression analysis of multiple protein structures. Journal of Computational Biology See

70 Protein structure prediction

71 Sequence vs. structure Computational solutions sorely needed

72 Protein structure prediction PISPIETVPVKLKPGMDGPKVKQWPLTEEKIKALVEICTEMEKEGKISKIG PENPYNTPVFAIKKKDSTKWRKLVDFRELNKRTQDFWEVQLGIPHPAGLKK KKSVTVLDVGDAYFSVPLDEDFRKYTAFTIPSINNETPGIRYQYNVLPQGW KGSPAIFQSSMTKILEPFKKQNPDIVIYQYMDDLYVGSDLEIGQHRTKIEE LRQHLLRWGLTTPDKKHQKEPPFLWMGYELHPDKWTVQPIVLPEKDSWTVN DIQKLVGKLNWASQIYPGIKVRQLCKLLRGTKALTEVIPLTEEAELELAEN REILKEPVHGVYYDPSKDLIAEIQKQGQGQWTYQIYQEPFKNLKTGKYARM RGAHTNDVKQLTEAVQKITTESIVIWGKTPKFKLPIQKETWETWWTEYWQA TWIPEWEFVNTPPLVKLWYQLEKEPIVGAETFYVDGAANRETKLGKAGYVT NKGRQKVVPLTNTTNQKTELQAIYLALQDSGLEVNIVTDSQYALGIIQAQP DKSESELVNQIIEQLIKKEKVYLAWVPAHKGIGGNEQVDKLVSAGI PISPIETVPVKLKPGMDGPKVKQWPLTEEKIKALVEICTEMEKEGKISKIG PENPYNTPVFAIKKKDSTKWRKLVDFRELNKRTQDFWEVQLGIPHPAGLKK KKSVTVLDVGDAYFSVPLDEDFRKYTAFTIPSINNETPGIRYQYNVLPQGW KGSPAIFQSSMTKILEPFKKQNPDIVIYQYMDDLYVGSDLEIGQHRTKIEE LRQHLLRWGLTTPDKKHQKEPPFLWMGYELHPDKWTVQPIVLPEKDSWTVN DIQKLVGKLNWASQIYPGIKVKQLCKLLRGTKALTEVIPLTEEAELELAEN REILKEPVHGVYYDPSKDLIAEIQKQGQGQWTYQIYQEPFKNLKTGKYARM RGAHTNDVKQLTEAVQKITTESIVIWGKTPKFKLPIQKETWETWWTEYWQA TWIPEWEFVNTPPLVKLWYQ Sequence of 984 amino acids: 3D coordinates of 7404 atoms: HIV reverse transcriptase

73 Protein Architecture Levels of protein structure: –Primary –Secondary –(Super-secondary) –Tertiary –Quaternary

74 Secondary structure prediction for protein folding PISPIETVPVKLKPGMDGPKVKQWPLTEEKIKALVEICTEMEKE GKISKIGPENPYNTPVFAIKKKDSTKWRKLVDFRELNKRTQDFW EVQLGIPHPAGLKKKKSVTVLDVGDAYFSVPLDEDFRKYTAFTI PSINNETPGIRYQYNVLPQGWKGSPAIFQSSMTKILEPFKKQNP DIVIYQYMDDLYVGSDLEIGQHRTKIEELRQHLLRWGLTTPDKK HQKEPPFLWMGYELHPDKWTVQPIVLPEKDSWTVNDIQKLVGKL NWASQIYPGIKVRQLCKLLRGTKALTEVIPLTEEAELELAENRE ILKEPVHGVYYDPSKDLIAEIQKQGQGQWTYQIYQEPFKNLKTG KYARMRGAHTNDVKQLTEAVQKITTESIVIWGKTPKFKLPIQKE TWETWWTEYWQATWIPEWEFVNTPPLVKLWYQLEKEPIVGAETF YVDGAANRETKLGKAGYVTNKGRQKVVPLTNTTNQKTELQAIYL ALQDSGLEVNIVTDSQYALGIIQAQPDKSESELVNQIIEQLIKK EKVYLAWVPAHKGIGGNEQVDKLVSAGI PISPIETVPVKLKPGMDGPKVKQWPLTEEKIKALVEICTEMEKE GKISKIGPENPYNTPVFAIKKKDSTKWRKLVDFRELNKRTQDFW EVQLGIPHPAGLKKKKSVTVLDVGDAYFSVPLDEDFRKYTAFTI PSINNETPGIRYQYNVLPQGWKGSPAIFQSSMTKILEPFKKQNP DIVIYQYMDDLYVGSDLEIGQHRTKIEELRQHLLRWGLTTPDKK HQKEPPFLWMGYELHPDKWTVQPIVLPEKDSWTVNDIQKLVGKL NWASQIYPGIKVKQLCKLLRGTKALTEVIPLTEEAELELAENRE ILKEPVHGVYYDPSKDLIAEIQKQGQGQWTYQIYQEPFKNLKTG KYARMRGAHTNDVKQLTEAVQKITTESIVIWGKTPKFKLPIQKE TWETWWTEYWQATWIPEWEFVNTPPLVKLWYQ Sequence of amino acids: Predict structural segments: Goal: Recover 3D coords Given a protein sequence: NWVLSTAADMQGVVTDGMASGLDKD... Predict a secondary structure sequence: LLEEEELLLLHHHHHHHHHHLHHHL... The secondary structure prediction problem: H =  -helix E = Extended  -strand L = Loop/coil

75 Local effects in protein folding: Amino acid propensities Secondary structure propensities: Position-specific propensities:

76 Local interactions Intra-segment correlations, induced by –Side chain interactions –Native environment E.g. hydrophobicity patterns: Sequence of helical peptide: NLAKMVVKTAEAILKD Amphipathic  -helices: Amphipathic  -strands: Hydrophobic moment:

77 Window-based prediction For each position in a protein sequence: –...LSTAADMQGVVTDGMASGLDKD... Predict its secondary structure based on a local window: –...LSTAADMQGVVTDGMASGLDKD... Slide window along sequence: –...LSTAADMQGVVTDGMASGLDKD...

78 Modeling structural correlations A R N D C Q E G H I L K M F P S T W Y V D W Y V N R A Pair-wise dependence: A R N D C Q E G H I L K M F P S T W Y V HELIX STRAND LOOP P(A|H)... P(A|E) P(A|L) Conditional independence models: * * *

79 PHD (Rost & Sander, 1993) Neural network based –2 levels: Sequence -> Structure Structure -> Structure –Uses multiple sequence alignment Amino acid frequencies Conservation weight –Post-processing by dynamic programming

80 2-level neural network system Accuracy: 70.2% (Multiple sequence alignment) ~63% (Single sequence)

81 Special case: helical transmembrane proteins Membrane proteins biologically important Difficult to determine experimentally Easier to predict –Constraints imposed by lipid bilayer –Strong hydrophobicity signal –Cytoplasmic residues positively charged –2-state bacteriorhodopsin Accuracy: 95% (Multiple alignment)

82 Bayesian protein structure prediction Probability model: Bayesian inference Predict Structure to maximize probability Model-based structure prediction –Probabilistic modeling of segments Hydrophobicity patterns Side chain interactions Helical capping

83 Stochastic models HMM models: –Asai et. al. (1993), Stultz et. al. (1993) –Relatively poor predictive accuracy Stochastic segment models: –Schmidler et. al. (2000) –Significantly more accurate HHHLL

84 Probabilistic model Structure- and position- specific frequencies: Segment length priors: Conditional independence of inter-segment residues: Markovian dependence in segment types:

85 Example prediction: Cytochrome C5 (1cc5) True: Predicted:

86 Stochastic models for local and non-local dependencies Modeling correlated mutations for structure prediction –Local interactions Helical side-chains Helix capping –Non-local interactions  -sheets, coiled-coils, hp contacts, salt bridges, Cys-Cys

87 Example: prediction of  -Strand contact map for 5pti Pairing and register of  -hairpin correctly predicted Predicted contacts:True contacts:

88 Dissimilar sequence, function, same fold Observations: 1.> 30% sequence identity yields >85% structure 2.< 5% sequence identity can still have same structure –convergent evolution? 3.Many different functions arise from same fold –e.g. TIM barrels Implication: far fewer folds than proteins

89 Fold recognition tasks 1.Given a sequence and a fold – do they match? Score function Algorithm for matching (“threading”) 2.Create a library of known folds Representative, non-redundant domains 3.Statistical evaluation of scores Which is “best”? Are any “significant”?

90 Threading: Contact potentials Profile methods miss interactions Empirical potentials: Probabilities estimated from experimental structures (GRF) Implicitly account for –Van der Waals, Electrostatics –Other? properties of native structures

91 Modern strategy for protein structure prediction Search for sequence homologs Matches sequence or family with known structure? (30% rule) Threading and homology modeling Fold recognition Ab initio methods Hit? Yes No Yes No (60%) analogy vs first principles

92 3D Structure Prediction

93 Challenges Representation –Atomic detail vs coarse-grained models Conformational sampling –Generating potential conformations Scoring –Evaluating quality using physical or statistical energy functions Global optimization Significance and predictive uncertainty

94 Fragment-based methods Rosetta (Baker et al) –Build up library of short protein fragments from PDB structures –Conformational sampling by MC fragment substitution –Coarse- and finer-grained energetics as search progresses

95 Evaluation of secondary structure prediction Large database of protein sequences: –Known structures X-ray crystallography, NMR “Gold-standard” assignment –(see Colloc’h et al ‘93) –Non-homologous < 25-30% identity Cross-validation

96 CASP Critical

97 References Schmidler SC, Liu JL, Brutlag DB. Stochastic Segment Interaction Models for Biological Sequence Analysis. Submitted to Journal of the American Statistical Association. Schmidler SC, Liu JL, Brutlag DB. Bayesian Modeling of Non-local Interactions in Protein Sequences: Prediction of β-Sheets. Submitted to Journal of Computational Biology. Schmidler, SC, Liu, JS, Brutlag, DL (2000). Bayesian segmentation of protein secondary structure. Journal of Computational Biology, Vol 7, No 1/2, pp Bayesian segmentation of protein secondary structure. Schmidler SC, Liu JS, Brutlag DL (1999). Bayesian protein structure prediction. Case Studies in Bayesian Statistics, Vol 5, See

98 Protein folding simulations

99 Simulation of folding Physics models: –QM intractable –Classical mechanics: pair potentials –Van der Waals –Electrostatics –Covalent bond vibration –Hydrogen bonds, desolvation Molecular dynamics –Solve dynamical system (2 nd law: F=ma) ‘Monte Carlo’ method + _  

100 Molecular dynamics How to choose  t? Bond vibration –1-2 femtoseconds Time scales Recall –Protein folding occurs on the order of ms-sec –Some very small proteins may fold in  sec 1s1ms = s 1  s = s 1ns = s 1ps = s 1fs = s

101 Molecular dynamics

102 Simulation movie Helical peptide simulation HIV mper at lipid bilayer

103 Convergence analysis

104 Statistical Evaluation

105 Statistical Challenges Parameter estimation Evaluating likelihoods –“ensemble measurement” problem Detailed experimental data

106 References Schmidler SC, Cooke B. Preserving the Boltzmann Ensemble in Replica-Exchange Molecular Dynamics Simulations. Journal of Chemical Physics, (to appear). Cooke B, Schmidler SC (2008). Statistical Prediction and Molecular Dynamics Simulation. Biophysical Journal, (to appear).Statistical Prediction and Molecular Dynamics Simulation. See

107 Overview What is proteomics? –Biological overview and questions of interest Structural proteomics –Structure comparison and alignment –Protein structure prediction –Protein folding simulations Functional proteomics –Mass spectrometry –Protein-protein interaction networks

108 Functional Proteomics

109 What is functional proteomics? Analogy to functional genomics. Large scale measurement of protein expression and identification of differential expression. Goal: to identify and characterize key functional proteins in different physiological or disease states.

110 Why functional proteomics? mRNA transcripts levels are at best a very noisy measure of cell state. –Proteins and RNA have different half-lives –Post-translational modification critical in protein function, interaction, localization –Proteins are the real cellular machinery. Since mRNA levels correlate poorly with protein abundance, better to measure protein directly.

111 2D PAGE Dim 1: isoelectric focusing (IEF) –pH gradient –Applied current causes proteins to migrate to pI Dim 2: SDS-PAGE –Polyacrylamide gel electrophoresis –Migration according to size/molecular weight pH gradient Applied current Applied current

112 Differential expression Stain with flourescent dye Digitized imaging Spot finding –Segmentation Gel comparison –Registration –Differential intensities –Spots present/absent –Automation is hard (F. Seillier-Moiseiwitsch)

113 2D PAGE Many difficulties: –Requires substantial pre-processing which is hard to automate (each study different) –Hydrophobic (membrane) proteins don’t work –Spots may contain multiple proteins –Low concentration proteins not detected –Reproducibility Quantitative comparison of intensities difficult –Present/absent more reliable

114 Protein Identification Peptide Mass Fingerprinting (PMF) –Digestion: trypsin (protease) cleaves carboxyl side of Arg and Lys Mass spectrometry: –MALDI-TOF –ESI: electrospray ionization –MS/MS: tandem mass spec R KRRKK

115 MALDI-TOF Small organic compound (matrix) added to dilute macromolecules; dries into a crysalline deposit Short laser pulse excites matrix, energy causes desporption and ionization Ionized aerosol enters a vacuum and ions are accelerated by an electric field Flight time determined by mass/charge ratio (m/z) –MALDI Produces singly-charged ions, so usually just mass Rapid – used for most high-throughput proteomics Matrix-Assisted Laser Desorption Ionization – Time of Flight

116 MALDI-TOF Animation

117 Database identification Match fingerprint against genome databases Assumptions: –Reproducible fingerprint (all sites cleaved) not all proteases suitable; trypsin common –Sequence in database – organism genome sequenced cDNA, ESTs can sometimes be used)

118 Example: ProFound Zhang and Chait (2000) Bayesian matching –Prior based on: organism taxonomy mass range other available experimental information –Likelihood based on fragment matches –

119 Example: ProFound ProFound: Zhang and Chait (2000), Anal. Chem. 72, 11 High resolution spectrum: 30kDa band from Sacch. cerevisiae

120 Identification of a two-protein mixture: close homologues Low resolution spectrum for human mitochondrial sample, searching all taxa ProFound Zhang and Chait (2000) Anal. Chem. 72, 11

121 Other PMF algorithms Online PMF searches –ProFound –MS-Fit: –Mascot: –Several at once: Databases –Sequenced organisms: –ESTs

122 Whole proteome expression (well, not really) MALDI: SELDI: Surface-enhanced Laser Desorption Ionization –Protein “chips” – surface immobilization –Allows differential capture of proteins based on properties Monoclonal antibody arrays –Promising but hard to develop, primarily useful for known proteins

123 Analysis issues with MALDI/SELDI Mass calibration –One trick: autolysis products (protease fragments) Registration and normalization –sounds familiar? microarrays…

124 Registration between datasets Baggerly et al reanalysis of Petrocoin et al Datasets 2 & 3

125 Discriminating features (?) Baggerly et al reanalysis of Petrocoin et al Datasets 2 & 3

126 Discriminating features (?) Baggerly et al reanalysis of Petrocoin et al Datasets 2 & 3 Note the low m/z < 500

127 Technology challenges Post-translational modification –Phosphorylation Tyr, Ser, Thr Difficult to analyze – suppresses ionization Membrane proteins –Hydrophobic, do not work in PAGE gels Quantitative expression levels –Isotope labelling

128 Statistical analysis challenges Peak finding –Whole sample spectra suppress most interesting proteins –Fractionation; statistical modelign Classification based on whole-sample spectrum Variable selection for biomarker identification Representation of spectra –Wavelets and other local bases –Non-parametric Bayes (Clyde & Wolpert) Quantitative abundance and differentiation

129 Bayesian biomarker identification

130 Goal: Whole-genome protein expression High-throughput protein expression Mass spectrometry – MALDI-TOF Peptide identification Abundance Collaboration: Tim Haystead lab, Mike Datto –Students: Casper Wu Keck project: Joe Nevins, Mike West

131 Mass spec data MALDI-TOF Time of flight determines m/z ratio Intensity a noisy indicator of (relative) abundance Vast majority of peptides lost in noise

132 Bioinformatics challenges Classification of spectra –Subject to many problems Noise structure (protocol), calibration, over fitting, interpretation Protein identification: –Peak location: spectrum peptide masses Baseline correction, signal/noise ratio –Database search: peptide masses proteins Mass redundancy

133 Challenges of whole-proteome expression Identification of large numbers of peptides –vs. a few key differences –Multiple peptides/protein needed Estimate abundance Protein ID under mass redunancy HPLC fractionation and recombination –Protocol design Requires detection of low signal/noise ratio

134 Mass spec data MALDI-TOF Time of flight determines m/z ratio Intensity a noisy indicator of (relative) abundance Vast majority of peptides lost in noise

135 Statistical detection Missing data problem –# peptides, locations “missing” –Estimate probabilities via Monte Carlo sampling, exact recursions Utilize natural isotope signature

136 Simulated data Peptide signature: + noise = Bayesian posterior Probabilities for peptide location:

137 Serum profiling data Ongoing work: Abundance estimation from multiple fractions Experimental validation

138 Conclusions Bionformatics challenges different from existing uses of technology Requires careful: –Sample collection and preparation –Experimental protocol design, calibration, sample randomization (Datto serum) –Statistical analysis (baseline correction, noise filtering) –All this must precede complex multivariate algorithms for analyzing abundances/intensities Petricoin et al study –Baggerly et al review; others

139 References

140 Protein-protein interactions

141 Protein-protein interactions Another major emphasis of functional proteomics is the elucidation of protein interactions Often cast as the problem of measuring and reconstructing interaction networks. Techniques include Y2H, and computational methods: genome sequence analysis, structure- structure docking.

142 Predicted protein interaction network Eisenberg et al 1999 Genome sequence analysis to predict fusion events Combination of multiple methods

143 Protein-protein interactions: Yeast two hybrid (Y2H) Y2H: High-throughput screening of protein-protein interactions

144 Plasmid constructs for Y2H

145 Y2H Pros: –Amenable to high-throughput –Putative identification of many interactions Cons: –Highly noisy –Multiple high-throughput datasets have little overlap (D’haeselleer and Church, 2003)

146 References

147 Molecular Docking

148 Structural basis of protein-protein interactions Protein interaction data quite noisy Protein structure represents significant prior information about likelihood of interaction Suggests combining information Structural information more difficult to use: –Docking

149 Protein-protein interactions Protein-protein docking –Similar principles as protein-ligand docking, but harder (?) –Geometry + Energetics + Sampling/Search

150 Connolly surface (cont’d) Define surface by points where probe sphere ‘touches’ surfaces Sample contact surface and represent with points and surface normals Intelligent sampling for sparseness

151 Example: Connolly surface for HIV-I Protease active site Dock program, Kuntz et al, UCSF dock.compbio.ucsf.edu

152 Modeling the target site: Sphere generation Create complementary image of potential binding sites Generate spheres for every two surface points, do not intersect surface Keep approx. 1 sphere/atom Sphere clusters of interest Image of ligand geometry

153 Geometric hashing (cont’d) Same algorithm described earlier for protein structure alignment

154 Protein-protein docking

155 References

156 Summary Many important problems with complex data: shapes, spectra, images, networks Many new technologies generating large amounts of noisy data Many opportunities for statisticians!

157 Journals Proteins: Structure, Function, and Bioinformatics Protein Science Journal of Molecular Biology Biophysical Journal Journal of Computational Biology Bioinformatics PLoS Computational Biology Proteomics Journal of Proteome Research

158 Instructor Scott C. Schmidler Assistant Professor Department of Statistical Science Program in Computational Biology & Bioinformatics Program in Structural Biology & Biophysics Duke University 223 Old Chemistry Building Box Duke University Ph: (919) Durham, NC Fax: (919)