Improving the Reliability of Peptide Identification by Tandem Mass Spectrometry Nathan Edwards Department of Biochemistry and Molecular & Cellular Biology.

Improving the Reliability of Peptide Identification by Tandem Mass Spectrometry Nathan Edwards Department of Biochemistry and Molecular & Cellular Biology Georgetown University Medical Center

2 Mass Spectrometry for Proteomics Measure mass of many (bio)molecules simultaneously High bandwidth Mass is an intrinsic property of all (bio)molecules No prior knowledge required

3 Mass Spectrometer Ionizer Sample + _ Mass Analyzer Detector MALDI Electro-Spray Ionization (ESI) Time-Of-Flight (TOF) Quadrapole Ion-Trap Electron Multiplier (EM)

4 High Bandwidth

5 Mass is fundamental!

6 Mass Spectrometry for Proteomics Measure mass of many molecules simultaneously...but not too many, abundance bias Mass is an intrinsic property of all (bio)molecules...but need a reference to compare to

7 Mass Spectrometry for Proteomics Mass spectrometry has been around since the turn of the century......why is MS based Proteomics so new? Ionization methods MALDI, Electrospray Protein chemistry & automation Chromatography, Gels, Computers Protein sequence databases A reference for comparison

8 Sample Preparation for Peptide Identification Enzymatic Digest and Fractionation

9 Single Stage MS MS m/z

10 Tandem Mass Spectrometry (MS/MS) Precursor selection m/z

11 Tandem Mass Spectrometry (MS/MS) Precursor selection + collision induced dissociation (CID) MS/MS m/z

12 The big picture... MS/MS spectra provide evidence for the amino-acid sequence of functional proteins. Key concepts: Spectrum acquisition is unbiased Direct observation of amino-acid sequence Sensitive to minor sequence variation Observed peptides represent folded proteins

13 Peptide Identification For each (likely) peptide sequence 1. Compute fragment masses 2. Compare with spectrum 3. Retain those that match well Peptide sequences from protein sequence databases Swiss-Prot, IPI, NCBI’s nr,... Automated, high-throughput peptide identification in complex mixtures

14 Peptide Identification, but... What about novel peptides? Search compressed ESTs (C3, PepSeqDB) What about peak intensity? Spectral matching using HMMs (HMMatch) Which identifications are correct? Unsupervised, model-free, result combiner with false discovery rate estimation

15 Why don’t we see more novel peptides? Tandem mass spectrometry doesn’t discriminate against novel peptides......but protein sequence databases do! Searching traditional protein sequence databases biases the results towards well-understood protein isoforms!

16 What goes missing? Known coding SNPs Novel coding mutations Alternative splicing isoforms Alternative translation start-sites Microexons Alternative translation frames

17 Why should we care? Alternative splicing is the norm! Only 20-25K human genes Each gene makes many proteins Proteins have clinical implications Biomarker discovery Evidence for SNPs and alternative splicing stops with transcription Genomic assays, ESTs, mRNA sequence. Little hard evidence for translation start site

18 Novel Splice Isoform Human Jurkat leukemia cell-line Lipid-raft extraction protocol, targeting T cells von Haller, et al. MCP 2003. LIME1 gene: LCK interacting transmembrane adaptor 1 LCK gene: Leukocyte-specific protein tyrosine kinase Proto-oncogene Chromosomal aberration involving LCK in leukemias. Multiple significant peptide identifications

19 Novel Splice Isoform

20 Novel Splice Isoform

21 Novel Mutation HUPO Plasma Proteome Project Pooled samples from 10 male & 10 female healthy Chinese subjects Plasma/EDTA sample protocol Li, et al. Proteomics 2005. (Lab 29) TTR gene Transthyretin (pre-albumin) Defects in TTR are a cause of amyloidosis. Familial amyloidotic polyneuropathy late-onset, dominant inheritance

22 Novel Mutation Ala2→Pro associated with familial amyloid polyneuropathy

23 Novel Mutation

24 Searching ESTs Proposed long ago: Yates, Eng, and McCormack; Anal Chem, ’95. Now: Protein sequences are sufficient for protein identification Computationally expensive/infeasible Difficult to interpret Make EST searching feasible for routine searching to discover novel peptides.

25 Searching Expressed Sequence Tags (ESTs) Pros No introns! Primary splicing evidence for annotation pipelines Evidence for dbSNP Often derived from clinical cancer samples Cons No frame Large (8Gb) “Untrusted” by annotation pipelines Highly redundant Nucleotide error rate ~ 1%

26 Compressed EST Peptide Sequence Database For all ESTs mapped to a UniGene gene: Six-frame translation Eliminate ORFs < 30 amino-acids Eliminate amino-acid 30-mers observed once Compress to C 2 FASTA database Complete, Correct for amino-acid 30-mers Gene-centric peptide sequence database: Size: < 3% of naïve enumeration, 20774 FASTA entries Running time: ~ 1% of naïve enumeration search E-values: ~ 2% of naïve enumeration search results

27 PepSeq FASTA Databases Organisms: HUMAN, MOUSE, RAT, ZEBRA FISH Peptide Evidence: Genbank mRNA, EST, HTC RefSeq mRNA, Proteins Swiss-Prot/TrEMBL, EMBL, VEGA, H-Inv, IPI Proteins Swiss-Prot variants Swiss-Prot signal peptide & init. Met removal Singe FASTA entry per Gene

28 Spectral Matching for Peptide Identification Detection vs. identification Increased sensitivity & specificity No novel peptides! NIST GC/MS Spectral Library Identifies small molecules, 100,000’s of (consensus) spectra Bundled/Sold with many instruments “Dot-product” spectral comparison Current project: Peptide MS/MS

29 NIST MS Search: Peptides

30 Peptide DLATVYVDVLK

31 Protein Families

32 Protein Families

33 Peptide DLATVYVDVLK

34 Hidden Markov Models for Spectral Matching Capture statistical variation and consensus in peak intensity Only need 10 spectra to build a model Capture semantics of peaks Extrapolate model to other peptides Good specificity with superior sensitivity for peptide detection Assign 1000’s of additional spectra (p-value < 10 -5 )

35 Hidden Markov Model Ion Delete Insert (m/z,int) pair emitted by ion & insert states

36 The devil in the details Intensity normalization Discretize (m/z,int) pairs Viterbi distance as score Compute p-value using “random” spectra

37 Random Spectra Uniform sample of (m/z,int) Permutation (m/z) of true spectra peaks M/z distribution between true spectra and uniform sample (parameter) Random TrueFalse Viterbi Score # of spectra

38 HMM Peptide Identification Results – DLATV

39 Spectral Matching of Peptide Variants DFLAGGVAAAISK DFLAGGIAAAISK

40 HMM model extrapolation

41 Mascot Search Results

42 Peptide Identification Results Search engines always provide an answer Current search engines: Hard to determine “good” scores Significance estimates are unreliable Need better methods!

43 Common Algorithmic Framework Pre-process experimental spectra Filter peptide candidates Score match between peptides and spectra Rank peptides and assign

44 Comparison of search engines No single score is comprehensive Search engines disagree Many spectra lack confident peptide assignment 4% OMSSA 10% 2% 5%9% 69% 2% X!Tandem Mascot

45 Lots of published solutions! Treat search engines as black-boxes Apply supervised machine learning to results Use multiple match metrics Combine/refine using multiple search engines Agreement suggests correctness Use empirical significance estimates “Decoy” databases (FDR)

46 PepArML Peptide identification arbiter by machine learning Unifies these ideas within a model- free, combining machine learning framework Unsupervised training procedure

47 PepArML Overview Unify Tandem, Mascot, and OMSSA results X!Tandem Mascot OMSSA Other PepArML Identified Unidentified

48 Voting Heuristic Combiner Choose peptide ID with the most votes Use best FDR as confidence Break ties (single votes) using FDR Strawman for comparison

49 Dataset construction Machine Learning x Spectra compare Matched Ions Peak_intensity Mass delta # of missed cleavages Peptide length Tandem Score Mascot Score OMSSA Score Extract Features X!Tandem Mascot OMSSA Other Search Tools

50 Dataset construction Build feature vectors T F T TandemMascotOMSSA T ……

51 Dataset construction Synthetic protein mixtures provide ground truth C8 8 standard proteins (Calibrant Biosystems) 4594 MS/MS spectra (LTQ) 618 (11.2%) true positives S17 17 standard proteins (Sashimi Repository) 1389 MS/MS spectra (Q-TOF) 354 (25.4%) true positives AURUM 364 standard proteins (AURUM 1.0) 7508 MS/MS spectra (MALDI-TOF-TOF) 3775 (50.3%) true positives

52 Machine learning improves single search engines (S17)

53 Multiple search engines are better than single search engines (S17)

54 Feature Evaluation

55 Application to Real Data How well do these models generalize? Different instruments Spectral characteristics change scores Search parameters Different parameters change score values Supervised learning requires (Synthetic) experimental data from every instrument Search results from available search engines Training/models for all parameters x search engine sets x instruments

56 Model Generalization

57 Rescuing Machine Learning Train a new machine-learning model for every dataset! Generalization not required No predetermined search engines, parameters, instruments, features Perhaps we can “guess” the true proteins Most proteins not in doubt Machine learning can tolerate imperfect labels

58 Unsupervised Learning Heuristic selection of “true” proteins Train classifier, predict true peptide IDs Update “true” proteins Heuristic selection of “true” proteins from classifier predictions Iterate until convergence

59 Unsupervised Learning Performance

60 Unsupervised Learning Convergence

61 Conclusions Proteomics can inform genome annotation Eukaryotic and prokaryotic Functional vs silencing variants Peptides identify more than just proteins Untapped source of disease biomarkers Computational inference can make a substantial impact in proteomics

62 Conclusions Compressed peptide sequence databases make routine EST searching feasible HMMatch spectral matching improves identification performance for familiar peptides Unsupervised, model-free, combining PepArML framework solves peptide identification interpretation problem

63 Acknowledgements Chau-Wen Tseng, Xue Wu UMCP Computer Science Catherine Fenselau UMCP Biochemistry Cheng Lee Calibrant Biosystems PeptideAtlas, HUPO PPP, X!Tandem Funding: NIH/NCI, USDA/ARS

Improving the Reliability of Peptide Identification by Tandem Mass Spectrometry Nathan Edwards Department of Biochemistry and Molecular & Cellular Biology.

Similar presentations

Presentation on theme: "Improving the Reliability of Peptide Identification by Tandem Mass Spectrometry Nathan Edwards Department of Biochemistry and Molecular & Cellular Biology."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Improving the Reliability of Peptide Identification by Tandem Mass Spectrometry Nathan Edwards Department of Biochemistry and Molecular & Cellular Biology.

Similar presentations

Presentation on theme: "Improving the Reliability of Peptide Identification by Tandem Mass Spectrometry Nathan Edwards Department of Biochemistry and Molecular & Cellular Biology."— Presentation transcript:

Similar presentations

About project

Feedback