Presentation is loading. Please wait.

Presentation is loading. Please wait.

Novel Peptide Identification using ESTs and Genomic Sequence Nathan Edwards Center for Bioinformatics and Computational Biology University of Maryland,

Similar presentations


Presentation on theme: "Novel Peptide Identification using ESTs and Genomic Sequence Nathan Edwards Center for Bioinformatics and Computational Biology University of Maryland,"— Presentation transcript:

1 Novel Peptide Identification using ESTs and Genomic Sequence Nathan Edwards Center for Bioinformatics and Computational Biology University of Maryland, College Park

2 2 Novel Peptides Absent from traditional protein sequence databases IPI, SwissProt, TrEMBL, NCBI’s nr, MSDB Due to Deliberate “redundancy” elimination “Dark-side” genes Bias towards high-quality, high-confidence full-length protein sequence

3 3 What is missing? Known coding SNPs Novel coding mutations Alternative splicing isoforms Alternative translation start-sites Microexons Alternative translation frames

4 4 Why should we care? Alternative splicing is the norm! Only 20-25K human genes Each gene makes many proteins Proteins have clinical implications Biomarker discovery Evidence for SNPs and alternative splicing stops with transcription Genomic assays, ESTs, mRNA sequence. No hard evidence for translation start site

5 5 Novel Protein HEQASNVLSDISEFR Evidence: log 10 (E-value) = -9.6 100’s of ESTs Full length mRNA sequence Details: Peptide Atlas A8_IP (Resing et al.);

6 6 Novel Protein

7 7

8 8

9 9 Novel Splice Isoform LQGSATAAEAQVGHQTAR Evidence: log 10 (E-value) = -6.8 10’s of ESTs Full length mRNA sequence Details: Peptide Atlas raftflow (von Haller, et al.); LIME1 gene

10 10 Novel Splice Isoform

11 11 Novel Splice Isoform

12 12 Novel Splice Isoform

13 13 Novel Frame TAGSPLCLPTPGAAPGSAGSCSHR Evidence: log 10 (E-value) = -3.9 10’s of ESTs Full length mRNA sequence Details: Peptide Atlas raftflow (von Haller, et al.); LIME1 gene, downstream from LQGSA...

14 14 Novel Frame

15 15 Novel Frame

16 16 Novel Frame

17 17 “Novel” Microexon LQTASDESYKDPTNIQLSK Evidence: log 10 (E-value) = -6.4 10’s of ESTs / mRNA sequences SwissProt variant, absent from IPI Details: Peptide Atlas raftflow (von Haller, et al.); SPTAN1 gene

18 18 “Novel” Microexon

19 19 “Novel” Microexon

20 20 “Novel” Microexon

21 21 “Novel” Microexon

22 22 Novel Mutation KADDTWEPFASGK Evidence: log 10 (E-value) = -7.6 2 ESTs from same clone library Ala2 Deletion Details: HUPO PPP 29_b1-EDTA_1 (Qian/He; Omenn et al.); TTR gene Known Mutation: Ala2-to-Pro associated with familial amyloidotic polyneuropathy.

23 23 Novel Mutation

24 24 Novel Mutation

25 25 Novel Mutation

26 26 Novel Mutation

27 27 Known Coding SNP DTEEEDFHVDQ[V|A]TTVK Evidence: log 10 (E-value) = -9.5 / -9.4 Known dbSNP (coding): Val12-to-Ala Wildtype also observed Details: HUPO PPP 40 (Wang; Omenn et al.); SERPINA1 gene

28 28 Wildtype

29 29 Known Coding SNP

30 30 Known Coding SNP

31 31 Known Coding SNP LQHL[E|V]NELTHDIITK Evidence: log 10 (E-value) = -6.7/-10.9 4 ESTs, same clone library Known dbSNP (coding): Glu5-to-Val Wildtype also observed Details: HUPO PPP 28_b2-CIT (Pounds/Adkins/Rodland/Anderson; Omenn et al.); SERPINA1 gene

32 32 IPI Common Variant Elimination YYGGGYGSTQATFMVFQALAQYQK Evidence: log 10 (E-value) = -5.9 100’s ESTs, mRNA sequence IPI has (rare) variant (Insertion of AS@10) Differ in 5’ splice site. Details: HUPO PPP 29 (Qian/He; Omenn et al.); C3 gene

33 33 Why don’t we see more novel peptides? Tandem mass spectrometry doesn’t discriminate against novel peptides......but protein sequence databases do! Searching traditional protein sequence databases biases the results towards well-understood protein isoforms!

34 34 Why don’t we see more novel peptides? Traditional protein sequence databases High-quality, full-length proteins only Many interesting peptides are omitted Exclusive – peptide identifications are lost. ESTs, genomic & mRNA sequence Used as evidence for full-length protein sequences Inclusive – may need to filter results

35 35 Significant False Positives E-values are not enough! Random guessers are easy to beat. Post-translational modifications vs. amino-acid substitution methylation (on I/L, Q, R, C, H, K, S, T, N): +14 D → E, G → A, V → I/L, N → Q, S → T: +14 Peptide extension z=+2 → z=+3 Nonsense AA masses sum to precursor Need to ensure: fragment ions define novel sequence sequence evidence is strong other plausible explanations can be eliminated

36 36 Significant False Positives DFLAGGLAAAISK 2.2x10 -8 2 ESTs DFLAGGIAAAISK 2.2x10 -8 IPI (2), RefSeq, mRNA, ~ 1400 ESTs DFLAGGVAAAISK3.7x10 -8 IPI, RefSeq, mRNA, ~700 ESTs DFLAGGVAAAISKMAVVPI3.5x10 -5 Genscan exon AISFAKDFLAGGIAAAISK 3.3x10 -4 Genscan exon

37 37 Significant False Positives

38 38 How do we know they are novel? How do we know they are real? Good spectra Good E-value Good ion ladders Good sequence evidence Lack of other explanations...

39 39 Peptide Sequence Evidence C 3 Compression: Amino-acid 30-mers Complete, Correct(, Compact) Present at least twice (ESTs only)

40 40 SBH-graph ACDEFGI, ACDEFACG, DEFGEFGI

41 41 Compressed-SBH-graph ACDEFGI 2 2 1 2 1

42 42 Peptide Sequence Databases MS/MS search engine input only Protein context is lost Inclusive, rather than exclusive Download from http://www.umiacs.umd.edu/~nedwards Exact string search for gene/protein context Recover peptide sequence evidence Relational database to reassemble......with respect to genes & genome Grid Computing + Web Services + Viewer Work in progress

43 43 Peptide Identification Navigator

44 44 Peptide Identification Navigator

45 45 Conclusions Peptides identify more than proteins Search EST sequences (at least) Compressed peptide sequence databases make this feasible


Download ppt "Novel Peptide Identification using ESTs and Genomic Sequence Nathan Edwards Center for Bioinformatics and Computational Biology University of Maryland,"

Similar presentations


Ads by Google