Novel Peptide Identification using ESTs and Sequence Database Compression Nathan Edwards Center for Bioinformatics and Computational Biology University.

Novel Peptide Identification using ESTs and Sequence Database Compression Nathan Edwards Center for Bioinformatics and Computational Biology University of Maryland, College Park

2 What is missing from protein sequence databases? Known coding SNPs Novel coding mutations Alternative splicing isoforms Alternative translation start-sites Microexons Alternative translation frames

3 Why don’t we see more novel peptides? Tandem mass spectrometry doesn’t discriminate against novel peptides......but protein sequence databases do! Searching traditional protein sequence databases biases the results towards well-understood protein isoforms!

4 Novel Splice Isoform

6 Novel Mutation Ala2→Pro associated with familial amyloid polyneuropathy

7 Novel Mutation

8 Searching ESTs Proposed long ago: Yates, Eng, and McCormack; Anal Chem, ’95. Now: Protein sequences are sufficient for protein identification Computationally expensive/infeasible Difficult to interpret Make EST searching feasible for routine searching to discover novel peptides.

9 Searching Expressed Sequence Tags (ESTs) Pros No introns! Primary splicing evidence for annotation pipelines Evidence for dbSNP Often derived from clinical cancer samples Cons No frame Large (8Gb) “Untrusted” by annotation pipelines Highly redundant Nucleotide error rate ~ 1%

10 Other Search Strategies Genome Corrected ESTs Large (2Gb) Controls for nucleotide error rate Polymorphism lost, potential errors introduced Genome Clustered ESTs Small, Gene model Convergence to well-understood isoforms Controls nucleotide error rate Full-Length mRNAs Incomplete gene coverage, “most” are already in IPI

11 Other Search Strategies Genome Large (6Gb), lots of non-coding DNA Find novel ORFs, no sampling bias Miss spliced peptide sequences. Genscan Exons Small, find novel ORFs. Miss spliced peptide sequences. How should we interpret peptide identifications with no mRNA evidence?

12 Compressed EST Peptide Sequence Database For all ESTs mapped to a UniGene gene: Six-frame translation Eliminate ORFs < 30 amino-acids Eliminate amino-acid 30-mers observed once Compress to C 2 FASTA database Complete, Correct for amino-acid 30-mers Gene-centric peptide sequence database: Size: < 3% of naïve enumeration, 20774 FASTA entries Running time: ~ 1% of naïve enumeration search E-values: ~ 2% of naïve enumeration search results

13 Compressed EST Peptide Sequence Database For all ESTs mapped to a UniGene gene: Six-frame translation Eliminate ORFs < 30 amino-acids Eliminate amino-acid 30-mers observed once Compress to C 2 FASTA database Complete, Correct for amino-acid 30-mers Gene-centric peptide sequence database: Size: < 3% of naïve enumeration, 20774 FASTA entries Running time: ~ 1% of naïve enumeration search E-values: ~ 2% of naïve enumeration search results

14 SBH-graph ACDEFGI, ACDEFACG, DEFGEFGI

15 Compressed SBH-graph ACDEFGI, ACDEFACG, DEFGEFGI

16 Sequence Databases & CSBH-graphs Original sequences correspond to paths ACDEFGI, ACDEFACG, DEFGEFGI

17 Sequence Databases & CSBH-graphs All k-mers represented by an edge have the same count 2 2 1 2 1

18 CSBH-graphs Quickly determine which k-mers occur at least twice 2 2 1 2

19 de Bruijn Sequences de Bruijn sequences represent all words of length k from some alphabet A. A = {0,1}, k = 3: s = 0001110100 A = {0,1}, k = 4: s = 0000111101011001000

20 de Bruijn Graph: A = {0,1}, k = 4 110011100001000010111101 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0

21 Correct, Complete, Compact (C 3 ) Enumeration Set of paths that use each edge exactly once ACDEFGEFGI, DEFACG

22 Correct, Complete (C 2 ) Enumeration Set of paths that use each edge at least once ACDEFGEFGI, DEFACG

23 Patching the CSBH-graph Use artificial edges to fix unbalanced nodes

24 Patching the CSBH-graph Use matching-style formulations to choose artificial edges Optimal C 2 /C 3 enumeration in polynomial time. Chinese Postman Problem Edmonds and Johnson, ’73 l-tuple DNA sequencing Pevzner, ’89 Shortest (Common) Superstring MAX-SNP-hard, 2.5 approx algorithm

25 C 3 Enumeration 1 3 2 1 3 -2 -4 -2 Cost: k #in-#out

26 C 3 Enumeration 1 3 2 1 3 -2 -4 -2 #in-#out Cost: k 0 0 Cost: 0

27 Reusing Edges ACDHAC EHAC FHAC GHAC D ACDEHAC, ACDFHAC, ACDGHACD

28 C 3 : ACDEHACDFHAC, ACDGHACD Reusing Edges ACDHAC EHAC FHAC GHAC D $ACD

29 C 2 : ACDEHACDFHACDGHAC Reusing Edges ACDHAC EHAC FHAC GHAC D D

30 C 2 Enumeration 1 3 2 1 3 -2 -4 -2 4 7 10 “Shortcut paths” #in-#out

31 Implementation CSBH-graph construction Determine non-trivial nodes directly Consecutive non-trivial nodes determine edges C 3 /C 2 enumeration C 3 : Trivial “assignment” of artificial edges C 2 : Depth-first search & Goldberg’s CS2 min cost flow code Eulerian path algorithm Can be applied to entire EST database Condor grid and PBS cluster for CSBH-graph construction Large memory machine for C 3 /C 2 enumeration

32 Conclusions Peptides identify more than just proteins Compressed peptide sequence databases makes routine EST searching feasible Currently available for download Can include other sources of peptide sequence at little additional cost. CSBH-graph + edge counts + C 2 /C 3 enumeration algorithms Minimal FASTA representation of k-mer sets

33 Acknowledgements Chau-Wen Tseng, Xue Wu UMCP Computer Science Catherine Fenselau, Crystal Harvey UMCP Biochemistry Calibrant Biosystems PeptideAtlas, HUPO PPP, X!Tandem Funding: National Cancer Institute

Novel Peptide Identification using ESTs and Sequence Database Compression Nathan Edwards Center for Bioinformatics and Computational Biology University.

Similar presentations

Presentation on theme: "Novel Peptide Identification using ESTs and Sequence Database Compression Nathan Edwards Center for Bioinformatics and Computational Biology University."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Novel Peptide Identification using ESTs and Sequence Database Compression Nathan Edwards Center for Bioinformatics and Computational Biology University.

Similar presentations

Presentation on theme: "Novel Peptide Identification using ESTs and Sequence Database Compression Nathan Edwards Center for Bioinformatics and Computational Biology University."— Presentation transcript:

Similar presentations

About project

Feedback