Presentation is loading. Please wait.

Presentation is loading. Please wait.

Optimal k-mer superstrings for protein identification and DNA assay design. Nathan Edwards Center for Bioinformatics and Computational Biology University.

Similar presentations


Presentation on theme: "Optimal k-mer superstrings for protein identification and DNA assay design. Nathan Edwards Center for Bioinformatics and Computational Biology University."— Presentation transcript:

1 Optimal k-mer superstrings for protein identification and DNA assay design. Nathan Edwards Center for Bioinformatics and Computational Biology University of Maryland, College Park

2 2 k-mer (Sub-)Problems Enumerate: For all (distinct) k-mers, do Existence:...with respect to exact (& inexact) count ¸ x Uniqueness:...with respect to exact & inexact match Near-neighbors:...with respect to inexact match Representation: Represent (distinct) k-mers for other tools Fast annotation of k-mer counts on original sequences

3 3 Applications of k-mer sets Peptide Identification Represent all amino-acid 30-mers...that occur at least twice in human dbEST PCR Primer Design: Test DNA 20-mer primers for uniqueness What does it mean to be unique? DNA sequencing error / repeat detection Eliminate mers that are too rare or too frequent Pathogen signatures Near-neighbors imply potential false-positives

4 4 k-mer Superstring Problem Given A set of sequences S = { S 1,..., S n } Sequence database Word size k Find A new set of sequences T = { T 1,..., T m } Such that Total length of T is minimized, and T is complete and correct w.r.t. k-mers of S

5 5 k-mer Superstring Problem Completeness All of the k-mers of S are represented Correctness No additional k-mers are present Minimize the total representation length Correlates with running time

6 6 Shortest (common) superstring problem General strings (arbitrary length) Single output string Completeness for input sequences only Classical NP-hard problem Garey and Johnson Approximate within ~ 2.5*OPT Max-SNP hard One of the first algorithmic approaches to genome assembly

7 7 de Bruijn Sequences de Bruijn sequences represent all words of length k from some alphabet A. A = {0,1}, k = 3: s = 0001110100 A = {0,1}, k = 4: s = 0000111101011001000

8 8 de Bruijn Graph: A = {0,1}, k = 4 110011100001000010111101 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0

9 9 de Bruijn Sequences & Graphs de Bruijn graphs (k,A): Edges represent length k words from A Each node has in degree |A| out degree |A| Eulerian tour constructs de Bruijn sequence.

10 10 Sequencing-by- Hybridization-graph ACDEFGI, ACDEFACG, DEFGEFGI

11 11 Compressed SBH-graph ACDEFGI, ACDEFACG, DEFGEFGI

12 12 Sequence Databases & CSBH-graphs Original sequences correspond to paths ACDEFGI, ACDEFACG, DEFGEFGI

13 13 C 3 Enumeration Complete All k-mers are present Correct No other k-mers are present Compact No k-mer is present more than once

14 14 Correct, Complete, Compact (C 3 ) Enumeration Set of paths that use each edge exactly once ACDEFGEFGI, DEFACG

15 15 Correct, Complete (C 2 ) Enumeration Set of paths that use each edge at least once ACDEFGEFGI, DEFACG

16 16 Patching the CSBH-graph Use artificial edges to fix unbalanced nodes

17 17 Patching the CSBH-graph Use matching-style formulations to choose artificial edges Optimal C 2 /C 3 enumeration in polynomial time. Chinese Postman Problem Edmonds and Johnson, ’73 l-tuple DNA sequencing Pevzner, ’89 Shortest (Common) Superstring MAX-SNP-hard, 2.5 approx algorithm

18 18 Related work Chinese Postman Problem Undirected graph, weighted edges Shortest path that uses all the edges Solvable in polynomial time Construct minimum weighted matching between nodes of odd-degree Add matching to graph and find Eulerian path Minimize weight of extra edges used

19 19 C 2 Enumeration Chinese postman problem, except: Directed graph Add edges from nodes with surplus in-degree to nodes with surplus out-degree Fixed cost teleportation option Can always “start” a new sequence Find optimal set of additional edges Transportation problem / min cost flow instance

20 20 C 3 Enumeration 1 3 2 1 3 -2 -4 -2 Cost: k #in-#out

21 21 Reusing Edges ACDHAC EHAC FHAC GHAC D ACDEHAC, ACDFHAC, ACDGHACD

22 22 C 3 : ACDEHACDFHAC, ACDGHACD Reusing Edges ACDHAC EHAC FHAC GHAC D $ACD

23 23 C 2 : ACDEHACDFHACDGHAC Reusing Edges ACDHAC EHAC FHAC GHAC D D

24 24 C 2 Enumeration 1 3 2 1 3 -2 -4 -2 4 7 10 “Shortcut paths” #in-#out

25 25 C 3 Enumeration 1 3 2 1 3 -2 -4 -2 #in-#out Cost: k 0 0 Cost: 0

26 26 Sample Preparation for Peptide Identification Enzymatic Digest and Fractionation

27 27 Single Stage MS MS m/z

28 28 Tandem Mass Spectrometry (MS/MS) Precursor selection m/z

29 29 Tandem Mass Spectrometry (MS/MS) Precursor selection + collision induced dissociation (CID) MS/MS m/z

30 30 Peptide Identification For each (likely) peptide sequence 1. Compute fragment masses 2. Compare with spectrum 3. Retain those that match well Peptide sequences from protein sequence databases Swiss-Prot, IPI, NCBI’s nr,... Automated, high-throughput peptide identification in complex mixtures

31 31 Novel Splice Isoform Human Jurkat leukemia cell-line Lipid-raft extraction protocol, targeting T cells von Haller, et al. MCP 2003. LIME1 gene: LCK interacting transmembrane adaptor 1 LCK gene: Leukocyte-specific protein tyrosine kinase Proto-oncogene Chromosomal aberration involving LCK in leukemias. Multiple significant peptide identifications

32 32 Novel Splice Isoform

33 33 Novel Splice Isoform

34 34 Novel Mutation HUPO Plasma Proteome Project Pooled samples from 10 male & 10 female healthy Chinese subjects Plasma/EDTA sample protocol Li, et al. Proteomics 2005. (Lab 29) TTR gene Transthyretin (pre-albumin) Defects in TTR are a cause of amyloidosis. Familial amyloidotic polyneuropathy late-onset, dominant inheritance

35 35 Novel Mutation Ala2→Pro associated with familial amyloid polyneuropathy

36 36 Novel Mutation

37 37 Compressed EST Peptide Sequence Database For all ESTs mapped to a UniGene gene: Six-frame translation Eliminate ORFs < 30 amino-acids Eliminate amino-acid 30-mers observed once Compress to C 2 FASTA database Complete, Correct for amino-acid 30-mers Inclusive gene-centric peptide sequence database: Size: < 3% of naïve enumeration, 20774 FASTA entries Running time: ~ 1% of naïve enumeration search E-values: ~ 2% of naïve enumeration search results

38 38 Compressed EST Peptide Sequence Database For all ESTs mapped to a UniGene gene: Six-frame translation Eliminate ORFs < 30 amino-acids Eliminate amino-acid 30-mers observed once Compress to C 2 FASTA database Complete, Correct for amino-acid 30-mers Gene-centric peptide sequence database: Size: < 3% of naïve enumeration, 20774 FASTA entries Running time: ~ 1% of naïve enumeration search E-values: ~ 2% of naïve enumeration search results

39 39 Sequence Databases & CSBH-graphs All k-mers represented by an edge have the same count 2 2 1 2 1

40 40 CSBH-graph subgraphs Quickly determine those that occur twice 2 2 1 2

41 41 k-mer (Sub-)Problems Enumerate: For all (distinct) k-mers, do Existence:...with exact (& inexact) count ¸ x Uniqueness:...exact & inexact match Near-neighbors:...inexact match Representation: Represent (distinct) k-mers for other tools Fast annotation of k-mer counts on original sequences

42 42 Large scale instances! CSBH-graph instances Partition set of all k-mers, determine non-trivial nodes Days on condor grid (250 CPUs) to construct ¸ 100,000,000 nodes and edges (sparse & dense) Min-cost flow instances ¸ 500,000 nodes and edges Algorithms must be linear in problem size Out-of-core Eulerian path algorithm? Currently testing out-of-core connected-components

43 43 Grid computing Heterogeneous machines Varying disk/memory/MHz/cores capabilities Centralized scheduler Jobs started asynchronously Other jobs may preempt current job Input files may need to be staged 250 simultaneous requests for a 3Gb file? How to guarantee integrity of input files? Problem decomposition may be non-trivial Jobs sizes need to fit the least capable machine Sometimes need to “game” the scheduler Need to ensure the integrity of job output

44 44 Uniqueness Oracles Oracle for uniqueness of 20-mers in the Human genome (size: 3Gb) Count occurrences in the genome: 0,1,2+ Construct 20-mer superstring for 20-mers with count 1 Construct 20-mer superstring for 20-mers with count > 1 Easy(-ish) for exact sequence match: O(n) Fast automata, hash tables, suffix trees.

45 45 Polymerase Chain Reaction

46 46 Polymerase Chain Reaction

47 47 Inexact sequence match Inexact sequence matching O(n*m*k) Errors/Mismatches (k): 1,2,3 # distinct 20-mers (m): O(n) Achieve expected linear time using a hybrid approach (blastn): Exact search for short chunks of primers Expensive alignment only where chunks match Large chunks ) Fast, but miss occurrences Small chunks ) Slow, find all matches

48 48 Baeza-Yates Perleberg: Correct and O(n) for small k At least 1 chunk is observed with no error. Small k → Large chunks → Fast and correct Form of locality sensitive hashing Inexact sequence match ≠ = ≠ q g

49 49 Locality Sensitive Hashing For each primer: store a (set of) hash(es) in hash-table At each position in the genome: look-up a (set of) hash(es) in hash-table if any hash is found, do more expensive check Need to weigh sensitivity (false negatives) vs specificity (false positives) Our application requires speed and no false negatives!

50 50 Random Projection Choose T templates of l random “care” positions q g

51 51 Random Projection Choose T templates of l random “care” positions t1t1 g t 1 : 0 1 1 0 1 0 1 0 0 0 1 0 1 0 1 0 0

52 52 Random Projection Choose T templates of l random positions t1t1 t2t2 g t 1 : 0 1 1 0 1 0 1 0 0 0 1 0 1 0 1 0 0 t 2 : 1 0 1 0 0 1 0 1 0 0 0 0 0 1 0 0 1

53 53 Random Projection Choose T templates of l random positions t1t1 t2t2 g t 1 : 0 1 1 0 1 0 1 0 0 0 1 0 1 0 1 0 0 t 2 : 1 0 1 0 0 1 0 1 0 0 0 0 0 1 0 0 1

54 54 Random Projection Choose T templates of l random positions t1t1 t2t2 g t 1 : 0 1 1 0 1 0 1 0 0 0 1 0 1 0 1 0 0 t 2 : 1 0 1 0 0 1 0 1 0 0 0 0 0 1 0 0 1

55 55 Gapped seed-set design problem Given: mer-size: m ( = 20 ) # errors: k ( = 1,2,3) # cares: l ( = 10,12,14 ) Find the smallest set of templates with no false negatives. Minimize running time.

56 56 Gapped seed set design formulation (for k = 2) Cover the edges of K m with copies of K m-l How many triangles to cover K 6 ? (m = 6, k = 2, l = 3) Some instances of (m,2,m-3) cover each edge exactly once: Steiner triple systems

57 57 How many triangles cover K 6 ? 15 edges total Is 5 triangles possible?

58 58 How many triangles cover K 6 ? 15 edges total Is 5 triangles possible? NO!

59 59 How many triangles cover K 6 ? 15 edges total Is 5 triangles possible? NO! Each node requires 3 triangles Triangles must account for at least 18 “edges”

60 60 How many triangles cover K 6 ? 15 edges total Is 5 triangles possible? NO! Each node requires 3 triangles Triangles must account for at least 18 “edges”

61 61 Gapped seed set design formulation #2 Set cover instance: Ground set: all possible placements of the k errors (alignments) Covering sets: all possible placements of the l care positions For (m=20,k=2,l=10), 190 elements, 184,756 sets! Greedy approximation algorithm works

62 62 Gapped seed set design formulation #3 TemplatesPositions (m) l Remove any k position nodes, at least 1 template must have degree l.

63 63 Gapped seed set design formulation #3 Polynomial size in terms of number of templates Select T in advance and test whether sufficient. Greedily add 1,2,3,... templates. Apply iteratively to achieve feasible solution

64 64 Solution for (20,2,10).................... Positions ********** t 1 ********** t 2 ***** ***** t 3 ***** ***** t 4 ***** ***** t 5 ********** t 6 Need > 4 templates, 6 is optimal

65 65 Remember the application! We are checking some templates twice! We compute hash(es) at each position in the genome Any template that is a shift of another will be computed at some nearby genomic position!

66 66 Solution for (20,2,10).................... Positions ********** t 1 ********** t 2 ***** ***** t 3 ***** ***** t 4 ***** ***** t 5 ********** t 6 Need at most 3 templates...can we do better?

67 67 Solution for (20,2,10) w/ shift.................... Positions **** ** **** t 1 **** * ***** t 2 Optimal is 2 templates...

68 68 Gapped seed set design Solution strategies Randomized algorithms Greedy algorithm Directly to set cover instance Indirectly to bipartite instance Integer programming On set cover and bipartite instances Solution of greedy algorithm subproblem...in parallel, using COIN-OR SYMPHONY Branch-and-bound enumeration Solution of greedy algorithm subproblem...in parallel, using COIN-OR ALPS library

69 69 What about edit-distance? Formulations can be generalized Similar solution strategies can be applied (All) symmetry lost! This may actually be helpful Much harder to solve Is greedy still good? Solutions typically require more templates

70 70 Uniqueness Oracles Integrated with CSBH-graph construction algorithm Ensure edge-count property is preserved Sequence database of unique / non- unique 20-mers for small genomes D. melanogaster, up to edit-distance 2 Currently working to scale to human...

71 71 Other Projects / Interests HMMs for Peptide Spectrum Matching with UMd, CS Rapid Microorganism Identification Database www.RMIDb.org Pathogen detection using Spectral Matching with USDA Locality sensitive hashing spectra, peptide sequence Statistical techniques statistical significance importance sampling CSBH-graph applications genome assembly Grid computing Web-applications Relational databases

72 72 Future Research Directions Extend k-mer superstring algorithms Range of word sizes, variable length words Other sequence properties (Tryptic peptides, T m ) Identification of protein isoforms: Optimize proteomics workflow for isoform detection Identify splice variants in cancer cell-lines (MCF-7) and clinical brain tumor samples Aggressive peptide sequence enumeration dbPep for genomic annotation Open, flexible informatics infrastructure for peptide identification

73 73 Future Research Directions Proteomics for Microorganism Identification Specificity of tandem mass spectra Revamp RMIDb prototype Incorporate spectral matching Primer design Uniqueness oracle for inexact match in human Integration with Primer3 Tiling, multiplexing, pooling, & tag arrays

74 74 Acknowledgements Chau-Wen Tseng, Xue Wu UMCP Computer Science Catherine Fenselau, Steve Swatkoski UMCP Biochemistry Calibrant Biosystems PeptideAtlas, HUPO PPP, X!Tandem Funding: National Cancer Institute


Download ppt "Optimal k-mer superstrings for protein identification and DNA assay design. Nathan Edwards Center for Bioinformatics and Computational Biology University."

Similar presentations


Ads by Google