Optimal k-mer superstrings for protein identification and DNA assay design. Nathan Edwards Center for Bioinformatics and Computational Biology University.

Slides:



Advertisements
Similar presentations
CS 336 March 19, 2012 Tandy Warnow.
Advertisements

Lecture 24 Coping with NPC and Unsolvable problems. When a problem is unsolvable, that's generally very bad news: it means there is no general algorithm.
Reference Assisted Nucleic Acid Sequence Reconstruction from Mass Spectrometry Data Gabriel Ilie 1, Alex Zelikovsky 2 and Ion Măndoiu 1 1 CSE Department,
Combinatorial Pattern Matching CS 466 Saurabh Sinha.
Next Generation Sequencing, Assembly, and Alignment Methods
Algorithm Strategies Nelson Padua-Perez Chau-Wen Tseng Department of Computer Science University of Maryland, College Park.
Computability and Complexity 23-1 Computability and Complexity Andrei Bulatov Search and Optimization.
PepArML: A model-free, result-combining peptide identification arbiter via machine learning Xue Wu, Chau-Wen Tseng, Nathan Edwards University of Maryland,
Implicit Hitting Set Problems Richard M. Karp Harvard University August 29, 2011.
Indexed Search Tree (Trie) Fawzi Emad Chau-Wen Tseng Department of Computer Science University of Maryland, College Park.
Accurate Method for Fast Design of Diagnostic Oligonucleotide Probe Sets for DNA Microarrays Nazif Cihan Tas CMSC 838 Presentation.
May 25, GSU Biotech Symposium1 Minimum PCR Primer Set Selection with Amplification Length and Uniqueness Constraints Ion Mandoiu University of.
Approximation Algorithms Motivation and Definitions TSP Vertex Cover Scheduling.
Introduction to computational genomics – hands on course Gene expression (Gasch et al) Unit 1: Mapper Unit 2: Aggregator and peak finder Solexa MNase Reads.
Graphs & Graph Algorithms 2 Fawzi Emad Chau-Wen Tseng Department of Computer Science University of Maryland, College Park.
APBC Improved Algorithms for Multiplex PCR Primer Set Selection with Amplification Length Constraints Kishori M. Konwar Ion I. Mandoiu Alexander.
CPSC 411, Fall 2008: Set 4 1 CPSC 411 Design and Analysis of Algorithms Set 4: Greedy Algorithms Prof. Jennifer Welch Fall 2008.
My contact details and information about submitting samples for MS
Proteomic Characterization of Alternative Splicing and Coding Polymorphism Nathan Edwards Center for Bioinformatics and Computational Biology University.
Novel Peptide Identification using ESTs and Sequence Database Compression Nathan Edwards Center for Bioinformatics and Computational Biology University.
Physical Mapping of DNA Shanna Terry March 2, 2004.
CS 394C March 19, 2012 Tandy Warnow.
The dynamic nature of the proteome
PROTEIN STRUCTURE NAME: ANUSHA. INTRODUCTION Frederick Sanger was awarded his first Nobel Prize for determining the amino acid sequence of insulin, the.
Improving Genome Annotation using Proteomics Nathan Edwards Center for Bioinformatics and Computational Biology University of Maryland, College Park.
Improving the Reliability of Peptide Identification by Tandem Mass Spectrometry Nathan Edwards Department of Biochemistry and Molecular & Cellular Biology.
Proteomic Characterization of Alternative Splicing and Coding Polymorphism Nathan Edwards Center for Bioinformatics and Computational Biology University.
Nathan Edwards Center for Bioinformatics and Computational Biology
1 Introduction to Approximation Algorithms. 2 NP-completeness Do your best then.
Top-down characterization of proteins in bacteria with unsequenced genomes Nathan Edwards Georgetown University Medical Center.
Direct Experimental Observation of Functional Protein Isoforms by Tandem Mass Spectrometry Nathan Edwards Center for Bioinformatics and Computational Biology.
© B.Raghavachari & J.Veerasamy, UTD 1 Euler tours, postman tours and mixed graphs Jeyakesavan Veerasamy* * Joint work with Balaji Raghavachari Samsung.
Generating Peptide Candidates from Protein Sequence Databases for Protein Identification via Mass Spectrometry Nathan Edwards Informatics Research.
Graph Theory And Bioinformatics Jason Wengert. Outline Introduction to Graphs Eulerian Paths & Hamiltonian Cycles Interval Graph & Shape of Genes Sequencing.
Novel Empirical FDR Estimation in PepArML David Retz and Nathan Edwards Georgetown University Medical Center.
JM - 1 Introduction to Bioinformatics: Lecture III Genome Assembly and String Matching Jarek Meller Jarek Meller Division of Biomedical.
Computational Molecular Biology Non-unique Probe Selection via Group Testing.
Combinatorial Optimization Problems in Computational Biology Ion Mandoiu CSE Department.
Precomputing Edit-Distance Specificity of Short Oligonucleotides Nathan Edwards Center for Bioinformatics and Computational Biology University of Maryland,
1 Short Term Scheduling. 2  Planning horizon is short  Multiple unique jobs (tasks) with varying processing times and due dates  Multiple unique jobs.
Improving the Sensitivity of Peptide Identification for Genome Annotation Nathan Edwards Department of Biochemistry and Molecular & Cellular Biology Georgetown.
Proteomic Characterization of Alternative Splicing and Coding Polymorphism Nathan Edwards Center for Bioinformatics and Computational Biology University.
Poster produced by Faculty & Curriculum Support (FACS), Georgetown University Medical Center Application of meta-search, grid-computing, and machine-learning.
Data Structures & Algorithms Graphs
Novel Peptide Identification using ESTs and Genomic Sequence Nathan Edwards Center for Bioinformatics and Computational Biology University of Maryland,
Improving the Sensitivity of Peptide Identification Nathan Edwards Department of Biochemistry and Molecular & Cellular Biology Georgetown University Medical.
Faster, more sensitive peptide identification from tandem mass spectra by sequence database compression Nathan J. Edwards Center for Bioinformatics & Computational.
Computational Molecular Biology Non-unique Probe Selection via Group Testing.
Flipping letters to minimize the support of a string Giuseppe Lancia, Franca Rinaldi, Romeo Rizzi University of Udine.
Aggressive Enumeration of Peptide Sequences for MS/MS Peptide Identification Nathan Edwards Center for Bioinformatics and Computational Biology.
Poster produced by Faculty & Curriculum Support (FACS), Georgetown University Medical Center  Peptide sequence databases, meta-search engine, machine-learning.
Improving the Sensitivity of Peptide Identification by Meta-Search, Grid-Computing, and Machine-Learning Nathan Edwards Georgetown University Medical Center.
Improving the Sensitivity of Peptide Identification for Genome Annotation Nathan Edwards Department of Biochemistry and Molecular & Cellular Biology Georgetown.
A new Approach to Fragment Assembly in DNA Sequenceing Fei wu April,24,2006.
Poster produced by Faculty & Curriculum Support (FACS), Georgetown University Medical Center Application of meta-search, grid-computing, and machine-learning.
Novel Peptide Identification using ESTs and Genomic Sequence Nathan Edwards Center for Bioinformatics and Computational Biology University of Maryland,
COMPUTATIONAL GENOMICS GENOME ASSEMBLY
Application of meta-search, grid-computing, and machine-learning can significantly improve the sensitivity of peptide identification. The PepArML meta-search.
CSE280Stefano/Hossein Project: Primer design for cancer genomics.
Honors Track: Competitive Programming & Problem Solving Seminar Topics Kevin Verbeek.
2015/06/03 Park, Hyewon 1. Introduction Protein assembly Transforms a list of identified peptides into a list of identified proteins. 2 Duplicate Spectrum.
CSCI2950-C Genomes, Networks, and Cancer
Algorithms and Computation: Bottom-Up Data Analysis Workflows
Genome sequence assembly
Department of Computer Science
Proteomics Informatics David Fenyő
What is Computer Science About? Part 2: Algorithms
Precomputing Edit-Distance Specificity of Short Oligonucleotides
Proteomics Informatics David Fenyő
Presentation transcript:

Optimal k-mer superstrings for protein identification and DNA assay design. Nathan Edwards Center for Bioinformatics and Computational Biology University of Maryland, College Park

2 k-mer (Sub-)Problems Enumerate: For all (distinct) k-mers, do Existence:...with respect to exact (& inexact) count ¸ x Uniqueness:...with respect to exact & inexact match Near-neighbors:...with respect to inexact match Representation: Represent (distinct) k-mers for other tools Fast annotation of k-mer counts on original sequences

3 Applications of k-mer sets Peptide Identification Represent all amino-acid 30-mers...that occur at least twice in human dbEST PCR Primer Design: Test DNA 20-mer primers for uniqueness What does it mean to be unique? DNA sequencing error / repeat detection Eliminate mers that are too rare or too frequent Pathogen signatures Near-neighbors imply potential false-positives

4 k-mer Superstring Problem Given A set of sequences S = { S 1,..., S n } Sequence database Word size k Find A new set of sequences T = { T 1,..., T m } Such that Total length of T is minimized, and T is complete and correct w.r.t. k-mers of S

5 k-mer Superstring Problem Completeness All of the k-mers of S are represented Correctness No additional k-mers are present Minimize the total representation length Correlates with running time

6 Shortest (common) superstring problem General strings (arbitrary length) Single output string Completeness for input sequences only Classical NP-hard problem Garey and Johnson Approximate within ~ 2.5*OPT Max-SNP hard One of the first algorithmic approaches to genome assembly

7 de Bruijn Sequences de Bruijn sequences represent all words of length k from some alphabet A. A = {0,1}, k = 3: s = A = {0,1}, k = 4: s =

8 de Bruijn Graph: A = {0,1}, k =

9 de Bruijn Sequences & Graphs de Bruijn graphs (k,A): Edges represent length k words from A Each node has in degree |A| out degree |A| Eulerian tour constructs de Bruijn sequence.

10 Sequencing-by- Hybridization-graph ACDEFGI, ACDEFACG, DEFGEFGI

11 Compressed SBH-graph ACDEFGI, ACDEFACG, DEFGEFGI

12 Sequence Databases & CSBH-graphs Original sequences correspond to paths ACDEFGI, ACDEFACG, DEFGEFGI

13 C 3 Enumeration Complete All k-mers are present Correct No other k-mers are present Compact No k-mer is present more than once

14 Correct, Complete, Compact (C 3 ) Enumeration Set of paths that use each edge exactly once ACDEFGEFGI, DEFACG

15 Correct, Complete (C 2 ) Enumeration Set of paths that use each edge at least once ACDEFGEFGI, DEFACG

16 Patching the CSBH-graph Use artificial edges to fix unbalanced nodes

17 Patching the CSBH-graph Use matching-style formulations to choose artificial edges Optimal C 2 /C 3 enumeration in polynomial time. Chinese Postman Problem Edmonds and Johnson, ’73 l-tuple DNA sequencing Pevzner, ’89 Shortest (Common) Superstring MAX-SNP-hard, 2.5 approx algorithm

18 Related work Chinese Postman Problem Undirected graph, weighted edges Shortest path that uses all the edges Solvable in polynomial time Construct minimum weighted matching between nodes of odd-degree Add matching to graph and find Eulerian path Minimize weight of extra edges used

19 C 2 Enumeration Chinese postman problem, except: Directed graph Add edges from nodes with surplus in-degree to nodes with surplus out-degree Fixed cost teleportation option Can always “start” a new sequence Find optimal set of additional edges Transportation problem / min cost flow instance

20 C 3 Enumeration Cost: k #in-#out

21 Reusing Edges ACDHAC EHAC FHAC GHAC D ACDEHAC, ACDFHAC, ACDGHACD

22 C 3 : ACDEHACDFHAC, ACDGHACD Reusing Edges ACDHAC EHAC FHAC GHAC D $ACD

23 C 2 : ACDEHACDFHACDGHAC Reusing Edges ACDHAC EHAC FHAC GHAC D D

24 C 2 Enumeration “Shortcut paths” #in-#out

25 C 3 Enumeration #in-#out Cost: k 0 0 Cost: 0

26 Sample Preparation for Peptide Identification Enzymatic Digest and Fractionation

27 Single Stage MS MS m/z

28 Tandem Mass Spectrometry (MS/MS) Precursor selection m/z

29 Tandem Mass Spectrometry (MS/MS) Precursor selection + collision induced dissociation (CID) MS/MS m/z

30 Peptide Identification For each (likely) peptide sequence 1. Compute fragment masses 2. Compare with spectrum 3. Retain those that match well Peptide sequences from protein sequence databases Swiss-Prot, IPI, NCBI’s nr,... Automated, high-throughput peptide identification in complex mixtures

31 Novel Splice Isoform Human Jurkat leukemia cell-line Lipid-raft extraction protocol, targeting T cells von Haller, et al. MCP LIME1 gene: LCK interacting transmembrane adaptor 1 LCK gene: Leukocyte-specific protein tyrosine kinase Proto-oncogene Chromosomal aberration involving LCK in leukemias. Multiple significant peptide identifications

32 Novel Splice Isoform

33 Novel Splice Isoform

34 Novel Mutation HUPO Plasma Proteome Project Pooled samples from 10 male & 10 female healthy Chinese subjects Plasma/EDTA sample protocol Li, et al. Proteomics (Lab 29) TTR gene Transthyretin (pre-albumin) Defects in TTR are a cause of amyloidosis. Familial amyloidotic polyneuropathy late-onset, dominant inheritance

35 Novel Mutation Ala2→Pro associated with familial amyloid polyneuropathy

36 Novel Mutation

37 Compressed EST Peptide Sequence Database For all ESTs mapped to a UniGene gene: Six-frame translation Eliminate ORFs < 30 amino-acids Eliminate amino-acid 30-mers observed once Compress to C 2 FASTA database Complete, Correct for amino-acid 30-mers Inclusive gene-centric peptide sequence database: Size: < 3% of naïve enumeration, FASTA entries Running time: ~ 1% of naïve enumeration search E-values: ~ 2% of naïve enumeration search results

38 Compressed EST Peptide Sequence Database For all ESTs mapped to a UniGene gene: Six-frame translation Eliminate ORFs < 30 amino-acids Eliminate amino-acid 30-mers observed once Compress to C 2 FASTA database Complete, Correct for amino-acid 30-mers Gene-centric peptide sequence database: Size: < 3% of naïve enumeration, FASTA entries Running time: ~ 1% of naïve enumeration search E-values: ~ 2% of naïve enumeration search results

39 Sequence Databases & CSBH-graphs All k-mers represented by an edge have the same count

40 CSBH-graph subgraphs Quickly determine those that occur twice

41 k-mer (Sub-)Problems Enumerate: For all (distinct) k-mers, do Existence:...with exact (& inexact) count ¸ x Uniqueness:...exact & inexact match Near-neighbors:...inexact match Representation: Represent (distinct) k-mers for other tools Fast annotation of k-mer counts on original sequences

42 Large scale instances! CSBH-graph instances Partition set of all k-mers, determine non-trivial nodes Days on condor grid (250 CPUs) to construct ¸ 100,000,000 nodes and edges (sparse & dense) Min-cost flow instances ¸ 500,000 nodes and edges Algorithms must be linear in problem size Out-of-core Eulerian path algorithm? Currently testing out-of-core connected-components

43 Grid computing Heterogeneous machines Varying disk/memory/MHz/cores capabilities Centralized scheduler Jobs started asynchronously Other jobs may preempt current job Input files may need to be staged 250 simultaneous requests for a 3Gb file? How to guarantee integrity of input files? Problem decomposition may be non-trivial Jobs sizes need to fit the least capable machine Sometimes need to “game” the scheduler Need to ensure the integrity of job output

44 Uniqueness Oracles Oracle for uniqueness of 20-mers in the Human genome (size: 3Gb) Count occurrences in the genome: 0,1,2+ Construct 20-mer superstring for 20-mers with count 1 Construct 20-mer superstring for 20-mers with count > 1 Easy(-ish) for exact sequence match: O(n) Fast automata, hash tables, suffix trees.

45 Polymerase Chain Reaction

46 Polymerase Chain Reaction

47 Inexact sequence match Inexact sequence matching O(n*m*k) Errors/Mismatches (k): 1,2,3 # distinct 20-mers (m): O(n) Achieve expected linear time using a hybrid approach (blastn): Exact search for short chunks of primers Expensive alignment only where chunks match Large chunks ) Fast, but miss occurrences Small chunks ) Slow, find all matches

48 Baeza-Yates Perleberg: Correct and O(n) for small k At least 1 chunk is observed with no error. Small k → Large chunks → Fast and correct Form of locality sensitive hashing Inexact sequence match ≠ = ≠ q g

49 Locality Sensitive Hashing For each primer: store a (set of) hash(es) in hash-table At each position in the genome: look-up a (set of) hash(es) in hash-table if any hash is found, do more expensive check Need to weigh sensitivity (false negatives) vs specificity (false positives) Our application requires speed and no false negatives!

50 Random Projection Choose T templates of l random “care” positions q g

51 Random Projection Choose T templates of l random “care” positions t1t1 g t 1 :

52 Random Projection Choose T templates of l random positions t1t1 t2t2 g t 1 : t 2 :

53 Random Projection Choose T templates of l random positions t1t1 t2t2 g t 1 : t 2 :

54 Random Projection Choose T templates of l random positions t1t1 t2t2 g t 1 : t 2 :

55 Gapped seed-set design problem Given: mer-size: m ( = 20 ) # errors: k ( = 1,2,3) # cares: l ( = 10,12,14 ) Find the smallest set of templates with no false negatives. Minimize running time.

56 Gapped seed set design formulation (for k = 2) Cover the edges of K m with copies of K m-l How many triangles to cover K 6 ? (m = 6, k = 2, l = 3) Some instances of (m,2,m-3) cover each edge exactly once: Steiner triple systems

57 How many triangles cover K 6 ? 15 edges total Is 5 triangles possible?

58 How many triangles cover K 6 ? 15 edges total Is 5 triangles possible? NO!

59 How many triangles cover K 6 ? 15 edges total Is 5 triangles possible? NO! Each node requires 3 triangles Triangles must account for at least 18 “edges”

60 How many triangles cover K 6 ? 15 edges total Is 5 triangles possible? NO! Each node requires 3 triangles Triangles must account for at least 18 “edges”

61 Gapped seed set design formulation #2 Set cover instance: Ground set: all possible placements of the k errors (alignments) Covering sets: all possible placements of the l care positions For (m=20,k=2,l=10), 190 elements, 184,756 sets! Greedy approximation algorithm works

62 Gapped seed set design formulation #3 TemplatesPositions (m) l Remove any k position nodes, at least 1 template must have degree l.

63 Gapped seed set design formulation #3 Polynomial size in terms of number of templates Select T in advance and test whether sufficient. Greedily add 1,2,3,... templates. Apply iteratively to achieve feasible solution

64 Solution for (20,2,10) Positions ********** t 1 ********** t 2 ***** ***** t 3 ***** ***** t 4 ***** ***** t 5 ********** t 6 Need > 4 templates, 6 is optimal

65 Remember the application! We are checking some templates twice! We compute hash(es) at each position in the genome Any template that is a shift of another will be computed at some nearby genomic position!

66 Solution for (20,2,10) Positions ********** t 1 ********** t 2 ***** ***** t 3 ***** ***** t 4 ***** ***** t 5 ********** t 6 Need at most 3 templates...can we do better?

67 Solution for (20,2,10) w/ shift Positions **** ** **** t 1 **** * ***** t 2 Optimal is 2 templates...

68 Gapped seed set design Solution strategies Randomized algorithms Greedy algorithm Directly to set cover instance Indirectly to bipartite instance Integer programming On set cover and bipartite instances Solution of greedy algorithm subproblem...in parallel, using COIN-OR SYMPHONY Branch-and-bound enumeration Solution of greedy algorithm subproblem...in parallel, using COIN-OR ALPS library

69 What about edit-distance? Formulations can be generalized Similar solution strategies can be applied (All) symmetry lost! This may actually be helpful Much harder to solve Is greedy still good? Solutions typically require more templates

70 Uniqueness Oracles Integrated with CSBH-graph construction algorithm Ensure edge-count property is preserved Sequence database of unique / non- unique 20-mers for small genomes D. melanogaster, up to edit-distance 2 Currently working to scale to human...

71 Other Projects / Interests HMMs for Peptide Spectrum Matching with UMd, CS Rapid Microorganism Identification Database Pathogen detection using Spectral Matching with USDA Locality sensitive hashing spectra, peptide sequence Statistical techniques statistical significance importance sampling CSBH-graph applications genome assembly Grid computing Web-applications Relational databases

72 Future Research Directions Extend k-mer superstring algorithms Range of word sizes, variable length words Other sequence properties (Tryptic peptides, T m ) Identification of protein isoforms: Optimize proteomics workflow for isoform detection Identify splice variants in cancer cell-lines (MCF-7) and clinical brain tumor samples Aggressive peptide sequence enumeration dbPep for genomic annotation Open, flexible informatics infrastructure for peptide identification

73 Future Research Directions Proteomics for Microorganism Identification Specificity of tandem mass spectra Revamp RMIDb prototype Incorporate spectral matching Primer design Uniqueness oracle for inexact match in human Integration with Primer3 Tiling, multiplexing, pooling, & tag arrays

74 Acknowledgements Chau-Wen Tseng, Xue Wu UMCP Computer Science Catherine Fenselau, Steve Swatkoski UMCP Biochemistry Calibrant Biosystems PeptideAtlas, HUPO PPP, X!Tandem Funding: National Cancer Institute