Presentation is loading. Please wait.

Presentation is loading. Please wait.

Reference Assisted Nucleic Acid Sequence Reconstruction from Mass Spectrometry Data Gabriel Ilie 1, Alex Zelikovsky 2 and Ion Măndoiu 1 1 CSE Department,

Similar presentations


Presentation on theme: "Reference Assisted Nucleic Acid Sequence Reconstruction from Mass Spectrometry Data Gabriel Ilie 1, Alex Zelikovsky 2 and Ion Măndoiu 1 1 CSE Department,"— Presentation transcript:

1 Reference Assisted Nucleic Acid Sequence Reconstruction from Mass Spectrometry Data Gabriel Ilie 1, Alex Zelikovsky 2 and Ion Măndoiu 1 1 CSE Department, University of Connecticut 2 CS Department, Georgia State University

2 MassCLEAVE assay for MS-based nucleic acid sequence analysis

3 Signed relative errors assumed to follow a normal distribution with mean 0, standard deviation σ for masses and σ’ for intensities Two types of error incurred when matching compomer c to peak of mass m and intensity i(m): – Relative mass error – Relative intensity error: Error model

4 Notations Σ= {A,C,G, T } = DNA alphabet CS σ (s) = compomer spectrum of a sequence s digested at cut base σ CS(s) = (CS σ (s) ) σϵΣ = the compomer spectra obtained by performing all four cleavage reactions on s MS σ = mass spectrum obtained by base-specific cleavage of the unknown target at cut base σ MS = (MS σ (s) ) σϵΣ = the MS spectra obtained by performing all four cleavage reactions m 0 = minimum detectable mass σ = standard deviation of signed relative measurement errors in the masses σ' = standard deviation of signed relative measurement errors in the intensities  = user specified tolerance parameter

5 Problem formulation Given: Mass spectra MS Reference sequence r including position of PCR primers Maximum edit distance D Standard deviations σ and σ’, tolerance parameter  Find: Target sequence t flanked by PCR primers that a.is within edit distance D of r, and b.yields a matching of compomers of CS(t) to masses of MS with minimum total relative error

6 Naïve Algorithm Exhaustive search – Generate all sequences within an edit distance of D of the reference, and – Compute the minimum total relative error for matching the compomers of each of these sequences to the masses in MS. The number of candidate sequences grows exponentially with D

7 3-Stage Algorithm 1.Identify regions of the reference sequence that are unambiguously supported by MS data – High probability to be present in the unknown target sequence 2.Branch-and-bound approach to fill in remaining gaps – Generates set of candidate sequences with compomers supported by MS data 3.Compute candidate sequences with minimum total relative error – Min-cost flow problem currently solved as linear program – With or without intensities

8 First stage: finding strongly supported regions of the reference Chebyshev’s inequality: A detectable compomer c ϵ CS σ (s) is strongly matched to mass m ϵ MS σ (s) if: where ε = σ /  0.5 is set based on a user specified tolerance 

9 First stage: finding strongly supported regions of the reference A strong match between compomer c and mass m is unambiguous if: – c has multiplicity of 1 in reference – c can be strongly matched only to m – m can be strongly matched only to c The set M of unambiguous matches can be found efficiently by binary search

10 First stage: finding strongly supported regions of the reference which are normally distributed with mean 0 and standard deviation σ /i 0.5 If Chebyshev’s inequality fails for index i, match(c i, m i ) is removed from M (c 1, m 1 ),..., (c n, m n ) = unambiguous matches for cut base σ, indexed in non-decreasing order of relative errors We iteratively apply Chebyshev’s inequality with tolerance  to the running means of signed relative errors,

11 First stage: finding strongly supported regions of the reference A position in the reference sequence has strong support if – All detectable compomers overlapping it can be strongly matched, and – At least one of these matches is in M (unambiguous + not removed) Positions in PCR primers automatically marked as having strong support

12 Second stage: generating candidate targets by branch-and-bound Reference regions with strong support assumed to be present in target Gaps filled one base at a time, in left-to-right order, using branch-and-bound – Choice order: reference base, substitutions, deletion, insertions – Chebyshev test with tolerance  applied to running means of signed relative errors of closest matches Search pruned when test fails or more than D mutations

13 Third stage: scoring candidates by linear programming Objective: – Minimize total relative error Variables: – For each c ϵ CS σ and m ϵ MS σ, x c,m is set to 1 if c is matched to m, 0 otherwise (integrality follows from total unimodularity) Constraints: – No missing peaks: each detectable compomer c ϵ CS σ (t) must be matched to one mass in MS σ – No extraneous peaks: each mass m ϵ MS σ must be matched to at least one detectable compomer c ϵ CS σ (t)

14 LP w/o intensities

15 LP with intensities

16 Simulation setup Reference length: 100-500 bp Reference sequences/targets – D=1: 10 random references, all sequences at edit distance 1 used as targets – D=2,3: 100 random reference-target pairs Error free MS data: σ = σ’ = 0 Noisy MS data: σ = 0.0001, σ’ =0-1 Tolerance parameter:  = 0.01

17 Precision and Recall actual target predicted target(s) tp (true positive) Prediction is unique & correct fp (false positive) Prediction is unique & incorrect fn (false negative) Prediction is not unique

18 Branch-and-bound vs. Naïve (F-measure for D=1, error free data, w/o intensities)

19 Branch-and-bound speed-up (D=1, error free data, w/o intensities)

20 Results on noisy data (F-measure, D=1, σ = 0.0001, w/o intensities)

21 Effect of the number of mutations (F-measure, σ = 0.0001, w/o intensities)

22 Do intensities help? (F-measure, σ = 0.0001, 1 substitution)

23 Do intensities help? (F-measure, σ = 0.0001)

24 Ongoing Work Experiments on EPLD clone data – Branch-and-bound relaxation + penalty in LP objective to handle missing/extraneous peaks – Intensity data normalization: correct for mass and base composition effects


Download ppt "Reference Assisted Nucleic Acid Sequence Reconstruction from Mass Spectrometry Data Gabriel Ilie 1, Alex Zelikovsky 2 and Ion Măndoiu 1 1 CSE Department,"

Similar presentations


Ads by Google