Presentation is loading. Please wait.

Presentation is loading. Please wait.

KGEM: an EM Error Correction Algorithm for NGS Amplicon-based Data Alexander Artyomenko.

Similar presentations


Presentation on theme: "KGEM: an EM Error Correction Algorithm for NGS Amplicon-based Data Alexander Artyomenko."— Presentation transcript:

1 kGEM: an EM Error Correction Algorithm for NGS Amplicon-based Data Alexander Artyomenko

2 Introduction Introduction Reconstructing spectrum of viral population Challenges: – Assembling short reads to span entire genome – Distinguishing sequencing errors from mutations Avoid assembling: – ID sequences via high variability region

3 Previous Work Previous Work KEC (k-mer Error Correction) [Skums et al.] – Incorporates counts (frequencies) of k-mers (substrings of length k) QuasiRecomb (Quasispecies Recombination) [Töpfer et. al] – Hidden Markov Model-based approach – Incorporates possibility for recombinant progeny – Parameter: k generators (ancestor haplotypes)

4 Problem Formulation Problem Formulation Given: a set of reads R emitted by a set of unknown haplotypes H’ Find: a set of haplotypes H = {H 1,…,H k } maximizing Pr(R|H)

5 Fractional Haplotype Fractional Haplotype Fractional Haplotype: a string of 5-tuples of probabilities for each possible symbol: a, c, t, g, d=‘-’ a c - tctgc a0.710.060.00.130.00.270.100.03 c0.130.940.0 0.640.00.140.58 t0.160.00.010.870.110.730.00.09 g0.0 0.210.00.250.00.760.09 d0.0 0.780.0 0.21

6 kGEM kGEM Initialize (fractional) Haplotypes Repeat until Haplotypes are unchanged Estimate Pr(r|H i ) probability of a read r being emitted by haplotype H i Estimate frequencies of Haplotypes Update and Round Haplotypes Collapse Identical and Drop Rare Haplotypes Output Haplotypes

7 Initialization Initialization Find set of reads representing haplotype population – Start with a random read – Each next read maximizes minimum distance to previously chosen 1 2 3 4

8 Initialization Initialization Transform selected reads into fractional haplotypes using formula: where s m is i-th nucleotide of selected read s. a c - tg- g a -c ε=0.01 a0.960.01 0.960.01 c 0.960.01 0.96 t0.01 0.960.01 g 0.960.010.960.01 d 0.960.01 0.960.01 0.960.01

9 Read Emission Probability Read Emission Probability For each i=1, …, k and for each read r j from R compute value: 1 2 3 2 1 Reads Haplotypes h 1,1 h 3,2 h 2,1 h 3,1 h 1,2 h 2,2

10 Estimate Frequencies Estimate Frequencies Estimate haplotype frequencies via Expectation Maximization (EM) method Repeat two steps until the change < σ E-step: expected portion of r emitted by H i M-step: updated frequency of haplotype H i

11 Update Haplotypes Update Haplotypes Update allele frequencies for each haplotype according to read’s contribution: a0.710.060.00.130.00.27 … 0.100.03 c0.130.940.0 0.640.00.140.58 t0.160.00.010.870.110.730.00.09 g0.0 0.210.00.250.00.760.09 d0.0 0.780.0 0.21

12 Round each haplotype’s position to most probable allele a0.760.00.010.060.770.00.29 … 0.140.09 c0.110.890.01 0.230.680.00.060.50 t0.130.00.110.930.00.140.710.00.04 g0.010.00.210.0 0.180.00.800.23 d0.010.110.680.0 0.14 a0.760.00.010.060.770.00.29 … 0.140.09 c0.110.890.01 0.230.680.00.060.50 t0.130.00.110.930.00.140.710.00.04 g0.010.00.210.0 0.180.00.800.23 d0.010.110.680.0 0.14 a0.760.00.010.060.770.00.29 … 0.140.09 c0.110.890.01 0.230.680.00.060.50 t0.130.00.110.930.00.140.710.00.04 g0.010.00.210.0 0.180.00.800.23 d0.010.110.680.0 0.14 a0.760.00.010.060.770.00.29 … 0.140.09 c0.110.890.01 0.230.680.00.060.50 t0.130.00.110.930.00.140.710.00.04 g0.010.00.210.0 0.180.00.800.23 d0.010.110.680.0 0.14 a0.960.01 0.960.01 … c 0.960.01 0.960.01 0.96 t0.01 0.960.01 0.960.01 g 0.960.01 d 0.960.01 Round Haplotypes Round Haplotypes ac-tactgc

13 Collapse and Drop Rare Collapse and Drop Rare Collapse haplotypes which have the same integral strings Drop haplotypes with coverage ≤ δ – Empirically, δ<5 implies drop in PPV without improving sensitivity

14 kGEM kGEM Initialize (fractional) Haplotypes Repeat until Haplotypes are unchanged Estimate Pr(r|H i ) probability of a read r being emitted by haplotype H i Estimate frequencies of Haplotypes Update and Round Haplotypes Collapse Identical and Drop Rare Haplotypes Output Haplotypes

15 Experimental Setup Experimental Setup HCV E1E2 sub-region (315bp) 20 simulated data sets of 10 variants 100,000 reads from Grinder 0.5 10 datasets with homo-polymer errors Frequency distribution: uniform and power-law model with parameter α= 2.0

16

17 Nicholas Mancuso Alex Zelikovsky Pavel Skums Ion M ă ndoiu Acknowledgements Acknowledgements

18 Thank you! Questions?


Download ppt "KGEM: an EM Error Correction Algorithm for NGS Amplicon-based Data Alexander Artyomenko."

Similar presentations


Ads by Google