Presentation is loading. Please wait.

Presentation is loading. Please wait.

KMERSTREAM Streaming algorithms for k-mer abundance estimation Páll joint work with Bjarni V. Halldórsson.

Similar presentations


Presentation on theme: "KMERSTREAM Streaming algorithms for k-mer abundance estimation Páll joint work with Bjarni V. Halldórsson."— Presentation transcript:

1 KMERSTREAM Streaming algorithms for k-mer abundance estimation Páll Melsted @pmelsted joint work with Bjarni V. Halldórsson

2 Error rates vs. Quality values  What error rates can we expect from NGS  Specifically whole genome sequencing with Illumina sequencing technology  How informative are quality values  Rubbish?  Worth using for analysis?

3 Quality values  A probability estimate that the basecall is correct  @SEQ_ID GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACT + !''*((((***+))%%++)(%%).1***-+*''))**55CCF>>>>>>C  Phred scale,  Pr[base call incorrect] ~ 10 -Q/10 !=33, bad basecall

4 Error rates  What percentage of basecalls are correct  How to estimate  Align reads to a reference  Count mismatches and non-alignments  Correct for snps and variants.  Reference free  Whole genome assembly?

5 K-mer counting  Count k-mers, want k large, say ~31. GATTTGGGGTTCAAAGCAGTA GAT ATT TTT TTG TGG... GAT 2 ATT 3 TTT 2 TTG 3 TGG 2... ATTTGGGGTTGATT ATT TTT TTG TGG GGG GGT GTT TTG TGA GAT ATT

6 Errors and k-mers  Basecall errors impact many k-mers GATTTGGGGTTCAAAGCAGTA GATTT ATTTG TTTGG TTGGG TGGGG GGGGT... AAGCAG AGCAGT GCAGTA GATTTGGGGTTCAAAGCAGTA GATTT ATTTG TTTGG TTGGG TGGGG GGGGT GGGTT GGTTC

7 Errors and k-mers  Basecall errors are not independent  Multiple errors more likely  Ends of reads contain more errors  K-mer error rate underestimates true basecall error rate  Discounts reads with many errors or errors at the ends  Can be off by a factor of 2

8 Frequency histograms  Sequencing at normal coverage, ~30x, most true k-mers will have high coverage and most error k-mers will have coverage of 1

9 Naïve method  Assumptions:  Sampling from a genome of size G  Poisson distribution, Poi(λ), of coverage of each position  Each k-mer sampled is an error with prob ε independently.  When we sample an error k-mer, it is replaced by a single nucleotide substitution at random

10 Naïve model  Probability that a k-mer has coverage 1  ε Pr[error k-mer has cov 1] + (1-ε) Pr[true k-mer has cov 1] ε1-ε TGAC TGGC Genome length G Sample random position Produce correct k-merIntroduce one error

11 Frequency moments  From the frequency histogram we define  f i = number of k-mers with coverage i  f 1 = number of singletons  F 0 = number of distinct k-mers = Σ f i  F1 = number of all k-mers = Σ i f i

12 Fitting the model  3 unknown parameters G, λ, ε  3 k-mer frequency statistics, f 1, F 1, F 0

13 Computing the moments  Count all k-mers? – very memory intensive  Sample k-mers (à la KmerGenie)  Streaming algorithm, KmerStream  Estimates f 1, F 0, F 1 directly without storing any k-mers  Accuracy can be specified (default ~2%)

14 KmerStream  Very fast, 5-10s per million reads  Low memory overhead, ~11M  One pass over the dataset  Uses hashing to sample k-mers adaptively  Lossy counting similar to Bloom filter  Does not keep track of k-mers  2-3x faster than KmerGenie, 10x better memory

15 Validation  Sampled reads from PhiX sequencing lane at 30x coverage, repeated 1000 times. KmerStream estimates True kmer counts

16 Real data  Sequenced at deCODE genetics, 2656 individuals, sequenced at 10x to 30x coverage.  KmerStream run for all samples, model fit to estimate k-mer error rates for k=31

17 K-mer error rates

18 Quality cutoff  Keep only k-mers in reads where quality is above q.  Run for q = 0, 13, 20, 30.  Should correspond to upper bound on error of 1.0, 0.05, 0.01, 0.001

19 K-mer error rates Moving from q0 to q13 huge improvement q20 to q30 not recommended, 50% samples increased error rate

20 Wrap up  Quality values are informative  Can get speed up by prioritizing processing based on quality values e.g. alignment  Error rates are highly variable  Quality value cutoffs can be done on a case by case basis with minimal overhead.

21 Thank you  Paper on bioRxiv  Code on github.com/pmelsted/KmerStream  Ph.D. position available “Streaming algorithms for whole genome assembly.”


Download ppt "KMERSTREAM Streaming algorithms for k-mer abundance estimation Páll joint work with Bjarni V. Halldórsson."

Similar presentations


Ads by Google