Presentation is loading. Please wait.

Presentation is loading. Please wait.

EBI is an Outstation of the European Molecular Biology Laboratory. CRAM: reference-based compression format developed by Vadim Zalunin.

Similar presentations


Presentation on theme: "EBI is an Outstation of the European Molecular Biology Laboratory. CRAM: reference-based compression format developed by Vadim Zalunin."— Presentation transcript:

1 EBI is an Outstation of the European Molecular Biology Laboratory. CRAM: reference-based compression format developed by Vadim Zalunin

2 Data horror EMBL-EBI 10 petabytes SRA ~1 petabytes Over 2 million DVDs or 2.5km Complete Genomics 0.5 TB for a single file

3 The need for compression Red alert

4 Compression, what is it? BMP, 190 kbPNG, 100 kbJPG, 21 kbJPG, 4 kb LOSSLESSLOSSY

5 Compression, when we know what to expect. BMP, 145 kbPNG, 2 kbJPG, 6 kbJPG, 3 kb LOSSLESSLOSSY But the actual message is only 40 characters (bytes) long!

6 Compression at it’s best IMAGE, 145 kb "Five little ducks went swimming one day" TEXT, 40 bIMAGE, 145 kb ~3500 times more efficient compressuncompress

7 What are we talking about sample sequencing machines bug bunch of huge files The bug’s DNA is hidden somewhere

8 Looking closer at the data bunch of huge files read 1 read 2 read 3 ….. read bizzilion It boils down to a long list of reads: Each read represents a short nucleotide sequence from the genome. Additional information may be attached to it, for example error estimates.

9 What is a CCAGATCCTGGCCCTAAACAGGTGGTAAGGAAGGAGAGAGTG… + IDCEFFGGHHGGGHIGIHGFEFCFFDDGFFGIIHHIGIHHFI… An excerpt from of a FASTQ file.

10 What is a CCAGATCCTGGCCCTAAACAGGTGGTAAGGAAGGAGAGAGTG… + IDCEFFGGHHGGGHIGIHGFEFCFFDDGFFGIIHHIGIHHFI… read name An excerpt from of a FASTQ file.

11 What is a CCAGATCCTGGCCCTAAACAGGTGGTAAGGAAGGAGAGAGTG… + IDCEFFGGHHGGGHIGIHGFEFCFFDDGFFGIIHHIGIHHFI… read nameread bases An excerpt from of a FASTQ file. Bases: ACGTN

12 What is a CCAGATCCTGGCCCTAAACAGGTGGTAAGGAAGGAGAGAGTG… + IDCEFFGGHHGGGHIGIHGFEFCFFDDGFFGIIHHIGIHHFI… read nameread bases read quality scores An excerpt from of a FASTQ file. Bases: ACGTN Quality scores: from ‘!’ (ASCII 33) to ‘~’ (ASCII 126)

13 What is quality score? Then quality score is phred quality score encoded as ASCII symbols Basically: higher scores are better, so ‘!’ is bad, ‘I’ is good.

14 Reference based encoding Reference sequence TGAGCTCTAAGTACCCGCGGTCTGTCCG read 1 TGAGCTCTTAGTAGC read 2 GCTCTAAGTAGCCGC read 3 CTCTAAGTAGCCGCG read 4 GTAGCCGCGGACTGT read 5 CGGTCTGTCCG Read start positionRead end position

15 Reference based encoding Reference sequence TGAGCTCTAAGTACCCGCGGTCTGTCCG read T read read read A.... read

16 Reference based encoding Reference sequence TGAGCTCTAAGTACCCGCGGTCTGTCCG read T read read read A.... read Mismatching bases

17 Lossy quality scores Approach 1 Quality scores are usually values from 0 to 39. Let’s shrink them, so that they are from 0 to 7 now. Approach 2 Let’s treat quality scores using alignment information. For example: preserve only quality scores for mismatching bases. horizontal vertical

18 Comparison study:1K Genomes exomes compressuncompress BAM CRAM

19 compressuncompress Comparison study:1K Genomes exomes BAM CRAM Some analysis pipeline

20 compressuncompress Comparison study:1K Genomes exomes BAM CRAM Some analysis pipeline Original SNPsRestored SNPs

21 Comparison study:1K Genomes exomes

22 CRAM NGS data compression Do nothing CRAM lossy Untreated CRAM very lossy Lossless Lossy Bits/base CRAM lossless (bad)(good)

23 Progressive application of compression Sample value Sample accessibility 200-foldLossless2-fold20-fold Hard High Easy Low

24 References More information: Mailing list: Publications: Fritz, M.H. Leinonen, R., et al. (2011) Efficient storage of high throughput DNA sequencing data using reference-based compression. Genome Res. 21 (5), Cochrane G., Cook C.E. and Birney E. (2012) The future of DNA sequence archiving. Gigascience 1


Download ppt "EBI is an Outstation of the European Molecular Biology Laboratory. CRAM: reference-based compression format developed by Vadim Zalunin."

Similar presentations


Ads by Google