Presentation is loading. Please wait.

Presentation is loading. Please wait.

EBI is an Outstation of the European Molecular Biology Laboratory. CRAM: reference-based compression format developed by Vadim Zalunin.

Similar presentations


Presentation on theme: "EBI is an Outstation of the European Molecular Biology Laboratory. CRAM: reference-based compression format developed by Vadim Zalunin."— Presentation transcript:

1 EBI is an Outstation of the European Molecular Biology Laboratory. CRAM: reference-based compression format developed by Vadim Zalunin

2 Data horror EMBL-EBI 10 petabytes SRA ~1 petabytes Over 2 million DVDs or 2.5km Complete Genomics 0.5 TB for a single file

3 The need for compression Red alert

4 Compression, what is it? BMP, 190 kbPNG, 100 kbJPG, 21 kbJPG, 4 kb LOSSLESSLOSSY

5 Compression, when we know what to expect. BMP, 145 kbPNG, 2 kbJPG, 6 kbJPG, 3 kb LOSSLESSLOSSY But the actual message is only 40 characters (bytes) long!

6 Compression at it’s best IMAGE, 145 kb "Five little ducks went swimming one day" TEXT, 40 bIMAGE, 145 kb ~3500 times more efficient compressuncompress

7 What are we talking about sample sequencing machines bug bunch of huge files The bug’s DNA is hidden somewhere

8 Looking closer at the data bunch of huge files read 1 read 2 read 3 ….. read bizzilion It boils down to a long list of reads: Each read represents a short nucleotide sequence from the genome. Additional information may be attached to it, for example error estimates.

9 What is a Read? @SRR081241.20758946 CCAGATCCTGGCCCTAAACAGGTGGTAAGGAAGGAGAGAGTG… + IDCEFFGGHHGGGHIGIHGFEFCFFDDGFFGIIHHIGIHHFI… An excerpt from of a FASTQ file.

10 What is a Read? @SRR081241.20758946 CCAGATCCTGGCCCTAAACAGGTGGTAAGGAAGGAGAGAGTG… + IDCEFFGGHHGGGHIGIHGFEFCFFDDGFFGIIHHIGIHHFI… read name An excerpt from of a FASTQ file.

11 What is a Read? @SRR081241.20758946 CCAGATCCTGGCCCTAAACAGGTGGTAAGGAAGGAGAGAGTG… + IDCEFFGGHHGGGHIGIHGFEFCFFDDGFFGIIHHIGIHHFI… read nameread bases An excerpt from of a FASTQ file. Bases: ACGTN

12 What is a Read? @SRR081241.20758946 CCAGATCCTGGCCCTAAACAGGTGGTAAGGAAGGAGAGAGTG… + IDCEFFGGHHGGGHIGIHGFEFCFFDDGFFGIIHHIGIHHFI… read nameread bases read quality scores An excerpt from of a FASTQ file. Bases: ACGTN Quality scores: from ‘!’ (ASCII 33) to ‘~’ (ASCII 126)

13 What is quality score? Then quality score is phred quality score encoded as ASCII symbols 33-126. Basically: higher scores are better, so ‘!’ is bad, ‘I’ is good.

14 Reference based encoding Reference sequence TGAGCTCTAAGTACCCGCGGTCTGTCCG read 1 TGAGCTCTTAGTAGC read 2 GCTCTAAGTAGCCGC read 3 CTCTAAGTAGCCGCG read 4 GTAGCCGCGGACTGT read 5 CGGTCTGTCCG Read start positionRead end position

15 Reference based encoding Reference sequence TGAGCTCTAAGTACCCGCGGTCTGTCCG read 1........T...... read 2............... read 3............... read 4..........A.... read 5...........

16 Reference based encoding Reference sequence TGAGCTCTAAGTACCCGCGGTCTGTCCG read 1........T...... read 2............... read 3............... read 4..........A.... read 5........... Mismatching bases

17 Lossy quality scores Approach 1 Quality scores are usually values from 0 to 39. Let’s shrink them, so that they are from 0 to 7 now. Approach 2 Let’s treat quality scores using alignment information. For example: preserve only quality scores for mismatching bases. horizontal vertical

18 Comparison study:1K Genomes exomes compressuncompress BAM CRAM

19 compressuncompress Comparison study:1K Genomes exomes BAM CRAM Some analysis pipeline

20 compressuncompress Comparison study:1K Genomes exomes BAM CRAM Some analysis pipeline Original SNPsRestored SNPs

21 Comparison study:1K Genomes exomes

22 CRAM NGS data compression Do nothing CRAM lossy Untreated CRAM very lossy Lossless Lossy Bits/base CRAM lossless (bad)(good)

23 Progressive application of compression Sample value Sample accessibility 200-foldLossless2-fold20-fold Hard High Easy Low

24 References More information: http://www.ebi.ac.uk/ena/about/cram_toolkit Mailing list: http://listserver.ebi.ac.uk/mailman/listinfo/cram-dev Publications: Fritz, M.H. Leinonen, R., et al. (2011) Efficient storage of high throughput DNA sequencing data using reference-based compression. Genome Res. 21 (5), 734-40 Cochrane G., Cook C.E. and Birney E. (2012) The future of DNA sequence archiving. Gigascience 1


Download ppt "EBI is an Outstation of the European Molecular Biology Laboratory. CRAM: reference-based compression format developed by Vadim Zalunin."

Similar presentations


Ads by Google