Burrows-Wheeler Transformation Review

Burrows-Wheeler Transformation Review

Compress techniques Lossless: Lossy: Huffman coding
Run-length coding(rle) Lempel-ziv (lz77) Burrows-wheeler transfom(BWT) Lossy: Used to handle audio, image and video.

Huffman coding If we send a telegram with a content of ‘a b a c c d a’, for there are four different charaters, we can use two bits to code them. 00:a 01:b 10:c :d “abaccda” can be coded as ‘ ’

Run-length encoding To encoding the data, rle transform a sequence of same data into a specific data format. Input={ 1,1,1,1,1,1 }; Output={ 6,1 } Input={ 6,1,0,1,1,1,1,1,1 }; Output={6,1,0,6,1} So we need a control code(can use the least occurrence code)! So 6,1 means what?

Lempel-ziv (lz77) LZ77 algorithms achieve compression by replacing repeated occurrences of data with references to a single copy of that data existing earlier in the input (uncompressed) data stream. Input=“the brown fox jumped over the brown foxy jumping frog” The result is

Outline Burrows-Wheeler Transformation Procedures of BWT
Compression based BWT What is FM-index FM-index and Compression based BWT Experiment results Conclusion

Burrows-Wheeler Transformation
Burrows-Wheeler Transformation(BWT) is first proposed by Burrows and Wheeler in 1994. Properties: BWT Compression deals with data block, not data stream. BWT itself don’t compress data , it changes the data permutation to make data compressible. BWT Compression can achieve good results in a competitive time cost 。

Compression based BWT What is FM-index FM-index and Compression based BWT Experiment results Conclusion

Procedures of BWT Three steps: S=abraca
Cyclically shifting block data with length N. Sorting results of step ① and get matrix M. Output the last column L and the index of original string in M. index result 0 aabrac 1 abraca 2 acaabr 3 bracaa 4 caabra 5 racaab index results 0 abraca 1 bracaa 2 racaab 3 acaabr 4 caabra 5 aabrac Cyclically shifts Sort L=caraab Index=1 S=abraca Output

Reversible Procedures of BWT
Occ(i) means occurrences of L[i] in the prefix L[0…i-1] C[c] means numbers of char c which has a lower order LF[5]=C[a]+Occ(5)=1+2=3 It will be critical to understand the following two properties of matrix M. For simplicity, for any character, it have the same relative position in F and L. let us see an example: How to compute the LF array？ index sort results $abraca a$abrac abraca$ aca$abr braca$a ca$abra raca$ab L[5]== M[?] Occ(5) C[a] For the i-th row of M, the last character L[i] precedes the first character F[i] in the original string S, namely …L[i]F[i] and F is the first column of M and can be obtained by sorting L. Last-to-First mapping (LF-mapping). Let L[i] =c and let ri be the number of occurrences of c in the prefix L[0,i-1]. Let M[j] be the ri-th row of the M starting with c. Then the character F[j] in the first column corresponds to L[i] in the last column and set LF-mapping array LF[i]=j, meaning that F[j] and L[i] are the same character in original string. index sorting result 0 $abraca 1 $abrac 2 braca$ 3 ca$abr 4 braca$a 5 ca$abra raca$ab aa a three rows are still in order three rows are in order L[5]== M[3]

Reversible Procedures of BWT
Algorithm BWT_reverse(L) 1. i=0; //M[0]=$S 2. for j=N-1 to 0 3. S[j]=L[i]; 4. i=Occ[i]+C[L[i]]; //compute the LF-mapping array S’=abraca$ I=0 a ca aca raca braca abraca 0 $abraca $abraca $abraca $abraca $abraca $abraca $abraca a$abrac a$abrac a$abrac a$abrac a$abrac a$abrac a$abrac abraca$ abraca$ abraca$ abraca$ abraca$ abraca$ abraca$ aca$abr aca$abr aca$abr aca$abr aca$abr aca$abr aca$abr braca$a braca$a braca$a braca$a braca$a braca$a braca$a ca$abra ca$abra ca$abra ca$abra ca$abra ca$abra ca$abra raca$ab raca$ab raca$ab raca$ab raca$ab raca$ab raca$ab S[5]=a S[4]=c S[3]=a S[2]=r S[1]=b S[0]=a End! End with meeting$

Compression based on BWT What is FM-index FM-index and Compression based on BWT Experiment results Conclusion

Compression based on BWT
BWT itself don’t compress data, so compression based on BWT combined the BWT with currently compression techniques. BWT GST RLE EC Input data Output data Move to Front(MTF) Run length Encoding(RLE) Huffman coding/ Entropy coding

Compression based on BWT FM-index based on BWT FM-index and Compression based on BWT Experiment results Conclusion

FM-index based on BWT Step 2:Locating the occurrences
Algorithm counting(P[0,p-1]) c=P[p-1], i=p-1; sp=C[c], ep=C[c+1]-1; while((sp≤ep)&&(i≥1)) do c=P[i-1]; sp=C[c]+Occ(c,sp); ep=C[c]+Occ(c,ep); i=i-1; if(ep<sp) return 0; else return ep-sp+1; C[c] means the number of char which has a lower order than c. Occ(c,i) means the occurrences of c in prefix L[0…i-1]. In the i-th iteration: sp points to the start postion of pattern P[i, p-1]. ep points to the end postion of pattern P[i, p-1] Step 2:Locating the occurrences How FM-index Works？ What is FM-index？ S’=abraca$ P=aca sp ep aca suffix array 0 $abraca $abraca 1 a$abrac a$abrac 2 abraca$ abraca$ 3 aca$abr aca$abr 4 braca$a braca$a 5 ca$abra ca$abra 6 raca$ab raca$ab pos(6)=2 LF-mapping M[6] is a marked row. Set pos(3)=pos(6)+1=3 Full-text ,minute-space index. S=abraca P=aca sp ep aca aca aca 0 $abraca $abraca $abraca 1 a$abrac a$abrac a$abrac 2 abraca$ abraca$ abraca$ 3 aca$abr aca$abr aca$abr 4 braca$a braca$a braca$a 5 ca$abra ca$abra ca$abra 6 raca$ab raca$ab raca$ab sp=C[c]+Occ(c,1)=5+0=5 ep=C[c]+Occ(c,3) =5+0=5 sp=C[a]+Occ(a,5)=1+2=3 ep=C[a]+Occ(a,5)=1+2=3 FM-index consists two steps: 1)Counting the occurences of te matching pattern 2)Locating the occurrences. FM-index combines the BWT-based compression algorithm with suffix array data structure, and achieves effective random accesses to the compressed data without uncompressing all of them at query time.

Relationship BWT GST RLE0 EC Compression algorithm based on BWT
Auxiliary information Partition the BWT result into buckets FM-index FM-index based on BWT Input data Output data

Experiment results We compare several tools which is widely used, including gzip (v1.2.4), szip(v1.12a),bzip2(v 1.0.6), bicom(v 1.01). Bzip2 and szip are based on BWT, gzip is based on LZ77 . bicom is based on PPM. the result is showed Below. File File size Bicom Szip Bzip2 gzip Large.txt 4,047,392 1.69 1.63 1.67 2.35 E.Coli 4,638,690 2.12 2.02 2.16 2.31 World192.txt 2,473,400 1.44 1.60 1.58 2.34

Experiment results Read length Program CPU time
Peak memory (megabytes) Speed- up Reads aligned 36 bp Bowtie m15s , Maq h52m26s x Bowtie –v 2 4m55s , SOAP h44m3s , x 50 bp Bowtie m11s , Maq h39m56s x Bowtie –v 2 5m32s , SOAP h42m4s , x 76 bp Bowtie m58s , Maq h45m7s , x Bowtie –v m35s , SOAP do not support

Conclusion BWT is a data transformation method.
Both Compression techniques and FM-Index based on BWT achieve good results at low time cost. Dynamic FM-index will be an interest topic.

Thanks The end! Thanks!

Burrows-Wheeler Transformation Review

Similar presentations

Presentation on theme: "Burrows-Wheeler Transformation Review"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Burrows-Wheeler Transformation Review

Similar presentations

Presentation on theme: "Burrows-Wheeler Transformation Review"— Presentation transcript:

Similar presentations

About project

Feedback