Presentation is loading. Please wait.

Presentation is loading. Please wait.

Jadavpur University Presentation on : Data Compression Using Burrows-Wheeler Transform Presented by: Suvendu Rup.

Similar presentations


Presentation on theme: "Jadavpur University Presentation on : Data Compression Using Burrows-Wheeler Transform Presented by: Suvendu Rup."— Presentation transcript:

1 Jadavpur University Presentation on : Data Compression Using Burrows-Wheeler Transform Presented by: Suvendu Rup

2 10/17/2015 2 Introduction : What is Data Compression ? - Data compression is often referred to as coding, where coding is a general form encompassing any special representation of data that satisfies a given need. -Data compression may be viewed as a branch of information theory in which the primary objective is to minimize the amount of data to be transmitted. -Data compression has important application in the areas of data transmission and data storage. - Compressing data to be stored or transmitted reduces storage and communication cost. Types of Data Compression : - Lossless Data Compression - Lossy Data Compression

3 10/17/2015 3 Lossless Data Compression : - Lossless data compression is a class of data compression that allows the exact original data to be reconstructed from the compressed data. Most Lossless compression use two different kinds of algorithm. - Statistical modeling - Burrows Wheeler Transform - LZ77 - Encoding Algorithm -Huffman coding -Arithmetic coding Lossy Data Compression : - A Lossy data compression method is one where compressing data and then decompressing it retrieves data that may well be different from the original but is close enough to be useful in some way.

4 10/17/2015 4 Advantages of Data Compression : - More disk space. - Faster file upload and download. - More file storage options. Disadvantage of Data Compression : - Added complication. - Effect of error in transmission. - Slower for sophisticated method. - Need to decompress the previous data.

5 10/17/2015 5 DATA COMPRESSION USING BURROWS WHEELER TRANSFORM : Why BWT ? Consider the patterns : A B B A B A A A A B B B Pattern second is more impressive because frequency is not an important issue context of symbol is also important. As the structure is more regular so good as the compression. - Burrows and Wheeler presented a transformation function in 1983. Later they observed that this is quite suitable for data compression. - The Burrows Wheeler Transform transform a block of data into a format that is well suited for compression.

6 10/17/2015 6 DATA COMPRESSION USING BURROWS WHEELER TRANSFORM(COND.) Burrows wheeler compression is relatively a new approach of lossless compression first presented by Burrows and Wheeler in 1994. Forward BWT Move to front encoding Huffman compression Huffman Decompression Reverse Move-to- Front Reverse BWT Source Text Compressed File Original Source Text Steps For Compression Steps For Decompression

7 10/17/2015 7 Algorithm for forward transformation : 1. [Sort rotation] Form a N* N Matrix M whose elements are characters and whose rows are the rotation of S sorted in lexicographic order. At least one of the rows of M contains the original string S. let I be the index of the first such row numbering from 0. 2.[Find last character of rotations] Let the string L be the last column of M.

8 10/17/2015 8 The Burrows wheeler forward Transformation: 1)Write the input as the first row of a matrix one symbol per column. 2)From all cyclic permutation of that row and write as them as the other rows of the matrix. 3) Sort the matrix rows according to the lexicographic order in the elements of the rows. 4) take as output the final column of the sorted matrix together with the number of the row which corresponds to the original input.

9 10/17/2015 9 Example:- Let’s encode the sequence D R D O B B S We start with all the cyclic permutation of this sequence. As there are a total of 7 characters, there are 7 permutation. Now let’s sort these sequences in lexicographic (dictionary) order. The sequence L in this case is L: O B R S D D B

10 10/17/2015 10 0DRDOBBS 1RDOBBSD 2DOBBSDR 3OBBSDRD 4BBSDRDO 5BSDRDOB 6SDRDOBB Cyclic Permutation of D R D O B B S

11 10/17/2015 11 0BBSDRDO 1BSDRDOB 2DOBBSDR 3DRDOBBS 4OBBSDRD 5RDOBBSD 6SDRDOBB (Sequences sorted into lexicographic order) The original sequence appears as sequence number 3 in the sorted list. We have tagged the first and last columns F and L

12 10/17/2015 12 REVERSE BWT We can decode the original sequence by using the sequence N and the index I to the original sequence in the sorted list. The sequence F is simply the sequence L in lexicographic order. In example F:B B D D O R S Lets call the sorted array and the cyclically sifted array A s 0BBSDRDOOBBSDRD 1BSDRDOBBBSDRDO 2DOBBSDRRDOBBSD 3DRDOBBSSDRDOBB 4OBBSDRDDOBBSDR 5RDOBBSDDRDOBBS 6SDRDOBBBSDRDOB A AsAs

13 10/17/2015 13 The first element of each line of A form the sequence F, while the first element of each line of A s is the sequence L. For Example row 0 in A s corresponds to row 4 of A Lets store this information in the array T The row T[j] in a is the same as the jth row in A s Thus T[0]=4 and T={4 0 5 6 2 3 1 } We define two operators F and L where F[j] is the first element in jth row of A and L[j] is the last element in the jth row of A. Row T[j] of A Is the same as the row j of A s F[T[j]]=L[j] REVERSE BWT (contd.)

14 10/17/2015 14 Move to front coding A coding scheme that takes advantage of long runs of identical symbols is the move to front (mtf) coding. We start with some initial listing of the source alphabet. The symbol at the top of the list is assigned the number 0, the next one is assigned the number 1 and so on.

15 10/17/2015 15 Example Lets encode l= O B R S D D B. Lets assume that the source alphabet is given by A={B D O R S } We start out with the assignment 01234 BDORS The first element of L is O which gets encoded as a 2.We then move O to the top of the list which gives us 01234 OBDRS The next B is encoded as 1and move to the top of the list 01234 BODRS The next letter is R which is encoded as 3,Moving R to the top of the list, we get 01234 RBODS The next letter is S so that gets encode as a 4.and move to front of the list Continuing in this fashion we get the sequence 2 1 3 4 4 0 3

16 10/17/2015 16 Implementation : The implementation scheme performs a forward BWT transformation on an input file stream and sends the result in an output file. The input file is New.txt and the corresponding output file are p.txt and t.txt for cyclic shift and lexicographic sequence. Let’s encode small sequence stored in a file New.txt This$is$the We start with all the cyclic permutation of this sequence. As there are a total of 11 characters, there are 11 permutation. Now let’s sort these sequences in lexicographic (dictionary) order. The sequence L in this case is L:sshtth$ii$e

17 10/17/2015 17 0this$is$the 1his$is$thet 2is$is$theth 3s$is$thethi 4$is$thethis 5is$thethis$ 6s$thethis$i 7$thethis$is 8thethis$is$ 9hethis$is$t 10ethis$is$th Permutation of this$is$the

18 10/17/2015 18 0$is$thethis 1$thethis$is 2ethis$is$th 3hethis$is$t 4his$is$thet 5is$is$theth 6is$thethis$ 7s$Is$thethi 8s$thethis$i 9thethis$is$ 10this$is$the (Sequences sorted into lexicographic order) The original sequence appears as sequence number 10 in the sorted list. We have tagged the first and last columns F and L

19 10/17/2015 19 Move to Front Coding : -This scheme performs a Move to Front encoding function on an input file stream New.txt and sends the result to an output file move.txt. - An MTF encoder encodes each character using the count of distinct previous character seen since the character last appearance. - Each new input character is encoded with its current position in the array. The character is then moved to position 0 in the array, and all the higher order characters are moved down by one position to make room. - Both the encoder and decoder have to start with the order array initialized to the same values. - This scheme takes two argument an input file and an output file.

20 10/17/2015 20 Example Lets encode l= sshtth$ii$e. Lets assume that the source alphabet is given by A={$, e, h, i, s,t } We start out with the assignment 012345 $ehist The first element of L is which gets encoded as a 4.We then move s to the top of the list which gives us 012345 s$ehit The next s is encoded as 0.Because s is already at the top of the list, we do not need to make any changes.The next letter is h, which we encode as 3. We then move h to the top of the list 012345 hs$eit The next letter is t which is encoded as 5,Moving t to the top of the list, we get 012345 ths$ei The next letter is also t so that gets encoded as a 0. Continuing in this fashion we get the sequence 4 0 3 5 0 1 3 5 0 1 5

21 10/17/2015 21 Huffman Compression : The compression method compress and decompress files. -The basic compression idea is 1) Count the occurrence of each character. 2) Sort by occurrence highest first 3) Build the Huffman tree. 4) Character with higher probabilities have to be in the near of the top and the others more in the near of the bottom. After constructing the tree now assume 0= left. 1= right.

22 10/17/2015 22 COMPRESSION USING HUFFMAN CODING Design huffman code for the sequence 4 0 3 5 0 1 3 5 0 1 5 The source={ 0 1 3 4 5} The probabilities of the occurrences are P(0)=3/11 P(1)=2/11 P(3)=2/11 P(4)=1/11 P(5)=3/11

23 10/17/2015 23 1 8/ 11 5/ 11 3/ 11 2/11 1/11 2/ 11 3/ 11 CONSTRUCTION OF HUFFMAN TREE 0 0 0 0 1 1 1 1 4 1 3 0 5

24 10/17/2015 24 HUFFMAN COMPRESSION The compressed form 1 4 3 0 5 are 0000 0001 001 01 1 From the compressed form 0001 01 001 1 01 0000 001 1 01 0000 1 we will get back the original sequence 4 0 3 5 0 1 3 5 0 1 5 HUFFMAN DECOMPRESSION

25 10/17/2015 25 REVERSE MOVE-TO-FRONT CODING By applying this technique we will get back the encode sequence that is L. From the sequence 4 0 3 5 0 1 3 5 0 1 5 we will get back sshtth$ii$e. Where A={ $, e, h, i, s,t }

26 10/17/2015 26 REVERSE BWT We can decode the original sequence by using the sequence N and the index I to the original sequence in the sorted list. The sequence F is simply the sequence L in lexicographic order. In example F:$$ehhiisstt Lets call the sorted array and the cyclically sifted array A s 0$is$thethiss$is$thethi 1$thethis$iss$thethis$i 2ethis$is$thhethis$is$t 3 thethis$is$ 4his$is$thetthis$is$the 5is$is$thethhis$is$thet 6is$thethis$$is$thethis 7s$is$thethiis$is$theth 8s$thethis$iis$thethis$ 9thethis$is$$thethis$is 10This$is$theethis$is$th A AsAs

27 10/17/2015 27 REVERSE BWT (CONTD.): The first element of each line of A form the sequence F, while the first element of each line of A s is the sequence L. For Example row 0 in A s corresponds to row 7 of A Lets store this information in the array T The row T[j] in a is the same as the jth row in A s Thus T[0]=7 and T={7 8 3 9 10 4 0 5 6 1 2 } We define two operators F and L where F[j] is the first element in jth row of A and L[j] is the last element in the jth row of A. Row T[j] of A Is the same as the row j of A s F[T[j]]=L[j]

28 10/17/2015 28 We can write the procedure as follows for a sequence of length N. D[N]=L[I] k  I for j=1 to N-1 { k  T[k] D[N-j]=L[k] } Where D contains the decoded sequence. ALGORITHM

29 10/17/2015 29 Explanation : Lets us use the decoding algorithm to recover the original sequence from the sequence L in the previous example using T=[7 8 3 9 10 0 5 6 1 2 ] Note that L=[s s h t t h $ i i $ e] K=I=10, and N=11 We start with the last symbol : D[11]=L[10]=e Updating kk ← T[10]=2 Then D[10]=L[2]=h Continuing in the fashion k ← T[2]=3 D[9]=L[3]=t k ← T[3]=9 D[8]=L[9]=$ k ← T[9]=1 D[7]=L[1]=s And we have decoded the entire sequence. k ← T[1]=8 D[6]=L[8]=i k ← T[8]=6 D[5]=L[6]=$ k ← T[6]=0 D[4]=L[0]=s k ← T[0]=7 D[3]=L[7]=i k ← T[7]=5 D[2]=L[5]=h k ← T[5]=4 D[1]=L[4]=t

30 10/17/2015 30 Result : -For compress a file using the BWT by Huffman Compression I have taken 7 files in a Calgary corpus. - For compress a file the step is Encode -For decompression encode /d

31 10/17/2015 31 File name Size(KB)Compres sed file Size Percenta ge Compres s time(ms Decompr ess time thelp13640 219 109 file501224 656 125 geo631422 875 154 local12810985 641 250 aa864451 172 125 thesis87258767 4625 969 book88059767 5234 906

32 Second step of BWT by alternative Move to Front : - Much research has been focused on refining the MTF, most based on controlling when to move the symbol to front of stack, instead of attempting to improve the MTF algorithm it examines two relatively simple method based on models. -In order to employ an efficient alternate method deterministic can be used to calculate the complexity information and entropy associated with a string will be referred to as T-complexity, T-information, and T-entropy. - The complexity of the string, the T-augmentation is represented by the no. of steps required. - The T-information of a string is the deterministic information content of a string s refereed to as I det (s). It is calculate as the inverse logarithm integral of the string. I det (li -1 (C det (s)) - T-entropy is represented as T E =∆T i / ∆ l. It is the rate of change of T- information Ti along a string L, and will be used to measure the average T- entropy over a file.

33 10/17/2015 33 Dual modeling of Move to Front data stream : -The entire MTF input stream is conventionally represented in one model. - One method suggested by Fenwick relies on a cache system, where the most probable symbols are stored in a prominent foreground model and the bulk remaining symbols stored in a larger background model. - In dual model representation of the MTF data, successful compression is assured by providing the decoder a means of knowing when to switch between models. - If a background symbol is encountered the encoder emits a special ESCAPE symbol from the foreground model, informing the decoder switch model before then encoding the symbols in the background model. - Balkenhol has suggested a similar approach encoded in the original ASCII alphabet instead of the conventional MTF. - Further enhancement are made to the cache provisions to ensure a symbol is only moved to the very front and assigned zero.

34 10/17/2015 34 Further Work : Image compression : -BWT can be used for waveform coding, and the wavelet transform could be used before the BWT to improve the compression performance for many class of signals. Wave form coding using the BWT : -Many natural waveforms also have repetitions, after uniform sampling and quantization, then repetition often remain in the digitized wave sequences can be compressed using BWT. - This scheme will not work well for many real world signals, because almost all the real world signals are noisy. Noise will destroy the perfect repetition required for the BWT. Image compression using the DWT and the BWT : - The wavelet transform is a powerful analysis tool and has been successfully used in image compression.

35 10/17/2015 35 Base line Algorithm : -The block diagram of the baseline image compression algorithm using the DWT and BWT. -The images are first wavelet transformed and quantized. - The next step to convert the 2D image to 1D sequence by a zigzag scan. - Next to perform BWT and MTF on the sequence. - Then perform Huffman coding to compress it. Block diagram of Base line Image Compression: Wavelet Transform Zigzag BWT MTF Huffman Coding

36 Multiprocessor Approach to Data Compression : - The computer network bandwidth has created the need for increased speed in compressing data and transferring the compressed data over telecommunication line. - Single processor dictionary compression techniques include Lz77,Lz78. - The alternative of single processor technique how ever the use of modern multiprocessor computers. - The recent growth in processor speeds and increase availability of multiprocessor computers,software compression solutions may be fast for many application. - The parallel process involves arranging processors in a pipeline fashion and passing data between them. - in this approach data being compressed as a symbol or a symbol string. - Symbols originate from uncompressed data set and are 8 bits in length with 256 possible values. -Code word refers to a bit string that is substituted for a symbol or a symbol string

37 10/17/2015 37 Parallel Processor Dictionary Compression Technique : -James Storer has been a pioneer in the use of parallel processor, to implement data compression within very large scale integration(VLSI). - His designed chained together up to 3840 of the processor in a pipeline. The design was fabricated into a VLSI chip and was able to process 160 million bits per sec. - The pipeline processor perform the following task. Buf B Buf A Stor B Stor A Processor Index Data input Data flow Data o/p Data stream

38 10/17/2015 38 Parallel Processor Dictionary Compression Technique (cond.): 1)Buf A, Buf B operate as a data stream for the symbol pass through. 2)Stor A and Stor B operate as a library locations which store a single symbol or code word. 3)The processor index functions as the code word that is associated with the library contents. 4)The processor embodies logic to compare the stored library contents to the data stream. - Data steps through each processor are symbol at a time right to left. - The library is built where a special signal called “leader” is sent to a processor to store the contents if buf A and buf B into stor A and stor B. -The data symbol in buf A and buf B comparing library contents stor A and stor B. -If a match occurs between buffer and storage location then processor index is substituted directly into data stream for both the symbol currently in buf A and buf B. - This provides a replacement of two 8 bit symbols with one 12 bit code word.

39 10/17/2015 39 CONCLUSION 13 Years have been passed the introduction of BWT. In this 13years our understanding of several theoretical and practical issues related to the BWT has significantly incresed.BWT has many interesting facts And that is going to deeply influence the field of loss less data compression. The biggest draw back of BWT based algorithm is that it is not online, that is it must process a large portion of the input data before a single output bit can produce. The issue of developing on line counter parts of BWT based compressor has been address, but further work is still needed in this direction.

40 10/17/2015 40 BIBILOGRAPHY : 1.ARNOLD, R., AND BELL, T. 2000. The Canterbury corpus home page. 2.BENTLEY,J.,SLEATOR,D.,TARJAN,R.,ANDWEI,V.19 86.A locally adaptive data compression scheme.Commun. 3.BURROWS,M.,AND WHEELER, D.J. 1994. A block sorting lossless data compresssion algorithm. 4.CLEARY,J.G.,AND TEAHAN,W.J.1997.Unbound length contexts for PPM. 5.CORMCK,G.V., AND HORSPOOOL,R.N.S.1987. Data compression using dynamic Markov modeling. 6.EFFROS,M.1999.Universal lossless sources coding with the Burrows-Wheeler transform.

41 10/17/2015 41 7.Data Compression Conference.IEEE Computer Society Press, Los Alamiots,calif. 8.FENWICK,P.1996a.Block sorting text compression –final report. 9. FENWICK,P.1996b. The Burrows-Wheeler transform for block sorting text compression. 10.T.C. BELL,J.G. CLEARY,AND I.H. Witten. Text Copression. 11.D.L.DONOHO.DE-NOISING by soft-thresholding.Transaction on Information Theory. 12.BALKENHOL B, KURTZ S. Universal data Compression Based on the Burrows {Wheeler Transformation. 13.LARSSON NJ. The context Trees of Blocking Sorting Compression. 14.BALKENHOL B.– One attempt of a Compression algorithm Using the BWT.

42 10/17/2015 42


Download ppt "Jadavpur University Presentation on : Data Compression Using Burrows-Wheeler Transform Presented by: Suvendu Rup."

Similar presentations


Ads by Google