# More on Canonical Huffman coding. Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-20062 As we have seen canonical Huffman coding allows.

## Presentation on theme: "More on Canonical Huffman coding. Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-20062 As we have seen canonical Huffman coding allows."— Presentation transcript:

More on Canonical Huffman coding

Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-20062 As we have seen canonical Huffman coding allows a faster decompression, and a more efficient use of memory But... until now we’ve supposed that code lengths are given. In order to prove the viability of canonical Huffman we need a way to compute the code lengths

Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-20063 Computing code lengths - I The efficiency of computing the code lengths affects only encoding performances, and encoding is less important than decoding However, when we use whole words as source symbols, it is common to have an alphabet composed of many thousand different symbols (words). In this case efficient solutions can make the difference w.r.t. the use of the traditional Huffman tree

Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-20064 The problem Input data a file of n integers, where the i-th integer is the number of times symbol i appears in the text if is the i-th integer in the file, the probability of the i-th symbol is... only small changes if we have directly a file with the probabilities Output data Huffman code lengths

Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-20065 The key idea We use a heap data structure It is easy to find the smallest value, that is located at the root There is an elegant and efficient solution to store in memory the binary tree using a simple vector

Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-20066 A closer view to the heap - I As we have already seen, the left child of a node in position i is in position 2i, while the right child si stored in location 2i+1. This means that the parent of a node in position k is in location What is the depth of the heap if there are n elements? SOL.

Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-20067 A closer view to the heap - II How this heap is stored in an array? 211120 8172312 810 2 2 8 10 8 17 23 12 21 11 20

A closer view to the heap - III Removing the smallest item and reorganizing the heap cost, 2 comparisons for each level of the tree 211120 8172312 810 2 8 < 10 20 > 8 8 < 17 20 > 8 11 < 21 20 > 11

Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-20069 Construction of the heap It is possible to prove that the cost of constructing a heap from an unsorted list of n items requires about 2n comparisons By the way, how much does it cost, in the worst case, to sort the vector with one of the popular sorting algorithms? SOL. nlogn

Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-200610 Computing code lengths - II The algorithm works constructing an initial array with 2n positions the last n positions store the frequencies of the symbols the first half of the array contains a heap in order to efficiently locate the symbol with le lowest frequency As entries are removed from the heap, freed space is used to store branches of the tree

Computing code lengths - III Also frequencies are overwritten to store the pointers that constitute the tree At the end the array contains only a one- element heap and the Huffman tree heap leaves (frequencies) 1n2n heap leaves & tree pointers 1h2n tree pointers h=12n

Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-200612 Computing code lengths: phase 1 Frequencies are read from the file and stored in the last n positions of the array (let’s call it A) Each position, points to the corrispondent frequency A[n+i] Then the first half of A is ordered in a heap with the method seen before In practice we must ensure that At the end A[1] stores m1: A[m1]=min{A[n+1...2n]}

Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-200613 Computing code lengths: phase 2 h=n while h>1 { m1=A[1] -- take the root of the heap h=h-1 -- position A[h] is no more part of the heap “reorder the heap” m2=A[1] -- now m1,m2 point to the two smallest freq. A[h+1]=A[m1]+A[m2] -- new item is saved in pos. h+1 A[1]=h+1 -- the new element is pushed back in the heap A[m1]=A[m2]=h+1 -- smallest frequencies are discarded and changed into tree pointers “reorder the heap” }

example 457 1hm1m1 457 1h+1m1m1m2m2 97 1m1m1m2m2 97 1 h m1m1m2m2

Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-200615 Computing code lengths: phase 3 After n-1 iterations a single aggregate remains in the heap, in position 2 and A[1]=2, as this is the only item in the heap To find the deep in the tree of a particular leaf we can simply start from it and follow the pointers until we reach location 2. The number of pointers followed are the desidered length for { d=0; r=i; -- d is the counter, r is the current element while r>2 d=d+1; r=A[r]; -- follow the pointer to location 2 A[i]=d } -- now A[i] is the length of codeword i

The cost of the algorithm first phase  O(n) second phase  knlogn a heap of n element is reordered about 2n times each reordering takes iterations, each of which has a constant cost third phase  O(n 2 ) in the worst case there is one iteration for each bit of each codeword how many total bits? uniform distribution  n*logn bits...... worst case: i-th symbol requires i bits to be coded

Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-200617 Revised third phase - I Note that nodes are added to the tree from position h toward position 2 for this reason all pointers of the tree are “from right to left” Then if we start from position 2 towards position 2n, when we find a node we’ve always already found its parent!

Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-200618 Revised third phase - II So a very efficient algorithm (O(n)) to find code lengths is to start from position 2 that has length 0 and then to proceed towards position 2n labeling each position with the length associated to its parent augmented by 1 Then third phase becomes A[2]=0 for i=3 to 2n A[i]=A[A[i]]+1 -- A[A[i]] is the parent of A[i]

Download ppt "More on Canonical Huffman coding. Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-20062 As we have seen canonical Huffman coding allows."

Similar presentations