# DL - 2004Compression3 – Beeri/Feitelson1 Canonical Huffman trees: Goals: a scheme for large alphabets with Efficient decoding Efficient coding Economic.

## Presentation on theme: "DL - 2004Compression3 – Beeri/Feitelson1 Canonical Huffman trees: Goals: a scheme for large alphabets with Efficient decoding Efficient coding Economic."— Presentation transcript:

DL - 2004Compression3 – Beeri/Feitelson1 Canonical Huffman trees: Goals: a scheme for large alphabets with Efficient decoding Efficient coding Economic use of main memory

DL - 2004Compression3 – Beeri/Feitelson2 A non-Huffman same cost tree Code 1: lca(e,b) = 0 code 2: lca(e,b) = Code 2: successive integers (going down from longest codes) decimalCode 2Code1 (huffman) frequencysymbol 0000 10a 1001 11b 201010012c 301110113d 4100122e 511 23f

DL - 2004Compression3 – Beeri/Feitelson3 tree for code 2: Lemma: #(nodes) in each level in Huffman is even Proof: a parent with single child is impossible a0a0 b1b1 c2c2 d3d3 e4e4 f5f5

DL - 2004Compression3 – Beeri/Feitelson4 General approach:

DL - 2004Compression3 – Beeri/Feitelson5 Canonical Huffman Algorithm : compute lengths of codes and numbers of symbols for each length (for regular Huffman) L = max length first(L) = 0 for i = L-1 downto 1 { – – assign to symbols of length i codes of this length, starting at first(i) } Q: What happens when there are no symbols of length i? Does first(L) = 0< first(L-1)<… { "@context": "http://schema.org", "@type": "ImageObject", "contentUrl": "http://images.slideplayer.com/13/4035007/slides/slide_5.jpg", "name": "DL - 2004Compression3 – Beeri/Feitelson5 Canonical Huffman Algorithm : compute lengths of codes and numbers of symbols for each length (for regular Huffman) L = max length first(L) = 0 for i = L-1 downto 1 { – – assign to symbols of length i codes of this length, starting at first(i) } Q: What happens when there are no symbols of length i.", "description": "Does first(L) = 0< first(L-1)<…

DL - 2004Compression3 – Beeri/Feitelson6 Decoding: (assume we start now on new symbol) i=1; v = nextbit(); // we have read the first bit while v { "@context": "http://schema.org", "@type": "ImageObject", "contentUrl": "http://images.slideplayer.com/13/4035007/slides/slide_6.jpg", "name": "DL - 2004Compression3 – Beeri/Feitelson6 Decoding: (assume we start now on new symbol) i=1; v = nextbit(); // we have read the first bit while v

DL - 2004Compression3 – Beeri/Feitelson7 Data structures for decoder: The array first(i) Arrays S(i) of the symbols with code length i, ordered by their code (v-first(i) is the index value to get the symbol for code v) Thus, decoding uses efficient arithmetic operations + array look-up – more efficient then storing a tree and traversing pointers What about coding (for large alphabets, where symbols = words or blocks)? The problem: millions of symbols  large Huffman tree, …

DL - 2004Compression3 – Beeri/Feitelson8 Construction of canonical Huffman: (sketch) assumption: we have the symbol frequencies Input: a sequence of (symbol, freq) Output: a sequence of (symbol, length) Idea: use an array to represent a heap for creating the tree, and the resulting tree and lengths We illustrate by example

DL - 2004Compression3 – Beeri/Feitelson9 Example: frequencies: 2, 8, 11, 12 (each cell with a freq. also contains a symbol – not shown) Now reps of 2, 8 (smallest) go out, rest percolate The sum 10 is put into cell4, and its rep into cell 3 Cell4 is the parent (“sum”) of cells 5, 8. 811122#6#7#8#5 811122#6#7 #41112#410#3#6#7

DL - 2004Compression3 – Beeri/Feitelson10 after one more step: Finally, a representation of the Huffman tree: Next, by i=2 to 8, assign lengths (here shown after i=4) #4#312#4#321#3#6 #4#3#2#4#3#233#2 #4#3#2#4210#2

DL - 2004Compression3 – Beeri/Feitelson11 Summary: Insertion of (symbol,freq) into array – O(n) Creation of heap – Creating tree from heap: each step is total is Computing lengths O(n) Storage requirements: 2n (compare to tree!)

DL - 2004Compression3 – Beeri/Feitelson12 Entropy H: a lower bound on compression How can one still improve? Huffman works for given frequencies, e.g., for the English language – static modeling Plus: No need to store model in coder/decoder But, can construct frequency table for each file semi-static modeling Minus: –Need to store model in compressed file (negligible for large files) –Takes more time to compress Plus: may provide for better compression

DL - 2004Compression3 – Beeri/Feitelson13 3rd option: start compressing with default freqs As coding proceeds, update frequencies: After reading a symbol: – compress it – update freq table* Adaptive modeling Decoding must use precisely same algorithm for updating freqs  can follow coding Plus: Model need not be stored May provide compression that adapts to file, including local changes of freqs Minus: less efficient then previous models * May use a sliding window to better reflect local changes

DL - 2004Compression3 – Beeri/Feitelson14 Adaptive Huffman: Construction of Huffman after each symbol: O(n) Incremental adaptation in O(logn) is possible Both too expensive for practical use (large alphabets) We illustrate adaptive by arithmetic coding (soon)

DL - 2004Compression3 – Beeri/Feitelson15 Higher-order modeling: use of context e.g.: for each block of 2 letters, construct freq. table for the next letter (2-order compression) (uses conditional probabilities – hence improvement) This also can be static/semi-static/adaptive

DL - 2004Compression3 – Beeri/Feitelson16 Arithmetic coding : Can be static, semi-static, adaptive Basic idea: Coder: start with the interval [0,1) 1 st symbol selects a sub-interval, based on its probability i’th symbol selects a sub-interval of (i-1)’th interval, based on its probability When file ends, store a number in the final interval Decoder: reads the number, reconstructs the sequence of intervals, i.e. symbols Important: Length of file stored at beginning of compressed file (otherwise, decoder does not know when to stop)

DL - 2004Compression3 – Beeri/Feitelson17 Example : (static) a ~ 3/4, b ~ 1/4 The file to be compressed: aaaba The sequence of intervals (& symbols creating them) : [0,1), a [0,3/4), a [0,9/16), a [0, 27/64), b [81/256, 108/256), a [324/1024, 405/1024) Assuming this is the end, we store: 5 –length of file Any number in final interval, say 0.011 (3 digits) (after first 3 a’s, one digit suffices!) (for a large file, the length will be negligible)

DL - 2004Compression3 – Beeri/Feitelson18 Why is it a good approach in general? For a symbol with large probability, # of binary digits needed to represent an occurrence is smaller than 1  poor compression with Huffman But, arithmetic represents such a symbol with a small shrinkage of interval, hence the extra number of digits is smaller than 1! Consider the example above, after aaa

DL - 2004Compression3 – Beeri/Feitelson19 Arithmetic coding – adaptive – an example: The symbols: {a, b, c} Initial frequencies: 1,1,1 (= initial accumulated freqs) (0 is illegal, cannot code a symbol with probability 0!) b: model passes to coder the triple (1, 2, 3): –1 : the accumulated freqs up to, not including, b –2 : the accumulated freqs up to, including, b –3 : the sum of freqs Coder notes new interval [1/3, 2/3) Model updates freqs to 1, 2, 1 c: model passes (3,4,4) (upper quarter) Coder updates interval to [7/12,8/12) Model updates freqs to (1,2,2) And so on ….

DL - 2004Compression3 – Beeri/Feitelson20 Practical considerations: Interval ends are held as binary numbers # of bits in number to be stored proportional to size of file – impractical to compute it all before storing solution: as interval gets small, first bit of a number in it is determined. This bit is written by code into compressed file, and “removed” from interval ends (= mult by 2) Example : in 1 st example, when interval becomes [0,27/64] ~ [000000,011011) (after 3 a’s) output 0, and update to [00000,11011) Decoder sees 1 st 0: knows the first three are a’s, Computes interval, “throws” the 0

DL - 2004Compression3 – Beeri/Feitelson21 Practically, (de)coder maintain a word for each number, computations are approximate  Some (very small) loss of compression Both sides must perform same approximations at “same time” Initial assignment of freq. 1 to low freq. symbols? Solution: assign 1 to all symbols not seen so far If k were not seen yet, one now occurs, give it 1/k Since coder does not know when to stop, file length must be stored in compressed file

DL - 2004Compression3 – Beeri/Feitelson22 Frequencies data structure: need to allow both update, and sums of the form (expensive for large alphabets) Solution: a tree-like structure O(logn) accesses! sumbinarycell f111 f1+f2102 f3113 f1+f2+f3+f41004 f51015 f5+f61106 f71117 f1+…+f810008 If k, the binary cell # ends with i 0’s, the cell contains fk+f_(k-1)+…+ f_(k-i+1) What is the algorithm to compute

DL - 2004Compression3 – Beeri/Feitelson23 Dictionary-based methods Huffman is a dictionary-based method: Each symbol in dictionary has associated code But, adaptive Huffman is not practical Famous adaptive methods: LZ77, LZ78 (Lempel-Ziv) We describe LZ77 (basis of gzip in Unix)

DL - 2004Compression3 – Beeri/Feitelson24 Basic idea: The dictionary -- the sequences of symbols in a window before current position (typical window size: ) When coder at position p, window is the symbols in positions p-w,…p-1 Coder searches for longest seq. that matches the one at position p If found, of length l, put (n,l) into file (n -- offset, l length), and forward l positions, else output the current symbol

DL - 2004Compression3 – Beeri/Feitelson25 Example : input is: a b a a b a b…b (11 b’s) Code is : a b (2,1) (1,1) (3,2) (2,1) (1,10) Decoding: a  a, b  b, (2,1)  a, (1,1)  a, current known string: a b a a (3,2)  b a, (2,1)  b current known string: a b a a b a b (1, 10)  Go back one step to b do 10 times: output scanned symbol, advance one (note: run-length encoding hides here) Note: decoding is extremely fast!

DL - 2004Compression3 – Beeri/Feitelson26 Practical issues: 1)Maintenance of window: use cyclic buffer 2) searching for longest matching word  expensive coding 3)How to distinguish a pair (n,l) from a symbol? 4)Can we save on the space for (n,l)? The gzip solution for 2-4: 2: a hash table of 3-sequences, with lists of positions where a sequence starting with them exists (what about short matches?) An option: limit the search in the list (save time) Does not always find the longest match, but loss is very small

DL - 2004Compression3 – Beeri/Feitelson27 3: one bit suffices (but see below) 4: offsets are integers in range [1,2^k], often smaller values are more frequent Semi-static solution: (gzip) Divide file into segments of 64k; for each: Find the offsets used and their frequencies Code using canonical Huffman Do same for lengths Actually, add symbols (issue 3) to set of lengths, code together using one code, and put in file this code before offset code (why?)

DL - 2004Compression3 – Beeri/Feitelson28 One last issue (for all methods): synchronization Assume you want to start decoding in mid-file? E.g.: a db of files, coded using one code Bit-based addresses for the files --- these addresses occur in many IL’s, which are loaded to MM. 32/address is ok, 64/address may be costly Byte/word-based addresses allow for much larger db’s. It may pay to even use k-word blocks based addresses how does one synchronize?

DL - 2004Compression3 – Beeri/Feitelson29 Solution: fill last block with 01…1 if code fills last block, add a block Since file addresses/lengths are known, filling can be removed Does this work for Huffman? Arithmetic? LZ77? What is the cost?

DL - 2004Compression3 – Beeri/Feitelson30 Summary of file compression : Large db’s  compression helps reduce storage Fast query processing requires synchronization and fast decoding Db is often given, so statistics can be collected – semi-static is a viable option (plus regular re-organization) Context-based methods give good compression, but expensive decoding word-based Huffman is recommended (semi-static ) Construct two models: one for words, another for no-words

DL - 2004Compression3 – Beeri/Feitelson31 Compression of inverted lists  Introduction  Global, non-parametric methods  Global parametric methods  Local parametric methods

DL - 2004Compression3 – Beeri/Feitelson32 Introduction: Important parameters: N - # of documents in db n - # of (distinct) words F - # of word occurrences f - # of inverted list entries The index contains: lexicon (MM, if possible), IL’s (Disc) IL compression helps to reduce size of index, cost of i/o (in TREC, 99) 741,856 535,346 333,338,738 134,994,414 Total size: 2G

DL - 2004Compression3 – Beeri/Feitelson33 The IL for a term t contains entries An entry: d (= doc. id), {in-doc freq., in-doc-position,…} For ranked answers, the entry is usually (d, ) We consider each separately – independent compressions, can be composed

DL - 2004Compression3 – Beeri/Feitelson34 Compression of doc numbers: A sequence of numbers in [1..N]; how can it be compressed? Most methods use gaps : g1=d1, g2=d2-d1, … We know that For long lists, most are small. These facts can be used for compression (Each method has an associated probability distribution on the gaps, defined by code lengths: )

DL - 2004Compression3 – Beeri/Feitelson35 Global, non-parametric methods Binary coding: represent each gap by a fixed length binary number Code length for g: Probability: uniform distribution: p(g)=1/N

DL - 2004Compression3 – Beeri/Feitelson36 Unary coding: represent each g>0 by d-1 digits 1, then 0 1 -> 0, 2 -> 10, 3 -> 110, 4-> 1110, … Code length for g: g  Worst case for sum: N (hence for all IL’s: nN) is this a nice bound? P(g): Exponential decay; if does not hold in practice  compression penalty

DL - 2004Compression3 – Beeri/Feitelson37 Gamma ( ) code : a number g is represented by Prefix רישא : unary code for* Suffix סיפא :binary code, with digits, for Examples: (*Why not ?)

DL - 2004Compression3 – Beeri/Feitelson38 Delta ( ) code :

DL - 2004Compression3 – Beeri/Feitelson39 Interim summary: We have codes with probability distributions : Q: can you prove that the (exact) formulas for probabilities for gamma, delta sum to 1?

DL - 2004Compression3 – Beeri/Feitelson40 Golomb code: Semi-static, uses db statistics  global, parametric code 1)Select a basis b (based on db statistics – later) 2)g>0  we represent g-1 Prefix: let (integer division) represent, in unary, q+1 Suffix: the remainder is (g-1)-qb (in [0..b-1]) represent by a binary tree code - some leaves at distance - the others at distance

DL - 2004Compression3 – Beeri/Feitelson41 The binary tree code: cut 2j leaves from the full binary tree of depth k assign leaves, in order, to the values in [0..b-1] Example: b=6 01 23 45

DL - 2004Compression3 – Beeri/Feitelson42 Summary of Golomb code: Exponential decay like unary, slower rate, affected by b Q: what is the underlying theory? Q: how is b chosen?

DL - 2004Compression3 – Beeri/Feitelson43 Infinite Huffman Trees : Example: Consider The code (*) 0, 10, 110, 1110, … seems natural, but Huffman algorithm is not applicable! (why?) For each m, consider the (finite) m-approximation each has a Huffman tree code: 0, 10, …, 1…10 the code for m+1 refines that of m The sequence of codes converges to (*)

DL - 2004Compression3 – Beeri/Feitelson44 1 1 1 0 0 1/2 1/4 0 1/8 1/2 approximation 1, code words: 0, 1 approximation 2, code words: 0, 10,11 01 approximation 3, code words: 0, 10,110,111 approximation 1, code words: 0, 10, 110, 1110, 1111 1/16

DL - 2004Compression3 – Beeri/Feitelson45 A more general approximation scheme: Given: the sequence An m-approximation, with skip b is the finite sequence where for example: b = 3: approximated tail

DL - 2004Compression3 – Beeri/Feitelson46 Fact: refining the m-approx. by splitting to and gives the m+1-approx. A sequence of m-approximations is good if (*) are the smallest in the sequence, so they are the 1 st pair merged by Huffman (why is this important?) (*) Depends on the and on b

DL - 2004Compression3 – Beeri/Feitelson47 Let -- the Bernoulli distribution A decreasing sequence  to prove (*), need to show: For which b do they hold?

DL - 2004Compression3 – Beeri/Feitelson48

DL - 2004Compression3 – Beeri/Feitelson49 We select < on the right (useful later): has a unique solution To solve, from the left side, we obtain Hence the solution is (b is an integer):

DL - 2004Compression3 – Beeri/Feitelson50 Next: how do these Huffman trees look like? Start with 0-approx. Facts: 1.A decreasing sequence (so last two are smallest) 2. (when b>3) follows from and (**) 3. Previous two properties are preserved when last two are replaced by their sum 4. The Huffman tree for the sequence assigns to codes of lengths of same cost as the Golomb code for remainders Proof: induction on b

DL - 2004Compression3 – Beeri/Feitelson51 Now, expand the approximations, to obtain infinite tree: This is the Golomb code, (with places of prefix/suffix exchanged)!! 1 1 1 0 0 0

DL - 2004Compression3 – Beeri/Feitelson52 Last question: where do we get p, and why Bernoulli? Assume equal probability p for t to be in d For a given t, probability of the gap g from one doc to next is then For p: there are f pairs (t, d), estimate p by Since N is large, a reasonable estimate

DL - 2004Compression3 – Beeri/Feitelson53 For TREC: To estimate for a small p log(2-p) ~ log 2, log(1-p) ~ -p b ~ (log 2)/p ~ 0.69 nN/f = 1917 end of (blobal) Golomb

DL - 2004Compression3 – Beeri/Feitelson54 Global observed frequency: (a global method) Construct all IL’s collect statictics on frequencies of gaps Construct canonical Huffman tree for gaps The model/tree needs to be stored (gaps are in [1..N]; for TREC this is 3/4M gap values  storage overhead may not be so large) Practically, not far from gamma, delta, But local methods are better

DL - 2004Compression3 – Beeri/Feitelson55 Local (parametric) methods : Coding of IL(t) based on statistics of IL(t) Local observed frequency: Construct canonical Huffman for IL(t) based on its gap frequencies Problem: in small IL’s # of distinct gaps is close to # of gaps  Size of model close to size of compressed data Example: 25 entries, 15 gap values Model: 15 gaps, 15 lengths (or freqs) Way out: construct model for groups of IL’s (see book for details)

DL - 2004Compression3 – Beeri/Feitelson56 Local Bernoulli/Golomb: Assumption: - # of entries of IL(t) is known (to coder & decoder), estimate b & construct Golomb Note: Large f_t  larger p  smaller b  code gets close to unary (reasonable, many small gaps) Small f_t  large b  most coding ~ log b For example: f_t = 2 (one gap)  b ~ 0.69N for a gap < 0.69/N, code in log(0.69N) for a larger gap, one more bit

DL - 2004Compression3 – Beeri/Feitelson57 Interpolative coding : Uses original d’s, not g’s Let f = f_t, assume d’s are stored in L[0,…,f] (each entry is at most N) Standard binary for middle d, with # of bits determined by its range Continue as in binary search; each d in binary, with # of bits determined by modified range

DL - 2004Compression3 – Beeri/Feitelson58 Example: L[3,8,9,11,12,13,18] (f=7) N=20 H  7 div 2 = 3 L[3] = 11 (4’th d) smallest d is 1, and there are 3 to left of L[3] largest d is 20, there are 3 to right of L[3] size of interval is (20-3)-(1+3)=17-4=13  code 11 in 4 bits For sub-list left of 11: 3, 8, 9 h  3 div 2 = 1 L[1] = 8 bounds: lower: 1+1 = 2; upper = 10-1=9 code using 3 bits For L[2] = 9, range is [9..10], use 2 bits For sub-list right of 11 – do on board (note the element that is coded in 0 bits!)

DL - 2004Compression3 – Beeri/Feitelson59 Advantages: Relatively easy to code, decode Very efficient for clusters (a word that occurs in many documents close to each other) Disadvantage: more complex to implement, requires a stack And cost of decoding is a bit more than Golomb ------ ---------- -------- ----------- ------- -------- Summary of methods : Show table 3.8

DL - 2004Compression3 – Beeri/Feitelson60 An entry in IL(t) also contains - freq. of t in d compression of f_{d,t}: In TREC, F/f ~ 2.7  these are small numbers Unary: total overhead is Cost per entry is F/f (for TREC: 2.7) Gamma: shorter than unary, except for 2, 4 (For TREC: ~2.13) Does not pay the complexity to choose another Total cost of compression of IL: 8-9 bits/entry

Download ppt "DL - 2004Compression3 – Beeri/Feitelson1 Canonical Huffman trees: Goals: a scheme for large alphabets with Efficient decoding Efficient coding Economic."

Similar presentations