Algoritmi per IR Dictionary-based compressors. Lempel-Ziv Algorithms Keep a “dictionary” of recently-seen strings. The differences are: How the dictionary.

Slides:



Advertisements
Similar presentations
Paolo Ferragina, Università di Pisa XML Compression and Indexing Paolo Ferragina Dipartimento di Informatica, Università di Pisa [Joint with F. Luccio,
Advertisements

Paolo Ferragina, Università di Pisa On Compression and Indexing: two sides of the same coin Paolo Ferragina Dipartimento di Informatica, Università di.
Lecture 4: Data Compression Techniques TSBK01 Image Coding and Data Compression Jörgen Ahlberg Div. of Sensor Technology Swedish Defence Research Agency.
Hokkaido University Lecture on Information Knowledge Network "Information retrieval and pattern matching" Laboratory of Information Knowledge Network,
Dynamic Rank-Select Structures with Applications to Run-Length Encoded Texts Sunho Lee and Kunsoo Park Seoul National Univ.
Lecture #1 From 0-th order entropy compression To k-th order entropy compression.
Text Indexing The Suffix Array. Basic notation and facts Occurrences of P in T = All suffixes of T having P as a prefix SUF(T) = Sorted set of suffixes.
15-583:Algorithms in the Real World
Applied Algorithmics - week7
Paolo Ferragina, Università di Pisa Compressed Permuterm Index Paolo Ferragina Dipartimento di Informatica, Università di Pisa.
Text Indexing The Suffix Array. Basic notation and facts Occurrences of P in T = All suffixes of T having P as a prefix SUF(T) = Sorted set of suffixes.
Paolo Ferragina, Università di Pisa Compressing and Indexing Strings and (labeled) Trees Paolo Ferragina Dipartimento di Informatica, Università di Pisa.
Source Coding Data Compression A.J. Han Vinck. DATA COMPRESSION NO LOSS of information and exact reproduction (low compression ratio 1:4) general problem.
Suffix Trees, Suffix Arrays and Suffix Trays Richard Cole Tsvi Kopelowitz Moshe Lewenstein.
Lecture 10: Dictionary Coding
Algorithms for Data Compression
Lecture 6 Source Coding and Compression Dr.-Ing. Khaled Shawky Hassan
Suffix Sorting & Related Algoritmics Martin Farach-Colton Rutgers University USA.
1 Suffix Trees © Jeff Parker, Outline An introduction to the Suffix Tree Some sample applications How to build a Suffix Tree efficiently.
Suffix Sorting & Related Algoritmics Martin Farach-Colton Rutgers University USA.
Lossless Compression - II Hao Jiang Computer Science Department Sept. 18, 2007.
Algorithm Programming Some Topics in Compression Bar-Ilan University תשס"ח by Moshe Fresko.
1 Lempel-Ziv algorithms Burrows-Wheeler Data Compression.
Algorithm Programming Some Topics in Compression
Web Algorithmics Dictionary-based compressors. LZ77 Algorithm’s step: Output Advance by len + 1 A buffer “window” has fixed length and moves aacaacabcaaaaaa.
Suffix arrays. Suffix array We loose some of the functionality but we save space. Let s = abab Sort the suffixes lexicographically: ab, abab, b, bab The.
Suffix trees and suffix arrays presentation by Haim Kaplan.
Tries. (Compacted) Trie y s 1 z stile zyg 5 etic ial ygy aibelyite czecin omo systile syzygetic syzygial syzygy szaibelyite szczecin.
Advanced Algorithms for Massive DataSets Data Compression.
Source Coding Hafiz Malik Dept. of Electrical & Computer Engineering The University of Michigan-Dearborn
Efficient encoding methods  Coding theory refers to study of code properties and their suitability to specific applications.  Efficient codes are used,
Noiseless Coding. Introduction Noiseless Coding Compression without distortion Basic Concept Symbols with lower probabilities are represented by the binary.
Source Coding-Compression
296.3Page 1 CPS 296.3:Algorithms in the Real World Data Compression: Lecture 2.5.
Information and Coding Theory Heuristic data compression codes. Lempel- Ziv encoding. Burrows-Wheeler transform. Juris Viksna, 2015.
Fundamental Structures of Computer Science Feb. 24, 2005 Ananda Guna Lempel-Ziv Compression.
Fundamental Data Structures and Algorithms Aleks Nanevski February 10, 2004 based on a lecture by Peter Lee LZW Compression.
Basics of Data Compression Paolo Ferragina Dipartimento di Informatica Università di Pisa.
The LZ family LZ77 LZ78 LZR LZSS LZB LZH – used by zip and unzip
Fundamental Data Structures and Algorithms Margaret Reid-Miller 24 February 2005 LZW Compression.
Suffix trees. Trie A tree representing a set of strings. a b c e e f d b f e g { aeef ad bbfe bbfg c }
Parallel Data Compression Utility Jeff Gilchrist November 18, 2003 COMP 5704 Carleton University.
Lecture 7 Source Coding and Compression Dr.-Ing. Khaled Shawky Hassan
Lempel-Ziv methods.
Information Retrieval
The Burrows-Wheeler Transform: Theory and Practice Article by: Giovanni Manzini Original Algorithm by: M. Burrows and D. J. Wheeler Lecturer: Eran Vered.
Index construction: Compression of documents Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading Managing-Gigabytes: pg 21-36, 52-56,
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval" Basics
Lampel ZIV (LZ) code The Lempel-Ziv algorithm is a variable-to-fixed length code Basically, there are two versions of the algorithm LZ77 and LZ78 are the.
LZW (Lempel-Ziv-welch) compression method The LZW method to compress data is an evolution of the method originally created by Abraham Lempel and Jacob.
15-853Page :Algorithms in the Real World Data Compression III Lempel-Ziv algorithms Burrows-Wheeler Introduction to Lossy Compression.
Suffix Tree 6 Mar MinKoo Seo. Contents  Basic Text Searching  Introduction to Suffix Tree  Suffix Trees and Exact Matching  Longest Common Substring.
Page 1 Algorithms in the Real World Lempel-Ziv Burroughs-Wheeler ACB.
Lecture 4: Data Compression Techniques TSBK01 Image Coding and Data Compression Jörgen Ahlberg Div. of Sensor Technology Swedish Defence Research Agency.
Linear Time Suffix Array Construction Using D-Critical Substrings
CSE 589 Applied Algorithms Spring 1999
Burrows-Wheeler Transformation Review
15-853:Algorithms in the Real World
Compression & Huffman Codes
Data Compression.
Information and Coding Theory
Advanced Algorithms for Massive DataSets
Algorithms in the Real World
Applied Algorithmics - week7
Lempel-Ziv-Welch (LZW) Compression Algorithm
Suffix trees.
Problem with Huffman Coding
COMS 161 Introduction to Computing
Advanced Seminar in Data Structures
CPS 296.3:Algorithms in the Real World
Presentation transcript:

Algoritmi per IR Dictionary-based compressors

Lempel-Ziv Algorithms Keep a “dictionary” of recently-seen strings. The differences are: How the dictionary is stored How it is extended How it is indexed How elements are removed LZ-algos are asymptotically optimal, i.e. their compression ratio goes to H(S) for n   !! No explicit frequency estimation

LZ77 Algorithm’s step: Output d = distance of copied string wrt current position len = length of longest match c = next char in text beyond longest match Advance by len + 1 A buffer “window” has fixed length and moves aacaacabcababac Dictionary (all substrings starting here) Cursor ?? ?

aacaacabcabaaac (3,4,b) aacaacabcabaaac (1,1,c) aacaacabcabaaac Example: LZ77 with window a acaacabcabaaac (0,0,a) aacaacabcabaaac Window size = 6 Longest matchNext character aacaacabcabaaac caacaacababaaaccaacaacababaaac (3,3,a) aacaacabcabaaac (1,2,c) within W

LZ77 Decoding Decoder keeps same dictionary window as encoder. Finds substring and inserts a copy of it What if l > d? (overlap with text to be compressed) E.g. seen = abcd, next codeword is (2,9,e) Simply copy starting at the cursor for (i = 0; i < len; i++) out[cursor+i] = out[cursor-d+i] Output is correct: abcdcdcdcdcdce

LZ77 Optimizations used by gzip LZSS: Output one of the following formats (0, position, length) or (1,char) Typically uses the second format if length < 3. Special greedy: possibly use shorter match so that next match is better Hash Table for speed-up searches on triplets Triples are coded with Huffman’s code

LZ78 Dictionary: substrings stored in a trie (each has an id). Coding loop: find the longest match S in the dictionary Output its id and the next character c after the match in the input string Add the substring Sc to the dictionary Decoding: builds the same dictionary and looks at ids

LZ78: Coding Example aabaacabcabcb (0,a) 1 = a Dict.Output aabaacabcabcb (1,b) 2 = ab aabaacabcabcb (1,a) 3 = aa aabaacabcabcb (0,c) 4 = c aabaacabcabcb (2,c) 5 = abc aabaacabcabcb (5,b) 6 = abcb

LZ78: Decoding Example a (0,a) 1 = a aab (1,b) 2 = ab aabaa (1,a) 3 = aa aabaac (0,c) 4 = c aabaacabc (2,c) 5 = abc aabaacabcabcb (5,b) 6 = abcb Input Dict.

LZW (Lempel-Ziv-Welch) Don’t send extra character c, but still add Sc to the dictionary. Dictionary: initialized with 256 ascii entries (e.g. a = 112 ) Decoder is one step behind the coder since it does not know c There is an issue for strings of the form SSc where S[0] = c, and these are handled specially!!!

LZW: Encoding Example aabaacababacb =aa Dict.Output aabaacababacb 257=ab aabaacababacb =ba aabaacababacb =aac aabaacababacb =ca aabaacababacb =aba 112 aabaacababacb =abac aabaacababacb =cb

LZW: Decoding Example a =aa aa 257=ab aab =ba aabaa =aac aabaac =ca aabaacab =aba 112 aabaacab 261 InputDict one step later ? 261 a b 114

LZ78 and LZW issues How do we keep the dictionary small? Throw the dictionary away when it reaches a certain size (used in GIF) Throw the dictionary away when it is no longer effective at compressing ( e.g. compress ) Throw the least-recently-used (LRU) entry away when it reaches a certain size (used in BTLZ, the British Telecom standard)

You find this at:

Algoritmi per IR Burrows-Wheeler Transform

The big (unconscious) step...

p i#mississi p p pi#mississ i s ippi#missi s s issippi#mi s s sippi#miss i s sissippi#m i i ssippi#mis s m ississippi # i ssissippi# m The Burrows-Wheeler Transform (1994) Let us given a text T = mississippi# mississippi# ississippi#m ssissippi#mi sissippi#mis sippi#missis ippi#mississ ppi#mississi pi#mississip i#mississipp #mississippi ssippi#missi issippi#miss Sort the rows # mississipp i i #mississip p i ppi#missis s FL T

A famous example Much longer...

p i#mississi p p pi#mississ i s ippi#missi s s issippi#mi s s sippi#miss i s sissippi#m i i ssippi#mis s m ississippi # i ssissippi# m # mississipp i i #mississip p i ppi#missis s FL Take two equal L’s chars How do we map L’s onto F’s chars ?... Need to distinguish equal chars in F... Rotate rightward their rows Same relative order !! unknown A useful tool: L  F mapping

T =.... # i #mississip p p i#mississi p p pi#mississ i s ippi#missi s s issippi#mi s s sippi#miss i s sissippi#m i i ssippi#mis s m ississippi # i ssissippi# m The BWT is invertible # mississipp i i ppi#missis s FL unknown 1. LF-array maps L’s to F’s chars 2. L[ i ] precedes F[ i ] in T Two key properties: Reconstruct T backward: i p p i InvertBWT(L) Compute LF[0,n-1]; r = 0; i = n; while (i>0) { T[i] = L[r]; r = LF[r]; i--; }

BWT matrix #mississipp i#mississip ippi#missis issippi#mis ississippi# mississippi pi#mississi ppi#mississ sippi#missi sissippi#mi ssippi#miss ssissippi#m #mississipp i#mississip ippi#missis issippi#mis ississippi# mississippi pi#mississi ppi#mississ sippi#missi sissippi#mi ssippi#miss ssissippi#m How to compute the BWT ? ipssm#pissiiipssm#pissii L SA L[3] = T[ 7 ] We said that: L[i] precedes F[i] in T Given SA and T, we have L[i] = T[SA[i]-1]

How to construct SA from T ? # i# ippi# issippi# ississippi# mississippi pi# ppi# sippi# sissippi# ssippi# ssissippi# SA Elegant but inefficient Obvious inefficiencies:  (n 2 log n) time in the worst-case  (n log n) cache misses or I/O faults Input: T = mississippi#

Many algorithms, now...

Compressing L seems promising... Key observation: l L is locally homogeneous L is highly compressible Algorithm Bzip : Œ Move-to-Front coding of L  Run-Length coding Ž Statistical coder  Bzip vs. Gzip: 20% vs. 33%, but it is slower in (de)compression !

RLE0 = An encoding example T = mississippimississippimississippi L = ipppssssssmmmii#pppiiissssssiiiiii Mtf = Mtf = [i,m,p,s] # at 16 Bzip2-output = Arithmetic/Huffman on |  |+1 symbols plus  (16), plus the original Mtf-list (i,m,p,s) Mtf = Alphabet |  |+1 Bin(6)=110, Wheeler’s code

You find this in your Linux distribution