CSE 589 Applied Algorithms Spring 1999

Slides:



Advertisements
Similar presentations
FORMAL LANGUAGES, AUTOMATA, AND COMPUTABILITY
Advertisements

Algorithms for Data Compression
Lecture 6 Source Coding and Compression Dr.-Ing. Khaled Shawky Hassan
Chapter 4 Normal Forms for CFGs Chomsky Normal Form n Defn A CFG G = (V, , P, S) is in chomsky normal form if each rule in G has one of.
LING 438/538 Computational Linguistics Sandiway Fong Lecture 8: 9/29.
Algorithm Programming Some Topics in Compression Bar-Ilan University תשס"ח by Moshe Fresko.
Lempel-Ziv-Welch (LZW) Compression Algorithm
Lossless Data Compression Using run-length and Huffman Compression pages
Source Coding Hafiz Malik Dept. of Electrical & Computer Engineering The University of Michigan-Dearborn
Spring 2015 Mathematics in Management Science Binary Linear Codes Two Examples.
Chapter 12: Context-Free Languages and Pushdown Automata
Noiseless Coding. Introduction Noiseless Coding Compression without distortion Basic Concept Symbols with lower probabilities are represented by the binary.
Text Compression Spring 2007 CSE, POSTECH. 2 2 Data Compression Deals with reducing the size of data – Reduce storage space and hence storage cost Compression.
Information and Coding Theory Heuristic data compression codes. Lempel- Ziv encoding. Burrows-Wheeler transform. Juris Viksna, 2015.
Page 110/6/2015 CSE 40373/60373: Multimedia Systems So far  Audio (scalar values with time), image (2-D data) and video (2-D with time)  Higher fidelity.
Lecture Objectives  To learn how to use a Huffman tree to encode characters using fewer bytes than ASCII or Unicode, resulting in smaller files and reduced.
Normal Forms for Context-Free Grammars Definition: A symbol X in V  T is useless in a CFG G=(V, T, P, S) if there does not exist a derivation of the form.
Grammars CPSC 5135.
Context Free Grammar. Introduction Why do we want to learn about Context Free Grammars?  Used in many parsers in compilers  Yet another compiler-compiler,
Compression.  Compression ratio: how much is the size reduced?  Symmetric/asymmetric: time difference to compress, decompress?  Lossless; lossy: any.
ICS 220 – Data Structures and Algorithms Lecture 11 Dr. Ken Cosh.
Phrase-structure grammar A phrase-structure grammar is a quadruple G = (V, T, P, S) where V is a finite set of symbols called nonterminals, T is a set.
Regular Grammars Chapter 7. Regular Grammars A regular grammar G is a quadruple (V, , R, S), where: ● V is the rule alphabet, which contains nonterminals.
Regular Grammars Chapter 7 1. Regular Grammars A regular grammar G is a quadruple (V, , R, S), where: ● V is the rule alphabet, which contains nonterminals.
Section 12.4 Context-Free Language Topics
Data Compression Meeting October 25, 2002 Arithmetic Coding.
Lecture 11 Theory of AUTOMATA
1 Simplification of Context-Free Grammars Some useful substitution rules. Removing useless productions. Removing -productions. Removing unit-productions.
1 Chapter 6 Simplification of CFGs and Normal Forms.
Lecture 7 Source Coding and Compression Dr.-Ing. Khaled Shawky Hassan
Transparency No. 1 Formal Language and Automata Theory Homework 5.
Lampel ZIV (LZ) code The Lempel-Ziv algorithm is a variable-to-fixed length code Basically, there are two versions of the algorithm LZ77 and LZ78 are the.
CSE 311 Foundations of Computing I Lecture 19 Recursive Definitions: Context-Free Grammars and Languages Spring
15-853Page :Algorithms in the Real World Data Compression III Lempel-Ziv algorithms Burrows-Wheeler Introduction to Lossy Compression.
Lossless Compression-Statistical Model Lossless Compression One important to note about entropy is that, unlike the thermodynamic measure of entropy,
Syntax Analysis By Noor Dhia Syntax analysis:- Syntax analysis or parsing is the most important phase of a compiler. The syntax analyzer considers.
Advanced Data Structures Lecture 8 Mingmin Xie. Agenda Overview Trie Suffix Tree Suffix Array, LCP Construction Applications.
Describing Syntax and Semantics Chapter 3: Describing Syntax and Semantics Lectures # 6.
CSE 589 Applied Algorithms Spring 1999
Lexical Analyzer in Perspective
Data Coding Run Length Coding
Data Compression.
Information and Coding Theory
Context-Free Grammars: an overview
Andrzej Ehrenfeucht, University of Colorado, Boulder
Applied Algorithmics - week7
Complexity and Computability Theory I
Lecture 22 Pumping Lemma for Context Free Languages
Even-Even Devise a grammar that generates strings with even number of a’s and even number of b’s.
Simplifications of Context-Free Grammars
Lempel-Ziv-Welch (LZW) Compression Algorithm
Huffman Coding, Arithmetic Coding, and JBIG2
Chapter 7 Special Section
CSE 3302 Programming Languages
Strings: Tries, Suffix Trees
Why Compress? To reduce the volume of data to be transmitted (text, fax, images) To reduce the bandwidth required for transmission and to reduce storage.
Chapter 7 Regular Grammars
5. Context-Free Grammars and Languages
CHAPTER 2 Context-Free Languages
CSE 311: Foundations of Computing
Chapter 11 Data Compression
CSE 589 Applied Algorithms Spring 1999
Suffix Trees String … any sequence of characters.
Chapter 7 Special Section
Strings: Tries, Suffix Trees
Compilers Principles, Techniques, & Tools Taught by Jing Zhang
Recap lecture 30 Deciding whether two languages are equivalent or not, example, deciding whether an FA accept any string or not, method 3, examples, finiteness.
Discrete Maths 13. Grammars Objectives
CPS 296.3:Algorithms in the Real World
Parsing CSCI 432 Computer Science Theory
Presentation transcript:

CSE 589 Applied Algorithms Spring 1999 Dictionary Coding Sequitur

LZW Encoding Algorithm Repeat find the longest match w in the dictionary output the index of w put wa in the dictionary where a was the unmatched symbol CSE 589 - Lecture 13 - Spring 1999

LZW Encoding Example (1) Dictionary a b a b a b a b a 0 a 1 b CSE 589 - Lecture 13 - Spring 1999

LZW Encoding Example (2) Dictionary a b a b a b a b a 0 a 1 b 2 ab CSE 589 - Lecture 13 - Spring 1999

LZW Encoding Example (3) Dictionary a b a b a b a b a 0 1 0 a 1 b 2 ab 3 ba CSE 589 - Lecture 13 - Spring 1999

LZW Encoding Example (4) Dictionary a b a b a b a b a 0 1 2 0 a 1 b 2 ab 3 ba 4 aba CSE 589 - Lecture 13 - Spring 1999

LZW Encoding Example (5) Dictionary a b a b a b a b a 0 1 2 4 0 a 1 b 2 ab 3 ba 4 aba 5 abab CSE 589 - Lecture 13 - Spring 1999

LZW Encoding Example (6) Dictionary a b a b a b a b a 0 1 2 4 3 0 a 1 b 2 ab 3 ba 4 aba 5 abab CSE 589 - Lecture 13 - Spring 1999

LZW Decoding Algorithm Emulate the encoder in building the dictionary. Decoder is slightly behind the encoder. initialize dictionary; decode first index to w; put w? in dictionary; repeat decode the first symbol s of the index; complete the previous dictionary entry with s; finish decoding the remainder of the index; put w? in the dictionary where w was just decoded; CSE 589 - Lecture 13 - Spring 1999

LZW Decoding Example (1) Dictionary 0 1 2 4 3 6 a 0 a 1 b 2 a? CSE 589 - Lecture 13 - Spring 1999

LZW Decoding Example (2a) Dictionary 0 1 2 4 3 6 a b 0 a 1 b 2 ab CSE 589 - Lecture 13 - Spring 1999

LZW Decoding Example (2b) Dictionary 0 1 2 4 3 6 a b 0 a 1 b 2 ab 3 b? CSE 589 - Lecture 13 - Spring 1999

LZW Decoding Example (3a) Dictionary 0 1 2 4 3 6 a b a 0 a 1 b 2 ab 3 ba CSE 589 - Lecture 13 - Spring 1999

LZW Decoding Example (3b) Dictionary 0 1 2 4 3 6 a b ab 0 a 1 b 2 ab 3 ba 4 ab? CSE 589 - Lecture 13 - Spring 1999

LZW Decoding Example (4a) Dictionary 0 1 2 4 3 6 a b ab a 0 a 1 b 2 ab 3 ba 4 aba CSE 589 - Lecture 13 - Spring 1999

LZW Decoding Example (4b) Dictionary 0 1 2 4 3 6 a b ab aba 0 a 1 b 2 ab 3 ba 4 aba 5 aba? CSE 589 - Lecture 13 - Spring 1999

LZW Decoding Example (5a) Dictionary 0 1 2 4 3 6 a b ab aba b 0 a 1 b 2 ab 3 ba 4 aba 5 abab CSE 589 - Lecture 13 - Spring 1999

LZW Decoding Example (5b) Dictionary 0 1 2 4 3 6 a b ab aba ba 0 a 1 b 2 ab 3 ba 4 aba 5 abab 6 ba? CSE 589 - Lecture 13 - Spring 1999

LZW Decoding Example (6a) Dictionary 0 1 2 4 3 6 a b ab aba ba b 0 a 1 b 2 ab 3 ba 4 aba 5 abab 6 bab CSE 589 - Lecture 13 - Spring 1999

LZW Decoding Example (6b) Dictionary 0 1 2 4 3 6 a b ab aba ba bab 0 a 1 b 2 ab 3 ba 4 aba 5 abab 6 bab 7 bab? CSE 589 - Lecture 13 - Spring 1999

Trie Data Structure for Dictionary Fredkin (1960) 0 a 9 ca 1 b 10 ad 2 c 11 da 3 d 12 abr 4 r 13 raa 5 ab 14 abra 6 br 7 ra 8 ac 0 1 2 3 4 a b c d r b 5 c 8 d 10 r 6 a 9 a 11 a 7 r 12 a 13 a 14 CSE 589 - Lecture 13 - Spring 1999

Encoder Uses a Trie (1) 0 1 2 3 4 a b c d r b 5 c 8 d 10 r 6 a 9 a 11 0 1 2 3 4 a b c d r b 5 c 8 d 10 r 6 a 9 a 11 a 7 r 12 a 13 a 14 a b r a c a d a b r a a b r a c a d a b r a 0 1 4 0 2 0 3 5 7 12 CSE 589 - Lecture 13 - Spring 1999

Encoder Uses a Trie (2) 0 1 2 3 4 a b c d r b 5 c 8 d 10 r 6 a 9 a 11 0 1 2 3 4 a b c d r b 5 c 8 d 10 r 6 a 9 a 11 a 7 r 12 a 15 a 13 a 14 a b r a c a d a b r a a b r a c a d a b r a 0 1 4 0 2 0 3 5 7 12 8 CSE 589 - Lecture 13 - Spring 1999

Decoder’s Data Structure Simply an array of strings 0 a 9 ca 1 b 10 ad 2 c 11 da 3 d 12 abr 4 r 13 raa 5 ab 14 abr? 6 br 7 ra 8 ac 0 1 4 0 2 0 3 5 7 12 8 ... a b r a c a d ab ra abr CSE 589 - Lecture 13 - Spring 1999

Notes on Dictionary Coding Extremely effective when there are repeated patterns in the data that are widely spread. Where local context is not as significant. text some graphics program sources or binaries Variants of LZW are pervasive. Unix compress GIF CSE 589 - Lecture 13 - Spring 1999

Sequitur Nevill-Manning and Witten, 1996. Uses a context-free grammar (without recursion) to represent a string. The grammar is inferred from the string. If there is structure and repetition in the string then the grammar may be very small compared to the original string. Clever encoding of the grammar yields impressive compression ratios. Compression plus structure! CSE 589 - Lecture 13 - Spring 1999

Context-Free Grammars Invented by Chomsky in 1959 to explain the grammar of natural languages. Also invented by Backus in 1959 to generate and parse Fortran. Example: terminals: b, e nonterminals: S, A Production Rules: S -> SA, S -> A, A -> bSe, A -> be S is the start symbol CSE 589 - Lecture 13 - Spring 1999

Context-Free Grammar Example hierarchical parse tree S -> SA S -> A A -> bSe A -> be derivation of bbebee S S A bSe bSAe bAAe bbeAe bbebee A b S e Example: b and e matched as parentheses S A b e A b e CSE 589 - Lecture 13 - Spring 1999

Arithmetic Expressions derivation of a * (a + a) + a parse tree S -> S + T S -> T T -> T*F T -> F F -> a F -> (S) S S+T T+T T*F+T F*F+T a*F+T a*(S)+F a*(S+F)+T a*(T+F)+T a*(F+F)+T a*(a+F)+T a*(a+a)+T a*(a+a)+F a*(a+a)+a S S + T T F T * F a F ( S ) a S + T T F a F a CSE 589 - Lecture 13 - Spring 1999

Sequitur Principles Digram Uniqueness: Rule Utility: no pair of adjacent symbols (digram) appears more than once in the grammar. Rule Utility: Every production rule is used more than once. These two principles are maintained as an invariant while inferring a grammar for the input string. CSE 589 - Lecture 13 - Spring 1999

Sequitur Example (1) bbebeebebebbebee S -> b CSE 589 - Lecture 13 - Spring 1999

Sequitur Example (2) bbebeebebebbebee S -> bb CSE 589 - Lecture 13 - Spring 1999

Sequitur Example (3) bbebeebebebbebee S -> bbe CSE 589 - Lecture 13 - Spring 1999

Sequitur Example (4) bbebeebebebbebee S -> bbeb CSE 589 - Lecture 13 - Spring 1999

Sequitur Example (5) bbebeebebebbebee S -> bbebe Enforce digram uniqueness. be occurs twice. Create new rule A ->be. CSE 589 - Lecture 13 - Spring 1999

Sequitur Example (6) bbebeebebebbebee S -> bAA A -> be CSE 589 - Lecture 13 - Spring 1999

Sequitur Example (7) bbebeebebebbebee S -> bAAe A -> be CSE 589 - Lecture 13 - Spring 1999

Sequitur Example (8) bbebeebebebbebee S -> bAAeb A -> be CSE 589 - Lecture 13 - Spring 1999

Sequitur Example (9) bbebeebebebbebee S -> bAAebe A -> be Enforce digram uniqueness. be occurs twice. Use existing rule A -> be. CSE 589 - Lecture 13 - Spring 1999

Sequitur Example (10) bbebeebebebbebee S -> bAAeA A -> be CSE 589 - Lecture 13 - Spring 1999

Sequitur Example (11) bbebeebebebbebee S -> bAAeAb A -> be CSE 589 - Lecture 13 - Spring 1999

Sequitur Example (12) bbebeebebebbebee S -> bAAeAbe A -> be Enforce digram uniqueness. be occurs twice. Use existing rule A -> be. CSE 589 - Lecture 13 - Spring 1999

Sequitur Example (13) bbebeebebebbebee S -> bAAeAA A -> be Enforce digram uniqueness AA occurs twice. Create new rule B -> AA. CSE 589 - Lecture 13 - Spring 1999

Sequitur Example (14) bbebeebebebbebee S -> bBeB A -> be B -> AA CSE 589 - Lecture 13 - Spring 1999

Sequitur Example (15) bbebeebebebbebee S -> bBeBb A -> be B -> AA CSE 589 - Lecture 13 - Spring 1999

Sequitur Example (16) bbebeebebebbebee S -> bBeBbb A -> be B -> AA CSE 589 - Lecture 13 - Spring 1999

Sequitur Example (17) bbebeebebebbebee S -> bBeBbbe A -> be B -> AA Enforce digram uniqueness. be occurs twice. Use existing rule A ->be. CSE 589 - Lecture 13 - Spring 1999

Sequitur Example (18) bbebeebebebbebee S -> bBeBbA A -> be CSE 589 - Lecture 13 - Spring 1999

Sequitur Example (19) bbebeebebebbebee S -> bBeBbAb A -> be B -> AA CSE 589 - Lecture 13 - Spring 1999

Sequitur Example (20) bbebeebebebbebee S -> bBeBbAbe A -> be B -> AA Enforce digram uniqueness. be occurs twice. Use existing rule A -> be. CSE 589 - Lecture 13 - Spring 1999

Sequitur Example (21) bbebeebebebbebee S -> bBeBbAA A -> be Enforce digram uniqueness. AA occurs twice. Use existing rule B -> AA. CSE 589 - Lecture 13 - Spring 1999

Sequitur Example (22) bbebeebebebbebee S -> bBeBbB A -> be B -> AA Enforce digram uniqueness. bB occurs twice. Create new rule C -> bB. CSE 589 - Lecture 13 - Spring 1999

Sequitur Example (23) bbebeebebebbebee S -> CeBC A -> be B -> AA C -> bB CSE 589 - Lecture 13 - Spring 1999

Sequitur Example (24) bbebeebebebbebee S -> CeBCe A -> be B -> AA C -> bB Enforce digram uniqueness. Ce occurs twice. Create new rule D -> Ce. CSE 589 - Lecture 13 - Spring 1999

Sequitur Example (25) bbebeebebebbebee S -> DBD A -> be B -> AA C -> bB D -> Ce Enforce rule utility. C occurs only once. Remove C -> bB. CSE 589 - Lecture 13 - Spring 1999

Sequitur Example (26) bbebeebebebbebee S -> DBD A -> be B -> AA D -> bBe CSE 589 - Lecture 13 - Spring 1999

The Hierarchy bbebeebebebbebee S S -> DBD A -> be B -> AA D -> bBe D B D b B e A A b B e A A b e b e A A b e b e b e b e Is there compression? In this small example, probably not. CSE 589 - Lecture 13 - Spring 1999

Sequitur Algorithm Input the first symbol s to create the production S -> s; repeat match an existing rule: create a new rule: remove a rule: input a new symbol: until no symbols left A -> ….XY… B -> XY A -> ….B…. B -> XY A -> ….XY…. B -> ....XY.... A -> ….C.... B -> ....C.... C -> XY A -> ….B…. B -> X1X2…Xk A -> …. X1X2…Xk …. S -> X1X2…Xk S -> X1X2…Xks CSE 589 - Lecture 13 - Spring 1999

Complexity The number of non-input sequitur operations applied < 2n where n is the input length. Amortized Complexity Argument Let s = the sum of the right hand sides of all the production rules. Let r = the number of rules. We evaluate s - r/2. Initially s - r/2 = 1/2 because s = 1 and r = 1. s - r/2 > 0 at all times because each rule has at least 1 symbol on the right hand side. s - r/2 increases by 1 for every input operation. s - r/2 decreases by at least 1/2 for each non-input sequitur rule applied. CSE 589 - Lecture 13 - Spring 1999

Sequitur Rule Complexity digram Uniqueness - match an existing rule. digram Uniqueness - create a new rule. Rule Utility - Remove a rule. A -> ….XY… B -> XY A -> ….B…. B -> XY s r -1 0 s - r/2 -1 A -> ….XY…. B -> ....XY.... A -> ….C.... B -> ....C.... C -> XY s r 0 1 s - r/2 -1/2 A -> ….B…. B -> X1X2…Xk s r -1 -1 s - r/2 -1/2 A -> …. X1X2…Xk …. CSE 589 - Lecture 13 - Spring 1999

Linear Time Algorithm There is a data structure to implement all the sequitur operations in constant time. Production rules in an array of doubly linked lists. Each production rule has reference count of the number of times used. Each nonterminal points to its production rule. digrams stored in a hash table for quick lookup. CSE 589 - Lecture 13 - Spring 1999

Data Structure Example current digram S C e B C e S -> CeBCe A -> be B -> AA C -> bB digram table A 2 b e BC eB Ce be AA bB B 2 A A C 2 b B reference count CSE 589 - Lecture 13 - Spring 1999

Basic Encoding a Grammar # 110 S -> DBD A -> be B -> AA D -> bBe Grammar Symbol Code Grammar Code D B D # b e # A A # b B e 101 100 101 110 000 001 110 011 011 110 000 100 001 39 bits |Grammar Code| = r = number of rules s = sum of right hand sides a = number in original symbol alphabet CSE 589 - Lecture 13 - Spring 1999

Better Encoding of the Grammar Nevill-Manning and Witten suggest a more efficient encoding of the grammar that resembles LZ77. The first time a nonterminal is sent, its right hand side is transmitted instead. The second time a nonterminal is sent the new production rule is established with a pointer to the previous occurrence sent along with the length of the rule. Subsequently, the nonterminal is represented by the index of the production rule. CSE 589 - Lecture 13 - Spring 1999

Compression Quality Neville-Manning and Witten 1997 size compress gzip sequitur PPMC bib 111261 3.35 2.51 2.48 2.12 book 768771 3.46 3.35 2.82 2.52 geo 102400 6.08 5.34 4.74 5.01 obj2 246814 4.17 2.63 2.68 2.77 pic 513216 0.97 0.82 0.90 0.98 progc 38611 3.87 2.68 2.83 2.49 Files from the Calgary Corpus Units in bits per character (8 bits) compress - based on LZW gzip - based on LZ77 PPMC - adaptive arithmetic coding with context CSE 589 - Lecture 13 - Spring 1999

Notes on Sequitur Very new and different from the standards. Yields compression and structure simultaneously. With clever encoding is competitive with the best of the standards. Practical linear time encoding and decoding. CSE 589 - Lecture 13 - Spring 1999