CS336: Intelligent Information Retrieval

Slides:



Advertisements
Similar presentations
Data Compression CS 147 Minh Nguyen.
Advertisements

Lecture 4 (week 2) Source Coding and Compression
Michael Alves, Patrick Dugan, Robert Daniels, Carlos Vicuna
Huffman code and ID3 Prof. Sin-Min Lee Department of Computer Science.
An introduction to Data Compression
SIMS-201 Compressing Information. 2  Overview Chapter 7: Compression Introduction Entropy Huffman coding Universal coding.
Huffman Encoding Dr. Bernard Chen Ph.D. University of Central Arkansas.
Greedy Algorithms (Huffman Coding)
Data Compressor---Huffman Encoding and Decoding. Huffman Encoding Compression Typically, in files and messages, Each character requires 1 byte or 8 bits.
Processing of large document collections
Lecture04 Data Compression.
Compression & Huffman Codes
1 Huffman Codes. 2 Introduction Huffman codes are a very effective technique for compressing data; savings of 20% to 90% are typical, depending on the.
CPSC 231 Organizing Files for Performance (D.H.) 1 LEARNING OBJECTIVES Data compression. Reclaiming space in files. Compaction. Searching. Sorting, Keysorting.
Compression Techniques. Digital Compression Concepts ● Compression techniques are used to replace a file with another that is smaller ● Decompression.
2015/6/15VLC 2006 PART 1 Introduction on Video Coding StandardsVLC 2006 PART 1 Variable Length Coding  Information entropy  Huffman code vs. arithmetic.
Text Operations: Coding / Compression Methods. Text Compression Motivation –finding ways to represent the text in fewer bits –reducing costs associated.
A Data Compression Algorithm: Huffman Compression
Information Theory Eighteenth Meeting. A Communication Model Messages are produced by a source transmitted over a channel to the destination. encoded.
Data Structures – LECTURE 10 Huffman coding
CSE 143 Lecture 18 Huffman slides created by Ethan Apter
Fundamentals of Multimedia Chapter 7 Lossless Compression Algorithms Ze-Nian Li and Mark S. Drew 건국대학교 인터넷미디어공학부 임 창 훈.
2015/7/12VLC 2008 PART 1 Introduction on Video Coding StandardsVLC 2008 PART 1 Variable Length Coding  Information entropy  Huffman code vs. arithmetic.
EEE377 Lecture Notes1 EEE436 DIGITAL COMMUNICATION Coding En. Mohd Nazri Mahmud MPhil (Cambridge, UK) BEng (Essex, UK) Room 2.14.
Source Coding Hafiz Malik Dept. of Electrical & Computer Engineering The University of Michigan-Dearborn
Data Compression Basics & Huffman Coding
1 Lossless Compression Multimedia Systems (Module 2) r Lesson 1: m Minimum Redundancy Coding based on Information Theory: Shannon-Fano Coding Huffman Coding.
Spring 2015 Mathematics in Management Science Binary Linear Codes Two Examples.
Data Structures and Algorithms Huffman compression: An Application of Binary Trees and Priority Queues.
Chapter 2 Source Coding (part 2)
Data Compression1 File Compression Huffman Tries ABRACADABRA
Lecture Objectives  To learn how to use a Huffman tree to encode characters using fewer bytes than ASCII or Unicode, resulting in smaller files and reduced.
Communication Technology in a Changing World Week 2.
Compression.  Compression ratio: how much is the size reduced?  Symmetric/asymmetric: time difference to compress, decompress?  Lossless; lossy: any.
ICS 220 – Data Structures and Algorithms Lecture 11 Dr. Ken Cosh.
ALGORITHMS FOR ISNE DR. KENNETH COSH WEEK 13.
Lossless Compression CIS 465 Multimedia. Compression Compression: the process of coding that will effectively reduce the total number of bits needed to.
COMPRESSION. Compression in General: Why Compress? So Many Bits, So Little Time (Space) CD audio rate: 2 * 2 * 8 * = 1,411,200 bps CD audio storage:
Introduction to Algorithms Chapter 16: Greedy Algorithms.
Huffman coding Content 1 Encoding and decoding messages Fixed-length coding Variable-length coding 2 Huffman coding.
Huffman Code and Data Decomposition Pranav Shah CS157B.
CS Spring 2011 CS 414 – Multimedia Systems Design Lecture 6 – Basics of Compression (Part 1) Klara Nahrstedt Spring 2011.
Lecture 4: Lossless Compression(1) Hongli Luo Fall 2011.
Huffman Codes Juan A. Rodriguez CS 326 5/13/2003.
CS654: Digital Image Analysis Lecture 34: Different Coding Techniques.
STATISTIC & INFORMATION THEORY (CSNB134) MODULE 11 COMPRESSION.
Lossless Decomposition and Huffman Codes Sophia Soohoo CS 157B.
Multi-media Data compression
1 Data Compression Hae-sun Jung CS146 Dr. Sin-Min Lee Spring 2004.
Lecture 12 Huffman Algorithm. In computer science and information theory, a Huffman code is a particular type of optimal prefix code that is commonly.
بسم الله الرحمن الرحيم My Project Huffman Code. Introduction Introduction Encoding And Decoding Encoding And Decoding Applications Applications Advantages.
Page 1KUT Graduate Course Data Compression Jun-Ki Min.
CSE 143 Lecture 22 Huffman slides created by Ethan Apter
Lossless Compression-Statistical Model Lossless Compression One important to note about entropy is that, unlike the thermodynamic measure of entropy,
HUFFMAN CODES.
Data Compression.
Applied Algorithmics - week7
Algorithms for iSNE Dr. Kenneth Cosh Week 13.
ISNE101 – Introduction to Information Systems and Network Engineering
Huffman Coding Based on slides by Ethan Apter & Marty Stepp
Context-based Data Compression
Why Compress? To reduce the volume of data to be transmitted (text, fax, images) To reduce the bandwidth required for transmission and to reduce storage.
Chapter 11 Data Compression
Communication Technology in a Changing World
Communication Technology in a Changing World
File Compression Even though disks have gotten bigger, we are still running short on disk space A common technique is to compress files so that they take.
Huffman Coding Greedy Algorithm
Algorithms CSCI 235, Spring 2019 Lecture 31 Huffman Codes
Presentation transcript:

CS336: Intelligent Information Retrieval Lecture 6: Compression

Compression Change the representation of a file so that takes less space to store less time to transmit Text compression: original file can be reconstructed exactly Data Compression: can tolerate small changes or noise typically sound or images already a digital approximation of an analog waveform

Text Compression: Two classes Symbol-wise Methods estimate the probability of occurrence of symbols More accurate estimates => Greater compression Dictionary Methods represent several symbols as one codeword compression performance based on length of codewords replace words and other fragments of text with an index to an entry in a “dictionary”

Compression: Based on text model compressed text encoder model decoder Determine output code based on pr distribution of model # bits should use to encode a symbol equivalent to information content (H(s) = - log Pr(s)) Recall Information Theory and Entropy (H) avg amount of information per symbol over an alphabet H = (Pr(s) * I(s)) The more probable a symbol is, the less information it provides (easier to predict) The less probable, the more information it provides

Compression: Based on text model compressed text encoder model decoder Symbol-wise methods (statistical) Can assume independence assumption Considering context achieves best compression rates e.g. probability of seeing ‘u’ next is higher if we’ve just seen ‘q’ Must yield probability distribution such that the probability of the symbol actually occurring is very high Dictionary methods string representations typically based upon text seen so far most chars coded as part of a string that has already occurred Space is saved if the reference or pointer takes fewer bits than the string it replaces

Statistical Models Static: same model regardless of text processed But, won’t be good for all inputs Recall entropy: measures information content over collection of symbols high pr(s) => low entropy; low pr(s) => high entropy pr(s) = 1 => only s is possible, no encoding needed pr(s) = 0 =>s won’t occur, so s can not be encoded What is implication for models in which some pr(s) = 0, but s does in fact occur? Problem with static models is that they will not be good for all types of inputs. E.g. a model for English probably won’t perform well on a file of numbers s can not be encoded! Therefore in practice, all symbols must be assigned pr(s) > 0

Semi-static Statistical Models Generate model on fly for file being processed first pass to estimate probabilities probabilities transmitted to decoder before encoded transmission disadvantage is having to transmit model first

Adaptive Statistical Models Model constructed from the text just encoded begin with general model, modify gradually as more text is seen best models take into account context decoder can generate same model at each time step since it has seen all characters up to that point advantages: effective w/out fine-tuning to particular text and only 1 pass needed

Adaptive Statistical Models What about pr(0) problem? allow 1 extra count divided evenly amongst unseen symbols recall probabilities estimated via occurrence counts assume 72,300 total occurrences, 5 unseen characters: each unseen char, u, gets pr(u) = pr(char is one of the unseen) * pr(all unseen char) = 1/5 * 1/72301 Not good for full-text retrieval must be decoded from beginning therefore not good for random access to files pr(0): if 72,300 occurrences w/ 5 unseen chars, each unseen gets pr = 1/5 * 1/72301

Symbol-wise Methods Morse Code Huffman coding (prefix-free code) common symbols assigned short codes (few bits) rare symbols have longer codes Huffman coding (prefix-free code) no codeword is a prefix of another symbol’s codeword

Huffman coding What is the string for 1010110111111110? Decoder identifies 10 as first codeword => e Decoding proceeds l to rt on remainder of string Good when prob. distribution is static Best when model is word-based Typically used for compressing text in IR eefggf

Create a decoding tree bottom-up Each symbol and pr are leafs 1.0 Create a decoding tree bottom-up Each symbol and pr are leafs 2 nodes w/ smallest p are joined under same parent (p = p(s1)+p(s2) Repeat, ignoring children, until all nodes are connected 4. Label nodes Left branch gets 0 Right branch 1 0.4 1 0.2 0.6 Construct a decoding tree from the bottom up 0.1 0.3 1 e 0.3 a 0.05 b c 0.1 d 0.2 f g

Huffman coding Requires 2 passes over the document gather statistics and build the coding table encode the document Must explicitly store coding table with document eats into your space savings on short documents Exploit only non-uniformity in symbol distribution adaptive algorithms can recognize the higher-order redundancy in strings e.g. 0101010101....

Dictionary Methods Braille special codes are used to represent whole words Dictionary: list of sub-strings with corresponding code-words we do this naturally e.g) replace “december” w/ “12” codebook for ascii might contain 128 characters and 128 common letter pairs At best each char encoded in 4 bits rather than 7 can be static, semi-static, or adaptive (best)

Dictionary Methods Ziv-Lempel Built on the fly Replace strings w/ reference to a previous occurrence codebook all words seen prior to current position codewords represented by “pointers” Compression: pointer stored in fewer bits than string Especially useful when resources used must be minimized (e.g. 1 machine distributes data to many) Relatively easy to implement Decoding is fast Requires only small amount of memory 1 => many: 1 server distributes to multiple machines

Ziv-Lempel Coded output: series of triples <x,y,z> x: how far back to look in previous decoded text to find next string y: how long the string is z: next character from the input only necessary if char to be coded did not occur previously ptr into text <0,0,a> <0,0,b> <2,1,a> <3,2,b> <5,3,b> <1,10,a> Is common to use diff repr. for pointers: for the offset, shorter codewords are used for recent matches and longer for matches further back in the window. Match length can be represented w/ variable-length codes that use fewer bits to represent smaller numbers. a b a a b a b a b b 10 b’s followed by a

Accessing Compressed Text Store information identifying a document’s location relative to the beginning of the file Store bit offset variable length codes mean docs may not begin/end on byte boundaries insist that docs begin on byte boundaries some docs will have waste bits at the end need a method to explicitly identify end of coded text

Text Compression Key difference b/w symbol-wise & dictionary symbol-wise base coding of symbol on context dictionary groups symbols together creating an implicit context Present techniques give compression of ~2 bits/character for general English text In order to do better, techniques must leverage semantic content external world knowledge Rule of thumb: The greater the compression, the slower the program runs or the more memory it uses. Inverted lists contain skewed data, so text compression methods are not appropriate.

Inverted File Compression Note: each inverted list can be stored as an ascending sequence of integers <8; 2, 4, 9, 45, 57, 76, 78, 80> Replace with initial position w/ d-gaps <8; 2,2,5,36,12,19,2,2> on average gaps < largest doc num Compression models describe probability distribution of d-gap sizes

Fixed-Length Compression Typically, integer stored in 4 bytes To compress the d-gaps for : <5; 1, 3, 7, 70, 250 > <5; 1, 2, 4, 63, 180 > Use fewer bytes: Two leftmost bits store the number of bytes (1-4) The d-gap is stored in the next 6, 14, 22, 30 bits

Example The inverted list is: 1, 3, 7, 70, 250 After computing gaps: 1, 2, 4, 63, 180 Number bytes reduced from 5*4 to 6

 Code This method is superior to a binary encoding Represent the number x in two parts: Unary code: specifies bits needed to code x 1 + log x followed by binary code: codes x in that many bits log x bits that represent x – 2 log x Let x = 9: log x = 3 ( 8 = 23 < 9 < 24 = 16) part 1: 1 + 3 = 4 => 1110 (unary) part 2: 9 – 8 = 1 code: 1110 001 Unary code: (n-1) 1-bits followed by a 0, therefore 1 represented by 0, 2 represented by 10, 3 by 110, etc. Unary code: n-1 1-bits followed by a 0, therefore 1 represented by 0, 2 represented by 10, 3 by 110, etc. Decoding: extract a unary code “ca”, then treat the next ca-1 bits as a binary code to get cb. x is then 2^(ca-1) +cb => 1110 001 => 2^(4-1) + 1 => 8+1 = 9

Example The inverted list is: 1, 3, 7, 70, 250 After computing gaps: 1, 2, 4, 63, 180 1 + log x x – 2 log x Number bits reduced from 5*32 to 36. Decode: let u = unary, b = binary x = 2u-1 +b