Huffman code and Lossless Decomposition Prof. Sin-Min Lee Department of Computer Science.

Slides:



Advertisements
Similar presentations
Compression techniques. Why we need compression. Types of compression –Lossy and lossless Concentrate on lossless techniques. Run Length coding. Entropy.
Advertisements

Schema Refinement: Normal Forms
15-583:Algorithms in the Real World
Data Compression CS 147 Minh Nguyen.
Introduction to Computer Science 2 Lecture 7: Extended binary trees
Michael Alves, Patrick Dugan, Robert Daniels, Carlos Vicuna
Huffman code and ID3 Prof. Sin-Min Lee Department of Computer Science.
Source Coding Data Compression A.J. Han Vinck. DATA COMPRESSION NO LOSS of information and exact reproduction (low compression ratio 1:4) general problem.
Greedy Algorithms (Huffman Coding)
Lecture 10 : Huffman Encoding Bong-Soo Sohn Assistant Professor School of Computer Science and Engineering Chung-Ang University Lecture notes : courtesy.
Data Compressor---Huffman Encoding and Decoding. Huffman Encoding Compression Typically, in files and messages, Each character requires 1 byte or 8 bits.
Lossless Decomposition (2) Prof. Sin-Min Lee Department of Computer Science San Jose State University.
Compression & Huffman Codes
Instructor: Amol Deshpande  Data Models ◦ Conceptual representation of the data  Data Retrieval ◦ How to ask questions of the database.
Huffman Encoding 16-Apr-17.
1 Huffman Codes. 2 Introduction Huffman codes are a very effective technique for compressing data; savings of 20% to 90% are typical, depending on the.
A Data Compression Algorithm: Huffman Compression
CMSC424: Database Design Instructor: Amol Deshpande
Compression & Huffman Codes Fawzi Emad Chau-Wen Tseng Department of Computer Science University of Maryland, College Park.
1 CMSC424, Spring 2005 CMSC424: Database Design Lecture 9.
4.8 Huffman Codes These lecture slides are supplied by Mathijs de Weerd.
Lossless Data Compression Using run-length and Huffman Compression pages
Source Coding Hafiz Malik Dept. of Electrical & Computer Engineering The University of Michigan-Dearborn
Data Compression Basics & Huffman Coding
Data Compression Gabriel Laden CS146 – Dr. Sin-Min Lee Spring 2004.
Spring 2015 Mathematics in Management Science Binary Linear Codes Two Examples.
©Silberschatz, Korth and Sudarshan7.1Database System Concepts Chapter 7: Relational Database Design First Normal Form Pitfalls in Relational Database Design.
Database Systems Normal Forms. Decomposition Suppose we have a relation R[U] with a schema U={A 1,…,A n } – A decomposition of U is a set of schemas.
Huffman Codes. Encoding messages  Encode a message composed of a string of characters  Codes used by computer systems  ASCII uses 8 bits per character.
1 Analysis of Algorithms Chapter - 08 Data Compression.
Prof. Amr Goneid Department of Computer Science & Engineering
4.8 Huffman Codes These lecture slides are supplied by Mathijs de Weerd.
BCNF & Lossless Decomposition Prof. Sin-Min Lee Department of Computer Science.
Compression.  Compression ratio: how much is the size reduced?  Symmetric/asymmetric: time difference to compress, decompress?  Lossless; lossy: any.
ICS 220 – Data Structures and Algorithms Lecture 11 Dr. Ken Cosh.
ALGORITHMS FOR ISNE DR. KENNETH COSH WEEK 13.
Huffman Coding. Huffman codes can be used to compress information –Like WinZip – although WinZip doesn’t use the Huffman algorithm –JPEGs do use Huffman.
File Compression Techniques Alex Robertson. Outline History Lossless vs Lossy Basics Huffman Coding Getting Advanced Lossy Explained Limitations Future.
Lossless Compression CIS 465 Multimedia. Compression Compression: the process of coding that will effectively reduce the total number of bits needed to.
COMPRESSION. Compression in General: Why Compress? So Many Bits, So Little Time (Space) CD audio rate: 2 * 2 * 8 * = 1,411,200 bps CD audio storage:
Prof. Amr Goneid, AUC1 Analysis & Design of Algorithms (CSCE 321) Prof. Amr Goneid Department of Computer Science, AUC Part 8. Greedy Algorithms.
Huffman coding Content 1 Encoding and decoding messages Fixed-length coding Variable-length coding 2 Huffman coding.
Huffman Code and Data Decomposition Pranav Shah CS157B.
Lecture 4: Lossless Compression(1) Hongli Luo Fall 2011.
Midterm 3 Revision and ID3 Prof. Sin-Min Lee. Armstrong’s Axioms We can find F+ by applying Armstrong’s Axioms: –if   , then    (reflexivity) –if.
Lossless Decomposition and Huffman Codes Sophia Soohoo CS 157B.
1 Data Compression Hae-sun Jung CS146 Dr. Sin-Min Lee Spring 2004.
Lecture 12 Huffman Algorithm. In computer science and information theory, a Huffman code is a particular type of optimal prefix code that is commonly.
© D. Wong Functional Dependencies (FD)  Given: relation schema R(A1, …, An), and X and Y be subsets of (A1, … An). FD : X  Y means X functionally.
Chapter 8 Relational Database Design. 2 Relational Database Design: Goals n Reduce data redundancy (undesirable replication of data values) n Minimize.
Lecture 12 Huffman Coding King Fahd University of Petroleum & Minerals College of Computer Science & Engineering Information & Computer Science Department.
Normal Forms Zachary G. Ives University of Pennsylvania CIS 550 – Database & Information Systems June 18, 2016 Some slide content courtesy of Susan Davidson.
Submitted To-: Submitted By-: Mrs.Sushma Rani (HOD) Aashish Kr. Goyal (IT-7th) Deepak Soni (IT-8 th )
Data Compression: Huffman Coding in Weiss (p.389)
Design & Analysis of Algorithm Huffman Coding
Huffman Codes ASCII is a fixed length 7 bit code that uses the same number of bits to define each character regardless of how frequently it occurs. Huffman.
HUFFMAN CODES.
Compression & Huffman Codes
Data Compression.
Greedy Technique.
Data Compression.
Data Compression CS 147 Minh Nguyen.
Data Compression If you’ve ever sent a large file to a friend, you may have compressed it into a zip archive like the one on this slide before doing so.
Huffman Coding CSE 373 Data Structures.
CSE 589 Applied Algorithms Spring 1999
Huffman Encoding.
Algorithms CSCI 235, Spring 2019 Lecture 30 More Greedy Algorithms
Chapter 7a: Overview of Database Design -- Normalization
Presentation transcript:

Huffman code and Lossless Decomposition Prof. Sin-Min Lee Department of Computer Science

Data Compression Data discussed so far have used FIXED length for representation For data transfer (in particular), this method is inefficient. For speed and storage efficiencies, data symbols should use the minimum number of bits possible for representation.

Data Compression Methods Used For Compression: –Encode high probability symbols with fewer bits Shannon-Fano, Huffman, UNIX compact –Encode sequences of symbols with location of sequence in a dictionary PKZIP, ARC, GIF, UNIX compress, V.42bis –Lossy compression JPEG and MPEG

Data Compression Average code length Instead of the length of individual code symbols or words, we want to know the behavior of the complete information source

Data Compression Average code length Assume that symbols of a source alphabet {a 1,a 2, …,a M } are generated with probabilities p 1,p 2, …,p M P(a i ) = p i (i = 1, 2, …, M) Assume that each symbol of the source alphabet is encoded with codes of lengths l 1,l 2, …,l M

Data Compression Average code length Then the Average code length, L, of an information source is given by:

Data Compression Variable Length Bit Codings Rules: 1.Use minimum number of bits AND 2.No code is the prefix of another code AND 3.Enables left-to-right, unambiguous decoding

Data Compression Variable Length Bit Codings No code is a prefix of another –For example, can’t have ‘A’ map to 10 and ‘B’ map to 100, because 10 is a prefix (the start of) 100.

Data Compression Variable Length Bit Codings Enables left-to-right, unambiguous decoding –That is, if you see 10, you know it’s ‘A’, not the start of another character.

Data Compression Variable Length Bit Codings Suppose ‘A’ appears 50 times in text, but ‘B’ appears only 10 times ASCII coding assigns 8 bits per character, so total bits for ‘A’ and ‘B’ is 60 * 8 = 480 If ‘A’ gets a 4-bit code and ‘B’ gets a 12-bit code, total is 50 * * 12 = 320

Data Compression Variable Length Bit Codings Example: Source Symbol PC1C1 C2C2 C3C3 C4C4 C5C5 C6C6 A B C D Average code length = 1.75

Data Compression Variable Length Bit Codings Question: Is this the best that we can get?

Data Compression Huffman code –Constructed by using a code tree, but starting at the leaves –A compact code constructed using the binary Huffman code construction method

Data Compression Huffman code Algorithm ① Make a leaf node for each code symbol Add the generation probability of each symbol to the leaf node ② Take the two leaf nodes with the smallest probability and connect them into a new node Add 1 or 0 to each of the two branches The probability of the new node is the sum of the probabilities of the two connecting nodes ③ If there is only one node left, the code construction is completed. If not, go back to (2)

Data Compression Huffman code Example Character (or symbol) frequencies –A: 20% (.20) e.g., ‘A’ occurs 20 times in a 100 character document, 1000 times in a 5000 character document, etc. –B: 9% (.09) –C: 15% (.15) –D: 11% (.11) –E: 40% (.40) –F: 5% (.05) Also works if you use character counts Must know frequency of every character in the document

C.15 A.20 D.11 F.05 B.09 E.40 Huffman code Example Symbols and their associated frequencies. Now we combine the two least common symbols (those with the smallest frequencies) to make a new symbol string and corresponding frequency. Data Compression

C.15 A.20 D.11 F.05 BF.14 B.09 E.40 Data Compression Huffman code Example Here’s the result of combining symbols once. Now repeat until you’ve combined all the symbols into a single string.

C.15 A.20 D.11 F.05 BF.14 B.09 BFD.25 AC.35 E.40 ABCDF.60 ABCDEF 1.0 Data Compression Huffman code Example

Now assign 0s/1s to each branch Codes (reading from top to bottom) –A: 010 –B: 0000 –C: 011 –D: 001 –E: 1 –F: 0001 Note –None are prefixes of another ABCDEF 1.0 E.40 C.15 A.20 D.11 F.05 BF.14 AC.35 BFD.25 ABCDF.60 B Data Compression Average Code Length = ?

Data Compression Huffman code There is no unique Huffman code –Assigning 0 and 1 to the branches is arbitrary –If there are more nodes with the same probability, it doesn ’ t matter how they are connected Every Huffman code has the same average code length!

Data Compression Huffman code Quiz: Symbols A, B, C, D, E, F are being produced by the information source with probabilities 0.3, 0.4, 0.06, 0.1, 0.1, 0.04 respectively. What is the binary Huffman code? 1)A = 00, B = 1, C = 0110, D = 0100, E = 0101, F = )A = 00, B = 1, C = 01000, D = 011, E = 0101, F = )A = 11, B = 0, C = 10111, D = 100, E = 1010, F = 10110

Data Compression Huffman code Applied extensively: Network data transfer MP3 audio format Gif image format HDTV Modelling algorithms

Loss-less Decompositions Definition: A decomposition of R into (R1, R2) is called lossless if, for all legal instance of r(R): r =  R1 (r )  R2 (r ) In other words, projecting on R1 and R2, and joining back, results in the relation you started with Rule: A decomposition of R into (R1, R2) is lossless, iff: R1 ∩ R2  R1 or R1 ∩ R2  R2 in F+.

Exercise

Answer

Dependency-preserving Decompositions Is it easy to check if the dependencies in F hold ? Okay as long as the dependencies can be checked in the same table. Consider R = (A, B, C), and F ={A  B, B  C} 1. Decompose into R1 = (A, B), and R2 = (A, C) Lossless ? Yes. But, makes it hard to check for B  C The data is in multiple tables. 2. On the other hand, R1 = (A, B), and R2 = (B, C), is both lossless and dependency-preserving Really ? What about A  C ? If we can check A  B, and B  C, A  C is implied.

Dependency-preserving Decompositions Definition: Consider decomposition of R into R1, …, Rn. Let F i be the set of dependencies F + that include only attributes in R i. The decomposition is dependency preserving, if (F 1  F 2  …  F n ) + = F +

Example: Decompose Lossless but not dependency preserving Why ?

BCNF Given a relation schema R, and a set of functional dependencies F, if every FD, A  B, is either: 1. Trivial 2. A is a superkey of R Then, R is in BCNF (Boyce-Codd Normal Form) Why is BCNF good ?

BCNF What if the schema is not in BCNF ? Decompose (split) the schema into two pieces. Careful: you want the decomposition to be lossless

Example

Achieving BCNF Schemas For all dependencies A  B in F+, check if A is a superkey By using attribute closure If not, then Choose a dependency in F+ that breaks the BCNF rules, say A  B Create R1 = A B Create R2 = A (R – B – A) Note that: R1 ∩ R2 = A and A  AB (= R1), so this is lossless decomposition Repeat for R1, and R2 By defining F1+ to be all dependencies in F that contain only attributes in R1 Similarly F2+

Example 1 B  C R = (A, B, C) F = {A  B, B  C} Candidate keys = {A} BCNF = No. B  C violates. R1 = (B, C) F1 = {B  C} Candidate keys = {B} BCNF = true R2 = (A, B) F2 = {A  B} Candidate keys = {A} BCNF = true

Example 2-1 A  B R = (A, B, C, D, E) F = {A  B, BC  D} Candidate keys = {ACE} BCNF = Violated by {A  B, BC  D} etc… R1 = (A, B) F1 = {A  B} Candidate keys = {A} BCNF = true R2 = (A, C, D, E) F2 = {AC  D} Candidate keys = {ACE} BCNF = false (AC  D) From A  B and BC  D by pseudo-transitivity AC  D R3 = (A, C, D) F3 = {AC  D} Candidate keys = {AC} BCNF = true R4 = (A, C, E) F4 = {} [[ only trivial ]] Candidate keys = {ACE} BCNF = true Dependency preservation ??? We can check: A  B (R1), AC  D (R3), but we lost BC  D So this is not a dependency -preserving decomposition

Example 2-2 BC  D R = (A, B, C, D, E) F = {A  B, BC  D} Candidate keys = {ACE} BCNF = Violated by {A  B, BC  D} etc… R1 = (B, C, D) F1 = {BC  D} Candidate keys = {BC} BCNF = true R2 = (B, C, A, E) F2 = {A  B} Candidate keys = {ACE} BCNF = false (A  B) A  B R3 = (A, B) F3 = {A  B} Candidate keys = {A} BCNF = true R4 = (A, C, E) F4 = {} [[ only trivial ]] Candidate keys = {ACE} BCNF = true Dependency preservation ??? We can check: BC  D (R1), A  B (R3), Dependency-preserving decomposition

Example 3 A  BC R = (A, B, C, D, E, H) F = {A  BC, E  HA} Candidate keys = {DE} BCNF = Violated by {A  BC} etc… R1 = (A, B, C) F1 = {A  BC} Candidate keys = {A} BCNF = true R2 = (A, D, E, H) F2 = {E  HA} Candidate keys = {DE} BCNF = false (E  HA) E  HA R3 = (E, H, A) F3 = {E  HA} Candidate keys = {E} BCNF = true R4 = (ED) F4 = {} [[ only trivial ]] Candidate keys = {DE} BCNF = true Dependency preservation ??? We can check: A  BC (R1), E  HA (R3), Dependency-preserving decomposition