Huffman code and ID3 Prof. Sin-Min Lee Department of Computer Science.

Slides:



Advertisements
Similar presentations
Data Mining Lecture 9.
Advertisements

15-583:Algorithms in the Real World
Data Compression CS 147 Minh Nguyen.
Introduction to Computer Science 2 Lecture 7: Extended binary trees
Lecture 4 (week 2) Source Coding and Compression
Michael Alves, Patrick Dugan, Robert Daniels, Carlos Vicuna
Paper By - Manish Mehta, Rakesh Agarwal and Jorma Rissanen
SIMS-201 Compressing Information. 2  Overview Chapter 7: Compression Introduction Entropy Huffman coding Universal coding.
Data Compressor---Huffman Encoding and Decoding. Huffman Encoding Compression Typically, in files and messages, Each character requires 1 byte or 8 bits.
Lecture04 Data Compression.
Chapter 9: Greedy Algorithms The Design and Analysis of Algorithms.
A Data Compression Algorithm: Huffman Compression
Induction of Decision Trees
Classification Continued
Information Theory Eighteenth Meeting. A Communication Model Messages are produced by a source transmitted over a channel to the destination. encoded.
Spring 2015 Mathematics in Management Science Binary Linear Codes Two Examples.
Machine Learning Chapter 3. Decision Tree Learning
Learning what questions to ask. 8/29/03Decision Trees2  Job is to build a tree that represents a series of questions that the classifier will ask of.
CS Spring 2011 CS 414 – Multimedia Systems Design Lecture 7 – Basics of Compression (Part 2) Klara Nahrstedt Spring 2011.
Huffman Codes. Encoding messages  Encode a message composed of a string of characters  Codes used by computer systems  ASCII uses 8 bits per character.
Feature Selection: Why?
Compression.  Compression ratio: how much is the size reduced?  Symmetric/asymmetric: time difference to compress, decompress?  Lossless; lossy: any.
Huffman Coding. Huffman codes can be used to compress information –Like WinZip – although WinZip doesn’t use the Huffman algorithm –JPEGs do use Huffman.
Linawati Electrical Engineering Department Udayana University
Lossless Compression CIS 465 Multimedia. Compression Compression: the process of coding that will effectively reduce the total number of bits needed to.
Huffman coding Content 1 Encoding and decoding messages Fixed-length coding Variable-length coding 2 Huffman coding.
Huffman Code and Data Decomposition Pranav Shah CS157B.
For Monday No new reading Homework: –Chapter 18, exercises 3 and 4.
CSCE350 Algorithms and Data Structure Lecture 19 Jianjun Hu Department of Computer Science and Engineering University of South Carolina
Lecture 4: Lossless Compression(1) Hongli Luo Fall 2011.
Huffman Codes Juan A. Rodriguez CS 326 5/13/2003.
Foundation of Computing Systems
Bahareh Sarrafzadeh 6111 Fall 2009
1 Algorithms CSCI 235, Fall 2015 Lecture 30 More Greedy Algorithms.
Lossless Decomposition and Huffman Codes Sophia Soohoo CS 157B.
1 Classification: predicts categorical class labels (discrete or nominal) classifies data (constructs a model) based on the training set and the values.
Multi-media Data compression
1 Data Compression Hae-sun Jung CS146 Dr. Sin-Min Lee Spring 2004.
Chapter 7 Lossless Compression Algorithms 7.1 Introduction 7.2 Basics of Information Theory 7.3 Run-Length Coding 7.4 Variable-Length Coding (VLC) 7.5.
Lecture 12 Huffman Algorithm. In computer science and information theory, a Huffman code is a particular type of optimal prefix code that is commonly.
Decision Trees.
CS Spring 2012 CS 414 – Multimedia Systems Design Lecture 7 – Basics of Compression (Part 2) Klara Nahrstedt Spring 2012.
ECE 101 An Introduction to Information Technology Information Coding.
Huffman code and Lossless Decomposition Prof. Sin-Min Lee Department of Computer Science.
Review of Decision Tree Learning Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University.
Submitted To-: Submitted By-: Mrs.Sushma Rani (HOD) Aashish Kr. Goyal (IT-7th) Deepak Soni (IT-8 th )
Data Compression: Huffman Coding in Weiss (p.389)
Design & Analysis of Algorithm Huffman Coding
DECISION TREES An internal node represents a test on an attribute.
HUFFMAN CODES.
CSC317 Greedy algorithms; Two main properties:
4.8 Huffman Codes These lecture slides are supplied by Mathijs de Weerd.
Compression & Huffman Codes
Data Compression.
Greedy Technique.
Artificial Intelligence
Chapter 6 Classification and Prediction
Data Science Algorithms: The Basic Methods
Data Compression If you’ve ever sent a large file to a friend, you may have compressed it into a zip archive like the one on this slide before doing so.
Chapter 8 – Binary Search Tree
Classification and Prediction
Machine Learning Chapter 3. Decision Tree Learning
Huffman Coding CSE 373 Data Structures.
Machine Learning Chapter 3. Decision Tree Learning
Data Compression Section 4.8 of [KT].
©Jiawei Han and Micheline Kamber
Algorithms CSCI 235, Spring 2019 Lecture 30 More Greedy Algorithms
Presentation transcript:

Huffman code and ID3 Prof. Sin-Min Lee Department of Computer Science

Data Compression Data discussed so far have used FIXED length for representation For data transfer (in particular), this method is inefficient. For speed and storage efficiencies, data symbols should use the minimum number of bits possible for representation.

Data Compression Methods Used For Compression: –Encode high probability symbols with fewer bits Shannon-Fano, Huffman, UNIX compact –Encode sequences of symbols with location of sequence in a dictionary PKZIP, ARC, GIF, UNIX compress, V.42bis –Lossy compression JPEG and MPEG

Data Compression Average code length Instead of the length of individual code symbols or words, we want to know the behavior of the complete information source

Data Compression Average code length Assume that symbols of a source alphabet {a 1,a 2, …,a M } are generated with probabilities p 1,p 2, …,p M P(a i ) = p i (i = 1, 2, …, M) Assume that each symbol of the source alphabet is encoded with codes of lengths l 1,l 2, …,l M

Data Compression Average code length Then the Average code length, L, of an information source is given by:

Data Compression Variable Length Bit Codings Rules: 1.Use minimum number of bits AND 2.No code is the prefix of another code AND 3.Enables left-to-right, unambiguous decoding

Data Compression Variable Length Bit Codings No code is a prefix of another –For example, can’t have ‘A’ map to 10 and ‘B’ map to 100, because 10 is a prefix (the start of) 100.

Data Compression Variable Length Bit Codings Enables left-to-right, unambiguous decoding –That is, if you see 10, you know it’s ‘A’, not the start of another character.

Data Compression Variable Length Bit Codings Suppose ‘A’ appears 50 times in text, but ‘B’ appears only 10 times ASCII coding assigns 8 bits per character, so total bits for ‘A’ and ‘B’ is 60 * 8 = 480 If ‘A’ gets a 4-bit code and ‘B’ gets a 12-bit code, total is 50 * * 12 = 320

Data Compression Variable Length Bit Codings Example: Source Symbol PC1C1 C2C2 C3C3 C4C4 C5C5 C6C6 A B C D Average code length = 1.75

Data Compression Variable Length Bit Codings Question: Is this the best that we can get?

Data Compression Huffman code –Constructed by using a code tree, but starting at the leaves –A compact code constructed using the binary Huffman code construction method

Data Compression Huffman code Algorithm ① Make a leaf node for each code symbol Add the generation probability of each symbol to the leaf node ② Take the two leaf nodes with the smallest probability and connect them into a new node Add 1 or 0 to each of the two branches The probability of the new node is the sum of the probabilities of the two connecting nodes ③ If there is only one node left, the code construction is completed. If not, go back to (2)

Data Compression Huffman code Example Character (or symbol) frequencies –A: 20% (.20) e.g., ‘A’ occurs 20 times in a 100 character document, 1000 times in a 5000 character document, etc. –B: 9% (.09) –C: 15% (.15) –D: 11% (.11) –E: 40% (.40) –F: 5% (.05) Also works if you use character counts Must know frequency of every character in the document

C.15 A.20 D.11 F.05 B.09 E.40 Huffman code Example Symbols and their associated frequencies. Now we combine the two least common symbols (those with the smallest frequencies) to make a new symbol string and corresponding frequency. Data Compression

C.15 A.20 D.11 F.05 BF.14 B.09 E.40 Data Compression Huffman code Example Here’s the result of combining symbols once. Now repeat until you’ve combined all the symbols into a single string.

C.15 A.20 D.11 F.05 BF.14 B.09 BFD.25 AC.35 E.40 ABCDF.60 ABCDEF 1.0 Data Compression Huffman code Example

Now assign 0s/1s to each branch Codes (reading from top to bottom) –A: 010 –B: 0000 –C: 011 –D: 001 –E: 1 –F: 0001 Note –None are prefixes of another ABCDEF 1.0 E.40 C.15 A.20 D.11 F.05 BF.14 AC.35 BFD.25 ABCDF.60 B Data Compression Average Code Length = ?

Data Compression Huffman code There is no unique Huffman code –Assigning 0 and 1 to the branches is arbitrary –If there are more nodes with the same probability, it doesn ’ t matter how they are connected Every Huffman code has the same average code length!

Data Compression Huffman code Quiz: Symbols A, B, C, D, E, F are being produced by the information source with probabilities 0.3, 0.4, 0.06, 0.1, 0.1, 0.04 respectively. What is the binary Huffman code? 1)A = 00, B = 1, C = 0110, D = 0100, E = 0101, F = )A = 00, B = 1, C = 01000, D = 011, E = 0101, F = )A = 11, B = 0, C = 10111, D = 100, E = 1010, F = 10110

Data Compression Huffman code Applied extensively: Network data transfer MP3 audio format Gif image format HDTV Modelling algorithms

A decision tree is a tree in which each branch node represents a choice between a number of alternatives, and each leaf node represents a classification or decision.

One measure is the amount of information provided by the attribute. Example: Suppose you are going to bet $1 on the flip of a coin. With a normal coin. You might be willing to pay up to $1 for advance knowledge of the coin flip. However, if the coin was rigged so that 99% of the time heads come up, you would bet heads with an expected value of $0.98. So, in the case of the rigged coin, you would be willing to pay less that $0.02 for advance information of the result. The less you know the more valuable the information is. Information theory uses this intuition.

We measure the entropy of a dataset,S, with respect to one attribute, in this case the target attribute, with the following calculation: where Pi is the proportion of instances in the dataset that take the ith value of the target attribute This probability measures give us an indication of how uncertain we are about the data. And we use a log2 measure as this represents how many bits we would need to use in order to specify what the class (value of the target attribute) is of a random instance.

Entropy Calculations If we have a set with k different values in it, we can calculate the entropy as follows: Where P(value i ) is the probability of getting the i th value when randomly selecting one from the set. So, for the set R = {a,a,a,b,b,b,b,b}

Using the example of the marketing data, we know that there are two classes in the data and so we use the fractions that each class represents in an entropy calculation: Entropy (S = [9/14 responses, 5/14 no responses]) = -9/14 log2 9/14 - 5/14 log2 5/14 = bits

Initial decision tree is one node with all examples. There are 4 +ve examples and 3 -ve examples i.e. probability of +ve is 4/7 = 0.57; probability of -ve is 3/7 = 0.43 Entropy is: - (0.57 * log 0.57) - (0.43 * log 0.43) = 0.99

Evaluate possible ways of splitting. Try split on size which has three values: large, medium and small. There are four instances with size = large. There are two large positives examples and two large negative examples. The probability of +ve is 0.5 The entropy is: - (0.5 * log 0.5) - (0.5 * log 0.5) = 1

There is one small +ve and one small -ve Entropy is: - (0.5 * log 0.5) - (0.5 * log 0.5) = 1 There is only one medium +ve and no medium -ves, so entropy is 0. Expected information for a split on size is: The expected information gain is: = 0.13

Now try splitting on colour and shape. Colour has an information gain of 0.52 Shape has an information gain of 0.7 Therefore split on shape. Repeat for all subtree

How do we construct the decision tree? Basic algorithm (a greedy algorithm) –Tree is constructed in a top-down recursive divide-and-conquer manner –At start, all the training examples are at the root –Attributes are categorical (if continuous-valued, they can be discretized in advance) –Examples are partitioned recursively based on selected attributes. –Test attributes are selected on the basis of a heuristic or statistical measure (e.g., information gain) Information gain: (information before split) – (information after split)

Conditions for stopping partitioning –All samples for a given node belong to the same class –There are no remaining attributes for further partitioning – majority voting is employed for classifying the leaf –There are no samples left

The basic algorithm for decision tree induction is a greedy algorithm that constructs decision trees in a top-down recursive divide-and – conquer manner. The basic strategy is as follows. Tree STARTS as a single node representing all training dataset (samples) IF the samples are ALL in the same class, THEN the node becomes a LEAF and is labeled with that class OTHERWISE, the algorithm uses an entropy-based measure known as information gain as a heuristic for selecting the ATTRIBUTE that will best separate the samples into individual classes. This attribute becomes the node-name (test, or tree split decision attribute) Select the attribute with the highest information gain ( information gain is the expected reduction in entropy ).

A branch is created for each value of the node-attribute (and is labeled by this value -this is syntax) and the samples are partitioned accordingly (this is semantics; see example which follows) The algorithm uses the same process recursively to form a decision tree at each partition. Once an attribute has occurred at a node, it need not be considered in any other of the node’s descendents The recursive partitioning STOPS only when any one of the following conditions is true. All samples for the given node belong to the same class or There are no remaining attributes on which the samples may be further partitioning. In this case we convert the given node into a LEAF and label it with the class in majority among samples (majority voting) There is no samples left – a leaf is created with majority vote for training sample

Information Gain as A Splitting Criteria

Information Gain Computation (ID3/C4.5): Case of Two Classes