Prof. Sin-Min Lee Department of Computer Science

Slides:



Advertisements
Similar presentations
Introduction to Computer Science 2 Lecture 7: Extended binary trees
Advertisements

Lecture 4 (week 2) Source Coding and Compression
Applied Algorithmics - week7
Michael Alves, Patrick Dugan, Robert Daniels, Carlos Vicuna
Huffman code and ID3 Prof. Sin-Min Lee Department of Computer Science.
Lecture 2: Greedy Algorithms II Shang-Hua Teng Optimization Problems A problem that may have many feasible solutions. Each solution has a value In maximization.
Greedy Algorithms Amihood Amir Bar-Ilan University.
Greedy Algorithms Greed is good. (Some of the time)
Bounds on Code Length Theorem: Let l ∗ 1, l ∗ 2,..., l ∗ m be optimal codeword lengths for a source distribution p and a D-ary alphabet, and let L ∗ be.
Huffman Encoding Dr. Bernard Chen Ph.D. University of Central Arkansas.
Data Compressor---Huffman Encoding and Decoding. Huffman Encoding Compression Typically, in files and messages, Each character requires 1 byte or 8 bits.
Compression & Huffman Codes
1 Huffman Codes. 2 Introduction Huffman codes are a very effective technique for compressing data; savings of 20% to 90% are typical, depending on the.
CPSC 411, Fall 2008: Set 4 1 CPSC 411 Design and Analysis of Algorithms Set 4: Greedy Algorithms Prof. Jennifer Welch Fall 2008.
A Data Compression Algorithm: Huffman Compression
Lecture 6: Greedy Algorithms I Shang-Hua Teng. Optimization Problems A problem that may have many feasible solutions. Each solution has a value In maximization.
Induction of Decision Trees
Decision Trees an Introduction.
Data Structures – LECTURE 10 Huffman coding
Chapter 9: Huffman Codes
Greedy Algorithms Huffman Coding
CPSC 411, Fall 2008: Set 4 1 CPSC 411 Design and Analysis of Algorithms Set 4: Greedy Algorithms Prof. Jennifer Welch Fall 2008.
CS420 lecture eight Greedy Algorithms. Going from A to G Starting with a full tank, we can drive 350 miles before we need to gas up, minimize the number.
Algorithm Design & Analysis – CS632 Group Project Group Members Bijay Nepal James Hansen-Quartey Winter
Huffman Encoding Veronica Morales.
1 Analysis of Algorithms Chapter - 08 Data Compression.
4.8 Huffman Codes These lecture slides are supplied by Mathijs de Weerd.
Compression.  Compression ratio: how much is the size reduced?  Symmetric/asymmetric: time difference to compress, decompress?  Lossless; lossy: any.
ALGORITHMS FOR ISNE DR. KENNETH COSH WEEK 13.
Lossless Compression CIS 465 Multimedia. Compression Compression: the process of coding that will effectively reduce the total number of bits needed to.
Introduction to Algorithms Chapter 16: Greedy Algorithms.
Trees (Ch. 9.2) Longin Jan Latecki Temple University based on slides by Simon Langley and Shang-Hua Teng.
Prof. Amr Goneid, AUC1 Analysis & Design of Algorithms (CSCE 321) Prof. Amr Goneid Department of Computer Science, AUC Part 8. Greedy Algorithms.
Huffman coding Content 1 Encoding and decoding messages Fixed-length coding Variable-length coding 2 Huffman coding.
GREEDY ALGORITHMS UNIT IV. TOPICS TO BE COVERED Fractional Knapsack problem Huffman Coding Single source shortest paths Minimum Spanning Trees Task Scheduling.
Huffman Codes Juan A. Rodriguez CS 326 5/13/2003.
Foundation of Computing Systems
Bahareh Sarrafzadeh 6111 Fall 2009
Trees (Ch. 9.2) Longin Jan Latecki Temple University based on slides by Simon Langley and Shang-Hua Teng.
Huffman Codes. Overview  Huffman codes: compressing data (savings of 20% to 90%)  Huffman’s greedy algorithm uses a table of the frequencies of occurrence.
1 Data Compression Hae-sun Jung CS146 Dr. Sin-Min Lee Spring 2004.
Huffman code and Lossless Decomposition Prof. Sin-Min Lee Department of Computer Science.
Analysis of Algorithms CS 477/677 Instructor: Monica Nicolescu Lecture 18.
Lossless Compression-Statistical Model Lossless Compression One important to note about entropy is that, unlike the thermodynamic measure of entropy,
Design & Analysis of Algorithm Huffman Coding
DECISION TREES An internal node represents a test on an attribute.
HUFFMAN CODES.
CSC317 Greedy algorithms; Two main properties:
The Greedy Method and Text Compression
The Greedy Method and Text Compression
Chapter 8 – Binary Search Tree
Chapter 9: Huffman Codes
Analysis & Design of Algorithms (CSCE 321)
Huffman Coding.
Merge Sort 11/28/2018 2:21 AM The Greedy Method The Greedy Method.
Algorithms (2IL15) – Lecture 2
Advanced Algorithms Analysis and Design
Greedy Algorithms Many optimization problems can be solved more quickly using a greedy approach The basic principle is that local optimal decisions may.
Chapter 11 Data Compression
Huffman Coding CSE 373 Data Structures.
Greedy Algorithms TOPICS Greedy Strategy Activity Selection
Data Structure and Algorithms
Podcast Ch23d Title: Huffman Compression
Lecture 2: Greedy Algorithms
Algorithms CSCI 235, Spring 2019 Lecture 30 More Greedy Algorithms
Huffman Coding Greedy Algorithm
Huffman codes Binary character code: each character is represented by a unique binary string. A data file can be coded in two ways: a b c d e f frequency(%)
Algorithms CSCI 235, Spring 2019 Lecture 31 Huffman Codes
Analysis of Algorithms CS 477/677
Presentation transcript:

Prof. Sin-Min Lee Department of Computer Science CS157B Lecture 19 Huffman Trees and ID3 Prof. Sin-Min Lee Department of Computer Science

Professor David A. Huffman Huffman coding is an algorithm used for lossless data compression developed by David A. Huffman as a PhD student at MIT in 1952, and published in A Method for the Construction of Minimum-Redundancy Codes. "Huffman Codes" are widely used applications that involve the compression and transmission of digital data, such as: fax machines, modems, computer networks, and high-definition television (HDTV), etc. Professor David A. Huffman (August 9, 1925 - October 7, 1999)

Motivation Huffman savings are between 20% - 90% The motivations for data compression are obvious: reducing the space required to store files on disk or tape reducing the time to transmit large files. Image Source : plus.maths.org/issue23/ features/data/data.jpg Huffman savings are between 20% - 90%

Basic Idea : It uses a variable-length code table for encoding a source symbol (such as a character in a file) where the variable-length code table has been derived in a particular way based on the frequency of occurrence for each possible value of the source symbol.

Fixed-length codeword Example: Suppose you have a file with 100K characters. For simplicity assume that there are only 6 distinct characters in the file from a through f, with frequencies as indicated below. We represent the file using a unique binary string for each character. a b c d e f Frequency (in 1000s) 45 13 12 16 9 5 Fixed-length codeword 000 001 010 011 100 101 Space = (45*3 + 13*3 + 12*3 + 16*3 + 9*3 + 5*3) * 1000 = 300K bits

Space = (45*1 + 13*3 + 12*3 + 16*3 + 9*4 + 5*4) * 1000 Can we do better ?? YES !! By using variable-length codes instead of fixed-length codes. Idea : Giving frequent characters short codewords, and infrequent characters long codewords. i.e. The length of the encoded character is inversely proportional to that character's frequency. a b c d e f Frequency (in 1000s) 45 13 12 16 9 5 Fixed-length codeword 000 001 010 011 100 101 Variable-length codeword 111 1101 1100 Space = (45*1 + 13*3 + 12*3 + 16*3 + 9*4 + 5*4) * 1000 = 224K bits ( Savings = 25%)

Variable-length codeword PREFIX CODES : Codes in which no codeword is also a prefix of some other codeword. ("prefix-free codes" would have been a more appropriate name) Variable-length codeword 101 100 111 1101 1100 It is very easy to encode and decode using prefix codes. No Ambiguity !! It is possible to show (although we won't do so here) that the optimal data compression achievable by a character code can always be achieved with a prefix code, so there is no loss of generality in restricting attention to prefix codes.

Variable-length codeword Benefits of using Prefix Codes: Example: a b c d e f Variable-length codeword 101 100 111 1101 1100 F A C E Encoded as 1100 0 100 1101 = 110001001101 To decode, we have to decide where each code begins and ends, since they are no longer all the same length. But this is easy, since, no codes share a prefix. This means we need only scan the input string from left to right, and as soon as we recognize a code, we can print the corresponding character and start looking for the next code. In the above case, the only code that begins with “1100.." or a prefix is “f", so we can print “f" and start decoding “0100...", get “a", etc.

Variable-length codeword Benefits of using Prefix Codes: Example: To see why the no-common prefix property is essential, suppose that we encoded “e" with the shorter code “110“ a b c d e f Variable-length codeword 101 100 111 1101 1100 110 FACE = 11000100110 When we try to decode “1100"; we could not tell whether 1100 = 110 0 = “f" or 1100 = 110 + 0 = “ea"

Representation: The Huffman algorithm is represented as: binary tree each edge represents either 0 or 1 0 means "go to the left child" 1 means "go to the right child." each leaf corresponds to the sequence of 0s and 1s traversed from the root to reach it, i.e. a particular code. Since no prefix is shared, all legal codes are at the leaves, and decoding a string means following edges, according to the sequence of 0s and 1s in the string, until a leaf is reached.

a b c d e f Frequency (in 1000s) 45 13 12 16 9 5 Fixed-length codeword 000 001 010 011 100 101 a b c d e f Frequency (in 1000s) 45 13 12 16 9 5 Variable-length codeword 101 100 111 1101 1100 100 14 28 58 86 a:45 b:13 e:9 f:5 c:12 d:16 100 30 25 14 55 a:45 b:13 e:9 f:5 c:12 d:16 1 1 1 1 1 1 1 1 1 1 Labeling : leaf -> character it represents : frequency with which it appears in the text. internal node -> frequency with which all leaf nodes under it appear in the text (i.e. the sum of their frequencies). 1

0ptimal Code 100 14 28 58 86 a:45 b:13 e:9 f:5 c:12 d:16 100 30 25 14 55 a:45 b:13 e:9 f:5 c:12 d:16 1 1 1 1 1 1 1 1 1 1 An optimal code for a file is always represented by a full binary tree, in which every non-leaf node has two children. The fixed-length code in our example is not optimal since its tree, is not a full binary tree: there are codewords beginning 10 . . . , but none beginning 11 .. Since we can now restrict our attention to full binary trees, we can say that if C is the alphabet from which the characters are drawn, then the tree for an optimal prefix code has exactly |C| leaves, one for each letter of the alphabet, and exactly |C| - 1 internal nodes.

Which we define as the cost of the tree. Given a tree T corresponding to a prefix code, it is a simple matter to compute the number of bits required to encode a file. For each character c in the alphabet C, f(c) denote the frequency of c in the file dT(c) denote the depth of c's leaf in the tree. (dT(c) is also the length of the codeword for character c) The number of bits required to encode a file is thus B(T) = S f(c) dT(c) c C Which we define as the cost of the tree.

Constructing a Huffman code Huffman invented a greedy algorithm that constructs an optimal prefix code called a Huffman code. The algorithm builds the tree T corresponding to the optimal code in a bottom-up manner. It begins with a set of |C| leaves and performs a sequence of |C| - 1 "merging" operations to create the final tree. Greedy Choice? The two smallest nodes are chosen at each step, and this local decision results in a globally optimal encoding tree. In general, greedy algorithms use small-grained, or local minimal/maximal choices to result in a global minimum/maximum

HUFFMAN(C) 1 n |C| 2 Q C 3 for i 1 to n - 1 4 do ALLOCATE-NODE(z) 5 left[z] x EXTRACT-MIN(Q) 6 right[z] y EXTRACT-MIN(Q) 7 f[z] f[x] + f[y] 8 INSERT(Q, z) 9 return EXTRACT-MIN(Q) C is a set of n characters and that each character c C is an object with a defined frequency f[c]. A min-priority queue Q, keyed on f, is used to identify the two least-frequent objects to merge together. The result of the merger of two objects is a new object whose frequency is the sum of the frequencies of the two objects that were merged.

For our example, Huffman's algorithm proceeds as shown. 1 n |C| 2 Q C Line 1 sets the initial queue size, n = 6 (letters in the alphabet) Line 2 initializes the priority queue Q with the characters in C. (a through f) 3 for i 1 to n - 1 4 do ALLOCATE-NODE(z) 5 left[z] x EXTRACT-MIN(Q) 6 right[z] y EXTRACT-MIN(Q) 7 f[z] f[x] + f[y] 8 INSERT(Q, z) The for loop uses n - 1 (6 - 1 = 5) merge steps to build the tree. It repeatedly extracts the two nodes x and y of lowest frequency from the queue, and replaces them in the queue with a new node z representing their merger. The frequency of z is computed as the sum of the frequencies of x and y in line 7. The node z has x as its left child and y as its right child. 9 return EXTRACT-MIN(Q) After mergers, the one node left in the queue -- the root -- is returned in line 9. The final tree represents the optimal prefix code. The codeword for a letter is the sequence of edge labels on the path from the root to the letter.

The steps of Huffman's algorithm b:13 e:9 f:5 c:12 d:16 14 a:45 b:13 e:9 f:5 c:12 d:16 1 25 14 a:45 b:13 e:9 f:5 c:12 d:16 1 30 25 14 a:45 b:13 e:9 f:5 c:12 d:16 1 55 30 25 14 a:45 b:13 e:9 f:5 c:12 d:16 1 100 55 30 25 14 a:45 b:13 e:9 f:5 c:12 d:16 1

Running Time Analysis The analysis of the running time of Huffman's algorithm assumes that Q is implemented as a binary min-heap. For a set C of n characters, the initialization of Q in line 2 can be performed in O(n) time using the BUILD-MIN-HEAP procedure. The for loop in lines 3-8 is executed exactly |n| - 1 times. Each heap operation requires time O(log n). The loop contributes = (|n| - 1) * O(log n) = O(nlog n) Thus, the total running time of HUFFMAN on a set of n characters = O(n) + O(nlog n) = O(n log n)

Correctness of Huffman's algorithm To prove that the greedy algorithm HUFFMAN is correct, we show that the problem of determining an optimal prefix code exhibits the greedy-choice and optimal-substructure properties.

Lemma that shows that the greedy-choice property holds. Lemma : Let C be an alphabet in which each character c C has frequency f[c]. Let x and y be two characters in C having the lowest frequencies. Then there exists an optimal prefix code for C in which the codewords for x and y have the same length and differ only in the last bit. Why ? Must be on the bottom (least frequent) Full tree, so they must be siblings, and so differ in one bit. Proof: The idea of the proof is to take the tree T representing an arbitrary optimal prefix code and modify it to make a tree representing another optimal prefix code such that the characters x and y appear as sibling leaves of maximum depth in the new tree. If we can do this, then their codewords will have the same length and differ only in the last bit.

Proof b a y x T b x y a T’ Let a and b be two characters that are are sibling leaves of maximum depth in T. Without loss in generality, assume that f[a] < f[b] and f[x] < f[y] Since f[x] and f[y] are the two lowest frequencies in that order, and f[a] and f[b] are two arbitrary frequencies in that order, we have f[x] < f[a] and f[y] < f[b]. Exchange the positions of a and x in T, to produce T’. The difference in cost between T and T’ is B(T) – B(T’) = S f(c) dT(c) - S f(c) dT’(c) = f[x] dT(x) + f[a] dT(a) - f[x] dT’(x) - f[a] dT’(a) = f[x] dT(x) + f[a] dT(a) - f[x] dT(a) - f[a] dT(x) = (f[a] - f[x]) (dT(a) - dT(x)) 0 (cost is not increased)

Proof b a y x T b x y a T’ y x b a T’’ Similarly exchanging the positions of b and y in T’, to produce T’’ does not increase the cost, B(T’) – B(T’’) is non-negative. Therefore B(T’’) B(T), and since T is optimal, B(T) B(T’’), => B(T’’) = B(T) Thus, T’’ is an optimal tree in which x & y appear as sibling leaves of maximum depth from which the lemma follows.

Lemma that shows that the optimal substructure property holds. Let C be a given alphabet with frequency f[c] defined for each character c C . Let x and y be two characters in C with minimum frequency. Let C’ be the alphabet C with characters x,y removed and (new) character z added, so that C’ = C – {x,y} U {z}; define f for C’ as for C, except that f[z] = f[x] + f[y]. Let T’ be any tree representing an optimal prefix code for the alphabet C’. Then the tree T, obtained from T’ by replacing the leaf node for z with an internal node having x and y as children, represents an optimal prefix code for the alphabet C. Î Proof: We first express B (T) in terms of B (T') c C – {x,y} we have dT(c) = dT’(c), and hence f[c]dT(c) = f[c]dT’ (c)’ Î

Since dT(x) = dT(y) = dT’(z) + 1, we have f[x]dT(x) + f[y]dT(y) = (f[x] + f[y]) (dT'(z) + 1 ) = f(z)dT'(z) + (f[x] + f[y]) From which we conclude that B(T) = B(T’) + (f[x] + f[y]) B(T’) = B(T) - (f[x] - f[y]) Proof by contradiction Suppose that T does not represent an optimal prefix code for C. Then there exists a tree T’’ such that B(T’’) < B(T). Without loss in generality (by the previous lemma) T’’ has x & y as siblings. Let T’’’ be the tree T’’ with the common parent of x & y replaced by a leaf z with frequency f[z] = f[x] + f[y]. Then, B(T’’’) = B(T’’) - (f[x] – f[y]) < B(T) - (f[x] - f[y]) = B(T’) Yielding a contradiction to the assumption that T’ represents an optimal prefix code for C’. Thus, T must represent an optimal prefix code for the alphabet C.

Drawbacks The main disadvantage of Huffman’s method is that it makes two passes over the data: one pass to collect frequency counts of the letters in the message, followed by the construction of a Huffman tree and transmission of the tree to the receiver; and a second pass to encode and transmit the letters themselves, based on the static tree structure. This causes delay when used for network communication, and in file compression applications the extra disk accesses can slow down the algorithm. We need one-pass methods, in which letters are encoded “on the fly”.

ID3 algorithm To get the fastest decision-making procedure, one has to arrange attributes in a decision tree in a proper order - the most discriminating attributes first. This is done by the algorithm called ID3. The most discriminating attribute can be defined in precise terms as the attribute for which the fixing its value changes the enthropy of possible decisions at most. Let wj be the frequency of the j-th decision in a set of examples x. Then the enthropy of the set is E(x)= - Swj* log(wj) Let fix(x,a,v) denote the set of these elements of x whose value of attribute a is v. The average enthropy that remains in x , after the value a has been fixed, is: H(x,a) = S kv E(fix(x,a,v)), where kv is the ratio of examples in x with attribute a having value v.

Ok now we want a quantitative way of seeing the effect of splitting the dataset by using a particular attribute (which is part of the tree building process). We can use a measure called Information Gain, which calculates the reduction in entropy (Gain in information) that would result on splitting the data on an attribute, A. where v is a value of A , |Sv| is the subset of instances of S where A takes the value v, and |S| is the number of instances

Continuing with our example dataset, let's name it S just for convenience, let's work out the Information Gain that splitting on the attribute District would result in over the entire dataset: So by calculating this value for each attribute that remains, we can see which attribute splits the data more purely. Let's imagine we want to select an attribute for the root node, then performing the above calcualtion for all attributes gives: Gain(S,House Type) = 0.049 bits Gain(S,Income) =0.151 bits Gain(S,Previous Customer) = 0.048 bits

We can clearly see that District results in the highest reduction in entropy or the highest information gain. We would therefore choose this at the root node splitting the data up into subsets corresponding to all the different values for the District attribute. With this node evaluation technique we can procede recursively through the subsets we create until leaf nodes have been reached throughout and all subsets are pure with zero entropy. This is exactly how ID3 and other variants work.

If S is a collection of 14 examples with 9 YES and 5 NO examples then Entropy(S) = - (9/14) Log2 (9/14) - (5/14) Log2 (5/14) = 0.940 Notice entropy is 0 if all members of S belong to the same class (the data is perfectly classified). The range of entropy is 0 ("perfectly classified") to 1 ("totally random"). Gain(S, A) is information gain of example set S on attribute A is defined as Gain(S, A) = Entropy(S) - S ((|Sv| / |S|) * Entropy(Sv)) Where: S is each value v of all possible values of attribute A Sv = subset of S for which attribute A has value v |Sv| = number of elements in Sv |S| = number of elements in S

Example 2. Suppose S is a set of 14 examples in which one of the attributes is wind speed. The values of Wind can be Weak or Strong. The classification of these 14 examples are 9 YES and 5 NO. For attribute Wind, suppose there are 8 occurrences of Wind = Weak and 6 occurrences of Wind = Strong. For Wind = Weak, 6 of the examples are YES and 2 are NO. For Wind = Strong, 3 are YES and 3 are NO. Therefore Gain(S,Wind)=Entropy(S)-(8/14)*Entropy(Sweak)-(6/14)*Entropy(Sstrong) = 0.940 - (8/14)*0.811 - (6/14)*1.00 = 0.048 Entropy(Sweak) = - (6/8)*log2(6/8) - (2/8)*log2(2/8) = 0.811 Entropy(Sstrong) = - (3/6)*log2(3/6) - (3/6)*log2(3/6) = 1.00 For each attribute, the gain is calculated and the highest gain is used in the decision node.

Decision Tree Construction Algorithm (pseudo-code): Input: A data set, S Output: A decision tree If all the instances have the same value for the target attribute then return a decision tree that is simply this value (not really a tree - more of a stump). Else Compute Gain values (see above) for all attributes and select an attribute with the lowest value and create a node for that attribute. Make a branch from this node for every value of the attribute Assign all possible values of the attribute to branches. Follow each branch by partitioning the dataset to be only instances whereby the value of the branch is present and then go back to 1.

Decision Tree Example Outlook Temperature Humidity Windy Play? sunny hot high false No true overcast Yes rain mild cool normal

Outlook sunny rain overcast Humidity Yes Windy high normal true false No Yes No Yes

Which Attributes to Select?

Which is the best attribute? A Criterion for Attribute Selection Which is the best attribute? The one which will result in the smallest tree Heuristic: choose the attribute that produces the “purest” nodes Outlook overcast Yes

Information gain: (information before split) – (information after split) Information gain for attributes from weather data:

Continuing to Split

The Final Decision Tree Splitting stops when data can’t be split any further

Person Homer 0” 250 36 M Marge 10” 150 34 F Bart 2” 90 10 Lisa 6” 78 8 Hair Length Weight Age Class Homer 0” 250 36 M Marge 10” 150 34 F Bart 2” 90 10 Lisa 6” 78 8 Maggie 4” 20 1 Abe 1” 170 70 Selma 8” 160 41 Otto 180 38 Krusty 200 45 Comic 8” 290 38 ?

Let us try splitting on Hair length Entropy(4F,5M) = -(4/9)log2(4/9) - (5/9)log2(5/9) = 0.9911 yes no Hair Length <= 5? Let us try splitting on Hair length Entropy(1F,3M) = -(1/4)log2(1/4) - (3/4)log2(3/4) = 0.8113 Entropy(3F,2M) = -(3/5)log2(3/5) - (2/5)log2(2/5) = 0.9710 Gain(Hair Length <= 5) = 0.9911 – (4/9 * 0.8113 + 5/9 * 0.9710 ) = 0.0911

Let us try splitting on Weight Entropy(4F,5M) = -(4/9)log2(4/9) - (5/9)log2(5/9) = 0.9911 yes no Weight <= 160? Let us try splitting on Weight Entropy(4F,1M) = -(4/5)log2(4/5) - (1/5)log2(1/5) = 0.7219 Entropy(0F,4M) = -(0/4)log2(0/4) - (4/4)log2(4/4) = 0 Gain(Weight <= 160) = 0.9911 – (5/9 * 0.7219 + 4/9 * 0 ) = 0.5900

Let us try splitting on Age Entropy(4F,5M) = -(4/9)log2(4/9) - (5/9)log2(5/9) = 0.9911 yes no age <= 40? Let us try splitting on Age Entropy(3F,3M) = -(3/6)log2(3/6) - (3/6)log2(3/6) = 1 Entropy(1F,2M) = -(1/3)log2(1/3) - (2/3)log2(2/3) = 0.9183 Gain(Age <= 40) = 0.9911 – (6/9 * 1 + 3/9 * 0.9183 ) = 0.0183

This time we find that we can split on Hair length, and we are done! Of the 3 features we had, Weight was best. But while people who weigh over 160 are perfectly classified (as males), the under 160 people are not perfectly classified… So we simply recurse! yes no Weight <= 160? This time we find that we can split on Hair length, and we are done! yes no Hair Length <= 2?

We need don’t need to keep the data around, just the test conditions. Weight <= 160? yes no How would these people be classified? Hair Length <= 2? Male yes no Male Female

Male Male Female It is trivial to convert Decision Trees to rules… yes Weight <= 160? yes no Hair Length <= 2? Male yes no Male Female Rules to Classify Males/Females If Weight greater than 160, classify as Male Elseif Hair Length less than or equal to 2, classify as Male Else classify as Female

Once we have learned the decision tree, we don’t even need a computer! This decision tree is attached to a medical machine, and is designed to help nurses make decisions about what type of doctor to call. Decision tree for a typical shared-care setting applying the system for the diagnosis of prostatic obstructions.

The worked examples we have seen were performed on small datasets The worked examples we have seen were performed on small datasets. However with small datasets there is a great danger of overfitting the data… When you have few datapoints, there are many possible splitting rules that perfectly classify the data, but will not generalize to future datasets. Yes No Wears green? Female Male For example, the rule “Wears green?” perfectly classifies the data, so does “Mothers name is Jacqueline?”, so does “Has blue shoes”…