1 Decision Trees Greg Grudic (Notes borrowed from Thomas G. Dietterich and Tom Mitchell) [Edited by J. Wiebe] Decision Trees.

Slides:



Advertisements
Similar presentations
Lecture 4 (week 2) Source Coding and Compression
Advertisements

CHAPTER 9: Decision Trees
CHAPTER 2: Supervised Learning. Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 Learning a Class from Examples.
Michael Alves, Patrick Dugan, Robert Daniels, Carlos Vicuna
CPSC 502, Lecture 15Slide 1 Introduction to Artificial Intelligence (AI) Computer Science cpsc502, Lecture 15 Nov, 1, 2011 Slide credit: C. Conati, S.
C4.5 algorithm Let the classes be denoted {C1, C2,…, Ck}. There are three possibilities for the content of the set of training samples T in the given node.
Huffman code and ID3 Prof. Sin-Min Lee Department of Computer Science.
Lecture 2: Greedy Algorithms II Shang-Hua Teng Optimization Problems A problem that may have many feasible solutions. Each solution has a value In maximization.
Greedy Algorithms Amihood Amir Bar-Ilan University.
Bounds on Code Length Theorem: Let l ∗ 1, l ∗ 2,..., l ∗ m be optimal codeword lengths for a source distribution p and a D-ary alphabet, and let L ∗ be.
Data Compressor---Huffman Encoding and Decoding. Huffman Encoding Compression Typically, in files and messages, Each character requires 1 byte or 8 bits.
Decision Tree under MapReduce Week 14 Part II. Decision Tree.
Decision Trees.
Lecture 6: Huffman Code Thinh Nguyen Oregon State University.
1 Lecture 8: Genetic Algorithms Contents : Miming nature The steps of the algorithm –Coosing parents –Reproduction –Mutation Deeper in GA –Stochastic Universal.
Spatial and Temporal Data Mining
Lecture 6: Greedy Algorithms I Shang-Hua Teng. Optimization Problems A problem that may have many feasible solutions. Each solution has a value In maximization.
Is ASCII the only way? For computers to do anything (besides sit on a desk and collect dust) they need two things: 1. PROGRAMS 2. DATA A program is a.
Three kinds of learning
Learning decision trees derived from Hwee Tou Ng, slides for Russell & Norvig, AI a Modern Approachslides Tom Carter, “An introduction to information theory.
Learning decision trees
Learning decision trees derived from Hwee Tou Ng, slides for Russell & Norvig, AI a Modern Approachslides Tom Carter, “An introduction to information theory.
1 Decision Trees Greg Grudic (Notes borrowed from Thomas G. Dietterich and Tom Mitchell) Modified by Longin Jan Latecki Some slides by Piyush Rai Intro.
MACHINE LEARNING. What is learning? A computer program learns if it improves its performance at some task through experience (T. Mitchell, 1997) A computer.
Information Theory and Security
Ensemble Learning (2), Tree and Forest
Learning Chapter 18 and Parts of Chapter 20
Huffman Codes Message consisting of five characters: a, b, c, d,e
Huffman Coding Vida Movahedi October Contents A simple example Definitions Huffman Coding Algorithm Image Compression.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Machine Learning Chapter 3. Decision Tree Learning
Learning what questions to ask. 8/29/03Decision Trees2  Job is to build a tree that represents a series of questions that the classifier will ask of.
CS-424 Gregory Dudek Today’s outline Administrative issues –Assignment deadlines: 1 day = 24 hrs (holidays are special) –The project –Assignment 3 –Midterm.
Machine Learning Queens College Lecture 2: Decision Trees.
Learning from Observations Chapter 18 Through
ALGORITHMS FOR ISNE DR. KENNETH COSH WEEK 13.
Decision Tree Learning Debapriyo Majumdar Data Mining – Fall 2014 Indian Statistical Institute Kolkata August 25, 2014.
1 Learning Chapter 18 and Parts of Chapter 20 AI systems are complex and may have many parameters. It is impractical and often impossible to encode all.
Trees (Ch. 9.2) Longin Jan Latecki Temple University based on slides by Simon Langley and Shang-Hua Teng.
Machine Learning, Decision Trees, Overfitting Machine Learning Tom M. Mitchell Machine Learning Department Carnegie Mellon University January 14,
For Monday No new reading Homework: –Chapter 18, exercises 3 and 4.
CS 8751 ML & KDDDecision Trees1 Decision tree representation ID3 learning algorithm Entropy, Information gain Overfitting.
MACHINE LEARNING 10 Decision Trees. Motivation  Parametric Estimation  Assume model for class probability or regression  Estimate parameters from all.
Concept learning, Regression Adapted from slides from Alpaydin’s book and slides by Professor Doina Precup, Mcgill University.
CS 5751 Machine Learning Chapter 3 Decision Tree Learning1 Decision Trees Decision tree representation ID3 learning algorithm Entropy, Information gain.
Decision Trees, Part 1 Reading: Textbook, Chapter 6.
1 CSCI 3202: Introduction to AI Decision Trees Greg Grudic (Notes borrowed from Thomas G. Dietterich and Tom Mitchell) Intro AIDecision Trees.
CS654: Digital Image Analysis Lecture 34: Different Coding Techniques.
Bahareh Sarrafzadeh 6111 Fall 2009
DECISION TREE Ge Song. Introduction ■ Decision Tree: is a supervised learning algorithm used for classification or regression. ■ Decision Tree Graph:
Trees (Ch. 9.2) Longin Jan Latecki Temple University based on slides by Simon Langley and Shang-Hua Teng.
Lecture 12 Huffman Algorithm. In computer science and information theory, a Huffman code is a particular type of optimal prefix code that is commonly.
Classification and Regression Trees
ECE 5984: Introduction to Machine Learning Dhruv Batra Virginia Tech Topics: –Decision/Classification Trees Readings: Murphy ; Hastie 9.2.
Oliver Schulte Machine Learning 726 Decision Tree Classifiers.
Huffman code and Lossless Decomposition Prof. Sin-Min Lee Department of Computer Science.
Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.
Lossless Compression-Statistical Model Lossless Compression One important to note about entropy is that, unlike the thermodynamic measure of entropy,
Design & Analysis of Algorithm Huffman Coding
HUFFMAN CODES.
CSC317 Greedy algorithms; Two main properties:
Decision Trees (suggested time: 30 min)
Decision Trees Greg Grudic
Huffman Coding.
Advanced Algorithms Analysis and Design
Machine Learning Chapter 3. Decision Tree Learning
Chapter 11 Data Compression
Machine Learning Chapter 3. Decision Tree Learning
INTRODUCTION TO Machine Learning 2nd Edition
Huffman Coding Greedy Algorithm
Presentation transcript:

1 Decision Trees Greg Grudic (Notes borrowed from Thomas G. Dietterich and Tom Mitchell) [Edited by J. Wiebe] Decision Trees

2 Outline Decision Tree Representations –ID3 and C4.5 learning algorithms (Quinlan 1986) –CART learning algorithm (Breiman et al. 1985) Entropy, Information Gain Overfitting Decision Trees

3 Training Data Example: Goal is to Predict When This Player Will Play Tennis? Decision Trees

4

5

6 \ 5

7Decision Trees

8 Learning Algorithm for Decision Trees Decision Trees

9 Choosing the Best Attribute Decision Trees - Many different frameworks for choosing BEST have been proposed! - We will look at Entropy Gain. Number + and – examples before and after a split. A1 and A2 are “attributes” (i.e. features or inputs).

10 Entropy Decision Trees

11Decision Trees Entropy is like a measure of purity…

12 Entropy Decision Trees

Entropy & Bits You are watching a set of independent random sample of X X has 4 possible values: P(X=A)=1/4, P(X=B)=1/4, P(X=C)=1/4, P(X=D)=1/4 You get a string of symbols ACBABBCDADDC… To transmit the data over binary link you can encode each symbol with bits (A=00, B=01, C=10, D=11) You need 2 bits per symbol

Fewer Bits – example Now someone tells you the probabilities are not equal P(X=A)=1/2, P(X=B)=1/4, P(X=C)=1/8, P(X=D)=1/8 Now, it is possible to find coding with expectation of 1.75 bits on the average. >>> -1 * logBaseB(2,0.5) 1.0 >>> -1 * logBaseB(2,0.25) 2.0 >>> -1 * logBaseB(2,1/8.0) 3.0 >>> 1 * * * 1/ * 1/ >>> Use more bits for less probable ones. We expect those to appear less often.

Reality Of course, we can’t use partial bits, so the specific numbers are theoretical numbers only Common encoding method: Huffman coding (from 1951 class project at MIT!) In 1951 David A. Huffman and his classmates in an electrical engineering graduate course on information theory were given the choice of a term paper or a final exam. For the term paper, Huffman’s professor, Robert M. Fano, had assigned what at first appeared to be a simple problem. Students were asked to find the most efficient method of representing numbers, letters or other symbols using a binary code. Besides being a nimble intellectual exercise, finding such a code would enable information to be compressed for transmission over a computer network or for storage in a computer’s memory. Huffman worked on the problem for months, developing a number of approaches, but none that he could prove to be the most efficient. Finally, he despaired of ever reaching a solution and decided to start studying for the final. Just as he was throwing his notes in the garbage, the solution came to him. “It was the most singular moment of my life,” Huffman says. “There was the absolute lightning of sudden realization.” Huffman says he might never have tried his hand at the problem—much less solved it at the age of 25—if he had known that Fano, his professor, and Claude E. Shannon, the creator of information theory, had struggled with it. “It was my luck to be there at the right time and also not have my professor discourage me by telling me that other good people had struggled with this problem,” he says. Optimal encoding exists such that average expected value X is between H(X) and H(X) + 1 Decision Trees15

A simple example Suppose we have a message consisting of 5 symbols, e.g. [ ►♣♣♠☻►♣☼►☻ ] How can we code this message using 0/1 so the coded message will have minimum length (for transmission or saving!) 5 symbols  at least 3 bits For a simple encoding, length of code is 10*3=30 bits

A simple example – cont. Intuition: Those symbols that are more frequent should have smaller codes, yet since their length is not the same, there must be a way of distinguishing each code For Huffman code, length of encoded message will be ►♣♣♠☻►♣☼►☻ =3*2 +3*2+2*2+3+3=24bits

Huffman Coding We won’t cover the algorithm here (perhaps you covered it in a systems course?) This was to give you an idea Information theory comes up in (all?) many areas of CS Decision Trees18

19 Information Gain Decision Trees

20 Training Example Decision Trees

21 Selecting the Next Attribute Decision Trees

22Decision Trees

23 Non-Boolean Features Features with multiple discrete values –Multi-way splits Real-valued features –Use thresholds Regression –Segaran considers variance from the mean Mean = sum(data)/len(data) Return sum([(d-mean) **2 for d in data])/len(data) –Idea: A high variance means the numbers are widely dispersed; low variance means the numbers are close together. We’ll look at how this is used in his code later. Decision Trees

24 Hypothesis Space Search Decision Trees You do not get the globally optimal tree! - Search space is exponential.

25 Overfitting Decision Trees

26 Overfitting in Decision Trees Decision Trees

27 Development Data is Used to Control Overfitting Prune tree to reduce error on validation set Segaran: Start at the leafs Create a combined data set of sibling leafs Suppose there are two, called tb and fb Delta = (entropy(tb+fb) – (entropy(tb) + entropy(fb))/2 If Delta < mingain (a parameter): –Merge the branches Then, return up one level of recursion, and consider further merging branches Note: Segaran just uses the training data Decision Trees

Segaran’s Trees Same ideas, but different structure Each node corresponds to: –Attr_i == Value_j? Yes, No. In lecture: examples of his trees and how they are built Decision Trees28