CS Fall 2016 (© Jude Shavlik), Lecture 3

Slides:

Advertisements

Similar presentations

1 Machine Learning: Lecture 3 Decision Tree Learning (Based on Chapter 3 of Mitchell T.., Machine Learning, 1997)

Advertisements

Decision Trees Decision tree representation ID3 learning algorithm

Machine Learning III Decision Tree Induction

CPSC 502, Lecture 15Slide 1 Introduction to Artificial Intelligence (AI) Computer Science cpsc502, Lecture 15 Nov, 1, 2011 Slide credit: C. Conati, S.

Huffman code and ID3 Prof. Sin-Min Lee Department of Computer Science.

Introduction Training Complexity, Pruning CART vs. ID3 vs. C4.5

ID3 Algorithm Abbas Rizvi CS157 B Spring What is the ID3 algorithm? ID3 stands for Iterative Dichotomiser 3 Algorithm used to generate a decision.

Learning from Observations Copyright, 1996 © Dale Carnegie & Associates, Inc. Chapter 18 Fall 2004.

Induction of Decision Trees

Three kinds of learning

Fall 2004 TDIDT Learning CS478 - Machine Learning.

Machine Learning Chapter 3. Decision Tree Learning

Learning CPSC 386 Artificial Intelligence Ellen Walker Hiram College.

CS-424 Gregory Dudek Today’s outline Administrative issues –Assignment deadlines: 1 day = 24 hrs (holidays are special) –The project –Assignment 3 –Midterm.

For Wednesday No new reading Homework: –Chapter 18, exercises 3, 4, 7.

Today’s Topics Chapter 2 in One Slide Chapter 18: Machine Learning (ML) Creating an ML Dataset –“Fixed-length feature vectors” –Relational/graph-based.

For Friday No reading No homework. Program 4 Exam 2 A week from Friday Covers 10, 11, 13, 14, 18, Take home due at the exam.

ASSESSING LEARNING ALGORITHMS Yılmaz KILIÇASLAN. Assessing the performance of the learning algorithm A learning algorithm is good if it produces hypotheses.

Today’s Topics HW0 due 11:55pm tonight and no later than next Tuesday HW1 out on class home page; discussion page in MoodleHW1discussion page Please do.

CSC 4510 – Machine Learning Dr. Mary-Angela Papalaskari Department of Computing Sciences Villanova University Course website:

Today’s Topics Dealing with Noise Overfitting (the key issue in all of ML) A ‘Greedy’ Algorithm for Pruning D-Trees Generating IF-THEN Rules from D-Trees.

Learning from observations

Learning from Observations Chapter 18 Through

Mehdi Ghayoumi MSB rm 132 Ofc hr: Thur, a Machine Learning.

Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.3: Decision Trees Rodney Nielsen Many of.

For Wednesday No reading Homework: –Chapter 18, exercise 6.

Machine Learning, Decision Trees, Overfitting Machine Learning Tom M. Mitchell Machine Learning Department Carnegie Mellon University January 14,

For Monday No new reading Homework: –Chapter 18, exercises 3 and 4.

Today’s Topics Read Chapter 3 & Section 4.1 (Skim Section 3.6 and rest of Chapter 4), Sections 5.1, 5.2, 5.3, 5,7, 5.8, & 5.9 (skim rest of Chapter 5)

CS 8751 ML & KDDDecision Trees1 Decision tree representation ID3 learning algorithm Entropy, Information gain Overfitting.

Chapter 11 Statistical Techniques. Data Warehouse and Data Mining Chapter 11 2 Chapter Objectives  Understand when linear regression is an appropriate.

Today’s Topics Learning Decision Trees (Chapter 18) –We’ll use d-trees to introduce/motivate many general issues in ML (eg, overfitting reduction) “Forests”

ASSESSING LEARNING ALGORITHMS Yılmaz KILIÇASLAN. Assessing the performance of the learning algorithm A learning algorithm is good if it produces hypotheses.

ASSESSING LEARNING ALGORITHMS Yılmaz KILIÇASLAN. Assessing the performance of the learning algorithm A learning algorithm is good if it produces hypotheses.

CS 5751 Machine Learning Chapter 3 Decision Tree Learning1 Decision Trees Decision tree representation ID3 learning algorithm Entropy, Information gain.

Decision Trees Binary output – easily extendible to multiple output classes. Takes a set of attributes for a given situation or object and outputs a yes/no.

1 Decision Tree Learning Original slides by Raymond J. Mooney University of Texas at Austin.

Today’s Topics HW1 Due 11:55pm Today (no later than next Tuesday) HW2 Out, Due in Two Weeks Next Week We’ll Discuss the Make-Up Midterm Be Sure to Check.

Today’s Topics Graded HW1 in Moodle (Testbeds used for grading are linked to class home page) HW2 due (but can still use 5 late days) at 11:55pm tonight.

Decision Tree Learning

ECE 471/571 – Lecture 20 Decision Tree 11/19/15. 2 Nominal Data Descriptions that are discrete and without any natural notion of similarity or even ordering.

ECE 5984: Introduction to Machine Learning Dhruv Batra Virginia Tech Topics: –Decision/Classification Trees Readings: Murphy ; Hastie 9.2.

DECISION TREES Asher Moody, CS 157B. Overview  Definition  Motivation  Algorithms  ID3  Example  Entropy  Information Gain  Applications  Conclusion.

CSE343/543 Machine Learning: Lecture 4.  Chapter 3: Decision Trees  Weekly assignment:  There are lot of applications and systems using machine learning.

1 CS 391L: Machine Learning: Computational Learning Theory Raymond J. Mooney University of Texas at Austin.

Machine Learning Inductive Learning and Decision Trees

CS Fall 2016 (Shavlik©), Lecture 5

Artificial Intelligence

CS Fall 2016 (© Jude Shavlik), Lecture 4

Machine Learning: Decision Tree Learning

Data Science Algorithms: The Basic Methods

Introduction to Data Science Lecture 7 Machine Learning Overview

Supervised Learning Seminar Social Media Mining University UC3M

CS Fall 2016 (Shavlik©), Lecture 12, Week 6

Issues in Decision-Tree Learning Avoiding overfitting through pruning

ECE 471/571 – Lecture 12 Decision Tree.

Classification and Prediction

CS Fall 2016 (© Jude Shavlik), Lecture 6, Week 4

CS Fall 2016 (Shavlik©), Lecture 8, Week 5

Learning with Identification Trees

Machine Learning Chapter 3. Decision Tree Learning

Machine Learning: Lecture 3

CS Fall 2016 (Shavlik©), Lecture 2

CS Fall 2016 (© Jude Shavlik), Lecture 7, Week 4

Machine Learning Chapter 3. Decision Tree Learning

Statistical Learning Dong Liu Dept. EEIS, USTC.

CS Fall 2016 (Shavlik©), Lecture 12, Week 6

Learning Chapter 18 and Parts of Chapter 20

CS639: Data Management for Data Science

Lecture 14 Learning Inductive inference

Presentation transcript:

CS 540 - Fall 2016 (© Jude Shavlik), Lecture 3 If Still on Waitlist At END of class, turn in sheet of paper with your name your student ID number your major(s); indicate if CS certificate student your year (grad, sr, jr, soph) other CS class(es) currently in this term I’ll review & email those invited by NOON tomorrow You can turn in HW0 until NEXT Tues without penalty (but no extension for HW1) 9/13/16 CS 540 - Fall 2016 (© Jude Shavlik), Lecture 3

CS 540 - Fall 2016 (© Jude Shavlik), Lecture 3 Midterm Date? Current plan is Thurs Oct 20, but significant conflict has arisen HW2 can be turned in up to Oct 18 (one week late) Tues Oct 25 for midterm? Drop date is Fri Nov 4 (will take week to grade midterm) FINAL IS FIXED (and late in Dec!) 9/13/16 CS 540 - Fall 2016 (© Jude Shavlik), Lecture 3

“Mid-class” Break One, tough robot … https://m.youtube.com/watch?v=rVlhMGQgDkY 9/13/16 CS 540 - Fall 2016 (© Jude Shavlik), Lecture 3

CS 540 - Fall 2016 (© Jude Shavlik), Lecture 3 CS 540 Fall 2015 (Shavlik) 12/6/2018 Today’s Topics HW0 due 11:55pm 9/13/16 and no later than 9/20/16 HW1 out on class home page (due in two weeks); discussion page in Moodle Feature Space Revisited Learning Decision Trees (Chapter 18) We’ll use d-trees to introduce/motivate many general issues in ML (eg, overfitting reduction) “Forests” of Decision Trees very Successful ML approach, arguably the best on many tasks Expected-Value Calculations (a topic we’ll revisit a few times) Information Gain Advanced Topic: Regression Trees Coding Tips for HW1 Fun (and easy) reading: http://homes.cs.washington.edu/~pedrod/Prologue.pdf 9/13/16 CS 540 - Fall 2016 (© Jude Shavlik), Lecture 3

CS 540 - Fall 2016 (Shavlik©), Lecture 2 Recall: Feature Space If examples are described in terms of values of features, they can be plotted as points in an N-dimensional space Size Big ? Color Gray 2500 Weight A “concept” is then a (possibly disjoint) volume in this space 9/8/16 CS 540 - Fall 2016 (Shavlik©), Lecture 2

Supervised Learning and Venn Diagrams - - - - - - - - + + - - - + - - - + + - - + + + + + + + + + + - - A - - - - - - + + + - - + B - - - - - - - - Feature Space Concept = A or B (ie, a disjunctive concept) Examples = labeled points in feature space Concept = a label for regions of feat. space 9/8/16 CS 540 - Fall 2016 (Shavlik©), Lecture 2

Brief Introduction to Logic Instances Conjunctive Concept Color(?obj1, red)  Size(?obj1, large) Disjunctive Concept Color(?obj2, blue)  Size(?obj2, small) More formally a “concept” is of the form  x  y  z F(x, y, z)  Member(x, Class1) “and” “or” 9/8/16 CS 540 - Fall 2016 (Shavlik©), Lecture 2

CS 540 - Fall 2016 (Shavlik©), Lecture 2 Logical Symbols  and  or  not  implies  equivalent  for all  there exists 9/8/16 CS 540 - Fall 2016 (Shavlik©), Lecture 2

Induction vs. Deduction compute what logically follows if we know P(Mary) is true and x P(x)  Q(x), we can deduce Q(Mary) Induction if we observe P(1), P(2), …, P(100) we can induce x P(x) might be wrong Which does supervised ML do? 9/8/16 CS 540 - Fall 2016 (Shavlik©), Lecture 2

Outputs (inputs can be real valued in both cases) Some Common Jargon Classification Learning a discrete valued function Regression Learning a real valued function Most ML algo’s easily extended to regression tasks (and to multi-category classification) – we will focus on BINARY outputs but will discuss how to handle richer outputs as we go Discrete/Real Outputs (inputs can be real valued in both cases) 9/8/16 CS 540 - Fall 2016 (Shavlik©), Lecture 2

CS 540 - Fall 2016 (© Jude Shavlik), Lecture 3 Learning Decision Trees (Ch 18): The ID3 Algorithm (Quinlan 1979; Machine Learning 1:1 1986) Induction of Decision Trees (top-down) Based on Hunt’s CLS psych model (1963) Handles noisy & missing feature values C4.5 and C5.0 successors; CART very similar COLOR? SIZE? Red Blue Big Small - + 9/13/16 CS 540 - Fall 2016 (© Jude Shavlik), Lecture 3

Main Hypothesis of ID3 - - Ross Quinlan The simplest tree that classifies training examples will work best on future examples (Occam’s Razor) COLOR? SIZE? Red Blue Big Small - + SIZE? Big Small - + VS. NP-Hard to find the smallest tree (Hyafil +Rivest, 1976) 9/13/16 CS 540 - Fall 2016 (© Jude Shavlik), Lecture 3

Why Occam’s Razor? (Occam lived 1285 – 1349) There are fewer short hypotheses (small trees in ID3) than long ones Short hypothesis that fits training data unlikely to be coincidence Long hypothesis that fits training data might be (since many more possibilities) COLT community formally addresses these issues (ML theory) 9/13/16 CS 540 - Fall 2016 (© Jude Shavlik), Lecture 3

Finding Small Decision Trees ID3 - Generate small trees with greedy algorithm: Find a feature that “best” divides the data Recur on each subset of the data that the feature creates What does “best” mean? We’ll briefly postpone answering this 9/13/16 CS 540 - Fall 2016 (© Jude Shavlik), Lecture 3

Overview of ID3 (Recursion!) A2 +1 +2 A1 A3 A4 -1 -2 +3 +5 +4 -3 Splitting Attribute (aka Feature) ? Use Majority class at parent node (+) - why? Dataset +1 +2 +3 +4 +5 ID3 + -1 -2 -3 -1 -2 A1 A3 - + A4 A1 A2 A3 A4 +3 +5 +4 A1 A3 Resulting d-tree shown in red - 9/13/16 CS 540 - Fall 2016 (© Jude Shavlik), Lecture 3

ID3 Algorithm (Figure 18.5 of textbook) Given E, a set of classified examples F, a set of features not yet in decision tree majority class at parent If |E| = 0 then return majority class at parent Else if All_Examples_Same_Class, Return <the class> Else if |F| = 0 return maj class @ par (have +/- ex’s with same feature values) Else Let bestF = FeatureThatGainsMostInfo(E, F) Let leftF = F – bestF Add node bestF to decision tree For each possible value, v, of bestF do Add arc (labeled v) to decision tree And connect to result of ID3({ex in E| ex has value v for feature bestF}, leftF) 9/13/16 CS 540 - Fall 2016 (© Jude Shavlik), Lecture 3

CS 540 - Fall 2016 (© Jude Shavlik), Lecture 3 Venn Diagram View of ID3 Question: How do decision trees divide feature space? + - - - + + F2 F1 9/13/16 CS 540 - Fall 2016 (© Jude Shavlik), Lecture 3

CS 540 - Fall 2016 (© Jude Shavlik), Lecture 3 Venn Diagram View of ID3 Question: How do decision trees divide the feature space? + - - - + + F1 ‘Axis-parallel splits’ F1 F2 + - F2 - + F2 - + 9/13/16 CS 540 - Fall 2016 (© Jude Shavlik), Lecture 3

Use this as a guide on how to print d-trees in ASCII 9/13/16 CS 540 - Fall 2016 (© Jude Shavlik), Lecture 3

CS 540 - Fall 2016 (© Jude Shavlik), Lecture 3 Main Issue How to choose next feature to place in decision tree? Random choice? [works better than you’d expect] Feature with largest number of values? Feature with fewest? Information theoretic measure (Quinlan’s approach) General-purpose tool, eg often used for “feature selection” 9/13/16 CS 540 - Fall 2016 (© Jude Shavlik), Lecture 3

Expected Value Calculations: Sample Task Imagine you invest $1 in a lottery ticket It says odds are 1 in 10 times you’ll win $5 1 in 1,000,000 times you’ll win $100,000 How much do you expect to get back? 0.1 x $5 + 0.000001 x $100,000 = $0.60 9/13/16 CS 540 - Fall 2016 (© Jude Shavlik), Lecture 3

CS 540 - Fall 2016 (© Jude Shavlik), Lecture 3 More Generally Assume eventA has N discrete and disjoint random outcomes Expected value (eventA) =  prob (outcomei occurs)  value(outcomei) N i = 1 9/13/16 CS 540 - Fall 2016 (© Jude Shavlik), Lecture 3

Scoring the Features (so we can pick the best one) Let f+ = fraction of positive examples Let f - = fraction of negative examples f+ = p / (p + n), f - = n / (p + n) where p = #pos, n = #neg The expected information needed to determine the category of one these examples is InfoNeeded( f+, f -) = - f+ lg (f+) - f - lg (f -) This is also called the entropy of the set of examples (derived later) From where will we get this info? 9/13/16 CS 540 - Fall 2016 (© Jude Shavlik), Lecture 3

Consider the Extreme Cases of InfoNeeded(f +, f -) CS 540 Fall 2015 (Shavlik) 12/6/2018 Consider the Extreme Cases of InfoNeeded(f +, f -) All same class (+, say) InfoNeeded(1, 0) = -1 lg(1) - 0 lg(0) = 0 50-50 mixture InfoNeeded(½, ½) = 2 [ -½ lg(½) ] = 1 0 (by def’n) 1 InfoNeeded(f+, 1-f+) f+ 0.5 1 9/13/16 CS 540 - Fall 2016 (© Jude Shavlik), Lecture 3

CS 540 - Fall 2016 (© Jude Shavlik), Lecture 3 Evaluating a Feature How much does it help to know the value of attribute/feature A ? Assume A divides the current set of examples into N groups Let qi = fraction of data on branch i fi+ = fraction of +’s on branch i fi - = fraction of –’s on branch i 9/13/16 CS 540 - Fall 2016 (© Jude Shavlik), Lecture 3

Evaluating a Feature (cont.) InfoRemaining(A )  Σ qi x InfoNeeded(fi+, fi-) Info still needed after determining the value of attribute A Another expected value calc Pictorially i= 1 A InfoNeeded(f+, f-) v1 vN InfoNeeded(fN+, fN-) InfoNeeded(f1+, f1-) 9/13/16 CS 540 - Fall 2016 (© Jude Shavlik), Lecture 3

CS 540 - Fall 2016 (© Jude Shavlik), Lecture 3 Info Gain Gain(A)  InfoNeeded(f+, f -) – InfoRemaining(A) Our scoring function in our hill-climbing (greedy) algorithm Constant for all features So pick A with smallest Remainder(A) That is, choose the feature that statistically tells us the most about the class of another example drawn from this distribution 9/13/16 CS 540 - Fall 2016 (© Jude Shavlik), Lecture 3

CS 540 - Fall 2016 (© Jude Shavlik), Lecture 3 Sample Info-Gain Calculation InfoNeeded( f+, f -) = - f+ lg (f+) - f - lg (f -) + BIG Red - SMALL Yellow Blue Class Size Shape Color 9/13/16 CS 540 - Fall 2016 (© Jude Shavlik), Lecture 3

Info-Gain Calculation (cont.) Note that “Size” provides complete classification, so done 9/13/16 CS 540 - Fall 2016 (© Jude Shavlik), Lecture 3

Recursive Methods You’ll Need to Write The d-tree learning algo (pseudocode appeared above) Classifying a ‘testset’ example Leaf nodes: return leaf’s label (ie, the predicted category) Interior nodes: determine which feature value to lookup in ex return result of recursive call on the ‘left’ or ‘right’ branch Printing the d-tree in ‘plain ASCII’ (you need not follow verbatim) Tip: pass in ‘currentDepthOfRecursion’ (initially 0) Leaf nodes: print LABEL (and maybe # training ex’s reaching here) + LINEFEED Interior nodes: for each outgoing arc print LINEFEED and 3 x currentDepthOfRecursion spaces print FEATURE NAME +“ = “ + the arc’s value + “: “ make recursive call on arc, with currentDepthOfRecursion + 1 9/13/16 CS 540 - Fall 2016 (© Jude Shavlik), Lecture 3

CS 540 - Fall 2016 (© Jude Shavlik), Lecture 3 Suggested Approach Randomly choose a feature Get tree building to work Get tree printing to work Get tree traversal (for test ex’s) to work Add in code for infoGain Test on simple, handcrafted datasets Train and test on SAME file (why?) Should get ALL correct (except if extreme noise) Produce what the HW requests 9/13/16 CS 540 - Fall 2016 (© Jude Shavlik), Lecture 3