CS 540 - Fall 2016 (© Jude Shavlik), Lecture 3 If Still on Waitlist At END of class, turn in sheet of paper with your name your student ID number your major(s); indicate if CS certificate student your year (grad, sr, jr, soph) other CS class(es) currently in this term I’ll review & email those invited by NOON tomorrow You can turn in HW0 until NEXT Tues without penalty (but no extension for HW1) 9/13/16 CS 540 - Fall 2016 (© Jude Shavlik), Lecture 3
CS 540 - Fall 2016 (© Jude Shavlik), Lecture 3 Midterm Date? Current plan is Thurs Oct 20, but significant conflict has arisen HW2 can be turned in up to Oct 18 (one week late) Tues Oct 25 for midterm? Drop date is Fri Nov 4 (will take week to grade midterm) FINAL IS FIXED (and late in Dec!) 9/13/16 CS 540 - Fall 2016 (© Jude Shavlik), Lecture 3
“Mid-class” Break One, tough robot … https://m.youtube.com/watch?v=rVlhMGQgDkY 9/13/16 CS 540 - Fall 2016 (© Jude Shavlik), Lecture 3
CS 540 - Fall 2016 (© Jude Shavlik), Lecture 3 CS 540 Fall 2015 (Shavlik) 12/6/2018 Today’s Topics HW0 due 11:55pm 9/13/16 and no later than 9/20/16 HW1 out on class home page (due in two weeks); discussion page in Moodle Feature Space Revisited Learning Decision Trees (Chapter 18) We’ll use d-trees to introduce/motivate many general issues in ML (eg, overfitting reduction) “Forests” of Decision Trees very Successful ML approach, arguably the best on many tasks Expected-Value Calculations (a topic we’ll revisit a few times) Information Gain Advanced Topic: Regression Trees Coding Tips for HW1 Fun (and easy) reading: http://homes.cs.washington.edu/~pedrod/Prologue.pdf 9/13/16 CS 540 - Fall 2016 (© Jude Shavlik), Lecture 3
CS 540 - Fall 2016 (Shavlik©), Lecture 2 Recall: Feature Space If examples are described in terms of values of features, they can be plotted as points in an N-dimensional space Size Big ? Color Gray 2500 Weight A “concept” is then a (possibly disjoint) volume in this space 9/8/16 CS 540 - Fall 2016 (Shavlik©), Lecture 2
Supervised Learning and Venn Diagrams - - - - - - - - + + - - - + - - - + + - - + + + + + + + + + + - - A - - - - - - + + + - - + B - - - - - - - - Feature Space Concept = A or B (ie, a disjunctive concept) Examples = labeled points in feature space Concept = a label for regions of feat. space 9/8/16 CS 540 - Fall 2016 (Shavlik©), Lecture 2
Brief Introduction to Logic Instances Conjunctive Concept Color(?obj1, red) Size(?obj1, large) Disjunctive Concept Color(?obj2, blue) Size(?obj2, small) More formally a “concept” is of the form x y z F(x, y, z) Member(x, Class1) “and” “or” 9/8/16 CS 540 - Fall 2016 (Shavlik©), Lecture 2
CS 540 - Fall 2016 (Shavlik©), Lecture 2 Logical Symbols and or not implies equivalent for all there exists 9/8/16 CS 540 - Fall 2016 (Shavlik©), Lecture 2
Induction vs. Deduction compute what logically follows if we know P(Mary) is true and x P(x) Q(x), we can deduce Q(Mary) Induction if we observe P(1), P(2), …, P(100) we can induce x P(x) might be wrong Which does supervised ML do? 9/8/16 CS 540 - Fall 2016 (Shavlik©), Lecture 2
Outputs (inputs can be real valued in both cases) Some Common Jargon Classification Learning a discrete valued function Regression Learning a real valued function Most ML algo’s easily extended to regression tasks (and to multi-category classification) – we will focus on BINARY outputs but will discuss how to handle richer outputs as we go Discrete/Real Outputs (inputs can be real valued in both cases) 9/8/16 CS 540 - Fall 2016 (Shavlik©), Lecture 2
CS 540 - Fall 2016 (© Jude Shavlik), Lecture 3 Learning Decision Trees (Ch 18): The ID3 Algorithm (Quinlan 1979; Machine Learning 1:1 1986) Induction of Decision Trees (top-down) Based on Hunt’s CLS psych model (1963) Handles noisy & missing feature values C4.5 and C5.0 successors; CART very similar COLOR? SIZE? Red Blue Big Small - + 9/13/16 CS 540 - Fall 2016 (© Jude Shavlik), Lecture 3
Main Hypothesis of ID3 - - Ross Quinlan The simplest tree that classifies training examples will work best on future examples (Occam’s Razor) COLOR? SIZE? Red Blue Big Small - + SIZE? Big Small - + VS. NP-Hard to find the smallest tree (Hyafil +Rivest, 1976) 9/13/16 CS 540 - Fall 2016 (© Jude Shavlik), Lecture 3
Why Occam’s Razor? (Occam lived 1285 – 1349) There are fewer short hypotheses (small trees in ID3) than long ones Short hypothesis that fits training data unlikely to be coincidence Long hypothesis that fits training data might be (since many more possibilities) COLT community formally addresses these issues (ML theory) 9/13/16 CS 540 - Fall 2016 (© Jude Shavlik), Lecture 3
Finding Small Decision Trees ID3 - Generate small trees with greedy algorithm: Find a feature that “best” divides the data Recur on each subset of the data that the feature creates What does “best” mean? We’ll briefly postpone answering this 9/13/16 CS 540 - Fall 2016 (© Jude Shavlik), Lecture 3
Overview of ID3 (Recursion!) A2 +1 +2 A1 A3 A4 -1 -2 +3 +5 +4 -3 Splitting Attribute (aka Feature) ? Use Majority class at parent node (+) - why? Dataset +1 +2 +3 +4 +5 ID3 + -1 -2 -3 -1 -2 A1 A3 - + A4 A1 A2 A3 A4 +3 +5 +4 A1 A3 Resulting d-tree shown in red - 9/13/16 CS 540 - Fall 2016 (© Jude Shavlik), Lecture 3
ID3 Algorithm (Figure 18.5 of textbook) Given E, a set of classified examples F, a set of features not yet in decision tree majority class at parent If |E| = 0 then return majority class at parent Else if All_Examples_Same_Class, Return <the class> Else if |F| = 0 return maj class @ par (have +/- ex’s with same feature values) Else Let bestF = FeatureThatGainsMostInfo(E, F) Let leftF = F – bestF Add node bestF to decision tree For each possible value, v, of bestF do Add arc (labeled v) to decision tree And connect to result of ID3({ex in E| ex has value v for feature bestF}, leftF) 9/13/16 CS 540 - Fall 2016 (© Jude Shavlik), Lecture 3
CS 540 - Fall 2016 (© Jude Shavlik), Lecture 3 Venn Diagram View of ID3 Question: How do decision trees divide feature space? + - - - + + F2 F1 9/13/16 CS 540 - Fall 2016 (© Jude Shavlik), Lecture 3
CS 540 - Fall 2016 (© Jude Shavlik), Lecture 3 Venn Diagram View of ID3 Question: How do decision trees divide the feature space? + - - - + + F1 ‘Axis-parallel splits’ F1 F2 + - F2 - + F2 - + 9/13/16 CS 540 - Fall 2016 (© Jude Shavlik), Lecture 3
Use this as a guide on how to print d-trees in ASCII 9/13/16 CS 540 - Fall 2016 (© Jude Shavlik), Lecture 3
CS 540 - Fall 2016 (© Jude Shavlik), Lecture 3 Main Issue How to choose next feature to place in decision tree? Random choice? [works better than you’d expect] Feature with largest number of values? Feature with fewest? Information theoretic measure (Quinlan’s approach) General-purpose tool, eg often used for “feature selection” 9/13/16 CS 540 - Fall 2016 (© Jude Shavlik), Lecture 3
Expected Value Calculations: Sample Task Imagine you invest $1 in a lottery ticket It says odds are 1 in 10 times you’ll win $5 1 in 1,000,000 times you’ll win $100,000 How much do you expect to get back? 0.1 x $5 + 0.000001 x $100,000 = $0.60 9/13/16 CS 540 - Fall 2016 (© Jude Shavlik), Lecture 3
CS 540 - Fall 2016 (© Jude Shavlik), Lecture 3 More Generally Assume eventA has N discrete and disjoint random outcomes Expected value (eventA) = prob (outcomei occurs) value(outcomei) N i = 1 9/13/16 CS 540 - Fall 2016 (© Jude Shavlik), Lecture 3
Scoring the Features (so we can pick the best one) Let f+ = fraction of positive examples Let f - = fraction of negative examples f+ = p / (p + n), f - = n / (p + n) where p = #pos, n = #neg The expected information needed to determine the category of one these examples is InfoNeeded( f+, f -) = - f+ lg (f+) - f - lg (f -) This is also called the entropy of the set of examples (derived later) From where will we get this info? 9/13/16 CS 540 - Fall 2016 (© Jude Shavlik), Lecture 3
Consider the Extreme Cases of InfoNeeded(f +, f -) CS 540 Fall 2015 (Shavlik) 12/6/2018 Consider the Extreme Cases of InfoNeeded(f +, f -) All same class (+, say) InfoNeeded(1, 0) = -1 lg(1) - 0 lg(0) = 0 50-50 mixture InfoNeeded(½, ½) = 2 [ -½ lg(½) ] = 1 0 (by def’n) 1 InfoNeeded(f+, 1-f+) f+ 0.5 1 9/13/16 CS 540 - Fall 2016 (© Jude Shavlik), Lecture 3
CS 540 - Fall 2016 (© Jude Shavlik), Lecture 3 Evaluating a Feature How much does it help to know the value of attribute/feature A ? Assume A divides the current set of examples into N groups Let qi = fraction of data on branch i fi+ = fraction of +’s on branch i fi - = fraction of –’s on branch i 9/13/16 CS 540 - Fall 2016 (© Jude Shavlik), Lecture 3
Evaluating a Feature (cont.) InfoRemaining(A ) Σ qi x InfoNeeded(fi+, fi-) Info still needed after determining the value of attribute A Another expected value calc Pictorially i= 1 A InfoNeeded(f+, f-) v1 vN InfoNeeded(fN+, fN-) InfoNeeded(f1+, f1-) 9/13/16 CS 540 - Fall 2016 (© Jude Shavlik), Lecture 3
CS 540 - Fall 2016 (© Jude Shavlik), Lecture 3 Info Gain Gain(A) InfoNeeded(f+, f -) – InfoRemaining(A) Our scoring function in our hill-climbing (greedy) algorithm Constant for all features So pick A with smallest Remainder(A) That is, choose the feature that statistically tells us the most about the class of another example drawn from this distribution 9/13/16 CS 540 - Fall 2016 (© Jude Shavlik), Lecture 3
CS 540 - Fall 2016 (© Jude Shavlik), Lecture 3 Sample Info-Gain Calculation InfoNeeded( f+, f -) = - f+ lg (f+) - f - lg (f -) + BIG Red - SMALL Yellow Blue Class Size Shape Color 9/13/16 CS 540 - Fall 2016 (© Jude Shavlik), Lecture 3
Info-Gain Calculation (cont.) Note that “Size” provides complete classification, so done 9/13/16 CS 540 - Fall 2016 (© Jude Shavlik), Lecture 3
Recursive Methods You’ll Need to Write The d-tree learning algo (pseudocode appeared above) Classifying a ‘testset’ example Leaf nodes: return leaf’s label (ie, the predicted category) Interior nodes: determine which feature value to lookup in ex return result of recursive call on the ‘left’ or ‘right’ branch Printing the d-tree in ‘plain ASCII’ (you need not follow verbatim) Tip: pass in ‘currentDepthOfRecursion’ (initially 0) Leaf nodes: print LABEL (and maybe # training ex’s reaching here) + LINEFEED Interior nodes: for each outgoing arc print LINEFEED and 3 x currentDepthOfRecursion spaces print FEATURE NAME +“ = “ + the arc’s value + “: “ make recursive call on arc, with currentDepthOfRecursion + 1 9/13/16 CS 540 - Fall 2016 (© Jude Shavlik), Lecture 3
CS 540 - Fall 2016 (© Jude Shavlik), Lecture 3 Suggested Approach Randomly choose a feature Get tree building to work Get tree printing to work Get tree traversal (for test ex’s) to work Add in code for infoGain Test on simple, handcrafted datasets Train and test on SAME file (why?) Should get ALL correct (except if extreme noise) Produce what the HW requests 9/13/16 CS 540 - Fall 2016 (© Jude Shavlik), Lecture 3