6 Inductive learning method Construct/adjust h to agree with f on training set.(h is consistent if it agrees with f on all examples)E.g., curve fitting:
7 Inductive learning method Construct/adjust h to agree with f on training set.(h is consistent if it agrees with f on all examples)E.g., curve fitting:
8 Inductive learning method Construct/adjust h to agree with f on training set.(h is consistent if it agrees with f on all examples)E.g., curve fitting:
9 Inductive learning method Construct/adjust h to agree with f on training set.(h is consistent if it agrees with f on all examples)E.g., curve fitting:Ockham’s razor: prefer the simplest consistent hypothesis.
10 Inductive LearningGiven examples of some concepts and a description (features) for these concepts as training data, learn how to classify subsequent descriptions into one of the concepts.Concepts (classes)FeaturesTraining setTest setHere, the function has discrete outputs (the classes).
11 “Should I play tennis today?” Decision Trees“Should I play tennis today?”OutlookHumidityWindNoYesSunnyRainOvercastHighLowStrongWeakNote: A decision tree can be expressed as a disjunction of conjunctions(Outlook = sunny) (Humidity = normal) (Outlook = overcast) (Wind=Weak)
12 Learning Decision Trees Inductive Learning.Given a set of positive and negative training examples of a concept, can we learn a decision tree that can be used to appropriately classify other examples?Identification Trees: ID3 [ Quinlan, 1979 ].
13 What on Earth causes people to get sunburns? I don’t know, so let’s go to the beach and collect some data.
14 There are 3 x 3 x 3 x 2 = 54 possible feature vectors Sunburn dataNameHairHeightSwim Suit ColorLotionResultSarahBlondAverageYellowNoSunburnedDanaTallRedYesFineAlexBrownShortAnnieEmilyBluePeteJohnKatieThere are 3 x 3 x 3 x 2 = 54 possible feature vectors
15 Exact Matching Method 106/512 0.4% Construct a table recording observed cases.Use table lookup to classify new data.Problem: For realistic problems, exact matching can’t be used.8 people and 54 possible feature vectors: 15% chance of finding an exact match.Another example:106 Examples12 features5 values per feature 106/512 0.4%
16 How can we do the classification? Nearest-neighbor method (but only if we can establish a distance between feature vectors).Use identification trees:An identification tree is a decision tree in which each set of possible conclusions is implicitly established by a list of samples of known class.
17 An ID tree consistent with the data Hair ColorBlondBrownRedAlexPeteJohnEmilyLotion UsedYesNoSarahAnnieDanaKatieSunburnedNot Sunburned
18 Another consistent ID tree HeightSunburnedNot SunburnedTallDanaPeteShortAverageHair ColorSuit ColorBrownRedRedBlondYellowBlueAlexHair ColorSuit ColorSarahRedBlueRedYellowBrownAnnieEmilyKatieJohn
19 An idea Select tests that divide as well as possible people into sets with homogeneous labels Hair ColorLotion usedNoBlondYesRedBrownSarahAnnieEmilyPeteJohnSarahAnnieDanaKatieDanaAlexKatieAlexPeteJohnEmilyHeightSuit Color
20 Then among blonds... This is perfectly homogeneous... Height Katie AnnieSarahDanaShortAvTallLotion usedSarahAnnieDanaKatieNoYesThis is perfectly homogeneous...Suit ColorSarahKatieDanaAnnieYellowRedBlue
21 Combining these two together … Hair ColorBlondBrownRedAlexPeteJohnEmilyLotion UsedYesNoSarahAnnieDanaKatieSunburnedNot Sunburned
23 Problem:For practical problems, it is unlikely that any test will produce one completely homogeneous subset.Solution:Minimize a measure of inhomogeneity or disorder.Available from information theory.
24 InformationLet’s say we have a question which has n possible answers and call them vi.Let’s say that answer vi occurs with probability P(vi), then the information content (entropy) measured in bits of knowing the answer is:One bit of information is enough information to answer a yes or no question.E.g. consider flipping a fair coin, how much information do you have if you know which side comes up?I(½, ½) = - (½ log2½ + ½ log2½) = 1bit
25 Information at a nodeIn our decision tree for a given feature (e.g. hair color), we haveb: number of branches (e.g. possible values for the feature)Nb: number of samples in branchNp: number of samples in all branchesNbc: number of samples in class c in branch b.Using frequencies as an estimate of the probabilities, we haveFor a single branch, the information is simply
26 ExampleConsider a single branch (b=1) which only contains members of two classes A and B.If half of the points belong to A and half belong to B:What if all the points belong to A (or to B):We like the latter situation since the branches are homogeneous, so less information is needed to make a decision (maximize informationgain).
27 Information = 4/8*1 + 1/8*0 + 3/8*0 = 0.5 What is the amount of information required for classification after we have used the hair test?Hair ColorBlondRedBrownSarahAnnieDanaKatieAlexPeteJohnEmily-1 log21-0 log20= 0- 0 log20- 3/3 log23/3= 0- 2/4 log22/4= 1Information = 4/8*1 + 1/8*0 + 3/8*0 = 0.5
28 Selecting top level feature Using the 8 samples we have so far, we get:Test InformationHairHeightSuit ColorLotionHair wins, least additional information needed for rest of classification.This is used to build the first level of the identification tree:Hair ColorSarahAnnieDanaKatieEmilyAlexPeteJohnBlondRedBrown
29 Selecting second level feature Hair ColorSarahAnnieDanaKatieEmilyAlexPeteJohnBlondRedBrownLet’s consider the remaining features for the blond branch (4 samples)Test InformationHeightSuit Color 1LotionLotion wins, least additional information.
30 Thus we get to the tree we had arrived at earlier Hair ColorBlondBrownRedAlexPeteJohnEmilyLotion UsedYesNoSarahAnnieDanaKatieSunburnedNot Sunburned
31 Using the Identification tree as a classification procedure Hair ColorBlondRedBrownLotion UsedOKSunburnYesNoSunburnOKRules:If Blond and uses lotion, then OKIf Blond and does not use lotion, then gets burnedIf red-haired, then gets burnedIf brown hair, then OK
32 Performance measurement How do we know that h ≈ f ?Use theorems of computational/statistical learning theoryTry h on a new test set of examples(use same distribution over example space as training set)Learning curve = % correct on test set as a function of training set size