Presentation is loading. Please wait.

Presentation is loading. Please wait.

INTRODUCTION TO ARTIFICIAL INTELLIGENCE Massimo Poesio Machine Learning: Decision Trees.

Similar presentations


Presentation on theme: "INTRODUCTION TO ARTIFICIAL INTELLIGENCE Massimo Poesio Machine Learning: Decision Trees."— Presentation transcript:

1 INTRODUCTION TO ARTIFICIAL INTELLIGENCE Massimo Poesio Machine Learning: Decision Trees

2 A DEFINITION OF LEARNING: LEARNING AS IMPROVEMENT Learning denotes changes in the system that are adaptive in the sense that they enable the system to do the task or tasks drawn from the same population more efficiently and more effectively the next time. -- Herb Simon

3 3 LEARNING AS IMPROVEMENT, 2 Improve on task, T, with respect to performance metric, P, based on experience, E. T: Assign to words their senses. P: Percentage of words correctly classified. E: Corpus of words, some with human-given labels T: Categorize email messages as spam or legitimate. P: Percentage of email messages correctly classified. E: Database of emails, some with human-given labels T: Playing checkers P: Percentage of games won against an arbitrary opponent E: Playing practice games against itself T: Recognizing hand-written words P: Percentage of words correctly classified E: Database of human-labeled images of handwritten words

4 4 SPECIFYING A LEARNING SYSTEM Choose the training experience Choose exactly what is too be learned, i.e. the target function. Choose how to represent the target function. Choose a learning algorithm to infer the target function from the experience. Environment/ Experience Learner Knowledge Performance Element

5 FEATURES The functions learned by ML algorithms specify a mapping from input FEATURES to output FEATURES

6 EXAMPLE 1: CHECKERS Features used in the linear function seen last time: – bp(b): number of black pieces on board b – rp(b): number of red pieces on board b – bk(b): number of black kings on board b – rk(b): number of red kings on board b – bt(b): number of black pieces threatened (i.e. which can be immediately taken by red on its next turn) – rt(b): number of red pieces threatened

7 EXAMPLE 2: WHEN THE NEIGHBOR DRIVES Suppose we are trying to learn when our neighbor goes to work by car so we can ask a ride Their decision appears to be influenced by – Temperature – Whether it’s going to rain or not – Day of the week – Whether they need to stop at a shop on the way back – How they are dressed

8 PAST EXPERIENCE IN TERMS OF FEATURES

9 PREDICTION TASK

10 PREDICTION

11 THE NEED FOR AVERAGING

12 THE NEED FOR GENERALIZATION

13 FIRST EXAMPLE OF ML ALGORITHM: DECISION TREES A method independently developed by Quinlan in AI and by Breiman et al in statistics

14 14 DECISION TREES Tree-based classifiers for instances represented as feature-vectors. Nodes test features, there is one branch for each value of the feature, and leaves specify the category. Can represent arbitrary conjunction and disjunction. Can represent any classification function over discrete feature vectors. color red blue green shape circle square triangle neg pos neg color red blue green shape circle square triangle B C A B C

15 A DECISION TREE FOR THE DRIVING PROBLEM

16 LEARNING DECISION TREES FROM DATA Use the data in the training set to build a decision tree that will then be used to make decisions with unseen data The decision tree specifies a function

17 TRAVERSING THE DECISION TREE

18

19

20 Decision Tree Learning Discrete class values – Slight changes in the input: either no or full effect on the classification Discrete feature values (or discretized) Fast Modern DT induction algorithms: – Handling noisy feature values – Handling noisy labels – Handling missing feature values

21 Top-down DT induction Partition training examples into good “splits”, based on values of a single “good” feature: (1) Sat, hot, no, casual, keys -> + (2) Mon, cold, snow, casual, no-keys -> - (3) Tue, hot, no, casual, no-keys -> - (4) Tue, cold, rain, casual, no-keys -> - (5) Wed, hot, rain, casual, keys -> +

22 Top-down DT induction keys? yesno Drive: 1,5Walk: 2,3,4

23 Top-down DT induction Partition training examples into good “splits”, based on values of a single “good” feature (1) Sat, hot, no, casual -> + (2) Mon, cold, snow, casual -> - (3) Tue, hot, no, casual -> - (4) Tue, cold, rain, casual -> - (5) Wed, hot, rain, casual -> + No acceptable classification: proceed recursively

24 Top-down DT induction t? coldhot Walk: 2,4 Drive: 1,5 Walk: 3

25 Top-down DT induction t? coldhot Walk: 2,4day? Sat Tue Wed Drive: 1Walk: 3Drive: 5

26 Top-down DT induction t? coldhot Walk: 2,4day? Sat Tue Wed Drive: 1Walk: 3Drive: 5 Mo, Thu, Fr, Su ? Drive

27 Top-down DT induction: divide and conquer algorithm Pick a feature Split your examples into subsets based on the values of the feature For each subsets, examine the examples: – Zero: assign the most popular class for the parent – All from the same class: assign this class – Otherwise, process recursively

28 Top-Down DT induction Different trees can be built for the same data, depending on the order of features: t? coldhot Walk: 2,4day? Sat Tue We d Drive: 1Walk: 3Drive: 5 Mo, Thu, Fr, Su ? Drive

29 Top-down DT induction Different trees can be built for the same data, depending on the order of features: t? coldhot Walk: 2,4day? Sat Tue We d Drive: 1Walk: 3Drive: 5 Mo Drive:? clothing casual halloween walk:?

30 Selecting features Intuitively – We want more “informative” features to be higher in the tree: Is it Monday? Is it raining? Good political news? No halloween cloths? Hat on? Coat on? Car keys? Yes?? -> Driving! (doesn't look as a good learning job) – We want a nice compact tree..

31 Selecting features-2 Formally – Define “tree size” (number of nodes, leaves; depth,..) – Try all the possible trees, find the smallest one – NP-hard Top-down DT induction – greedy search, depends on heuristics for feature ordering (=> no guarantee) – Information gain

32 Entropy Information theory: entropy – number of bits needed to encode some information. S – set of N examples: p*N positive (“Walk”) and q*N negative (“Drive”) Entropy(S)= -p*lg(p) – q*lg(q) p=1, q=0 => Entropy(S)=0 p=1/2, q=1/2 => Entropy(S)=1

33 Entropy and Decision Trees keys? noyes Walk: 2,4Drive: 1,3,5 E(S)=-0.6*lg(0.6)-0.4*lg(0.4)= 0.97 E(Sno)=0E(Skeys)=0

34 Entropy and Decision Trees t? coldhot Walk: 2,4 Drive: 1,5 Walk: 3 E(S)=-0.6*lg(0.6)-0.4*lg(0.4)= 0.97 E(Scold)=0E(Shot)=-0.33*lg(0.33)-0.66*lg(0.66)= 0.92

35 Information gain For each feature f, compute the reduction in entropy on the split: Gain(S,f)=E(S)- ∑(Entropy(Si)* |Si|/|S|) f=keys? : Gain(S,f)=0.97 f=t?: Gain(S,f)=0.97-0*2/5-0.92*3/5=0.42 f=clothing?: Gain(S,f)= ?

36 Conquer-and-divide with Information gain Batch learning (read all the input data, compute information gain based on all the examples simultaneously) Greedy search, may find local optima Outputs a single solution Optimizes depth

37 Complexity Worst case: build a complete tree – Compute gains on all nodes: at level i, we have already examined i features; m-i remaining. In practice: tree is rarely complete, linear on number of features, number of examples (== very fast)

38 Overfitting Suppose we build a very complex tree.. Is it good? Last lecture: we measure the quality (“goodness”) of the prediction, not the performance on the training data Why can complex trees yield mistakes: – Noise in the data – Even without noise, solutions at the last levels are based on too few observations

39 Overfitting Mo: Walk (50 observations), Drive (5) Tue: Walk (40), Drive (3) We: Drive (1) Thu: Walk (42), Drive (14) Fri: Walk (50) Sa: Drive (20), Walk (20) Su: Drive (10) Can we conclude that “We->Drive”?

40 Overfitting A hypothesis H is said to overfit the training data if there exist another hypothesis H' such that: Error(H, train data) <= Error (H', train data) Error(H, unseen data) > Error (H', unseen data) Overfitting is related to hypothesis complexity: a more complex hypothesis (e.g., a larger decision tree) overfits more

41 Overfitting Prevention for DT: Pruning “prune” a complex tree: produce a smaller tree that is less accurate on the training data Original tree:...Mo: hot->drive (2), cold -> walk (100) Pruned tree:.. Mo-> walk (100/2) post-/pre- pruning

42 Pruning criteria Cross-validation – Reserve some training data to evaluate the utility of the subtrees Statistical tests: use a test to determine whether observations at given level can be random MDL (minimum description length): compare the added complexity against memorizing exceptions

43 DT: issues Splitting criteria – Information gain: split at features with many values Non-discrete features Non-discrete outputs (“regression trees”) Costs Missing values Incremental learning Memory issues

44 ACKNOWLEDGMENTS Some of the slides from – Ray Mooney’s Utexas ML course – MIT Open Course Ware AI course


Download ppt "INTRODUCTION TO ARTIFICIAL INTELLIGENCE Massimo Poesio Machine Learning: Decision Trees."

Similar presentations


Ads by Google