INTRODUCTION TO ARTIFICIAL INTELLIGENCE Massimo Poesio Machine Learning: Decision Trees.

INTRODUCTION TO ARTIFICIAL INTELLIGENCE Massimo Poesio Machine Learning: Decision Trees

A DEFINITION OF LEARNING: LEARNING AS IMPROVEMENT Learning denotes changes in the system that are adaptive in the sense that they enable the system to do the task or tasks drawn from the same population more efficiently and more effectively the next time. -- Herb Simon

3 LEARNING AS IMPROVEMENT, 2 Improve on task, T, with respect to performance metric, P, based on experience, E. T: Assign to words their senses. P: Percentage of words correctly classified. E: Corpus of words, some with human-given labels T: Categorize email messages as spam or legitimate. P: Percentage of email messages correctly classified. E: Database of emails, some with human-given labels T: Playing checkers P: Percentage of games won against an arbitrary opponent E: Playing practice games against itself T: Recognizing hand-written words P: Percentage of words correctly classified E: Database of human-labeled images of handwritten words

4 SPECIFYING A LEARNING SYSTEM Choose the training experience Choose exactly what is too be learned, i.e. the target function. Choose how to represent the target function. Choose a learning algorithm to infer the target function from the experience. Environment/ Experience Learner Knowledge Performance Element

FEATURES The functions learned by ML algorithms specify a mapping from input FEATURES to output FEATURES

EXAMPLE 1: CHECKERS Features used in the linear function seen last time: – bp(b): number of black pieces on board b – rp(b): number of red pieces on board b – bk(b): number of black kings on board b – rk(b): number of red kings on board b – bt(b): number of black pieces threatened (i.e. which can be immediately taken by red on its next turn) – rt(b): number of red pieces threatened

EXAMPLE 2: WHEN THE NEIGHBOR DRIVES Suppose we are trying to learn when our neighbor goes to work by car so we can ask a ride Their decision appears to be influenced by – Temperature – Whether it’s going to rain or not – Day of the week – Whether they need to stop at a shop on the way back – How they are dressed

PAST EXPERIENCE IN TERMS OF FEATURES

PREDICTION TASK

PREDICTION

THE NEED FOR AVERAGING

THE NEED FOR GENERALIZATION

FIRST EXAMPLE OF ML ALGORITHM: DECISION TREES A method independently developed by Quinlan in AI and by Breiman et al in statistics

14 DECISION TREES Tree-based classifiers for instances represented as feature-vectors. Nodes test features, there is one branch for each value of the feature, and leaves specify the category. Can represent arbitrary conjunction and disjunction. Can represent any classification function over discrete feature vectors. color red blue green shape circle square triangle neg pos neg color red blue green shape circle square triangle B C A B C

A DECISION TREE FOR THE DRIVING PROBLEM

LEARNING DECISION TREES FROM DATA Use the data in the training set to build a decision tree that will then be used to make decisions with unseen data The decision tree specifies a function

TRAVERSING THE DECISION TREE

Decision Tree Learning Discrete class values – Slight changes in the input: either no or full effect on the classification Discrete feature values (or discretized) Fast Modern DT induction algorithms: – Handling noisy feature values – Handling noisy labels – Handling missing feature values

Top-down DT induction Partition training examples into good “splits”, based on values of a single “good” feature: (1) Sat, hot, no, casual, keys -> + (2) Mon, cold, snow, casual, no-keys -> - (3) Tue, hot, no, casual, no-keys -> - (4) Tue, cold, rain, casual, no-keys -> - (5) Wed, hot, rain, casual, keys -> +

Top-down DT induction keys? yesno Drive: 1,5Walk: 2,3,4

Top-down DT induction Partition training examples into good “splits”, based on values of a single “good” feature (1) Sat, hot, no, casual -> + (2) Mon, cold, snow, casual -> - (3) Tue, hot, no, casual -> - (4) Tue, cold, rain, casual -> - (5) Wed, hot, rain, casual -> + No acceptable classification: proceed recursively

Top-down DT induction t? coldhot Walk: 2,4 Drive: 1,5 Walk: 3

Top-down DT induction t? coldhot Walk: 2,4day? Sat Tue Wed Drive: 1Walk: 3Drive: 5

Top-down DT induction t? coldhot Walk: 2,4day? Sat Tue Wed Drive: 1Walk: 3Drive: 5 Mo, Thu, Fr, Su ? Drive

Top-down DT induction: divide and conquer algorithm Pick a feature Split your examples into subsets based on the values of the feature For each subsets, examine the examples: – Zero: assign the most popular class for the parent – All from the same class: assign this class – Otherwise, process recursively

Top-Down DT induction Different trees can be built for the same data, depending on the order of features: t? coldhot Walk: 2,4day? Sat Tue We d Drive: 1Walk: 3Drive: 5 Mo, Thu, Fr, Su ? Drive

Top-down DT induction Different trees can be built for the same data, depending on the order of features: t? coldhot Walk: 2,4day? Sat Tue We d Drive: 1Walk: 3Drive: 5 Mo Drive:? clothing casual halloween walk:?

Selecting features Intuitively – We want more “informative” features to be higher in the tree: Is it Monday? Is it raining? Good political news? No halloween cloths? Hat on? Coat on? Car keys? Yes?? -> Driving! (doesn't look as a good learning job) – We want a nice compact tree..

Selecting features-2 Formally – Define “tree size” (number of nodes, leaves; depth,..) – Try all the possible trees, find the smallest one – NP-hard Top-down DT induction – greedy search, depends on heuristics for feature ordering (=> no guarantee) – Information gain

Entropy Information theory: entropy – number of bits needed to encode some information. S – set of N examples: p*N positive (“Walk”) and q*N negative (“Drive”) Entropy(S)= -p*lg(p) – q*lg(q) p=1, q=0 => Entropy(S)=0 p=1/2, q=1/2 => Entropy(S)=1

Entropy and Decision Trees keys? noyes Walk: 2,4Drive: 1,3,5 E(S)=-0.6*lg(0.6)-0.4*lg(0.4)= 0.97 E(Sno)=0E(Skeys)=0

Entropy and Decision Trees t? coldhot Walk: 2,4 Drive: 1,5 Walk: 3 E(S)=-0.6*lg(0.6)-0.4*lg(0.4)= 0.97 E(Scold)=0E(Shot)=-0.33*lg(0.33)-0.66*lg(0.66)= 0.92

Information gain For each feature f, compute the reduction in entropy on the split: Gain(S,f)=E(S)- ∑(Entropy(Si)* |Si|/|S|) f=keys? : Gain(S,f)=0.97 f=t?: Gain(S,f)=0.97-0*2/5-0.92*3/5=0.42 f=clothing?: Gain(S,f)= ?

Conquer-and-divide with Information gain Batch learning (read all the input data, compute information gain based on all the examples simultaneously) Greedy search, may find local optima Outputs a single solution Optimizes depth

Complexity Worst case: build a complete tree – Compute gains on all nodes: at level i, we have already examined i features; m-i remaining. In practice: tree is rarely complete, linear on number of features, number of examples (== very fast)

Overfitting Suppose we build a very complex tree.. Is it good? Last lecture: we measure the quality (“goodness”) of the prediction, not the performance on the training data Why can complex trees yield mistakes: – Noise in the data – Even without noise, solutions at the last levels are based on too few observations

Overfitting Mo: Walk (50 observations), Drive (5) Tue: Walk (40), Drive (3) We: Drive (1) Thu: Walk (42), Drive (14) Fri: Walk (50) Sa: Drive (20), Walk (20) Su: Drive (10) Can we conclude that “We->Drive”?

Overfitting A hypothesis H is said to overfit the training data if there exist another hypothesis H' such that: Error(H, train data) <= Error (H', train data) Error(H, unseen data) > Error (H', unseen data) Overfitting is related to hypothesis complexity: a more complex hypothesis (e.g., a larger decision tree) overfits more

Overfitting Prevention for DT: Pruning “prune” a complex tree: produce a smaller tree that is less accurate on the training data Original tree:...Mo: hot->drive (2), cold -> walk (100) Pruned tree:.. Mo-> walk (100/2) post-/pre- pruning

Pruning criteria Cross-validation – Reserve some training data to evaluate the utility of the subtrees Statistical tests: use a test to determine whether observations at given level can be random MDL (minimum description length): compare the added complexity against memorizing exceptions

DT: issues Splitting criteria – Information gain: split at features with many values Non-discrete features Non-discrete outputs (“regression trees”) Costs Missing values Incremental learning Memory issues

ACKNOWLEDGMENTS Some of the slides from – Ray Mooney’s Utexas ML course – MIT Open Course Ware AI course

INTRODUCTION TO ARTIFICIAL INTELLIGENCE Massimo Poesio Machine Learning: Decision Trees.

Similar presentations

Presentation on theme: "INTRODUCTION TO ARTIFICIAL INTELLIGENCE Massimo Poesio Machine Learning: Decision Trees."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

INTRODUCTION TO ARTIFICIAL INTELLIGENCE Massimo Poesio Machine Learning: Decision Trees.

Similar presentations

Presentation on theme: "INTRODUCTION TO ARTIFICIAL INTELLIGENCE Massimo Poesio Machine Learning: Decision Trees."— Presentation transcript:

Similar presentations

About project

Feedback