Learning from Observations

Slides:



Advertisements
Similar presentations
Induction of Decision Trees (IDT)
Advertisements

Introduction to Artificial Intelligence CS440/ECE448 Lecture 21
Learning from Observations Chapter 18 Section 1 – 3.
ICS 178 Intro Machine Learning
DECISION TREES. Decision trees  One possible representation for hypotheses.
Decision Trees Decision tree representation ID3 learning algorithm
CPSC 502, Lecture 15Slide 1 Introduction to Artificial Intelligence (AI) Computer Science cpsc502, Lecture 15 Nov, 1, 2011 Slide credit: C. Conati, S.
Final Exam: May 10 Thursday. If event E occurs, then the probability that event H will occur is p ( H | E ) IF E ( evidence ) is true THEN H ( hypothesis.
Decision making in episodic environments
1 Chapter 10 Introduction to Machine Learning. 2 Chapter 10 Contents (1) l Training l Rote Learning l Concept Learning l Hypotheses l General to Specific.
Learning from Observations Copyright, 1996 © Dale Carnegie & Associates, Inc. Chapter 18 Fall 2005.
Cooperating Intelligent Systems
Learning from Observations Chapter 18 Section 1 – 4.
Learning From Observations
Learning from Observations Copyright, 1996 © Dale Carnegie & Associates, Inc. Chapter 18 Fall 2004.
Induction of Decision Trees
Learning from Observations Copyright, 1996 © Dale Carnegie & Associates, Inc. Chapter 18.
ICS 273A Intro Machine Learning
CSCE 580 Artificial Intelligence Ch.18: Learning from Observations
Three kinds of learning
LEARNING DECISION TREES
Learning decision trees derived from Hwee Tou Ng, slides for Russell & Norvig, AI a Modern Approachslides Tom Carter, “An introduction to information theory.
Learning decision trees
Learning decision trees derived from Hwee Tou Ng, slides for Russell & Norvig, AI a Modern Approachslides Tom Carter, “An introduction to information theory.
ICS 273A Intro Machine Learning
Learning: Introduction and Overview
Boosting Main idea: train classifiers (e.g. decision trees) in a sequence. a new classifier should focus on those cases which were incorrectly classified.
Induction of Decision Trees (IDT) CSE 335/435 Resources: – –
Machine learning Image source:
Machine learning Image source:
Machine Learning CS 165B Spring 2012
By Wang Rui State Key Lab of CAD&CG
Machine Learning CPS4801. Research Day Keynote Speaker o Tuesday 9:30-11:00 STEM Lecture Hall (2 nd floor) o Meet-and-Greet 11:30 STEM 512 Faculty Presentation.
INTRODUCTION TO MACHINE LEARNING. $1,000,000 Machine Learning  Learn models from data  Three main types of learning :  Supervised learning  Unsupervised.
Inductive learning Simplest form: learn a function from examples
Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.
LEARNING DECISION TREES Yılmaz KILIÇASLAN. Definition - I Decision tree induction is one of the simplest, and yet most successful forms of learning algorithm.
Learning from observations
Learning from Observations Chapter 18 Through
CHAPTER 18 SECTION 1 – 3 Learning from Observations.
Lecture 7 : Intro to Machine Learning Rachel Greenstadt & Mike Brennan November 10, 2008.
Learning from Observations Chapter 18 Section 1 – 3, 5-8 (presentation TBC)
Learning from Observations Chapter 18 Section 1 – 3.
Decision Trees. What is a decision tree? Input = assignment of values for given attributes –Discrete (often Boolean) or continuous Output = predicated.
1 Chapter 10 Introduction to Machine Learning. 2 Chapter 10 Contents (1) l Training l Rote Learning l Concept Learning l Hypotheses l General to Specific.
Decision Trees Binary output – easily extendible to multiple output classes. Takes a set of attributes for a given situation or object and outputs a yes/no.
Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.
CSC 8520 Spring Paula Matuszek DecisionTreeFirstDraft Paula Matuszek Spring,
Chapter 18 Section 1 – 3 Learning from Observations.
Learning From Observations Inductive Learning Decision Trees Ensembles.
Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.
Anifuddin Azis LEARNING. Why is learning important? So far we have assumed we know how the world works Rules of queens puzzle Rules of chess Knowledge.
Decision Tree Learning CMPT 463. Reminders Homework 7 is due on Tuesday, May 10 Projects are due on Tuesday, May 10 o Moodle submission: readme.doc and.
Learning from Observations
Learning from Observations
Learning from Observations
Machine Learning Inductive Learning and Decision Trees
Università di Milano-Bicocca Laurea Magistrale in Informatica
Introduce to machine learning
Presented By S.Yamuna AP/CSE
Data Science Algorithms: The Basic Methods
Decision making in episodic environments
Machine Learning: Lecture 3
Learning from Observations
Learning from Observations
Decision trees One possible representation for hypotheses
A task of induction to find patterns
Machine Learning: Decision Tree Learning
Decision Trees - Intermediate
A task of induction to find patterns
Presentation transcript:

Learning from Observations Chapter 18 Section 1 – 4

Outline Learning agents Inductive learning Decision tree learning Boosting This is about the best “off-the-shelves” classification algorithm! (and you’ll get to know it)

Learning agents Sometimes we want to invest time and effort to observe the feedback from our environment to our actions in order to improve these actions so we can more effectively optimize our utility in the future.

Learning element Design of a learning element is affected by Which components of the performance element are to be learned (e.g. learn to stop for traffic light) What feedback is available to learn these components (e.g. visual feedback form camera) What representation is used for the components (e.g. logic, probabilistic descriptions, attributes,...) Type of feedback: Supervised learning: correct answers for each example (label). Unsupervised learning: correct answers not given. Reinforcement learning: occasional rewards

Two Examples of Learning Object Categories. Here is your training set (2 classes):

Here is your test set: Does it belong to one of the above classes?

Learning from 1 Example Copied from P. Perona talk slides. Having said that, we all know that the human visual system can do much better than today’s computer vision algorithms. We learn a vast number of different objects and object categories in our life time. In many cases we never had the luxury of hundreds of training examples. For example, here I show a member of a new object category that most of you have never seen. Now let’s see if you are able to find another member of this new category in the following picture. Copied from P. Perona talk slides. S. Savarese, 2003

P. Buegel, 1562

Inductive learning such that h ≈ f given a training set of examples Simplest form: learn a function from examples f is the target function An example is a pair (x, f(x)) Problem: find a hypothesis h such that h ≈ f given a training set of examples

Inductive learning method Construct/adjust h to agree with f on training set (h is consistent if it agrees with f on all examples) E.g., curve fitting:

Inductive learning method Construct/adjust h to agree with f on training set (h is consistent if it agrees with f on all examples) E.g., curve fitting:

Inductive learning method

Inductive learning method

Inductive learning method which curve is best?

Inductive learning method Ockham’s razor: prefer the simplest hypothesis consistent with data

Decision Trees Problem: decide whether to wait for a table at a restaurant, based on the following attributes: Alternate: is there an alternative restaurant nearby? Bar: is there a comfortable bar area to wait in? Fri/Sat: is today Friday or Saturday? Hungry: are we hungry? Patrons: number of people in the restaurant (None, Some, Full) Price: price range ($, $$, $$$) Raining: is it raining outside? Reservation: have we made a reservation? Type: kind of restaurant (French, Italian, Thai, Burger) WaitEstimate: estimated waiting time (0-10, 10-30, 30-60, >60)

Attribute-based representations Examples described by attribute values (Boolean, discrete, continuous) E.g., situations where I will/won't wait for a table: Classification of examples is positive (T) or negative (F) General form for data: a number N of instances, each with attributes (x1,x2,x3,...xd) and target value y.

Decision trees One possible representation for hypotheses We imagine someone taking a sequence of decisions. E.g., here is the “true” tree for deciding whether to wait: Note you can use the same attribute more than once.

Expressiveness Decision trees can express any function of the input attributes. E.g., for Boolean functions, truth table row → path to leaf: Trivially, there is a consistent decision tree for any training set with one path to leaf for each example (unless f nondeterministic in x) but it probably won't generalize to new examples Prefer to find more compact decision trees: we don’t want to memorize the data, we want to find structure in the data!

Hypothesis spaces How many distinct decision trees with n Boolean attributes? = number of Boolean functions = number of distinct truth tables with 2n rows = 22n E.g., with 6 Boolean attributes, there are 18,446,744,073,709,551,616 trees n=2: 2^2 = 4 rows. For each row we can choose T or F: 2^4 functions.

Decision tree learning If there are so many possible trees, can we actually search this space? (solution: greedy search). Aim: find a small tree consistent with the training examples Idea: (recursively) choose "most significant" attribute as root of (sub)tree.

Choosing an attribute Idea: a good attribute splits the examples into subsets that are (ideally) "all positive" or "all negative" Patrons or type? To wait or not to wait is still at 50%.

Using information theory Entropy measures the amount of uncertainty in a probability distribution: Consider tossing a biased coin. If you toss the coin VERY often, the frequency of heads is, say, p, and hence the frequency of tails is 1-p. (fair coin p=0.5). The uncertainty in any actual outcome is given by the entropy: Note, the uncertainty is zero if p=0 or 1 and maximal if we have p=0.5.

Using information theory If there are more than two states s=1,2,..n we have (e.g. a die):

Using information theory Imagine we have p examples which are true (positive) and n examples which are false (negative). Our best estimate of true or false is given by: Hence the entropy is given by:

Using Information Theory How much information do we gain if we disclose the value of some attribute? Answer: uncertainty before minus uncertainty after

Example Before: Entropy = - ½ log(1/2) – ½ log(1/2)=log(2) = 1 bit: There is “1 bit of information to be discovered”. After: for Type: If we go into branch “French” we have 1 bit, similarly for the others. French: 1bit Italian: 1 bit Thai: 1 bit Burger: 1bit After: for Patrons: In branch “None” and “Some” entropy = 0!, In “Full” entropy = -1/3log(1/3)-2/3log(2/3). So Patrons gains more information! On average: 1 bit ! We gained nothing!

Information Gain How do we combine branches: 1/6 of the time we enter “None”, so we weight“None” with 1/6. Similarly: “Some” has weight: 1/3 and “Full” has weight ½. entropy for each branch. weight for each branch

Information gain Information Gain (IG) or reduction in entropy from the attribute test: Choose the attribute with the largest IG

Information gain For the training set, p = n = 6, I(6/12, 6/12) = 1 bit Patrons has the highest IG of all attributes and so is chosen by the DTL algorithm as the root

Example contd. Decision tree learned from the 12 examples: Substantially simpler than “true” tree---a more complex hypothesis isn’t justified by small amount of data

What to Do if... In some leaf there are no examples: Choose True or False according to the number of positive/negative examples at your parent. There are no attributes left Two or more examples have the same attributes but different label: we have an error/noise. Stop and use majority vote. Demo: http://www.cs.ubc.ca/labs/lci/CIspace/Version4/dTree/

Boosting Main idea: train classifiers (e.g. decision trees) in a sequence. a new classifier should focus on those cases which were incorrectly classified in the last round. combine the classifiers by letting them vote on the final prediction. each classifier could be (should be) very “weak”, e.g. a DT, with only one node: decision stump.

Example this line is one simple classifier saying that everything to the left + and everything to the right is -

Decision Stump Data: {(X1,Y1),(X2,Y2),....,(Xn,Yn)} label (e.g. True or False, 0 or 1 -1 or +1) attributes (e.g. temperature outside) +1 Xi -1 threshold

The Algorithm in Detail (Z,H) = AdaBoost[Xtrain,Ytrain,Rounds] weights = 1/N; (N = # datacases) For r=1 to Rounds Do H[r] = LearnWeakClassifier[Xtrain,Ytrain,weights]; error = 0; For i=1 to N Do If H[r,Xtrain(i)]=Ytrain(i) error = error + weight(i); If H[r,Xtrain(i)]=Ytrain(i)  weight(i)=weight(i) error/(1-error); Normalize Weights Z[r] = log[(1-error)/error] Final classifier:

And in a Picture training case correctly classified training case has large weight in this round this DT has a strong vote.

And in animation Original Training set : Equal Weights to all training samples Taken from “A Tutorial on Boosting” by Yoav Freund and Rob Schapire

AdaBoost(Example) ROUND 1

AdaBoost(Example) ROUND 2

AdaBoost(Example) ROUND 3

AdaBoost(Example)

k-Nearest Neighbors Another simple classification algorithm Idea: Look around you too see how your neighbors classify data. Classify a new data-point according to a majority vote of your k-nearest neighbors.

k-NN decision line

Distance Metric How do we measure what it means to be “close”? Depending on the problem we should choose an appropriate distance metric.

Algorithm in Detail For a new test example: Find the k closest neighbors Each neighbor predicts the class to be it sown class Take majority vote. Demo: http://www.comp.lancs.ac.uk/~kristof/research/notes/nearb/cluster.html Demo: MATLAB

Logistic Regression / Perceptron Fits a soft decision boundary between the classes. 2 dimensions 1 dimension

The logit / sigmoid Determines the offset Determines the angle and the steepness.

Objective We interpret f(x) as the probability of classifying a data case as positive. We want to maximize the total probability of the data-vectors:

Algorithm in detail Repeat until convergence (gradient descend): (Or better: Newton steps)

Cross-validation You are ultimately interested in good performance on new (unseen) test data. To estimate that, split of a (smallish) subset of the training data (called validation set). Train without validation data and “test” on validation data. This will give an indication of how long to run boosting.

Project Demo For this project I will obtain data from a company who want to predict activity of a chemical compound. Training data will be provided Test data will not! May the best win this competition.

Summary Learning agent = performance element + learning element For supervised learning, the aim is to find a simple hypothesis approximately consistent with training examples Decision tree learning + boosting Learning performance = prediction accuracy measured on test set