CS-424 Gregory Dudek Lecture 14 Learning –Inductive inference –Probably approximately correct learning.

Slides:



Advertisements
Similar presentations
Pretty-Good Tomography Scott Aaronson MIT. Theres a problem… To do tomography on an entangled state of n qubits, we need exp(n) measurements Does this.
Advertisements

Machine Learning: Intro and Supervised Classification
1 Machine Learning: Lecture 3 Decision Tree Learning (Based on Chapter 3 of Mitchell T.., Machine Learning, 1997)
Evaluating Classifiers
CPSC 502, Lecture 15Slide 1 Introduction to Artificial Intelligence (AI) Computer Science cpsc502, Lecture 15 Nov, 1, 2011 Slide credit: C. Conati, S.
Machine Learning Week 2 Lecture 1.
Learning from Observations Copyright, 1996 © Dale Carnegie & Associates, Inc. Chapter 18 Spring 2004.
Learning from Observations Copyright, 1996 © Dale Carnegie & Associates, Inc. Chapter 18 Fall 2005.
Cooperating Intelligent Systems
Computational Learning Theory PAC IID VC Dimension SVM Kunstmatige Intelligentie / RuG KI2 - 5 Marius Bulacu & prof. dr. Lambert Schomaker.
LEARNING FROM OBSERVATIONS Yılmaz KILIÇASLAN. Definition Learning takes place as the agent observes its interactions with the world and its own decision-making.
18 LEARNING FROM OBSERVATIONS
CS 8751 ML & KDDEvaluating Hypotheses1 Sample error, true error Confidence intervals for observed hypothesis error Estimators Binomial distribution, Normal.
Northwestern University Winter 2007 Machine Learning EECS Machine Learning Lecture 13: Computational Learning Theory.
Measuring Model Complexity (Textbook, Sections ) CS 410/510 Thurs. April 27, 2007 Given two hypotheses (models) that correctly classify the training.
Probably Approximately Correct Model (PAC)
Learning from Observations Copyright, 1996 © Dale Carnegie & Associates, Inc. Chapter 18 Fall 2004.
September 21, 2010Neural Networks Lecture 5: The Perceptron 1 Supervised Function Approximation In supervised learning, we train an ANN with a set of vector.
Learning from Observations Copyright, 1996 © Dale Carnegie & Associates, Inc. Chapter 18.
Machine Learning CSE 473. © Daniel S. Weld Topics Agency Problem Spaces Search Knowledge Representation Reinforcement Learning InferencePlanning.
LEARNING FROM OBSERVATIONS Yılmaz KILIÇASLAN. Definition Learning takes place as the agent observes its interactions with the world and its own decision-making.
CS 4700: Foundations of Artificial Intelligence
Part I: Classification and Bayesian Learning
PAC learning Invented by L.Valiant in 1984 L.G.ValiantA theory of the learnable, Communications of the ACM, 1984, vol 27, 11, pp
1 Machine Learning: Lecture 5 Experimental Evaluation of Learning Algorithms (Based on Chapter 5 of Mitchell T.., Machine Learning, 1997)
Learning CPSC 386 Artificial Intelligence Ellen Walker Hiram College.
COMP3503 Intro to Inductive Modeling
CS 8751 ML & KDDSupport Vector Machines1 Support Vector Machines (SVMs) Learning mechanism based on linear programming Chooses a separating plane based.
1 CSI 5388:Topics in Machine Learning Inductive Learning: A Review.
Machine Learning CSE 681 CH2 - Supervised Learning.
CS-424 Gregory Dudek Lecture 14 Learning –Probably approximately correct learning (cont’d) –Version spaces –Decision trees.
ECE 8443 – Pattern Recognition Objectives: Error Bounds Complexity Theory PAC Learning PAC Bound Margin Classifiers Resources: D.M.: Simplified PAC-Bayes.
Learning from observations
Learning from Observations Chapter 18 Through
CHAPTER 18 SECTION 1 – 3 Learning from Observations.
 2003, G.Tecuci, Learning Agents Laboratory 1 Learning Agents Laboratory Computer Science Department George Mason University Prof. Gheorghe Tecuci 5.
Learning from Observations Chapter 18 Section 1 – 3, 5-8 (presentation TBC)
Introduction to Machine Learning Supervised Learning 姓名 : 李政軒.
Kansas State University Department of Computing and Information Sciences CIS 830: Advanced Topics in Artificial Intelligence Monday, January 22, 2001 William.
Properties of OLS How Reliable is OLS?. Learning Objectives 1.Review of the idea that the OLS estimator is a random variable 2.How do we judge the quality.
Machine Learning Chapter 5. Artificial IntelligenceChapter 52 Learning 1. Rote learning rote( โรท ) n. วิถีทาง, ทางเดิน, วิธีการตามปกติ, (by rote จากความทรงจำ.
Concept learning, Regression Adapted from slides from Alpaydin’s book and slides by Professor Doina Precup, Mcgill University.
Machine Learning Chapter 5. Evaluating Hypotheses
CS Inductive Bias1 Inductive Bias: How to generalize on novel data.
Chapter5: Evaluating Hypothesis. 개요 개요 Evaluating the accuracy of hypotheses is fundamental to ML. - to decide whether to use this hypothesis - integral.
Machine Learning Concept Learning General-to Specific Ordering
Goal of Learning Algorithms  The early learning algorithms were designed to find such an accurate fit to the data.  A classifier is said to be consistent.
Machine Learning Tom M. Mitchell Machine Learning Department Carnegie Mellon University Today: Computational Learning Theory Probably Approximately.
Data Mining and Decision Support
CS 8751 ML & KDDComputational Learning Theory1 Notions of interest: efficiency, accuracy, complexity Probably, Approximately Correct (PAC) Learning Agnostic.
MACHINE LEARNING 3. Supervised Learning. Learning a Class from Examples Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Carla P. Gomes CS4700 Computational Learning Theory Slides by Carla P. Gomes and Nathalie Japkowicz (Reading: R&N AIMA 3 rd ed., Chapter 18.5)
Chapter 18 Section 1 – 3 Learning from Observations.
Machine Learning Lecture 1: Intro + Decision Trees Moshe Koppel Slides adapted from Tom Mitchell and from Dan Roth.
Machine Learning Chapter 7. Computational Learning Theory Tom M. Mitchell.
Learning From Observations Inductive Learning Decision Trees Ensembles.
1 CS 391L: Machine Learning: Computational Learning Theory Raymond J. Mooney University of Texas at Austin.
Learning from Observations
Learning from Observations
Computational Learning Theory
Presented By S.Yamuna AP/CSE
Chapter 11: Learning Introduction
CH. 2: Supervised Learning
Computational Learning Theory
Machine Learning: Lecture 3
LEARNING Chapter 18b and Chuck Dyer
Computational Learning Theory
Learning from Observations
Lecture 14 Learning Inductive inference
Learning from Observations
Presentation transcript:

CS-424 Gregory Dudek Lecture 14 Learning –Inductive inference –Probably approximately correct learning

CS-424 Gregory Dudek What is learning? Key point: all learning can be seen as learning the representation of a function. Will become clearer with more examples! Example representations: propositional if-then rules first-order if-then rules first-order logic theories decision trees neural networks Java programs

CS-424 Gregory Dudek Learning: formalism Come up with some function f such that f(x) = y for all training examples (x,y) and f (somehow) generalizes to yet unseen examples. –In practice, we don’t always do it perfectly.

CS-424 Gregory Dudek Inductive bias: intro There has to be some structure apparent in the inputs in order to support generalization. Consider the following pairs from the phone book. InputsOutputs Ralph Student Louie Reasoner Harry Coder Fred Flintstone???-???? There is not much to go on here. Suppose we were to add zip code information. Suppose phone numbers were issued based on the spelling of a person's last name. Suppose the outputs were user passwords?

CS-424 Gregory Dudek Example 2 Consider the problem of fitting a curve to a set of (x,y) pairs. | x x |-x x--- | x |__________x_____ –Should you fit a linear, quadratic, cubic, piece-wise linear function? –It would help to have some idea of how smooth the target function is or to know from what family of functions (e.g., polynomials of degree 3) to choose from. –Does this sound like cheating? What's the alternative?

CS-424 Gregory Dudek Inductive Learning Given a collection of examples (x,f(x)), return a function h that approximates f. h is called the hypothesis and is chosen from the hypothesis space. What if f is not in the hypothesis space?

CS-424 Gregory Dudek Inductive Bias: definition This "some idea of what to choose from" is called an inductive bias. Terminology H, hypothesis space - a set of functions to choose from C, concept space - a set of possible functions to learn Often in learning we search for a hypothesis f in H that is consistent with the training examples, i.e., f(x) = y for all training examples (x,y). In some cases, any hypothesis consistent with the training examples is likely to generalize to unseen examples. The trick is to find the right bias.

CS-424 Gregory Dudek Which hypothesis?

CS-424 Gregory Dudek Bias explanation How does learning algorithm decide Bias leads them to prefer one hypothesis over another. Two types of bias: preference bias (or search bias) depending on how the hypothesis space is explored, you get different answers restriction bias (or language bias), the “language” used: Java, FOL, etc. (h is not equal to c). e.g. language: piece-wise linear functions: gives (b)/(d).

CS-424 Gregory Dudek Issues in selecting the bias Tradeoff (similar in reasoning): more expressive the language, the harder to find (compute) a good hypothesis. Compare: propositional Horn clauses with first-order logic theories or Java programs. Also, often need more examples.

CS-424 Gregory Dudek Occam’s Razor Most standard and intuitive preference bias: Occam’s Razor (aka Ockham’s Razor) The most likely hypothesis is the simplest one that is consistent will all of the observations. Named after Sir William of Ockham.

CS-424 Gregory Dudek Implications The world is simple. The chances of an accidentally correct explanation are low for a simple theory.

CS-424 Gregory Dudek Example (simple vs. complex theories) Mel’s skills: –Input data: Mel flunked his exam with a 25% mark. –Hypothesis space: Mel is smart, or Mel is dumb. –In this case, we can “reliably” choose. –Simple case –Hypothesis space: Mel is dumb. Mel is dum but knows how to cheat. Mel is dumb but his brother took the same course last year. Mel is dumb but the course is easy. Mel is smart Mel is smart but was sick Mel is smart but is taking 7 courses Mel is smart but is very neurotic and self-destructive

CS-424 Gregory Dudek Probably Approximately Correct (PAC) Learning Two important questions that we have yet to address: Where do the training examples come from? How do we test performance, i.e., are we doing a good job learning? PAC learning is one approach to dealing with these questions.

CS-424 Gregory Dudek Classifier example Consider learning the predicate Flies(Z) = { true, false}. We are assigning objects to one of two categories: recall we call this a classifier. Suppose that X = {pigeon,dodo,penguin,747}, Y = {true,false}, and that Pr(pigeon) = 0.3 Flies(pigeon) = true Pr(dodo) = 0.1 Flies(dodo) = false Pr(penguin) = 0.2 Flies(penguin) = false Pr(747) = 0.4 Flies(747) = true Pr is the distribution governing the presentation of training examples (how often do we see such examples). We will use the same distribution for evaluation purposes.

CS-424 Gregory Dudek Note that if we mis-classified dodos but got everything else right, then we would still be doing pretty well in the sense that 90% of the time we would get the right answer. We formalize this as follows.

CS-424 Gregory Dudek The approximate error associated with a hypothesis f is error(f) = ∑ {x | f(x) not= Flies(x)} Pr(x) We say that a hypothesis is approximately correct with error at most  if error(f) =< 

CS-424 Gregory Dudek The chances that a theory is correct increases with the number of consistent examples it predicts. Or…. A badly wrong theory will probably be uncovered after only a few tests.

CS-424 Gregory Dudek Epsilon-ball We can formalize this with the notion of an epsilon ball. Consider the hypothesis space: lay it out on R n About the correct hypothesis, the are a group of PAC hypotheses with error < epsilon –These are in a “ball”; THE EPSILON-BALL.

CS-424 Gregory Dudek PAC: definition Relax this requirement by not requiring that the learning program necessarily achieve a small error but only that it to keep the error small with high probability. Probably approximately correct (PAC) with probability  and error at most  if, given any set of training examples drawn according to the fixed distribution, the program outputs a hypothesis f such that Pr(Error(f) >  ) < 

CS-424 Gregory Dudek PAC Idea: Consider space of hypotheses. Divide these into “good” and “bad” sets. Want to assure that we can close in on the set of good hypotheses that are close approximations of the correct theory.

CS-424 Gregory Dudek PAC Training examples Theorem: If the number of hypotheses |H| is finite, then a program that returns an hypothesis that is consistent with ln(  /|H|)/ln(1-  ) training examples (drawn according to Pr) is guaranteed to be PAC with probability  and error bounded by .

CS-424 Gregory Dudek PAC theorem: proof If f is not approximately correct then Error(f) >  so the probability of f being correct on one example is < 1 -  and the probability of being correct on m examples is < (1 -  ) m. Suppose that H = {f,g}. The probability that f correctly classifies all m examples is < (1 -  ) m. The probability that g correctly classifies all m examples is < (1 -  ) m. The probability that one of f or g correctly classifies all m examples is < 2 * (1 -  )^m. To ensure that any hypothesis consistent with m training examples is correct with an error at most  with probability , we choose m so that 2 * (1 -  )^m < .

CS-424 Gregory Dudek Generalizing, there are |H| hypotheses in the restricted hypothesis space and hence the probability that there is some hypothesis in H that correctly classifies all m examples is bounded by |H|(1-  ) m. Solving for m in |H|(1-  ) m <  we obtain m >= ln(  /|H|)/ln(1-  ). QED

CS-424 Gregory Dudek Stationarity Key assumption of PAC learning: Past examples are drawn randomly from the same distribution as future examples: stationarity. The number m of examples required is called the sample complexity.

CS-424 Gregory Dudek A class of concepts C is said to be PAC learnable for a hypothesis space H if (roughly) there exists an polynomial time algorithm such that: for any c in C, distribution Pr, epsilon, and delta, if the algorithm is given a number of training examples polynomial in 1/epsilon and 1/delta then with probability 1-delta the algorithm will return a hypothesis f from H such that Error(f) =< epsilon.

CS-424 Gregory Dudek Overfitting Consider error in hypothesis h over: training data: error train (h) entire distribution D of data: error D (h) Hypothesis h \in H overfits training data if –there is an alternative hypothesis h’ \in H such that error train (h) < error train (h’) but error D (h) > error D (h’)

CS-424 Gregory Dudek Learning Decision Trees Decision tree takes as input a set of properties and outputs yes/no “decisions”. Example: goal predicate: WillWait