Supervised Learning Berlin Chen 2005 References:

Slides:



Advertisements
Similar presentations
Computational Learning Theory
Advertisements

VC Dimension – definition and impossibility result
CHAPTER 2: Supervised Learning. Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 Learning a Class from Examples.
CH2 - Supervised Learning Computational learning theory
CmpE 104 SOFTWARE STATISTICAL TOOLS & METHODS MEASURING & ESTIMATING SOFTWARE SIZE AND RESOURCE & SCHEDULE ESTIMATING.
6-1 Introduction To Empirical Models 6-1 Introduction To Empirical Models.
Data mining in 1D: curve fitting
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Support Vector Machines (SVMs) Chapter 5 (Duda et al.)
Feasibility of learning: the issues solution for infinite hypothesis sets VC generalization bound (mostly lecture 5 on AMLbook.com)
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
x – independent variable (input)
Northwestern University Winter 2007 Machine Learning EECS Machine Learning Lecture 13: Computational Learning Theory.
Measuring Model Complexity (Textbook, Sections ) CS 410/510 Thurs. April 27, 2007 Given two hypotheses (models) that correctly classify the training.
Computational Learning Theory
1 Computational Learning Theory and Kernel Methods Tianyi Jiang March 8, 2004.
SVM Support Vectors Machines
Experimental Evaluation
INTRODUCTION TO Machine Learning ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Part I: Classification and Bayesian Learning
CS Bayesian Learning1 Bayesian Learning. CS Bayesian Learning2 States, causes, hypotheses. Observations, effect, data. We need to reconcile.
Fundamentals of machine learning 1 Types of machine learning In-sample and out-of-sample errors Version space VC dimension.
Machine Learning CSE 681 CH2 - Supervised Learning.
Learning from observations
Introduction to Machine Learning Supervised Learning 姓名 : 李政軒.
1 Machine Learning: Lecture 8 Computational Learning Theory (Based on Chapter 7 of Mitchell T.., Machine Learning, 1997)
Computational Learning Theory IntroductionIntroduction The PAC Learning FrameworkThe PAC Learning Framework Finite Hypothesis SpacesFinite Hypothesis Spaces.
Christoph Eick: Learning Models to Predict and Classify 1 Learning from Examples Example of Learning from Examples  Classification: Is car x a family.
Concept learning, Regression Adapted from slides from Alpaydin’s book and slides by Professor Doina Precup, Mcgill University.
INTRODUCTION TO MACHINE LEARNING 3RD EDITION ETHEM ALPAYDIN © The MIT Press, Lecture.
Review of fundamental 1 Data mining in 1D: curve fitting by LLS Approximation-generalization tradeoff First homework assignment.
Goal of Learning Algorithms  The early learning algorithms were designed to find such an accurate fit to the data.  A classifier is said to be consistent.
Machine Learning Tom M. Mitchell Machine Learning Department Carnegie Mellon University Today: Computational Learning Theory Probably Approximately.
Machine Learning 5. Parametric Methods.
CS 8751 ML & KDDComputational Learning Theory1 Notions of interest: efficiency, accuracy, complexity Probably, Approximately Correct (PAC) Learning Agnostic.
MACHINE LEARNING 3. Supervised Learning. Learning a Class from Examples Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Machine Learning Lecture 1: Intro + Decision Trees Moshe Koppel Slides adapted from Tom Mitchell and from Dan Roth.
Machine Learning Chapter 7. Computational Learning Theory Tom M. Mitchell.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Support Vector Machines (SVMs) Chapter 5 (Duda et al.) CS479/679 Pattern Recognition Dr. George Bebis.
1 CS 391L: Machine Learning: Computational Learning Theory Raymond J. Mooney University of Texas at Austin.
Fundamentals of machine learning 1 Types of machine learning In-sample and out-of-sample errors Version space VC dimension.
Machine Learning Basics ( 1/2 ) 周岚. Machine Learning Basics What do we mean by learning? Mitchell (1997) : A computer program is said to learn from experience.
Evaluating Hypotheses
Chapter 4 Basic Estimation Techniques
Chapter 7. Classification and Prediction
Deep Feedforward Networks
Computational Learning Theory
Computational Learning Theory
Computational Learning Theory
Maximum Likelihood Estimation
CH. 2: Supervised Learning
Christoph Eick: Learning Models to Predict and Classify 1 Learning from Examples Example of Learning from Examples  Classification: Is car x a family.
INTRODUCTION TO Machine Learning
LECTURE 05: THRESHOLD DECODING
INF 5860 Machine learning for image classification
Computational Learning Theory
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Computational Learning Theory
Computational Learning Theory Eric Xing Lecture 5, August 13, 2010
CSCI B609: “Foundations of Data Science”
Parametric Methods Berlin Chen, 2005 References:
Multivariate Methods Berlin Chen
Machine Learning: UNIT-3 CHAPTER-2
Multivariate Methods Berlin Chen, 2005 References:
CS639: Data Management for Data Science
Supervised machine learning: creating a model
INTRODUCTION TO Machine Learning
Memory-Based Learning Instance-Based Learning K-Nearest Neighbor
INTRODUCTION TO Machine Learning 3rd Edition
Presentation transcript:

Supervised Learning Berlin Chen 2005 References: Learning a class (discrete output) from it positive and negative examples Generalized to multiple classes and regression (continuous output) References: 1. E. Alpaydin, Introduction to Machine Learning , Chapter 2 2. Tom M. Mitchell, Machine Learning, Chapter 7 3. T. Hastie et al., The Elements of Statistical Learning, Chapter 2

Supervised Learning Learning with a Teacher Simplest Case The training examples (instances) are equipped with (class) label information Simplest Case Two-class problem: the training set is composed of positive examples (Class 1) and negative examples (Class 0) Output is discrete (symbolic) Use the learned machine to make predictions of the examples not seen before (Predictive Machine Learning) More Complicated Cases Multi-class problem Regression or function approximation Output is numeric (continuous)

Learning a Class from Examples Case 1: Credit Scoring Examples (dataset) - customer data of the bank Features (or attributes) of the customers Income, savings, collaterals, profession, etc. Collateral:擔保品,抵押品

Learning a Class from Examples (cont.) Case 2: Car Class (“Family Car”) Examples (dataset) - a set of cars Features (or attributes) of the cars Price, engine power, seat capacity, color, etc.

an axis-aligned rectangle hypothesis Hypothesis Class The learning algorithm should find a hypothesis class to approximate the ideal class (concept) as closely as possible can make a prediction for any instance In “family car” case, a possible hypothesis class is defined by Empirical Error The proportion of training instances where predictions of do not match the true class labels an axis-aligned rectangle hypothesis

Hypothesis Class (cont.) Generalization The issue about how well the hypothesis will correctly classify future (new) examples that are not part of the training set (false alarm, negative examples classified as positive) (false rejection, positive examples classified as negative)

Hypothesis Class (cont.) Given the hypothesis in the form of rectangles Most Specific Hypothesis The tightest rectangle that includes all the positive examples and none of the negative examples Most General Hypothesis The largest rectangle that includes all the positive examples and none of the negative examples Version Space Any between and Consisting of valid hypotheses with no error (consistent with the training set) To be discussed in more detail later on ! Mitchell, 1997

Some Assessments The difficulty of the machine learning problems E.g., inherent computational complexity The capability of various types of machine learning algorithms E.g., numbers of mistakes will be made Frameworks for analyzing, e.g. Vapnik-Chervonenkis (VC) Dimension Probably Approximately Correct (PAC) Learning

Shattering a Set of Instances A dichotomy (two-class problem) with a dataset of points ways to be labeled as positive and negative Namely, labeling problems Shattering for dichotomy A hypothesis that can separates the positive examples form the negative shatters points

Shattering a Set of Instances (cont.) A set of instances S is shattered by hypothesis space H iff for every dichotomy of S there exists some hypothesis in H consistent with this dichotomy Eight dichotomies of three instances using ellipses instance hypothesis Instance Space X

Vapnik-Chervonenkis (VC) Dimension The complexity of the hypothesis space H is not measured by the number of distinct hypotheses |H|, but instead by the number of distinct instances S (|S|=N) from the instance space X that can be completely discriminated using H Each hypothesis imposes some dichotomy on S (S X) possible dichotomies Some of them cannot be shattered The large the subset S that can be shattered, the more expressive H A tight bound on sample complexity

VC Dimension (cont.) Def: The VC dimension, VC(H), of hypothesis space H defined over instance space X is the size of the largest finite subset of X shattered by H If arbitrarily large finite sets of X can be shattered by H VC(H)≡∞ For any finite H VC(H)≦log2|H| If we find any set of instances of size d can be shattered, then VC(H) ≧ d To show that VC(H) < d, we must show that no set of size d can be shattered

VC Dimension (cont.) Case 1 S: a set of real values H: a set of possible intervals on the real number lines (|H|≡∞) Suppose that for a subset of S consisting of two instances All can be shattered VC(H)≧2 (possibly) Suppose that for a subset of S consisting of three instances Cannot be shattered in all cases ∴ VC(H)=2 ?

VC Dimension (cont.) Case 2 S: a set of points on the two-dimensional (x, y) plane H: a set of all linear decision surface (|H|≡∞) Suppose that for a subset of S consisting of two instances All can be shattered VC(H)≧2 (possibly) Suppose that for a subset of S consisting of three instances Can be shattered for points not colinear ∴ VC(H)=3

points for all possible labelings VC Dimension (cont.) Case 3 S: a set of points on the two-dimensional (x, y) plane H: a set of all axis-aligned rectangle (|H|≡∞) Suppose that for a subset of S consisting of three instances All can be shattered VC(H)≧3 (possibly) Suppose that for a subset of S consisting of four instances Can be shattered for points not colinear ∴ VC(H)=4 ? S: four points on a line H: intervals We can’t dichotomize 5 points for all possible labelings

VC Dimension (cont.) VC Dimension seems pessimistic E.g., rectangle hypotheses can lean only datasets containing four 2-dimensional points VC Dimension does not take the probability distribution of data samples into consideration Instances close by most of the time have the same color Dataset containing much more than four points might be learned positions & labelings

Probably Approximately Correct (PAC) Learning Instance Space X C h Where C and h disagree True Error ? Training Error ?

PAC Learning (cont.) Weakened Demands The error made by the hypothesis is bounded by some constant The probability that the hypothesis fails to shattering the instances is bounded

PAC Learning (cont.) (Valiant, 1984) Def: Given a class, C, and examples drawn from some unknown but fixed probability distribution, , we want to find the number of examples, , such that with probability at least , the hypothesis has error at most , for arbitrary and Measure how closely the hypothesis approximates the actual target class (concept) C

region of difference between C and h PAC Learning (cont.) An Illustrative case S: a set of points on the two-dimensional (x, y) plane H: a set of all axis-aligned rectangle h: the tightest rectangle hypothesis We want positive examples falling in is at most Falling in one of the four strips is less than (Blumer et al., 1984) region of difference between C and h

PAC Learning (cont.) An Illustrative case (cont.) The probability that N independent draws missing (not falling in) any of the four strips is Which we like to be at most , namely We also have that (A loose bound)

PAC Learning (cont.) An Illustrative case (cont.) The number of examples N is a slowly growing function of and , linear and logarithmic respectively Provided that we take at least independent examples and use the tightest rectangle are the hypothesis h, with confidence probability at least , a given point will be misclassified with error probability at most

Noise Unwanted anomaly in the data due to: Imprecision in recording the input attributes Errors in labeling the data points (teacher noise) Misuse of features, or there are hidden or latent features which are unobservable hypotheses with simple shapes vs. hypotheses with wiggly shapes Wiggly:擺動的,扭動的;波浪式的

Noise (cont.) But simple models (like axis-aligned rectangles) are Easy to use With few parameters → a small training set needed Easy to explain (interpret) Less affected by single instances

Occam’s Razor Simple explanations are more plausible and unnecessary complexity should be shaved off A simple model has more bias A complex model has more variance Tradeoff between bias and variance To be discussed in more detail later on !

Learning Multiple Classes Extend a two-class problem to a K-class problem

Learning Multiple Classes (cont.) A K-class classification problem can be viewed as K two-class problems For each class , training examples can be either positive (belonging to ) and negative (belonging to all other classes) Build a hypothesis for each class What if no, or two or more is 1 The case of doubt : reject such cases or defer the decision to human expert

Interpolation/Regression To learn a unknown function whose output is numeric from a training set of examples Interpolation: when no error is presented, we want to find a function that pass through those training points Extrapolation: to predict the output for any unseen in the training set Regression There is noise added to the output of the unknown function : random noise (thought of as extra hidden variables )

Interpolation/Regression

Regression Given the noise as the hidden variables , try to approximate the function output by our model Empirical error on the training set defined as E.g., If is linear and is one-dimensional examples Take partial derivatives of with respect to parameters and , and setting them equal to zero

Regression (cont.) Appendix A:

Regression (cont.)

Regression (cont.) Two-dimensional Inputs

Model Selection An Illustrative case Binary d-dimensional examples (examples with d inputs/features) Binary output values (0 and 1) If d=2 When more training samples observed, more hypotheses inconsistent with them are removed Each distinct training example removes half of the remaining hypotheses inputs output

Model Selection (cont.) An Illustrative case (cont.) Given N d-dimensional training examples observed There will be possible functions remained Ill-Posted Problem The training data by itself is not sufficient to find a unique solution Inductive Bias The assumptions made to have learning possible (e.g. to find a unique solution given the insufficient training data) Car classification: hypothesis class in rectangles instead of wiggly shapes Regression: linear hypothesis class instead of polynomial ones model selection: how to choose the right (inductive) bias Wiggly:擺動的,扭動的;波浪式的

Model Selection (cont.)

Model Complexity We should match the complexity of the hypothesis with the complexity of the function underlying the training/test data Underfitting: the hypothesis is less complex than the function E.g., try to fit a line to data sampled from a third-order polynomial Overfitting: the hypothesis is more complex than the function E.g., try to fit a six-order polynomial to data sampled from a third-order polynomial

Model Complexity (cont.) Trade-off between three factors (Triple Trade-off) The complexity (capacity) of the hypothesis The amount of training data The generalization error on new samples

Model Complexity (cont.) Data sets (for measuring generalization ability) Training Set Validation Set (Development) Test the generalization ability Select the best model (complexity) Test Set (Evaluation)

Summary Given a training set of labeled examples , supervised learning aim to build a good and useful approximate to with three issues taken into account Model to be employed Car classification: rectangles with four coordinates making up Linear regression: linear function with slope and intercept Loss function computing difference between desired output and the approximation Optimization procedure fining minimizing the approximation error Intercept:截段;截距 Model: the hypothesis class should be large enough Loss function: enough training data to pinpoint the correct hypothesis Optimization procedure: find a good optimization method

Some Demos on Speech Technology AT&T LVCSR Microsoft MIPAD CMU Universal Speech Interface