Machine Learning CSE 681 CH2 - Supervised Learning.

Machine Learning CSE 681 CH2 - Supervised Learning

Learning a Class from Examples 2  Let us say we want to learn the class, C, of a “family car.”  We have a set of examples of cars, and we have a group of people that we survey to whom we show these cars. The people look at the cars and label them as as “family car” or “not family car”.  A car may have many features. Examples of features: year, make, model, color, seating capacity, price, engine power, type of transmission, miles/gallon, etc.  Based on expert knowledge or some other technique, we decide the most important (relevant) features (atributes) that separate a family car from other cars are the price and engine power.  This is called dimensionality reduction: There are many algorithms for dimensionality reduction (Principal components analysis, Factor analysis, Vector quantization, Mutual information, etc.

Class Learning Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 3  Class learning is finding a description (model) that is shared by all positive examples and none of the negative examples (or multiple classes).  After finding a model, we can make a prediction: Given a car that we have not seen before, by checking with the model learned, we will be able to say whether it is a family car or not.

Input representation 4  Let us denote price as the first input attribute x 1 (e.g., in U.S. dollars) and engine power as the second attribute x 2 (e.g., engine volume in cubic centimeters). Thus we represent each car using two numeric values x1x1 x2x2 r 5121 6,531 880

Training set X Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 5

Learning a Class from Examples 6  After further discussions with the expert and the analysis of the data, we may have reason to believe that for a car to be a family car, its price and engine power should be in a certain range  This equation (function) assumes class C to be a rectangle in the price-engine power space.

Class C 7

Hypothesis class H 8  Formally, the learner should choose in advance a set of predictors (functions). This set is called hypothesis class and is denoted by H.  The hypothesis can be a well-known type of function: hyperplanes (straight lines in 2-D), circles, ellipses, rectangles, donut shapes, etc.  In our example, we assume that the hypothesis class is a set of rectangles.  The hypothesis class is also called inductive bias.  The learning algorithm then finds the particular hypothesis, h ∈ H, to approximate C as closely as possible.

What's the right hypothesis class H? 9

Linearly separable data 10

Not linearly separable 11 Source: CS540

Quadratically separable 12 Source: CS540

Hypothesis class H 13 Source: CS540 Function Fitting (Curve Fitting)

Hypothesis h ∈ H 14  Each hypothesis h ∈ H is a function mapping from x to r. After deciding on H, the learner samples a training set S and uses a minimization rule to choose a predictor out of the hypothesis class.  The learner try to choose a hypothesis h ∈ H, which minimizes the error over the training set. By restricting the learner to choose a predictor from H, we bias it toward a particular set of predictors.  This preference is often called an inductive bias. Since H is chosen in advance we refer to it as a prior knowledge on the problem.  Though the expert defines this hypothesis class, the values of the parameters are not known; that is, though we choose H, we do not know which particular h ∈ H is equal, or closest, to real class C.

Hypothesis for the example 15  Depending on values of p 1, p 2, e 1, and e 2, there are many rectangles ( h ∈ H ) that respresent HYPOTHESIS CLASS H.  Given a hypothesis class (rectangle in the example), then the learning problem is just to find the four parameters that define h.  The aim is to find h ∈ H that is as similar as possible to C. Let us say the hypothesis h makes a prediction for an instance x such that

Hypothesis class H Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 16 Error of h on H

Empirical Error Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 17  In real life we do not know C(x), so we cannot evaluate how well h(x) matches C(x). What we have is the training set X, which is a small subset of the set of all possible empirical error x. The empirical error is the proportion of training instances where predictions of h do not match the required values given in X. The error of hypothesis h given the training set X is where l(a ≠ b) is 1 if a ≠ b and is 0 if a = b

Generalization 18  In our example, each rectangle with values (p 1, p 2, e 1, e 2 ) defines one hypothesis, h, from H.  We need to choose the best one, or in other words, we need to find the values of these four parameters given the training set, to include all the positive examples and none of the negative examples.  We can find infinitely many rectangles that are consistent with the training examples, i.e, the error or loss E is 0.  However, different hypotheses that are consistent with the training examples may behave differently with future examples that are part of the training set.  Generalization is the problem of how well the learned classifier will classify future unseen examples. A good learned hypothesis will make fewer mistakes in the future.

Most Specific Hypothesis S 19  The most specific hypothesis, S, that is the hypothesis tightest rectangle that includes all the positive examples and none of the negative examples  The most general hypothesis G is the largest axis-aligned rectangle we can draw including all positive examples and no negative examples.  Any hypothesis h ∈ H between S and G is a valid hypothesis with no errors, and thus consistent with the training set.  All such hypotheses h make up the Version Space of hypotheses.

S, G, and the Version Space Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 20 most specific hypothesis, S most general hypothesis, G h  H, between S and G is consistent and make up the version space (Mitchell, 1997)

How to choose h Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 21  It seems intuitive to choose h halfway between S and G with the maximum margin.  For the error function to have a minimum value at h with the maximum margin, we should use an error (loss) function which not only checks whether an instance is on the correct side of the boundary but also how far away it is.

Margin 22  Choose h with largest margin

Supervised Learning Process 23  In the supervised learning problem, our goal is, to learn a function h : x → r so that h(x) is a “good” predictor for the corresponding value of r.  For historical reasons, this function h is called a hypothesis.  Formally, given a training set:

Supervised Learning Process 24 Training Set Learning Algorithm h predicted rNew input x

Supervised Learning 25  When r can take on only a small number of discrete values (such as “family car” or “not family car” in our example), we call it as a classification problem.  When the target variable r that we’re trying to predict is continuous, such as tempareture in wheather prediction, we call the learning problem as a regression problem (or prediction problem in some data mining books).

Machine Learning CSE 681 CH2 - Supervised Learning.

Similar presentations

Presentation on theme: "Machine Learning CSE 681 CH2 - Supervised Learning."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Machine Learning CSE 681 CH2 - Supervised Learning.

Similar presentations

Presentation on theme: "Machine Learning CSE 681 CH2 - Supervised Learning."— Presentation transcript:

Similar presentations

About project

Feedback