Presentation is loading. Please wait.

Presentation is loading. Please wait.

CH. 2: Supervised Learning

Similar presentations


Presentation on theme: "CH. 2: Supervised Learning"— Presentation transcript:

1 CH. 2: Supervised Learning
Supervised learning (SL) learns an unknown mapping f from the input i to an output o whose correct answers are provided by a supervisor. Given a training set figure out f. 1

2 2.1 Learning from Examples
Example: Learn the class C of a “family car” Car representation: where x1: price, x2: engine power Given a training set 2

3 Through discussing with experts and analyzing the training data, a hypothesis class H of car class C, is defined as 3

4 Problem: Given X and H, the learning algorithm A
attempts to find an that minimizes the error , where, 4

5 S, G, and Version Space Version space: For all hypotheses between
S and G. 5

6 Choose h with largest margin
6

7 2.2 Vapnik-Chervonenkis (VC) Dimension
N points can be labeled in 2N ways as +/– (or 1/0), e.g., N = 3 , if that separates +/– examples, H shatters N points, e.g., H: line class and rectangle class both can shatter 3 points VC(H): the maximum number of points that can be shattered by H e.g., VC(line) = 3; VC(axis-aligned rectangle) = 4 7

8 Examples: i) The VC dimension of the “line” hypothesis
class is 3 in 2D space, i.e., VC(line) = 3 8

9 (ii) An axis-aligned rectangle shatters 4 points, i.e.,
VC(rectangle) = 4 (?) Only rectangles covering two points are shown  (iii) VC(triangle) = 7 Assignment (monotonicity property): For any two hypothesis classes 9

10 2.3 Probably Approximately Correct (PAC) Learning – using the tightest rectangle as the
hypothesis, i.e., h = S C: actual class h: induced class (h,C) 10 10

11 Problem: How many training examples N should
have, such that with probability at least , h has error at most ? Mathematically, where the region of difference between C and h. 11

12 In order that the probability of a positive car falling in
(i.e., error) is at most Probability that fall in a strip is upper bounded by Probability that miss (i.e., correct) a strip Probability that N instances miss a strip Probability that N instances miss 4 strips This probability should be 12

13 (?) 13

14 2.4 Noise and Model Complexity
Noise due to error, imprecision, uncertainty, etc.. Complicated hypotheses are generally necessary to cope with noise. However, simpler hypotheses make more sense because simple to use, easy to check, train, and explain, and good generalization Occam’s razor: simpler explanations are more plausible and any unnecessary complexity should be shaved off. 14

15 Examples: 1) Classification 2) Regression 15

16 Multiple Classes, Ci i = 1, ..., K
2.5 Learning Multiple Classes Multiple Classes, Ci i = 1, ..., K Training set: Treat a K-class classification problem as K 2-class problems, i.e., train hypotheses Total error: 16

17 2.6 Regression Different from classification problems whose outputs
are Boolean value (Yes/No), the outputs of regression problems are numeric values. Training set: Find , s.t. Let g be the estimate of f. Total error: For linear model: 17

18 18

19 Solve (1) and (2) for 19

20 20

21 2.7 Model Selection Example: Learning a Boolean function from examples
21

22 Each training example removes half the hypotheses,
e.g., input , output 0. This removes because their outputs are 1. The above example illustrates that a learning starts with all possible hypotheses and as more examples are seen, those inconsistent hypotheses are removed. Ill-posed problem: training examples are not sufficient to lead to a unique solution 22

23 Inductive bias: extra assumptions or restrictions
about the hypothesis class may be introduced to make learning possible. Model selection: chooses a good inductive bias or determines between hypotheses. Generalization: How well a model trained on the training set predicts the right output for new instances Underfitting: Hypothesis class H is less complex than true class C, e.g., fit a line to data sampled from a 3rd order polynomial 23

24 Overfitting: Hypothesis class H is more complex
than true class C, e.g., fit a 3rd order polynomial to data sampled from a line Triple trade-off: trade-off between 3 factors 1. The size of training set, N, 2. The complexity (or capacity) of H, c(H), 3. Generalization error, E, on new data 24

25 Data are divided: i) training set, ii) validation set,
Cross-validation Data are divided: i) training set, ii) validation set, and iii) publication set. Training set: for inducing a hypothesis Validation set: for testing the generalization ability of the induced hypothesis Publication (test) set: for providing the expected error of the best hypothesis 25

26 2.8 Dimensions of a Supervised Learning Algorithm
Training sample: 1. Model: defines the hypothesis class H 2. Loss function: 3. Optimization procedure: 26

27 27


Download ppt "CH. 2: Supervised Learning"

Similar presentations


Ads by Google