Download presentation
Presentation is loading. Please wait.
1
CH. 2: Supervised Learning
Supervised learning (SL) learns an unknown mapping f from the input i to an output o whose correct answers are provided by a supervisor. Given a training set figure out f. 1
2
2.1 Learning from Examples
Example: Learn the class C of a “family car” Car representation: where x1: price, x2: engine power Given a training set 2
3
Through discussing with experts and analyzing the training data, a hypothesis class H of car class C, is defined as 3
4
Problem: Given X and H, the learning algorithm A
attempts to find an that minimizes the error , where, 4
5
S, G, and Version Space Version space: For all hypotheses between
S and G. 5
6
Choose h with largest margin
6
7
2.2 Vapnik-Chervonenkis (VC) Dimension
N points can be labeled in 2N ways as +/– (or 1/0), e.g., N = 3 , if that separates +/– examples, H shatters N points, e.g., H: line class and rectangle class both can shatter 3 points VC(H): the maximum number of points that can be shattered by H e.g., VC(line) = 3; VC(axis-aligned rectangle) = 4 7
8
Examples: i) The VC dimension of the “line” hypothesis
class is 3 in 2D space, i.e., VC(line) = 3 8
9
(ii) An axis-aligned rectangle shatters 4 points, i.e.,
VC(rectangle) = 4 (?) Only rectangles covering two points are shown (iii) VC(triangle) = 7 Assignment (monotonicity property): For any two hypothesis classes 9
10
2.3 Probably Approximately Correct (PAC) Learning – using the tightest rectangle as the
hypothesis, i.e., h = S C: actual class h: induced class (h,C) 10 10
11
Problem: How many training examples N should
have, such that with probability at least , h has error at most ? Mathematically, where the region of difference between C and h. 11
12
In order that the probability of a positive car falling in
(i.e., error) is at most Probability that fall in a strip is upper bounded by Probability that miss (i.e., correct) a strip Probability that N instances miss a strip Probability that N instances miss 4 strips This probability should be 12
13
(?) 13
14
2.4 Noise and Model Complexity
Noise due to error, imprecision, uncertainty, etc.. Complicated hypotheses are generally necessary to cope with noise. However, simpler hypotheses make more sense because simple to use, easy to check, train, and explain, and good generalization Occam’s razor: simpler explanations are more plausible and any unnecessary complexity should be shaved off. 14
15
Examples: 1) Classification 2) Regression 15
16
Multiple Classes, Ci i = 1, ..., K
2.5 Learning Multiple Classes Multiple Classes, Ci i = 1, ..., K Training set: Treat a K-class classification problem as K 2-class problems, i.e., train hypotheses Total error: 16
17
2.6 Regression Different from classification problems whose outputs
are Boolean value (Yes/No), the outputs of regression problems are numeric values. Training set: Find , s.t. Let g be the estimate of f. Total error: For linear model: 17
18
18
19
Solve (1) and (2) for 19
20
20
21
2.7 Model Selection Example: Learning a Boolean function from examples
21
22
Each training example removes half the hypotheses,
e.g., input , output 0. This removes because their outputs are 1. The above example illustrates that a learning starts with all possible hypotheses and as more examples are seen, those inconsistent hypotheses are removed. Ill-posed problem: training examples are not sufficient to lead to a unique solution 22
23
Inductive bias: extra assumptions or restrictions
about the hypothesis class may be introduced to make learning possible. Model selection: chooses a good inductive bias or determines between hypotheses. Generalization: How well a model trained on the training set predicts the right output for new instances Underfitting: Hypothesis class H is less complex than true class C, e.g., fit a line to data sampled from a 3rd order polynomial 23
24
Overfitting: Hypothesis class H is more complex
than true class C, e.g., fit a 3rd order polynomial to data sampled from a line Triple trade-off: trade-off between 3 factors 1. The size of training set, N, 2. The complexity (or capacity) of H, c(H), 3. Generalization error, E, on new data 24
25
Data are divided: i) training set, ii) validation set,
Cross-validation Data are divided: i) training set, ii) validation set, and iii) publication set. Training set: for inducing a hypothesis Validation set: for testing the generalization ability of the induced hypothesis Publication (test) set: for providing the expected error of the best hypothesis 25
26
2.8 Dimensions of a Supervised Learning Algorithm
Training sample: 1. Model: defines the hypothesis class H 2. Loss function: 3. Optimization procedure: 26
27
27
Similar presentations
© 2024 SlidePlayer.com Inc.
All rights reserved.