2 Supervised learning for two classes We are given n training samples (xi,yi) for i=1..n drawn i.i.d from a probability distribution P(x,y).Each xi is a d-dimensional vector (xi in Rd) and yi is +1 or -1Our problem is to learn a function f(x) for predicting the labels of test samples xi’ in Rd for i=1..n’ also drawn i.i.d from P(x,y)
3 Loss functionLoss function: c(x,y,f(x))Maps to [0,inf]Examples:
4 Test errorWe quantify the test error as the expected error on the test set (in other words the average test error). In the case of two classes:We’d like to find f that minimizes this but we need P(y|x) which we don’t have access to.
5 Expected riskSuppose we didn’t have test data (x’). Then we average the test error over all possible data points xWe want to find f that minimizes this but we don’t have all data points. We only have training data.
6 Empirical riskSince we only have training data we can’t calculate the expected risk (we don’t even know P(x,y)).Solution: we approximate P(x,y) with the empirical distribution pemp(x,y)The delta function δx(y)=1 if x=y and 0 otherwise.
7 Empirical risk We can now define the empirical risk as Once the loss function is defined and training data is given we can then find f that minimizes this.
8 Example of minimizing empirical risk (least squares) Suppose we are given n data points (xi,yi) where each xi in Rd and yi in R. We want to determine a linear function f(x)=ax+b for predicting test points.Loss function c(xi,yi,f(xi))=(yi-f(xi))2What is the empirical risk?
9 Empirical risk for least squares Now finding f has reduced to finding a and b.Since this function is convex in a and b we know thereis a global optimum which is easy to find by settingfirst derivatives to 0.
10 Maximum likelihood and empirical risk Maximizing the likelihood P(D|M) is the same as maximizing log(P(D|M)) which is the same as minimizing -log(P(D|M))Set the loss function toNow minimizing the empirical risk is the same as maximizing the likelihood
11 Empirical riskWe pose the empirical risk in terms of a loss function and go about to solve it.Input: n training samples xi each of dimension d along with labels yiOutput: a linear function f(x)=wTx+w0 that minimizes the empirical risk
12 Empirical risk examples Linear regressionHow about logistic regression?
13 Logistic regression Recall the logistic regression model: Let y=+1 be case and y=-1 be control.The sample likelihood of the training data is given by
14 Logistic regressionWe find our parameters w and w0 by maximizing the likelihood or minimizing the -log(likelihood).The -log of the likelihood is