2Supervised learning for two classes We are given n training samples (xi,yi) for i=1..n drawn i.i.d from a probability distribution P(x,y).Each xi is a d-dimensional vector (xi in Rd) and yi is +1 or -1Our problem is to learn a function f(x) for predicting the labels of test samples xi’ in Rd for i=1..n’ also drawn i.i.d from P(x,y)
3Loss functionLoss function: c(x,y,f(x))Maps to [0,inf]Examples:
4Test errorWe quantify the test error as the expected error on the test set (in other words the average test error). In the case of two classes:We’d like to find f that minimizes this but we need P(y|x) which we don’t have access to.
5Expected riskSuppose we didn’t have test data (x’). Then we average the test error over all possible data points xWe want to find f that minimizes this but we don’t have all data points. We only have training data.
6Empirical riskSince we only have training data we can’t calculate the expected risk (we don’t even know P(x,y)).Solution: we approximate P(x,y) with the empirical distribution pemp(x,y)The delta function δx(y)=1 if x=y and 0 otherwise.
7Empirical risk We can now define the empirical risk as Once the loss function is defined and training data is given we can then find f that minimizes this.
8Example of minimizing empirical risk (least squares) Suppose we are given n data points (xi,yi) where each xi in Rd and yi in R. We want to determine a linear function f(x)=ax+b for predicting test points.Loss function c(xi,yi,f(xi))=(yi-f(xi))2What is the empirical risk?
9Empirical risk for least squares Now finding f has reduced to finding a and b.Since this function is convex in a and b we know thereis a global optimum which is easy to find by settingfirst derivatives to 0.
10Maximum likelihood and empirical risk Maximizing the likelihood P(D|M) is the same as maximizing log(P(D|M)) which is the same as minimizing -log(P(D|M))Set the loss function toNow minimizing the empirical risk is the same as maximizing the likelihood
11Empirical riskWe pose the empirical risk in terms of a loss function and go about to solve it.Input: n training samples xi each of dimension d along with labels yiOutput: a linear function f(x)=wTx+w0 that minimizes the empirical risk
12Empirical risk examples Linear regressionHow about logistic regression?
13Logistic regression Recall the logistic regression model: Let y=+1 be case and y=-1 be control.The sample likelihood of the training data is given by
14Logistic regressionWe find our parameters w and w0 by maximizing the likelihood or minimizing the -log(likelihood).The -log of the likelihood is