Presentation on theme: "Regularized risk minimization Usman Roshan. Supervised learning for two classes We are given n training samples (x i,y i ) for i=1..n drawn i.i.d from."— Presentation transcript:
Regularized risk minimization Usman Roshan
Supervised learning for two classes We are given n training samples (x i,y i ) for i=1..n drawn i.i.d from a probability distribution P(x,y). Each x i is a d-dimensional vector (x i in R d ) and y i is +1 or -1 Our problem is to learn a function f(x) for predicting the labels of test samples x i ’ in R d for i=1..n’ also drawn i.i.d from P(x,y)
Loss function Loss function: c(x,y,f(x)) Maps to [0,inf] Examples:
Test error We quantify the test error as the expected error on the test set (in other words the average test error). In the case of two classes: We’d like to find f that minimizes this but we need P(y|x) which we don’t have access to.
Expected risk Suppose we didn’t have test data (x’). Then we average the test error over all possible data points x We want to find f that minimizes this but we don’t have all data points. We only have training data.
Empirical risk Since we only have training data we can’t calculate the expected risk (we don’t even know P(x,y)). Solution: we approximate P(x,y) with the empirical distribution p emp (x,y) The delta function δ x (y)=1 if x=y and 0 otherwise.
Empirical risk We can now define the empirical risk as Once the loss function is defined and training data is given we can then find f that minimizes this.
Example of minimizing empirical risk (least squares) Suppose we are given n data points (x i,y i ) where each x i in R d and y i in R. We want to determine a linear function f(x)=ax+b for predicting test points. Loss function c(x i,y i,f(x i ))=(y i -f(x i )) 2 What is the empirical risk?
Empirical risk for least squares Now finding f has reduced to finding a and b. Since this function is convex in a and b we know there is a global optimum which is easy to find by setting first derivatives to 0.
Maximum likelihood and empirical risk Maximizing the likelihood P(D|M) is the same as maximizing log(P(D|M)) which is the same as minimizing -log(P(D|M)) Set the loss function to Now minimizing the empirical risk is the same as maximizing the likelihood
Empirical risk We pose the empirical risk in terms of a loss function and go about to solve it. Input: n training samples x i each of dimension d along with labels y i Output: a linear function f(x)=w T x+w 0 that minimizes the empirical risk
Empirical risk examples Linear regression How about logistic regression?
Logistic regression Recall the logistic regression model: Let y=+1 be case and y=-1 be control. The sample likelihood of the training data is given by
Logistic regression We find our parameters w and w 0 by maximizing the likelihood or minimizing the -log(likelihood). The -log of the likelihood is
Logistic regression loss function
SVM loss function Recall the SVM optimization problem: The loss function (second term) can be written as
Different loss functions Linear regression Logistic regression SVM
Regularized risk minimization Minimize Note the additional term added to the empirical risk.
Other loss functions From “A Scalable Modular Convex Solver for Regularized Risk Minimization”, Teo et. al., KDD 2007
Regularizer L1 norm: L1 gives sparse solution (many entries will be zero) Logistic loss with L1 also known as “lasso” L2 norm:
Regularized risk minimizer exercise Compare SVM to regularized logistic regression Software: Version 2.1 executables for OSL machines available on course website