Regularized risk minimization

Slides:



Advertisements
Similar presentations
Regularized risk minimization
Advertisements

VC theory, Support vectors and Hedged prediction technology.
Support vector machine
Visual Recognition Tutorial
The Nature of Statistical Learning Theory by V. Vapnik
Classification and risk prediction
An Introduction to Kernel-Based Learning Algorithms K.-R. Muller, S. Mika, G. Ratsch, K. Tsuda and B. Scholkopf Presented by: Joanna Giforos CS8980: Topics.
Support Vector Machines Based on Burges (1998), Scholkopf (1998), Cristianini and Shawe-Taylor (2000), and Hastie et al. (2001) David Madigan.
1 Introduction to Kernels Max Welling October (chapters 1,2,3,4)
Sample Selection Bias Lei Tang Feb. 20th, Classical ML vs. Reality  Training data and Test data share the same distribution (In classical Machine.
Genome-wide association studies Usman Roshan. Recap Single nucleotide polymorphism Genome wide association studies –Relative risk, odds risk (or odds.
SVM Support Vectors Machines
Visual Recognition Tutorial
SVM (Support Vector Machines) Base on statistical learning theory choose the kernel before the learning process.
Bing LiuCS Department, UIC1 Learning from Positive and Unlabeled Examples Bing Liu Department of Computer Science University of Illinois at Chicago Joint.
Statistical Learning: Pattern Classification, Prediction, and Control Peter Bartlett August 2002, UC Berkeley CIS.
An Introduction to Support Vector Machines Martin Law.
Support Vector Machines
Linear hyperplanes as classifiers Usman Roshan. Hyperplane separators.
Learning with Positive and Unlabeled Examples using Weighted Logistic Regression Wee Sun Lee National University of Singapore Bing Liu University of Illinois,
1 SUPPORT VECTOR MACHINES İsmail GÜNEŞ. 2 What is SVM? A new generation learning system. A new generation learning system. Based on recent advances in.
CS Statistical Machine learning Lecture 18 Yuan (Alan) Qi Purdue CS Oct
An Introduction to Support Vector Machines (M. Law)
Regression Usman Roshan CS 698 Machine Learning. Regression Same problem as classification except that the target variable y i is continuous. Popular.
Machine Learning CUNY Graduate Center Lecture 4: Logistic Regression.
Kernels Usman Roshan CS 675 Machine Learning. Feature space representation Consider two classes shown below Data cannot be separated by a hyperplane.
Classification Derek Hoiem CS 598, Spring 2009 Jan 27, 2009.
Sparse Kernel Methods 1 Sparse Kernel Methods for Classification and Regression October 17, 2007 Kyungchul Park SKKU.
Ohad Hageby IDC Support Vector Machines & Kernel Machines IP Seminar 2008 IDC Herzliya.
Linear hyperplanes as classifiers Usman Roshan. Hyperplane separators.
Chapter1: Introduction Chapter2: Overview of Supervised Learning
Chapter 20 Classification and Estimation Classification – Feature selection Good feature have four characteristics: –Discrimination. Features.
Regression Usman Roshan CS 675 Machine Learning. Regression Same problem as classification except that the target variable y i is continuous. Popular.
Classification Course web page: vision.cis.udel.edu/~cv May 14, 2003  Lecture 34.
CS Statistical Machine learning Lecture 12 Yuan (Alan) Qi Purdue CS Oct
Feature Selction for SVMs J. Weston et al., NIPS 2000 오장민 (2000/01/04) Second reference : Mark A. Holl, Correlation-based Feature Selection for Machine.
A Kernel Approach for Learning From Almost Orthogonal Pattern * CIS 525 Class Presentation Professor: Slobodan Vucetic Presenter: Yilian Qin * B. Scholkopf.
Linear hyperplanes as classifiers Usman Roshan. Hyperplane separators.
Computational Intelligence: Methods and Applications Lecture 24 SVM in the non-linear case Włodzisław Duch Dept. of Informatics, UMK Google: W Duch.
Page 1 CS 546 Machine Learning in NLP Review 2: Loss minimization, SVM and Logistic Regression Dan Roth Department of Computer Science University of Illinois.
Machine Learning Usman Roshan Dept. of Computer Science NJIT.
STATISTICAL LEARNING THEORY & CLASSIFICATIONS BASED ON SUPPORT VECTOR MACHINES presenter: Xipei Liu Vapnik, Vladimir. The nature of statistical.
Usman Roshan Dept. of Computer Science NJIT
Regression Usman Roshan.
Support vector machines
CS 9633 Machine Learning Support Vector Machines
Usman Roshan CS 675 Machine Learning
Dan Roth Department of Computer and Information Science
Empirical risk minimization
Bounding the error of misclassification
CSE 4705 Artificial Intelligence
Support Vector Machines
Probabilistic Models for Linear Regression
An Introduction to Support Vector Machines
Kernels Usman Roshan.
Predictive Learning from Data
دانشگاه صنعتی امیرکبیر Instructor : Saeed Shiry & Bishop Ch. 1
Support vector machines
Biointelligence Laboratory, Seoul National University
Regression Usman Roshan.
Support Vector Machines and Kernels
The loss function, the normal equation,
Usman Roshan CS 675 Machine Learning
Support vector machines
Support vector machines
Empirical risk minimization
Derek Hoiem CS 598, Spring 2009 Jan 27, 2009
Usman Roshan Dept. of Computer Science NJIT
Image recognition.
Support Vector Machines 2
Presentation transcript:

Regularized risk minimization Usman Roshan

Supervised learning for two classes We are given n training samples (xi,yi) for i=1..n drawn i.i.d from a probability distribution P(x,y). Each xi is a d-dimensional vector (xi in Rd) and yi is +1 or -1 Our problem is to learn a function f(x) for predicting the labels of test samples xi’ in Rd for i=1..n’ also drawn i.i.d from P(x,y)

Loss function Loss function: c(x,y,f(x)) Maps to [0,inf] Examples:

Test error We quantify the test error as the expected error on the test set (in other words the average test error). In the case of two classes: We want to find f that minimizes this but we need P(y|x) which we don’t have access to.

Expected risk Suppose we don’t have test data (x’). Then we average the test error over all possible data points x This is also known as the expected risk or the expected value of the loss function in Bayesian decision theory We want to find f that minimizes this but we don’t have all data points. We only have training data. And we don’t know P(y,x)

Empirical risk Since we only have training data we can’t calculate the expected risk (we don’t even know P(x,y)). Solution: we approximate P(x,y) with the empirical distribution pemp(x,y) The delta function δx(y)=1 if x=y and 0 otherwise.

Empirical risk We can now define the empirical risk as Once the loss function is defined and training data is given we can then find f that minimizes this.

Bounding the expected risk Recall from earlier that we bounded the expected risk with the empirical risk plus a complexity term. This suggests we should minimize empirical risk plus classifier complexity.

Regularized risk minimization Minimize Note the additional term added to the empirical risk. This term measures classifier complexity.

Representer theorem Plays a central role in statistical estimation Taken from Learning with Kernels by Scholkopf and Smola

Regularized empirical risk Linear regression Logistic regression SVM

Single layer neural network Linear regression regularized risk: Single layer neural network regularized risk:

Other loss functions From “A Scalable Modular Convex Solver for Regularized Risk Minimization”, Teo et. al., KDD 2007

Regularizer L1 norm: L1 gives sparse solution (many entries will be zero) Logistic loss with L1 also known as “lasso” L2 norm:

Regularized risk minimizer exercise Compare SVM to regularized logistic regression Software: http://users.cecs.anu.edu.au/~chteo/BMRM.html Version 2.1 executables for OSL machines available on course website