Regularized risk minimization

Slides:

Advertisements

Similar presentations

Regularized risk minimization

Advertisements

VC theory, Support vectors and Hedged prediction technology.

Support vector machine

Visual Recognition Tutorial

The Nature of Statistical Learning Theory by V. Vapnik

Classification and risk prediction

An Introduction to Kernel-Based Learning Algorithms K.-R. Muller, S. Mika, G. Ratsch, K. Tsuda and B. Scholkopf Presented by: Joanna Giforos CS8980: Topics.

Support Vector Machines Based on Burges (1998), Scholkopf (1998), Cristianini and Shawe-Taylor (2000), and Hastie et al. (2001) David Madigan.

1 Introduction to Kernels Max Welling October (chapters 1,2,3,4)

Sample Selection Bias Lei Tang Feb. 20th, Classical ML vs. Reality  Training data and Test data share the same distribution (In classical Machine.

Genome-wide association studies Usman Roshan. Recap Single nucleotide polymorphism Genome wide association studies –Relative risk, odds risk (or odds.

SVM Support Vectors Machines

Visual Recognition Tutorial

SVM (Support Vector Machines) Base on statistical learning theory choose the kernel before the learning process.

Bing LiuCS Department, UIC1 Learning from Positive and Unlabeled Examples Bing Liu Department of Computer Science University of Illinois at Chicago Joint.

Statistical Learning: Pattern Classification, Prediction, and Control Peter Bartlett August 2002, UC Berkeley CIS.

An Introduction to Support Vector Machines Martin Law.

Support Vector Machines

Linear hyperplanes as classifiers Usman Roshan. Hyperplane separators.

Learning with Positive and Unlabeled Examples using Weighted Logistic Regression Wee Sun Lee National University of Singapore Bing Liu University of Illinois,

1 SUPPORT VECTOR MACHINES İsmail GÜNEŞ. 2 What is SVM? A new generation learning system. A new generation learning system. Based on recent advances in.

CS Statistical Machine learning Lecture 18 Yuan (Alan) Qi Purdue CS Oct

An Introduction to Support Vector Machines (M. Law)

Regression Usman Roshan CS 698 Machine Learning. Regression Same problem as classification except that the target variable y i is continuous. Popular.

Machine Learning CUNY Graduate Center Lecture 4: Logistic Regression.

Kernels Usman Roshan CS 675 Machine Learning. Feature space representation Consider two classes shown below Data cannot be separated by a hyperplane.

Classification Derek Hoiem CS 598, Spring 2009 Jan 27, 2009.

Sparse Kernel Methods 1 Sparse Kernel Methods for Classification and Regression October 17, 2007 Kyungchul Park SKKU.

Ohad Hageby IDC Support Vector Machines & Kernel Machines IP Seminar 2008 IDC Herzliya.

Linear hyperplanes as classifiers Usman Roshan. Hyperplane separators.

Chapter1: Introduction Chapter2: Overview of Supervised Learning

Chapter 20 Classification and Estimation Classification – Feature selection Good feature have four characteristics: –Discrimination. Features.

Regression Usman Roshan CS 675 Machine Learning. Regression Same problem as classification except that the target variable y i is continuous. Popular.

Classification Course web page: vision.cis.udel.edu/~cv May 14, 2003  Lecture 34.

CS Statistical Machine learning Lecture 12 Yuan (Alan) Qi Purdue CS Oct

Feature Selction for SVMs J. Weston et al., NIPS 2000 오장민 (2000/01/04) Second reference : Mark A. Holl, Correlation-based Feature Selection for Machine.

A Kernel Approach for Learning From Almost Orthogonal Pattern * CIS 525 Class Presentation Professor: Slobodan Vucetic Presenter: Yilian Qin * B. Scholkopf.

Linear hyperplanes as classifiers Usman Roshan. Hyperplane separators.

Computational Intelligence: Methods and Applications Lecture 24 SVM in the non-linear case Włodzisław Duch Dept. of Informatics, UMK Google: W Duch.

Page 1 CS 546 Machine Learning in NLP Review 2: Loss minimization, SVM and Logistic Regression Dan Roth Department of Computer Science University of Illinois.

Machine Learning Usman Roshan Dept. of Computer Science NJIT.

STATISTICAL LEARNING THEORY & CLASSIFICATIONS BASED ON SUPPORT VECTOR MACHINES presenter: Xipei Liu Vapnik, Vladimir. The nature of statistical.

Usman Roshan Dept. of Computer Science NJIT

Regression Usman Roshan.

Support vector machines

CS 9633 Machine Learning Support Vector Machines

Usman Roshan CS 675 Machine Learning

Dan Roth Department of Computer and Information Science

Empirical risk minimization

Bounding the error of misclassification

CSE 4705 Artificial Intelligence

Support Vector Machines

Probabilistic Models for Linear Regression

An Introduction to Support Vector Machines

Kernels Usman Roshan.

Predictive Learning from Data

دانشگاه صنعتی امیرکبیر Instructor : Saeed Shiry & Bishop Ch. 1

Support vector machines

Biointelligence Laboratory, Seoul National University

Regression Usman Roshan.

Support Vector Machines and Kernels

The loss function, the normal equation,

Usman Roshan CS 675 Machine Learning

Support vector machines

Support vector machines

Empirical risk minimization

Derek Hoiem CS 598, Spring 2009 Jan 27, 2009

Usman Roshan Dept. of Computer Science NJIT

Image recognition.

Support Vector Machines 2

Presentation transcript:

Regularized risk minimization Usman Roshan

Supervised learning for two classes We are given n training samples (xi,yi) for i=1..n drawn i.i.d from a probability distribution P(x,y). Each xi is a d-dimensional vector (xi in Rd) and yi is +1 or -1 Our problem is to learn a function f(x) for predicting the labels of test samples xi’ in Rd for i=1..n’ also drawn i.i.d from P(x,y)

Loss function Loss function: c(x,y,f(x)) Maps to [0,inf] Examples:

Test error We quantify the test error as the expected error on the test set (in other words the average test error). In the case of two classes: We want to find f that minimizes this but we need P(y|x) which we don’t have access to.

Expected risk Suppose we don’t have test data (x’). Then we average the test error over all possible data points x This is also known as the expected risk or the expected value of the loss function in Bayesian decision theory We want to find f that minimizes this but we don’t have all data points. We only have training data. And we don’t know P(y,x)

Empirical risk Since we only have training data we can’t calculate the expected risk (we don’t even know P(x,y)). Solution: we approximate P(x,y) with the empirical distribution pemp(x,y) The delta function δx(y)=1 if x=y and 0 otherwise.

Empirical risk We can now define the empirical risk as Once the loss function is defined and training data is given we can then find f that minimizes this.

Bounding the expected risk Recall from earlier that we bounded the expected risk with the empirical risk plus a complexity term. This suggests we should minimize empirical risk plus classifier complexity.

Regularized risk minimization Minimize Note the additional term added to the empirical risk. This term measures classifier complexity.

Representer theorem Plays a central role in statistical estimation Taken from Learning with Kernels by Scholkopf and Smola

Regularized empirical risk Linear regression Logistic regression SVM

Single layer neural network Linear regression regularized risk: Single layer neural network regularized risk:

Other loss functions From “A Scalable Modular Convex Solver for Regularized Risk Minimization”, Teo et. al., KDD 2007

Regularizer L1 norm: L1 gives sparse solution (many entries will be zero) Logistic loss with L1 also known as “lasso” L2 norm:

Regularized risk minimizer exercise Compare SVM to regularized logistic regression Software: http://users.cecs.anu.edu.au/~chteo/BMRM.html Version 2.1 executables for OSL machines available on course website