CS 59000 Statistical Machine learning Lecture 13 Yuan (Alan) Qi Purdue CS Oct. 8 2008.

Slides:



Advertisements
Similar presentations
Pattern Recognition and Machine Learning
Advertisements

CSI :Florida A BAYESIAN APPROACH TO LOCALIZED MULTI-KERNEL LEARNING USING THE RELEVANCE VECTOR MACHINE R. Close, J. Wilson, P. Gader.
Notes Sample vs distribution “m” vs “µ” and “s” vs “σ” Bias/Variance Bias: Measures how much the learnt model is wrong disregarding noise Variance: Measures.
INTRODUCTION TO MACHINE LEARNING Bayesian Estimation.
Pattern Recognition and Machine Learning
Pattern Recognition and Machine Learning: Kernel Methods.
Computer vision: models, learning and inference Chapter 8 Regression.
Chapter 4: Linear Models for Classification
A Bayesian Approach to Joint Feature Selection and Classifier Design Balaji Krishnapuram, Alexander J. Hartemink, Lawrence Carin, Fellow, IEEE, and Mario.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Pattern Recognition and Machine Learning
Support Vector Machines and Kernel Methods
Classification and risk prediction
1 Introduction to Kernels Max Welling October (chapters 1,2,3,4)
Principle of Locality for Statistical Shape Analysis Paul Yushkevich.
Machine Learning CMPT 726 Simon Fraser University
Kernel Methods and SVM’s. Predictive Modeling Goal: learn a mapping: y = f(x;  ) Need: 1. A model structure 2. A score function 3. An optimization strategy.
Arizona State University DMML Kernel Methods – Gaussian Processes Presented by Shankar Bhargav.
Correlations and Copulas Chapter 10 Risk Management and Financial Institutions 2e, Chapter 10, Copyright © John C. Hull
INTRODUCTION TO Machine Learning ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
An Introduction to Support Vector Machines Martin Law.
Latent Variable Models Christopher M. Bishop. 1. Density Modeling A standard approach: parametric models  a number of adaptive parameters  Gaussian.
Cao et al. ICML 2010 Presented by Danushka Bollegala.
PATTERN RECOGNITION AND MACHINE LEARNING
Outline Separating Hyperplanes – Separable Case
Ch 6. Kernel Methods Pattern Recognition and Machine Learning, C. M. Bishop, Summarized by J. S. Kim Biointelligence Laboratory, Seoul National University.
INTRODUCTION TO MACHINE LEARNING 3RD EDITION ETHEM ALPAYDIN © The MIT Press, Lecture.
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 3: LINEAR MODELS FOR REGRESSION.
CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.
CS Statistical Machine learning Lecture 18 Yuan (Alan) Qi Purdue CS Oct
CS Statistical Machine learning Lecture 10 Yuan (Alan) Qi Purdue CS Sept
Learning Theory Reza Shadmehr Linear and quadratic decision boundaries Kernel estimates of density Missing data.
An Introduction to Support Vector Machines (M. Law)
Virtual Vector Machine for Bayesian Online Classification Yuan (Alan) Qi CS & Statistics Purdue June, 2009 Joint work with T.P. Minka and R. Xiang.
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: ML and Simple Regression Bias of the ML Estimate Variance of the ML Estimate.
Bayesian Generalized Kernel Mixed Models Zhihua Zhang, Guang Dai and Michael I. Jordan JMLR 2011.
Gaussian Processes Li An Li An
CS Statistical Machine learning Lecture 24
Sparse Kernel Methods 1 Sparse Kernel Methods for Classification and Regression October 17, 2007 Kyungchul Park SKKU.
EE4-62 MLCV Lecture Face Recognition – Subspace/Manifold Learning Tae-Kyun Kim 1 EE4-62 MLCV.
Linear Models for Classification
Lecture 2: Statistical learning primer for biologists
Gaussian Processes For Regression, Classification, and Prediction.
CS Statistical Machine learning Lecture 12 Yuan (Alan) Qi Purdue CS Oct
1 Chapter 8: Model Inference and Averaging Presented by Hui Fang.
Gaussian Process Networks Nir Friedman and Iftach Nachman UAI-2K.
Irena Váňová. B A1A1. A2A2. A3A3. repeat until no sample is misclassified … labels of classes Perceptron algorithm for i=1...N if then end * * * * *
1 Kernel-class Jan Recap: Feature Spaces non-linear mapping to F 1. high-D space 2. infinite-D countable space : 3. function space (Hilbert.
Ch 1. Introduction Pattern Recognition and Machine Learning, C. M. Bishop, Updated by J.-H. Eom (2 nd round revision) Summarized by K.-I.
CS Statistical Machine learning Lecture 7 Yuan (Alan) Qi Purdue CS Sept Acknowledgement: Sargur Srihari’s slides.
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 1: INTRODUCTION.
Ch 12. Continuous Latent Variables ~ 12
Probability Theory and Parameter Estimation I
Ch3: Model Building through Regression
CH 5: Multivariate Methods
Computer vision: models, learning and inference
Non-Parametric Models
Lecture 09: Gaussian Processes
Special Topics In Scientific Computing
Probabilistic Models for Linear Regression
Techniques for studying correlation and covariance structure
Machine Learning Math Essentials Part 2
Pattern Recognition and Machine Learning
Biointelligence Laboratory, Seoul National University
Generally Discriminant Analysis
Lecture 10: Gaussian Processes
Parametric Methods Berlin Chen, 2005 References:
Multivariate Methods Berlin Chen
Multivariate Methods Berlin Chen, 2005 References:
Presentation transcript:

CS Statistical Machine learning Lecture 13 Yuan (Alan) Qi Purdue CS Oct

Outline Review of kernel trick, kernel ridge regression and kernel Principle Component Analysis Gaussian processes (GPs) From linear regression to GP GP for regression

Kernel Trick 1. Reformulate an algorithm such that input vector enters only in the form of inner product. 2. Replace input x by its feature mapping: 3. Replace the inner product by a Kernel function: Examples: Kernel PCA, Kernel Fisher discriminant, Support Vector Machines

Dual variables: Dual Representation for Ridge Regression

Kernel Ridge Regression Using kernel trick: Minimize over dual variables:

Generate Kernel Matrix Positive semidefinite Consider Gaussian kernel:

Principle Component Analysis (PCA) Assume We have is a normalized eigenvector:

Feature Mapping Eigen-problem in feature space

Dual Variables Suppose, we have

Eigen-problem in Feature Space (1)

Eigen-problem in Feature Space (2) Normalization condition: Projection coefficient:

General Case for Non-zero Mean Case Kernel Matrix:

Gaussian Processes How kernels arise naturally in a Bayesian setting? Instead of assigning a prior on parameters w, we assign a prior on function value y. Infinite space in theory Finite space in practice (finite number of training set and test set)

Linear Regression Revisited Let We have

From Prior on Parameter to Prior on Function The prior on function value:

Stochastic Process A stochastic process is specified by giving the joint distribution for any finite set of values in a consistent manner (Loosely speaking, it means that a marginalized joint distribution is the same as the joint distribution that is defined in the subspace.)

Gaussian Processes The joint distribution of any variables is a multivariable Gaussian distribution. Without any prior knowledge, we often set mean to be 0. Then the GP is specified by the covariance :

Impact of Kernel Function Covariance matrix : kernel function Application economics & finance

Gaussian Process for Regression Likelihood: Prior: Marginal distribution:

Samples of GP Prior over Functions

Samples of Data Points

Predictive Distribution is a Gaussian distribution with mean and variance:

Predictive Mean We see the same form as kernel ridge regression and kernel PCA.

GP Regression Discussion: the difference between GP regression and Bayesian regression with Gaussian basis functions?

Marginal Distribution of Target Values