Subspace and Kernel Methods April 2004 Seong-Wook Joo.

Slides:



Advertisements
Similar presentations
Text mining Gergely Kótyuk Laboratory of Cryptography and System Security (CrySyS) Budapest University of Technology and Economics
Advertisements

Independent Component Analysis
Component Analysis (Review)
Principal Component Analysis Based on L1-Norm Maximization Nojun Kwak IEEE Transactions on Pattern Analysis and Machine Intelligence, 2008.
Face Recognition Ying Wu Electrical and Computer Engineering Northwestern University, Evanston, IL
Pattern Recognition and Machine Learning: Kernel Methods.
Machine learning continued Image source:
Machine Learning Lecture 8 Data Processing and Representation
Dimension reduction (1)
1cs542g-term High Dimensional Data  So far we’ve considered scalar data values f i (or interpolated/approximated each component of vector values.
Principal Component Analysis CMPUT 466/551 Nilanjan Ray.
Principal Component Analysis
Pattern Recognition Topic 1: Principle Component Analysis Shapiro chap
Dimensional reduction, PCA
Prénom Nom Document Analysis: Data Analysis and Clustering Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
Independent Component Analysis (ICA) and Factor Analysis (FA)
The Terms that You Have to Know! Basis, Linear independent, Orthogonal Column space, Row space, Rank Linear combination Linear transformation Inner product.
Bayesian belief networks 2. PCA and ICA
Lecture 10: Support Vector Machines
Pattern Recognition. Introduction. Definitions.. Recognition process. Recognition process relates input signal to the stored concepts about the object.
ICA Alphan Altinok. Outline  PCA  ICA  Foundation  Ambiguities  Algorithms  Examples  Papers.
Continuous Latent Variables --Bishop
Ch. 10: Linear Discriminant Analysis (LDA) based on slides from
CS 485/685 Computer Vision Face Recognition Using Principal Components Analysis (PCA) M. Turk, A. Pentland, "Eigenfaces for Recognition", Journal of Cognitive.
Survey on ICA Technical Report, Aapo Hyvärinen, 1999.
Summarized by Soo-Jin Kim
Presented By Wanchen Lu 2/25/2013
Feature extraction 1.Introduction 2.T-test 3.Signal Noise Ratio (SNR) 4.Linear Correlation Coefficient (LCC) 5.Principle component analysis (PCA) 6.Linear.
Independent Component Analysis on Images Instructor: Dr. Longin Jan Latecki Presented by: Bo Han.
IEEE TRANSSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE
Classifiers Given a feature representation for images, how do we learn a model for distinguishing features from different classes? Zebra Non-zebra Decision.
N– variate Gaussian. Some important characteristics: 1)The pdf of n jointly Gaussian R.V.’s is completely described by means, variances and covariances.
ECE 8443 – Pattern Recognition LECTURE 10: HETEROSCEDASTIC LINEAR DISCRIMINANT ANALYSIS AND INDEPENDENT COMPONENT ANALYSIS Objectives: Generalization of.
ECE 8443 – Pattern Recognition LECTURE 08: DIMENSIONALITY, PRINCIPAL COMPONENTS ANALYSIS Objectives: Data Considerations Computational Complexity Overfitting.
1 Introduction to Kernel Principal Component Analysis(PCA) Mohammed Nasser Dept. of Statistics, RU,Bangladesh
CSE 185 Introduction to Computer Vision Face Recognition.
EE4-62 MLCV Lecture Face Recognition – Subspace/Manifold Learning Tae-Kyun Kim 1 EE4-62 MLCV.
Tony Jebara, Columbia University Advanced Machine Learning & Perception Instructor: Tony Jebara.
CSC2515: Lecture 7 (post) Independent Components Analysis, and Autoencoders Geoffrey Hinton.
PCA vs ICA vs LDA. How to represent images? Why representation methods are needed?? –Curse of dimensionality – width x height x channels –Noise reduction.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition LECTURE 12: Advanced Discriminant Analysis Objectives:
Principal Component Analysis (PCA)
Principal Component Analysis and Linear Discriminant Analysis for Feature Reduction Jieping Ye Department of Computer Science and Engineering Arizona State.
2D-LDA: A statistical linear discriminant analysis for image matrix
Introduction to Independent Component Analysis Math 285 project Fall 2015 Jingmei Lu Xixi Lu 12/10/2015.
A Kernel Approach for Learning From Almost Orthogonal Pattern * CIS 525 Class Presentation Professor: Slobodan Vucetic Presenter: Yilian Qin * B. Scholkopf.
Linear Classifiers Dept. Computer Science & Engineering, Shanghai Jiao Tong University.
Unsupervised Learning II Feature Extraction
Dimension reduction (2) EDR space Sliced inverse regression Multi-dimensional LDA Partial Least Squares Network Component analysis.
Part 3: Estimation of Parameters. Estimation of Parameters Most of the time, we have random samples but not the densities given. If the parametric form.
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 1: INTRODUCTION.
Machine Learning Supervised Learning Classification and Regression K-Nearest Neighbor Classification Fisher’s Criteria & Linear Discriminant Analysis Perceptron:
1 C.A.L. Bailer-Jones. Machine Learning. Data exploration and dimensionality reduction Machine learning, pattern recognition and statistical data modelling.
Principal Component Analysis (PCA)
Ch 12. Continuous Latent Variables ~ 12
LECTURE 11: Advanced Discriminant Analysis
LECTURE 09: BAYESIAN ESTIMATION (Cont.)
University of Ioannina
LECTURE 10: DISCRIMINANT ANALYSIS
Machine Learning Basics
Machine Learning Dimensionality Reduction
PCA vs ICA vs LDA.
Bayesian belief networks 2. PCA and ICA
Principal Component Analysis
Introduction PCA (Principal Component Analysis) Characteristics:
A Fast Fixed-Point Algorithm for Independent Component Analysis
Feature space tansformation methods
Generally Discriminant Analysis
LECTURE 09: DISCRIMINANT ANALYSIS
What is Artificial Intelligence?
Presentation transcript:

Subspace and Kernel Methods April 2004 Seong-Wook Joo

Motivation of Subspace Methods Subspace is a “manifold” (surface) embedded in a higher dimensional vector space –Visual data is represented as a point in a high dimensional vector space –Constraints in the natural world and the imaging process causes the points to “live” in a lower dimensional subspace Dimensionality reduction –Achieved by extracting ‘important’ features from the dataset  Learning –Is desirable to avoid the “curse of dimensionality” in pattern recognition  Classification With fixed sample size, the classification performance decreases as the number of feature increases Example: Appearance-based methods (vs model-based)

Linear Subspaces Definitions/Notations –X dxn : sample data set. n d-vectors –U dxk : basis vector set. k d-vectors –Q kxn : coefficient (component) sets. n k-vectors Note: k could be up to d, in which case the above is a “change of basis” and ≈  = Selection of U –Orthonormal bases Q is simply projection of X onto U: Q = U T X –General independent bases If k=d, Q is obtained by solving linear system if k<d, do some optimization (e.g., least squares) Different criterion for selecting U leads to different subspace methods XdxnXdxn UdxkUdxk QkxnQkxn ≈ x i ≈  b=1..k q bi u b

ICA (Independent Component Analysis) Assumption, Notation –Measured data is a linear combination of some set of independent signals ( random variables x representing (x(1)…x(d)) or row d-vectors ) –x i = a i1 s 1 + … + a in s n = a i S (a i : row n-vector) –zero-mean x i, a i assumed –X = AS (X nxd : measured data, i.e., n different mixtures, A nxn : mixing matrix, S nxd : n independent signals) Algorithm –Goal: given X, find A and S (or find W=A -1 s.t. S=WX) –Key idea By the Central Limit Theorem, sum of independent random variables becomes more ‘Gaussian’ than the individual r.v.’s Some linear comb. v X is maximally non-Gaussian when v X=s i, i.e., v=w i (naturally, this doen’t work when s is Gaussian) –Non-Gaussianity measures Kurtosis (a 4 th order stat), Negentropy

ICA Examples Natural imagesFaces (vs PCA)

CCA (Canonical Correlation Analysis) Assumption, Notation –Two sets of vectors X = [x 1 …x m ], Y = [y 1 …y n ] –X, Y : measured from the same semantic object (physical phenomenon) –projection for each of the sets: x' = w x x, y' = w y y Algorithm –Goal: Given X, Y find w x, w y that maximizes the correlation btwn x', y' –XX T = C xx, YY T = C yy : within-set cov., XY T = C xy : between set cov. –Solutions for w x, w y by generalized eigenvalue problem or SVD Taking the top k vector pairs W x =( w x1 …w xk ), W y =( w y1 …w yk ), correlation matrix k x k of the projected k -vectors x', y' is diagonal with diagonals maximized k  min(m,n)

CCA Example X: training images, Y: corresponding pose params (pan, tilt) = ( ,  ) First 3 principle components, parameterized by pose ( ,  ) First 2 CCA factors, parameterized by pose ( ,  )

Comparisons PCA –Unsupervised –Orthogonal bases  min. Euclidean error –Transform into uncorrelated (Cov=0) variables LDA –Supervised –(properties same as PCA) ICA –Unsupervised –General linear bases –Transform into variables not only uncorrelated (2 nd order) but also as independent as possible (higher order) CCA –Supervised –Separate (orthogonal) linear bases for each data set –Transformed variables’ correlation matrix is ‘maximized’

Kernel Methods Kernels –  (.): nonlinear mapping to a high dimensional space –Mercer kernels can be decomposed into dot product K(x,y) =  (x)  (y) Kernel PCA –X dxn (cols of d-vectors)   (X) (high dimensional vectors) –Inner-product matrix =  (X) T  (X) = [ K(x i,x j ) ]  K nxn (X,X) –First k eigenvectors e: transform matrix E nxk = [e 1 …e k ] –The ‘real’ eigenvectors are  (X)E –New pattern y is mapped (into prin. components) by (  (X)E) T  (y) = E T  (X) T  (y) = E T K nx1 (X,y) –The “trick” is to somehow use dot products wherever  (x) occurs Exists kernel versions of FDA, ICA, CCA, …

References Overview –H. Bischof and A. Leonardis, “Subspace Methods for Visual Learning and Recognition”, ECCV 2002 Tutorial slides (shorter version) –H. Bischof and A. Leonardis, “Kernel and subspace methods for computer vision” (Editorial), Pattern Recognition, Volume 36, Issue 9, 2003 –Baback Moghaddam, “Principal Manifolds and probabilistic Subspaces for Visual Recognition”, PAMI, Vol 24, No 6, Jun 2002 (Introduction section) –A. Jain, R. Duin, J. Mao, “Statistical Pattern Recognition: A Review”, PAMI, Vol 22, No 1, Jan 2000 (section 4: Dimensionality Reduction) ICA –A. Hyvärinen and E. Oja, “Independent component analysis: algorithms and applications”, Neural Networks, Volume 13, Issue 4, Jun CCA –T. Melzer, M. Reiter and H. Bischof, “Appearance models based on kernel canonical correlation analysis”, Pattern Recognition, Volume 36, Issue 9,

Kernel Density Estimation aka Parzen windows estimator The KDE estimate at x using a “kernel” K(·,·) is equivalent to the inner product   (x),1/n  i  (x i )  = 1/n  i K(x,x i ) –inner product can be seen as a similarity measure KDE and classification –Let x’ =  (x), assume class ω 1, ω 2 ’s mean c 1 ’,c 2 ’ are of same dist from origin (=equal prior?) –Linear classifier  x’,c 1 ’-c 2 ’  > 0 ? ω 1 : ω 2 = 1/n 1  i  ω1  x’,x i ’  - 1/n 2  i  ω2  x’,x i ’  = 1/n 1  i  ω1 K(x,x i ) - 1/n 2  i  ω2 K(x,x i ) This is equivalent to the “Bayes classifier” with the densities estimated by KDE

XdxnXdxn (U dxk ) T QkxnQkxn = Getting coefficients for orthonormal basis vectors: