Principal Component Analysis Based on L1-Norm Maximization Nojun Kwak IEEE Transactions on Pattern Analysis and Machine Intelligence, 2008.

Slides:



Advertisements
Similar presentations
Component Analysis (Review)
Advertisements

Machine Learning Lecture 8 Data Processing and Representation
Robust Multi-Kernel Classification of Uncertain and Imbalanced Data
Principal Component Analysis CMPUT 466/551 Nilanjan Ray.
Dimensionality Reduction Chapter 3 (Duda et al.) – Section 3.8
Linear Methods for Regression Dept. Computer Science & Engineering, Shanghai Jiao Tong University.
Principal Component Analysis
Pattern Recognition Topic 1: Principle Component Analysis Shapiro chap
CS 790Q Biometrics Face Recognition Using Dimensionality Reduction PCA and LDA M. Turk, A. Pentland, "Eigenfaces for Recognition", Journal of Cognitive.
Prénom Nom Document Analysis: Data Analysis and Clustering Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
Lecture 4 Unsupervised Learning Clustering & Dimensionality Reduction
The Terms that You Have to Know! Basis, Linear independent, Orthogonal Column space, Row space, Rank Linear combination Linear transformation Inner product.
Unsupervised Learning
Principal Components Analysis (PCA) 273A Intro Machine Learning.
Tutorial 10 Iterative Methods and Matrix Norms. 2 In an iterative process, the k+1 step is defined via: Iterative processes Eigenvector decomposition.
Dimensionality Reduction
Dimensionality Reduction. Multimedia DBs Many multimedia applications require efficient indexing in high-dimensions (time-series, images and videos, etc)
CS 485/685 Computer Vision Face Recognition Using Principal Components Analysis (PCA) M. Turk, A. Pentland, "Eigenfaces for Recognition", Journal of Cognitive.
Nonlinear Dimensionality Reduction Approaches. Dimensionality Reduction The goal: The meaningful low-dimensional structures hidden in their high-dimensional.
SVD(Singular Value Decomposition) and Its Applications
Graph-based consensus clustering for class discovery from gene expression data Zhiwen Yum, Hau-San Wong and Hongqiang Wang Bioinformatics, 2007.
Summarized by Soo-Jin Kim
Dimensionality Reduction: Principal Components Analysis Optional Reading: Smith, A Tutorial on Principal Components Analysis (linked to class webpage)
EMIS 8381 – Spring Netflix and Your Next Movie Night Nonlinear Programming Ron Andrews EMIS 8381.
1 Recognition by Appearance Appearance-based recognition is a competing paradigm to features and alignment. No features are extracted! Images are represented.
AN ORTHOGONAL PROJECTION
Classification Course web page: vision.cis.udel.edu/~cv May 12, 2003  Lecture 33.
Computer Vision Lab. SNU Young Ki Baik Nonlinear Dimensionality Reduction Approach (ISOMAP, LLE)
Ch 4. Linear Models for Classification (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, Summarized and revised by Hee-Woong Lim.
ECE 8443 – Pattern Recognition LECTURE 10: HETEROSCEDASTIC LINEAR DISCRIMINANT ANALYSIS AND INDEPENDENT COMPONENT ANALYSIS Objectives: Generalization of.
Computational Intelligence: Methods and Applications Lecture 23 Logistic discrimination and support vectors Włodzisław Duch Dept. of Informatics, UMK Google:
ECE 8443 – Pattern Recognition LECTURE 08: DIMENSIONALITY, PRINCIPAL COMPONENTS ANALYSIS Objectives: Data Considerations Computational Complexity Overfitting.
CSE 185 Introduction to Computer Vision Face Recognition.
Optimal Component Analysis Optimal Linear Representations of Images for Object Recognition X. Liu, A. Srivastava, and Kyle Gallivan, “Optimal linear representations.
Tony Jebara, Columbia University Advanced Machine Learning & Perception Instructor: Tony Jebara.
Design of PCA and SVM based face recognition system for intelligent robots Department of Electrical Engineering, Southern Taiwan University, Tainan County,
1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.
Reduces time complexity: Less computation Reduces space complexity: Less parameters Simpler models are more robust on small datasets More interpretable;
1  Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.
Chapter 13 (Prototype Methods and Nearest-Neighbors )
MACHINE LEARNING 7. Dimensionality Reduction. Dimensionality of input Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Irfan Ullah Department of Information and Communication Engineering Myongji university, Yongin, South Korea Copyright © solarlits.com.
Feature Extraction 主講人:虞台文. Content Principal Component Analysis (PCA) PCA Calculation — for Fewer-Sample Case Factor Analysis Fisher’s Linear Discriminant.
2D-LDA: A statistical linear discriminant analysis for image matrix
Ultra-high dimensional feature selection Yun Li
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition LECTURE 10: PRINCIPAL COMPONENTS ANALYSIS Objectives:
Feature Extraction 主講人:虞台文.
Face detection and recognition Many slides adapted from K. Grauman and D. Lowe.
Computacion Inteligente Least-Square Methods for System Identification.
Martina Uray Heinz Mayer Joanneum Research Graz Institute of Digital Image Processing Horst Bischof Graz University of Technology Institute for Computer.
Machine Learning Supervised Learning Classification and Regression K-Nearest Neighbor Classification Fisher’s Criteria & Linear Discriminant Analysis Perceptron:
CSE 554 Lecture 8: Alignment
Spectral Methods for Dimensionality
Principal Component Analysis (PCA)
LECTURE 11: Advanced Discriminant Analysis
University of Ioannina
LECTURE 10: DISCRIMINANT ANALYSIS
Lecture 8:Eigenfaces and Shared Features
Face Recognition and Feature Subspaces
Recognition: Face Recognition
Principal Component Analysis (PCA)
Outline Peter N. Belhumeur, Joao P. Hespanha, and David J. Kriegman, “Eigenfaces vs. Fisherfaces: Recognition Using Class Specific Linear Projection,”
In summary C1={skin} C2={~skin} Given x=[R,G,B], is it skin or ~skin?
Support Vector Machines Introduction to Data Mining, 2nd Edition by
Singular Value Decomposition
Principal Component Analysis
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Feature space tansformation methods
CS4670: Intro to Computer Vision
LECTURE 09: DISCRIMINANT ANALYSIS
Presentation transcript:

Principal Component Analysis Based on L1-Norm Maximization Nojun Kwak IEEE Transactions on Pattern Analysis and Machine Intelligence, 2008

Outline Introduction Background Knowledge Problem Description Algorithms Experiments Conclusion 2

Introduction In data analysis problems, why do we need dimensionality reduction? Principal Component Analysis(PCA) PCA based on the L2-Norm is prone to the presence of outliers. 3

Introduction Some algorithms for this problem: – L1-PCA Weighted median method Convex programming method Maximum likelihood estimation method – R1-PCA 4

Background Knowledge L1-Norm, L2-Norm Principal Component Analysis(PCA) 5

Lp-Norm Consider an n-dimensional vector: Define the p-Norm: L1-Norm is L2-Norm is 6

Lp-Norm For example, x = [1, 2, 3] Special case : 7 namesymbolvalueapproximation L1-Norm|x| L2-Norm|x| L3-Norm|x| L4-Norm|x| L ∞ -Norm|x| ∞

Principal Component Analysis Principal component analysis (PCA) is a technique to seek projections that best preserve the data in a least-squares sense. The projections constitute a low-dimensional linear subspace. 8

Principal Component Analysis The projection vectors, …, are the eigenvectors of the scatter matrix having the largest eigenvalues. 9 Scatter matrix:

Principal Component Analysis The rotational invariance property: a fundamental property of Euclidean space with L2-Norm. So, PCA has rotational invariance property. 10

Problem Description Traditional PCA: the presence of outliers. The effect of the outliers with a large norm is exaggerated by the use of the L2-Norm. Other method? 11

Problem Description If we use L1-Norm instead of L2-Norm: where is the dataset. 12 is the projection matrix. is the coefficient matrix.

Problem Description However, it’s very hard to achieve the exact solution. To resolve it, Ding et al. propose the R1-Norm and an approximate solution. 13 We call it R1-PCA.

Problem Description The solution of R1-PCA depends on the dimension of subspace being found. The optimal solution when is not necessarily a subspace of when. The proposed method: PCA-L1 14

We consider that: The maximization is done on the feature space. Algorithms 15 ensure to orthonormality of the projection matrix.

Algorithms However, it’s difficult to find a global solution for. The optimal ith projection varies with different as in R1-PCA. How to solve it? 16

Algorithms We simplify it into a series of problems using a greedy search method. Then, if we set, it become that: 17 Although the successive greedy solutions may differ from the optimal solution, it’s expected to provide a good approximation.

Algorithms The optimization is still difficult because it contains absolute value operation, which is nonlinear. 18

Algorithms 19

Algorithms However, does the PCA-L1 procedure finds a local maximum point ? We should prove it. 20

Theorem Theorem: With the PCA-L1 procedure vector converges to, which is a local maximum points of. The proof includes two parts: – is a non-decreasing function of. – The objective function has a local maximum value at. 21

Proof is a non-decreasing function of. is the set of optimal polarity corresponding to. For all, 22

Proof This holds because 23 are parallel. The inner product of two vectors.

Proof So, the objective function is non- decreasing and there are a finite number of data points. The PCA-L1 procedure converges to a projection vector. 24

Proof The objective function has a local maximum value at. Because converges to by the PCA-L1 procedure, for all. By Step 4b, for all. 25

Proof There exists a small neighborhood of, such that if, then for all. Then, since is parallel to, the inequality holds for all. is a local maximum point. 26

Algorithms So, the PCA-L1 procedure finds a local maximum point. Because is a linear combination of data points, i.e.,, it’s invariant to rotations. Under rotational transformation R:X→RX, then W→RW. 27

Algorithms Computation complexity: is the number of iterations for convergence. does not depend on the dimension. 28

Algorithms The PCA-L1 procedure just finds a local maximum solution. It may not be the global solution. We can set appropriately. – By setting. – Run the PCA-L1 with different initial vector. 29

Algorithms Extracting Multiple Features : 30 Original PCA’s thought. Run the PCA-L1 for each feature dimension.

Algorithms How to guarantee the orthonormality of the projection vectors? We should show that is orthogonal to. 31

Proof The projection vector is a linear combination of samples. It’s in the subspace spanned by. Then, we consider : 32 Form Greedy search algorithm. normal vector, (=1)

Proof Because, is orthogonal to.. is orthogonal to. 33 The orthonormality of the projection vectors is guaranteed.

Algorithms Even if the greedy search algorithm does not provide the optimal solution, it provides a set of good projections that maximize L1 dispersion. 34

Algorithms For data analysis, we could decide how much data could be captured. In PCA, we could compute the eigenvalue: 35 The eigenvalue is equivalent to the variance of the feature. We can compute the ratio of the variance to the total variance. The sum of variance: In eigenvalue, it exceeds 95% of the total variance, m is set to.

Algorithms In PCA-L1, once is obtained, we can compute the variance of the feature. The sum of variance: The total variance: 36 We can set the appropriate number of extracted features like original PCA.

Experiments In the experiments, we apply PCA-L1 algorithm and compared with R1-PCA and original PCA. Three experiments: – A Toy problem with an Outlier – UCI Data Sets – Face Reconstruction 37

A Toy Problem with an Outlier Consider the data points in a 2D space: If we discard the outlier, the projection vector should be. 38 an outlier.

A Toy Problem with an Outlier The projection vector: 39 outlier

A Toy Problem with an Outlier The residual error : 40 outlier Average residual error PCA-L1L2-PCAR1-PCA Much influenced by the outlier.

UCI Data Sets Data sets in UCI machine learning repositories. Compare the classification performances. 1-NN classifier was used and 10-fold cross validation for average classification rate. For PCA-L1, we set the initial projection vector as. 41

UCI Data Sets The data sets: 42

UCI Data Sets The average correct classification rates: 43

UCI Data Sets The average correct classification rates: 44

UCI Data Sets The average correct classification rates: 45 In many cases, PCA-L1 outperformed L2-PCA and R1- PCA when the number of extracted features was small.

UCI Data Sets Average Classification rate on UCI Data Sets: 46 PCA-L1 outperformed other methods by 1% on average.

UCI Data Sets Computation cost: 47

Face Reconstruction The Yale face database. – 11 individuals. – 15 face images for one person. Among 165 images, 20% were selected randomly and occluded with a noise block. 48

Face Reconstruction For these image sets, we applied: – L2-PCA(eigenface) – R1-PCA – PCA-L1 Then, we used extracted features to reconstruct images. 49

Face Reconstruction 50 Experimental results:

Face Reconstruction The average reconstruction error is: 51 original image reconstructed image Form 10~20 features, the difference became apparent and PCA-L1 outperformed than other methods.

Face Reconstruction We added 30 dummy images consist of random black and white dots to the original 165 Yale images. We applied: – L2-PCA(eigenface) – R1-PCA – PCA-L1 We reconstructed images with features. 52

Face Reconstruction Experimental results: 53

Face Reconstruction The average reconstruction error: 54 From 6 to 36 features, the error of L2- PCA is constant. The dummy images serious affect those projection vectors. From 14 to 36 features, the error of R1- PCA is increasing. The dummy images serious affect those projection vectors.

Conclusion The PCA-L1 was proven to find a local maximum point. The computation complexity is proportional to – the number of samples – the dimension of input space – The number of iterations The method is usually faster and robust to outliers. 55

Principal Component Analysis Given a dataset of l samples: We represent D by projecting the data onto a line running through the sample mean, denoted as ( ): 56

Principal Component Analysis Then, 57

Principal Component Analysis To look for the best direction, 58 scatter matrix

Principal Component Analysis We want to minimize : Maximize, subject to We use Lagrange multipliers: 59

Principal Component Analysis Since, minimizing can be achieved by choosing as the largest eigenvector of. Similarly, we can extend 1-d to -d projection. 60