Introduction In data analysis problems, why do we need dimensionality reduction? Principal Component Analysis(PCA) PCA based on the L2-Norm is prone to the presence of outliers. 3
Introduction Some algorithms for this problem: – L1-PCA Weighted median method Convex programming method Maximum likelihood estimation method – R1-PCA 4
Background Knowledge L1-Norm, L2-Norm Principal Component Analysis(PCA) 5
Lp-Norm Consider an n-dimensional vector: Define the p-Norm: L1-Norm is L2-Norm is 6
Lp-Norm For example, x = [1, 2, 3] Special case : 7 namesymbolvalueapproximation L1-Norm|x| L2-Norm|x| L3-Norm|x| L4-Norm|x| L ∞ -Norm|x| ∞
Principal Component Analysis Principal component analysis (PCA) is a technique to seek projections that best preserve the data in a least-squares sense. The projections constitute a low-dimensional linear subspace. 8
Principal Component Analysis The projection vectors, …, are the eigenvectors of the scatter matrix having the largest eigenvalues. 9 Scatter matrix:
Principal Component Analysis The rotational invariance property: a fundamental property of Euclidean space with L2-Norm. So, PCA has rotational invariance property. 10
Problem Description Traditional PCA: the presence of outliers. The effect of the outliers with a large norm is exaggerated by the use of the L2-Norm. Other method? 11
Problem Description If we use L1-Norm instead of L2-Norm: where is the dataset. 12 is the projection matrix. is the coefficient matrix.
Problem Description However, it’s very hard to achieve the exact solution. To resolve it, Ding et al. propose the R1-Norm and an approximate solution. 13 We call it R1-PCA.
Problem Description The solution of R1-PCA depends on the dimension of subspace being found. The optimal solution when is not necessarily a subspace of when. The proposed method: PCA-L1 14
We consider that: The maximization is done on the feature space. Algorithms 15 ensure to orthonormality of the projection matrix.
Algorithms However, it’s difficult to find a global solution for. The optimal ith projection varies with different as in R1-PCA. How to solve it? 16
Algorithms We simplify it into a series of problems using a greedy search method. Then, if we set, it become that: 17 Although the successive greedy solutions may differ from the optimal solution, it’s expected to provide a good approximation.
Algorithms The optimization is still difficult because it contains absolute value operation, which is nonlinear. 18
Algorithms However, does the PCA-L1 procedure finds a local maximum point ? We should prove it. 20
Theorem Theorem: With the PCA-L1 procedure vector converges to, which is a local maximum points of. The proof includes two parts: – is a non-decreasing function of. – The objective function has a local maximum value at. 21
Proof is a non-decreasing function of. is the set of optimal polarity corresponding to. For all, 22
Proof This holds because 23 are parallel. The inner product of two vectors.
Proof So, the objective function is non- decreasing and there are a finite number of data points. The PCA-L1 procedure converges to a projection vector. 24
Proof The objective function has a local maximum value at. Because converges to by the PCA-L1 procedure, for all. By Step 4b, for all. 25
Proof There exists a small neighborhood of, such that if, then for all. Then, since is parallel to, the inequality holds for all. is a local maximum point. 26
Algorithms So, the PCA-L1 procedure finds a local maximum point. Because is a linear combination of data points, i.e.,, it’s invariant to rotations. Under rotational transformation R:X→RX, then W→RW. 27
Algorithms Computation complexity: is the number of iterations for convergence. does not depend on the dimension. 28
Algorithms The PCA-L1 procedure just finds a local maximum solution. It may not be the global solution. We can set appropriately. – By setting. – Run the PCA-L1 with different initial vector. 29
Algorithms Extracting Multiple Features : 30 Original PCA’s thought. Run the PCA-L1 for each feature dimension.
Algorithms How to guarantee the orthonormality of the projection vectors? We should show that is orthogonal to. 31
Proof The projection vector is a linear combination of samples. It’s in the subspace spanned by. Then, we consider : 32 Form Greedy search algorithm. normal vector, (=1)
Proof Because, is orthogonal to.. is orthogonal to. 33 The orthonormality of the projection vectors is guaranteed.
Algorithms Even if the greedy search algorithm does not provide the optimal solution, it provides a set of good projections that maximize L1 dispersion. 34
Algorithms For data analysis, we could decide how much data could be captured. In PCA, we could compute the eigenvalue: 35 The eigenvalue is equivalent to the variance of the feature. We can compute the ratio of the variance to the total variance. The sum of variance: In eigenvalue, it exceeds 95% of the total variance, m is set to.
Algorithms In PCA-L1, once is obtained, we can compute the variance of the feature. The sum of variance: The total variance: 36 We can set the appropriate number of extracted features like original PCA.
Experiments In the experiments, we apply PCA-L1 algorithm and compared with R1-PCA and original PCA. Three experiments: – A Toy problem with an Outlier – UCI Data Sets – Face Reconstruction 37
A Toy Problem with an Outlier Consider the data points in a 2D space: If we discard the outlier, the projection vector should be. 38 an outlier.
A Toy Problem with an Outlier The projection vector: 39 outlier
A Toy Problem with an Outlier The residual error : 40 outlier Average residual error PCA-L1L2-PCAR1-PCA Much influenced by the outlier.
UCI Data Sets Data sets in UCI machine learning repositories. Compare the classification performances. 1-NN classifier was used and 10-fold cross validation for average classification rate. For PCA-L1, we set the initial projection vector as. 41
UCI Data Sets The data sets: 42
UCI Data Sets The average correct classification rates: 43
UCI Data Sets The average correct classification rates: 44
UCI Data Sets The average correct classification rates: 45 In many cases, PCA-L1 outperformed L2-PCA and R1- PCA when the number of extracted features was small.
UCI Data Sets Average Classification rate on UCI Data Sets: 46 PCA-L1 outperformed other methods by 1% on average.
UCI Data Sets Computation cost: 47
Face Reconstruction The Yale face database. – 11 individuals. – 15 face images for one person. Among 165 images, 20% were selected randomly and occluded with a noise block. 48
Face Reconstruction For these image sets, we applied: – L2-PCA(eigenface) – R1-PCA – PCA-L1 Then, we used extracted features to reconstruct images. 49
Face Reconstruction 50 Experimental results:
Face Reconstruction The average reconstruction error is: 51 original image reconstructed image Form 10~20 features, the difference became apparent and PCA-L1 outperformed than other methods.
Face Reconstruction We added 30 dummy images consist of random black and white dots to the original 165 Yale images. We applied: – L2-PCA(eigenface) – R1-PCA – PCA-L1 We reconstructed images with features. 52
Face Reconstruction Experimental results: 53
Face Reconstruction The average reconstruction error: 54 From 6 to 36 features, the error of L2- PCA is constant. The dummy images serious affect those projection vectors. From 14 to 36 features, the error of R1- PCA is increasing. The dummy images serious affect those projection vectors.
Conclusion The PCA-L1 was proven to find a local maximum point. The computation complexity is proportional to – the number of samples – the dimension of input space – The number of iterations The method is usually faster and robust to outliers. 55
Principal Component Analysis Given a dataset of l samples: We represent D by projecting the data onto a line running through the sample mean, denoted as ( ): 56
Principal Component Analysis Then, 57
Principal Component Analysis To look for the best direction, 58 scatter matrix
Principal Component Analysis We want to minimize : Maximize, subject to We use Lagrange multipliers: 59
Principal Component Analysis Since, minimizing can be achieved by choosing as the largest eigenvector of. Similarly, we can extend 1-d to -d projection. 60