CSE 446 Dimensionality Reduction and PCA Winter 2012 Slides adapted from Carlos Guestrin & Luke Zettlemoyer.

Slides:



Advertisements
Similar presentations
Principal Component Analysis Based on L1-Norm Maximization Nojun Kwak IEEE Transactions on Pattern Analysis and Machine Intelligence, 2008.
Advertisements

Face Recognition Ying Wu Electrical and Computer Engineering Northwestern University, Evanston, IL
Computer Vision Spring ,-685 Instructor: S. Narasimhan Wean 5403 T-R 3:00pm – 4:20pm Lecture #20.
1cs542g-term High Dimensional Data  So far we’ve considered scalar data values f i (or interpolated/approximated each component of vector values.
Principal Component Analysis CMPUT 466/551 Nilanjan Ray.
Principal Component Analysis
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Principal Component Analysis
Singular Value Decomposition COS 323. Underconstrained Least Squares What if you have fewer data points than parameters in your function?What if you have.
Face Recognition using PCA (Eigenfaces) and LDA (Fisherfaces)
Prénom Nom Document Analysis: Data Analysis and Clustering Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
Face Recognition Jeremy Wyatt.
Clustering.
Lecture 4 Unsupervised Learning Clustering & Dimensionality Reduction
SVD and PCA COS 323. Dimensionality Reduction Map points in high-dimensional space to lower number of dimensionsMap points in high-dimensional space to.
The Terms that You Have to Know! Basis, Linear independent, Orthogonal Column space, Row space, Rank Linear combination Linear transformation Inner product.
Principal Component Analysis Barnabás Póczos University of Alberta Nov 24, 2009 B: Chapter 12 HRF: Chapter 14.5.
Dimension reduction : PCA and Clustering Christopher Workman Center for Biological Sequence Analysis DTU.
Unsupervised Learning
Pattern Recognition. Introduction. Definitions.. Recognition process. Recognition process relates input signal to the stored concepts about the object.
INTRODUCTION TO Machine Learning ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Clustering & Dimensionality Reduction 273A Intro Machine Learning.
SVD(Singular Value Decomposition) and Its Applications
Summarized by Soo-Jin Kim
Dimensionality Reduction: Principal Components Analysis Optional Reading: Smith, A Tutorial on Principal Components Analysis (linked to class webpage)
Chapter 2 Dimensionality Reduction. Linear Methods
Recognition Part II Ali Farhadi CSE 455.
Presented By Wanchen Lu 2/25/2013
Face Recognition and Feature Subspaces
Face Recognition and Feature Subspaces
Feature extraction 1.Introduction 2.T-test 3.Signal Noise Ratio (SNR) 4.Linear Correlation Coefficient (LCC) 5.Principle component analysis (PCA) 6.Linear.
CSC321: Neural Networks Lecture 12: Clustering Geoffrey Hinton.
1 Recognition by Appearance Appearance-based recognition is a competing paradigm to features and alignment. No features are extracted! Images are represented.
CHAPTER 7: Clustering Eick: K-Means and EM (modified Alpaydin transparencies and new transparencies added) Last updated: February 25, 2014.
Classification Course web page: vision.cis.udel.edu/~cv May 12, 2003  Lecture 33.
MACHINE LEARNING 8. Clustering. Motivation Based on E ALPAYDIN 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2  Classification problem:
ECE 8443 – Pattern Recognition LECTURE 08: DIMENSIONALITY, PRINCIPAL COMPONENTS ANALYSIS Objectives: Data Considerations Computational Complexity Overfitting.
CSE 185 Introduction to Computer Vision Face Recognition.
Radial Basis Function ANN, an alternative to back propagation, uses clustering of examples in the training set.
Clustering Instructor: Max Welling ICS 178 Machine Learning & Data Mining.
Dimensionality reduction
MACHINE LEARNING 7. Dimensionality Reduction. Dimensionality of input Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Optimal Reverse Prediction: Linli Xu, Martha White and Dale Schuurmans ICML 2009, Best Overall Paper Honorable Mention A Unified Perspective on Supervised,
CS 2750: Machine Learning Dimensionality Reduction Prof. Adriana Kovashka University of Pittsburgh January 27, 2016.
CSE 446: Expectation Maximization (EM) Winter 2012 Daniel Weld Slides adapted from Carlos Guestrin, Dan Klein & Luke Zettlemoyer.
Advanced Artificial Intelligence Lecture 8: Advance machine learning.
ECE 5984: Introduction to Machine Learning Dhruv Batra Virginia Tech Topics: –(Finish) Expectation Maximization –Principal Component Analysis (PCA) Readings:
Face detection and recognition Many slides adapted from K. Grauman and D. Lowe.
1 Kernel Machines A relatively new learning methodology (1992) derived from statistical learning theory. Became famous when it gave accuracy comparable.
Unsupervised Learning II Feature Extraction
Part 3: Estimation of Parameters. Estimation of Parameters Most of the time, we have random samples but not the densities given. If the parametric form.
Machine Learning Supervised Learning Classification and Regression K-Nearest Neighbor Classification Fisher’s Criteria & Linear Discriminant Analysis Perceptron:
Unsupervised Learning Part 2. Topics How to determine the K in K-means? Hierarchical clustering Soft clustering with Gaussian mixture models Expectation-Maximization.
LECTURE 09: BAYESIAN ESTIMATION (Cont.)
Dimensionality reduction
Dimensionality Reduction
Lecture 8:Eigenfaces and Shared Features
CS 2750: Machine Learning Dimensionality Reduction
Face Recognition and Feature Subspaces
Lecture: Face Recognition and Feature Reduction
Dimensionality reduction
Unsupervised Learning: Principle Component Analysis
Machine Learning Dimensionality Reduction
In summary C1={skin} C2={~skin} Given x=[R,G,B], is it skin or ~skin?
Singular Value Decomposition
Probabilistic Models with Latent Variables
Principal Component Analysis
INTRODUCTION TO Machine Learning
Feature space tansformation methods
Presentation transcript:

CSE 446 Dimensionality Reduction and PCA Winter 2012 Slides adapted from Carlos Guestrin & Luke Zettlemoyer

Machine Learning 2 Supervised Learning Parametric Reinforcement Learning Unsupervised Learning Non-parametric FriK-means & Agglomerative Clustering MonExpectation Maximization (EM) WedPrinciple Component Analysis (PCA)

EM Pick K random cluster models Alternate: –Assign data instances proportionately to different models –Revise each cluster model based on its (proportionately) assigned points Stop when no changes another iterative clustering algorithm

Gaussian Mixture Example: Start

After first iteration

After 20th iteration

Can we do EM with hard assignments? (univariate case for simplicity) Iterate: On the t’th iteration let our estimates be θ t = { μ 1 (t), μ 2 (t) … μ k (t) } E-step Compute “expected” classes of all datapoints M-step Compute most likely new μs given class expectations δ represents hard assignment to “most likely” or nearest cluster Equivalent to k-means clustering algorithm!!!

Summary K-means for clustering: – algorithm – converges because it’s coordinate ascent Hierarchical agglomerative clustering EM for mixture of Gaussians: – Viewed as coordinate ascent in parameter space (  i,  i,  i ) – How to “learn” maximum likelihood parameters (locally max. like.) in the case of unlabeled data – Relation to K-means Hard / soft clustering Probabilistic model Remember, E.M. can (& does) get stuck in local minima

Dimensionality Reduction Input data may have thousands or millions of dimensions! e.g., images have??? And text data has ???, Dnl S. Wld s Thms J. Cbl / WRF Prfssr f Cmptr Scnc nd ngnrng t th nvrsty f Wshngtn. ftr gng t schl t Phllps cdmy, h rcvd bchlr's dgrs n bth Cmptr Scnc nd Bchmstry t Yl nvrsty n Are all those dimensions (letters, pixels) necessary?

Dimensionality Reduction Represent data with fewer dimensions! Easier learning – fewer parameters – |Features| >> |training examples| ?? Better visualization – Hard to understand more than 3D or 4D Discover “intrinsic dimensionality” of data – High dimensional data that is truly lower dimensional

Feature Selection Want to learn f:X  Y – X= – but some features are more important than others Approach: select subset of features to be used by learning algorithm – Score each feature (or sets of features) – Select set of features with best score

Greedy forward feature selection algorithm Pick a dictionary of features – e.g., polynomials for linear regression Greedy: Start from empty (or simple) set of features F 0 =  – Run learning algorithm for current set of features F t Obtain h t – Select next best feature X i e.g., X j giving lowest held-out error when learning with F t  {X j } – F t+1  F t  {X i } – Repeat

Greedy backward feature selection algorithm Pick a dictionary of features – e.g., polynomials for linear regression Greedy: Start with all features F 0 = F – Run learning algorithm for current set of features F t Obtain h t – Select next worst feature X i X j giving lowest held out-error learner when learning with F t - {X j } – F t+1  F t - {X i } – Repeat

Impact of feature selection on classification of fMRI data [Pereira et al. ’05]

Feature Selection through Regularization Previously, we discussed regularization with a squared norm: What if we used an L 1 norm instead? What about L ∞ ? These norms work, but are harder to optimize! And, it can be tricky to set λ!!!

Regularization 16 L1L2

Lower Dimensional Projections Rather than picking a subset of the features, we can make new ones by combining existing features x 1 … x n New features are linear combinations of old ones Reduces dimension when k<n Let’s see this in the unsupervised setting – just X, but no Y …

Linear projection and reconstruction x1x1 x2x2 project into 1-dimension z1z1 reconstruction: only know z 1, and projection weights, what was (x 1,x 2 )

Principal component analysis – basic idea Project n-dimensional data into k-dimensional space while preserving information: – e.g., project space of words into 3- dimensions – e.g., project 3-d into 2-d Choose projection with minimum reconstruction error

Linear projections, a review Project a point into a (lower dimensional) space: – point: x = (x 1,…,x n ) – select a basis – set of unit (length 1) basis vectors (u 1,…,u k ) we consider orthonormal basis: – u i  u i =1, and u i  u j =0 for i  j – select a center – x, defines offset of space – best coordinates in lower dimensional space defined by dot-products: (z 1,…,z k ), z i = (x-x)  u i

3D Data 21

Projection to 2D 22

PCA finds projection that minimizes reconstruction error Given m data points: x i = (x 1 i,…,x n i ), i=1…m Will represent each point as a projection: PCA: – Given k<n, find (u 1,…,u k ) minimizing reconstruction error: x1x1 x2x2 u1u1

Understanding the reconstruction error Note that x i can be represented exactly by n-dimensional projection: Rewriting error:  Given k<n, find (u 1,…,u k ) minimizing reconstruction error: Aha! Error is sum of squared weights that would have be used for dimensions that are cut!!!!

Reconstruction error and covariance matrix Now, to find the u j, we minimize: Lagrange multiplier to ensure orthonormal Take derivative, set equal to 0, …, solutions are eigenvectors

Minimizing reconstruction error and eigen vectors Minimizing reconstruction error equivalent to picking orthonormal basis (u 1,…,u n ) minimizing: Solutions: eigen vectors So, minimizing reconstruction error equivalent to picking (u k+1,…,u n ) to be eigen vectors with smallest eigen values And, our projection should be onto the (u 1,…,u k ) with the largest values

Basic PCA algorithm Start from m by n data matrix X Recenter: subtract mean from each row of X – X c  X – X Compute covariance matrix: –   1/m X c T X c Find eigen vectors and values of  Principal components: k eigen vectors with highest eigen values

PCA example Data:Projection: Reconstruction:

Handwriting 29

Two Principle Eigenvectors 30

Eigenfaces [Turk, Pentland ’91] Input images: – N images – Each 50  50 pixels – 2500 features Misleading figure. Best to think of as an N  2500 matrix: |Examples|  |Features|

Reduce Dimensionality 2500  15 Average face First principal component Other components For all except average, “gray” = 0, “white” > 0, “black” < 0

Using PCA for Compression Store each face as coefficients of projection onto first few principal components Using PCA for Recognition Compute projections of target image, compare to database (“nearest neighbor classifier”)

Eigenfaces reconstruction Each image corresponds to adding together the principal components:

Scaling up Covariance matrix can be really big! –  is n  n – features can be common! – Finding eigenvectors is very slow… Use singular value decomposition (SVD) – Finds top k eigenvectors – Great implementations available, e.g., Matlab svd

SVD Write X = W S V T – X  data matrix, one row per datapoint – W  weight matrix one row per datapoint – coordinate of x i in eigenspace – S  singular value matrix, diagonal matrix in our setting each entry is eigenvalue j – V T  singular vector matrix in our setting each row is eigenvector v j

SVD notation change  Treat as black box: code widely available In Matlab: [U,W,V]=svd(A,0)

PCA using SVD algorithm Start from m by n data matrix X Recenter: subtract mean from each row of X – X c  X – X Call SVD algorithm on X c – ask for k singular vectors Principal components: k singular vectors with highest singular values (rows of V T ) – Coefficients: project each point onto the new vectors

Singular Value Decomposition (SVD) Handy mathematical technique that has application to many problems Given any m  n matrix A, algorithm to find matrices U, V, and W such that A = U W V T U is m  n and orthonormal W is n  n and diagonal V is n  n and orthonormal

SVD The w i are called the singular values of A If A is singular, some of the w i will be 0 In general rank(A) = number of nonzero w i SVD is mostly unique (up to permutation of singular values, or if some w i are equal)

What you need to know Dimensionality reduction – why and when it’s important Simple feature selection Regularization as a type of feature selection Principal component analysis – minimizing reconstruction error – relationship to covariance matrix and eigenvectors – using SVD