LECTURE 14: DIMENSIONALITY REDUCTION: PRINCIPAL COMPONENT REGRESSION March 21, 2016 SDS 293 Machine Learning B.A. Miller.

LECTURE 14: DIMENSIONALITY REDUCTION: PRINCIPAL COMPONENT REGRESSION March 21, 2016 SDS 293 Machine Learning B.A. Miller

Outline Fun* application: Compressed Sensing Principal components analysis Principal components regression If there’s time, another fun* exercise

Chemistry lab You’ve been waiting for your lab partner to bring you the solution you’ve been working on together - It’s clear, odorless, tasteless, and worth 40% of your grade You get a call... - “I had to leave, I left it on the table in the corner. There were 3 beakers” You go to pick it up, and...

Hmmm...

Constrained optimization People have been putting beakers of water on the table! You can’t wait any longer You can’t get by with just one, or even two If you want to finish in time, you only have enough time to test 10 samples - And you can test for how concentrated it is What can we do here? - Hint: you learned about it before the break

Since you’re confident there are 3 beakers with the solution you want, you can use subset selection methods with compressed observations Test 10 combinations, each with a random sampling of the liquid in the beakers Mixing observations first test second test

00.10.20.30.40.50.60.7 -0.2 0 0.2 0.4 0.6 0.8 1 Using the lasso to find the right beakers Take your measurements (y) and your mixing matrix (X), and plug them into the lasso We can identify the 3 beakers that contained the solution! And you can turn them in on time and pass the class X

Compressive (or compressed) sensing Very hot topic in signal processing starting ~10 years ago With knowledge that the data have a sparse representation, take fewer “mixed” measurements than you would have to take for an arbitrary Applied in a variety of areas - MRI acquisition - Wideband radio receivers - Single-pixel camera DMD: Digital Micromirror Device

What happened before the break? Ridge regression and the lasso Both are “shrinkage” methods Provide estimates for the regression coefficients that are biased toward the origin Why would we want a biased estimate?

What’s wrong with bias? Prefers some estimates to others Does not yield the true value in expectation But what if your unbiased estimator gives you this? In practice, you may know that this isn’t what the solution should be - May not know what it is, but know what you see is wrong May want to bias our estimate to reduce variance i 0.511.522.533.5 i 10 9 -4 -2 0 2 4

Flashback: subset selection Image credit: Ming Malaykham

Estimating heroes Suppose the superhero height formula took on the following values (  i was drawn from a standard normal distribution)

Estimate for  When we try to estimate, we get the following: (Relatively) Huge difference between actual and estimated coefficients i 0.511.522.533.5 i -2 0 2 4 6 estimated true

Why did this happen? The data are redundant Little information in the 3 rd dimension not captured in the first two When performing linear regression, this redundancy causes noise to be amplified

Another way to bias data Data live in a p-dimensional space - But maybe not all p dimensions are useful Previous idea: select a subset of features that do a good job of representing the whole dataset Today: create new features that are mixtures of the old ones Project data into a new feature space to reduce variance in the estimate

Projection Transform the data before performing regression Bad: lose information Good: fewer dimensions to deal with Possibly good: can improve stability and improve linear regression performance (lower variance) Instead of we solve

Projection

Linear projection The focus today is linear projection, which is accomplished by taking linear combinations of the original data: for j between 1 and M (the new dimension) You can look at this as multiplying the data matrix by a projection matrix

How do we decide how to project? How do we determine the  s in ?

How do we decide how to project? Why did I suggest this line?

Principal components analysis Answer: that line represents the cross section of the data along which the variance is the greatest Data can be rotated and reoriented in any arbitrary way - Data will have components along the new coordinate axes Principal components are those where the variance is highest - Don’t we want variance to be low? Higher variance in the data yields lower variance in the coefficient estimates

The first principal component We want the  values to maximize the variance of Any real-valued  s will do... - Why isn’t this possible? Have to restrict the magnitude of the  vector - Set its L 2 norm to 1

More principal components For M > 1, we consider M orthogonal dimensions - Why is this necessary? Can think of this recursively To find the Mth principal component... - Find the (M-1) principal components - Subtract the projection into that space - Maximize the variance in the remaining complementary space

Another interpretation of principal components Minimize the distance between the projected points and the original points - In aggregate, the sum of squared distances Why is this the same as maximizing the variance? Each point’s distance from the origin contributes to (1) variance in the projected space and (2) distance between the original and projected points

Regression in the principal components We set out with an objective: solve for  in (That’s still our objective) But now we’re going to work in the space of reduced dimension: The new features are related to the old features by So we’re computing

Say it with matrices We want to solve Since the columns of  are orthonormal, the values in each component of  don’t change, regardless of the value of M - Details omitted This means we can easily obtain the solution for M from 1 to p, and use some criterion to pick the best value of M - Cross validation error, AIC, BIC, adjusted R 2

Comparison with ridge regression and the lasso How is PCR similar to the lasso? - Reduce dimensionality of the solution space How is it different? - Finds a solution in the space of all features, rather than a subset - Results can be much more difficult to interpret In this way, PCR is more similar to ridge regression

Back to the beginning We wanted to estimate the coefficients for superhero height, but got a solution that was way off What happens if we use 2 components instead of 3? Using only the components with higher variability significantly improves our estimate! i 0.511.522.533.5 i -2 0 2 4 6 estimated true i 0.511.522.533.5 i -2 0 2 4 6 estimated true estimated - PCR

Parting thought Good to “standardize” your data - Make every feature have unit variance This prevents significant differences in variance among the features from being a problem Then PCR addresses the scaling effects due to actual data redundancy - And you don’t lose information because some features just have larger values than others The pcr function in R can do this automatically

Check the clock If there’s time left, let’s go to the board for a fun* exercise

LECTURE 14: DIMENSIONALITY REDUCTION: PRINCIPAL COMPONENT REGRESSION March 21, 2016 SDS 293 Machine Learning B.A. Miller.

Similar presentations

Presentation on theme: "LECTURE 14: DIMENSIONALITY REDUCTION: PRINCIPAL COMPONENT REGRESSION March 21, 2016 SDS 293 Machine Learning B.A. Miller."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

LECTURE 14: DIMENSIONALITY REDUCTION: PRINCIPAL COMPONENT REGRESSION March 21, 2016 SDS 293 Machine Learning B.A. Miller.

Similar presentations

Presentation on theme: "LECTURE 14: DIMENSIONALITY REDUCTION: PRINCIPAL COMPONENT REGRESSION March 21, 2016 SDS 293 Machine Learning B.A. Miller."— Presentation transcript:

Similar presentations

About project

Feedback