2 Outline Linear Regression Least Square Solution Subset Least Square subset selection/forward/backwardPenalized Least Square:Ridge RegressionLASSOElastic Nets (LASSO+Ridge)
3 Linear Methods for Regression Input (FEATURES) Vector: (p-dimensional) X = X1, X2, …, XpReal Valued OUTPUT: YJoint Distribution of (Y,X )Function:Regression Function E(Y |X ) = f(X)Training Data :(x1, y1), (x2, y2), …, (xN, yN) for estimation of input-output relation f.
4 Linear Model f(x): Regression function or a good approximation LINEAR in Unknown Parameters(weights, coefficients)
5 Features Quantitative inputs Any arbitrary but known function of measured attributesTransformations of quantitative attributes: g(x), e.g., log, square, square-root etc.Basis expansions: e.g., a polynomial approximation of f as a function of X1 (Taylor Series expansion with unknown coefficients)
6 Features (Cont.) Qualitative (categorical) input G Dummy Codes: For an attribute with k categories, may use k codes j = 1,2, …, k, as indicators of the category (level) used.Together, this collection of inputs represents the effect of G throughThis is a set of level-dependent constants, since only one of the Xj equals one and others are zero
7 Features(cont)Interactions: 2nd or higher-order Interactions of some features, e.g.,Feature vector for the ith case in training set (Example)
8 Generalized Linear Models: Basis Expansion Wide variety of flexible modelsModel for f is a linear expansion of basis functionsDictionary: Prescribed basis functions
9 Other Basis FunctionsPolynomial basis of degree s (Smooth functions Cs).Fourie Series (Band-limited functions, a compact subspace of C∞)SplinesPiecewise polynomials of degree K between the knots, joined with continuity of degree K-1 at the knots (Sobolev Spaces).Wavelets (Besov Spaces)Radial Basis Functions: Symmetric p-dim kernels located at particular centroids f(|x-y|)Gaussian Kernel at each centroidsAnd more …-- Curse of Dimensionality: p could be equal to or much larger than n.
10 Method of Least Squares Find coefficients that minimizeResidual Sum of Squares, RSS(b) =RSS denotes the empirical risk over the training set. It doesn’t assure the predictive performance over all inputs of interest.
11 Min RSS CriterionStatistically Reasonable provided Examples in the Training SetLarge # of independent random draws from the inputs population for which prediction is desirable.Given inputs (x1, x2, …, xN), the outputs (y1, y2, …, yN) conditionally independentIn principle, predictive performance over the set of future input vectors should be examined.Gaussian Noise: the Least Squares method equivalent to Max LikelihoodMin RSS(b) over b in R(p+1), a quadratic function of b.Optimal Solution: Take the derivatives with respect to elements of b, and set them equal to zero.
12 XT(Y-X b) = 0 or (XTX) b = XTY Optimal SolutionThe Hession (2nd derivative) of the criterion function is given by XTX.The optimal solution satisfies the normal equationsXT(Y-X b) = 0 or (XTX) b = XTYFor an unique solution, the matrix XTX must full rank.
13 ProjectionWhen the matrix XTX is full rank. the estimated response for the training set:H: Projection (Hat) matrixHY: Orthogonal Projection of Y on the space spanned by the columns of XNote: the projection is linear in Y
15 Simple Univariate Regression One Variable with no interceptLS estimateinner product= cosine (angle between vectors x and y), a measure of similarity between y and xResiduals: projection on normal spaceDefinition: “Regress b on a”Simple regression of response b and input a, with no interceptEstimateResidual“b adjusted for a”“b orthogonalized with respect to a”
16 Multiple Regression Multiple Regression:p>1 LS estimates different from simple univariate regression estimates, unless columns of input matrix X orthogonal,IfthenThese estimates are uncorrelated, andOrthogonal inputs occur sometimes in balanced, designed experiments (experimental design).Observational studies, will almost never have orthogonal inputsMust “orthogonalize” them in order to have similar interpretationUse Gram-Schmidt procedure to obtain an orthogonal basis for multiple regression
17 Multiple Regression Estimates: Sequence of Simple Regressions Regression by Successive Orthogonalization:InitializeFor, j=1, 2, …, p, Regressto produce coefficientsand residual vectorsRegress y on the residual vector for the estimateInstead of using x1 and x2, take x1 and z as features
18 Multiple Regression = Gram-Schmidt Orthogonalization Procedure The vector zp is the residual of the multiple regression of xp on all other inputsSuccessive z’s in the above algorithm are orthogonal and form an orthogonal basis for the columns space of X.The least squares projection onto this subspace is the usualBy re-arranging the order of these variables, any input can be labeled as the pth variable.If is highly correlated with other variables, the residuals are quite small, and the coefficient has high variance.
19 Statistics Properties of LS ModelUncorrelated noise: Mean zero, VarianceThenNoise estimationModel d.f. = p+1 (dimension of the model space)To Draw inferences on parameter estimates, we need assumptions on noise:If assume:then,
20 Gauss-Markov Theorem(The Gauss-Markov Theorem) If we have any linear estimator that is unbiased for aT β, that is, E(cT y)= aT β,thenIt says that, for inputs in row space of X, LS estimate have Minimum variance among all unbiased estimates.
21 Bias-Variance Tradeoff Mean square error of an estimator = variance + biasLeast square estimator achieves the minimal variance among all unbiased estimatorsThere are biased estimators to further reduce variance: Stein’s estimator, Shrinkage/Thresholding (LASSO, etc.)More complicated a model is, more variance but less bias, need a trade-off
22 Hypothesis Test Single Parameter test: βj=0, T-statistics where vj is the j-th diagonal element of V = (XTX)-1Confidence interval , e.g. z =1.96Group parameter: , F-statistics for nested models
24 Rank Deficiency Penalized methods (!) X : rank deficient Normal equations has infinitely many solutionsHat matrix H, and the projection are unique.For an input in the row space of X, unique LS estimate.For an input, not in the row space of X, the estimate may change with the solution used.How to generalize to inputs outside the training set?Penalized methods (!)
25 Reasons for Alternatives to LS Estimates Prediction accuracyLS estimates have low bias but high variance when inputs are highly correlatedLarger ESPEPrediction accuracy can sometimes be improved by shrinking or setting some coefficients to zero.Small bias in estimates may yield a large decrease in varianceBias/var tradeoff may provide better predictive abilityBetter interpretationWith a large number of input variables, like to determine a smaller subset that exhibit the strongest effects.Many tools to achieve these objectivesSubset selectionPenalized regression - constrained optimization
26 Best Subset Selection Method Algorithm: leaps & boundsFind the best subset corresponding to the smallest RSS for each sizeFor each fixed size k, can also find a specified number of subsets close to the bestFor each fixed subset, obtain LS estimatesFeasible for p ~ 40.Choice of optimal k based on model selection criteria to be discussed later
27 Other Subset Selection Procedures Larger p, ClassicalForward selection (step-up),Backward elimination (step down)Hybrid forward-backward (step-wise) methodsGiven a model, these methods only provide local controls for variable selection or deletionWhich current variable is least effective (candidate for deletion)Which variable not in the model is most effective (candidate for inclusion)Do not attempt to find the best subset of a given sizeNot too popular in current practice
28 Forward Stagewise Selection (Incremental) Forward stagewiseStandardize the input variablesNote:
29 Penalized RegressionInstead of directly minimize the Residual Sum Square,The penalized regression usually take the form:where J(f) is the penalization term, usually penalize onthe smoothness or complexity of the function fλ is chosen by cross-validation.
30 Model Assessment and Selection If we are in data-rich situation, split data into three parts: training, validation, and testing.TrainValidationTestSee chapter 7.1 for details
31 Cross ValidationWhen sample size not sufficiently large, Cross Validation is a way to estimate the out of sample estimation error (or classification rate).Available DataTrainingTestRandomly spliterror1Split many times and geterror2, …, errorm ,then averageover all error to get an estimate
32 Ridge Regression (Tikhonov Regularization) Prostate Cancer ExampleRidge regression shrinks coefficients by imposing a penalty on their sizeMin a penalized RSSHere is complexity parameter, that controls the amount of shrinkageLarger its value, greater the amount of shrinkageCoefficients are shrunk towards zeroChoice of penalty term based on cross validation
33 Ridge Regression (cont) Equivalent problemMin RSS subject toLagrangian multiplier1-1 correspondence between s andWith many correlated variables, LS estimates can become unstable and exhibit high variance and high correlationsA widely large positive coeff on one variable can be cancelled by a large negative coeff on anotherImposing a size constraint on the coefficients, this phenomena is prevented from occurringRidge solutions are not invariant under scaling of inputsNormally standardize the inputs before solving the optimization problemSince the penalty term does not include the bias term, estimate intercept by the mean of response y.
34 Ridge Regression (cont) The Ridge criterionShrinkage:For orthogonal inputs, ridge: scaled version of LS estimatesRidge is mean or mode of posterior distribution of under a normal priorCentered input matrix XSVD of X:U and V are orthogonal matricesColumns of U span column space of XColumns of V span row space of XD: a diagonal matrix of singular valuesEigen decomposition ofThe Eigen vectors : principal components directions of X (Karhunen-Loeve direction)
35 Ridge Regression and Principal Components First PC direction :Among all normalized linear combinations of columns of X, the has largest sample varianceDerived variable, is first PC of X.Subsequent PC have max variance subject to being orthogonal to earlier ones. Last PC has min varianceEffective Degree of Freedmon
36 Ridge Regression (Summary) Ridge Regression penalized the complexity of a linear model by the sum squares of the coefficientsIt is equivalent to minimize RRS given the constraintsThe matrix (XTX+ I) is always invertable.The penalization parameter controls how simple “you” want the model to be.
38 Ridge Regression (Summary) Prostate Cancer ExampleSolutions are not sparse in the coefficient space.- ’s are not zeroalmost all the time.The computation complexity is O(p3) when inversing the matrix XTX+ I.
39 Least Absolute Shrinkage and Selection Operator (LASSO) Penalized RSS with L1-norm penalty, or subject to constraintShrinks like Ridge with L2- norm penalty, but LASSO coefficients hit zero, as the penalty increases.
40 LASSO as Penalized Regression Instead of directly minimize the Residual Sum Square,The penalized regression usually take the form:where
41 LASSO(cont) The computation is a quadratic programming problem. We can obtain the solution path, piece-wise linear.Coefficients are non-linear in response y (they are linear in y in Ridge Regression)Regularization parameter is chosen by cross validation.
42 LASSO and RidgeContour of RRS in the space of ’s
43 Generalize to L-q norm as penalty Minimize RSS subject to constraints on the l-q normEquivalent to MinBridge regression with ridge and LASSO as special cases (q=1, smallest value for convex region)For q=0, best subset regressionFor 0<q<1, it is not convex!
45 Why non-convex norms? LASSO is biased: Nonconvex Penalty is necessary for unbiased estimator
46 Elastic Net as a compromise between Ridge and LASSO (Zou and Hastie 2005)
47 The Group LASSO Group LASSO Group norm l1-l2 (also l1-l∞) Every group of variable are simultaneously selected or dropped
48 Methods using Derived Directions Principal Components RegressionPartial Least Squares
49 Principal Components Regression Motivation: leadingeigen-vectorsdescribe most ofthe variability in XZ1Z2X2X1
50 Principal Components Regression Zi and Zj are orthogonal now.The dimension is reduced.High correlation between independent variables are eliminated.Noises in X’s are taken off (hopefully).Computation: PCA + Regression
51 Partial Least Square Partial Least Squares (PLS) Uses inputs as well as response y to form the directions ZmSeeks directions that have high variance and have high correlation with response yPopular in ChemometricsIf original inputs are orthogonal, finds the LS after one step. Subsequent steps have no effect.Since the derived inputs use y, the estimates are non-linear functions of the response, when the inputs are not orthogonalThe coefficients for original variables tend to shrink as fewer PLS directions are usedChoice of M can be made via cross validation
62 Relationship between those three Lasso and forward stagewise can be thought of as restricted version of LARSFor Lasso: Start with LARS; If a coefficient crosses zero, stop. Drop that predictor, recompute the best direction and continue. It gives the LASSO path.For Stagewise, Start with LARS; select the most correlated direction at each stage, go that direction with a small step.There are other related methods:Orthogonal Matching PursuitLinearized Bregman Iteration
64 Homework Project II Click Prediction (classification): two subproblems click/impressionclick/biddingData Directory: /data/ipinyou/Files:bid txt: Bidding log file, 1.2M rows, 470MBimp txt: Impression log, 0.8M rows, 360MBclk txt: Click log file, 796 rows, 330KBdata.zip: compressed files above (Password: ipinyou2013)dsp_bidding_data_format.pdf: format fileRegion&citys.txt: Region and City codeQuestions:
65 Homework Project II Data Input by R: bid <- read.table("/Users/Liaohairen/DSP/bid txt", sep='\t', comment.char='')imp <- read.table("/Users/Liaohairen/DSP/imp txt", sep='\t', comment.char='’)R read.table by default uses '#' ascomment character, that is , it has the comment.char = '#' parameter, but the user-agent field may have '#' character. To read correctly, turning off of interpretation of comments by setting comment.char='' is needed.