4Outline Jinmiao Fu—Introduction and History Ning Ma—Establish and Fitting of the modelRuoyu Zhou—Multiple Regression Model in Matrix NotationDawei Xu and Yuan Shang—Statistical Inference for Multiple RegressionYu Mu—Regression DiagnosticsChen Wang and Tianyu Lu—Topics in Regression ModelingTian Feng—Variable Selection MethodsHua Mo—Chapter Summary and modern application
5IntroductionMultiple linear regression attempts to model the relationship between two or more explanatory variables and a response variable by fitting a linear equation to observed data. Every value of the independent variable x is associated with a value of the dependent variable
6Example: The relationship between an adult’s health and his/her daily eating amount of wheat, vegetable and meat.
8Karl Pearson (1857–1936)Lawyer, Germanist, eugenicist, mathematician and statisticianCorrelation coefficient Method of momentsPearson's system of continuous curves.Chi distance, P-valueStatistical hypothesis testing theory, statistical decision theory.Pearson's chi-square test, Principal component analysis.
9Sir Francis Galton FRS (16 February 1822 – 17 January 1911) Anthropology and polymathyDoctoral students Karl PearsonIn the late 1860s, Galton conceived the standard deviation. He created the statistical concept of correlation and also discovered the properties of the bivariate normal distribution and its relationship to regression analysis
10Galton invented the use of the regression line (Bulmer 2003, p Galton invented the use of the regression line (Bulmer 2003, p. 184), and was the first to describe and explain the common phenomenon of regression toward the mean, which he first observed in his experiments on the size of the seeds of successive generations of sweet peas.
11The publication by his cousin Charles Darwin of The Origin of Species in 1859 was an event that changed Galton's life. He came to be gripped by the work, especially the first chapter on "Variation under Domestication" concerning the breeding of domestic animals.
12Adrien-Marie Legendre (18 September 1752 – 10 January 1833) was a French mathematician. He made important contributions to statistics, number theory, abstract algebra and mathematical analysis.He developed the least squares method, which has broad application in linear regression, signal processing, statistics, and curve fitting.
13Johann Carl Friedrich Gauss (30 April 1777 – 23 February 1855) was a German mathematician and scientist who contributed significantly to many fields, including number theory, statistics, analysis, differential geometry, geodesy, geophysics, electrostatics, astronomy and optics.
14Gauss, who was 23 at the time, heard about the problem and tackled it Gauss, who was 23 at the time, heard about the problem and tackled it. After three months of intense work, he predicted a position for Ceres in December 1801—just about a year after its first sighting—and this turned out to be accurate within a half-degree. In the process, he so streamlined the cumbersome mathematics of 18th century orbital prediction that his work—published a few years later as Theory of Celestial Movement—remains a cornerstone of astronomical computation.
15It introduced the Gaussian gravitational constant, and contained an influential treatment of the method of least squares, a procedure used in all sciences to this day to minimize the impact of measurement error. Gauss was able to prove the method in 1809 under the assumption of normally distributed errors (see Gauss–Markov theorem; see also Gaussian). The method had been described earlier by Adrien-Marie Legendre in 1805, but Gauss claimed that he had been using it since 1795.
16Sir Ronald Aylmer Fisher FRS (17 February 1890 – 29 July 1962) was an English statistician, evolutionary biologist, eugenicist and geneticist. He was described by Anders Hald as "a genius who almost single-handedly created the foundations for modern statistical science," and Richard Dawkins described him as "the greatest of Darwin's successors".
17In addition to "analysis of variance", Fisher invented the technique of maximum likelihood and originated the concepts of sufficiency, ancillarity, Fisher's linear discriminator and Fisher information.
19Probabilistic Model : the observed value of the random variable(r.v.) depends on fixed predictor values,i=1,2,3,…,nunknown model parametersn is the number of observations.~ N (0, )i.i.d
20Fitting the modelLS provides estimates of the unknown model parameters,which minimizes Q(j=1,2,…,k)
21Tire tread wear vs. mileage (example11.1 in textbook) Mileage (in 1000 miles)Groove Depth (in mils)394.334329.508291.0012255.1716229.3320204.8324179.0028163.8332150.33The table gives the measurements on the groove of one tire after every 4000 miles.Our Goal: to build a model to find the relation between the mileage and groove depth of the tire.
22SAS code----fitting the model Data example；Input mile depthSqmile=mile*mile；Datalines;;run;Proc reg data=example;Model Depth= mile sqmile;Run;
24Goodness of Fit of the Model Residualsare the fitted valuesAn overall measure of the goodness of fitError sum of squares (SSE):total sum of squares (SST):SST is the SSE obtained when fitting the model Yi = B0 + ei, which ignores all the x’sR^2 = 0.5 means 50% of the variation in y is accounted for by x, in this case, all x’sregression sum of squares (SSR):
35Statistical Inference for Multiple Regression Determine which predictor variables have statistically significant effectsWe test the hypotheses:If we can’t reject H0j, then xj is not a significant predictor of y.
36Statistical Inference on Review statistical inference forSimple Linear Regression
37Statistical Inference on What about Multiple Regression?The steps are similar
38Statistical Inference on What’s Vjj? Why ?1. MeanRecall from simple linear regression, the least squares estimators for the regression parameters and are unbiased.Here, of leastsquares estimatorsis also unbiased.
39Statistical Inference on 2.VarianceConstant Variance assumption:
40Statistical Inference on Let Vjj be the jth diagonal of the matrix
46Prediction of Future Observation Having fitted a multiple regression model, suppose we wish to predict the future value of Y for a specified vector of predictor variables x*=(x0*,x1*,…,xk*)One way is to estimate E(Y*) by a confidence interval(CI).
48F-Test forConsider: Here is the overall null hypothesis, which states that none of the variables are related to . The alternative one shows at least one is related.
49How to Build a F-Test……The test statistic F=MSR/MSE follows F-distribution with k and n-(k+1) d.f. The α -level test rejects ifrecall that MSE(error mean square) with n-(k+1) degrees of freedom.
50The relation between F and r F can be written as a function of r. By using the formula: F can be as: We see that F is an increasing function of r ² and test the significance of it.
51Analysis of Variance (ANOVA) The relation between SST, SSR and SSE: where they are respectively equals to: The corresponding degrees of freedom(d.f.) is:
52ANOVA Table for Multiple Regression Source of Variation(source)Sum of Squares(SS)Degrees of Freedom(d.f.)Mean Square(MS)FRegressionErrorSSRSSEkn-(k+1)TotalSSTn-1This table gives us a clear view of analysis of variance of Multiple Regression.
53Extra Sum of Squares Method for Testing Subsets of Parameters Before, we consider the full model with k parameters. Now we consider the partial model: while the rest m coefficients are set to zero. And we could test these m coefficients to check out the significance:
54Building F-test by Using Extra Sum of Squares Method Let and be the regression and error sums of squares for the partial model. Since SST Is fixed regardless of the particular model, so: then, we have: The α-level F-test rejects null hypothesis if
55Remarks on the F-testThe numerator d.f. is m which is the number of coefficients set to zero. While the denominator d.f. is n-(k+1) which is the error d.f. for the full model. The MSE in the denominator is the normalizing factor, which is an estimate of σ² for the full model. If the ratio is large, we reject .
56Links between ANOVA and Extra Sum of Squares Method Let m=1 and m=k respectively, we have: From above we can derive: Hence, the F-ratio equals: with k and n-(k+1) d.f.
585 Regression Diagnostics 5.1 Checking the Model AssumptionsPlots of the residuals against individual predictor variables: check for linearityA plot of the residuals against fitted values: check for constant varianceA normal plot of the residuals:check for normality
59A run chart of the residuals: check if the random errors are auto correlated. Plots of the residuals against any omitted predictor variables: check if any of the omitted predictor variables should be included in the model.
60Example: Plots of the residuals against individual predictor variables
695.3 Data transformationTransformations of the variables(both y and the x’s) are often necessary to satisfy the assumptions of linearity, normality, and constant error variance. Many seemingly nonlinear models can be written in the multiple linear regression model form after making a suitable transformation. For example,after transformation:or
71MulticollinearityMulticollinearity occurs when two or more predictors in the model are correlated and provide redundant information about the response.Example of multicollinear predictors are height and weight of a person, years of education and income, and assessed value and square footage of a home.Consequences of high multicollinearity:a. Increased standard error of estimates of the β ’sb. Often confused and misled results.
72Detecting Multicollinearity Easy way: compute correlations between all pairs of predictors. If some r are close to 1 or -1, remove one of the two correlated predictors from the model.X1colinearX2independentX3X2Equal to 1Correlations
73Detecting Multicollinearity Another way: calculate the variance inflation factors for each predictor xj:where is the coefficient of determination of the model that includes all predictors except the jth predictor.If VIFj≥10, then there is a problem of multicollinearity.
74Muticollinearity-Example See Example11.5 on Page 416, Response is the heat of cement on a per gram basis (y) and predictors are tricalcium aluminate(x1), tricalcium silicate(x2), tetracalcium alumino ferrite(x3) and dicalcium silicate(x4).
75Muticollinearity-Example Estimated parameters in first order model: ˆy = x x x x4.F = with p−value below Individual t−statistics and p−values: 2.08 (0.071), 0.7 (0.501) and 0.14 (0.896), (0.844).Note that sign on β4 is opposite of what is expected. And very high F would suggest more than just one significant predictor.
76Muticollinearity-Example CorrelationsCorrelations were r13 = , r24 = Also the VIF were all greater than 10. So there is a multicollinearity problem in such model and we need to choose the optimal algorithm to help us select the variables necessary.
77Muticollinearity-Subsets Selection Algorithms for Selecting SubsetsAll possible subsetsOnly feasible with small number of potential predictors (maybe 10 or less)Then can use one or more of possible numerical criteria to find overall bestLeaps and bounds methodIdentifies best subsets for each value of pRequires fewer variables than observationsCan be quite effective for medium-sized data setsAdvantage to have several slightly different models to compare
78Muticollinearity-Subsets Selectioin Forward stepwise regressionStart with no predictorsFirst include predictor with highest correlation with responseIn subsequent steps add predictors with highest partial correlation with response controlling for variables already in equationsStop when numerical criterion signals maximum (minimum)Sometimes eliminate variables when t value gets too smallOnly possible method for very large predictor poolsLocal optimization at each step, no guarantee of finding overall optimumBackward eliminationStart with all predictors in equationRemove predictor with smallest t valueContinue until numerical criterion signals maximum (minimum)Often produces different final model than forward stepwise method
79Muticollinearity-Best Subsets Criteria Numerical Criteria for Choosing Best SubsetsNo single generally accepted criterionShould not be followed too mindlesslyMost common criteria combine measures of with add penalties for increasing complexity (number of predictors)Coefficient of determinationOrdinary multiple R-squareAlways increases with increasing number of predictors, so not very good for comparing models with different numbers of predictorsAdjusted R-SquareWill decrease if increase in R-Square with increasing p is small
80Muticollinearity-Best Subsets Criteria Residual mean square (MSEp)Equivalent to adjusted r-square except look for minimumMinimum occurs when added variable doesn't decrease error sum of squares enough to offset loss of error degree of freedomMallows' Cp statisticShould be about equal to p and look for small values near pNeed to estimate overall error variancePRESS statisticThe one associated with the minimum value of PRESSp is chosenIntuitively easier to grasp than the Cp-criterion.
81Muticollinearity-Forward Stepwise First include predictor with highest correlation with response>FIN=4
82Muticollinearity-Forward Stepwise In subsequent steps add predictors with highest partial correlation with response controlling for variables already in equations. (if Fi>FIN=4, enter the Xi and Fi<FOUT=4, remove the Xi)>FIN=4
84Muticollinearity-Forward Stepwise Summarize the stepwise algorithmsTherefore our “Best Model” should only include x1 and x2, which is y= x x2
85Muticollinearity-Forward Stepwise Check the significance of the model and individual parameter again. We find p value are all small and each VIF is far less than 10.
86Muticollinearity-Best Subsets Also we can stop when numerical criterion signals maximum (minimum) and sometimes eliminate variables when t value gets too small.
87Muticollinearity-Best Subsets The largest R squared value is associated with the full model.The best subset which minimizes the Cp-criterion includes x1,x2The subset which maximizes Adjusted R squared or equivalently minimizes MSEp is x1,x2,x4. And the Adjusted R squared increases only from to by the addition of x4to the model already containing x1 and x2.Thus the simpler model chosen by the Cp-criterion is preferred, which the fitted model isy= x x2
88Polynomial modelPolynomial models are useful in situations where the analyst knows that curvilinear effects are present in the true response function.We can do this with more than one explanatory variable using Polynomial regression model:
89Multicollinearity-Polynomial Models Multicollinearity is a problem in polynomial regression (with terms of second and higher order): x and x2 tend to be highly correlated.A special solution in polynomial models is to use zi = xi − ¯xi instead of just xi. That is, first subtract each predictor from its mean and then use the deviations in the model.
90Multicollinearity – Polynomial model Example: x = 2, 3, 4, 5, 6 and x2 = 4, 9, 16, 25, 36. As x increases, so does x2. rx,x2 = 0.98.= 4 then z = −2,−1, 0, 1, 2 and z2 = 4, 1, 0, 1, 4. Thus, z and z2 are no longer correlated. rz,z2 = 0.We can get the estimates of the β’s from the estimates of the γ ’s. Since
91Dummy Predictor Variable The dummy variable is a simple and useful method of introducing into a regression analysis information contained in variables that are not conventionally measured on a numerical scale, e.g., race, gender, region, etc.
92Dummy Predictor Variable The categories of an ordinal variable could be assigned suitable numerical scores.A nominal variable with c≥2 categories can be coded using c – 1 indicator variables, X1,…,Xc-1, called dummy variables.Xi=1, for ith category and 0 otherwiseX1=,…,=Xc-1=0, for the cth category
93Dummy Predictor Variable If y is a worker’s salary andDi = 1 if a non-smokerDi = 0 if a smokerWe can model this in the following way:
94Dummy Predictor Variable Equally we could have used the dummy variable in a model with other explanatory variables. In addition to the dummy variable we could also add years of experience (x), to give:For smokerFor non-smoker
98Standardized Regression Coefficients We typically wants to compare predictors in terms of the magnitudes of their effects on response variable.We use standardized regression coefficients to judge the effects of predictors with different units
99Standardized Regression Coefficients They are the LS parameter estimates obtained by running a regression on standardized variables, defined as follows:Where and are sample SD’s of and
100Standardized Regression Coefficients LetAndThe magnitudes of can be directly compared to judge the relative effects of on y.
101Standardized Regression Coefficients Since , the constant can be dropped from the model. Let be the vector of theand be the matrix of
102Standardized Regression Coefficients So we can getThis method of computing is numerically more stable than computing directly, because all entries of R and r are between -1 and 1.
103Standardized Regression Coefficients Example (Given in page 424)From the calculation, we can obtain thatAnd sample standard deviations of x1,x2 andareThen we haveNote that ,although Thus x1 has a larger effect than x2 on y.
104Standardized Regression Coefficients We can also use the matrix method to compute standardized regression coefficients.First we compute the correlation matrix between x1 ,x2 and yThen we haveNext calculateHenceWhich is as same result as before
110How to do the test?We reject in favor of at level α if
111Another way to interpret the test: test statistics:We reject at level α if
112Partial Correlation Coeffientients test statistics:*Add to the regression equation that includesonly if is large enough.
113How to do it by SAS? (EX9 Continuity of Ex5) The table shows data on the heat evolved in calories during the hardening of cement on a per gram basis (y) along with the percentages of four ingredients: tricalcium aluminate (x1), tricalcium silicate (x2), tetracalcium alumino ferrite (x3), and dicalcium silicate (x4).No.X1X2X3X4Y172666078.5229155274.331156820104.34314787.653395.955922109.27117102.74472.5541893.11021115940233483.81266113.31368109.4
117InterpretationAt the first step, x4 is chosen into the equation as it has the largest correlation with y among the 4 predictors;At the second step, we choose x1 into the equation for it has the highest partial correlation with y controlling for x4;At the third step, since is greater than, x2 is chosen into the equation rather than x3.
118InterpretationAt the 4th step, we removed x4 from the model since its partial F-statistics is too small.From Ex11.5, we know that x4 is highly correlated with x2. Note that in Step4, the R-Square is , which is slightly higher that , the R-Square of Step 2. It indicates that even x4 is the best predictor of y, the pair (x1,x2) is a better predictor than the predictor (x1,x4).
119DrawbacksThe final model is not guaranteed to be optimal in any specified case.It yields a single final model while in practice there are often several equally good model.
120Best Subset Regression Comparison to Stepwise MethodOptimality CriteriaHow to do it by SAS?
121Comparison to Stepwise Regression In best subsets regression, a subset of variables is chosen from that optimizes a well-defined objective criterion.The best regression algorithm permits determination of a specified number of best subsets from which the choice of the final model can be made by the investigator.
127Interpretation The best subset which minimizes the is x1, x2 which is the same model selected using stepwise regression in the former example.The subset which maximizes is x1, x2, x4. However, increases only from to by the addition of x4 to the model which already contains x1 and x2.Thus, the model chosen by the is preferred.
128and Modern Application Chapter Summaryand Modern Application
129Multiple Regression Model Fitting the MLR Model MLR Model in Matrix NotationModel (Extension of Simple Regression):are unknown parametersLeast squares method:Goodness of fit of the model:
130Statistical Inference on Statistical Inference for Multiple RegressionRegression DiagnosticsStatistical Inference onHypotheses:vs.Test statistic:Hypotheses:vs.At least oneTest statistic:Residual AnalysisData Transformation
131The General Hypothesis Test: the full model:the partial model:CompareHypotheses:vs.Test statistic:RejectH0 whenEstimating and Predicting Future Observations:LetandTest statistic:CI for the estimated mean *:PI for the estimated Y*:
132Topics in regression modeling Variable Selection Methods MulticollinearityPolynomial RegressionDummy Predictor VariablesLogistic egression ModelVariable Selection MethodsStepwise Regression:Stepwise Regression AlgorithmBest Subsets RegressionStrategy for building a MLR modelpartial F-testpartial Correlation Coefficient
133Application of the MLR model Linear regression is widely used in biological, chemistry, finance and social sciences to describe possible relationships between variables. It ranks as one of the most important tools used in these disciplines.
134Multiple linear regression Financial marketbiologyHousing priceheredityChemistry
135Example Broadly speaking, an asset pricing model can be expressed as: Where , and k denote the expected return on asset i, the kth risk factor and the number of risk factors, respectively.denotes the specific returnon asset i.
136The equation can also be expressed in the matrix notation: is called the factor loading
137Inflation rateGDPWhat’s the most important factors?Interest rateRate of return on the market portfolioEmployment rateGovernment policies
138MethodStep 1: Find the efficient factors(EM algorithms, maximum likelihood)Step 2: Fit the model and estimate the factorloading(Multiple linear regression)
139According to the multiple linear regression and run data on SAS, we can get the factor loading and the coefficient of multiple determinationWe can ensure the factors that mostly effect the return in term of SAS output and then build the appropriate multiple factor modelsWe can use the model to predict the future return and make a good choice!