Presentation on theme: "Multiple Linear Regression uses 2 or more predictors General form: Let us take simplest multiple regression case--two predictors: Here, the b’s are not."— Presentation transcript:
Multiple Linear Regression uses 2 or more predictors General form: Let us take simplest multiple regression case--two predictors: Here, the b’s are not simply and, unless x 1 and x 2 have zero correlation with one another. Any correl- ation between x 1 and x 2 makes determining the b’s less simple. The b’s are related to the partial correlation, in which the value of the other predictor(s) is held constant. Holding other predictors constant eliminates the part of the correlation due to the other predictors and not just to the predictor at hand. Notation: partial correlation of y with x 1, with x 2 held constant, is written
For 2 (or any n) predictors, there are 2 (or any n) equations in 2 (or any n) unknowns to be solved simultaneously. When n >3 or so, determinant operations are necessary. For case of 2 predictors, and using z values (variables standardized by subtracting their mean and then dividing by the standard deviation) for simplicity, the solution can be done by hand. The two equations to be solved simultaneously are: b 1.2 +b 2.1 (cor x1,x2 ) = cor y,x1 b 1.2 (cor x1,x2 ) +b 2.1 = cor y,x2 Goal is to find the two b coefficients, b 1.2 and b 2.1
b 1.2 +b 2.1 (cor x1,x2 ) = cor y,x1 b 1.2 (cor x1,x2 ) +b 2.1 = cor y,x2 Example of a multiple regression problem with two predictors The number of Atlantic hurricanes between June and November is slightly predictable 6 months in advance (in early December) using several precursor atmospheric and oceanic variables. Two variables used are (1) 500 millibar geopotential height in Novem- ber in the polar north Atlantic (67.5N-85°N latitude, 10E-50°W longitude); and (2) sea level pressure in November in the North tropical Pacific (7.5N-22.5°N latitude, °W longitude).
S L P 500mb Location of two long-lead Atlantic hurricane predictor regions
Physical reasoning behind the two predictors: (1) 500 millibar geopotential height in November in the polar north Atlantic. High heights are associated with a negative North Atlantic Oscillation (NAO) pattern, tending to associate with a stronger thermohaline circulation, and also tending to be followed by weaker upper atmospheric westerlies and weaker low-level trade winds in the tropical Atlantic the following hurricane season. All of these favor hurricane activity. (2) sea level pressure in November in the North tropical Pacific. High pressure in this region in winter tends to be followed by La Nina conditions in the coming summer and fall, which favors easterly Atlantic wind anomalies aloft, and hurricane activity. First step: Find “regular” correlations among all the variables (x 1,x 2, y): cor x1,y cor x2,y cor x1,x2
X 1 : Polar north Atlantic 500 millibar height X 2 : North tropical Pacific sea level pressure = 0.20 (x 1,y) = 0.40 (x 2,y) = 0.30 (x 1,x 2 ) Simultaneous equations to be solved b 1.2 +(0.30)b 2.1 = 0.20 (0.30)b 1.2 +b 2.1 = 0.40 Solution: Multiply 1 st equation by 3.333, then subtract second equation from first equation. This gives (3.033)b = So b 1.2 = and use this to find that b 2.1 = Regression equation is Z y = (0.088)z x1 + (0.374)z x2 one pre- dictor vs the other
Multiple correlation coefficient = R = correlation between predicted y and actual y using multiple regression. In example above, = Note this is only very slightly better than using the second predictor alone in simple regression. This is not surprising, since the first predictor’s total correlation with y is only 0.2, and it is correlated 0.3 with the second predictor, so that the second predictor already accounts for some of what the first predictor has to offer. A decision would probably be made concerning whether it is worth the effort to include the first predictor for such a small gain. Note: the multiple correlation can never decrease when more predictors are added.
Multiple R is usually inflated somewhat compared with the true relationship, since additional predictors fit the accidental variations found in the test sample. Adjustment (decrease) of R for the existence of multiple predictors gives a less biased estimate of R: Adjusted R = n = sample size k = number of predictors
Sampling variability of a simple (x, y) correlation coefficient around zero when population correlation is zero is approximately In multiple regression the same approximate relationship holds except that n must be further decreased by the number of predictors additional to the first one. If the number of predictors (x’s) is denoted by k, then the sampling variability of R around zero, when there is no true relationship with any of the predictors, is given by It is easier to get a given multiple correlation by chance as the number of predictors increases.
Partial Correlation is correlation between y and x 1, where a variable x 2 is not allowed to vary. Example: in an elemen- tary school, reading ability (y) is highly correlated with the child’s weight (x 1 ). But both y and x 1 are really caused by something else: the child’s age (call x 2 ). What would the correlation be between weight and reading ability if the age were held constant? (Would it drop down to zero?) A similar set of equations exists for the second predictor.
Suppose the three correlations are: reading vs. weight : reading vs. age: weight vs. age: The two partial correlations come out to be: Finally, the two regression weights turn out to be: Weight is seen to be a minor factor compared with age.
Another Example – Sahel Drying Trend Suppose 50 years of climate data suggest that the drying of the Sahel in northern Africa in July to September may be related both to warming in the tropical Atlantic and Indian oceans (x 1 ) as well as local changes in land use in the Sahel Itself (x 2 ). x 1 is expressed as SST, and x 2 is expressed as percentage vegetation decrease (expressed as a positive percentage) from the vegetation found at the beginning of the 50 year period. While both factors appear related to the downward trend in rainfall, the two predictors are somewhat correlated with one another. Suppose the correlations come out as follows: Cor(y,x 1 )= Cor(y,x 2 )= Cor(x 1,x 2 )= 0.50 What would be the multiple regression equation in “unit-free” standard deviation (z) units?
Cor(x 1,y)= Cor(x 2,y)= Cor(x 1,x 2 )=0.50 First we set up the two equations to be solved simultaneously b 1.2 +b 2.1 (cor x1,x2 ) = cor y,x1 b 1.2 (cor x1,x2 ) +b 2.1 = cor y,x2 b 1.2 +(0.50)b 2.1 = (0.50)b 1.2 +b 2.1 = Want to eliminate (or cancel) b 1.2 or b 2.1. To eliminate b 2.1, multiply first equation by 2 and subtract second one from it: 1.5 b 1.2 = and b 1.2 = and b 2.1 = Regression equation is Z y = z x z x2
If want to express the above equation in physical units, then must know the means and standard deviations of y, x 1 and x 2 and make substitutions to replace the z’s. When substitute and simplify results, y, x 1 and x 2 terms will appear instead of z terms. There generally will also be a constant term that is not found in the z expression because the original variables probably do not have means of 0 the way z’s always do.
The means and the standard deviations of the three data sets are y: Jul-Aug-Sep Sahel rainfall (mm): mean 230 mm, SD 88 mm x 1 : Tropical Atlantic/Indian ocean SST: mean 28.3 degr C, SD 1.7 C x 2 : Deforestation (percent of initial): mean 34%, SD 22% Z y = z x z x2 After simplification, final form will be: y = coeff x 1 + coeff x 2 + constant (here, both coeff <0) b 1 b 2
We now compute the multiple correlation R, and the standard error of estimate for the multiple regression. Using the two individual correlations and the b terms: Cor(x 1,y)= Cor(x 2,y)= Cor(x 1,x 2 )=0.50 Regression equation is Z y = z x z x2 = The deforestation factor helps the prediction accuracy only slightly. If there were less correlation between the two predictors, then the second predictor would be more valuable. Standard Error of Estimate = = In physical units it is (0.845)(88 mm) =74.3 mm
Let us evaluate the significance of the multiple correlation of How likely could it have arisen by chance alone? First we find the standard error of samples of 50 drawn from a population having no correlations at all, using 2 predictors: For n=50 and k=2 we get = For a 2-sided z test at the 0.05 level, we need 1.96(0.145) = 0.28 This is easily exceeded, suggesting that the combination of the two predictors (SST and deforestation) do have an impact on Sahel summer rainfall. (Using SST alone in simple regression, with cor=0.52, would have given nearly the same level of significance.)
Example problem using this regression equation: Suppose that a climate change model predicts that in year 2050, the SST in the tropical Atlantic and Indian oceans will be 2.4 standard deviations above the means given for the 50-year period of the preceding problem. (It is now about 1.6 standard deviations above that mean.) Assume that land use practices (percentage deforestation) will be the same as they are now, which is 1.3 standard deviations above the mean. Under this scenario, using the multiple regression relationship above, how many standard deviations away from the mean will Jul-Aug-Sep Sahel rainfall be, and what seasonal total rainfall does that correspond to?
The problem can be solved either in physical units or in standard deviation units, and then the answer can be expressed in either (or both) kinds of units afterward. If solved in physical units, the values of the two predictions in SD units (2.4 and 1.3) can be converted to raw units using the means and standard deviations of the variables provided previously, and the raw units form of the regression equation would be used. If solved in SD units, the simpler equation can be used: Z y = z x z x2 The z’s of the two predictors, according to the scenario given, will be 2.4 and 1.3, respectively. Then Zy = (2.4) – 0.147(1.3) = This is how many SDs away from the mean the rainfall would be. Since the rainfall mean and SD are 230 and 88 mm, respectively, the actual amount predicted is 230 – 1.264(88) = 230 – = mm.
Colinearity When the predictors are highly correlated with one another in multiple regression, a condition of colinearity exists. When this happens, the coefficients of two highly correlated predictors may have opposing signs, even when each of them has the same sign of simple correlation with the predictand. (Such opposing signed coefficients minimizes squared errors.) Issues and problems with this are (1) it is counterintuitive, and (2) the coefficients are very unstable, such that if one more sample is added to the data, they may change drastically. When colinearity exists, the multiple regression formula will often still provide useful and accurate predictions. To eliminate colinearity, predictors that are highly correlated can be combined into a single predictor.
Overfitting When too many predictors are included in a multiple regression equation, random correlations between the variations of y (the predictand) and one of the predictors are “explained” by the equation. Then when the equation is used on independent (e.g. future) predictions, the results are worse than expected. Overfitting and colinearity are two different issues. Overfitting is more serious, since it is “deceptive”. To reduce effects of overfitting: Can use cross-validation. --withhold one or more cases for forming equation, then predict those cases; rotate cases withheld --withhold part of the period for forming equation, then predict that part of the period.