Multiple Linear Regression uses 2 or more predictors General form: Let us take simplest multiple regression case--two predictors: Here, the b’s are not.

Slides:



Advertisements
Similar presentations
Climate drivers. Major climate drivers INDIAN OCEAN sea surface temperatures PACIFIC OCEAN sea surface temperatures SOUTHERN ANNULAR MODE – north-south.
Advertisements

Managerial Economics in a Global Economy
7.1Variable Notation.
3.2 OLS Fitted Values and Residuals -after obtaining OLS estimates, we can then obtain fitted or predicted values for y: -given our actual and predicted.
Copyright © 2009 Pearson Education, Inc. Chapter 8 Linear Regression.
Chapter 8 Linear Regression © 2010 Pearson Education 1.
Other Factors: MJO Index Madden Julian Oscillation (MJO): A 40- to 60-day period of alternately strong or weak trade winds that normally blow west. It.
Econ Prof. Buckles1 Multiple Regression Analysis y =  0 +  1 x 1 +  2 x  k x k + u 4. Further Issues.
Chapter 4 Multiple Regression.
1Prof. Dr. Rainer Stachuletz Multiple Regression Analysis y =  0 +  1 x 1 +  2 x  k x k + u 4. Further Issues.
Consolidated Seasonal Rainfall Guidance for Africa, July 2014 Initial Conditions Issued 14 July 2014 Forecast Background – ENSO update – Current State.
Remember “normal” ocean circulation? El Nino? (warm) Trade winds weaken Thermocline drops Upwelling is cut off SST rises in E.Pacific High & Low pressure.
Weather: The state of the atmosphere at a given time and place, with respect to variables such as temperature, moisture, wind velocity and direction,
Business Forecasting Chapter 4 Data Collection and Analysis in Forecasting.
3-2: Solving Linear Systems
1 Assessment of the CFSv2 real-time seasonal forecasts for Wanqiu Wang, Mingyue Chen, and Arun Kumar CPC/NCEP/NOAA.
Regression Analysis British Biometrician Sir Francis Galton was the one who used the term Regression in the later part of 19 century.
Introduction While it may not be efficient to write out the justification for each step when solving equations, it is important to remember that the properties.
Systems of Linear Equations
How are winds created Global wind changes Seasonal wind changes
Graphing Systems of Equations Graph of a System Intersecting lines- intersect at one point One solution Same Line- always are on top of each other,
CORRELATION & REGRESSION
The speaker took this picture on 11 December, 2012 over the ocean near Japan. 2014/07/29 AOGS 11th Annual Meeting in Sapporo.
FACTORS THAT AFFECT CLIMATE (LACEMOPS) 00px-The_Earth_seen_from_Apollo_17.jpg.
CHAPTER 14 MULTIPLE REGRESSION
1. Global monsoon features Australian monsoon South American monsoon North American monsoon African monsoon Asian monsoon 2. Northern China winter drought.
Assessing Predictability of Seasonal Precipitation for May-June-July in Kazakhstan Tony Barnston, IRI, New York, US.
The La Niña Influence on Central Alabama Rainfall Patterns.
Solving 2-Step Variable Equations
Multiple Regression The Basics. Multiple Regression (MR) Predicting one DV from a set of predictors, the DV should be interval/ratio or at least assumed.
Climates.
Copyright © 2008 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 8 Linear Regression.
Statistical analysis Outline that error bars are a graphical representation of the variability of data. The knowledge that any individual measurement.
MGS3100_04.ppt/Sep 29, 2015/Page 1 Georgia State University - Confidential MGS 3100 Business Analysis Regression Sep 29 and 30, 2015.
0 Weekly Climate Update August 26 th, 2008  The July 2008 tropical season was the third most active July on record. Only 1916 and 2005 were more active.
CORRELATION: Correlation analysis Correlation analysis is used to measure the strength of association (linear relationship) between two quantitative variables.
Copyright © Ed2Net Learning Inc Warm Up 1. Rotations can occur in a__________ or _____________ direction. Clockwise; counterclockwise 2. Unit circle.
Consolidated Seasonal Rainfall Guidance for Africa, June 2013 Initial Conditions Issued 9 July 2013 Forecast maps Forecast Background – ENSO update – Current.
Refinements to Atlantic Basin Seasonal Hurricane Prediction from 1 December Phil Klotzbach 33 rd Annual Climate Diagnostics and Prediction Workshop October.
Statistical Summary ATM 305 – 12 November Review of Primary Statistics Mean Median Mode x i - scalar quantity N - number of observations Value at.
Solving Systems of Equations Algebraically Chapter 3.2.
College Algebra Sixth Edition James Stewart Lothar Redlin Saleem Watson.
2-Step Equations What??? I just learned 1-step! Relax. You’ll use what you already know to solve 2-step equations.
Do Now (3x + y) – (2x + y) 4(2x + 3y) – (8x – y)
Solving 2-Step Variable Equations. Two Step Equations Essential Question How are inverse operations used to solve two step equations? Why does order matter.
Copyright © 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Slide 8- 1.
What causes the wind to blow?
Class #16 Monday, October 5 Class #16: Monday, October 5 Chapter 7 Global Winds 1.
3-2: Solving Linear Systems. Solving Linear Systems There are two methods of solving a system of equations algebraically: Elimination Substitution.
Section 6.2 Solving Linear Equations Math in Our World.
Multiple Regression.
Statistical analysis.
Has modulation of Indian Summer Monsoon Rainfall by Sea Surface Temperature of the equatorial Pacific Ocean, weakened in recent years? SRIVASTAVA et al.
Objective I CAN solve systems of equations using elimination with multiplication.
Challenges of Seasonal Forecasting: El Niño, La Niña, and La Nada
Statistical analysis.
Solving Systems of Equations Algebraically
ATM 305 – 16 November 2017 Lance Bosart and Philippe Papin
Question 1 Given that the globe is warming, why does the DJF outlook favor below-average temperatures in the southeastern U. S.? Climate variability on.
Introduction While it may not be efficient to write out the justification for each step when solving equations, it is important to remember that the properties.
3-2: Solving Linear Systems
ATMS790: Graduate Seminar, Yuta Tomii
Multiple Regression.
Solving Linear Equations
Before: December 4, 2017 Solve each system by substitution. Steps:
3-2: Solving Linear Systems
CHAPTER 14 MULTIPLE REGRESSION
Seasonal Forecasting Using the Climate Predictability Tool
3-2: Solving Linear Systems
3-2: Solving Linear Systems
Presentation transcript:

Multiple Linear Regression uses 2 or more predictors General form: Let us take simplest multiple regression case--two predictors: Here, the b’s are not simply and, unless x 1 and x 2 have zero correlation with one another. Any correl- ation between x 1 and x 2 makes determining the b’s less simple. The b’s are related to the partial correlation, in which the value of the other predictor(s) is held constant. Holding other predictors constant eliminates the part of the correlation due to the other predictors and not just to the predictor at hand. Notation: partial correlation of y with x 1, with x 2 held constant, is written

For 2 (or any n) predictors, there are 2 (or any n) equations in 2 (or any n) unknowns to be solved simultaneously. When n >3 or so, determinant operations are necessary. For case of 2 predictors, and using z values (variables standardized by subtracting their mean and then dividing by the standard deviation) for simplicity, the solution can be done by hand. The two equations to be solved simultaneously are: b 1.2 +b 2.1 (cor x1,x2 ) = cor y,x1 b 1.2 (cor x1,x2 ) +b 2.1 = cor y,x2 Goal is to find the two b coefficients, b 1.2 and b 2.1

b 1.2 +b 2.1 (cor x1,x2 ) = cor y,x1 b 1.2 (cor x1,x2 ) +b 2.1 = cor y,x2 Example of a multiple regression problem with two predictors The number of Atlantic hurricanes between June and November is slightly predictable 6 months in advance (in early December) using several precursor atmospheric and oceanic variables. Two variables used are (1) 500 millibar geopotential height in Novem- ber in the polar north Atlantic (67.5N-85°N latitude, 10E-50°W longitude); and (2) sea level pressure in November in the North tropical Pacific (7.5N-22.5°N latitude, °W longitude).

S L P 500mb Location of two long-lead Atlantic hurricane predictor regions

Physical reasoning behind the two predictors: (1) 500 millibar geopotential height in November in the polar north Atlantic. High heights are associated with a negative North Atlantic Oscillation (NAO) pattern, tending to associate with a stronger thermohaline circulation, and also tending to be followed by weaker upper atmospheric westerlies and weaker low-level trade winds in the tropical Atlantic the following hurricane season. All of these favor hurricane activity. (2) sea level pressure in November in the North tropical Pacific. High pressure in this region in winter tends to be followed by La Nina conditions in the coming summer and fall, which favors easterly Atlantic wind anomalies aloft, and hurricane activity. First step: Find “regular” correlations among all the variables (x 1,x 2, y): cor x1,y cor x2,y cor x1,x2

X 1 : Polar north Atlantic 500 millibar height X 2 : North tropical Pacific sea level pressure = 0.20 (x 1,y) = 0.40 (x 2,y) = 0.30 (x 1,x 2 )  Simultaneous equations to be solved b 1.2 +(0.30)b 2.1 = 0.20 (0.30)b 1.2 +b 2.1 = 0.40 Solution: Multiply 1 st equation by 3.333, then subtract second equation from first equation. This gives (3.033)b = So b 1.2 = and use this to find that b 2.1 = Regression equation is Z y = (0.088)z x1 + (0.374)z x2 one pre- dictor vs the other

Multiple correlation coefficient = R = correlation between predicted y and actual y using multiple regression. In example above, = Note this is only very slightly better than using the second predictor alone in simple regression. This is not surprising, since the first predictor’s total correlation with y is only 0.2, and it is correlated 0.3 with the second predictor, so that the second predictor already accounts for some of what the first predictor has to offer. A decision would probably be made concerning whether it is worth the effort to include the first predictor for such a small gain. Note: the multiple correlation can never decrease when more predictors are added.

Multiple R is usually inflated somewhat compared with the true relationship, since additional predictors fit the accidental variations found in the test sample. Adjustment (decrease) of R for the existence of multiple predictors gives a less biased estimate of R: Adjusted R = n = sample size k = number of predictors

Sampling variability of a simple (x, y) correlation coefficient around zero when population correlation is zero is approximately In multiple regression the same approximate relationship holds except that n must be further decreased by the number of predictors additional to the first one. If the number of predictors (x’s) is denoted by k, then the sampling variability of R around zero, when there is no true relationship with any of the predictors, is given by It is easier to get a given multiple correlation by chance as the number of predictors increases.

Partial Correlation is correlation between y and x 1, where a variable x 2 is not allowed to vary. Example: in an elemen- tary school, reading ability (y) is highly correlated with the child’s weight (x 1 ). But both y and x 1 are really caused by something else: the child’s age (call x 2 ). What would the correlation be between weight and reading ability if the age were held constant? (Would it drop down to zero?) A similar set of equations exists for the second predictor.

Suppose the three correlations are: reading vs. weight : reading vs. age: weight vs. age: The two partial correlations come out to be: Finally, the two regression weights turn out to be: Weight is seen to be a minor factor compared with age.

Another Example – Sahel Drying Trend Suppose 50 years of climate data suggest that the drying of the Sahel in northern Africa in July to September may be related both to warming in the tropical Atlantic and Indian oceans (x 1 ) as well as local changes in land use in the Sahel Itself (x 2 ). x 1 is expressed as SST, and x 2 is expressed as percentage vegetation decrease (expressed as a positive percentage) from the vegetation found at the beginning of the 50 year period. While both factors appear related to the downward trend in rainfall, the two predictors are somewhat correlated with one another. Suppose the correlations come out as follows: Cor(y,x 1 )= Cor(y,x 2 )= Cor(x 1,x 2 )= 0.50 What would be the multiple regression equation in “unit-free” standard deviation (z) units?

Cor(x 1,y)= Cor(x 2,y)= Cor(x 1,x 2 )=0.50 First we set up the two equations to be solved simultaneously b 1.2 +b 2.1 (cor x1,x2 ) = cor y,x1 b 1.2 (cor x1,x2 ) +b 2.1 = cor y,x2 b 1.2 +(0.50)b 2.1 = (0.50)b 1.2 +b 2.1 = Want to eliminate (or cancel) b 1.2 or b 2.1. To eliminate b 2.1, multiply first equation by 2 and subtract second one from it: 1.5 b 1.2 = and b 1.2 = and b 2.1 = Regression equation is Z y = z x z x2

If want to express the above equation in physical units, then must know the means and standard deviations of y, x 1 and x 2 and make substitutions to replace the z’s. When substitute and simplify results, y, x 1 and x 2 terms will appear instead of z terms. There generally will also be a constant term that is not found in the z expression because the original variables probably do not have means of 0 the way z’s always do.

The means and the standard deviations of the three data sets are y: Jul-Aug-Sep Sahel rainfall (mm): mean 230 mm, SD 88 mm x 1 : Tropical Atlantic/Indian ocean SST: mean 28.3 degr C, SD 1.7 C x 2 : Deforestation (percent of initial): mean 34%, SD 22% Z y = z x z x2 After simplification, final form will be: y = coeff x 1 + coeff x 2 + constant (here, both coeff <0) b 1 b 2

We now compute the multiple correlation R, and the standard error of estimate for the multiple regression. Using the two individual correlations and the b terms: Cor(x 1,y)= Cor(x 2,y)= Cor(x 1,x 2 )=0.50 Regression equation is Z y = z x z x2 = The deforestation factor helps the prediction accuracy only slightly. If there were less correlation between the two predictors, then the second predictor would be more valuable. Standard Error of Estimate = = In physical units it is (0.845)(88 mm) =74.3 mm

Let us evaluate the significance of the multiple correlation of How likely could it have arisen by chance alone? First we find the standard error of samples of 50 drawn from a population having no correlations at all, using 2 predictors: For n=50 and k=2 we get = For a 2-sided z test at the 0.05 level, we need 1.96(0.145) = 0.28 This is easily exceeded, suggesting that the combination of the two predictors (SST and deforestation) do have an impact on Sahel summer rainfall. (Using SST alone in simple regression, with cor=0.52, would have given nearly the same level of significance.)

Example problem using this regression equation: Suppose that a climate change model predicts that in year 2050, the SST in the tropical Atlantic and Indian oceans will be 2.4 standard deviations above the means given for the 50-year period of the preceding problem. (It is now about 1.6 standard deviations above that mean.) Assume that land use practices (percentage deforestation) will be the same as they are now, which is 1.3 standard deviations above the mean. Under this scenario, using the multiple regression relationship above, how many standard deviations away from the mean will Jul-Aug-Sep Sahel rainfall be, and what seasonal total rainfall does that correspond to?

The problem can be solved either in physical units or in standard deviation units, and then the answer can be expressed in either (or both) kinds of units afterward. If solved in physical units, the values of the two predictions in SD units (2.4 and 1.3) can be converted to raw units using the means and standard deviations of the variables provided previously, and the raw units form of the regression equation would be used. If solved in SD units, the simpler equation can be used: Z y = z x z x2 The z’s of the two predictors, according to the scenario given, will be 2.4 and 1.3, respectively. Then Zy = (2.4) – 0.147(1.3) = This is how many SDs away from the mean the rainfall would be. Since the rainfall mean and SD are 230 and 88 mm, respectively, the actual amount predicted is 230 – 1.264(88) = 230 – = mm.

Colinearity When the predictors are highly correlated with one another in multiple regression, a condition of colinearity exists. When this happens, the coefficients of two highly correlated predictors may have opposing signs, even when each of them has the same sign of simple correlation with the predictand. (Such opposing signed coefficients minimizes squared errors.) Issues and problems with this are (1) it is counterintuitive, and (2) the coefficients are very unstable, such that if one more sample is added to the data, they may change drastically. When colinearity exists, the multiple regression formula will often still provide useful and accurate predictions. To eliminate colinearity, predictors that are highly correlated can be combined into a single predictor.

Overfitting When too many predictors are included in a multiple regression equation, random correlations between the variations of y (the predictand) and one of the predictors are “explained” by the equation. Then when the equation is used on independent (e.g. future) predictions, the results are worse than expected. Overfitting and colinearity are two different issues. Overfitting is more serious, since it is “deceptive”. To reduce effects of overfitting: Can use cross-validation. --withhold one or more cases for forming equation, then predict those cases; rotate cases withheld --withhold part of the period for forming equation, then predict that part of the period.