A Brief Introduction to Spatial Regression

Slides:



Advertisements
Similar presentations
Simple Linear Regression Analysis
Advertisements

11-1 Empirical Models Many problems in engineering and science involve exploring the relationships between two or more variables. Regression analysis.
Regression and correlation methods
Lesson 10: Linear Regression and Correlation
Correlation and regression
Probability & Statistical Inference Lecture 9
6-1 Introduction To Empirical Models 6-1 Introduction To Empirical Models.
11 Simple Linear Regression and Correlation CHAPTER OUTLINE
Statistics for the Social Sciences
1-1 Regression Models  Population Deterministic Regression Model Y i =  0 +  1 X i u Y i only depends on the value of X i and no other factor can affect.
9. SIMPLE LINEAR REGESSION AND CORRELATION
The Simple Regression Model
SIMPLE LINEAR REGRESSION
REGRESSION AND CORRELATION
Regression Chapter 10 Understandable Statistics Ninth Edition By Brase and Brase Prepared by Yixun Shi Bloomsburg University of Pennsylvania.
11-1 Empirical Models Many problems in engineering and science involve exploring the relationships between two or more variables. Regression analysis.
Correlation and Regression Analysis
Introduction to Regression Analysis, Chapter 13,
Simple Linear Regression Analysis
Lecture 5 Correlation and Regression
Correlation & Regression
McGraw-Hill/Irwin Copyright © 2010 by The McGraw-Hill Companies, Inc. All rights reserved. Chapter 13 Linear Regression and Correlation.
Correlation and Regression
McGraw-Hill/IrwinCopyright © 2009 by The McGraw-Hill Companies, Inc. All Rights Reserved. Simple Linear Regression Analysis Chapter 13.
SIMPLE LINEAR REGRESSION
Introduction to Linear Regression and Correlation Analysis
MAT 254 – Probability and Statistics Sections 1,2 & Spring.
Correlation and Regression. The test you choose depends on level of measurement: IndependentDependentTest DichotomousContinuous Independent Samples t-test.
Chapter 6 & 7 Linear Regression & Correlation
OPIM 303-Lecture #8 Jose M. Cruz Assistant Professor.
Introduction to Linear Regression
Correlation and Regression Used when we are interested in the relationship between two variables. NOT the differences between means or medians of different.
Correlation & Regression
Examining Relationships in Quantitative Research
Introduction to Probability and Statistics Thirteenth Edition Chapter 12 Linear Regression and Correlation.
Y X 0 X and Y are not perfectly correlated. However, there is on average a positive relationship between Y and X X1X1 X2X2.
Chapter 4 Linear Regression 1. Introduction Managerial decisions are often based on the relationship between two or more variables. For example, after.
1 11 Simple Linear Regression and Correlation 11-1 Empirical Models 11-2 Simple Linear Regression 11-3 Properties of the Least Squares Estimators 11-4.
© Copyright McGraw-Hill Correlation and Regression CHAPTER 10.
Political Science 30: Political Inquiry. Linear Regression II: Making Sense of Regression Results Interpreting SPSS regression output Coefficients for.
CORRELATION. Correlation key concepts: Types of correlation Methods of studying correlation a) Scatter diagram b) Karl pearson’s coefficient of correlation.
Copyright (C) 2002 Houghton Mifflin Company. All rights reserved. 1 Understandable Statistics Seventh Edition By Brase and Brase Prepared by: Lynn Smith.
Correlation & Regression Analysis
Copyright © 2010 Pearson Education, Inc Chapter Seventeen Correlation and Regression.
Copyright © 2011 by The McGraw-Hill Companies, Inc. All rights reserved. McGraw-Hill/Irwin Simple Linear Regression Analysis Chapter 13.
Copyright (C) 2002 Houghton Mifflin Company. All rights reserved. 1 Understandable Statistics Seventh Edition By Brase and Brase Prepared by: Lynn Smith.
Simple Linear Regression The Coefficients of Correlation and Determination Two Quantitative Variables x variable – independent variable or explanatory.
Lecturer: Ing. Martina Hanová, PhD.. Regression analysis Regression analysis is a tool for analyzing relationships between financial variables:  Identify.
Copyright © 2015 McGraw-Hill Education. All rights reserved. No reproduction or distribution without the prior written consent of McGraw-Hill Education.
Introduction Many problems in Engineering, Management, Health Sciences and other Sciences involve exploring the relationships between two or more variables.
Chapter 13 Simple Linear Regression
The simple linear regression model and parameter estimation
Regression and Correlation
Regression Analysis AGEC 784.
11-1 Empirical Models Many problems in engineering and science involve exploring the relationships between two or more variables. Regression analysis.
Linear Regression and Correlation Analysis
Chapter 5 STATISTICS (PART 4).
Chapter 11 Simple Regression
Correlation and Simple Linear Regression
Correlation and Regression
6-1 Introduction To Empirical Models
Correlation and Simple Linear Regression
Simple Linear Regression and Correlation
Product moment correlation
Algebra Review The equation of a straight line y = mx + b
Warsaw Summer School 2017, OSU Study Abroad Program
Correlation and Simple Linear Regression
Correlation and Simple Linear Regression
Presentation transcript:

A Brief Introduction to Spatial Regression Eugene Brusilovskiy

Outline Review of Correlation OLS Regression Regression with a non-normal dependent variable Spatial Regression

Correlation Defined as a measure of how much two variables X and Y change together Dimensionless measure: A correlation between two variables is a single number that can range from -1 to 1, with positive values close to one indicating a strong direct relationship and negative values close to -1 indicating a strong inverse relationship E.g., a positive correlation between income and years of schooling indicates that more years of schooling would correspond to greater income (i.e., an increase in the years of schooling is associated with an increase in income) A correlation of 0 indicates a lack of relationship between the variables Generally denoted by the Greek letter ρ Pearson Correlation: When the variables are normally distributed Spearman Correlation: When the variables aren’t normally distributed

Some Remarks In practice, we rarely see perfect positive or negative correlations (i.e., correlations of exactly 1 or -1) Correlations are those higher than 0.6 (or lower than -0.6) are considered to be strong There might be confounding factors that explain a strong positive or negative correlation between variables E.g., volume of ice cream consumption might be correlated with crime rates. Why? Both tend to be high when the temperatures are warmer! The correlation between two seemingly unrelated variables does not always equal exactly zero (although it will often be close to it)

Correlation does not imply causation! Source: http://imgs.xkcd.com/comics/correlation.png

Regression A statistical method used to examine the relationship between a variable of interest (dependent variable) and one or more explanatory variables (predictors) Strength of the relationship Direction of the relationship (positive, negative, zero) Goodness of model fit Allows you to calculate the amount by which your dependent variable changes when a predictor variable changes by one unit (holding all other predictors constant) Often referred to as Ordinary Least Squares (OLS) regression Regression with one predictor is called simple regression Regression with two or more predictors is called multiple regression Available in all statistical packages Just like correlation, if an explanatory variable is a significant predictor of the dependent variable, it doesn’t imply that the explanatory variable is a cause of the dependent variable

Example Assume we have data on median income and median house value in 381 Philadelphia census tracts (i.e., our unit of measurement is a tract) Each of the 381 tracts has information on income (call it Y) and on house value (call it X). So, we can create a scatter-plot of Y against X. Through this scatter plot, we can calculate the equation of the line that best fits the pattern (recall: Y=mx+b, where m is the slope and b is the y-intercept) This is done by finding a line such that the sum of the squared (vertical) distances between the points and the line is minimized Hence the term ordinary least squares Now, we can examine the relationship between these two variables

We can easily extend this to cases with 2+ predictors When we have n>1 predictors, rather than getting a line in 2 dimensions, we get a line in n+1 dimensions (the ‘+1’ accounts for the dependent variable) Each independent variable will have its own slope coefficient which will indicate the relationship of that particular predictor with the dependent variable, controlling for all other independent variables in the regression. The equation of the best fit line becomes where The coefficient β of each predictor may be interpreted as the amount by which the dependent variable changes as the independent variable increases by one unit (holding all other variables constant)

An Example with 2 Predictors: Income as a function of House Value and Crime

Some Basic Regression Diagnostics The so-called p-value associated with the variable For any statistical method, including regression, we are testing some hypothesis. In regression, we are testing the null hypothesis that the coefficient (i.e., slope) β is equal to zero (i.e., that the explanatory variable is not a significant predictor of the dependent variable). Formally, the p-value is the probability of observing the value of β as extreme (i.e., as different from 0 as its estimated value is) when in reality it equals to zero (i.e., when the Null Hypothesis holds). If this probability is small enough (generally, p<0.05), we reject the null hypothesis of β=0 for an alternative hypothesis of β<>0. Again, when the null hypothesis (of β=0) cannot be rejected, the dependent variable is not related to the independent variable. The rejection of a null hypothesis (i.e., when p <0.05) indicates that the independent variable is a statistically significant predictor of the dependent variable One p-value per independent variable

Some Basic Regression Diagnostics (Cont’d) The sign of the coefficient of the independent variable (i.e., the slope of the regression line) One coefficient per independent variable Indicates whether the relationship between the dependent and independent variables is positive or negative We should look at the sign when the coefficient is statistically significant (i.e., significantly different from zero)

Some Basic Regression Diagnostics (Cont’d) R-squared (AKA Coefficient of Determination): the percent of variance in the dependent variable that is explained by the predictors In the single predictor case, R-squared is simply the square of the correlation between the predictor and dependent variable The more independent variables included, the higher the R-squared Adjusted R-squared: percent of variance in the dependent variable explained, adjusted by the number of predictors One R-squared for the regression model

Some (but not all) regression assumptions The dependent variable should be normally distributed (i.e., the histogram of the variable should look like a bell curve) Ideally, this will also be true of independent variables, but this is not essential. Independent variables can also be binary (i.e., have two values, such as 1 (yes) and 0 (no)) The predictors should not be strongly correlated with each other (i.e., no multicollinearity) Very importantly, the observations should be independent of each other. (The same holds for regression residuals). If this assumption is violated, our coefficient estimates could be wrong! General rule of thumb: 10 observations per independent variable

An Example of a Normal Distribution

Data Transformations Sometimes, it is possible to transform a variable’s distribution by subjecting it to some simple algebraic operation. The logarithmic transformation is the most widely used to achieve normality when the variable is positively skewed (as in the image on the left below) Analysis is then performed on the transformed variable.

Additional Regression Methods Logistic regression/Probit regression When your dependent variable is binary (i.e., has two possible outcomes). E.g., Employment Indicator (Are you employed? Yes/No) Multinomial logistic regression When your dependent variable is categorical and has more than two categories E.g., Race: Black, Asian, White, Other Ordinal logistic regression When your dependent variable is ordinal and has more than two categories E.g., Education: (1=Less than High School, 2=High School, 3=More than High School) Poisson regression When your dependent variable is a count E.g., Number of traffic violations (0, 1, 2, 3, 4, 5, etc)

Spatial Autocorrelation Recall: There is spatial autocorrelation in a variable if observations that are closer to each other in space have related values (Tobler’s Law) One of the regression assumptions is independence of observations. If this doesn’t hold, we obtain inaccurate estimates of the β coefficients, and the error term ε contains spatial dependencies (i.e., meaningful information), whereas we want the error to not be distinguishable from random noise.

Imagine a problem with a spatial component… This example is obviously a dramatization, but nonetheless, in many spatial problems points which are close together have similar values

But how do we know if spatial dependencies exist? Moran’s I (1950) – a rather old and perhaps the most widely used method of testing for spatial autocorrelation, or spatial dependencies We can determine a p-value for Moran’s I (i.e., an indicator of whether spatial autocorrelation is statistically significant). For more on Moran’s I, see http://en.wikipedia.org/wiki/Moran%27s_I Just as the non-spatial correlation coefficient, ranges from -1 to 1 Can be calculated in ArcGIS Other indices of spatial autocorrelation commonly used include: Geary’s c (1954) Getis and Ord’s G-statistic (1992) For non-negative values only

So, when a problem has a spatial component, we should: Run the non-spatial regression Test the regression residuals for spatial autocorrelation, using Moran’s I or some other index If no significant spatial autocorrelation exists, STOP. Otherwise, if the spatial dependencies are significant, use a special model which takes spatial dependencies into account.

Spatial Regression Models A spatial lag (SL) model Assumes that dependencies exist directly among the levels of the dependent variable That is, the income at one location is affected by the income at the nearby locations A “lag” term, which is a specification of income at nearby locations, is included in the regression, and its coefficient and p-value are interpreted as for the independent variables. As in OLS regression, we can include independent variables in the model. Whereas we will see spatial autocorrelation in OLS residuals, the SL model should account for spatial dependencies and the SL residuals would not be autocorrelated, Hence the SL residuals should not be distinguishable from random noise (i.e., have no consistent patterns or dependencies in them)

OLS Residuals vs. SL Residuals Non-random patterns and clustering Random Noise

But how is spatial proximity defined? For each point (or areal unit), we need to identify its spatial relationship with all the other points (or areal units). This can be done by looking at the (inverse of the) distance between each pair of points, or in a number of other ways: A binary indicator stating whether two points (or census tract centroids) are within a certain distance of each other (1=yes, 0=no) A binary indicator stating whether point A is one of the ___ (1, 5, 10, 15, etc) nearest neighbors of B (1=yes, 0=no) For areal datasets, the proportion of the boundary that zone 1 shares with zone 2, or simply a binary indicator of whether zone 1 and 2 share a border (1=yes, 0=no) Etc, etc, etc When we have n observations, we form an n x n table (called a weight matrix or a link matrix) which summarizes all the pairwise spatial relationships in the dataset These weight matrices are used in the estimation of spatial regression (and the calculation of Moran’s I). Unless we have compelling reasons not to do so, it’s generally a good idea to see whether our results hold with different types of weight matrices

Assume we have a map with 10 Census tracts The hypothetical weight matrix below indicates whether any given Census tract shares a boundary with another tract. 1 means yes and 0 means no. For instance, tracts 3 and 6 do share a boundary, as indicated by the blue 1’s. Point # 1 2 3 4 5 6 7 8 9 10

Now, we need a software package that can… Run the good old OLS regression model Create a weight matrix Test for spatial autocorrelation in OLS residuals Run a spatial lag model (or some other spatial model) Such packages do exist!

GeoDa A software package developed by Luc Anselin Can be downloaded free of charge (for members of educational and research institutions) at https://www.geoda.uiuc.edu/ Has a user-friendly interface Accepts ESRI shapefiles as inputs Is able to perform a number of basic GIS operations in addition to running the sophisticated spatial statistics models

Other Spatial Regression Models Spatial Error (can be implemented in GeoDa) Geographically-Weighted Regression (can be run in ArcGIS 9.3) These methods also aim to account for spatial dependencies in the data

Some References: Spatial Regression Bailey, T.C. and Gatrell, A.C. (1995). Interactive Spatial Data Analysis. Addison Wesley Longman, Harlow, Essex. Cressie, N.A.C. (1993). Statistics for Spatial Data. (Revised Edition). Wiley, John & Sons, Inc. LeSage, J. and Pace K.R. (2009). Introduction to Spatial Econometrics. CRC Press/Taylor & Francis Group.