# Correlation and Linear Regression

## Presentation on theme: "Correlation and Linear Regression"— Presentation transcript:

Correlation and Linear Regression

QTM1310/ Sharpe Scatterplots A scatterplot, which plots one quantitative variable against another, can be an effective display for data. Scatterplots are the ideal way to picture associations between two quantitative variables. 2

Direction in scatterplots
QTM1310/ Sharpe Direction in scatterplots The direction of the association is important. A pattern that runs from the upper left to the lower right is said to be negative. A pattern running from the lower left to the upper right is called positive. Look for direction: What’s the sign – positive, negative or neither? 3

QTM1310/ Sharpe Form of scatterplots The second thing to look for in a scatterplot is its form. If there is a straight line relationship, it will appear as a cloud or swarm of points stretched out in a generally consistent, straight form. This is called linear form. Sometimes the relationship curves gently, while still increasing or decreasing steadily; sometimes it curves sharply up then down. Look for Form: Is it straight, curved, something other or of no pattern? 4

Strength of the relationship
QTM1310/ Sharpe Strength of the relationship The third feature to look for in a scatterplot is the strength of the relationship. Do the points appear tightly clustered in a single stream? Or do the points seem to be so variable and spread out that we can barely notice any trend or pattern? Looking for strength: How scattered is the data? 5

Outliers in scatterplots
QTM1310/ Sharpe Outliers in scatterplots Finally, always look for the unexpected. An outlier is an unusual observation, standing away from the overall pattern of the scatterplot. Looking for unusual features: Are there unusual observations, points or subgroups? 6

Example: Describing a scatterplot
QTM1310/ Sharpe Example: Describing a scatterplot The Texas Transportation Institute issues an annual report on traffic congestion and its cost to society and business. Describe the scatterplot of Congestion Cost against Freeway Speed. 7

Describing a scatterplot
QTM1310/ Sharpe Describing a scatterplot The scatterplot of Congestion Cost against Freeway Speed is roughly linear, negative, and strong. As the Peak Period Freeway Speed (mph) increases, the Congestion Cost per person tends to decrease. 8

2nd EXAMPLE Example: Bookstore
QTM1310/ Sharpe 2nd EXAMPLE Example: Bookstore Data gathered from a bookstore show Number of Sales People Working and Sales (in \$1000). Given the scatterplot, describe the direction, form, and strength of the relationship. Are there any outliers? 9

QTM1310/ Sharpe 2nd EXAMPLE The relationship between Number of Sales People working and Sales is positive, linear, and strong. As the Number of Sales People working increases, Sales tends to increase also. There are no outliers. 10

QTM1310/ Sharpe Proper scatterplots To make a scatterplot of two quantitative variables, assign one variable to the y- and the other to the x-axis. Be sure to label the axes clearly, and indicate the scales of the axes with numbers. Each variable has units, and these should appear with the display—usually near each axis. 11

QTM1310/ Sharpe Proper scatterplots Each point is placed on a scatterplot at a position that corresponds to values of the two variables. The point’s horizontal location is specified by its x-value, and its vertical location is specified by its y-value variable. Together, these variables are known as coordinates and written (x, y). 12

QTM1310/ Sharpe Roles of variables One variable plays the role of the explanatory or predictor variable, while the other takes on the role of the response / explained variable. We place the explanatory variable on the x-axis and the response variable on the y-axis. The x- and y-variables are sometimes referred to as the independent and dependent variables, respectively. 13

correlation Understanding Correlation
QTM1310/ Sharpe Understanding Correlation correlation When two quantitative variables have a linear association, a measure of “how strong is the association” is needed. This measure should not depend on units for the variable, so standardized values are used. Since x’s and y’s are paired, multiply each standardized value of x by the standardized value it is paired with (y) and add up those cross-products. Then divide by n -1. 𝑟= 𝑧 𝑥 𝑧 𝑦 𝑛−1 14

correlation Understanding Correlation
QTM1310/ Sharpe Understanding Correlation correlation Remember our Z-score is found via 𝑧 𝑤 = 𝑤− 𝑤 𝑠 And s is found by taking the square root of the sample variance. 𝑠 2 = 𝑖=1 𝑛 ( 𝑤 𝑖 − 𝑤 ) 2 𝑛−1 𝑠= 𝑖=1 𝑛 ( 𝑤 𝑖 − 𝑤 ) 2 𝑛−1 Hence 𝑧 𝑤 = 𝑤− 𝑤 𝑖=1 𝑛 ( 𝑤 𝑖 − 𝑤 ) 2 𝑛−1 15

Using 𝑧 𝑤 = 𝑤− 𝑤 𝑖=1 𝑛 ( 𝑤 𝑖 − 𝑤 ) 2 𝑛−1 We can rewrite 𝑟= 𝑧 𝑥 𝑧 𝑦 𝑛−1 As 𝑟= 𝑥− 𝑥 𝑖=1 𝑛 ( 𝑥 𝑖 − 𝑥 ) 2 𝑛−1 𝑦− 𝑦 𝑖=1 𝑛 ( 𝑦 𝑖 − 𝑦 ) 2 𝑛−1 (𝑛−1) Summarizing just a little gives us

r= (x− x )(y− y ) i=1 n ( x i − x ) 2 n−1 i=1 n ( y i − y ) 2 n−1 (n−1)
The square root parts of the summation are nothing, but s x s y = 𝑖=1 𝑛 ( 𝑥 𝑖 − 𝑥 ) 2 𝑛− 𝑖=1 𝑛 ( 𝑦 𝑖 − 𝑦 ) 2 𝑛−1 = 𝑖=1 𝑛 ( 𝑥 𝑖 − 𝑥 ) 2 ( 𝑦 𝑖 − 𝑦 ) 2 𝑛−1 So r= (x− x )(y− y) s x s y (n−1) Hence the (n-1) cancel nicely.

This allows to give 𝑟 𝑥,𝑦 = 1 𝑛 ( 𝑥 𝑖 − 𝑥 )( 𝑦 𝑖 − 𝑦 ) 𝑛−1 𝑠 𝑥 𝑠 𝑦 = 𝟏 𝒏 ( 𝒙 𝒊 − 𝒙 )( 𝒚 𝒊 − 𝒚 ) 𝟏 𝒏 ( 𝒙 𝒊 − 𝒙 ) 𝟐 𝟏 𝒏 ( 𝒚 𝒊 − 𝒚 ) 𝟐 Which is our definition of correlation (as discussed yesterday) The ratio of the sum of the product zxzy for every point in the scatterplot to n – 1is called the correlation coefficient. Correlation measures the strength of the linear association between two quantitative variables.

Correlation conditions
QTM1310/ Sharpe Correlation conditions Before you use correlation, you must check three conditions: Quantitative Variables Condition: I.e. Correlation applies only to quantitative variables. Linearity Condition: I.e. Correlation measures only the strength of the linear association. Outlier Condition: I.e. Unusual observations can distort the correlation. 19

Correlation properties
QTM1310/ Sharpe Correlation properties The sign of a correlation coefficient gives the direction of the association. Correlation has no units. Correlation is always between –1 and +1. Correlation treats x and y symmetrically. Correlation is not affected by changes in the center or scale of either variable. Correlation is sensitive to unusual observations. 20

QTM1310/ Sharpe Correlation tables Sometimes the correlations between each pair of variables in a data set are arranged in a table like the one below. Using what you have learned, interpret the numbers. 21

Lurking variables & Causation
QTM1310/ Sharpe Lurking variables & Causation There is no way to conclude from a high correlation alone that one variable causes the other. There’s always the possibility that some third variable—a lurking variable—is simultaneously affecting both of the variables you have observed. 22

EXAMPLE: Lurking variable
QTM1310/ Sharpe EXAMPLE: Lurking variable The scatterplot below shows Life Expectancy (average of men and women, in years) against Doctors per Person for 40 countries of the world. The correlation is strong, positive, and linear (r = 0.705). Should we send more doctors to developing countries to increase life expectancy? 23

EXAMPLE: Lurking variable
QTM1310/ Sharpe EXAMPLE: Lurking variable Should we send more doctors to developing countries to increase life expectancy? No. Countries with higher standards of living have both longer life expectancies and more doctors. Higher standards of living is a lurking variable. Resist the temptation to conclude that x causes y from a correlation, no matter how obvious the conclusion may seen. Use common sense, when interpreting numbers of any sort! 24

QTM1310/ Sharpe A linear Model The scatterplot below shows Lowe’s sales and total home improvement expenditures between 1985 and 2007. The relationship is strong, positive, and linear (r = 0.976). 25

QTM1310/ Sharpe A linear model We see that the points don’t all line up, but that a straight line can summarize the general pattern. We call this line a linear model. A linear model describes the relationship between x and y. 26

QTM1310/ Sharpe A linear model This linear model can be used to predict sales from an estimate of residential improvement expenditures for the next year. We know the model won’t be perfect, so we must consider how far the model’s values are from the observed values. 27

QTM1310/ Sharpe Residuals A linear model can be written in the form 𝑦 = 𝛽 0 + 𝛽 1 𝑥 where 𝛽 0 and 𝛽 1 are numbers estimated from the data and 𝑦 is the predicted value. The difference between the predicted value and the observed value, 𝑦, is called the residual and is denoted 𝑢 𝑢 =𝑦− 𝑦 28

QTM1310/ Sharpe residuals In the computer usage model for 301 stores, the model predicts MIPS (Millions of Instructions Per Second) and the actual value is MIPS. We may compute the residual for 301 stores. 29

QTM1310/ Sharpe Line of best fit Some residuals will be positive and some negative, so adding up all the residuals is not a good assessment of how well the line fits the data. If we consider the sum of the squares of the residuals, then the smaller the sum, the better the fit. The line of best fit is the line for which the sum of the squared residuals is smallest – often called the least squares line. 30

explanation In the model 𝑦 = 𝛽 0 + 𝛽 1 𝑥 how do we determine 𝛽 0 and 𝛽 1 Choose 𝛽 0 and 𝛽 1 so that the (vertical) distances from the data points to the fitted lines are minimised? (so that the line fits the data as closely as possible): Richard Bußmann

Actual and Fitted Value
𝑦 𝑡 𝑢 𝑡 𝑦 𝑡 𝑥 𝑡 Richard Bußmann

problems If we merely minimize the distance, where do we run into problems? Some distances are positive, others negative. We aren’t looking at absolute numbers Richard Bußmann

Ordinary Least Squares
The most common method used to fit a line to the data is known as OLS (ordinary least squares). What we actually do, is to take each distance and to square it. Then we compute the respective areas, sum them and draw a line to minimize the total sum if the squares (hence least squares). Richard Bußmann

QTM1310/ Sharpe SALES & Price example A linear model to predict weekly Sales of frozen pizza (in pounds) from the average price (\$/unit) charged by a sample of stores in Dallas in 39 recent weeks is What is the explanatory variable? What is the response variable? What does the slope mean in this context? Is the y-intercept meaningful in this context? 35

SALES & Price example What is the explanatory variable? Average Price
QTM1310/ Sharpe SALES & Price example What is the explanatory variable? Average Price What is the response variable? Sales What does the slope mean in this context? Sales decrease by 24, pounds per dollar. Is the y-intercept meaningful in this context? It means nothing because stores will not set their price to \$0. 36

QTM1310/ Sharpe SALES & Price example What is the predicted Sales if the average price charged was \$3.50 for a pizza? If the sales for a price of \$3.50 turned out to be 60,000 pounds, what would the residual be? 37

QTM1310/ Sharpe SALES & Price example What is the predicted Sales if the average price charged was \$3.50 for a pizza? If the sales for a price of \$3.50 turned out to be 60,000 pounds, what would the residual be? 38