SCATTERPLOTS A scatterplot, which plots one quantitative variable against another, can be an effective display for data. Scatterplots are the ideal way to picture associations between two quantitative variables.
DIRECTION IN SCATTERPLOTS The direction of the association is important. A pattern that runs from the upper left to the lower right is said to be negative. A pattern running from the lower left to the upper right is called positive. Look for direction: What’s the sign – positive, negative or neither?
The second thing to look for in a scatterplot is its form. If there is a straight line relationship, it will appear as a cloud or swarm of points stretched out in a generally consistent, straight form. This is called linear form. Sometimes the relationship curves gently, while still increasing or decreasing steadily; sometimes it curves sharply up then down. Look for Form: Is it straight, curved, something other or of no pattern? FORM OF SCATTERPLOTS
STRENGTH OF THE RELATIONSHIP The third feature to look for in a scatterplot is the strength of the relationship. Do the points appear tightly clustered in a single stream? Or do the points seem to be so variable and spread out that we can barely notice any trend or pattern? Looking for strength: How scattered is the data?
Finally, always look for the unexpected. An outlier is an unusual observation, standing away from the overall pattern of the scatterplot. Looking for unusual features: Are there unusual observations, points or subgroups? OUTLIERS IN SCATTERPLOTS
The Texas Transportation Institute issues an annual report on traffic congestion and its cost to society and business. Describe the scatterplot of Congestion Cost against Freeway Speed. EXAMPLE: DESCRIBING A SCATTERPLOT
DESCRIBING A SCATTERPLOT The scatterplot of Congestion Cost against Freeway Speed is roughly linear, negative, and strong. As the Peak Period Freeway Speed (mph) increases, the Congestion Cost per person tends to decrease.
2 ND EXAMPLE Example: Bookstore Data gathered from a bookstore show Number of Sales People Working and Sales (in $1000). Given the scatterplot, describe the direction, form, and strength of the relationship. Are there any outliers?
The relationship between Number of Sales People working and Sales is positive, linear, and strong. As the Number of Sales People working increases, Sales tends to increase also. There are no outliers. 2 ND EXAMPLE
PROPER SCATTERPLOTS To make a scatterplot of two quantitative variables, assign one variable to the y- and the other to the x-axis. Be sure to label the axes clearly, and indicate the scales of the axes with numbers. Each variable has units, and these should appear with the display—usually near each axis.
PROPER SCATTERPLOTS Each point is placed on a scatterplot at a position that corresponds to values of the two variables. The point’s horizontal location is specified by its x-value, and its vertical location is specified by its y - value variable. Together, these variables are known as coordinates and written (x, y).
ROLES OF VARIABLES One variable plays the role of the explanatory or predictor variable, while the other takes on the role of the response / explained variable. We place the explanatory variable on the x-axis and the response variable on the y-axis. The x- and y-variables are sometimes referred to as the independent and dependent variables, respectively.
Understanding Correlation CORRELATION
Understanding Correlation CORRELATION
CORRELATION CONDITIONS Before you use correlation, you must check three conditions: Quantitative Variables Condition: I.e. Correlation applies only to quantitative variables. Linearity Condition: I.e. Correlation measures only the strength of the linear association. Outlier Condition: I.e. Unusual observations can distort the correlation.
CORRELATION PROPERTIES The sign of a correlation coefficient gives the direction of the association. Correlation has no units. Correlation is always between –1 and +1. Correlation treats x and y symmetrically. Correlation is not affected by changes in the center or scale of either variable. Correlation is sensitive to unusual observations.
CORRELATION TABLES Sometimes the correlations between each pair of variables in a data set are arranged in a table like the one below. Using what you have learned, interpret the numbers.
LURKING VARIABLES & CAUSATION There is no way to conclude from a high correlation alone that one variable causes the other. There’s always the possibility that some third variable—a lurking variable—is simultaneously affecting both of the variables you have observed.
EXAMPLE: LURKING VARIABLE The scatterplot below shows Life Expectancy (average of men and women, in years) against Doctors per Person for 40 countries of the world. The correlation is strong, positive, and linear (r = 0.705). Should we send more doctors to developing countries to increase life expectancy?
EXAMPLE: LURKING VARIABLE Should we send more doctors to developing countries to increase life expectancy? No. Countries with higher standards of living have both longer life expectancies and more doctors. Higher standards of living is a lurking variable. Resist the temptation to conclude that x causes y from a correlation, no matter how obvious the conclusion may seen. Use common sense, when interpreting numbers of any sort!
A LINEAR MODEL The scatterplot below shows Lowe’s sales and total home improvement expenditures between 1985 and The relationship is strong, positive, and linear (r = 0.976).
We see that the points don’t all line up, but that a straight line can summarize the general pattern. We call this line a linear model. A linear model describes the relationship between x and y. A LINEAR MODEL
This linear model can be used to predict sales from an estimate of residential improvement expenditures for the next year. We know the model won’t be perfect, so we must consider how far the model’s values are from the observed values. A LINEAR MODEL
In the computer usage model for 301 stores, the model predicts MIPS (Millions of Instructions Per Second) and the actual value is MIPS. We may compute the residual for 301 stores.
LINE OF BEST FIT Some residuals will be positive and some negative, so adding up all the residuals is not a good assessment of how well the line fits the data. If we consider the sum of the squares of the residuals, then the smaller the sum, the better the fit. The line of best fit is the line for which the sum of the squared residuals is smallest – often called the least squares line.
EXPLANATION Richard Bußmann 31
ACTUAL AND FITTED VALUE Richard Bußmann 32
PROBLEMS If we merely minimize the distance, where do we run into problems? Some distances are positive, others negative. We aren’t looking at absolute numbers Richard Bußmann 33
ORDINARY LEAST SQUARES The most common method used to fit a line to the data is known as OLS (ordinary least squares). What we actually do, is to take each distance and to square it. Then we compute the respective areas, sum them and draw a line to minimize the total sum if the squares (hence least squares). Richard Bußmann 34
A linear model to predict weekly Sales of frozen pizza (in pounds) from the average price ($/unit) charged by a sample of stores in Dallas in 39 recent weeks is What is the explanatory variable? What is the response variable? What does the slope mean in this context? Is the y-intercept meaningful in this context? SALES & PRICE EXAMPLE
What is the explanatory variable? Average Price What is the response variable? Sales What does the slope mean in this context? Sales decrease by 24, pounds per dollar. Is the y-intercept meaningful in this context? It means nothing because stores will not set their price to $0.
SALES & PRICE EXAMPLE What is the predicted Sales if the average price charged was $3.50 for a pizza? If the sales for a price of $3.50 turned out to be 60,000 pounds, what would the residual be?
What is the predicted Sales if the average price charged was $3.50 for a pizza? If the sales for a price of $3.50 turned out to be 60,000 pounds, what would the residual be? SALES & PRICE EXAMPLE