Regression line – Fitting a line to data If the scatter plot shows a clear linear pattern: a straight line through the points can describe the overall.

Slides:



Advertisements
Similar presentations
Chapter 3 Examining Relationships Lindsey Van Cleave AP Statistics September 24, 2006.
Advertisements

Residuals.
Probabilistic & Statistical Techniques Eng. Tamer Eshtawi First Semester Eng. Tamer Eshtawi First Semester
Chapter 3 Bivariate Data
AP Statistics Chapters 3 & 4 Measuring Relationships Between 2 Variables.
CHAPTER 3 Describing Relationships
Chapter 5 Regression. Chapter 51 u Objective: To quantify the linear relationship between an explanatory variable (x) and response variable (y). u We.
Descriptive Methods in Regression and Correlation
Relationship of two variables
1 Chapter 3: Examining Relationships 3.1Scatterplots 3.2Correlation 3.3Least-Squares Regression.
BPS - 3rd Ed. Chapter 211 Inference for Regression.
Notes Bivariate Data Chapters Bivariate Data Explores relationships between two quantitative variables.
AP STATISTICS LESSON 3 – 3 LEAST – SQUARES REGRESSION.
CSC323 – Week 3 Regression line
Notes Bivariate Data Chapters Bivariate Data Explores relationships between two quantitative variables.
Objective: Understanding and using linear regression Answer the following questions: (c) If one house is larger in size than another, do you think it affects.
Relationships If we are doing a study which involves more than one variable, how can we tell if there is a relationship between two (or more) of the.
The Practice of Statistics, 5th Edition Starnes, Tabor, Yates, Moore Bedford Freeman Worth Publishers CHAPTER 3 Describing Relationships 3.2 Least-Squares.
Examining Bivariate Data Unit 3 – Statistics. Some Vocabulary Response aka Dependent Variable –Measures an outcome of a study Explanatory aka Independent.
CHAPTER 5 Regression BPS - 5TH ED.CHAPTER 5 1. PREDICTION VIA REGRESSION LINE NUMBER OF NEW BIRDS AND PERCENT RETURNING BPS - 5TH ED.CHAPTER 5 2.
Chapter 3-Examining Relationships Scatterplots and Correlation Least-squares Regression.
Business Statistics for Managerial Decision Making
CHAPTER 3 Describing Relationships
BPS - 5th Ed. Chapter 231 Inference for Regression.
Describing Relationships. Least-Squares Regression  A method for finding a line that summarizes the relationship between two variables Only in a specific.
1. Analyzing patterns in scatterplots 2. Correlation and linearity 3. Least-squares regression line 4. Residual plots, outliers, and influential points.
CHAPTER 3 Describing Relationships
Statistics 101 Chapter 3 Section 3.
CHAPTER 3 Describing Relationships
Chapter 3: Describing Relationships
LSRL Least Squares Regression Line
Cautions about Correlation and Regression
Daniela Stan Raicu School of CTI, DePaul University
Regression and Residual Plots
AP Stats: 3.3 Least-Squares Regression Line
Least-Squares Regression
Chapter 3: Describing Relationships
CHAPTER 3 Describing Relationships
^ y = a + bx Stats Chapter 5 - Least Squares Regression
CHAPTER 3 Describing Relationships
Chapter 3: Describing Relationships
Least-Squares Regression
Chapter 3: Describing Relationships
Daniela Stan Raicu School of CTI, DePaul University
Chapter 3: Describing Relationships
Chapter 3: Describing Relationships
Chapter 3: Describing Relationships
Least-Squares Regression
CHAPTER 3 Describing Relationships
CHAPTER 3 Describing Relationships
CHAPTER 3 Describing Relationships
Warmup A study was done comparing the number of registered automatic weapons (in thousands) along with the murder rate (in murders per 100,000) for 8.
CHAPTER 3 Describing Relationships
Chapter 3: Describing Relationships
Chapter 3: Describing Relationships
CHAPTER 3 Describing Relationships
Chapter 3: Describing Relationships
CHAPTER 3 Describing Relationships
Chapter 3: Describing Relationships
Chapter 3: Describing Relationships
Chapter 3: Describing Relationships
CHAPTER 3 Describing Relationships
Chapter 3: Describing Relationships
Chapter 3: Describing Relationships
Chapter 3: Describing Relationships
Chapter 3: Describing Relationships
Chapter 3: Describing Relationships
Review of Chapter 3 Examining Relationships
Chapter 3: Describing Relationships
CHAPTER 3 Describing Relationships
Presentation transcript:

Regression line – Fitting a line to data If the scatter plot shows a clear linear pattern: a straight line through the points can describe the overall pattern Fitting a line means drawing a line that is as close as possible to the points: the “best” straight line is the regression line.

Simple Example: Productivity level To see how productivity was related to level of maintenance, a firm randomly selected 5 of its high speed machines for an experiment. Each machine was randomly assigned a different level of maintenance X and then had its average number of stoppage Y recorded. These are the data: Correlation coefficient r =  0.94 To see how productivity was related to level of maintenance, a firm randomly selected 5 of its high speed machines for an experiment. Each machine was randomly assigned a different level of maintenance X and then had its average number of stoppages Y recorded. Hours of maintenance 1.8  1.6  1.4  1.2  1.0  0.8  0.6  0.4  0.2  0 | | | | | | | | X  # interruptions     Hours XAverage int. Y Ave(x)=8 s(x)=3.16 Ave(y)=1 s(y)=0.45 r=–0.94

Least squares regression line Definition The regression line of y on x is the line that makes the sum of the squares of the vertical distances (deviations) of the data points from the line as small as possible It is defined as Slopeb = r*sd(y)/sd(x) Intercepta = ave(y) – b*ave(x) We use to distinguish between the values predicted from the regression line and the observed values. Note: b has the same sign of r Average value of y at x

Example: cont. Slope Intercept a=ave(y) –b  ave(x)=1– (–0.135)  8=2.08 Regression Line: = 2.08 –0.135 x = 2.08 –0.135 hours The regression line of the number of interruptions and the hours of maintenance per week is calculated as follows. The descriptive statistics for x and y are: Ave(x)=8 s(x)=3.16; Ave(y)=1 s(y)=0.45 and r=–0.94

Regression line = 2.08 –0.135 hours If the slope is positive, Y increases linearly with X. The slope value is the increase in Y for an increase of one unit in X. If the slope is negative, Y decreases linearly with X. The slope value is the decrease in Y for an increase of one unit in X Hours of maintenance 1.8  1.6  1.4  1.2  1.0  0.8  0.6  0.4  0.2  0 | | | | | | | | X  # interruptions     r=–0.94   residual Point of averages The slope is b= If you increase the maintenance schedule by one hour, the average number of stoppages will decrease by

Residuals      Y x  For a given x, use the regression line to predict the response The accuracy of the prediction depends on how spread out the observations are around the line. Observed value y Error Predicted value

Example: CPU Usage A study was conducted to examine what factors affect CPU usage. A set of 38 processes written in a programming language was considered. For each program, data were collected on the CPU usage in seconds, and the number of lines (in thousands) of the program. CPU usage Number of lines The scatter plot shows a clear positive association. We’ll fit a regression line to model the association!

Variable N Mean Std Dev Sum Minimum Maximum Y time X linet Correlation Coefficient = The regression line is

1.Coefficient of determination R 2 = (correlation coefficient) 2  describes how good the regression line is in explaining the response y.  fraction of the variation in the values of y that is explained by the regression line of y on x.  Varies between 0 and 1. Values close to 1, then the regression line provides a good explanation of the data; close to zero, then the regression line is not able to capture the variability in the data Goodness of fit measures

EXAMPLE (cont.): The correlation coefficient is r = –0.94. The regression line is able to capture 88.3% of the variability in the data. It is computed by the Excel function RSQ

2. Residuals The vertical distances between the observed points and the regression line can be regarded as the “left-over” variation in the response after fitting the regression line. A residual is the difference between an observed value of the response variable y and the value predicted by the regression line. Residual e= observed y – predicted y = = y – (intercept – slope x) = = y – (a + b x) A special property: the average of the residuals is always zero.

EXAMPLE: Residuals for the regression line = 2.08 – x for the number of interruptions Y on the hours of maintenance X. Hours XAverage interr. Y Predicted Interr.Residual y – (a+bx) – 0.135*4= – 1.54= – 0.135*6= – 1.27 = – – 0.135*8=11.1 –1= – 0.135*10= –0.73= – – 0.135*12= –0.46=0.14 Average=0

3. Accuracy of the predictions If the cloud of points is football-shaped, the prediction errors are similar along the regression line. One possible measure of the accuracy of the regression predictions is given by the root mean square error (r.m.s. error). The r.m.s. error is defined as the square root of the average squared residual: This is an estimate of the variation of y about the regression line. It is computed by the Excel function STEYX

Roughly 68% of the points 1 r.m.s. error Roughly 95% of the points 2 r.m.s. errors

Hours X Average interr. Y Predicted Interr. ResidualSquared Residual – Total Computing the r.m.s.error: The r.m.s. error is  (0.0911/3) = If the company will schedule 7 hours of maintenance per week, the predicted weekly number of interruptions of the machine will be =2.08 –  7=1.135 on average. Using the r.m.s. error, more likely the number of interruptions will be between 1.135–2*0.174=0.787 and *0.174=1.483.

Detect problems in the regression analysis: the residual plots The analysis of the residuals is useful to detect possible problems and anomalies in the regression A residual plot is a scatter plot of the regression residuals against the explanatory variable. Points should be randomly scattered inside a band centered around the horizontal line at zero (the mean of the residuals).

“Good case” Residual X “Bad cases” Non linear relationship Variation of y changing with x

Anomalies in the regression analysis If the residual plot displays a curve  the straight line is not a good description of the association between x and y If the residual plot is fan-shaped  the variation of y is not constant. In the figure above, predictions on y will be less precise as x increases, since y shows a higher variability for higher values of x.

Example of CPU usage data Residual plot Do you see any striking pattern?

Example: 100 meter dash At the 1987 World Championship in Rome, Ben Johnson set a new world record in the 100-meter dash. Meters Johnson Average St. dev Correlation = Scatter plot for Johnson’s times Elapsed time Meters The data: Y=the elapsed time from the start of the race in 10-meter increments for Ben Johnson, X= meters

Regression Line Elapsed time Meters The fitted regression line is = meters. The value of R 2 is 0.999, therefore 99.9% of the variability in the data is explained by the regression line.

Residual Plot Meters Residual Does the graph show any anomaly?

Outliers and Influential points An outlier is an observation that lies outside the overall pattern of the other observation outlier Large residual

Influential Point An observation is influential for the regression line, if removing it would change considerably the fitted line. An influential point pulls the regression line towards itself. Regression line if is omitted Influential point

Example: house prices in Albuquerque. = price. The coefficient of determination is R 2 = Selling price Annual tax What does the value of R 2 say? Are there any influential points?

New analysis: omitting the influential points Selling price Annual tax = price The coefficient of determination is R 2 = The regression line is The new regression line explains 82% of the variation in y. Previous regression line

Summary – Warnings 1.Correlation measures linear association, regression line should be used only when the association is linear 2.Extrapolation – do not use the regression line to predict values outside the observed range – predictions are not reliable 3.Correlation and regression line are sensitive to influential / extreme points 4.Check residual plots to detect anomalies and “hidden” patterns which are not captured by the regression line