Presentation on theme: "Chapter 3 Linear Regression and Correlation"— Presentation transcript:
1 Chapter 3 Linear Regression and Correlation Descriptive Analysis & Presentation of Two Quantitative Data
2 Chapter ObjectivesTo be able to present two-variables data in tabular and graphic formDisplay the relationship between two quantitative variables graphically using a scatter diagram.Calculate and interpret the linear correlation coefficient.Discuss basic idea of fitting the scatter diagram with a best-fitted line called a linear regression line.Create and interpret the linear regression line.
3 Terminology Data for a single variable is univariate data Many or most real world models have more than one variable … multivariate dataIn this chapter we will study the relations between two variables … bivariate data
4 Bivariate DataIn many studies, we measure more than one variable for each individualSome examples areRainfall amounts and plant growthExercise and cholesterol levels for a group of peopleHeight and weight for a group of people
5 Types of RelationsWhen we have two variables, they could be related in one of several different waysThey could be unrelatedOne variable (the input or explanatory or predictor variable) could be used to explain the other (the output or response or dependent variable)One variable could be thought of as causing the other variable to changeNote: When two variables are related to each other, one variable may notcause the change of the other variable. Relation does not always meancausation.
6 Lurking VariableSometimes it is not clear which variable is the explanatory variable and which is the response variableSometimes the two variables are related without either one being an explanatory variableSometimes the two variables are both affected by a third variable, a lurking variable, that had not been included in the study
7 Example 1 An example of a lurking variable A researcher studies a group of elementary school childrenY = the student’s heightX = the student’s shoe sizeIt is not reasonable to claim that shoe size causes height to changeThe lurking variable of age affects both of these two variables
8 More Examples Some other examples Rainfall amounts and plant growth Explanatory variable – rainfallResponse variable – plant growthPossible lurking variable – amount of sunlightExercise and cholesterol levelsExplanatory variable – amount of exerciseResponse variable – cholesterol levelPossible lurking variable – diet
9 Types of Bivariate Data Three combinations of variable types:1. Both variables are qualitative (attribute)2. One variable is qualitative (attribute) and the other is quantitative (numerical)3. Both variables are quantitative (both numerical)
10 Two Qualitative Variables When bivariate data results from two qualitative (attribute or categorical) variables, the data is often arranged on a cross-tabulation or contingency tableExample: A survey was conducted to investigate the relationship between preferences for television, radio, or newspaper for national news, and gender. The results are given in the table below:
11 Marginal TotalsThis table, may be extended to display the marginal totals (or marginals). The total of the marginal totals is the grand total:Row Totals760560Col. Totals3954504751320TVRadioNPMale280175305Female115275170Note: Contingency tables often show percentages (relative frequencies). These percentages are based on the entire sample or on the subsample (row or column) classifications.
12 Percentages Based on the Grand Total (Entire Sample) The previous contingency table may be converted to percentages of the grand total by dividing each frequency by the grand total and multiplying by 100For example, 175 becomes 13.3%TVRadioNPRow TotalsMale21.213.323.157.6Female8.720.812.942.4Col. Totals29.934.136.0100.01751320100133=æèçöø÷.
13 Percentages Based on Grand Total IllustrationThese same statistics (numerical values describing sample results) can be shown in a (side-by-side) bar graph:510152025TVRadioNPMaleFemalePercentages Based on Grand TotalPercentMedia
14 Percentages Based on Row (Column) Totals The entries in a contingency table may also be expressed as percentages of the row (column) totals by dividing each row (column) entry by that row’s (column’s) total and multiplying by The entries in the contingency table below are expressed as percentages of the column totals:Note: These statistics may also be displayed in a side-by-side bar graph
15 One Qualitative & One Quantitative Variable 1. When bivariate data results from one qualitative and one quantitative variable, the quantitative values are viewed as separate samples2. Each set is identified by levels of the qualitative variable3. Each sample is described using summary statistics, and the results are displayed for side-by-side comparison4. Statistics for comparison: measures of central tendency, measures of variation, 5-number summary5. Graphs for comparison: side-by-side stemplot and boxplot
16 ExampleExample: A random sample of households from three different parts of the country was obtained and their electric bill for June was recorded. The data is given in the table below:The part of the country is a qualitative variable with three levels of response. The electric bill is a quantitative variable. The electric bills may be compared with numerical and graphical techniques.
17 Comparison Using Box-and-Whisker Plots 234567ElectricBillThe Monthly Electric BillThe electric bills in the Northeast tend to be more spread out than those in the Midwest. The bills in the West tend to be higher than both those in the Northeast and Midwest.
18 Descriptive Statistics for Two Quantitative Variables Scatter Diagrams and correlation coefficient
19 Two Quantitative Variables The most useful graph to show the relationship between two quantitative variables is the scatter diagramEach individual is represented by a point in the diagramThe explanatory (X) variable is plotted on the horizontal scaleThe response (Y) variable is plotted on the vertical scale
20 ExampleExample: In a study involving children’s fear related to being hospitalized, the age and the score each child made on the Child Medical Fear Scale (CMFS) are given in the table below:Construct a scatter diagram for this data
21 Child Medical Fear Scale Solutionage = input variable, CMFS = output variableChild Medical Fear Scale154329876CMFSAge
22 Another Example An example of a scatter diagram Note: the vertical scale is truncated to illustrate the detail relation!
23 Types of RelationsThere are several different types of relations between two variablesA relationship is linear when, plotted on a scatter diagram, the points follow the general pattern of a lineA relationship is nonlinear when, plotted on a scatter diagram, the points follow a general pattern, but it is not a lineA relationship has no correlation when, plotted on a scatter diagram, the points do not show any pattern
24 Linear CorrelationsLinear relations or linear correlations have points that cluster around a lineLinear relations can be either positive (the points slants upwards to the right) or negative (the points slant downwards to the right)
25 Positive Correlations For positive (linear) correlationAbove average values of one variable are associated with above average values of the other (above/above, the points trend right and upwards)Below average values of one variable are associated with below average values of the other (below/below, the points trend left and downwards)
26 Example: Positive Correlation As x increases, y also increases:543216OutputInput
27 Negative Correlations For negative (linear) correlationAbove average values of one variable are associated with below average values of the other (above/below, the points trend right and downwards)Below average values of one variable are associated with above average values of the other (below/above, the points trend left and upwards)
28 Example: Negative Correlation As x increases, y decreases:OutputInput543219876
29 Nonlinear Correlations Nonlinear relations have points that have a trend, but not around a lineThe trend has some bend in it
30 No Correlations When two variables are not related There is no linear trendThere is no nonlinear trendChanges in values for one variable do not seem to have any relation with changes in the other
31 Example: No Correlation As x increases, there is no definite shift in y:32154OutputInput
32 Distinction between Nonlinear & No Correlation Nonlinear relations and no relations are very differentNonlinear relations are definitely patterns … just not patterns that look like linesNo relations are when no patterns appear at all
33 Example Examples of nonlinear relations Examples of no relations “Age” and “Height” for people (including both children and adults)“Temperature” and “Comfort level” for peopleExamples of no relations“Temperature” and “Closing price of the Dow Jones Industrials Index” (probably)“Age” and “Last digit of telephone number” for adults
34 Please NotePerfect positive correlation: all the points lie along a line with positive slopePerfect negative correlation: all the points lie along a line with negative slopeIf the points lie along a horizontal or vertical line: no correlationIf the points exhibit some other nonlinear pattern: nonlinear relationshipNeed some way to measure the strength of correlation
36 Measure of Linear Correlation The linear correlation coefficient is a measure of the strength of linear relation between two quantitative variablesThe sample correlation coefficient “r” isNote: are the sample means and sample variancesof the two variables X and Y.
37 Properties of Linear Correlation Coefficients Some properties of the linear correlation coefficientr is a unitless measure (so that r would be the same for a data set whether x and y are measured in feet, inches, meters etc.)r is always between –1 and +1.r = -1 : perfect negative correlationr = +1: perfect positive correlationPositive values of r correspond to positive relationsNegative values of r correspond to negative relations
38 Various Expressions for r There are other equivalent expressions for the linear correlation r as shown below:However, it is much easier to compute r using the short-cut formula shown on the next slide.
39 Short-Cut Formula for r SS“sum of squares for()xx”=n-å2SS“sum of squares for()yy”=n-å2SS“sum of squares for()xyxy”=xyn-å
40 ExampleExample: The table below presents the weight (in thousands of pounds) x and the gasoline mileage (miles per gallon) y for ten different automobiles. Find the linear correlation coefficient:
41 Completing the Calculation for r xyxy=-SS().)(0.427974491116947
42 Please Note r is usually rounded to the nearest hundredth r close to 0: little or no linear correlationAs the magnitude of r increases, towards -1 or +1, there is an increasingly stronger linear correlation between the two variablesWe’ll also learn to obtain the linear correlation coefficient from the graphing calculator.
43 Positive Correlation Coefficients Examples of positive correlationIn general, if the correlation is visible to the eye, then it is likely to be strongStrong Positiver = .8Moderate Positiver = .5Very Weakr = .1
44 Negative Correlation Coefficients Examples of negative correlationIn general, if the correlation is visible to the eye, then it is likely to be strongStrong Negativer = –.8Moderate Negativer = –.5Very Weakr = –.1
45 Nonlinear versus No Correlation Nonlinear correlation and no correlationBoth sets of variables have r = 0.1, but the difference is that the nonlinear relation shows a clear patternNonlinear RelationNo Relation
46 Interpret the Linear Correlation Coefficients Correlation is not causation!Just because two variables are correlated does not mean that one causes the other to changeThere is a strong correlation between shoe sizes and vocabulary sizes for grade school childrenClearly larger shoe sizes do not cause larger vocabulariesClearly larger vocabularies do not cause larger shoe sizesOften lurking variables result in confounding
47 How to Determine a Linear Correlation? How large does the correlation coefficient have to be before we can say that there is a relation?We’re not quite ready to answer that question
48 SummaryCorrelation between two variables can be described with both visual and numeric methodsVisual methodsScatter diagramsAnalogous to histograms for single variablesNumeric methodsLinear correlation coefficientAnalogous to mean and variance for single variablesCare should be taken in the interpretation of linear correlation (nonlinearity and causation)
50 Learning ObjectivesFind the regression line to fit the data and use the line to make predictionsInterpret the slope and the y-intercept of the regression lineCompute the sum of squared residuals
51 Regression AnalysisRegression analysis finds the equation of the line that best describes the relationship between two variablesOne use of this equation: to make predictions
52 Best Fitted LineIf we have two variables X and Y which tend to be linearly correlated, we often would like to model the relation with a line that best fits to the data.Draw a line through the scatter diagramWe want to find the line that “best” describes the linear relationship … the regression line
53 Residual = Observed – Predicted ResidualsOne difference between math and stat is that statistics assumes that the measurements are not exact, that there is an error or residualThe formula for the residual is alwaysResidual = Observed – PredictedThis relationship is not just for this chapter … it is the general way of defining error in statistics
54 What is a Residual? Here shows a residual on the scatter diagram The residualThe regression lineThe observed value yThe predicted value yThe x value of interest
55 ExampleFor example, say that we want to predict a value of y for a specific value of xAssume that we are using y = 10 x + 25 as our modelTo predict the value of y when x = 3, the model gives us y = 10 = 55, or a predicted value of 55Assume the actual value of y for x = 3 is equal to 50The actual value is 50, the predicted value is 55, so the residual (or error) is 50 – 55 = –5
56 Method of Least Squares We want to minimize the prediction errors or residuals, but we need to define what this meansWe use the method of least-squares which involves the following 3 steps:We consider a possible linear model to fit the dataWe calculate the residual for each pointWe add up the squares of the residuals ( We square all of the residuals to avoid the cancellation of positive residuals and negative residuals, since some observed values are under predicted, some of the observed valued are over predicted by the proposed linear model.)The line that has the smallest overall residuals ( i.e. the sum of all the squares of the residuals) is called the least-squares regression line or simply the regression line which is the best-fitted line to the data.
57 Method of Least Squares Assume the equation of the best-fitting line:Where (called, y hat) denotes the predicted value ofLeast squares method:Find the constants b0 and b1 such that the sum of the overall prediction errors is as small as possible
58 Illustration Observed and predicted values of y: y y b x = + ) ( , x y 1y^)(,xyy-^y^(,)x
59 Linear Regression Line The equation for the regression line is given bydenotes the predicted value for the response variable.b1 is the slope of the least-squares regression lineb0 is the y-intercept of the least-squares regression lineNote: Different textbooks may use different notations for the slope and the intercept.
60 Find the Equation of a Linear Regression Line The equation is determined by:b0: y-interceptb1: slopeValues that satisfy the least squares criterion:
61 ExampleExample: A recent article measured the job satisfaction of subjects with a 14-question survey. The data below represents the job satisfaction scores, y, and the salaries, x, for a sample of similar individuals:1) Draw a scatter diagram for this data2) Find the equation of the line of best fit (i.e., regression line)
62 Finding b1 & b0Preliminary calculations needed to find b1 and b0:
63 Linear Regression Line bxyx11187522955174=SS().0.()byxn1133517423484902=-×å(0.)(.Equation of the lineof best fit:.0.x=+149517y^Solution 1)
65 Please NoteKeep at least three extra decimal places while doing the calculations to ensure an accurate answerWhen rounding off the calculated values of b0 and b1, always keep at least two significant digits in the final answerThe slope b1 represents the predicted change in y per unit increase in xThe y-intercept is the value of y where the line of best fit intersects the y-axis. That is, it is the predicted value of y when x is zero.The line of best fit will always pass through the point
66 Please Note Finding the values of b1 and b0 is a very tedious process We should also know to use Graphing calculator for thisFinding the coefficients b1 and b0 is only the first step of a regression analysisWe need to interpret the slope b1We need to interpret the y-intercept b0
67 Making Predictions1. One of the main purposes for obtaining a regression equation is for making predictionsy^2. For a given value of x, we can predict a value of3. The regression equation should be used only to cover the sample domain on the input variable. You can estimate values outside the domain interval, but use caution and use values close to the domain interval.4. Use current data. A sample taken in 1987 should not be used to make predictions in 1999.
68 Interpret the Slope Interpreting the slope b1 The slope is sometimes defined as asThe slope is also sometimes defined as asThe slope relates changes in y to changes in x
69 Interpret the Slope For example, if b1 = 4 For example, if b1 = –7 If x increases by 1, then y will increase by 4If x decreases by 1, then y will decrease by 4A positive linear relationshipFor example, if b1 = –7If x increases by 1, then y will decrease by 7If x decreases by 1, then y will increase by 7A negative linear relationship
70 ExampleFor example, say that a researcher studies the population in a town (which is the y or response variable) in each year (which is the x or predictor variable)To simplify the calculations, years are measured from 1900 (i.e. x = 55 is the year 1955)The model used isy = 300 x + 12,000A slope of 300 means that the model predicts that, on the average, the population increases by 300 per year.An intercept of 12,000 means that the model predicts that the town had a population of 12,000 in the year 1900 (i.e. when x = 0)
71 Interpret the y-intercept Interpreting the y-intercept b0Sometimes b0 has an interpretation, and sometimes notIf 0 is a reasonable value for x, then b0 can be interpreted as the value of y when x is 0If 0 is not a reasonable value for x, then b0 does not have an interpretationIn general, we should not use the model for values of x that are much larger or much smaller than the observed values of x included (that is, it may be invalid to predict y for x values lying outside the range of the observed x.)
72 Summary Summarize two quantitative data Linear models of correlation Scatter diagramsCorrelation coefficientsLinear models of correlationLeast-squares regression linePrediction
73 Obtain Linear Correlation Coefficient and Regression Line Equation from TI Calculator 1. Turn on the diagnostic tool: CATALOG[2nd 0] DiagnosticOn ENTER ENTER2. Enter the data: STAT EDIT. Enter the x-variable data into L1 and the corresponding y-variable data into L23. Obtain regression line and the linear correlation r: STAT CALC 4:LinReg(ax+b) ENTER L1, L2, Y1 (Notice: to enter Y1, use VARS Y-VARS 1:Function 1:Y1 ENTER). (The screen will also show r2. Just ignore it.)4. Display the scatter diagram and the fitted regression line:Zoom 9:ZoomStat TRACE (press up or down arrow keys to move the cursor to the regression line. Now, you can trace the points along the line by pressing the right or left arrow keys. While the cursor is on the regression line, you can also enter a number, the screen will show the predicted value of y for the x value you just entered.)