Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Chapter 16 Linear regression is a procedure that identifies relationship between independent variables and a dependent variable.Linear regression is.

Similar presentations


Presentation on theme: "1 Chapter 16 Linear regression is a procedure that identifies relationship between independent variables and a dependent variable.Linear regression is."— Presentation transcript:

1 1 Chapter 16 Linear regression is a procedure that identifies relationship between independent variables and a dependent variable.Linear regression is a procedure that identifies relationship between independent variables and a dependent variable. This relationship helps reduce the unexplained variation of the dependent variable behavior, thus provide better predictions of its future values.This relationship helps reduce the unexplained variation of the dependent variable behavior, thus provide better predictions of its future values.

2 2 The Simple linear regression model The model is:

3 3 The Simple linear regression model The model is: We try to estimate the deterministic part of it by developing the line with the best fit. Best fit is defined as the minimum sum of squared errors. An error is the difference between the line value and the actual value for a given x. The analysis yields

4 4 Are the costs of welding machines breakdowns related to their age? From the data answer the following: –Find the sample regression line –What is the coefficient of determination. Interpret. –Are machine age and monthly repair costs linearly related? –Is the fit good enough to use the model to predict the monthly repair costs of a 120 months old machine? –Make the prediction. Problem 6

5 5 From Excel we get: –The Cov(age,cost)=936.82 Mean age (x) = 113.35. = 378.77 Mean cost (y) = 395.21. = 4094.79. –b 1 = cov(x,y)  = 936.82/378.77 = 2.4733 b 0 = y-b 0 x = 395.21-2.4733(113.35) = 114.86 The regression line: Data

6 6 Problem 6 Coefficient of determination. –In this case –56.59% of the variation in costs, are explained by the model (by the different ages).

7 7 Problem 6 Is there a linear relationship between monthly costs and machine age? We need to test if  1 is not equal to zero. –H 0 :  1 = 0 H 1 :  1  0 –In this case – t= [2.47-0]/.5106 = 4.837 –The rejection region is t>t  /2 or t< -t  /2 with n-2 degrees of freedom Can be calculated separately

8 8 Problem 6 The p value < alpha Data

9 9 Problem 6 We need to forecast the expected cost for a 120 months old machine. The equation provides a point prediction: Cost = 114.86+2.4733(120) = $411.65 The prediction interval (use data analysis plus): LCL = $318.12; UCL = 505.18 What’s the prediction for the average monthly repair cost for all the machines 120 months old? To answer this question construct the confidence interval (notice, not the prediction interval!)

10 10 Chapter 17 The multiple regression model allows more than one independent variable explain the values of the dependent variable.The multiple regression model allows more than one independent variable explain the values of the dependent variable. We assess the model as before usingWe assess the model as before using – t test for linear relationships between the independent variables and the dependent variable (tested one at a time) –F test for the over usefulness of the model –Coefficient of determination for the fit.

11 11 Problem 7 When a company buys another company it is not unusual that some workers are terminated. A buyout contract between Laurier Comp and the Western Comp required that Laurier provides a severance package to Western workers fired, equivalent to packages offered to Laurier workers. It is suggested that severance is determined by three factors: Age, length of service, pay. Bill smith, a Western employee, is offered a 5 weeks severance pay when his employment is terminated. Based on the data provided by Laurier (Xr19-05.xls) about severance offered to 50 of its employees in the past, answer the following questions:Xr19-05.xls

12 12 Determine the regression equation. Interpret the coefficients. Comment on how well the model fits the data. Do all the independent variables belong in the model? Does Laurier meet its obligation to Bill Smith? Problem 7 - continued

13 13 Problem 8 A linear regression model for life longevity –Insurance companies are interested in predicting life longevity of their customers. –Data for 100 deceased male customers was collected, and a regression model run. –The model studied was: Longevity =  0 +  1 MotherAge+ b 2 FatherAge+  3 GrandM+  4 GrandF+ 

14 14 Problem 8 The equation Coefficient of determination

15 15 Problem 8 Overall usefulness: H 0 : all   = 0 H 1 : At least one  i = 0 F Significance = p value = 4.86(10 -27 ) Reject H 0. The model is useful.

16 16 Problem 8 Mother’s age and father’s age at death have strong linear relationships to an Individual’s age at death. Grandparents’ age at death are not good predictors of an individual’s age at death. The t-test for  i : H 0 :  i = 0 H 1 :  i = 0 t= (b i –  i )/s bi Rejection region: t>t  /2, n-k-1 or t<-t  /2, n-k-1

17 17 Chapter 18.2 Dummy variables help include qualitative data in a regression model.Dummy variables help include qualitative data in a regression model. If qualitative data can be categorized by n categories, there are n-1 dummy variables needed to express all the categories.If qualitative data can be categorized by n categories, there are n-1 dummy variables needed to express all the categories. Dummy variables take on the values 0 or 1.Dummy variables take on the values 0 or 1. –X i = 0 if the data point in question does not belong to category i –X i = 1 if the data point in question belongs to category i.

18 18 In problem 6 we studied the relationship between age of welding machines and breakdown costs. This study was expanded. It is now including also lathe machine and stamping machines. See Data file. Code for machine type: 1=Welding; 2=Lathe; 3=StampingData Answer the following: –Develop a regression model –Interpret the coefficient –Can we conclude that welding machines cost more to repair than stamping machine. –Predict the monthly cost to repair an 85 month old lathe machine Problem 9

19 19 First we need to prepare the input data Original data Problem 9

20 20 Run the multiple regression Problem 9

21 21 Run the multiple regression Note the reference line (for the stamping machine): Cost=119.25+2.538Age Cost=119.25+2.538Age-11.755W-199.37L Repair cost increase on the average by $2.53 a month. The monthly repair cost for a welding machine is $11.75 lower than for a stamping machine of the same age. However, this result is not significant p value=.55). There is insufficient evidence in the sample to support the hypothesis that there is any difference between repair costs of welding machines and stamping machines. The monthly repair cost for a lathe machine is $199.37 lower than for a stamping machine of the same age. This result is significant. Problem 9

22 22 Chapter 15 We test the hypotheses that a set of data belongs to certain distributions: –The multinomial distribution –The normal distribution We also study whether two variables are dependent or not. We apply a tool called a Chi-squared test

23 23 The multinomial experiment The multinomial experiment is an extension of the binomial experiment. Characteristics –There are n independent trials. –Each trial can result in one of k possible outcomes. –There is a probability of a type k success (p k ) in each trial. We test whether the sample gathered support the hypothesis that p 1, p 2,…,p k are equal to specified values. The test is called: The goodness of fit test.

24 24 Problem 1 To determine whether a single die is balanced, or fair, the die was rolled 600 times. (See Xr15-09.xls). Is there sufficient evidence at 5% significance level to allow you to conclude that the die is not fair?

25 25 Problem 1 The hypothesis: –H 0 : p 1 = p 2 =…p 6 = 1/6 H 1 : At least one p is not 1/6. –Build a rejection Region: –In our case:  2 >  2 ,5

26 26 Problem 1 –We calculate  2 as follows: –In our case: e 1 =e 2 =…=e 6 =600(1/6)=100 From the file we have: f 1 =114; f 2 =92; f 3 =84; f 4 =101; f 5 =107; f 6 =103

27 27 Contingency table Here we test the relationship between two variables. Are they dependent? We build a contingency table and a Chi-Square statistic Variable/ Category 1 Variable/ Category 2 r rows c columns

28 28 A Sample Problem Contingency table Type of music vs. geographic location –A group of 30-years-old people is interviewed to determined whether the type of music is somehow related to the geographic location of their residence. –From the data presented can we infer that music preference is affected by the geographic location? Use (  =.10). H 0 : Type of music and geographic location are independent. H 1: Type of music and geographic location are dependent.

29 29 A Sample problem – contd. e11=(195)(428)/632=129.59; e12=(195)(100)/632=30.85 e23=(235)(65)/632=24.16;  2 = (140-129.59) 2 /129.59+…+(52-24.16) 2 /24.16+…=64.92  2.10,(3-1)(4-1) = 10.64; 64.92>10.64. Reject the null hypothesis. Type of music and geographic location are not independent. RockR & BCountryClassical Northeast 14032518 South 13441528 West 15427813 428 100 65 39 195 235 202 632

30 30 Using data analysis Plus A Sample problem – contd.

31 31 Chi squared test for normality Hypothesize on  and    and    Divide the Z interval into equal size sub-intervals. [i.e. (–2, – 1); (-1,0); (0,1); (1,2)] Determine the corresponding probabilities covered by each subinterval. [i.e. p1=P(Z<-2); p2=P(-2<Z<-1); …] Translate the Z scores to the associated X values. [i,.e. x1=  0 +(-2)  0 ; x2=  0 +(-1)  0 ; …] Find the actual frequency for each subinterval [i.e. f1 - for the interval below x1; f2 - for the interval (x1,x2); …] Calculate the expected frequency for each interval: e1 = np1; e2 = np2; … Build a Chi squared statistic and perform the test


Download ppt "1 Chapter 16 Linear regression is a procedure that identifies relationship between independent variables and a dependent variable.Linear regression is."

Similar presentations


Ads by Google