Presentation on theme: "17.1 INTRODUCTION Explanatory variable CHAPTER 17 FURTHER DATA ANALYSIS 2."— Presentation transcript:
17.1 INTRODUCTION Explanatory variable CHAPTER 17 FURTHER DATA ANALYSIS 2
Example － Given sample data from a random sample of students about their IQ and their height Does the height of an individual student influence the IQ of the student? － Given sample data from a random sample of people about their height and the height of their father Does the height of a father influence the height of his son? － Given sample data about the number of new born babies in the city per day and the number of the wild gooses flying over the city per day Does the number of new born babies in the city and the number of the wild gooses flied over the city
Methodology of Data Anylysis － Initial Data Analysis Very strong evidence to support a link, No evidence of any link, The sample evidence is inconclusive and further more sophisticated data analysis is required. － Further Data Analysis The sample evidence is consistent with No link between the response variable and the explanatory variable. A link between the response variable and the explanatory variable, the nature of the relationship needs to be described.
17.2 WHAT IS MEANT BY A RELATIONSHIP Relationship between a measured response variable and a measured explanatory variable? － Y-- the response variable － X-- the explanatory variable － Conceptual graph of Y against X
Deterministic Relationship － Fig. 3 shows a Deterministic Relationship between Y and X － In the data-analysis context, it would not be deterministic － for example, there was any connection between IQ and height, or height of son and height of father?
Graph 1 － Perfect linear relationship － Determining intercept and gradient/slope － Response Y depends only on the variable X Graph 2 － Statistical relationship/link As the value of the explanatory variable X increases, the value of the response variable Y also tends to increase the response Y may depend on a number of different variables, say X, U, V, W, Z Y=f(X,U,V,W,Z,….)
Y= f(X) + effect of all other variables =Y= f(X) + e e is the effect of all other variables The influence on Y is from two parts Variation in Y Explained by changes X (Explained Variation ) Variation in Y not explained by changes in X (Unexplained Variation ) The Total Variation in Y' = 'Explained Variation' + 'Unexplained Variation' － In Graph 1, Unexplained Variation is nil － In Graph 2, Explained Variation is large relative to the Unexplained Variation
Graph 3 － Y seems to be unrelated to X － Explained Variation is zero － Unexplained Variation influences all changes in Y.
Graph 4 － Similar to Graph 2 Graph 5 － Similar to Graph 1
Model of relationship between Y and X － Y= f(X) + e － Total Variation in Y = Explained Variation +Unexplained Variation － Two issues Can a model of the link be made? Can 'The Total Variation in Y', 'Explained Variation' and the 'Unexplained Variation' be measured? － For Graph 1 and Graph 5 Y= a + bX a – intercept b – gradient/slope － For Graphs 2,3 & 4 Statistical model
17.3 DEVELOPING A STATISTICAL MODEL Simple numerical example － Using EXCEL － Fitting line by eye Intercept ? Gradient?
－ Actual Y － Predicted Y p =2+5X Measures of the disagreement between the actual data Y and the fitted line(predicted Y p ) － (Y-Y p ) － (Y-Y p ) 2 – satisfactory measure of disagreement The line of Best Fit: (Y-Y p ) 2 is as small as possible The Method of Least Squares : finding the intercept and the gradient of a line to minimize (Y-Y p ) 2
How to find the values of the intercept and the gradient － Trial and Error in Excel － Using Excel SOLVER － Using MINTAB The Unexplained Variation － (Y-Y p ) 2 --- measure of the 'Unexplained Variation' The Total Variation in Y
17.4 The coefficient of Determination R 2 The Total Variation in Y ='Explained Variation + Unexplained Variation － Example: 82.00 = 78.23 + 3.77 Definition of R 2
Example － For Graphs 1 & 5 R 2 =1 or 100% － For Graph 3 R 2 =0 － For Graphs 2 & 4 R 2 is between 0 and 1 The interpretation of the ratio R 2
17.5 USING SAMPLE DATA TO TRACK A CONNECTION: INTRODUCTION Example 1 － To explore the relationship between the size of the engine, as measured by the cubic capacity in cubic centimetres (CC) and the petrol consumption, as measured in miles per gallon, (M.P.G.). CarCCM.P.G 1167031.40 2241026.50 3303022.90 4118032.70 5130030.60 6199028.20 7180029.50 8150029.50 9220027.40 10260026.80 11286025.00 12213027.60 13188028.40 14236025.70 15227028.10
Example 2 credit data － To investigate the connection between the response variable 'Amount Borrowed on Credit' CREDIT and the explanatory variable 'PAYOUT'.
17.6 THE INITIAL DATA ANALYSIS Example 1 － Graph ENGINE SIZE (CC)
－ Interpretation By I.D.A, there is a clear link between M.P.G. and Engine size, as the Engine size increases the petrol consumption is lower. This is confirmed by the value of R 2. The interpretation of R 2 is suggesting that 89.54% of the changes in M.P.G. are explained by changes in Engine size. Alternatively 10.46% of the changes in M.P.G. are due to other variables. Predict fuel consumption from engine size?
－ Using MINITAB Graph—Scatterplot Stat—Regression Interpretation
Example 2 － Graph － Regression Analysis － Interpretation The I.D.A. is inconclusive and farther analysis is required. The regression equation is CREDIT = 4 30 + 1.61 PAYOUT
17.7 THE FURTHER DATA ANALYSIS FDA four steps － Specify the hypotheses. － Defining the decision rule. － Examining the sample evidence. － Conclusions.
Specify the hypotheses － H 0 : R 2 = 0 There is no relationship between the response and the explanatory variable. － H 1 : R 2 > 0 There is a relationship between the response and the explanatory variable.
Defining the decision rule － If F calc F Table then favour H 1
Examining the sample evidence － MINITAB REGRESS output Conclusions Worked Example 2
－ F calc =187.75 － F Table = 3.85 － Conclusions Sample evidence favours H 1. So there is evidence of a connection between 'CREDIT' and 'PAYOUT'
17.8 DESCRIBING THE RELATIONSHIP The R 2 value can be interpreted as a measure of the quality of predictions made from the line of best fit according to the rule of thumb:
Example 1 － The regression equation is M.P.G. = 37.2 - 0.00444 cc － R 2 = 89.4% － Making predictions CCM.P.G. 50035.0237 150030.5873 200028.3691 300023.9327 500015.0599 65008.4053
－ Inside the range from 1000cc to 3000cc, prediction is likely to be of good quality. － Outside the range from 1000cc to 3000cc, prediction is not likely to be very reliable When cc=0, M.P.G.= 37.24 When cc=8500, M.P.G.=-0.4675
Example 2 － CREDIT = 430+1.61 PAYOUT － R 2 = 22.4% － When PAYOUT = £ 10, CREDIT = £430 + 1.61*10 = 430 + 16.1 = £443.10 － This is within the range of values of PAYOUT within the data so is a valid prediction, but is not a very reliable prediction since the value of R2 is 22.4%