Regression Model Building LPGA Golf Performance - 2008.

Slides:



Advertisements
Similar presentations
1 Outliers and Influential Observations KNN Ch. 10 (pp )
Advertisements

Chap 12-1 Statistics for Business and Economics, 6e © 2007 Pearson Education, Inc. Chapter 12 Simple Regression Statistics for Business and Economics 6.
6-1 Introduction To Empirical Models 6-1 Introduction To Empirical Models.
Review of Univariate Linear Regression BMTRY 726 3/4/14.
Regression Analysis Once a linear relationship is defined, the independent variable can be used to forecast the dependent variable. Y ^ = bo + bX bo is.
Prediction, Correlation, and Lack of Fit in Regression (§11. 4, 11
Class 16: Thursday, Nov. 4 Note: I will you some info on the final project this weekend and will discuss in class on Tuesday.
LECTURE 3 Introduction to Linear Regression and Correlation Analysis
1 1 Slide 統計學 Spring 2004 授課教師:統計系余清祥 日期: 2004 年 5 月 11 日 第十二週:建立迴歸模型.
1 1 Slide © 2014 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole.
Multiple Regression Predicting a response with multiple explanatory variables.
x y z The data as seen in R [1,] population city manager compensation [2,] [3,] [4,]
Chapter 12 Simple Regression
Part I – MULTIVARIATE ANALYSIS C2 Multiple Linear Regression I
Lecture 19: Tues., Nov. 11th R-squared (8.6.1) Review
1 Chapter 9 Variable Selection and Model building Ray-Bing Chen Institute of Statistics National University of Kaohsiung.
Lecture 24: Thurs. Dec. 4 Extra sum of squares F-tests (10.3) R-squared statistic (10.4.1) Residual plots (11.2) Influential observations (11.3,
Multiple Linear Regression
Regression Diagnostics Checking Assumptions and Data.
Crime? FBI records violent crime, z x y z [1,] [2,] [3,] [4,] [5,]
Regression Model Building Setting: Possibly a large set of predictor variables (including interactions). Goal: Fit a parsimonious model that explains variation.
Simple Linear Regression: An Introduction Dr. Tuan V. Nguyen Garvan Institute of Medical Research Sydney.
Regression Transformations for Normality and to Simplify Relationships U.S. Coal Mine Production – 2011 Source:
Forecasting Revenue: An Example of Regression Model Building Setting: Possibly a large set of predictor variables used to predict future quarterly revenues.
Checking Regression Model Assumptions NBA 2013/14 Player Heights and Weights.
1 1 Slide © 2003 South-Western/Thomson Learning™ Slides Prepared by JOHN S. LOUCKS St. Edward’s University.
1 1 Slide © 2008 Thomson South-Western. All Rights Reserved Slides by JOHN LOUCKS St. Edward’s University.
Regression Model Building
Quantile Regression Prize Winnings – LPGA 2009/2010 Seasons Kahane, L.H. (2010). “Returns to Skill in Professional Golf: A Quantile Regression.
Regression and Correlation Methods Judy Zhong Ph.D.
Marketing Research Aaker, Kumar, Day and Leone Tenth Edition
1 MULTI VARIATE VARIABLE n-th OBJECT m-th VARIABLE.
9/14/ Lecture 61 STATS 330: Lecture 6. 9/14/ Lecture 62 Inference for the Regression model Aim of today’s lecture: To discuss how we assess.
© 2002 Prentice-Hall, Inc.Chap 14-1 Introduction to Multiple Regression Model.
 Combines linear regression and ANOVA  Can be used to compare g treatments, after controlling for quantitative factor believed to be related to response.
Chapter 14 Simple Regression
Regression and Analysis Variance Linear Models in R.
Lecture 9: ANOVA tables F-tests BMTRY 701 Biostatistical Methods II.
MBP1010H – Lecture 4: March 26, Multiple regression 2.Survival analysis Reading: Introduction to the Practice of Statistics: Chapters 2, 10 and 11.
Anaregweek11 Regression diagnostics. Regression Diagnostics Partial regression plots Studentized deleted residuals Hat matrix diagonals Dffits, Cook’s.
Review of Building Multiple Regression Models Generalization of univariate linear regression models. One unit of data with a value of dependent variable.
Lesson Multiple Regression Models. Objectives Obtain the correlation matrix Use technology to find a multiple regression equation Interpret the.
Using R for Marketing Research Dan Toomey 2/23/2015
Slide 1 DSCI 5340: Predictive Modeling and Business Forecasting Spring 2013 – Dr. Nick Evangelopoulos Lecture 2: Review of Multiple Regression (Ch. 4-5)
Simple Linear Regression (SLR)
Simple Linear Regression (OLS). Types of Correlation Positive correlationNegative correlationNo correlation.
Tutorial 4 MBP 1010 Kevin Brown. Correlation Review Pearson’s correlation coefficient – Varies between – 1 (perfect negative linear correlation) and 1.
Model Selection and Validation. Model-Building Process 1. Data collection and preparation 2. Reduction of explanatory or predictor variables (for exploratory.
Multiple Regression. Simple Regression in detail Y i = β o + β 1 x i + ε i Where Y => Dependent variable X => Independent variable β o => Model parameter.
Lecture 6: Multiple Linear Regression Adjusted Variable Plots BMTRY 701 Biostatistical Methods II.
Determining Factors of GPA Natalie Arndt Allison Mucha MA /6/07.
Applied Quantitative Analysis and Practices LECTURE#31 By Dr. Osman Sadiq Paracha.
Lecture 3 Linear Models II Olivier MISSA, Advanced Research Skills.
Linear Models Alan Lee Sample presentation for STATS 760.
Lesson 14 - R Chapter 14 Review. Objectives Summarize the chapter Define the vocabulary used Complete all objectives Successfully answer any of the review.
Tutorial 5 Thursday February 14 MBP 1010 Kevin Brown.
Lab 4 Multiple Linear Regression. Meaning  An extension of simple linear regression  It models the mean of a response variable as a linear function.
Economics 173 Business Statistics Lecture 18 Fall, 2001 Professor J. Petry
Chapter 20 Linear and Multiple Regression
Chapter 12 Simple Linear Regression and Correlation
Statistics in MSmcDESPOT
John Loucks St. Edward’s University . SLIDES . BY.
Checking Regression Model Assumptions
Correlation and Simple Linear Regression
Checking Regression Model Assumptions
Regression Model Building
Correlation and Simple Linear Regression
Obtaining the Regression Line in R
Linear Regression and Correlation
Financial Econometrics Fin. 505
Presentation transcript:

Regression Model Building LPGA Golf Performance

Data Description Response: log(Prize Winnings/Round) – Skewed data Potential Predictors:  Average Drive Distance  Percentage of Drives Reaching Fairway  Percentage of Greens Reached in Regulation  Average Putts per Hole  Average Number of Sand Traps Hit per Round (Sandshot)  Percentage of Sand Saves Samples:  Training Sample – 100 Randomly Sampled Golfers  Validation Sample – 57 Remaining Golfers used to assess fit

Modeling Strategies Select Training Sample Select “best” subset of predictors based on Backward Elimination, Forward Selection, Stepwise Regression and/or All Possible Regressions based on Minimizing: Identify any Influential Observations (based on Outliers, Leverage Values, DFFITS, DFBETAS, Cook’s D) Test Model Assumptions: Normality (Shapiro-Wilk), Constant Variance (Brown-Forsyth and Breusch-Pagan) Determine Validity of model by obtaining prediction errors for validation sample

Top of Entire Sample (First 20 Golfers)

Backward Elimination (RSS = SSE) Step 1: Start: AIC= logprz ~ drive + fairway + green + putts + sandshot + sandsave Df Sum of Sq RSS AIC - fairway drive sandsave sandshot green putts Step 2: AIC= logprz ~ drive + green + putts + sandshot + sandsave Df Sum of Sq RSS AIC sandsave drive sandshot green putts At Step 1, Fairway is eliminated, AIC Is minimized ( < ) At Step 2, no other variables are removed (no AIC < )

Forward Selection (RSS = SSE) Step 1: Start: AIC=-6.61 logprz ~ 1 Df Sum of Sq RSS AIC + green putts drive sandshot sandsave fairway Step 2: AIC= logprz ~ green Df Sum of Sq RSS AIC + putts sandsave fairway drive sandshot Step 3: AIC= logprz ~ green + putts Df Sum of Sq RSS AIC + sandshot sandsave drive fairway Step 4: AIC= logprz ~ green + putts + sandshot Df Sum of Sq RSS AIC + drive sandsave fairway Step 5: AIC= logprz ~ green + putts + sandshot + drive Df Sum of Sq RSS AIC + sandsave fairway Step 6: AIC= logprz ~ green + putts + sandshot + drive + sandsave Df Sum of Sq RSS AIC fairway

Model – green, putts, sandshot, sandsave, drive Call: lm(formula = logprz ~ green + putts + sandshot + sandsave + drive, data = lpga.cv.in) Residuals: Min 1Q Median 3Q Max Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) e-14 *** green < 2e-16 *** putts < 2e-16 *** sandshot ** sandsave drive * --- Residual standard error: on 94 degrees of freedom Multiple R-squared: , Adjusted R-squared: F-statistic: 128 on 5 and 94 DF, p-value: < 2.2e-16

Influence Measures (n=100, p’=6)

Summary of Influence Measures - I Studentized Residuals (Exceed in absolute value)  Extreme values (in absolute value): and Leverage Values (Exceed 0.12)  Golfers 111 (h=0.1543), 127 (0.1263), 113 (0.1213) (No big problem) DFFITS (Exceed 0.49 in absolute value)  Three Golfers between and (Golfers 142, 91, and 117)  One Golfer between 0.49 and 0.59 (Golfer 59) Cook’s D (Exceed 1, sometimes suggested to exceed 0.5)  Max value is None come close to 1 (or the sometimes suggested ½)

Summary of Influence Measures DFBETAS (Exceed 0.20 in absolute value)  Intercept: Golfer 117 (-0.54), 28 (0.24), 45 (0.29), 59 (0.34), 142 (0.45)  Greens: Golfer 132 (-0.25), 91 (0.24), 110 (0.25), 142 (0.33)  Putts: Golfer 142 (-0.41), 25 (0.24), 117 (0.43)  Sandshots: Golfer 132 (-0.25), 111 (0.23), 39 (0.23), 110 (0.24)  Sandsaves: Golfers 59 (-0.43), 22 (-0.31), 91 (-0.30), 102 (-0.25), 115 (0.23), 47 (0.43)  Drive: Golfers 142 (-0.49), 59 (-0.24), 56 (0.28), 117 (0.29), 48 (0.30) Note that while some of these exceed the “threshold” none seem to be way too excessive. However, golfers 142 and 117 appear regularly, they should be checked out

Residuals appear to be (reasonably) approximately normal. Shapiro-Wilk test does not reject the hypothesis of normal errors > shapiro.test(residuals(lpga.mod1)) Shapiro-Wilk normality test data: residuals(lpga.mod1) W = , p-value =

No Evidence of non-constant error variance (Data had been transformed prior to fitting model)

Equal (Homogeneous) Variance - I No evidence to reject the null hypothesis of equal variance among errors

Equal (Homogeneous) Variance There is no evidence of unequal variance, based on either Brown-Forsyth or Breusch- Pagan tests Breusch-Pagan test data: logprz ~ green + putts + sandshot + sandsave + drive BP = , df = 5, p-value =