Scatterplot Smoothing Using PROC LOESS and Restricted Cubic Splines

Slides:



Advertisements
Similar presentations
All Possible Regressions and Statistics for Comparing Models
Advertisements

I OWA S TATE U NIVERSITY Department of Animal Science Using Basic Graphical and Statistical Procedures (Chapter in the 8 Little SAS Book) Animal Science.
Exploring the Shape of the Dose-Response Function.
Topic 12: Multiple Linear Regression
A. The Basic Principle We consider the multivariate extension of multiple linear regression – modeling the relationship between m responses Y 1,…,Y m and.
Topic 9: Remedies.
The %LRpowerCorr10 SAS Macro Power Estimation for Logistic Regression Models with Several Predictors of Interest in the Presence of Covariates D. Keith.
Generalized Additive Models Keith D. Holler September 19, 2005 Keith D. Holler September 19, 2005.
Polynomial Regression and Transformations STA 671 Summer 2008.
Experimental design and analyses of experimental data Lesson 2 Fitting a model to data and estimating its parameters.
EPI 809/Spring Probability Distribution of Random Error.
Data mining and statistical learning - lecture 6
Vector Generalized Additive Models and applications to extreme value analysis Olivier Mestre (1,2) (1) Météo-France, Ecole Nationale de la Météorologie,
Topic 3: Simple Linear Regression. Outline Simple linear regression model –Model parameters –Distribution of error terms Estimation of regression parameters.
Creating Graphs on Saturn GOPTIONS DEVICE = png HTITLE=2 HTEXT=1.5 GSFMODE = replace; PROC REG DATA=agebp; MODEL sbp = age; PLOT sbp*age; RUN; This will.
Datamining and statistical learning - lecture 9 Generalized linear models (GAMs)  Some examples of linear models  Proc GAM in SAS  Model selection in.
Some Terms Y =  o +  1 X Regression of Y on X Regress Y on X X called independent variable or predictor variable or covariate or factor Which factors.
Multiple regression analysis
Chapter 11: Inferential methods in Regression and Correlation
Reading – Linear Regression Le (Chapter 8 through 8.1.6) C &S (Chapter 5:F,G,H)
Economics 20 - Prof. Anderson1 Multiple Regression Analysis y =  0 +  1 x 1 +  2 x  k x k + u 6. Heteroskedasticity.
Copyright (c) Bani K. Mallick1 STAT 651 Lecture #18.
Nonparametric Smoothing Methods and Model Selections T.C. Lin Dept. of Statistics National Taipei University 5/4/2005.
1 Regression Analysis Regression used to estimate relationship between dependent variable (Y) and one or more independent variables (X). Consider the variable.
Example: Price Analysis for Diamond Rings in Singapore Variables Response Variable: Price in Singapore dollars (Y) Explanatory Variable: Weight of diamond.
This Week Continue with linear regression Begin multiple regression –Le 8.2 –C & S 9:A-E Handout: Class examples and assignment 3.
Chapter 12 Multiple Regression and Model Building.
© 2002 Prentice-Hall, Inc.Chap 14-1 Introduction to Multiple Regression Model.
1 Experimental Statistics - week 6 Chapter 15: Randomized Complete Block Design (15.3) Factorial Models (15.5)
1 1 Slide Simple Linear Regression Part A n Simple Linear Regression Model n Least Squares Method n Coefficient of Determination n Model Assumptions n.
1 Experimental Statistics - week 10 Chapter 11: Linear Regression and Correlation.
23-1 Analysis of Covariance (Chapter 16) A procedure for comparing treatment means that incorporates information on a quantitative explanatory variable,
Lecture 4 SIMPLE LINEAR REGRESSION.
1 Experimental Statistics - week 10 Chapter 11: Linear Regression and Correlation Note: Homework Due Thursday.
6-3 Multiple Regression Estimation of Parameters in Multiple Regression.
Regression Examples. Gas Mileage 1993 SOURCES: Consumer Reports: The 1993 Cars - Annual Auto Issue (April 1993), Yonkers, NY: Consumers Union. PACE New.
Regression For the purposes of this class: –Does Y depend on X? –Does a change in X cause a change in Y? –Can Y be predicted from X? Y= mX + b Predicted.
Copyright © Cengage Learning. All rights reserved. 12 Simple Linear Regression and Correlation
1 1 Slide © 2014 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole.
The Examination of Residuals. Examination of Residuals The fitting of models to data is done using an iterative approach. The first step is to fit a simple.
5-5 Inference on the Ratio of Variances of Two Normal Populations The F Distribution We wish to test the hypotheses: The development of a test procedure.
Topic 17: Interaction Models. Interaction Models With several explanatory variables, we need to consider the possibility that the effect of one variable.
6-1 Introduction To Empirical Models Based on the scatter diagram, it is probably reasonable to assume that the mean of the random variable Y is.
6-3 Multiple Regression Estimation of Parameters in Multiple Regression.
BUSI 6480 Lecture 8 Repeated Measures.
1 Experimental Statistics - week 14 Multiple Regression – miscellaneous topics.
Topic 25: Inference for Two-Way ANOVA. Outline Two-way ANOVA –Data, models, parameter estimates ANOVA table, EMS Analytical strategies Regression approach.
Topic 26: Analysis of Covariance. Outline One-way analysis of covariance –Data –Model –Inference –Diagnostics and rememdies Multifactor analysis of covariance.
PSYC 3030 Review Session April 19, Housekeeping Exam: –April 26, 2004 (Monday) –RN 203 –Use pencil, bring calculator & eraser –Make use of your.
Lecture 3 Topic - Descriptive Procedures Programs 3-4 LSB 4:1-4.4; 4:9:4:11; 8:1-8:5; 5:1-5.2.
Lecture 7: Multiple Linear Regression Interpretation with different types of predictors BMTRY 701 Biostatistical Methods II.
Xuhua Xia Correlation and Regression Introduction to linear correlation and regression Numerical illustrations SAS and linear correlation/regression –CORR.
Simple Linear Regression. Data available : (X,Y) Goal : To predict the response Y. (i.e. to obtain the fitted response function f(X)) Least Squares Fitting.
1 Experimental Statistics - week 12 Chapter 12: Multiple Regression Chapter 13: Variable Selection Model Checking.
Topic 20: Single Factor Analysis of Variance. Outline Analysis of Variance –One set of treatments (i.e., single factor) Cell means model Factor effects.
Customize SAS Output Using ODS Joan Dong. The Output Delivery System (ODS) gives you greater flexibility in generating, storing, and reproducing SAS procedure.
1 Statistics 262: Intermediate Biostatistics Regression Models for longitudinal data: Mixed Models.
CRD, Strength of Association, Effect Size, Power, and Sample Size Calculations BUSI 6480 Lecture 4.
1 Experimental Statistics - week 13 Multiple Regression Miscellaneous Topics.
Experimental Statistics - week 9
1 Experimental Statistics - week 12 Chapter 11: Linear Regression and Correlation Chapter 12: Multiple Regression.
1 Experimental Statistics - week 11 Chapter 11: Linear Regression and Correlation.
CMS SAS Users Group Conference Learn more about THE POWER TO KNOW ® October 17, 2011 Medicare Payment Standardization Modeling using SAS Enterprise Miner.
Chapter 4 Basic Estimation Techniques
LINEAR REGRESSION 1.
Topic 31: Two-way Random Effects Models
Migration and the Labour Market
Presentation transcript:

Scatterplot Smoothing Using PROC LOESS and Restricted Cubic Splines Jonas V. Bilenas Barclays Global Retail Bank/UK Adjunct Faculty, Saint Joseph University, School of Business June 23, 2011

Introduction In this tutorial we will look at 2 scatterplot smoothing techniques: The LOESS Procedure: Non-parametric regression smoothing (local regression or DWLS; Distance Weighted Least Squares). Restricted Cubic Splines: Parametric smoothing that can be used in regression procedures to fit functional models.

SUG, RUG, & LUG Pictures

LOESS documentation from SAS The LOESS procedure implements a nonparametric method for estimating regression surfaces pioneered by Cleveland, Devlin, and Grosse (1988), Cleveland and Grosse (1991), and Cleveland, Grosse, and Shyu (1992). The LOESS procedure allows great flexibility because no assumptions about the parametric form of the regression surface are needed. The main features of the LOESS procedure are as follows: fits nonparametric models supports the use of multidimensional data supports multiple dependent variables supports both direct and interpolated fitting that uses kd trees performs statistical inference performs automatic smoothing parameter selection performs iterative reweighting to provide robust fitting when there are outliers in the data supports graphical displays produced through ODS Graphics

LOESS Procedure Details LOESS fits a local regression function to the data within a chosen neighborhood of points. The radius of each neighborhood is chosen so that the neighborhood contains a specified percentage of the data points. This percentage of the region is specified by a smoothing parameter (0 < smooth <= 1). The larger the smoothing parameter the smoother the graphed function. Default value of smoothing is at 0.5. Smoothing parameter can also be optimized: AICC specifies the AICC criterion.. AICC1 specifies the AICC1 criterion. GCV specifies the generalized cross validation criterion. The regression procedure performs a fit weighted by the distance of points from the center of the neighborhood. Missing values are deleted.

Example of some LOESS proc loess data=sashelp.cars; ods output outputstatistics=outstay; model MPG_Highway=MSRP /smooth=0.8 alpha=.05 all; run; Fit Summary Fit Method kd Tree Blending Linear Number of Observations 428 Number of Fitting Points 9 kd Tree Bucket Size 68 Degree of Local Polynomials 1 Smoothing Parameter 0.80000 Points in Local Neighborhood 342 Residual Sum of Squares 8913.89292 Trace[L] 3.77247 GCV 0.04953 AICC 4.05885 AICC1 1737.19028 Delta1 424.12399 Delta2 424.20690 Equivalent Number of Parameters 3.66893 Lookup Degrees of Freedom 424.04109 Residual Standard Error 4.58445

SUG, RUG, & LUG Pictures

Example of some LOESS proc sort data=outstay; by pred; run; axis1 label = (angle=90 "MPG HIGHWAY"); axis2 label = (h=1.5 "MSRP"); symbol1 i=none c=black v=dot h=0.5; symbol2 i=j value=none color=red l=1 width=30; proc gplot data=outstay; plot (depvar pred)*MSRP / overlay haxis=axis2 vaxis=axis1 grid; title "LOESS Smooth=0.8"; run;quit;

LOESS with ODS GRAPHICS ods html; ods graphics on; proc loess data=sashelp.cars; model MPG_Highway=MSRP /smooth=(0.5 0.6 0.7 0.8) alpha=.05 all; run; ods grapahics off; ods html close;

Optimized LOESS ods html; ods graphics on; proc loess data=sashelp.cars; model MPG_Highway=MSRP / SELECT=AICC; run; ods grapahics off; ods html close;

LOESS in SGPLOT ods html; ods graphics on; title 'LOESS/SMOOTH=0.60'; proc sgplot data=sashelp.cars; loess x=MSRP y=MPG_Highway / smooth=0.60; run; quit; ods graphics off; ods html close;

Optimized LOESS ods html; ods graphics on; proc loess data=sashelp.cars; model MPG_Highway=MSRP Horsepower / SELECT=AICC; run; ods grapahics off; ods html close;

SUG, RUG, & LUG Pictures

LOESS for Time Series Plots ods html; ods graphics on; title 'Time series plot'; proc loess data=ENSO; model Pressure = Month / SMOOTH=0.1 0.2 0.3 0.4; run; quit; ods graphics off; ods html close; Data from Cohen (SUGI 24) Data also online: http://support.sas.com/documentation/cdl/en/statug/63033/HTML/default/viewer.htm#statug_loess_sect033.htm

LOESS for Time Series Plots (AICC optimized)

Large Number of Observations http://www.statisticalanalysisconsulting.com/scatterplots-dealing-with-overplotting/ Peter Flom Blog. Set PLOTS(MAXPOINTS= ) in PROC LOESS. Default limit is 5000, Run PROC LOESS on all data. But plot after binning independent variable and running means on binned data. proc loess data=test; /* output 300 for each record */ axis1 label = (angle=90 "MPG HIGHWAY") ods output outputstatistics=outstay; ; model MPG_Highway=horsepower axis2 label = (h=1.5 "Horsepower"); /smooth=0.4 ; run; symbol1 i=none c=black v=dot h=0.5; symbol2 i=j value=none color=red l=1 width=10; proc rank data=outstay groups=100 ties=low out=ranked; var horsepower; proc gplot data=means; ranks r_horsepower; plot (depvar pred)*Horsepower / overlay haxis=axis2 proc means data=ranked noprint nway; vaxis=axis1 class r_horsepower; grid; var depvar pred Horsepower; title "LOESS Smooth=0.4"; output out=means mean=; run;quit;

Large Number of Observations

SUG, RUG, & LUG Pictures

Restricted Cubic Splines Recommended by Frank Harrell Knots are specified in advanced. Placement of Knots are not important. Usually determined predetermined percentiles based on sample size, k Quantiles 3 .10 .5 .90 4 .05 .35 .65 .95 5 .05 .275 .5 .725 .95 6 .05 .23 .41 .59 .77 .95 7 .025 .1833 .3417 .5 .6583 .8167 .975

Restricted Cubic Splines Percentile values can be derived using PROC UNIVARIATE. Can Optimize number of Knots selecting number based on minimizing AICC. Provides a parametric regression function. Sometimes knot transformations make for difficult interpretation. May be difficult to incorporate interaction terms. Much more efficient than categorizing continuous variables into dummy terms. Macro available: http://biostat.mc.vanderbilt.edu/wiki/pub/Main/SasMacros/survrisk.txt

Restricted Cubic Splines proc univariate data=sashelp.cars noprint; var horsepower; output out=knots pctlpre=P_ pctlpts=5 27.5 50 72.5 95; run; proc print data=knots; run; Obs P_5 P_27_5 P_50 P_72_5 P_95 1 115 170 210 245 340

Restricted Cubic Splines options nocenter mprint; data test; set sashelp.cars; %rcspline (horsepower,115, 170, 210, 245, 340); run; LOG: MPRINT(RCSPLINE): DROP _kd_; MPRINT(RCSPLINE): _kd_= (340 - 115)**.666666666666 ; MPRINT(RCSPLINE): horsepower1=max((horsepower-115)/_kd_,0)**3+((245-115)*max((horsepower-340)/_kd_,0)**3 -(340-115)*max((horsepower-245)/_kd_,0)**3)/(340-245); MPRINT(RCSPLINE): ; horsepower2=max((horsepower-170)/_kd_,0)**3+((245-170)*max((horsepower-340)/_kd_,0)**3 -(340-170)*max((horsepower-245)/_kd_,0)**3)/(340-245); horsepower3=max((horsepower-210)/_kd_,0)**3+((245-210)*max((horsepower-340)/_kd_,0)**3 -(340-210)*max((horsepower-245)/_kd_,0)**3)/(340-245); 43 run;

Restricted Cubic Splines proc reg data=test; model MPG_Highway = horsepower horsepower1 horsepower2 horsepower3; LINEAR: TEST horsepower1, horsepower2, horsepower3; run; quit; Analysis of Variance Sum of Mean Source DF Squares Square F Value Pr > F Model 4 8147.64458 2036.91115 145.37 <.0001 Error 423 5926.86710 14.01151 Corrected Total 427 14075 Root MSE 3.74319 R-Square 0.5789 Dependent Mean 26.84346 Adj R-Sq 0.5749 Coeff Var 13.94453 Parameter Estimates Parameter Standard Variable Label DF Estimate Error t Value Pr > |t| Intercept Intercept 1 63.32145 2.50445 25.28 <.0001 Horsepower 1 -0.22900 0.01837 -12.46 <.0001 horsepower1 1 0.83439 0.12653 6.59 <.0001 horsepower2 1 -2.53834 0.49019 -5.18 <.0001 horsepower3 1 2.55417 0.66356 3.85 0.0001 Test LINEAR Results for Dependent Variable MPG_Highway Mean Source DF Square F Value Pr > F Numerator 3 750.78949 53.58 <.0001 Denominator 423 14.01151

Restricted Cubic Splines (5 Knots)

Restricted Cubic Splines (7 Knots): Time Series Data Regression terms not significant

SUG, RUG, & LUG Pictures

References Akaike, H. (1973), “Information Theory and an Extension of the Maximum Likelihood Principle,” in Petrov and Csaki, eds., Proceedings of the Second International Symposium on Information Theory, 267–281. Cleveland, W. S., Devlin, S. J., and Grosse, E. (1988), “Regression by Local Fitting,” Journal of Econometrics, 37, 87–114. Cleveland, W. S. and Grosse, E. (1991), “Computational Methods for Local Regression,” Statistics and Computing, 1, 47–62. Cohen, R.A. (SUGI 24). “An Introduction to PROC LOESS for Local Regression,” Paper 273-24. Harrell, F. (2010). “Regression Modeling Strategies: With Applications to Linear Models, Logistic Regression, and Survival Analysis (Springer Series in Statistics),” Springer. Harrell RCSPLINE MACRO: http://biostat.mc.vanderbilt.edu/wiki/pub/Main/SasMacros/survrisk.txt C. J. Stone and C. Y. Koo (1985), “Additive splines in statistics,” In Proceedings of the Statistical Computing Section ASA, pages 45{48, Washington, DC, 1985. [34, 39]