# /k 2DS00 Statistics 1 for Chemical Engineering lecture 4.

## Presentation on theme: "/k 2DS00 Statistics 1 for Chemical Engineering lecture 4."— Presentation transcript:

/k 2DS00 Statistics 1 for Chemical Engineering lecture 4

/k Week schedule Week 1: Measurement and statistics Week 2: Error propagation Week 3: Simple linear regression analysis Week 4: Multiple linear regression analysis Week 5: Nonlinear regression analysis

/k Detailed contents of week 4 multiple linear regression polynomial regression interaction multicollinearity measures of model adequacy selection of regression models

/k Specific warmth specific warmth of vapour at constant pressure as function of temperature data set from Perry’s Chemical Engineers’ Handbook thermodynamic theories say that quadratic relation between temperature and specific warmth usually suffices:

/k Scatter plot of specific warmth data

/k Regression output specific warmth data

/k Issues in regression output significance of model significance of individual regression parameters residual plots: –normality (density trace, normal probability plot) –constant variance (against predicted values + each independent variable) –model adequacy (against predicted values) –outliers –independence influential points

/k Residual plot specific warmth data This behaviour is visible in plot of fitted line only after rescaling!

/k Plot of fitted quadratic model for specific warmth data

/k Conclusion regression models for specific warmth data we need third order model (polynomial of degree 3) careful with extrapolation original data set contains influential points original data set contains potential outliers

/k Yield data yield of chemical reaction as function of both temperature and pressure goal of regression analysis is to find optimal settings of temperature and pressure start with simplest linear models: –no interaction –interaction

/k No interaction Temperature Yield 50100 Pressure = 5.5 Pressure = 1 65 80 85 70

/k Interaction Temperature Yield 50100 Pressure = 5.5 Pressure = 1 65 80 75 70

/k Interaction plot for yield data

/k First-order interaction model for yield data

/k Comments on first-order interaction model model significant, but R-squared relatively low residual plots suggest quadratic terms are missing:

/k Full quadratic model for yield data

/k Comments on full quadratic model for yield data strong improvement on R-squared independent variable T is no longer significant other independent variables involving T remain significant refit model omitting independent variable T while keeping the other independent variables

/k Incomplete quadratic model for yield data all parameters significant residual plots OK normality OK 3 influential points but standard deviations of parameter estimates are OK, so no action 2 possible outliers at predicted yield of 61% accept model and use it for finding optimal settings for yield

/k Optimal settings for yield

/k Problems with model selection variables may be significant in one model but not in another number of possible models increases rapidly with number of independent variables independent variables may influence each other (multicollinearity)

/k Multicollinearity Phenomenon: variables x i (almost) satisfy a linear relation Cause: large variances of parameter estimates. Not harmful for predictions Unpleasant for finding causal relations Ways to check for multicollinearity: –wrong signs of parameter estimates – significant model, but (many) non significant parameters – large variances of parameter estimates

/k Procedures for model selection compute all possible regression models –only possible with few independent variables –choice of best model through adequacy measures: determination coefficient (adjusted for number of ind. variables) MSE (directly related to standard error) Mallow’s C p (estimates total mean square error) sequentially add terms (forward regression) sequentially delete terms from full model (backward regression) These procedures do not necessarily yield the same result Final models should always be checked!