# Multivariate statistical analysis Regression analysis.

## Presentation on theme: "Multivariate statistical analysis Regression analysis."— Presentation transcript:

Multivariate statistical analysis Regression analysis

Regression vs. correlation 分析性解釋變數與反應變量之間的 ( 先驗 ) 因果關係 衡量變數之間的關聯 (association) 強度

Regression model (Y 1, Y 2, … Y j )= f (X 1, X 2, … X k ) k ≧ 2, multiple regression( 複迴歸 ) j ≧ 2, multivariate regression( 多元迴歸 ) The assumed model, y n =β 0 +β 1 x 1 +β 2 x 2 + … β n x n +e n, e n is the random error term based on some prerequisite assumptions Normal i.i.d. ~N(0, σ2) Normality Independence Variance equality

Modeling the regression line Ref.

ANOVA table for regression analysis — total model testing

Sum of errors Sum of squares for error (SSE) Sum of squares for model (SSM) Sum of squares for total (SST) MSE=SSE/d.f. of error=SSE/K MSM=SSM/d.f. of model=SSM/(N-K-1) d.f. of total=N-1 F=MSM/MSE

Determination Coefficient of determination R2=SSM/SST=1-SSE/SST, 0≦R2≦1 Adjusted coefficient of determination Adjusted by means of dividing by degree of freedom Adj. R2=1-[SSE/(N-K-1)]/[SST/(N-1)]=1-(1-R2)[(N- 1)/(N-K-1)] N>K+1, 必須比解釋變數之個數加一還多 Determining the goodness of fit of a sampled regression line

t-test for the coefficients of explaining variables — Marginal testing

Conflicts between total testing and marginal testing Confidence interval vs. confidence region (a region composed with several more narrower interval confidence intervals respectively)

Determine the predictors Checking the contribution of additional variables Stepwise regression Forward regression Backward regression

Testing the assumptions Normality testing Wilk-shapiro statistics Q-Q/ P-P plotting (expected distribution vs. real distribution) Variance equality testing Scatter the error term along x n Verify the randomized pattern Durbin-Watson test for testing the first autocorrelation of residuals Mean=2, if >2, “ - ” relation, if <2, “ + ” relation Independence testing Assumed the random & independent sampling process for the cross-sectional data Time-series analysis for the longitudinal data

Colinearity A pair of predictor variables that are strongly correlated Tolerance, 1-Rj2, if there exists strong correlation, the Tolerance will be smaller and near to zero VIF (variance inflation factor) The inverse of tolerance, if tolerance is small, VIF will inflate very large

Outliers Leverage h jj, (<1) h jj =1/n+[square(obj j - obj mean)]/Σ[ square(obj j - obj mean)] If h jj is comparatively too large, remove this observation.

Weighted regression The different impact of sample data Outliers  set the influence weight near to 0

Data transformation Transformation for normality, variance equality Transformation by log, or inverse, square