SIMPLE AND MULTIPLE REGRESSION

Slides:



Advertisements
Similar presentations
Topic 12: Multiple Linear Regression
Advertisements

A. The Basic Principle We consider the multivariate extension of multiple linear regression – modeling the relationship between m responses Y 1,…,Y m and.
Lecture 10 F-tests in MLR (continued) Coefficients of Determination BMTRY 701 Biostatistical Methods II.
The Multiple Regression Model.
1 Outliers and Influential Observations KNN Ch. 10 (pp )
Prediction, Correlation, and Lack of Fit in Regression (§11. 4, 11
Logistic Regression Example: Horseshoe Crab Data
Basic Data Analysis IV Regression Diagnostics in SPSS
Lecture 9 Today: Ch. 3: Multiple Regression Analysis Example with two independent variables Frisch-Waugh-Lovell theorem.
Sociology 601, Class17: October 27, 2009 Linear relationships. A & F, chapter 9.1 Least squares estimation. A & F 9.2 The linear regression model (9.3)
Sociology 601 Class 19: November 3, 2008 Review of correlation and standardized coefficients Statistical inference for the slope (9.5) Violations of Model.
Valuation 4: Econometrics Why econometrics? What are the tasks? Specification and estimation Hypotheses testing Example study.
LINEAR REGRESSION: Evaluating Regression Models Overview Assumptions for Linear Regression Evaluating a Regression Model.
LINEAR REGRESSION: Evaluating Regression Models. Overview Assumptions for Linear Regression Evaluating a Regression Model.
Multiple regression analysis
Class 5: Thurs., Sep. 23 Example of using regression to make predictions and understand the likely errors in the predictions: salaries of teachers and.
Multivariate Data Analysis Chapter 4 – Multiple Regression.
Examining Relationship of Variables  Response (dependent) variable - measures the outcome of a study.  Explanatory (Independent) variable - explains.
1 Review of Correlation A correlation coefficient measures the strength of a linear relation between two measurement variables. The measure is based on.
Nemours Biomedical Research Statistics April 2, 2009 Tim Bunnell, Ph.D. & Jobayer Hossain, Ph.D. Nemours Bioinformatics Core Facility.
Lecture 24: Thurs., April 8th
Multiple Linear Regression
1 1 Slide © 2003 South-Western/Thomson Learning™ Slides Prepared by JOHN S. LOUCKS St. Edward’s University.
Interpreting Bi-variate OLS Regression
Regression Model Building Setting: Possibly a large set of predictor variables (including interactions). Goal: Fit a parsimonious model that explains variation.
Multiple Regression Dr. Andy Field.
Forecasting Revenue: An Example of Regression Model Building Setting: Possibly a large set of predictor variables used to predict future quarterly revenues.
Correlation & Regression
Simple linear regression and correlation analysis
Inference for regression - Simple linear regression
1 MULTI VARIATE VARIABLE n-th OBJECT m-th VARIABLE.
Returning to Consumption
© 2002 Prentice-Hall, Inc.Chap 14-1 Introduction to Multiple Regression Model.
Diploma in Statistics Introduction to Regression Lecture 2.21 Introduction to Regression Lecture Review of Lecture 2.1 –Homework –Multiple regression.
How do Lawyers Set fees?. Learning Objectives 1.Model i.e. “Story” or question 2.Multiple regression review 3.Omitted variables (our first failure of.
EDUC 200C Section 3 October 12, Goals Review correlation prediction formula Calculate z y ’ = r xy z x for a new data set Use formula to predict.
6-3 Multiple Regression Estimation of Parameters in Multiple Regression.
Basics of Regression Analysis. Determination of three performance measures Estimation of the effect of each factor Explanation of the variability Forecasting.
Lecture 13 Diagnostics in MLR Variance Inflation Factors Added variable plots Identifying outliers BMTRY 701 Biostatistical Methods II.
Regression Analyses. Multiple IVs Single DV (continuous) Generalization of simple linear regression Y’ = b 0 + b 1 X 1 + b 2 X 2 + b 3 X 3...b k X k Where.
Multiple regression models Experimental design and data analysis for biologists (Quinn & Keough, 2002) Environmental sampling and analysis.
Multiple Regression and Model Building Chapter 15 Copyright © 2014 by The McGraw-Hill Companies, Inc. All rights reserved.McGraw-Hill/Irwin.
6-1 Introduction To Empirical Models Based on the scatter diagram, it is probably reasonable to assume that the mean of the random variable Y is.
Anaregweek11 Regression diagnostics. Regression Diagnostics Partial regression plots Studentized deleted residuals Hat matrix diagonals Dffits, Cook’s.
Biostat 200 Lecture Simple linear regression Population regression equationμ y|x = α +  x α and  are constants and are called the coefficients.
6-3 Multiple Regression Estimation of Parameters in Multiple Regression.
STA 286 week 131 Inference for the Regression Coefficient Recall, b 0 and b 1 are the estimates of the slope β 1 and intercept β 0 of population regression.
Correlation & Regression. The Data SPSS-Data.htmhttp://core.ecu.edu/psyc/wuenschk/SPSS/ SPSS-Data.htm Corr_Regr.
Outliers and influential data points. No outliers?
Linear Models Alan Lee Sample presentation for STATS 760.
Applied Quantitative Analysis and Practices LECTURE#30 By Dr. Osman Sadiq Paracha.
Statistical Data Analysis 2010/2011 M. de Gunst Lecture 10.
STAT E100 Section Week 12- Regression. Course Review - Project due Dec 17 th, your TA. - Exam 2 make-up is Dec 5 th, practice tests have been updated.
Lab 4 Multiple Linear Regression. Meaning  An extension of simple linear regression  It models the mean of a response variable as a linear function.
Before the class starts: Login to a computer Read the Data analysis assignment 1 on MyCourses If you use Stata: Start Stata Start a new do file Open the.
Chapter 12 REGRESSION DIAGNOSTICS AND CANONICAL CORRELATION.
Predicting Energy Consumption in Buildings using Multiple Linear Regression Introduction Linear regression is used to model energy consumption in buildings.
Unit 9: Dealing with Messy Data I: Case Analysis
Chapter 14 Introduction to Multiple Regression
Chapter 20 Linear and Multiple Regression
Chapter 12 Simple Linear Regression and Correlation
Multiple Regression Prof. Andy Field.
CHAPTER 7 Linear Correlation & Regression Methods
Regression Diagnostics
Happiness comes not from material wealth but less desire.
بحث في التحليل الاحصائي SPSS بعنوان :
CHAPTER 29: Multiple Regression*
Chapter 12 Simple Linear Regression and Correlation
Multiple Linear Regression
Presentation transcript:

SIMPLE AND MULTIPLE REGRESSION

Relació entre variables Entre variables discretes (exemple: veure Titanic) Entre continues (Regressió !) Entre discretes i continues (Regressió!)

> Rendiment en Matemàtiques, > Nombre de llibres a casa Pisa 2003 > Rendiment en Matemàtiques, > Nombre de llibres a casa

> Rendiment en Matemàtiques, > Nombre de llibres a casa Pisa 2003 > Rendiment en Matemàtiques, > Nombre de llibres a casa

Regressió Lineal ? Pisa 2003 EXAMINE VARIABLES=fac1_1 BY st19q01 /PLOT=BOXPLOT/STATISTICS=NONE/NOTOTAL /MISSING=REPORT.

Regressió Lineal ? Pisa 2003 EXAMINE VARIABLES=fac1_1 BY st19q01 /PLOT=BOXPLOT/STATISTICS=NONE/NOTOTAL /MISSING=REPORT.

Regressió Lineal ? Pisa 2003 mean=c(-1.0505584,-0.6346741,-0.1713318,0.1350648,0.4411666) books = c(1,2,3, 4, 5) plot(books, mean, col="blue", axes=FALSE, xlab="books", ylab="level mat") axis(1) axis(2) abline(lm(mean~books), col="red")

Regressió Lineal

* We first load the PISAespanya.sav file and then * this is the sintaxis file for SPSS analysis *Q38 How often do these things happen in your math class *Student dont't listen to what the teacher says CROSSTABS /TABLES=subnatio BY st38q02 /FORMAT= AVALUE TABLES /STATISTIC=CHISQ /CELLS= COUNT ROW . FACTOR /VARIABLES pv1math pv2math pv3math pv4math pv5math pv1math1 pv2math1 pv3math1 pv4math1 pv5math1 pv1math2 pv2math2 pv3math2 pv4math2 pv5math2 pv1math3 pv2math3 pv3math3 pv4math3 pv5math3 pv1math4 pv2math4 pv3math4 pv4math4 pv5math4 /MISSING LISTWISE /ANALYSIS pv1math pv2math pv3math pv4math pv5math pv1math1 pv2math1 pv3math1 pv4math1 pv5math1 pv1math2 pv2math2 pv3math2 pv4math2 pv5math2 pv1math3 pv2math3 pv3math3 pv4math3 pv5math3 pv1math4 pv2math4 pv3math4 pv4math4 pv5math4 /PRINT INITIAL EXTRACTION FSCORE /PLOT EIGEN ROTATION /CRITERIA FACTORS(1) ITERATE(25) /EXTRACTION ML /ROTATION NOROTATE /SAVE REG(ALL) . GRAPH /SCATTERPLOT(BIVAR)=st19q01 WITH fac1_1 /MISSING=LISTWISE . REGRESSION /MISSING LISTWISE /STATISTICS COEFF OUTS R ANOVA /CRITERIA=PIN(.05) POUT(.10) /NOORIGIN /DEPENDENT fac1_1 /METHOD=ENTER st19q01 /PARTIALPLOT ALL /SCATTERPLOT=(*ZRESID ,*ZPRED ) /RESIDUALS HIST(ZRESID) NORM(ZRESID) .

library(foreign) help(read.spss) data=read.spss("G:/DATA/PISAdata2003/ReducedDataSpain.sav", use.value.labels=TRUE,to.data.fram=TRUE) names(data) [1] "SUBNATIO" "SCHOOLID" "ST03Q01" "ST19Q01" "ST26Q04" "ST26Q05" [7] "ST27Q01" "ST27Q02" "ST27Q03" "ST30Q02" "EC07Q01" "EC07Q02" [13] "EC07Q03" "EC08Q01" "IC01Q01" "IC01Q02" "IC01Q03" "IC02Q01" [19] "IC03Q01" "MISCED" "FISCED" "HISCED" "PARED" "PCMATH" [25] "RMHMWK" "CULTPOSS" "HEDRES" "HOMEPOS" "ATSCHL" "STUREL" [31] "BELONG" "INTMAT" "INSTMOT" "MATHEFF" "ANXMAT" "MEMOR" [37] "COMPLRN" "COOPLRN" "TEACHSUP" "ESCS" "W.FSTUWT" "OECD" [43] "UH" "FAC1.1" attach(data) mean(FAC1.1) [1] -8.95814e-16 tabulate(ST19Q01) [1] 106 0 15 1266 1927 2372 3575 1155 375 > table(ST19Q01) ST19Q01 Miss Invalid N/A More than 500 books 106 0 15 1266 201-500 books 101-200 books 26-100 books 11-25 books 1927 2372 3575 1155 0-10 books 375

Efecto de Cultural Possession of the family

Data Variables Y and X observed on a sample of size n: yi , xi i =1,2, ..., n

Covariance and correlation

Scatterplot for various values of correlation par(mfrow=c(4,4)) for (i in seq(-1,1,.2) ) { r=i; X =rnorm(n) Y=r*X+ sqrt((1-r^2))*rnorm(n) plot(X,Y, main=c("correlation", as.character(r)), col="blue") regress= lm(Y ~X) abline(regress) }

Coeficient de correlació r = 0 , tot i que hi ha una relació funcional exacta (no lineal!) > cbind(x,y) x y [1,] -10 100 [2,] -9 81 [3,] -8 64 [4,] -7 49 [5,] -6 36 [6,] -5 25 [7,] -4 16 [8,] -3 9 [9,] -2 4 [10,] -1 1 [11,] 0 0 [12,] 1 1 [13,] 2 4 [14,] 3 9 [15,] 4 16 [16,] 5 25 [17,] 6 36 [18,] 7 49 [19,] 8 64 [20,] 9 81 [21,] 10 100 > x=seq(-10,10,1) y = x^2 r=cov(x,y)/(var(x)*var(y)) r [1] 0 plot(x,y,col="blue", axes=FALSE) axis(1) axis(2) abline(lm(y~x), col="red")

Regressió Lineal Simple Variables Y X E Y | X = a + b X Var (Y | X ) = s2

Linear relation: y = 1 + .6 X ############ values of X X = (-0.5+ runif(100))*30#### exact linear relation alpha=1; beta=.6; y =alpha + beta*X ####################### plot(X,y,xlim=c(-15,15),ylim=c(-15,15),type="l", col="red", axes=FALSE,xlab="axis X", ylab="axis y") abline(v=0) abline(h=0) axis(1,seq(-14,14,1),cex=.2) axis(2,seq(-14,14,1),cex=.2)

Linear relation and sample data sigma=3; E= rnorm(length(X)) Y=y+ sigma*E lines(X,Y,type="p", col="blue")

Yi = a + b Xi + ei ei : mean zero variance s2 normally distributed Regression Model Yi = a + b Xi + ei ei : mean zero variance s2 normally distributed

Sample Data: Scatterplot ######## the sample data: scatterplot plot(X,Y,xlim=c(-15,15),ylim=c(-15,15),type="p", col="blue", axes=FALSE,xlab="axis X", ylab="axis y") abline(v=0) abline(h=0) axis(1,seq(-14,14,1),cex=.2) axis(2,seq(-14,14,1),cex=.2)

Fitted regression line a= 0.5789 b=0.6270 ######## the sample data: scatterplot plot(X,Y,xlim=c(-15,15),ylim=c(-15,15),type="p", col="blue", axes=FALSE,xlab="axis X", ylab="axis y") abline(v=0) abline(h=0) axis(1,seq(-14,14,1),cex=.2) axis(2,seq(-14,14,1),cex=.2) abline(lm(Y~X), col=“green")

Fitted and true regression lines: a= 0.5789 b=0.6270 a=1, b=.6 abline(c(alpha,beta),col="red")

Fitted and true regression lines in repeated (20) sampling a=1, b=.6 abline(c(alpha,beta),col="red") ############ values of X X = (-0.5+ runif(100))*30#### exact linear relation alpha=1; beta=.6; y =alpha + beta*X ####################### plot(X,y,xlim=c(-15,15),ylim=c(-15,15),type="l", col="red", axes=FALSE,xlab="axis X", ylab="axis y") abline(v=0) abline(h=0) axis(1,seq(-14,14,1),cex=.2) axis(2,seq(-14,14,1),cex=.2) for (i in 1:20) { # Sample Data sigma=3; E= rnorm(length(X)) Y=y+ sigma*E # lines(X,Y,type="p", col="blue") abline(lm(Y~X), col="green") } ######## the sample data: scatterplot # plot(X,Y,xlim=c(-15,15),ylim=c(-15,15),type="p", col="blue", axes=FALSE,xlab="axis X", ylab="axis y") plot(X,Y,xlim=c(-15,15),ylim=c(-15,15),type="p", col="blue", axes=FALSE,xlab="axis X", ylab="axis y")

OLS estimate of beta (under repeated sampling) Estimate of beta for different samples (100): 0.619 0.575 0.636 0.543 0.555 0.594 0.611 0.584 0.576 ...... > a=1, b=.6 > mean(bs) [1] 0.6042086 > sd(bs) [1] 0.03599894 > ############ values of X X = (-0.5+ runif(100))*30#### exact linear relation alpha=1; beta=.6; y =alpha + beta*X ####################### plot(X,y,xlim=c(-15,15),ylim=c(-15,15),type="l", col="red", axes=FALSE,xlab="axis X", ylab="axis y") abline(v=0) abline(h=0) axis(1,seq(-14,14,1),cex=.2) axis(2,seq(-14,14,1),cex=.2) bs = c() for (i in 1:100) { # Sample Data sigma=3; E= rnorm(length(X)) Y=y+ sigma*E # lines(X,Y,type="p", col="blue") bs=c( bs,lm(Y ~X)$coefficients[2]) abline(lm(Y~X), col="green") abline(c(alpha,beta),col="red") } plot(bs, ylim=c(.3,.9), col="blue") abline(h=.6, col="red") mean(bs) sd(bs)

REGRESSION Analysis of the Simulated Data (with R and other software )

Fitted regression line: a=1, b=.6 a= 1.0232203, b= 0.6436286 # Sample Data sigma=3; E= rnorm(length(X)) Y=y+ sigma*E lines(X,Y,type="p", col="blue") ######## the sample data: scatterplot plot(X,Y,xlim=c(-15,15),ylim=c(-15,15),type="p", col="blue", axes=FALSE,xlab="axis X", ylab="axis y") abline(v=0) abline(h=0) axis(1,seq(-14,14,1),cex=.2) axis(2,seq(-14,14,1),cex=.2) ############## fitted regression line b= sum((Y-mean(Y))*(X - mean(X)))/ sum( (X - mean(X))^2) a = mean(Y) - b*mean(X) abline(c(a,b),col= "green") abline(c(alpha,beta),col="red") c(a,b)

Regression Analysis regression = lm(Y ~X) summary(regression) Call: lm(formula = Y ~ X) Residuals: Min 1Q Median 3Q Max -6.0860 -2.1429 -0.1863 1.9695 9.4817 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 1.0232 0.3188 3.21 0.00180 ** X 0.6436 0.0377 17.07 < 2e-16 *** --- Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1 Residual standard error: 3.182 on 98 degrees of freedom Multiple R-Squared: 0.7483, Adjusted R-squared: 0.7458 F-statistic: 291.4 on 1 and 98 DF, p-value: < 2.2e-16 >> regression = lm(Y ~X) summary(regression)

Regression Analysis with Stata . use "E:\Albert\COURSES\cursDAS\AS2003\data\MONT.dta", clear . regress y x Source | SS df MS Number of obs = 100 ---------+------------------------------ F( 1, 98) = 291.42 Model | 2950.73479 1 2950.73479 Prob > F = 0.0000 Residual | 992.280727 98 10.1253135 R-squared = 0.7483 ---------+------------------------------ Adj R-squared = 0.7458 Total | 3943.01551 99 39.8284395 Root MSE = 3.182 ------------------------------------------------------------------------------ y | Coef. Std. Err. t P>|t| [95% Conf. Interval] ---------+-------------------------------------------------------------------- x | .6436286 .0377029 17.071 0.000 .5688085 .7184488 _cons | 1.02322 .3187931 3.210 0.002 .3905858 1.655855 . predict yh . graph yh y x, c(s.) s(io)

Regression analysis with SPSS

Estimación

Gráfico de dispersión

Fitted Regression FYi = 1.02 + .64 Xi , R2=.74 s.e.: (.037) t-value: 17.07 Regression coeficient of X is significant (5% significance level), with the expected value of Y icreasing .64 for each unit increase of X. The 95% confidence interval for the regression coefficient is [.64-1.96*.037, . .64+1.96*.037]=[.57, .71] 74% of the variation of Y is explained by the variation of X

Fitted regression line graph yh y x, c(s.) s(io)

Residual plot . graph e x, yline(0)

OLS analysis

Variance decomposition

Properties of OLS estimation

Sampling distributions

Inferences

Student-t distribution

Significance tests

F-test

Prediction of Y

Multiple Regression:

t-test and CI

F-test

Confidence bounds for Y

Interpreting multiple regression by means of simple regression

Adjusted R2

Exemple de l’Anàlisi de Regressió Dades de paisos.sav

Variables

Matrix Plot Necessitat de transformar les variables !

Transformació de variables

Matrix Plot Relacions (aproximadament) lineals entre variables transformades !

Anàlisi de Regressió REGRESSION /MISSING LISTWISE /STATISTICS COEFF OUTS R ANOVA /CRITERIA=PIN(.05) POUT(.10) /NOORIGIN /DEPENDENT espvida /METHOD=ENTER calories logpib logmetg /PARTIALPLOT ALL /SCATTERPLOT=(*ZRESID ,*ZPRED ) /SAVE RESID .

Residus vs y ajustada:

Regressió parcial:

Regressió parcial:

Regressió parcial

Anàlisi de Regressió REGRESSION /MISSING LISTWISE /STATISTICS COEFF OUTS R ANOVA /CRITERIA=PIN(.05) POUT(.10) /NOORIGIN /DEPENDENT espvida /METHOD=ENTER calories logpib logmetg cal2 /PARTIALPLOT ALL /SCATTERPLOT=(*ZRESID ,*ZPRED ) /SAVE RESID .

Residus vs y ajustada

Case missfit Potential for influence: Leverage Influence Case statistics Case missfit Potential for influence: Leverage Influence

Residuals

Hat matrix

Influence Analysis

Diagnostic case statistics After fitting regression, use the instruction Fpredict namevar predicted value of y , cooksd Cook’s D influence measure , dfbeta(x1) DFBETAS for regression coefficient on var x1 , dfits DFFITS influence measures , hat Diagonal elements of hat matrix (leverage) , leverage (same as hat) , residuals , rstandard standardized residuals , .rstudent Studentized residuals , stdf standard erros of predicte individual y, standard error of forecast , stdp standard errors of predicted mean y , stdr standard errors of residuals , welsch Welsch’s distance influence measure .... display tprob(47, 2.337) Sort namevar list x1 x2 –5/1

Leverages after fit … . fpredict lev, leverage . gsort -lev . list nombre lev in 1/5 nombre lev 1. RECAGUREN SOCIEDAD LIMITADA, .0803928 2. EL MARQUES DEL AMPURDAN S,L, .0767692 3. ADOBOS Y DERIVADOS S,L, .0572497 4. CONSERVAS GARAVILLA SA .0549707 5. PLANTA DE ELABORADOS ALIMENTICIOS MA .0531497 .

Box plot of leverage values . predict lev, leverage . graph lev, box s([_n]) Cases with extreme leverages

Leverage versus residual square plot . lvr2plot, s([_n])

Dfbeta’s: . fpredict beta, dfbeta(nt_paau) . graph beta, box s([_n])

Regression: Outliers, basic idea

Regression: Outliers, indicators Description Rule of thumb (when “wrong”) Resid Residual: actual – predicted - ZResid Standardized residual: residual divided by standarddeviation residual > 3 (in absolute value) SResid Studentized Residu: residual divided by standarddeviation residual at that particular datapoint of X values DResid Difference residual and residual when datapoint deleted SDResid See DResid, standardized by standard deviation at that particular datapoint of X values

Regression: Outliers, in SPSS

Regression: Influential Points, Basic Idea Influential Point (no outlier!)

Regression: Influential Points, Indicators Description Rule of thumb Lever Afstand tussen punt en overige punten NB: potentieel invloedrijk > 2 (p+1) / n Cook Verandering in residuen van overige cases als een bepaalde case niet in de regressie meedoet > 1 DfBeta Verschil tussen beta’s wanneer de case wel meedoet en wanneer die niet meedoet in het model NB: voor elke beta krijgen we deze - SdfBeta DfBeta / Standaardfout DfBeta NB: voor elke beta > 2/√n DfFit Verschil tussen voorspelde waarde als case wel versus niet meedoet in model SDfFit DfFit / standaarddeviatie SdFit > 2 /√(p/n) CovRatio Verandering in Varianties/Covarianties als punt niet meedoet Abs (CovRatio – 1)> 3 p / n

Regression: Influential points, in SPSS Case 2 is an influential point

Regression: Influential Points, what to do? Nothing at all.. Check data Delete a-typical datapoints, then repeat regression without these datapoints “to delete a point or not is an issue statisticians disagree on”

MULTICOLLINEARITY Diagnostic tools

Regression: Multicollinearity If predictors correlate “high”, then we speak of multicollinearity Is this a problem? If you want to asess the influence of each predictor, yes it is, because: Standarderrors blow up, making coefficients not-significant

Analyzing math data . use "G:\Albert\COURSES\cursDAS\AS2003b\data\mat.dta", clear . save "G:\Albert\COURSES\CursMetEstad\Curs2004\Metodes\mathdata.dta" file G:\Albert\COURSES\CursMetEstad\Curs2004\Metodes\mathdata.dta saved . edit - preserve . gen perform = (nt_m1+ nt_m2+ nt_m3)/3 (110 missing values generated) . corr perform nt_paau nt_acces nt_exp (obs=189) | perform nt_paau nt_acces nt_exp ---------+------------------------------------ perform | 1.0000 nt_paau | 0.3535 1.0000 nt_acces | 0.5057 0.8637 1.0000 nt_exp | 0.5002 0.3533 0.7760 1.0000 . outfile nt_exp nt_paau nt_acces perform using "G:\Albert\COURSES\CursMetEsta > d\Curs2004\Metodes\mathdata.dat" .

Multiple regression: perform vs nt_acces nt_paau . regress perform nt_acces nt_paau Source | SS df MS Number of obs = 245 ---------+------------------------------ F( 2, 242) = 31.07 Model | 71.1787647 2 35.5893823 Prob > F = 0.0000 Residual | 277.237348 242 1.14560888 R-squared = 0.2043 ---------+------------------------------ Adj R-squared = 0.1977 Total | 348.416112 244 1.42793489 Root MSE = 1.0703 ------------------------------------------------------------------------------ perform | Coef. Std. Err. t P>|t| [95% Conf. Interval] ---------+-------------------------------------------------------------------- nt_acces | 1.272819 .2427707 5.243 0.000 .7946054 1.751032 nt_paau | -.2755092 .1835091 -1.501 0.135 -.6369882 .0859697 _cons | -1.513124 .9729676 -1.555 0.121 -3.42969 .4034425 . Perform = rendiment a mates I a III

Collinearity

Diagnostics for multicollinearity . corre nt_paau nt_exp nt_acces (obs=276) | nt_paau nt_exp nt_acces --------+--------------------------- nt_paau| 1.0000 nt_exp| 0.3435 1.0000 nt_acces| 0.8473 0.7890 1.0000 . fit perform nt_paau nt_exp nt_access . vif Variable | VIF 1/VIF ---------+---------------------- nt_acces | 1201.85 0.000832 nt_paau | 514.27 0.001945 nt_exp | 384.26 0.002602 Mean VIF | 700.13 . VIF = 1/(1 – Rj2) Any explanatory variable with a VIF greater than 5 (or tolerance less than .2) show a degree of collinearity that may be Problematic This ratio is called Tolerance In the case of just nt_paau an nt_exp we Get . vif Variable | VIF 1/VIF ---------+---------------------- nt_exp | 1.14 0.875191 nt_paau | 1.14 0.875191 Mean VIF | 1.14 .

Multiple regression: perform vs nt_paau nt_exp . regress perform nt_paau nt_exp Source | SS df MS Number of obs = 189 ---------+------------------------------ F( 2, 186) = 37.24 Model | 75.2441994 2 37.6220997 Prob > F = 0.0000 Residual | 187.897174 186 1.01019986 R-squared = 0.2859 ---------+------------------------------ Adj R-squared = 0.2783 Total | 263.141373 188 1.39968815 Root MSE = 1.0051 ------------------------------------------------------------------------------ perform | Coef. Std. Err. t P>|t| [95% Conf. Interval] ---------+-------------------------------------------------------------------- nt_paau | .3382551 .1109104 3.050 0.003 .119451 .5570593 nt_exp | .9040681 .1396126 6.476 0.000 .6286403 1.179496 _cons | -3.295308 1.104543 -2.983 0.003 -5.474351 -1.116266 . corr nt_exp nt_paau nt_acces (obs=276) | nt_exp nt_paau nt_acces ---------+--------------------------- nt_exp | 1.0000 nt_paau | 0.3435 1.0000 nt_acces | 0.7890 0.8473 1.0000 . predict yh (option xb assumed; fitted values) (82 missing values generated) . predict e, resid (169 missing values generated) .

Regression: Multicollinearity, Indicators description Rule of thumb (when “wrong”) Overall F_Test versus test coefficients Overall F-Test is significant, but individual coefficients are not - Beta Standardized coefficient Outside [-1, +1] Tolerance Tolerance = unique variance of a predictor (not shared/explained by other predictors) … NB: Tolerance per coefficient < 0.01 Variantie Inflation Factor √ VIF indicates how much the standard error of a particular coefficient is inflated due to correlatation between this particular predictor and the other predictors NB: VIF per coefficient >10 Eigenvalues …rather technical… +/- 0 Condition Index > 30 Variance Proportion …rather technical…look tot “loadings” on the dimensions Loadings around 1

Regression: Multicollinearity, in SPSS diagnostics

Regression: Multicollineariteit, in SPSS Beta > 1 Tolerance, VIF in orde

Regressie: Multicollineariteit, in SPSS 2 eigenwaarden rond 0 CI in orde Deze variabelen zorgen voor multicoll.

Regression: Multicollinearity, what to do? Nothing… (if there is no interest in the individual coefficients, only in good prediction) Leave one (or more) predictor(s) out Use PCA to reduce high correlated variables to smaller number of uncorrelated variables

Variables Categòriques Use: Survey_sample.sav, in i/.../data

Salari vs gènere | anys d’educació status de treball

Creació de variables dicotòmiques GET FILE='G:\Albert\Web\Metodes2005\Dades\survey_sample.sav'. COMPUTE D1 = wrkstat=1. EXECUTE . COMPUTE D2 = wrkstat=2. COMPUTE D3 = wrkstat=3. COMPUTE D4 = wrkstat=4. COMPUTE D5 = wrkstat=5. COMPUTE D6 = wrkstat=6. COMPUTE D7 = wrkstat=7. COMPUTE D8 = wrkstat=8.

Regressió en blocks REGRESSION /MISSING LISTWISE /STATISTICS COEFF OUTS R ANOVA CHANGE /CRITERIA=PIN(.05) POUT(.10) /NOORIGIN /DEPENDENT rincome /METHOD=ENTER sex /METHOD=ENTER d2 d3 d4 d5 d6 d7 d8 .

Regressió en blocks REGRESSION /MISSING LISTWISE /STATISTICS COEFF OUTS R ANOVA CHANGE /CRITERIA=PIN(.05) POUT(.10) /NOORIGIN /DEPENDENT rincome /METHOD=ENTER sex /METHOD=ENTER educ /METHOD=ENTER d2 d3 d4 d5 d6 d7 d8 .

Categorical Predictors Is income dependent on years of age and religion ?

Categorical Predictors Compute dummy variable for each category, except last

Categorical Predictors And so on…

Categorical Predictors Block 1

Categorical Predictors Block 2

Categorical Predictors Ask for R2 change

Categorical Predictors Look at R Square change for importance of categorical variable