Presentation is loading. Please wait.

Presentation is loading. Please wait.

Multivariate Analysis 6/23/20161. Introduction Defined as “all statistical techniques which simultaneously analyze more than two variables on a sample.

Similar presentations


Presentation on theme: "Multivariate Analysis 6/23/20161. Introduction Defined as “all statistical techniques which simultaneously analyze more than two variables on a sample."— Presentation transcript:

1 Multivariate Analysis 6/23/20161

2 Introduction Defined as “all statistical techniques which simultaneously analyze more than two variables on a sample of observations” Broadly classified into two categories- dependency techniques and interdependency techniques Dependency techniques are those that deal with problems involving one or more dependent variables, while the remaining variables are considered independent Dependency techniques are further classified according to the measurement scales and the number of dependent variables in the problems Interdependency techniques are those multivariate techniques that deal with problems with more than two variables, where the variables are not segregated as dependent variables and independent variables. These techniques aim at analyzing interrelationships between the variables. 6/23/20162

3 Dependency techniques Aims at explaining or predicting one or more dependent variables based on two or more independent variables. Focus is on defining a relationship between one dependent variable and many independent variables that affect it. Some prominent techniques include regression analysis, discriminant analysis, MANOVA and canonical correlation analysis. Selection of an appropriate technique depends on two criteria: number of dependent variables in the problem and measurement scale used. Multiple regression analysis: one dependent variable and the measurement scale used is interval or ratio scales. Multiple discriminant analysis: Single dependent variable and measurement scale used is non metric (i.e. ordinal or nominal) MANOVA and canonical correlation analysis: More than one dependent variable 6/23/20163

4 Interdependency techniques Used in situations where no distinction is made between variables which are independent variables and those which are dependent variables, instead the interdependent relationships between variable are examined. Prominent techniques include factor analysis, cluster analysis, metric multidimensional scaling and non-metric multidimensional scaling. Selection of an appropriate technique depends on measurement scale used in the problem Factor analysis, cluster analysis and metric multidimensional scaling: Metric data Non metric multidimensional scaling: Non-metric multidimensional scaling 6/23/20164

5 5 Today’s session Correlation Regression Analysis The least squares estimation method SPSS and regression output 6/23/2016

6 6 Correlation Correlation measures to what extent two (or more) variables are related – Correlation expresses a relationship that is not necessarily precise (e.g. height and weight) – Positive correlation indicates that the two variables move in the same direction – Negative correlation indicates that they move in opposite directions 6/23/2016

7 7 Covariance Covariance measures the “joint variability” If two variables are independent, then the covariance is zero (however, Cov=O does not mean that two variables are independent) Where E(…) indicates the expected value (i.e. average value) 6/23/2016 Srikant Panigrahy, FPM, Business Research Methods

8 8 Correlation coefficient The correlation coefficient r gives a measure (in the range –1, +1) of the relationship between two variables – r=0 means no correlation – r=+1 means perfect positive correlation – r=-1 means perfect negative correlation Perfect correlation indicates that a p% variation in x corresponds to a p% variation in y 6/23/2016

9 9 Correlation coefficient and covariance Pearson correlation coefficient Correlation coefficient - POPULATION SAMPLE

10 6/23/201610 Bivariate and multivariate correlation Bivariate correlation – 2 variables – Pearson correlation coefficient Partial correlation – The correlation between two variables after allowing for the effect of other “control” variables

11 11 Significance level in correlation Level of correlation (value of the correlation coefficient): indicates to what extent the two variables “move together” Significance of correlation (p value): given that the correlation coefficient is computed on a sample, indicates whether the relationship appear to be statistically significant Examples – Correlation is 0.50, but not significant: the sampling error is so high that the actual correlation could even be 0 – Correlation is 0.10 and highly significant: the level of correlation is very low, but we can be confident on the value of such correlation 6/23/2016

12 12 Correlation and covariance in SPSS Choose between bivariate & partial 6/23/2016

13 13 Bivariate correlation Select the variables you want to analyse Require the significance level (two tailed) Ask for additional statistics (if necessary) 6/23/2016

14 14 Bivariate correlation output 6/23/2016

15 15 Partial correlations List of variables to be analysed Control variables 6/23/2016

16 16 Partial correlation output - - - P A R T I A L C O R R E L A T I O N C O E F F I C I E N T S - - - Controlling for.. SIZE STYLE AMTSPENT USECOUP ORG AMTSPENT 1.0000.2677 -.0116 ( 0) ( 775) ( 775) P=. P=.000 P=.746 USECOUP.2677 1.0000.0500 ( 775) ( 0) ( 775) P=.000 P=. P=.164 ORG -.0116.0500 1.0000 ( 775) ( 775) ( 0) P=.746 P=.164 P=. (Coefficient / (D.F.) / 2-tailed Significance) ". " is printed if a coefficient cannot be computed Partial correlations still measure the correlation between two variables, but eliminate the effect of other variables, i.e. the correlations are computed on consumers shopping in stores of identical size and with the same shopping style 6/23/2016

17 17 Bivariate and partial correlations Correlation between Amount spent and Use of coupon – Bivariate correlation: 0.291 (p value 0.00) – Partial correlation: 0.268 (p value 0.00) The amount spent is positively correlated with the use of coupon (0=no use, 1=from newspaper, 2=from mailing, 3=both) The level of correlation does not change much after accounting for different shop size and shopping styles 6/23/2016

18 Regression Analysis 6/23/201618

19 Variables The dependent variable is what we are trying to predict - it is typically represented by Y. The independent variable is a variable used to predict the dependent variable - it is typically represented by x. Note that independent variable predicts the dependent variable - it cannot be stated that the independent variable (x) causes changes in the dependent variable (Y). Regression typically uses interval/ratio scales variables as the independent and dependent variable. You can also use dummy coding (1, 0) for nominally scaled measures (a “1” if a characteristic is present, a “0” if that characteristic is absent. 6/23/201619

20 Bivariate Regression Bivariate linear regression (simple regression) investigates a straight line relationship of the type Regression basically fits the data to a straight line, where a is the intercept point and b is the slope of the line. SPSS fits the line to minimize vertical distances between points and the regression line. This is called the least squares criterion. a Y = a + bx + e where Y is the dependent variable, x is the independent variable, and a and b are two constants to be estimated. 6/23/201620

21 Conducting regression analysis: Procedure Plot the scatter diagram Formulate the general model Estimate the parameters Estimate the standardized regression coefficient Test for significance Determine the strength and significance of the association Check the prediction accuracy Estimate the residuals Cross validate the model 6/23/201621

22 22 Linear regression analysis Dependent variable Intercept Regression coefficient Independent variable (explanatory variable, regressor…) Error 6/23/2016

23 23 Regression analysis y x 6/23/2016

24 24 Example We want to investigate if there is a relationship between cholesterol and age on a sample of 18 people The dependent variable is the cholesterol level The explanatory variable is age 6/23/2016

25 25 What regression analysis does Determine whether a relationships exist between the dependent and explanatory variables Determine how much of the variation in the dependent variable is explained by the independent variable (goodness of fit) Allow to predict the values of the dependent variable 6/23/2016

26 26 Regression and correlation Correlation: there is no causal relationship assumed Regression: we assume that the explanatory variables “cause” the dependent variable – Bivariate: one explanatory variable – Multivariate: two or more explanatory variables 6/23/2016

27 27 How to estimate the regression coefficients The objective is to estimate the population parameters  e,  on our data sample  A good way to estimate it is by minimising the error e i, which represents the difference between the actual observation and the estimated (predicted) one 6/23/2016

28 28 The objective is to identify the line (i.e. the a and b coefficients) that minimise the distance between the actual points and the fit line 6/23/2016

29 29 The least square method This is based on minimising the square of the distance (error) rather than the distance 6/23/2016

30 30 Bivariate regression in SPSS 6/23/2016

31 31 Regression dialog box Dependent variable Explanatory variable Leave this unchanged! 6/23/2016

32 32 Regression output Value of the coefficients Statistical significance Is the coefficient different from 0? 6/23/2016

33 33 Model diagnostics: goodness of fit The value of the R square is included between 0 and 1 and represents the proportion of total variation that is explained by the regression model 6/23/2016

34 Bivariate Regression in SPSS: Results Amtrak Amtrak., meaning 81% of the variation is unaccounted for. 19.0% of the variation in BI Amtrak can be accounted for by A Amtrak., meaning 81% of the variation is unaccounted for. The equation is significantly better than chance, as evidenced by the F-value amtrak belongs in the equation. The significant constant indicates there is considerable variation unexplained. The significant t-value suggests that A amtrak belongs in the equation. The significant constant indicates there is considerable variation unexplained. amtrak ) Thus, if a subject had an amtrak score of 2, the equation would predict Y = 3.132 +.507(2) = 4.146 The unstandardized equation would be: Y = 3.132 +.507(A amtrak ) Thus, if a subject had an A amtrak score of 2, the equation would predict Y = 3.132 +.507(2) = 4.146

35 35 R-square Total variation Variation explained by regression Residual variation 6/23/2016

36 Assumptions in regression analysis The error term is normally distributed. For each fixed value of X, the distribution of Y is normal The means of all these normal distributions of Y, given X, lie on straight line with slope b. The mean of the error term is zero. The variance of the error term is constant. This variance does not depend on the values assumed by X. The error terms are uncorrelated. In other words, the observations have been drawn independently. 6/23/201636

37 37 Multivariate regression The principle is identical to bivariate regression, but there are more explanatory variables The goodness of fit can be measured through the adjusted R-square, which takes into account the number of explanatory variables 6/23/2016

38 Multiple Regression Each coefficient in multiple regression is also known as a coefficient of partial regression - it assesses the relationship between itself (X i ) and the dependent variable (Y) not accounted for by other variables in the model. Each variable introduced into the equation needs to account for variation in Y that has not be accounted for by any of the X variables already entered. We typically assume that the X variables are uncorrelated with one another. If they are not uncorrelated, we have a problem of multicollinearity. 6/23/201638

39 Multicollinearity Multicollinearity is a problem in regression - it occurs when the independent variables are highly correlated with one another. – Multicollinearity does not affect the models overall ability to predict, but it can impact the interpretation of individual beta coefficients which describe variation of Y variable due to variation in X variable. – Also makes it difficult for researcher to interpret the relative effect of various independent variables on dependent variable.  Can be handled by inclusion of more data so tht independent variables can be explained better, another way is to remove the variable from analysis that has high correlation with another variable. Also the variable can be combined to a single variable. Multicollinearity can be assessed through the use of a statistic, the variance inflation factor (VIF) – If VIF < 10, multicollinearity is not a problem. – If VIF > 10, remove the variable from the independent variables and run the analysis again. 6/23/201639

40 Dummy Variables Dummy variables are used to transform the numerical variables to categorical variables like marital status and gender in regression models. These can take values of 0 or 1. If the variable has more than two categories, i.e., if user has to be rated as heavy user, medium user and light user, we need to keep one category aside as the reference category to prevent perfect multicollinearity. Thus, for a variable which consists of n categories, we need to create n-1 dummy variables instead of n dummy variables. 6/23/201640

41 Interpreting Regression Results R 2 – It is a coefficient of determination - it indicates the percentage of of variation in Y explained by the variation in the independent variables (X i ). It determines the goodness of fit for your model (regression equation). It ranges from 0 to 1 Std. error of the estimate – It measures the accuracy of predictions using the regression equation. – The smaller the std. error of the estimate, the smaller the confidence interval (the more precise the prediction) 6/23/201641

42 Interpreting Regression Results F-values: – The F-value determines whether the equation is better than chance. A p-value of.05 or lower indicates we would reject the null hypothesis that the independent variables are not related to the dependent variable. – The F-value does not measure whether your model does a good job of predicting - only that it is better than chance. T-tests: – Examine the t-values to determine whether to include additional variables into the model. T-values should be statistically significant to be included in your analysis. 6/23/201642

43 Interpreting Regression Results Unstandardized coefficients (abbreviated as B) – These are written in the metric of the measure, which makes them useful for prediction. Standardized coefficients (beta) – These are written in a standardized form, ranging from 0 to 1. – The higher the value of the standardized coefficient, the more important the predictor is to the model. (i.e., the more unique variation in Y than can be accounted for by that variable) Introducing more variables into an equation typically explains more variation (increases R 2 ), but each variable must be a significant contributor of otherwise unexplained variation to include in the model (see T-test results to determine this.) 6/23/201643

44 Multiple Regression in SPSS Step 1 Step 2

45 Multiple Regression in SPSS: Results 1.Note that the circled t-values for two of the variables are not significant – these do not supply any unique variation to the prediction of the dependent variable, so they should be removed from analysis. 2.Note the standardized coefficients (beta): the greater the beta, the more important a variable is to the prediction of the dependent variable. 3.Finally, note the size of the t-value for the constant – this suggests the model still has considerable unexplained variation. Y = 3.219 +.235(A amtrak, Good/Bad ) +.245(A amtrak, like/dislike ) -.0638(A auto, goob/bad )

46 Multiple Regression in SPSS: Results The model indicates that the five predictors account for 21.5% of the variation in A amtrak. The F-value suggests that the equation is significantly better than chance.

47 47 Interpret this Output 6/23/2016

48 48 Coefficient interpretation The constant represents the amount spent being 0 all other variables (£ 296.5) Health food stores, Size of store and being vegetarian are not significantly different from 0 Gender coeff = -69.6: On average being woman (G=1) implies spending £ 69 less Shopping style coeff = +22.8 S – S=1 (shop per himself) = +22.8 – S=2 (shop per himself & spouse) = +45.6 – S=3 (shop per himself & family) = +68.4 Coupon use coeff = 30.4 C – C=1 (do not use coupon) = +30.4 – C=2 (coupon from newspapers) = +60.8 – C=3 (coupon from mailings) = +91.2 – C=4 (coupon from both) = +121.6 6/23/2016

49 49 Types of multiple regression Three types of MR, each is designed to answer a different question: – Standard MR - to evaluate the relationships between a set of IV and a DV. – Hierarchical, or sequential, regression is used to examine the relationships between a set of IVs and a DV, after controlling for the effects of some other IV on the DV. – Stepwise, or statistical, regression is used to identify the subset of IVs that has the strongest relationship to a DV. 6/23/2016

50 50 Standard multiple regression In standard MR, all IVs are entered into the regression equation at the same time Multiple R and R² measure the strength of the relationship between the set of IVs and the DV. F test is used to determine if the relationship can be generalized to the population represented by the sample. A t-test is used to evaluate the individual relationship between each IV and the DV. 6/23/2016

51 51 Hierarchical multiple regression In hierarchical MR, the IVs are entered in two stages. In the 1 st stage, the IVs we want to control are entered into the regression. In the second stage, the IVs whose relationship we want to examine after the controls are entered. A statistical test of the change in R² from the first stage is used to evaluate the importance of the variables entered in the second stage. 6/23/2016

52 52 Stepwise multiple regression Stepwise regression is designed to find the most parsimonious set of predictors that are most effective in predicting the DV. Variables are added to the regression equation one at a time, using the statistical criterion of maximizing the R² of the included variables. When none of the possible addition can make a statistically significant improvement in R², the analysis stops. 6/23/2016

53 Canonical Correlation 6/23/201653

54 Canonical Correlation Analysis It is a way of measuring linear relationship between two multidimensional variables Extension of multiple regression analysis Multiple regression analysis analyzes linear relationship between a dependent variable and multiple independent variables but CCA analyses a linear relationship between multiple independent variables and multiple dependent variables 6/23/201654

55 Canonical Correlation Analysis A social researcher wants to know the relationship between various work environment factors (like work culture, HR policies, compensation structure, top management) influencing various employee behavior elements (employee productivity, attrition rate, job satisfaction, etc.) Linear combination for each variable is called canonical variables. CCA tries to maximize correlation between two canonical variables The coefficient of each canonical variable are called canonical coefficients. All conclusions are deduced based on relative magnitudes and the nature of canonical coefficients of each question. CCA is not as popular as regression due to complex statistical tool involving lot of effort. 6/23/201655

56 MANOVA 6/23/201656

57 MANOVA Examines the relationship between several dependent variables and several independent variables. Tries to examine whether there is any difference between various dependent variables with respect to independent variables. Difference from ANOVA is while ANOVA deals with problems containing one dependent variable and several independent variables, MANOVA deals with problems containing several dependent variables and several independent variables. Another major difference is that ANOVA test ignores interrelationships between the variables which leads to biased results. MANOVA takes care of this aspect by testing the mean differences between groups on two or more dependent variables simultaneously. 6/23/201657

58 DISCRIMINANT ANALYSIS 6/23/201658

59 Discriminant Analysis A technique used for classifying a set of observations into predefined groups based on a set of variables known as predictors or input variables 6/23/201659


Download ppt "Multivariate Analysis 6/23/20161. Introduction Defined as “all statistical techniques which simultaneously analyze more than two variables on a sample."

Similar presentations


Ads by Google