Presentation is loading. Please wait.

Presentation is loading. Please wait.

Maureen Meadows Senior Lecturer in Management, Open University Business School.

Similar presentations


Presentation on theme: "Maureen Meadows Senior Lecturer in Management, Open University Business School."— Presentation transcript:

1 Maureen Meadows Senior Lecturer in Management, Open University Business School

2  Discuss different forms of dependency and correlation that we might find in our datasets  Explore why it might sometimes be problematic to get a good measure of correlation  Look at examples of data analysis where correlation is an important issue  Discuss why it is important to deal with correlations, and not to forget about them!

3  Dependence is any statistical relationship between two random variables, or two sets of data  In other words, they are not independent of each other  E.g. the relationship between the height of parents and the height of children, or the relationship between the demand for a product and its price

4  Two variables are said to be correlated if changes in one variable are associated with changes in the other variable  So, if we know how one variable is changing, we have a good idea how the other variable is likely to be changing too  Hence it is widely used in many forms of forecasting etc.

5  Pearson’s product moment correlation coefficient is also known as r, R, or Pearson's r  It is a measure of the strength and direction of the association - in particular, the linear relationship - between two (metric) variables  It is defined as the covariance of the variables divided by the product of their standard deviations

6  It is a measure of the linear dependence or correlation between the two variables  The sign (+ or -) indicates the direction of the relationship  The value can range from -1 to +1, with +1 indicating a perfect positive relationship, 0 indicating no relationship and -1 indicating a perfect negative or reverse relationship

7  A rank correlation is any of several statistics that measure the relationship between rankings of different ordinal variables or different rankings of the same variable, where a "ranking" is the assignment of the labels "first", "second", "third", etc. to different observations of a particular variable  A rank correlation coefficient measures the degree of similarity between two rankings, and can be used to assess the significance of the relation between them

8  Spearman’s rank correlation coefficient is a measure of how well the relationship between two variables can be described by a monotonic function  Kendall tau rank correlation coefficient is a measure of the portion of ranks that match between two data sets  Goodman and Kruskal's gamma is a measure of the strength of association of the cross tabulated data when both variables are measured at the ordinal level

9  Is a correlation coefficient significantly different from zero or not?  Example: three different correlation coefficients: 0.50, 0.35, and 0.17.  Assume that we want to test whether there is no significant relationship between the two variables at hand. The null hypothesis (H0) to be tested is that these r values are not statistically different from zero (rho = 0).

10  For rho = 0, H0 can be tested using a two tailed t-test at a given confidence level, usually at a 95% level  If t calculated ≥ t table, H0 is rejected  If t calculated < t table H0 is not rejected and there is no significant correlation between variables  Here t calculated is computed as r/SEr = r*SQRT[((n – 2)/(1 – r 2 ))] while t table values are obtained from the literature

11  For n = 14, all three r values (0.50, 0.35, and 0.17) are not statistically different from zero  For n = 30, r = 0.50 is statistically different from zero while r = 0.35 and r = 0.17 are not  Conversely, r = 0.50 is not statistically different from zero when n is equal to or less than 14 while r = 0.35 is not different from zero when n is equal to or less than 30  Finally, r = 0.17 is not statistically different from zero at any of the sample sizes tested

12  It is a table showing the correlations between a set of variables  Often inspected before multivariate methods are applied, e.g. regression analysis or factor analysis  Examples: Kim et al (2011), Mohammed (2013)

13  The square of the correlation coefficient, typically denoted r 2, is called the coefficient of determination  It estimates the fraction of the variance in Y that is explained by X in a simple linear regression  Example: McDaniel (1981)

14  An expression of the relationship between two variables (collinearity), or more than two variables (multicollinearity)  Two variables exhibit complete collinearity if their correlation coefficient is 1, and complete lack of collinearity if their correlation coefficient is 0  Multicollinearity occurs when a variable is highly correlated with a set of other variables

15  Multicollinearity is the extent to which a variable can be ‘explained’ by the other variables in the analysis  As multicollinearity increases, it complicates the interpretation of the variate (the linear combination of variables formed in a technique such as regression)  It becomes more difficult to ascertain the effect of any single variables, because of their inter-relationships

16  An indicator of the effect that the other independent variables have on the standard error of a regression coefficient  Large VIF values indicate a high degree of collinearity or multicollinearity among the independent variables  The VIF is directly related to the tolerance value (VIF = 1/tol)  Example: Zhou et al (2013)

17  Another commonly used measure of collinearity and multicollinearity  The tolerance of a variable is 1- r 2, where r 2 is the coefficient of determination for the prediction of that variable by the other independent variables  As the tolerance grows smaller, the variable is more highly predicted by the other independent variables (collinearity)

18  Variables that are multicollinear are implictly weighted more heavily  E.g. if we cluster on 10 equally weighted variables, and they form two dimensions (one of 8 variables and the other of the remaining 2), then the first dimension will have four times as many chances to affect the similarity measure

19  Some degree of multicollinearity is desirable, because the objective is to identify sets of variables that are interrelated  Examples: Kim et al (2011), Mohammed (2013)

20  Hair, Anderson, Tatham and Black, Multivariate Data Analysis, 7 th edition (Prentice Hall, 2009)  Kim, JY, Shim, JP and Ahn, KM (2011) ‘Social Networking Service: Motivation, Pleasure, and Behavioral Intention to Use’, Journal of Computer Information Systems, Summer, 92-101.  McDaniel, SW (1981) ‘Multicollinearity in advertising-related data’, Journal of Advertising Research, 21:3, 59-63.  Mohammed, S. (2013) ‘Factors Affecting E-Banking Usage in India: an Empirical Analysis’, Economic Insights – Trends and Challenges, Vol. II (LXV) No. 1, 17-25.  Zhou, X, Han, Y and Wang, R (2013) ‘An Empirical Investigation on Firms’ Proactive and Passive Motivation for Bribery in China’, Journal of Business Ethics, 118:461-472.


Download ppt "Maureen Meadows Senior Lecturer in Management, Open University Business School."

Similar presentations


Ads by Google