# MTH 161: Introduction To Statistics

## Presentation on theme: "MTH 161: Introduction To Statistics"— Presentation transcript:

MTH 161: Introduction To Statistics
Lecture 30 Dr. MUMTAZ AHMED

Review of Previous Lecture
In last lecture we discussed: Describing Bivariate Data Scatter Plot Concept of Correlation Properties of Correlation Related examples and Excel Demo

Objectives of Current Lecture
In the current lecture: Common misconceptions about correlation Related Examples

Objectives of Current Lecture
In the current lecture: Introduction to Regression Analysis Regression versus Correlation Simple and Multiple Regression Model

There are many situations in which correlation is misleading. Correlation is defined only when both variables (X and Y) are Jointly Normal. Non-Linearity Outliers Ecological Correlations Trends

X Y -10 100 -9 81 -8 64 -7 49 -6 36 -5 25 -4 16 -3 9 -2 4 -1 1 2 3 5 6 7 8 10 Non-Linearity: Consider the data set on X and Y=X2. Note: Scatter plot shows very strong (perfect) relationship b/w X and Y. But Correl(X,Y) is approx. zero.

Non-Linearity: Consider the data set on X and Y=X2. The correlation coefficient only measures the strength of the linear relationship. Hence it is essential to plot the data prior to doing statistical analysis. If the data does not fit a standard joint normal pattern (or close) then the standard analysis can be quite misleading.

Outliers: Outliers present in a data can mislead. Consider a data set: Note: A perfect linear relationship b/w X and Y is spoiled by one outlier. Calculated Correlation is 0.82. x y 10 7.46 8 6.77 13 12.7 9 7.11 11 7.81 14 8.84 6 6.08 4 5.39 12 8.15 7 6.42 5 5.73

Outliers: Outliers present in a data can mislead. LESSON: One outlier, or a small group of outliers, can distort a strong correlation and make it appear as a zero or even negative correlation.

Ecological Correlations: When a correlation is measured at a group level, and then conclusions drawn for individuals within groups, this is called an “ecological correlation”. Example: Suppose we look at country data on total number of cigarettes consumed and total number of lung cancer cases, and find a strong correlation. From this, we might be tempted to conclude that smoking causes cancer. However, countries do not smoke, individuals do. So this is an ecological correlation. It is easily possible to make up data such that despite a strong ecological correlation, there is no relation between smoking and cancer at the individual level.

Ecological Correlations: For example suppose that there is a sequence of countries with increasing populations: 10, 100, 200, 500 etc. Suppose all males in each country smoke, but none of them get lung cancer, while none of the females smoke, but all females get lung cancer. If we look at individual data on smoking and cancer (at the level of persons), we will find a perfect correlation of -100%. No one who smokes gets cancer, and no one who gets cancer smokes. However if we look at the ecological correlation at the group level, we will find that there is a perfect +100% correlation between smoking and cancer – the larger the number of smokers, the large the number of lung cancer cases in each country. There will be a perfect linear relations between the two at the level of the country. This example shows that group level correlations cannot necessarily be reduced to the level of individuals.

Trends: One of the most damaging and least understood phenomenon is that of spurious correlation. Correlation reveals the relationship between two stationary variables, and does not work to reveal any relationship between nonstationary and trending variables. The most important such case is when the two variables in question have increasing (or decreasing) trends.

Year .Bhutan ElSalvador 1979 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 Trends: Example: Consider data on GNP per capita for Bhutan and El Salvador. In practical terms, we could easily consider these to be “independent” series – these two small economies are remote from each other geographically, and have no linkages to speak of. Hence correlation is expected to Zero.

Year Bhutan ElSalvador 1979 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 Trends: But calculated value of correlation is found to be 0.90. This is due to the fact that both series have trends. This 90% does not measure any real association between the two series. Before we measure correlation, it is necessary to transform the series to stationary ones. One way to do this is by taking rates of growth for each economy. Differencing the series is another method that is commonly used. It is also possible to subtract a trend from the series to eliminate the trend. There is substantial literature on the best method to make a series stationary (same across time) before applying any standard statistical techniques to it.

Year Bhutan ElSalvador 1979 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 Trends: Correlation of both series after differencing is found to be only 0.26 which is much less than 0.90. LESSON: Trends can mislead the real correlation.

Trend: Note: For El Salvador and Bhutan, it is easy to see on intuitive grounds that the two series have no relation with each other. This makes it easy to dismiss the statistical correlation of 90% as being spurious or nonsensical – these two words have been used in the literature on this subject. However when we expect to see a relation between the two series, then this same problem becomes much more serious. Someone does a correlation between GNP and Money Stock for Pakistan. The result will be a very large number. Now he could argue for a very strong relationship between the two. Because we expect that there is some real relationship between these two variables, the fact that the correlation here is nonsensical does not seem quite so obvious.

General Lesson: We have considered many cases where correlation can mislead us. Quoting a decisive number to a lay audience will sound very definite and authoritative and in addition, it will help win arguments. As a statistics student, you should be well aware of all these misconceptions and should not get trapped in the false interpretations.

Regression Regression analysis is certainly the most important tool at the statistician’s and econometrician’s disposal. Regression is concerned with describing and evaluating the relationship between a given variable and one or more other variables. More specifically, regression is an attempt to explain movements in a variable by reference to movements in one or more other variables.

Regression To make the idea more concrete, denote the variable whose movements the regression seeks to explain by y and the variables which are used to explain those variations by x1, x2, , xk . Hence, in this relatively simple setup, it would be said that variations in k variables (the xs) cause changes in some other variable, y.

Regression There are various completely interchangeable names for y and the xs.

Regression Versus Correlation
Regression and correlation have some fundamental differences. In regression analysis there is an asymmetry in the way the dependent and explanatory variables are treated. The dependent variable is assumed to be statistical, random, or stochastic, that is, to have a probability distribution. The explanatory variables, on the other hand, are assumed to have fixed values. In correlation analysis, on the other hand, we treat any (two) variables symmetrically; there is no distinction between the dependent and explanatory variables. After all, the correlation between two variables scores on mathematics and statistics examinations is the same as that between scores on statistics and mathematics examinations. Moreover, both variables are assumed to be random.

Simple vs Multiple Regression Models
If it is believed that y (dependent variable) depends on only one x (explanatory or independent) variable. Then the regression model is said to be simple. Example: Wage depends on education Consumption depends on income If it is believed that y (dependent variable) depends on two or more than two (explanatory) variables (x1, x2, …, xk). Then the regression model is said to be a Multiple. Wage depends on education and experience etc.

Review Let’s review the main concepts:
Common misconceptions about correlation Related Examples

Review Let’s review the main concepts:
Introduction to Regression analysis Regression versus Correlation Simple and Multiple Regression Model

Next Lecture In next lecture, we will study:
More on Regression and its importance Method of Least Squares and related concepts Related Examples