Presentation is loading. Please wait.

Presentation is loading. Please wait.

Lecture 30 Dr. MUMTAZ AHMED MTH 161: Introduction To Statistics.

Similar presentations


Presentation on theme: "Lecture 30 Dr. MUMTAZ AHMED MTH 161: Introduction To Statistics."— Presentation transcript:

1 Lecture 30 Dr. MUMTAZ AHMED MTH 161: Introduction To Statistics

2 Review of Previous Lecture In last lecture we discussed: Describing Bivariate Data Scatter Plot Concept of Correlation Properties of Correlation Related examples and Excel Demo 2

3 Objectives of Current Lecture In the current lecture: Common misconceptions about correlation Related Examples 3

4 Objectives of Current Lecture In the current lecture: Introduction to Regression Analysis Regression versus Correlation Simple and Multiple Regression Model 4

5 Common Confusion about Correlation There are many situations in which correlation is misleading. Correlation is defined only when both variables (X and Y) are Jointly Normal. Non-Linearity Outliers Ecological Correlations Trends 5

6 Common Confusion about Correlation Non-Linearity: Consider the data set on X and Y=X 2. Note: Scatter plot shows very strong (perfect) relationship b/w X and Y. But Correl(X,Y) is approx. zero. 6 XY -10100 -981 -864 -749 -636 -525 -416 -39 -24 1 00 11 24 39 416 525 636 749 864 981 10100

7 Common Confusion about Correlation Non-Linearity: Consider the data set on X and Y=X 2. The correlation coefficient only measures the strength of the linear relationship. Hence it is essential to plot the data prior to doing statistical analysis. If the data does not fit a standard joint normal pattern (or close) then the standard analysis can be quite misleading. 7

8 Common Confusion about Correlation Outliers: Outliers present in a data can mislead. Consider a data set: Note: A perfect linear relationship b/w X and Y is spoiled by one outlier. Calculated Correlation is 0.82. 8 xy 107.46 86.77 1312.7 97.11 117.81 148.84 66.08 45.39 128.15 76.42 55.73

9 Common Confusion about Correlation Outliers: Outliers present in a data can mislead. LESSON: One outlier, or a small group of outliers, can distort a strong correlation and make it appear as a zero or even negative correlation. 9

10 Common Confusion about Correlation Ecological Correlations: When a correlation is measured at a group level, and then conclusions drawn for individuals within groups, this is called an “ecological correlation”. Example: Suppose we look at country data on total number of cigarettes consumed and total number of lung cancer cases, and find a strong correlation. From this, we might be tempted to conclude that smoking causes cancer. However, countries do not smoke, individuals do. So this is an ecological correlation. It is easily possible to make up data such that despite a strong ecological correlation, there is no relation between smoking and cancer at the individual level. 10

11 Common Confusion about Correlation Ecological Correlations: For example suppose that there is a sequence of countries with increasing populations: 10, 100, 200, 500 etc. Suppose all males in each country smoke, but none of them get lung cancer, while none of the females smoke, but all females get lung cancer. If we look at individual data on smoking and cancer (at the level of persons), we will find a perfect correlation of -100%. No one who smokes gets cancer, and no one who gets cancer smokes. However if we look at the ecological correlation at the group level, we will find that there is a perfect +100% correlation between smoking and cancer – the larger the number of smokers, the large the number of lung cancer cases in each country. There will be a perfect linear relations between the two at the level of the country. This example shows that group level correlations cannot necessarily be reduced to the level of individuals. 11

12 Common Confusion about Correlation Trends: One of the most damaging and least understood phenomenon is that of spurious correlation. Correlation reveals the relationship between two stationary variables, and does not work to reveal any relationship between nonstationary and trending variables. The most important such case is when the two variables in question have increasing (or decreasing) trends. 12

13 Common Confusion about Correlation Trends: Example: Consider data on GNP per capita for Bhutan and El Salvador. In practical terms, we could easily consider these to be “independent” series – these two small economies are remote from each other geographically, and have no linkages to speak of. Hence correlation is expected to Zero. 13 Year.BhutanElSalvador 19791478.4244171.818 19801583.5993693.96 19811626.7143434.062 19821722.0213466.129 19831800.1353490.863 19841807.3443484.543 19851924.1893455.915 19862203.5573500.369 19872156.3153516.656 19882197.8543495.197 19892344.9483601.444 19902393.1253660.578 19912527.6323857.446 19922651.1134054.301 19932820.1774207.451 19943020.1784381.249 19953194.3964362.509 19963312.9174453.65 19973417.5454526.712 19983590.6444589.47 19993684.8194596.743 20003840.8694586.245 20014105.9174606.568 20024295.5164627.453 20034471.6344629.312 20044658.2924674.763 20054929.5354775.517

14 Common Confusion about Correlation Trends: But calculated value of correlation is found to be 0.90. This is due to the fact that both series have trends. This 90% does not measure any real association between the two series. Before we measure correlation, it is necessary to transform the series to stationary ones. One way to do this is by taking rates of growth for each economy. Differencing the series is another method that is commonly used. It is also possible to subtract a trend from the series to eliminate the trend. There is substantial literature on the best method to make a series stationary (same across time) before applying any standard statistical techniques to it. 14 YearBhutanElSalvador 19791478.4244171.818 19801583.5993693.96 19811626.7143434.062 19821722.0213466.129 19831800.1353490.863 19841807.3443484.543 19851924.1893455.915 19862203.5573500.369 19872156.3153516.656 19882197.8543495.197 19892344.9483601.444 19902393.1253660.578 19912527.6323857.446 19922651.1134054.301 19932820.1774207.451 19943020.1784381.249 19953194.3964362.509 19963312.9174453.65 19973417.5454526.712 19983590.6444589.47 19993684.8194596.743 20003840.8694586.245 20014105.9174606.568 20024295.5164627.453 20034471.6344629.312 20044658.2924674.763 20054929.5354775.517

15 Common Confusion about Correlation Trends: Correlation of both series after differencing is found to be only 0.26 which is much less than 0.90. LESSON: Trends can mislead the real correlation. 15 YearBhutanElSalvador 19791478.4244171.818 19801583.5993693.96 19811626.7143434.062 19821722.0213466.129 19831800.1353490.863 19841807.3443484.543 19851924.1893455.915 19862203.5573500.369 19872156.3153516.656 19882197.8543495.197 19892344.9483601.444 19902393.1253660.578 19912527.6323857.446 19922651.1134054.301 19932820.1774207.451 19943020.1784381.249 19953194.3964362.509 19963312.9174453.65 19973417.5454526.712 19983590.6444589.47 19993684.8194596.743 20003840.8694586.245 20014105.9174606.568 20024295.5164627.453 20034471.6344629.312 20044658.2924674.763 20054929.5354775.517

16 Common Confusion about Correlation Trend: Note: For El Salvador and Bhutan, it is easy to see on intuitive grounds that the two series have no relation with each other. This makes it easy to dismiss the statistical correlation of 90% as being spurious or nonsensical – these two words have been used in the literature on this subject. However when we expect to see a relation between the two series, then this same problem becomes much more serious. Someone does a correlation between GNP and Money Stock for Pakistan. The result will be a very large number. Now he could argue for a very strong relationship between the two. Because we expect that there is some real relationship between these two variables, the fact that the correlation here is nonsensical does not seem quite so obvious. 16

17 Common Confusion about Correlation General Lesson: We have considered many cases where correlation can mislead us. Quoting a decisive number to a lay audience will sound very definite and authoritative and in addition, it will help win arguments. As a statistics student, you should be well aware of all these misconceptions and should not get trapped in the false interpretations. 17

18 Regression Regression analysis is certainly the most important tool at the statistician’s and econometrician’s disposal. Regression is concerned with describing and evaluating the relationship between a given variable and one or more other variables. More specifically, regression is an attempt to explain movements in a variable by reference to movements in one or more other variables. 18

19 Regression To make the idea more concrete, denote the variable whose movements the regression seeks to explain by y and the variables which are used to explain those variations by x 1, x 2,..., x k. Hence, in this relatively simple setup, it would be said that variations in k variables (the xs) cause changes in some other variable, y. 19

20 Regression There are various completely interchangeable names for y and the xs. 20

21 Regression Versus Correlation Regression and correlation have some fundamental differences. In regression analysis there is an asymmetry in the way the dependent and explanatory variables are treated. The dependent variable is assumed to be statistical, random, or stochastic, that is, to have a probability distribution. The explanatory variables, on the other hand, are assumed to have fixed values. In correlation analysis, on the other hand, we treat any (two) variables symmetrically; there is no distinction between the dependent and explanatory variables. After all, the correlation between two variables scores on mathematics and statistics examinations is the same as that between scores on statistics and mathematics examinations. Moreover, both variables are assumed to be random. 21

22 Simple vs Multiple Regression Models If it is believed that y (dependent variable) depends on only one x (explanatory or independent) variable. Then the regression model is said to be simple. Example: Wage depends on education Consumption depends on income If it is believed that y (dependent variable) depends on two or more than two (explanatory) variables (x 1, x 2, …, x k ). Then the regression model is said to be a Multiple. Example: Wage depends on education and experience etc. 22

23 Review Let’s review the main concepts: Common misconceptions about correlation Related Examples 23

24 Review Let’s review the main concepts: Introduction to Regression analysis Regression versus Correlation Simple and Multiple Regression Model 24

25 Next Lecture In next lecture, we will study: More on Regression and its importance Method of Least Squares and related concepts Related Examples 25


Download ppt "Lecture 30 Dr. MUMTAZ AHMED MTH 161: Introduction To Statistics."

Similar presentations


Ads by Google