Presentation on theme: "MTH 161: Introduction To Statistics"— Presentation transcript:
1MTH 161: Introduction To Statistics Lecture 30Dr. MUMTAZ AHMED
2Review of Previous Lecture In last lecture we discussed:Describing Bivariate DataScatter PlotConcept of CorrelationProperties of CorrelationRelated examples and Excel Demo
3Objectives of Current Lecture In the current lecture:Common misconceptions about correlationRelated Examples
4Objectives of Current Lecture In the current lecture:Introduction to Regression AnalysisRegression versus CorrelationSimple and Multiple Regression Model
5Common Confusion about Correlation There are many situations in which correlation is misleading.Correlation is defined only when both variables (X and Y) are Jointly Normal.Non-LinearityOutliersEcological CorrelationsTrends
6Common Confusion about Correlation XY-10100-981-864-749-636-525-416-39-24-1123567810Non-Linearity: Consider the data set on X and Y=X2.Note: Scatter plot shows very strong (perfect)relationship b/w X and Y.But Correl(X,Y) is approx. zero.
7Common Confusion about Correlation Non-Linearity: Consider the data set on X and Y=X2.The correlation coefficient only measures the strength of the linear relationship. Hence it is essential to plot the data prior to doing statistical analysis. If the data does not fit a standard joint normal pattern (or close) then the standard analysis can be quite misleading.
8Common Confusion about Correlation Outliers: Outliers present in a data can mislead.Consider a data set:Note: A perfect linear relationship b/w X and Y is spoiled by one outlier. Calculated Correlation is 0.82.xy107.4686.771312.797.11117.81148.8466.0845.39128.1576.4255.73
9Common Confusion about Correlation Outliers: Outliers present in a data can mislead.LESSON: One outlier, or a small group of outliers, can distort a strong correlation and make it appear as a zero or even negative correlation.
10Common Confusion about Correlation Ecological Correlations:When a correlation is measured at a group level, and then conclusions drawn for individuals within groups, this is called an “ecological correlation”.Example: Suppose we look at country data on total number of cigarettes consumed and total number of lung cancer cases, and find a strong correlation.From this, we might be tempted to conclude that smoking causes cancer. However, countries do not smoke, individuals do. So this is an ecological correlation.It is easily possible to make up data such that despite a strong ecological correlation, there is no relation between smoking and cancer at the individual level.
11Common Confusion about Correlation Ecological Correlations:For example suppose that there is a sequence of countries with increasing populations: 10, 100, 200, 500 etc. Suppose all males in each country smoke, but none of them get lung cancer, while none of the females smoke, but all females get lung cancer. If we look at individual data on smoking and cancer (at the level of persons), we will find a perfect correlation of -100%. No one who smokes gets cancer, and no one who gets cancer smokes.However if we look at the ecological correlation at the group level, we will find that there is a perfect +100% correlation between smoking and cancer – the larger the number of smokers, the large the number of lung cancer cases in each country. There will be a perfect linear relations between the two at the level of the country.This example shows that group level correlations cannot necessarily be reduced to the level of individuals.
12Common Confusion about Correlation Trends:One of the most damaging and least understood phenomenon is that of spurious correlation.Correlation reveals the relationship between two stationary variables, and does not work to reveal any relationship between nonstationary and trending variables.The most important such case is when the two variables in question have increasing (or decreasing) trends.
13Common Confusion about Correlation Year.BhutanElSalvador197919801981198219831984198519861987198819891990199119921993199419951996199719981999200020012002200320042005Trends:Example: Consider data on GNP per capita for Bhutan and El Salvador. In practical terms, we could easily consider these to be “independent” series – these two small economies are remote from each other geographically, and have no linkages to speak of.Hence correlation is expected to Zero.
14Common Confusion about Correlation YearBhutanElSalvador197919801981198219831984198519861987198819891990199119921993199419951996199719981999200020012002200320042005Trends:But calculated value of correlation is found to be 0.90.This is due to the fact that both series have trends. This 90% does not measure any real association between the two series.Before we measure correlation, it is necessary to transform the series to stationary ones. One way to do this is by taking rates of growth for each economy. Differencing the series is another method that is commonly used. It is also possible to subtract a trend from the series to eliminate the trend. There is substantial literature on the best method to make a series stationary (same across time) before applying any standard statistical techniques to it.
15Common Confusion about Correlation YearBhutanElSalvador197919801981198219831984198519861987198819891990199119921993199419951996199719981999200020012002200320042005Trends:Correlation of both series after differencing is found to be only 0.26 which is much less than 0.90.LESSON: Trends can mislead the real correlation.
16Common Confusion about Correlation Trend:Note: For El Salvador and Bhutan, it is easy to see on intuitive grounds that the two series have no relation with each other. This makes it easy to dismiss the statistical correlation of 90% as being spurious or nonsensical – these two words have been used in the literature on this subject.However when we expect to see a relation between the two series, then this same problem becomes much more serious. Someone does a correlation between GNP and Money Stock for Pakistan. The result will be a very large number.Now he could argue for a very strong relationship between the two. Because we expect that there is some real relationship between these two variables, the fact that the correlation here is nonsensical does not seem quite so obvious.
17Common Confusion about Correlation General Lesson:We have considered many cases where correlation can mislead us.Quoting a decisive number to a lay audience will sound very definite and authoritative and in addition, it will help win arguments.As a statistics student, you should be well aware of all these misconceptions and should not get trapped in the false interpretations.
18RegressionRegression analysis is certainly the most important tool at the statistician’s and econometrician’s disposal.Regression is concerned with describing and evaluating the relationship between a given variable and one or more other variables.More specifically, regression is an attempt to explain movements in a variable by reference to movements in one or more other variables.
19RegressionTo make the idea more concrete, denote the variable whose movements the regression seeks to explain by y and the variables which are used to explain those variations by x1, x2, , xk .Hence, in this relatively simple setup, it would be said that variations in k variables (the xs) cause changes in some other variable, y.
20RegressionThere are various completely interchangeable names for y and the xs.
21Regression Versus Correlation Regression and correlation have some fundamental differences.In regression analysis there is an asymmetry in the way the dependent and explanatory variables are treated. The dependent variable is assumed to be statistical, random, or stochastic, that is, to have a probability distribution. The explanatory variables, on the other hand, are assumed to have fixed values.In correlation analysis, on the other hand, we treat any (two) variables symmetrically; there is no distinction between the dependent and explanatory variables. After all, the correlation between two variables scores on mathematics and statistics examinations is the same as that between scores on statistics and mathematics examinations. Moreover, both variables are assumed to be random.
22Simple vs Multiple Regression Models If it is believed that y (dependent variable) depends on only one x (explanatory or independent) variable. Then the regression model is said to be simple.Example:Wage depends on educationConsumption depends on incomeIf it is believed that y (dependent variable) depends on two or more than two (explanatory) variables (x1, x2, …, xk). Then the regression model is said to be a Multiple.Wage depends on education and experience etc.
23Review Let’s review the main concepts: Common misconceptions about correlationRelated Examples
24Review Let’s review the main concepts: Introduction to Regression analysisRegression versus CorrelationSimple and Multiple Regression Model
25Next Lecture In next lecture, we will study: More on Regression and its importanceMethod of Least Squares and related conceptsRelated Examples