Presentation is loading. Please wait.

Presentation is loading. Please wait.

Lecture 3 Today: Statistical Review cont’d:

Similar presentations


Presentation on theme: "Lecture 3 Today: Statistical Review cont’d:"— Presentation transcript:

1 Lecture 3 Today: Statistical Review cont’d: Unbiasedness and efficiency Sample equivalents of variance, covariance and correlation Probability limits and consistency (quick) The Simple Regression Model

2 SAMPLING AND ESTIMATORS
probability density function of X probability density function of X mX X mX X We will next demonstrate that the variance of the distribution of X is smaller than that of X, as depicted in the diagram. © Christopher Dougherty 1999–2006

3 SAMPLING AND ESTIMATORS
We start by replacing X by its definition and then using variance rule 2 to take 1/n out of the expression as a common factor. © Christopher Dougherty 1999–2006

4 SAMPLING AND ESTIMATORS
Next we use variance rule 1 to replace the variance of a sum with a sum of variances. In principle there are many covariance terms as well, but they are zero if we assume that the sample values are generated independently. © Christopher Dougherty 1999–2006

5 SAMPLING AND ESTIMATORS
Now we come to the bit that requires thought. Start with X1. When we are still at the planning stage, we do not know what the value of X1 will be. © Christopher Dougherty 1999–2006

6 SAMPLING AND ESTIMATORS
All we know is that it will be generated randomly from the distribution of X. The variance of X1, as a beforehand concept, will therefore be sX. The same is true for all the other sample components, thinking about them beforehand. Hence we write this line. 2 © Christopher Dougherty 1999–2006

7 SAMPLING AND ESTIMATORS
Thus we have demonstrated that the variance of the sample mean is equal to the variance of X divided by n, a result with which you will be familiar from your statistics course. © Christopher Dougherty 1999–2006

8 UNBIASEDNESS AND EFFICIENCY
Unbiasedness of X: Generalized estimator Z = l1X1 + l2X2 However, the sample mean is not the only unbiased estimator of the population mean. We will demonstrate this supposing that we have a sample of two observations (to keep it simple). Thus Z is an unbiased estimator of mX if the sum of the weights is equal to one. An infinite number of combinations of l1 and l2 satisfy this condition, not just the sample mean (here, li =1/n ). © Christopher Dougherty 1999–2006

9 UNBIASEDNESS AND EFFICIENCY
probability density function estimator B estimator A mX Generalized estimator Z = l1X1 + l2X2 is an unbiased estimator of mX if the sum of the weights is equal to one. An infinite number of combinations of lis satisfy this condition, not just the sample mean. How do we choose among them? The answer is to use the most efficient estimator, the one with the smallest population variance, because it will tend to be the most accurate. © Christopher Dougherty 1999–2006

10 UNBIASEDNESS AND EFFICIENCY
probability density function estimator B estimator A mX In the diagram, A and B are both unbiased estimators but B is superior because it is more efficient. © Christopher Dougherty 1999–2006

11 UNBIASEDNESS AND EFFICIENCY
Generalized estimator Z = l1X1 + l2X2 We will analyze the variance of the generalized estimator and find out what condition the weights must satisfy in order to minimize it. © Christopher Dougherty 1999–2006

12 UNBIASEDNESS AND EFFICIENCY
Generalized estimator Z = l1X1 + l2X2 The first variance rule is used to decompose the variance. © Christopher Dougherty 1999–2006

13 UNBIASEDNESS AND EFFICIENCY
Generalized estimator Z = l1X1 + l2X2 Note that we are assuming that X1 and X2 are independent observations and so their covariance is zero. The second variance rule is used to bring l1 and l2 out of the variance expressions. © Christopher Dougherty 1999–2006

14 UNBIASEDNESS AND EFFICIENCY “If l1 + l2 = 1, then, l12 + l22 >= ½.”
Generalized estimator Z = l1X1 + l2X2 The variance of X1, at the planning stage, is sX2. The same goes for the variance of X2. At this step, you can use the following result, “If l1 + l2 = 1, then, l12 + l22 >= ½.” to show that the sample mean is more efficient because it has a lower variance. © Christopher Dougherty 1999–2006

15 UNBIASEDNESS AND EFFICIENCY
Generalized estimator Z = l1X1 + l2X2 Or, you can use calculus as follows: We take account of the condition for unbiasedness and re-write the variance of Z, substituting for l2. © Christopher Dougherty 1999–2006

16 UNBIASEDNESS AND EFFICIENCY
Generalized estimator Z = l1X1 + l2X2 The quadratic is expanded. To minimize the variance of Z, we must choose l1 so as to minimize the final expression. © Christopher Dougherty 1999–2006

17 UNBIASEDNESS AND EFFICIENCY
Generalized estimator Z = l1X1 + l2X2 We differentiate with respect to l1 to obtain the first-order condition. © Christopher Dougherty 1999–2006

18 UNBIASEDNESS AND EFFICIENCY
Generalized estimator Z = l1X1 + l2X2 The expression is minimized for l1 = It follows that l2 = 0.5 as well. So we have demonstrated that the sample mean is the most efficient unbiased estimator, at least in this example. (Note that the second differential is positive, confirming that we have a minimum.) © Christopher Dougherty 1999–2006

19 CONFLICTS BETWEEN UNBIASEDNESS AND MINIMUM VARIANCE
probability density function estimator B estimator A q Suppose that you have alternative estimators of a population characteristic q, one unbiased, the other biased but with a smaller variance. How do you choose between them? © Christopher Dougherty 1999–2006

20 CONFLICTS BETWEEN UNBIASEDNESS AND MINIMUM VARIANCE
probability density function estimator B q A widely-used loss function is the mean square error of the estimator, defined as the expected value of the square of the deviation of the estimator about the true value of the population characteristic. © Christopher Dougherty 1999–2006

21 CONFLICTS BETWEEN UNBIASEDNESS AND MINIMUM VARIANCE
probability density function estimator B bias q mZ The mean square error involves a trade-off between the variance of the estimator and its bias. Suppose you have a biased estimator like estimator B above, with expected value mZ. © Christopher Dougherty 1999–2006

22 CONFLICTS BETWEEN UNBIASEDNESS AND MINIMUM VARIANCE
probability density function estimator B bias q mZ The mean square error can be shown to be equal to the sum of the variance of the estimator and the square of the bias. © Christopher Dougherty 1999–2006

23 CONFLICTS BETWEEN UNBIASEDNESS AND MINIMUM VARIANCE
To demonstrate this, we start by subtracting and adding mZ . © Christopher Dougherty 1999–2006

24 CONFLICTS BETWEEN UNBIASEDNESS AND MINIMUM VARIANCE
We expand the quadratic using the rule (a + b)2 = a2 + b2 + 2ab, where a = Z – mZ and b = mZ – q. © Christopher Dougherty 1999–2006

25 CONFLICTS BETWEEN UNBIASEDNESS AND MINIMUM VARIANCE
We use the first expected value rule to break up the expectation into its three components. © Christopher Dougherty 1999–2006

26 CONFLICTS BETWEEN UNBIASEDNESS AND MINIMUM VARIANCE
The first term in the expression is by definition the variance of Z. © Christopher Dougherty 1999–2006

27 CONFLICTS BETWEEN UNBIASEDNESS AND MINIMUM VARIANCE
(mZ – q) is a constant, so the second term is a constant. © Christopher Dougherty 1999–2006

28 CONFLICTS BETWEEN UNBIASEDNESS AND MINIMUM VARIANCE
In the third term, (mZ – q) may be brought out of the expectation, again because it is a constant, using the second expected value rule. © Christopher Dougherty 1999–2006

29 CONFLICTS BETWEEN UNBIASEDNESS AND MINIMUM VARIANCE
Now E(Z) is mZ, and E(–mZ) is –mZ. © Christopher Dougherty 1999–2006

30 CONFLICTS BETWEEN UNBIASEDNESS AND MINIMUM VARIANCE
Hence the third term is zero and the mean square error of Z is shown be the sum of the variance of Z and the bias squared. © Christopher Dougherty 1999–2006

31 CONFLICTS BETWEEN UNBIASEDNESS AND MINIMUM VARIANCE
probability density function estimator B estimator A q In the case of the estimators shown, estimator B is probably a little better than estimator A according to the MSE criterion. © Christopher Dougherty 1999–2006

32 ESTIMATORS OF VARIANCE, COVARIANCE, AND CORRELATION
Given a sample of n observations, the usual estimator of the variance is the sum of the squared deviations around the sample mean divided by n – 1, typically denoted s2X. Since the variance is the expected value of the squared deviation of X about its mean, it makes intuitive sense to use the average of the sample squared deviations as an estimator. But why divide by n – 1 rather than by n? The reason is that the sample mean is by definition in the middle of the sample, while the unknown population mean is not, except by coincidence. As a consequence, the sum of the squared deviations from the sample mean tends to be slightly smaller than the sum of the squared deviations from the population mean. Hence a simple average of the squared sample deviations is a downwards biased estimator of the variance. However, the bias can be shown to be a factor of (n – 1)/n. Thus one can allow for the bias by dividing the sum of the squared deviations by n – 1 instead of n. The proof is in the appendix of the review chapter. © Christopher Dougherty 1999–2006

33 ESTIMATORS OF VARIANCE, COVARIANCE, AND CORRELATION
A similar adjustment has to be made when estimating a covariance. For two random variables X and Y an unbiased estimator of the covariance sXY is given by the sum of the products of the deviations around the sample means divided by n – 1. © Christopher Dougherty 1999–2006

34 ESTIMATORS OF VARIANCE, COVARIANCE, AND CORRELATION
The population correlation coefficient rXY for two variables X and Y is defined to be their covariance divided by the square root of the product of their variances. The sample correlation coefficient, rXY, is obtained from this by replacing the covariance and variances by their estimators. © Christopher Dougherty 1999–2006

35 ESTIMATORS OF VARIANCE, COVARIANCE, AND CORRELATION
The 1/(n – 1) terms in the numerator and the denominator cancel and one is left with a straightforward expression. © Christopher Dougherty 1999–2006

36 Probability Limits and Consistency

37 ASYMPTOTIC PROPERTIES OF ESTIMATORS: PLIMS AND CONSISTENCY
probability density function of X n 1 50 0.08 0.06 0.04 0.02 n = 1 50 100 150 200 X If n is equal to 1, the sample consists of a single observation. X is the same as X and its standard deviation is 50. © Christopher Dougherty 1999–2006

38 ASYMPTOTIC PROPERTIES OF ESTIMATORS: PLIMS AND CONSISTENCY
probability density function of X n 1 50 4 25 0.08 0.06 0.04 n = 4 0.02 50 100 150 200 X We will see how the shape of the distribution changes as the sample size is increased. © Christopher Dougherty 1999–2006

39 ASYMPTOTIC PROPERTIES OF ESTIMATORS: PLIMS AND CONSISTENCY
probability density function of X n 1 50 4 25 25 10 0.08 0.06 n = 25 0.04 0.02 50 100 150 200 X The distribution becomes more concentrated about the population mean. © Christopher Dougherty 1999–2006

40 ASYMPTOTIC PROPERTIES OF ESTIMATORS: PLIMS AND CONSISTENCY
probability density function of X n 1 50 4 25 25 10 100 5 0.08 n = 100 0.06 0.04 0.02 50 100 150 200 X To see what happens for n greater than 100, we will have to change the vertical scale. © Christopher Dougherty 1999–2006

41 ASYMPTOTIC PROPERTIES OF ESTIMATORS: PLIMS AND CONSISTENCY
probability density function of X n 1 50 4 25 25 10 100 5 0.8 0.6 0.4 n = 100 0.2 50 100 150 200 X We have increased the vertical scale by a factor of 10. © Christopher Dougherty 1999–2006

42 ASYMPTOTIC PROPERTIES OF ESTIMATORS: PLIMS AND CONSISTENCY
probability density function of X n 1 50 4 25 25 10 100 5 0.8 0.6 n = 1000 0.4 0.2 50 100 150 200 X The distribution continues to contract about the population mean. © Christopher Dougherty 1999–2006

43 ASYMPTOTIC PROPERTIES OF ESTIMATORS: PLIMS AND CONSISTENCY
probability density function of X n 1 50 4 25 25 10 100 5 n = 5000 0.8 0.6 0.4 0.2 50 100 150 200 X In the limit, the variance of the distribution tends to zero. The distribution collapses to a spike at the true value. The plim of the sample mean is therefore the population mean. © Christopher Dougherty 1999–2006

44 ASYMPTOTIC PROPERTIES OF ESTIMATORS: PLIMS AND CONSISTENCY
An estimator of a population characteristic is said to be consistent if it satisfies two conditions: (1) It possesses a probability limit, and so its distribution collapses to a spike as the sample size becomes large, and (2) The spike is located at the true value of the population characteristic. Hence we can say plim X = mX. © Christopher Dougherty 1999–2006

45 ASYMPTOTIC PROPERTIES OF ESTIMATORS: PLIMS AND CONSISTENCY
probability density function of X n = 5000 0.8 0.6 0.4 0.2 50 100 150 200 X The sample mean in our example satisfies both conditions and so it is a consistent estimator of mX. Most standard estimators in simple applications satisfy the first condition because their variances tend to zero as the sample size becomes large. © Christopher Dougherty 1999–2006

46 ASYMPTOTIC PROPERTIES OF ESTIMATORS: PLIMS AND CONSISTENCY
The only issue then is whether the distribution collapses to a spike at the true value of the population characteristic. A sufficient condition for consistency is that the estimator should be unbiased and that its variance should tend to zero as n becomes large. It is easy to see why this is a sufficient condition. If the estimator is unbiased for a finite sample, it must stay unbiased as the sample size becomes large. Meanwhile, if the variance of its distribution is decreasing, its distribution must collapse to a spike. Since the estimator remains unbiased, this spike must be located at the true value. The sample mean is an example of an estimator that satisfies this sufficient condition. © Christopher Dougherty 1999–2006

47 ASYMPTOTIC PROPERTIES OF ESTIMATORS: PLIMS AND CONSISTENCY
Why are we interested in consistency, when in practice we have finite samples? As a first approximation, the answer is that if we can show that an estimator is consistent, then we may be optimistic about its finite sample properties, whereas is the estimator is inconsistent, we know that for finite samples it will definitely be biased. © Christopher Dougherty 1999–2006

48 ASYMPTOTIC PROPERTIES OF ESTIMATORS: PLIMS AND CONSISTENCY
Why are we interested in consistency, when in practice we have finite samples? As a first approximation, the answer is that if we can show that an estimator is consistent, then we may be optimistic about its finite sample properties, whereas is the estimator is inconsistent, we know that for finite samples it will definitely be biased. © Christopher Dougherty 1999–2006

49 ASYMPTOTIC PROPERTIES OF ESTIMATORS: PLIMS AND CONSISTENCY
However, there are reasons for being cautious about preferring consistent estimators to inconsistent ones. First, a consistent estimator may be biased for finite samples. Second, we are usually also interested in variances. If a consistent estimator has a larger variance than an inconsistent one, the latter might be preferable if judged by the mean square error or similar criterion that allows a trade-off between bias and variance. How can you resolve these issues? Mathematically they are intractable, otherwise we would not have resorted to large sample analysis in the first place. © Christopher Dougherty 1999–2006

50 ASYMPTOTIC PROPERTIES OF ESTIMATORS: PLIMS AND CONSISTENCY
However, there are reasons for being cautious about preferring consistent estimators to inconsistent ones. First, a consistent estimator may be biased for finite samples. Second, we are usually also interested in variances. If a consistent estimator has a larger variance than an inconsistent one, the latter might be preferable if judged by the mean square error or similar criterion that allows a trade-off between bias and variance. How can you resolve these issues? Mathematically they are intractable, otherwise we would not have resorted to large sample analysis in the first place. © Christopher Dougherty 1999–2006

51 The Simple Regression Model

52 SIMPLE REGRESSION MODEL
Y b1 X X1 X2 X3 X4 Suppose that a variable Y is a linear function of another variable X, with unknown parameters b1 and b2 that we wish to estimate. Suppose that we have a sample of 4 observations with X values as shown. © Christopher Dougherty 1999–2006

53 © Christopher Dougherty 1999–2006
Q4 Q3 Q2 Q1 b1 X X1 X2 X3 X4 If the relationship were an exact one, the observations would lie on a straight line and we would have no trouble obtaining accurate estimates of b1 and b2. © Christopher Dougherty 1999–2006

54 SIMPLE REGRESSION MODEL
Y Q4 P1 Q3 Q2 Q1 b1 P3 P2 X X1 X2 X3 X4 In practice, most economic relationships are not exact and the actual values of Y are different from those corresponding to the straight line. © Christopher Dougherty 1999–2006

55 SIMPLE REGRESSION MODEL
Y Q4 P1 Q3 Q2 Q1 b1 P3 P2 X X1 X2 X3 X4 To allow for such divergences, we will write the model as Y = b1 + b2X + u, where u is a disturbance term. © Christopher Dougherty 1999–2006

56 SIMPLE REGRESSION MODEL
Y Q4 P1 u1 Q3 Q2 Q1 b1 P3 P2 X X1 X2 X3 X4 Each value of Y thus has a non-random component, b1 + b2X, and a random component, u. The first observation has been decomposed into these two components. © Christopher Dougherty 1999–2006

57 SIMPLE REGRESSION MODEL
Y P1 P3 P2 X X1 X2 X3 X4 In practice we can see only the P points. © Christopher Dougherty 1999–2006

58 SIMPLE REGRESSION MODEL
Y P1 P3 P2 b1 X X1 X2 X3 X4 Obviously, we can use the P points to draw a line which is an approximation to the line Y = b1 + b2X. If we write this line Y = b1 + b2X, b1 is an estimate of b1 and b2 is an estimate of b2. ^ © Christopher Dougherty 1999–2006

59 SIMPLE REGRESSION MODEL
Y (actual value) P4 Y (fitted value) R3 R4 R2 P1 R1 P3 P2 b1 X X1 X2 X3 X4 The line is called the fitted model and the values of Y predicted by it are called the fitted values of Y. They are given by the heights of the R points. © Christopher Dougherty 1999–2006

60 SIMPLE REGRESSION MODEL
Y (actual value) P4 Y (fitted value) e4 (residual) R3 R4 R2 P1 e1 e3 e2 R1 P3 P2 b1 X X1 X2 X3 X4 The discrepancies between the actual and fitted values of Y are known as the residuals. © Christopher Dougherty 1999–2006

61 SIMPLE REGRESSION MODEL
Y (actual value) P4 Y (fitted value) R3 R4 R2 P1 R1 b1 P3 P2 b1 X X1 X2 X3 X4 Note that the values of the residuals are not the same as the values of the disturbance term. The diagram now shows the true unknown relationship as well as the fitted line. © Christopher Dougherty 1999–2006

62 SIMPLE REGRESSION MODEL
Y (actual value) P4 Y (fitted value) Q4 P1 Q3 Q2 Q1 b1 P3 P2 b1 X X1 X2 X3 X4 The disturbance term in each observation is responsible for the divergence between the non-random component of the true relationship and the actual observation. © Christopher Dougherty 1999–2006

63 SIMPLE REGRESSION MODEL
Y (actual value) P4 Y (fitted value) R3 R4 R2 P1 R1 b1 P3 P2 b1 X X1 X2 X3 X4 The residuals are the discrepancies between the actual and the fitted values. If the fit is a good one, the residuals and the values of the disturbance term will be similar, but they must be kept apart conceptually. © Christopher Dougherty 1999–2006

64 SIMPLE REGRESSION MODEL
Y (actual value) P4 Y (fitted value) e4 u4 R4 Q4 b1 b1 X X1 X2 X3 X4 Both of these lines will be used in our analysis. Each permits a decomposition of the value of Y. The decompositions will be illustrated with the fourth observation. © Christopher Dougherty 1999–2006

65 SIMPLE REGRESSION MODEL
Using the theoretical relationship, Y can be decomposed into its non-stochastic component b1 + b2X and its random component u. Y = b1 + b2X + u This is a theoretical decomposition because we do not know the values of b1 or b2, or the values of the disturbance term. We shall use it in our analysis of the properties of the regression coefficients. The other decomposition is with reference to the fitted line. In each observation, the actual value of Y is equal to the fitted value plus the residual. This is an operational decomposition which we will use for practical purposes. Y = b1 + b2X + e = e © Christopher Dougherty 1999–2006

66 SIMPLE REGRESSION MODEL
Least squares criterion: Minimize RSS (residual sum of squares), where To begin with, we will draw the fitted line so as to minimize the sum of the squares of the residuals, RSS. This is described as the least squares criterion. © Christopher Dougherty 1999–2006

67 SIMPLE REGRESSION MODEL
Least squares criterion: Minimize RSS (residual sum of squares), where Why not minimize Why the squares of the residuals? Why not just minimize the sum of the residuals? © Christopher Dougherty 1999–2006

68 SIMPLE REGRESSION MODEL
Y Y P1 P3 P2 X X1 X2 X3 X4 The answer is that you would get an apparently perfect fit by drawing a horizontal line through the mean value of Y. The sum of the residuals would be zero. You must prevent negative residuals from cancelling positive ones, and one way to do this is to use the squares of the residuals. Of course there are other ways of dealing with the problem. The least squares criterion has the attraction that the estimators derived with it have desirable properties, provided that certain conditions are satisfied. © Christopher Dougherty 1999–2006

69 DERIVING LINEAR REGRESSION COEFFICIENTS
Y X Next, we’ll see how the regression coefficients for a simple regression model are derived, using the least squares criterion (OLS, for ordinary least squares). We will start with a numerical example with just three observations: (1,3), (2,5), and (3,6) © Christopher Dougherty 1999–2006

70 DERIVING LINEAR REGRESSION COEFFICIENTS
Y b2 b1 X ^ Writing the fitted regression as Y = b1 + b2X, we will determine the values of b1 and b2 that minimize RSS, the sum of the squares of the residuals. © Christopher Dougherty 1999–2006

71 DERIVING LINEAR REGRESSION COEFFICIENTS
Y b2 b1 X Given our choice of b1 and b2, the residuals are as shown. © Christopher Dougherty 1999–2006

72 DERIVING LINEAR REGRESSION COEFFICIENTS
The sum of the squares of the residuals is thus as shown above. © Christopher Dougherty 1999–2006

73 DERIVING LINEAR REGRESSION COEFFICIENTS
The quadratics have been expanded. © Christopher Dougherty 1999–2006

74 DERIVING LINEAR REGRESSION COEFFICIENTS
Like terms have been added together. © Christopher Dougherty 1999–2006

75 DERIVING LINEAR REGRESSION COEFFICIENTS
For a minimum, the partial derivatives of RSS with respect to b1 and b2 should be zero. (We should also check a second-order condition.) © Christopher Dougherty 1999–2006

76 DERIVING LINEAR REGRESSION COEFFICIENTS
The first-order conditions give us two equations in two unknowns. © Christopher Dougherty 1999–2006

77 DERIVING LINEAR REGRESSION COEFFICIENTS
Solving them, we find that RSS is minimized when b1 and b2 are equal to 1.67 and 1.50, respectively. © Christopher Dougherty 1999–2006

78 DERIVING LINEAR REGRESSION COEFFICIENTS
Y b2 b1 X Here is the scatter diagram again. © Christopher Dougherty 1999–2006

79 DERIVING LINEAR REGRESSION COEFFICIENTS
Y 1.50 1.67 X The fitted line and the fitted values of Y are as shown. © Christopher Dougherty 1999–2006

80 DERIVING LINEAR REGRESSION COEFFICIENTS
Y X1 Xn X Now we will do the same thing for the general case with n observations. © Christopher Dougherty 1999–2006

81 DERIVING LINEAR REGRESSION COEFFICIENTS
Y b2 b1 X1 Xn X Given our choice of b1 and b2, we will obtain a fitted line as shown. © Christopher Dougherty 1999–2006

82 DERIVING LINEAR REGRESSION COEFFICIENTS
Y b2 b1 X1 Xn X The residual for the first observation is defined. © Christopher Dougherty 1999–2006

83 DERIVING LINEAR REGRESSION COEFFICIENTS
Y b2 b1 X1 Xn X Similarly we define the residuals for the remaining observations. That for the last one is marked. © Christopher Dougherty 1999–2006

84 DERIVING LINEAR REGRESSION COEFFICIENTS
RSS, the sum of the squares of the residuals, is defined for the general case. The data for the numerical example are shown for comparison. © Christopher Dougherty 1999–2006

85 DERIVING LINEAR REGRESSION COEFFICIENTS
The quadratics are expanded. © Christopher Dougherty 1999–2006

86 DERIVING LINEAR REGRESSION COEFFICIENTS
Like terms are added together. © Christopher Dougherty 1999–2006

87 DERIVING LINEAR REGRESSION COEFFICIENTS
Note that in this equation the observations on X and Y are just data that determine the coefficients in the expression for RSS. The choice variables in the expression are b1 and b2. This may seem a bit strange because in elementary calculus courses b1 and b2 are usually constants and X and Y are variables. However, if you have any doubts, compare what we are doing in the general case with what we did in the numerical example. © Christopher Dougherty 1999–2006

88 DERIVING LINEAR REGRESSION COEFFICIENTS
The first derivative with respect to b1. © Christopher Dougherty 1999–2006

89 DERIVING LINEAR REGRESSION COEFFICIENTS
With some simple manipulation we obtain a tidy expression for b1 . © Christopher Dougherty 1999–2006

90 DERIVING LINEAR REGRESSION COEFFICIENTS
The first derivative with respect to b2. © Christopher Dougherty 1999–2006

91 DERIVING LINEAR REGRESSION COEFFICIENTS
Divide through by 2. © Christopher Dougherty 1999–2006

92 DERIVING LINEAR REGRESSION COEFFICIENTS
We now substitute for b1 using the expression obtained for it and we thus obtain an equation that contains b2 only. © Christopher Dougherty 1999–2006

93 DERIVING LINEAR REGRESSION COEFFICIENTS
The definition of the sample mean has been used. © Christopher Dougherty 1999–2006

94 DERIVING LINEAR REGRESSION COEFFICIENTS
The last two terms have been disentangled. © Christopher Dougherty 1999–2006

95 DERIVING LINEAR REGRESSION COEFFICIENTS
Terms not involving b2 have been transferred to the right side. © Christopher Dougherty 1999–2006

96 DERIVING LINEAR REGRESSION COEFFICIENTS
Hence we obtain an expression for b2. © Christopher Dougherty 1999–2006

97 DERIVING LINEAR REGRESSION COEFFICIENTS
In practice, we shall use an alternative expression. We will demonstrate that it is equivalent. © Christopher Dougherty 1999–2006

98 DERIVING LINEAR REGRESSION COEFFICIENTS
Expanding the numerator, we obtain the terms shown. © Christopher Dougherty 1999–2006

99 DERIVING LINEAR REGRESSION COEFFICIENTS
In the second term the mean value of Y is a common factor. In the third, the mean value of X is a common factor. The last term is the same for all i. © Christopher Dougherty 1999–2006

100 DERIVING LINEAR REGRESSION COEFFICIENTS
We use the definitions of the sample means to simplify the expression. © Christopher Dougherty 1999–2006

101 DERIVING LINEAR REGRESSION COEFFICIENTS
Hence we have shown that the numerators of the two expressions are the same. © Christopher Dougherty 1999–2006

102 DERIVING LINEAR REGRESSION COEFFICIENTS
The denominator is mathematically a special case of the numerator, replacing Y by X. Hence the expressions are equivalent. © Christopher Dougherty 1999–2006

103 DERIVING LINEAR REGRESSION COEFFICIENTS
Y b2 b1 X1 Xn X The scatter diagram is shown again. We will summarize what we have done. We hypothesized that the true model is as shown, we obtained some data, and we fitted a line. © Christopher Dougherty 1999–2006

104 DERIVING LINEAR REGRESSION COEFFICIENTS
Y b2 b1 X1 Xn X We chose the parameters of the fitted line so as to minimize the sum of the squares of the residuals. As a result, we derived the expressions for b1 and b2. © Christopher Dougherty 1999–2006

105 Today: Ch. 1: The Simple Regression Model Interpretation of regression results Goodness of fit

106 DERIVING LINEAR REGRESSION COEFFICIENTS
Y b2 b1 X1 Xn X We chose the parameters of the fitted line so as to minimize the sum of the squares of the residuals. As a result, we derived the expressions for b1 and b2. © Christopher Dougherty 1999–2006

107 INTERPRETATION OF A REGRESSION EQUATION
The scatter diagram shows hourly earnings in 2002 plotted against years of schooling, defined as highest grade completed, for a sample of 540 respondents from the National Longitudinal Survey of Youth. © Christopher Dougherty 1999–2006

108 INTERPRETATION OF A REGRESSION EQUATION
Highest grade completed means just that for elementary and high school. Grades 13, 14, and 15 mean completion of one, two and three years of college. © Christopher Dougherty 1999–2006

109 INTERPRETATION OF A REGRESSION EQUATION
Grade 16 means completion of four-year college. Higher grades indicate years of postgraduate education. © Christopher Dougherty 1999–2006

110 INTERPRETATION OF A REGRESSION EQUATION
. reg EARNINGS S Source | SS df MS Number of obs = F( 1, 538) = Model | Prob > F = Residual | R-squared = Adj R-squared = Total | Root MSE = EARNINGS | Coef. Std. Err t P>|t| [95% Conf. Interval] S | _cons | This is the output from a regression of earnings on years of schooling, using Stata. © Christopher Dougherty 1999–2006

111 INTERPRETATION OF A REGRESSION EQUATION
. reg EARNINGS S Source | SS df MS Number of obs = F( 1, 538) = Model | Prob > F = Residual | R-squared = Adj R-squared = Total | Root MSE = EARNINGS | Coef. Std. Err t P>|t| [95% Conf. Interval] S | _cons | For the time being, we will be concerned only with the estimates of the parameters. The variables in the regression are listed in the first column and the second column gives the estimates of their coefficients. © Christopher Dougherty 1999–2006

112 INTERPRETATION OF A REGRESSION EQUATION
. reg EARNINGS S Source | SS df MS Number of obs = F( 1, 538) = Model | Prob > F = Residual | R-squared = Adj R-squared = Total | Root MSE = EARNINGS | Coef. Std. Err t P>|t| [95% Conf. Interval] S | _cons | In this case there is only one variable, S, and its coefficient is _cons, in Stata, refers to the constant. The estimate of the intercept is © Christopher Dougherty 1999–2006

113 INTERPRETATION OF A REGRESSION EQUATION
^ Here is the scatter diagram again, with the regression line shown. © Christopher Dougherty 1999–2006

114 INTERPRETATION OF A REGRESSION EQUATION
^ What do the coefficients actually mean? © Christopher Dougherty 1999–2006

115 INTERPRETATION OF A REGRESSION EQUATION
^ To answer this question, you must refer to the units in which the variables are measured. © Christopher Dougherty 1999–2006

116 INTERPRETATION OF A REGRESSION EQUATION
^ S is measured in years (strictly speaking, grades completed), EARNINGS in dollars per hour. So the slope coefficient implies that hourly earnings increase by $2.46 for each extra year of schooling. © Christopher Dougherty 1999–2006

117 INTERPRETATION OF A REGRESSION EQUATION
^ We will look at a geometrical representation of this interpretation. To do this, we will enlarge the marked section of the scatter diagram. © Christopher Dougherty 1999–2006

118 INTERPRETATION OF A REGRESSION EQUATION
$15.53 $13.07 $2.46 One year The regression line indicates that completing 12th grade instead of 11th grade would increase earnings by $2.46, from $13.07 to $15.53, as a general tendency. © Christopher Dougherty 1999–2006

119 INTERPRETATION OF A REGRESSION EQUATION
^ You should ask yourself whether this is a plausible figure. If it is implausible, this could be a sign that your model is misspecified in some way. © Christopher Dougherty 1999–2006

120 INTERPRETATION OF A REGRESSION EQUATION
^ For low levels of education it might be plausible. But for high levels it would seem to be an underestimate. © Christopher Dougherty 1999–2006

121 INTERPRETATION OF A REGRESSION EQUATION
^ What about the constant term? (Try to answer this question yourself before continuing with this sequence.) © Christopher Dougherty 1999–2006

122 INTERPRETATION OF A REGRESSION EQUATION
^ Literally, the constant indicates that an individual with no years of education would have to pay $13.93 per hour to be allowed to work. © Christopher Dougherty 1999–2006

123 INTERPRETATION OF A REGRESSION EQUATION
^ This does not make any sense at all. In former times craftsmen might require an initial payment when taking on an apprentice, and might pay the apprentice little or nothing for quite a while, but an interpretation of negative payment is impossible to sustain. © Christopher Dougherty 1999–2006

124 INTERPRETATION OF A REGRESSION EQUATION
^ A safe solution to the problem is to limit the interpretation to the range of the sample data, and to refuse to extrapolate on the ground that we have no evidence outside the data range. © Christopher Dougherty 1999–2006

125 INTERPRETATION OF A REGRESSION EQUATION
^ With this explanation, the only function of the constant term is to enable you to draw the regression line at the correct height on the scatter diagram. It has no meaning of its own. © Christopher Dougherty 1999–2006

126 INTERPRETATION OF A REGRESSION EQUATION
^ Another solution is to explore the possibility that the true relationship is nonlinear and that we are approximating it with a linear regression. We will soon extend the regression technique to fit nonlinear models. © Christopher Dougherty 1999–2006

127 © Christopher Dougherty 1999–2006
GOODNESS OF FIT Four useful results: This sequence explains measures of goodness of fit in regression analysis. It is convenient to start by demonstrating three useful results. The first is that the mean value of the residuals must be zero. © Christopher Dougherty 1999–2006

128 © Christopher Dougherty 1999–2006
GOODNESS OF FIT Four useful results: The residual in any observation is given by the difference between the actual and fitted values of Y for that observation. © Christopher Dougherty 1999–2006

129 © Christopher Dougherty 1999–2006
GOODNESS OF FIT Four useful results: First substitute for the fitted value. © Christopher Dougherty 1999–2006

130 © Christopher Dougherty 1999–2006
GOODNESS OF FIT Four useful results: Now sum over all the observations. © Christopher Dougherty 1999–2006

131 © Christopher Dougherty 1999–2006
GOODNESS OF FIT Four useful results: Dividing through by n, we obtain the sample mean of the residuals in terms of the sample means of X and Y and the regression coefficients. © Christopher Dougherty 1999–2006

132 © Christopher Dougherty 1999–2006
GOODNESS OF FIT Four useful results: If we substitute for b1, the expression collapses to zero. © Christopher Dougherty 1999–2006

133 © Christopher Dougherty 1999–2006
GOODNESS OF FIT Four useful results: Next we will demonstrate that the mean of the fitted values of Y is equal to the mean of the actual values of Y. © Christopher Dougherty 1999–2006

134 © Christopher Dougherty 1999–2006
GOODNESS OF FIT Four useful results: Again, we start with the definition of a residual. © Christopher Dougherty 1999–2006

135 © Christopher Dougherty 1999–2006
GOODNESS OF FIT Four useful results: Sum over all the observations. © Christopher Dougherty 1999–2006

136 © Christopher Dougherty 1999–2006
GOODNESS OF FIT Four useful results: Divide through by n. The terms in the equation are the means of the residuals, actual values of Y, and fitted values of Y, respectively. © Christopher Dougherty 1999–2006

137 © Christopher Dougherty 1999–2006
GOODNESS OF FIT Four useful results: We have just shown that the mean of the residuals is zero. Hence the mean of the fitted values is equal to the mean of the actual values. © Christopher Dougherty 1999–2006

138 © Christopher Dougherty 1999–2006
GOODNESS OF FIT Four useful results: Next we will demonstrate that the sum of the products of the values of X and the residuals is zero. © Christopher Dougherty 1999–2006

139 © Christopher Dougherty 1999–2006
GOODNESS OF FIT Four useful results: We start by replacing the residual with its expression in terms of Y and X. © Christopher Dougherty 1999–2006

140 © Christopher Dougherty 1999–2006
GOODNESS OF FIT Four useful results: We expand the expression. © Christopher Dougherty 1999–2006

141 © Christopher Dougherty 1999–2006
GOODNESS OF FIT Four useful results: The expression is equal to zero. One way of demonstrating this would be to substitute for b1 and b2 and show that all the terms cancel out. © Christopher Dougherty 1999–2006

142 © Christopher Dougherty 1999–2006
GOODNESS OF FIT Four useful results: A neater way is to recall the first order condition for b2 when deriving the regression coefficients. You can see that it is exactly what we need. © Christopher Dougherty 1999–2006

143 © Christopher Dougherty 1999–2006
GOODNESS OF FIT Four useful results: Finally we will demonstrate that the sum of the products of the fitted values of Y and the residuals is zero. © Christopher Dougherty 1999–2006

144 © Christopher Dougherty 1999–2006
GOODNESS OF FIT Four useful results: We start by substituting for the fitted value of Y. © Christopher Dougherty 1999–2006

145 © Christopher Dougherty 1999–2006
GOODNESS OF FIT Four useful results: We expand and rearrange. © Christopher Dougherty 1999–2006

146 © Christopher Dougherty 1999–2006
GOODNESS OF FIT Four useful results: The expression is equal to zero, given the first and third useful results. © Christopher Dougherty 1999–2006

147 © Christopher Dougherty 1999–2006
GOODNESS OF FIT We now come to the discussion of goodness of fit. One measure of the variation in Y is the sum of its squared deviations around its sample mean, often described as the Total Sum of Squares, TSS. © Christopher Dougherty 1999–2006

148 © Christopher Dougherty 1999–2006
GOODNESS OF FIT We will decompose TSS using the fact that the actual value of Y in any observations is equal to the sum of its fitted value and the residual. © Christopher Dougherty 1999–2006

149 © Christopher Dougherty 1999–2006
GOODNESS OF FIT We substitute for Yi. © Christopher Dougherty 1999–2006

150 © Christopher Dougherty 1999–2006
GOODNESS OF FIT From the useful results, the mean of the fitted values of Y is equal to the mean of the actual values. Also, the mean of the residuals is zero. © Christopher Dougherty 1999–2006

151 © Christopher Dougherty 1999–2006
GOODNESS OF FIT Hence we can simplify the expression as shown. © Christopher Dougherty 1999–2006

152 © Christopher Dougherty 1999–2006
GOODNESS OF FIT We expand the squared terms on the right side of the equation. © Christopher Dougherty 1999–2006

153 © Christopher Dougherty 1999–2006
GOODNESS OF FIT We expand the third term on the right side of the equation. © Christopher Dougherty 1999–2006

154 © Christopher Dougherty 1999–2006
GOODNESS OF FIT The last two terms are both zero, given the first and fourth useful results. © Christopher Dougherty 1999–2006

155 © Christopher Dougherty 1999–2006
GOODNESS OF FIT Thus we have shown that TSS, the total sum of squares of Y can be decomposed into ESS, the ‘explained’ sum of squares, and RSS, the residual (‘unexplained’) sum of squares. © Christopher Dougherty 1999–2006

156 © Christopher Dougherty 1999–2006
GOODNESS OF FIT The words explained and unexplained were put in quotation marks because the explanation may in fact be false. Y might really depend on some other variable Z, and X might be acting as a proxy for Z. It would be safer to use the expression apparently explained instead of explained. © Christopher Dougherty 1999–2006

157 © Christopher Dougherty 1999–2006
GOODNESS OF FIT The main criterion of goodness of fit, formally described as the coefficient of determination, but usually referred to as R2, is defined to be the ratio of ESS to TSS, that is, the proportion of the variance of Y explained by the regression equation. © Christopher Dougherty 1999–2006

158 © Christopher Dougherty 1999–2006
GOODNESS OF FIT Obviously we would like to locate the regression line so as to make the goodness of fit as high as possible, according to this criterion. Does this objective clash with our use of the least squares principle to determine b1 and b2? © Christopher Dougherty 1999–2006

159 © Christopher Dougherty 1999–2006
GOODNESS OF FIT Fortunately, there is no clash. To see this, rewrite the expression for R2 in term of RSS as shown. © Christopher Dougherty 1999–2006

160 © Christopher Dougherty 1999–2006
GOODNESS OF FIT The OLS regression coefficients are chosen in such a way as to minimize the sum of the squares of the residuals. Thus it automatically follows that they maximize R2. © Christopher Dougherty 1999–2006

161 © Christopher Dougherty 1999–2006
GOODNESS OF FIT Another natural criterion of goodness of fit is the correlation between the actual and fitted values of Y. We will demonstrate that this is maximized by using the least squares principle to determine the regression coefficients © Christopher Dougherty 1999–2006

162 © Christopher Dougherty 1999–2006
GOODNESS OF FIT We will start with the numerator and substitute for the actual value of Y, and its mean, in the first factor. © Christopher Dougherty 1999–2006

163 © Christopher Dougherty 1999–2006
GOODNESS OF FIT We rearrange a little. © Christopher Dougherty 1999–2006

164 © Christopher Dougherty 1999–2006
GOODNESS OF FIT We expand the expression The last two terms are both zero (fourth and first useful results). © Christopher Dougherty 1999–2006

165 © Christopher Dougherty 1999–2006
GOODNESS OF FIT Thus the numerator simplifies to the sum of the squared deviations of the fitted values. © Christopher Dougherty 1999–2006

166 © Christopher Dougherty 1999–2006
GOODNESS OF FIT We have the same expression in the denominator, under a square root. Cancelling, we are left with the square root in the numerator. © Christopher Dougherty 1999–2006

167 © Christopher Dougherty 1999–2006
GOODNESS OF FIT Thus the correlation coefficient is the square root of R2. It follows that it is maximized by the use of the least squares principle to determine the regression coefficients. © Christopher Dougherty 1999–2006


Download ppt "Lecture 3 Today: Statistical Review cont’d:"

Similar presentations


Ads by Google