Presentation is loading. Please wait.

Presentation is loading. Please wait.

Professor William Greene Stern School of Business IOMS Department Department of Economics Statistical Inference and Regression Analysis: Stat-GB.3302.30,

Similar presentations


Presentation on theme: "Professor William Greene Stern School of Business IOMS Department Department of Economics Statistical Inference and Regression Analysis: Stat-GB.3302.30,"— Presentation transcript:

1 Professor William Greene Stern School of Business IOMS Department Department of Economics Statistical Inference and Regression Analysis: Stat-GB.3302.30, Stat-UB.0015.01

2 2/98 Part 3 – Estimation Theory Immediate Reaction to the WHR Health System Performance Report New York Times, June 21, 2000

3 3/98 Part 3 – Estimation Theory A Model of the Best a Country Could Do vs. what They Actually Do

4 4/98 Part 3 – Estimation Theory The following was taken from http://www.msnbc.msn.com/id/27339545/ An msnbc.com guide to presidential polls Why results, samples and methodology vary from survey to survey WASHINGTON - A poll is a small sample of some larger number, an estimate of something about that larger number. For instance, what percentage of people reports that they will cast their ballots for a particular candidate in an election? A sample reflects the larger number from which it is drawn. Let’s say you had a perfectly mixed barrel of 1,000 tennis balls, of which 700 are white and 300 orange. You do your sample by scooping up just 50 of those tennis balls. If your barrel was perfectly mixed, you wouldn’t need to count all 1,000 tennis balls — your sample would tell you that 30 percent of the balls were orange.

5 5/98 Part 3 – Estimation Theory Use random samples and basic descriptive statistics. What is the ‘breach rate’ in a pool of tens of thousands of mortgages? (‘Breach’ = improperly underwritten or serviced or otherwise faulty mortgage.)

6 6/98 Part 3 – Estimation Theory The forensic analysis was an examination of statistics from a random sample of 1,500 loans.

7 Part 3 – Estimation Theory

8 8/98 Part 3 – Estimation Theory Estimation Nonparametric population features Mean - income Correlation – disease incidence and smoking Ratio – income per household member Proportion – proportion of ASCAP music played that is produced by Dave Matthews Distribution – histogram and density estimation Parameters Fitting distributions – mean and variance of lognormal distribution of income Parametric models of populations – relationship of loan rates to attributes of minorities and others in Bank of America settlement on mortgage bias 8

9 9/98 Part 3 – Estimation Theory Measurements as Observations Population Measurement Theory Characteristics Behavior Patterns Choices The theory argues that there are meaningful quantities to be statistically analyzed.

10 10/98 Part 3 – Estimation Theory Application – Health and Income German Health Care Usage Data, 7,293 Households, Observed 1984-1995 Data downloaded from Journal of Applied Econometrics Archive. Some variables in the file are DOCVIS = number of visits to the doctor in the observation period HOSPVIS = number of visits to a hospital in the observation period HHNINC = household nominal monthly net income in German marks / 10000. (4 observations with income=0 were dropped) HHKIDS = children under age 16 in the household = 1; otherwise = 0 EDUC = years of schooling AGE = age in years PUBLIC = decision to buy public health insurance HSAT = self assessed health status (0,1,…,10)

11 11/98 Part 3 – Estimation Theory Observed Data 11

12 12/98 Part 3 – Estimation Theory Inference about Population Population Measurement Characteristics Behavior Patterns Choices

13 13/98 Part 3 – Estimation Theory Classical Inference Population Measurement Characteristics Behavior Patterns Choices Imprecise inference about the entire population – sampling theory and asymptotics Sample The population is all 40 million German households (or all households in the entire world). The sample is the 7,293 German households in 1984-1995.

14 14/98 Part 3 – Estimation Theory Bayesian Inference Population Measurement Characteristics Behavior Patterns Choices Sharp, ‘exact’ inference about only the sample – the ‘posterior’ density is posterior to the data. Sample

15 15/98 Part 3 – Estimation Theory Estimation of Population Features Estimators and Estimates Estimator = strategy for use of the data Estimate = outcome of that strategy Sampling Distribution Qualities of the estimator Uncertainty due to random sampling 15

16 16/98 Part 3 – Estimation Theory Estimation Point Estimator: Provides a single estimate of the feature in question based on prior and sample information. Interval Estimator: Provides a range of values that incorporates both the point estimator and the uncertainty about the ability of the point estimator to find the population feature exactly. 16

17 17/98 Part 3 – Estimation Theory ‘ Repeated Sampling’ - A Sampling Distribution The true mean is 500. Sample means vary around 500, some quite far off. The sample mean has a sampling mean and a sampling variance. The sample mean also has a probability distribution. Looks like a normal distribution. This is a histogram for 1,000 means of samples of 20 observations from Normal[500,100 2 ].

18 18/98 Part 3 – Estimation Theory Application: Credit Modeling 1992 American Express analysis of Application process: Acceptance or rejection; X = 0 (reject) or 1 (accept). Cardholder behavior Loan default (D = 0 or 1). Average monthly expenditure (E = $/month) General credit usage/behavior (Y = number of charges) 13,444 applications in November, 1992

19 19/98 Part 3 – Estimation Theory 0.7809 is the true proportion in the population of 13,444 we are sampling from.

20 20/98 Part 3 – Estimation Theory Estimation Concepts Random Sampling Finite populations i.i.d. sample from an infinite population Information Prior Sample 20

21 21/98 Part 3 – Estimation Theory Properties of Estimators 21

22 22/98 Part 3 – Estimation Theory Unbiasedness The sample mean of the 100 sample estimates is 0.7844. The population mean (true proportion) is 0.7809.

23 23/98 Part 3 – Estimation Theory N=144 N=1024 N=4900.7 to.88 Consistency

24 24/98 Part 3 – Estimation Theory 24 Bank costs are normally distributed with mean . Which is a better estimator of , the mean (11.46) or the median (11.27)? Competing Estimators of a Parameter

25 25/98 Part 3 – Estimation Theory Interval estimates of the acceptance rate Based on the 100 samples of 144 observations

26 26/98 Part 3 – Estimation Theory Methods of Estimation Information about the source population Approaches Method of Moments Maximum Likelihood Bayesian 26

27 27/98 Part 3 – Estimation Theory The Method of Moments

28 28/98 Part 3 – Estimation Theory Estimating a Parameter Mean of Poisson p(y)=exp(-λ) λ y / y!, y = 0,1,…; λ > 0 E[y]= λ. E[(1/N)Σ i y i ]= λ. This is the estimator Mean of Exponential f(y) =  exp(-  y), y > 0;  > 0 E[y] = 1/ . E(1/N)Σ i y i = 1/ . 1/{(1/N)Σ i y i } is the estimator of 

29 29/98 Part 3 – Estimation Theory Mean and Variance of a Normal Distribution

30 30/98 Part 3 – Estimation Theory Proportion for Bernoulli In the AmEx data, the true population acceptance rate is 0.7809 =  Y = 1 if application accepted, 0 if not. E[y] =  E[(1/N)Σ i y i ] = p accept = . This is the estimator 30

31 31/98 Part 3 – Estimation Theory Gamma Distribution

32 32/98 Part 3 – Estimation Theory Method of Moments  (P) =  (P) /  (P) = dlog  (P)/dP

33 33/98 Part 3 – Estimation Theory 33

34 34/98 Part 3 – Estimation Theory Estimate One Parameter Assume known to be 0.1. Estimate P E[y] = P/ = P/.1 = 10P m 1 = mean of y = 31.278 Estimate of P is 31.278/10 = 3.1278. One equation in one unknown 34

35 35/98 Part 3 – Estimation Theory Application

36 36/98 Part 3 – Estimation Theory Method of Moments Solutions create ; y1=y ; y2=log(y) ; ysq=y*y$ calc ; m1=xbr(y1) ; mlog=xbr(y2); m2=xbr(ysq) $ Minimize; start = 2.0,.06 ; labels = p,l ; fcn= (m1 - p/l)^2 + (mlog – (psi(p)-log(l)))^2 $ ---------------------------------------------------- P| 2.41074 L|.07707 --------+------------------------------------------- Minimize; start = 2.0,.06 ; labels = p,l ; fcn= (m1 - p/l)^2 + (m2 – p*(p+1)/l^2 )^2 $ --------+------------------------------------------- P| 2.06182 L|.06589 --------+-------------------------------------------

37 37/98 Part 3 – Estimation Theory Properties of MoM estimator Unbiased? Sometimes, e.g., normal, Bernoulli and Poisson means Consistent? Yes by virtue of Slutsky Theorem Assumes parameters can vary continuously Assumes moment functions are continuous and smooth Efficient? Maybe – remains to be seen. (Which pair of moments should be used for the gamma distribution?) Sampling distribution? Generally normal by virtue of Lindeberg-Levy central limit theorem and the Slutsky theorem. 37

38 38/98 Part 3 – Estimation Theory Estimating Sampling Variance Exact sampling results – Poisson Mean, Normal Mean and Variance Approximation based on linearization Bootstrapping – discussed later with maximum likelihood estimator. 38

39 39/98 Part 3 – Estimation Theory Exact Variance of MoM Estimate normal or Poisson mean Estimator is sample mean = (1/N)  i Y i. Exact variance of sample mean is 1/N * population variance. 39

40 40/98 Part 3 – Estimation Theory Linearization Approach – 1 Parameter 40

41 41/98 Part 3 – Estimation Theory Linearization Approach – 1 Parameter 41

42 42/98 Part 3 – Estimation Theory Linearization Approach - General 42

43 43/98 Part 3 – Estimation Theory Exercise: Gamma Parameters m 1 = 1/N  y i => P/ m 2 = 1/N  y i 2 => P(P+1)/ 2 1. What is the Jacobian? (Derivatives) 2. How to compute the variance of m 1, the variance of m 2 and the covariance of m 1 and m 2 ? (The variance of m 1 is 1/N times the variance of y; the variance of m 2 is 1/N times the variance of y 2. The covariance is 1/N times the covariance of y and y 2.) 43

44 44/98 Part 3 – Estimation Theory Sufficient Statistics 44

45 45/98 Part 3 – Estimation Theory Sufficient Statistic 45

46 46/98 Part 3 – Estimation Theory Sufficient Statistic 46

47 47/98 Part 3 – Estimation Theory Sufficient Statistics 47

48 48/98 Part 3 – Estimation Theory Gamma Density 48

49 49/98 Part 3 – Estimation Theory Rao Blackwell Theorem The mean squared error of an estimator based on sufficient statistics is smaller than one not based on sufficient statistics. We deal in consistent estimators, so a large sample (approximate) version of the theorem is that estimators based on sufficient statistics are more efficient than those that are not. 49

50 50/98 Part 3 – Estimation Theory Maximum Likelihood Estimation Criterion Comparable to method of moments Several virtues: Broadly, uses all the sample and nonsample information available  efficient (better than MoM in many cases) 50

51 51/98 Part 3 – Estimation Theory Setting Up the MLE The distribution of the observed random variable is written as a function of the parameter(s) to be estimated P(y i |  ) = Probability density of data | parameters. L(  |y i ) = likelihood of parameter | data The likelihood function is constructed from the density Construction: Joint probability density function of the observed sample of data – generally the product when the data are a random sample. The estimator is chosen to maximize the likelihood of the data (essentially the probability of observing the sample in hand).

52 52/98 Part 3 – Estimation Theory Regularity Conditions Why? Regular MLE has known, good properties. Nonregular estimators usually do not have known properties (good or bad). What they are 1. logf(.) has three continuous derivatives wrt parameters 2. Conditions needed to obtain expectations of derivatives are met. (E.g., range of the variable is not a function of the parameters.) 3. Third derivative has finite expectation. What they mean Moment conditions and convergence. We need to obtain expectations of derivatives. We need to be able to truncate Taylor series. We will use central limit theorems MLE exists for nonregular densities (see text). Questionable statistical properties.

53 53/98 Part 3 – Estimation Theory Regular Exponential Density Exponential density f(y i |  )=(1/  )exp(-y i /  ) Average time until failure, , of light bulbs. y i = observed life until failure. Regularity (1) Range of y is 0 to  free of  (2) logf(y i |  ) = -log  – y/  ∂logf(y i |  )/∂  = -1/  + y i /  2 E[y i ]= , E[∂logf(  )/∂  ]=0 (3) ∂ 2 logf(y i |  )/∂  2 = 1/  2 - 2y i /  3 finite expectation = -1/  2 (4) ∂ 3 logf(y i |  )/∂  3 = -2/  3 + 6y i /  4 has finite expectation = 4/  3 (5) All derivatives are continuous functions of 

54 54/98 Part 3 – Estimation Theory Likelihood Function L(  )=Π i f(y i |  ) MLE = the value of  that maximizes the likelihood function. Generally easier to maximize the log of L. The same  maximizes log L In random sampling, logL=  i log f(y i |  ) 54

55 55/98 Part 3 – Estimation Theory Poisson Likelihood 55 log and ln both mean natural log throughout this course

56 56/98 Part 3 – Estimation Theory The MLE The log-likelihood function: log-L(  |data)= Σ i logf(y i |  ) The likelihood equation(s) = first derivative: First derivatives of log-L equals zero at the MLE. ∂[Σ i logf(y i |  )]/∂  MLE = 0. (Interchange summation and differentiation) Σ i [∂logf(y i |  )/∂  MLE ]= 0.

57 57/98 Part 3 – Estimation Theory Applications Bernoulli Exponential Poisson Normal Gamma 57

58 58/98 Part 3 – Estimation Theory Bernoulli 58

59 59/98 Part 3 – Estimation Theory Exponential Estimating the average time until failure, , of light bulbs. y i = observed life until failure. f(y i |  )=(1/  )exp(-y i /  ) L(  )=Π i f(y i |  )=  -N exp(-Σy i /  ) logL (  )=-Nlog (  ) - Σy i /  Likelihood equation: ∂logL(  )/∂  =-N/  + Σy i /  2 =0 Solution: (Multiply both sides of equation by  2 )  = Σy i /N (sample average estimates population average)

60 60/98 Part 3 – Estimation Theory Poisson Distribution 60

61 61/98 Part 3 – Estimation Theory Normal Distribution 61

62 62/98 Part 3 – Estimation Theory Gamma Distribution 62  (P) =  (P) /  (P) = dlog  (P)/dP

63 63/98 Part 3 – Estimation Theory Gamma Application 63 Gamma (Loglinear) Regression Model Dependent variable Y Log likelihood function -85.37567 --------+---------------------------------------------------------------- | Standard Prob. 95% Confidence Y| Coefficient Error z |z|>Z* Interval --------+---------------------------------------------------------------- |Parameters in conditional mean function LAMBDA|.07707***.02544 3.03.0024.02722.12692 |Scale parameter for gamma model P_scale| 2.41074***.71584 3.37.0008 1.00757 3.81363 --------+---------------------------------------------------------------- SAME SOLUTION AS METHOD OF MOMENTS USING M1 and Mlog create ; y1=y ; y2=log(y) $ calc ; m1=xbr(y1) ; mlog=xbr(y2) $ Minimize; start = 2.0,.06 ; labels = p,l ; fcn= (m1 - p/l)^2 + (mlog – (psi(p)-log(l)))^2 $ ------------------------------------------------------------ P| 2.41074 L|.07707 --------+---------------------------------------------------

64 64/98 Part 3 – Estimation Theory Properties of the MLE Estimator Regularity Finite sample vs. asymptotic properties Properties of the estimator Information used in estimation 64

65 65/98 Part 3 – Estimation Theory Properties of the MLE Sometimes unbiased, usually not Always consistent (under regularity) Large sample normal distribution Efficient Invariant Sufficient (uses sufficient statistics when they exist) 65

66 66/98 Part 3 – Estimation Theory Unbiasedness Usually when estimating a parameter that is the mean of the random variable Normal mean Poisson mean Bernoulli probability is the mean. Does not make degrees of freedom corrections Almost no other cases. 66

67 67/98 Part 3 – Estimation Theory Consistency Under regularity MLE is consistent. Without regularity, it may be consistent, but usually cannot be proved. Almost all cases, mean square consistent Expectation converges to the parameter Variance converges to zero. (Proof sketched in Rice text, 275-276) 67

68 68/98 Part 3 – Estimation Theory Large Sample Distribution

69 69/98 Part 3 – Estimation Theory The Information Equality

70 70/98 Part 3 – Estimation Theory Deduce The Variance of MLE

71 71/98 Part 3 – Estimation Theory Computing the Variance of the MLE

72 72/98 Part 3 – Estimation Theory Application: GSOEP Income Descriptive Statistics for 1 variables --------+--------------------------------------------------------------------- Variable| Mean Std.Dev. Minimum Maximum Cases Missing --------+--------------------------------------------------------------------- HHNINC|.355564.166561.030000 2.0 2698 0 --------+---------------------------------------------------------------------

73 73/98 Part 3 – Estimation Theory Variance of MLE

74 74/98 Part 3 – Estimation Theory Bootstrapping Given the sample, i = 1,…,N Sample N observations with replacement – some get picked more than once, some do not get picked. Recompute estimate of . Repeat R times, obtain R new estimates of . Estimate variance with the sample variance of the R new estimates.

75 75/98 Part 3 – Estimation Theory Bootstrap Results Estimated Variance =.00311 2.

76 76/98 Part 3 – Estimation Theory Sufficiency If sufficient statistics exist, the MLE will be a function of them Therefore, MLE satisfies the Rao Blackwell Theorem (in large samples).

77 77/98 Part 3 – Estimation Theory Efficiency Crame’r – Rao Lower Bound Variance of a consistent, asymptotically normally distributed estimator is > -1/{NE[H i (  )]}. The MLE achieves the C-R lower bound, so it is efficient. Implication: For normal sampling, the mean is better than the median.

78 78/98 Part 3 – Estimation Theory Invariance

79 79/98 Part 3 – Estimation Theory Bayesian Estimation Philosophical underpinnings How to combine information contained in the sample

80 80/98 Part 3 – Estimation Theory “Estimation” Assembling information Prior information = out of sample. Literally prior or outside information Sample information is embodied in the likelihood Result of the analysis: “Posterior belief” = blend of prior and likelihood

81 81/98 Part 3 – Estimation Theory Using Conditional Probabilities: Bayes Theorem Typical application: We know P(B|A), we want P(A|B) In drug testing: We know P(find evidence of drug use | usage) < 1. We needP(usage | find evidence of drug use). The problem is false positives. P(find evidence drug of use | Not usage) > 0 This implies that P(usage | find evidence of drug use)  1

82 82/98 Part 3 – Estimation Theory Bayes Theorem

83 83/98 Part 3 – Estimation Theory Disease Testing Notation + = test indicates disease, – = test indicates no disease D = presence of disease, N = absence of disease Known Data P(Disease) = P(D) =.005 (Fairly rare) (Incidence) P(Test correctly indicates disease) = P(+|D) =.98 ( Sensitivity ) (Correct detection of the disease) P(Test correctly indicates absence) = P(-|N) =. 95 (Specificity) (Correct failure to detect the disease) Objectives: Deduce these probabilities P(D|+) (Probability disease really is present | test positive) P(N|–) (Probability disease really is absent | test negative) Note, P(D|+) = the probability that a patient actually has the disease when the test says they do.

84 84/98 Part 3 – Estimation Theory More Information Deduce: Since P(+|D)=.98, we know P(–|D)=.02 because P(-|D)+P(+|D)=1 [P(–|D) is the P(False negative). Deduce: Since P(–|N)=.95, we know P(+|N)=.05 because P(-|N)+P(+|N)=1 [P(+|N) is the P(False positive). Deduce: Since P(D)=.005, we know P(N)=.995 because P(D)+P(N)=1.

85 85/98 Part 3 – Estimation Theory Now, Use Bayes Theorem

86 86/98 Part 3 – Estimation Theory Bayesian Investigation No fixed “parameters.”  is a random variable. Data are realizations of random variables. There is a marginal distribution p(data) Parameters are part of the random state of nature, p(  ) = distribution of  independently (prior to) the data Investigation combines sample information with prior information. Outcome is a revision of the prior based on the observed information (data)

87 87/98 Part 3 – Estimation Theory

88 88/98 Part 3 – Estimation Theory Symmetrical Treatment Likelihood is p(data|  ) Prior distribution summarizes the nonsample information about  in p(  ) Joint distribution is p(data,  ) P(data,  ) = p(data|  )p(  )=Likelihood x Prior Use Bayes theorem to get p(  |data) = posterior distribution

89 89/98 Part 3 – Estimation Theory The Posterior Distribution

90 90/98 Part 3 – Estimation Theory Priors – Where do they come from? What does the prior contain Informative priors – real prior information Noninformative priors Mathematical Complications Diffuse Uniform Normal with huge variance Improper priors Conjugate priors

91 91/98 Part 3 – Estimation Theory Application Consider estimation of the probability that a production process will produce a defective product. In case 1, suppose the sampling design is to choose N = 25 items from the production line and count the number of defectives. If the probability that any item is defective is a constant θ between zero and one, then the likelihood for the sample of data is L( θ | data) = θ D (1 − θ) 25−D, where D is the number of defectives, say, 8. The maximum likelihood estimator of θ will be q = D/25 = 0.32, and the asymptotic variance of the maximum likelihood estimator is estimated by q(1 − q)/25 = 0.008704.

92 92/98 Part 3 – Estimation Theory Application: Posterior Density

93 93/98 Part 3 – Estimation Theory Posterior Moments

94 94/98 Part 3 – Estimation Theory Mixing Prior and Sample Information

95 95/98 Part 3 – Estimation Theory Modern Bayesian Analysis Bayesian Estimate of Theta Observations = 5000 (Posterior mean was.333333) Mean =.334017 Standard Deviation =.086336 Posterior Variance =.007936 Sample variance =.007454 Skewness =.248077 Kurtosis-3 (excess)= -.161478 Minimum =.066214 Maximum =.653625.025 Percentile =.177090.975 Percentile -.510028

96 96/98 Part 3 – Estimation Theory Modern Bayesian Analysis Multiple parameter settings Derivation of exact form of expectations and variances for p(  1,  2,…,  K |data) is hopelessly complicated even if the density is tractable. Strategy: Sample joint observations (  1,  2,…,  K ) from the posterior population and use marginal means, variances, quantiles, etc. How to sample the joint observations??? (Still hopelessly complicated.)

97 97/98 Part 3 – Estimation Theory Magic: The Gibbs Sampler Objective: Sample joint observations on  1,  2,…,  K. from p(  1,  2,…,  K |data) (Let K = 3) Strategy: Gibbs sampling: Derive p(  1 |  2,  3,data) p(  2 |  1,  3,data) p(  3 |  1,  2,data) Gibbs Cycles produce joint observations 0. Start  1,  2,  3 at some reasonable values 1. Sample a draw from p(  1 |  2,  3,data) using the draws of  1,  2 in hand 2. Sample a draw from p(  2 |  1,  3,data) using the draw at step 1 for  1 3. Sample a draw from p(  3 |  1,  2,data) using the draws at steps 1 and 2 4. Return to step 1. After a burn in period (a few thousand), start collecting the draws. The set of draws ultimately gives a sample from the joint distribution.

98 98/98 Part 3 – Estimation Theory Methodological Issues Priors: Schizophrenia Uninformative are disingenuous Informative are not objective Using existing information? Bernstein von Mises and likelihood estimation. In large samples, the likelihood dominates The posterior mean will be the same as the MLE


Download ppt "Professor William Greene Stern School of Business IOMS Department Department of Economics Statistical Inference and Regression Analysis: Stat-GB.3302.30,"

Similar presentations


Ads by Google