Presentation on theme: "Small Area Estimation (in survey research). Knock! Knock! Whose there? (without opening the door) "The census taker." "Go away - I don't want my senses."— Presentation transcript:
Small Area Estimation (in survey research)
Knock! Knock! Whose there? (without opening the door) "The census taker." "Go away - I don't want my senses taken." "No, you don't understand, I just want to survey you." "A statistical sample of one isn't valid -- go away." "You aren't the only one." "So you are bothering a whole bunch of people, go away." "Look you are unique and I don't want to miss you in the survey." "How do you know I'm unique when you haven't surveyed me yet?" "Ok, I don't know you are unique, but you might be." "You mean you think I'm an oddball." "No, maybe more like an outlier." "Now you are calling me an out and out lier, go away." "No, I mean you are far from the average Joe." "I hope so, I'm Sally. " "Look Sally, we are trying to get population data, how many people live here?" "Gosh, how would I know, I think there about 15 thousand in Smugville." "No, I mean in this house!" "Oh, that's a question of a different nature." "So, how many?" "Sometimes one, sometimes two, sometimes four, now -- go away." "No, I need a precise number." "Ok, how about 1.34" "How did you come up with that?" "I live here sometimes during the week, my sistor visits me on weekends, and my mother visits me every second week, my two cats are sometimes here, and my …. and that’s none of your business". "Thanks Sally have a great day." (census taker wrote -- "NO PERSONS LIVING HERE - UNOCCUPIED.")
What is SAE? Small area estimation is the collective term for several statistical techniques involving the estimation of parameters for small sub-populations, generally used when the sub-population of interest is included in a larger survey. - Wikipedia (the free encyclopedia) Small area: a sub-population for which there is not enough sample to construct reliable estimates directly based on the survey sample –small geographical area, such as LHA –small domain, such as demographic subgroups Area with small number of respondents – estimates with low precision (large standard error) Area with no respondent – no estimate
Why SAE? Growing demand for reliable small area statistics for policy analysis and planning purposes –there is “increasing government concern with issues of distribution, equity and disparity” –apportionment of government funds –regional planning Constraints of national surveys: – not designed to produce reliable estimates at the small area level due to cost constraints. Limitation of administrative data sources: –do not have the necessary information to provide the detailed statistics needed for small areas.
How to do SAE? “Borrow strength” from related or similar small areas through explicit or implicit models that connect the small areas via supplementary data –combine data obtained from large scale surveys containing measures of interest with a set of covariates available for all small areas from other sources Auxiliary information/covariates –correlated with the measure of interest –known for all small areas –common source: census, administrative registries
SAE Methods Simple approaches: –Demographic methods local estimation of population in post-censual years latest census data + administrative registries (e.g., birth, death, etc.) –Synthetic estimation derived from direct survey estimate of a large area the small area is covered by the large area assumption: the small areas have the same characteristics as the large area potential bias Indirect standardization –Composite estimation weighted average of the synthetic and survey direct estimates balance the potential bias of a synthetic estimator and the instability of a direct estimator
SAE Methods Multi-level modeling: –using individual level covariates only –combining individual and area-level covariates –using area level covariates only model-based SAE generated for a particular small area is the expected outcome for that area based on its characteristics as measured by the covariates. example of interpretation: given the characteristics of the local population we would expect approximately x% of adults within LHA X to smoke/be obese etc. enables us to provide information about the characteristics of all areas in the population, not just the sampled areas.
Indirect Standardization Applying national (large area) direct survey estimates of demographic class to area-level population counts to generate expected area estimates. –intuitively appealing Mean level of many variables in a population is highly related to the distribution of such demographic variables as age, sex and social class. –easy and inexpensive to apply local level populations of demographic classes from the Census + national estimates from survey –assumes that the national rates for each subgroup apply uniformly across all areas. Differences between areas are due solely to differences in their demographic composition.
Models using individual level covariates only Modeling the relationship between measure of interest and covariates on individual level based on survey data. Apply estimated model coefficients to covariates available as counts for all small areas (e.g. from the Census) to obtain expected area estimate for measure of interest. Data requirement –exact correspondence between the covariates used in the model and data available from the Census or other administrative data sources. –restricts the choice of covariates in these models. Within area clustering is ignored.
Models combining individual and area level covariates Multi-level models incorporating random effects –fixed effects of covariates + small area specific random effects –taking into account the clustering within small area suited to the clustered nature of social surveys provides more accurate standard errors estimates –enabling exploration of the association of area differences with individual and area level characteristics –stringent data requirements due to inclusion of individual level covariates
Models using area level covariates only The model gives a constant predicted value for all individuals within an area - the predicted mean of the area. –avoid the stringent data requirements –relatively low cost –a strong argument: controlling for differences in area level covariates is all that is needed for predicting area differences in study variable. –not support subgroup estimates within each small area such as gender-specific estimate
Data requirements survey dataset: holds both the outcome variables (e.g. smoking status), as well as the individual level covariate data (e.g. age, sex, SEC). area-level covariate dataset: contains the estimation area level means for a set of covariates – usually census, administrative and registration data – along with the estimation area identifiers, and any higher-level area covariates and identifiers. analysis dataset: the survey and covariate datasets matched on estimation area identifier. The analysis dataset contains only the areas sampled in the survey. This dataset is used for modeling. implementation dataset: a dataset covering all areas (not just those sampled) to produce the final estimates. The implementation dataset will be at the lowest estimation area level, nested within higher-level geographic identifiers. This will allow the production of higher-level estimates by aggregating estimates for the component small areas. external validation dataset: relevant local and/or national surveys or other administrative sources to provide direct estimates of relevant outcomes to compare against the SAE.
Cautions “Indirect estimators should be considered when better alternatives are not available, but only with appropriate caution and in conjunction with statistical research and evaluation efforts. Both producer and user must not forget that even after such efforts, indirect estimates may not be adequate for the intended purpose.” you never have to say you are certain.
You Might Be a Statistician if... no one wants your job. you are right 95% of the time. you feel complete and sufficient. you found accountancy too exciting. you never have to say you are certain. you may not be normal but you are transformable.
References M Ghosh, J.N.K. Rao. "Small area estimation: An appraisal", Statistical Science, vol 9, no.1 (1994), 55-76. Danny Pfefferman. "Small area estimation - New developments and directions", International Statistical Review (2002), 70, 1, 125-143. Goldstein H (2003) Multilevel statistical models (New York: Halstead Press). Rao JNK (2003). Small Area Estimation. John Wiley & Sons, Inc., Hoboken, New Jersey.
Top ten reasons to be a statistician 1.Estimating parameters is easier than dealing with real life. 2.Statisticians are significant. 3.I always wanted to learn the entire Greek alphabet. 4.The probability a statistician major will get a job is >.9999. 5.If I flunk out I can always transfer to Engineering. 6.We do it with confidence, frequency, and variability. 7.You never have to be right - only close. 8.We're normal and everyone else is skewed. 9.The regression line looks better than the unemployment line. 10.No one knows what we do so we are always right.