Using Stata for Subpopulation Analysis of Complex Sample Survey Data

Slides:



Advertisements
Similar presentations
Multistage Sampling.
Advertisements

Copyright © 2003 Pearson Education, Inc. Slide 1 Computer Systems Organization & Architecture Chapters 8-12 John D. Carpinelli.
1 Copyright © 2013 Elsevier Inc. All rights reserved. Appendix 01.
STATISTICS HYPOTHESES TEST (III) Nonparametric Goodness-of-fit (GOF) tests Professor Ke-Sheng Cheng Department of Bioenvironmental Systems Engineering.
STATISTICS INTERVAL ESTIMATION Professor Ke-Sheng Cheng Department of Bioenvironmental Systems Engineering National Taiwan University.
STATISTICS POINT ESTIMATION Professor Ke-Sheng Cheng Department of Bioenvironmental Systems Engineering National Taiwan University.
Detection of Hydrological Changes – Nonparametric Approaches
Variance Estimation in Complex Surveys Third International Conference on Establishment Surveys Montreal, Quebec June 18-21, 2007 Presented by: Kirk Wolter,
Linearization Variance Estimators for Survey Data: Some Recent Work
1 ESTIMATION IN THE PRESENCE OF TAX DATA IN BUSINESS SURVEYS David Haziza, Gordon Kuromi and Joana Bérubé Université de Montréal & Statistics Canada ICESIII.
Document #07-2I RXQ Customer Enrollment Using a Registration Agent (RA) Process Flow Diagram (Move-In) (mod 7/25 & clean-up 8/20) Customer Supplier.
NTTS conference, February 18 – New Developments in Nonresponse Adjustment Methods Fannie Cobben Statistics Netherlands Department of Methodology.
Statistical Significance and Population Controls Presented to the New Jersey SDC Annual Network Meeting June 6, 2007 Tony Tersine, U.S. Census Bureau.
Copyright © 2010 Pearson Education, Inc. Slide
Overview of Lecture Partitioning Evaluating the Null Hypothesis ANOVA
Lecture 2 ANALYSIS OF VARIANCE: AN INTRODUCTION
1 Contact details Colin Gray Room S16 (occasionally) address: Telephone: (27) 2233 Dont hesitate to get in touch.
1 Correlation and Simple Regression. 2 Introduction Interested in the relationships between variables. What will happen to one variable if another is.
1 Discreteness and the Welfare Cost of Labour Supply Tax Distortions Keshab Bhattarai University of Hull and John Whalley Universities of Warwick and Western.
Chapter 7 Sampling and Sampling Distributions
Converting Data to Information. Know your data Know your audience Tell a story.
Understanding Multiyear Estimates from the American Community Survey Updated February 2013.
1 Understanding Multiyear Estimates from the American Community Survey.
Photo Slideshow Instructions (delete before presenting or this page will show when slideshow loops) 1.Set PowerPoint to work in Outline. View/Normal click.
Solve Multi-step Equations
REVIEW: Arthropod ID. 1. Name the subphylum. 2. Name the subphylum. 3. Name the order.
Biostatistics Unit 5 Samples Needs to be completed. 12/24/13.
County-level Estimates of Leisure Time Physical Inactivity among Adults aged 20 years old Trends
Announcements Homework 6 is due on Thursday (Oct 18)
Chi-Square and Analysis of Variance (ANOVA)
5-1 Chapter 5 Theory & Problems of Probability & Statistics Murray R. Spiegel Sampling Theory.
Prerequisites Recommended modules to complete before viewing this module 1. Introduction to the NLTS2 Training Modules 2. NLTS2 Study Overview 3. NLTS2.
2 |SharePoint Saturday New York City
Chapter 6 The Mathematics of Diversification
Hypothesis Tests: Two Independent Samples
Comparing Two Groups’ Means or Proportions: Independent Samples t-tests.
Basel-ICU-Journal Challenge18/20/ Basel-ICU-Journal Challenge8/20/2014.
© 2012 National Heart Foundation of Australia. Slide 2.
Statistical Analysis SC504/HS927 Spring Term 2008
Understanding Generalist Practice, 5e, Kirst-Ashman/Hull
Asthma in Minnesota Slide Set Asthma Program Minnesota Department of Health January 2013.
Module 17: Two-Sample t-tests, with equal variances for the two populations This module describes one of the most utilized statistical tests, the.
AU 350 SAS 111 Audit Sampling C Delano Gray June 14, 2008.
Statistical Inferences Based on Two Samples
© The McGraw-Hill Companies, Inc., Chapter 10 Testing the Difference between Means and Variances.
Analyzing Genes and Genomes
25 de febrero de 2009 Coloquio de Investigación CICIA Marisela Santiago, PhD Myra Pérez, PhD.
©Brooks/Cole, 2001 Chapter 12 Derived Types-- Enumerated, Structure and Union.
1 Interpreting a Model in which the slopes are allowed to differ across groups Suppose Y is regressed on X1, Dummy1 (an indicator variable for group membership),
Chapter Thirteen The One-Way Analysis of Variance.
Chapter 8 Estimation Understandable Statistics Ninth Edition
PSSA Preparation.
Chapter 11: The t Test for Two Related Samples
Experimental Design and Analysis of Variance
Module 20: Correlation This module focuses on the calculating, interpreting and testing hypotheses about the Pearson Product Moment Correlation.
Simple Linear Regression Analysis
Multiple Regression and Model Building
January Structure of the book Section 1 (Ch 1 – 10) Basic concepts and techniques Section 2 (Ch 11 – 15): Inference for quantitative outcomes Section.
Slide Slide 1 Copyright © 2007 Pearson Education, Inc Publishing as Pearson Addison-Wesley. Section 7-2 Estimating a Population Proportion Created by Erin.
Commonly Used Distributions
Data, Now What? Skills for Analyzing and Interpreting Data
Design Effects: What are they and how do they affect your analysis? David R. Johnson Population Research Institute & Department of Sociology The Pennsylvania.
Secondary Data Analysis Linda K. Owens, PhD Assistant Director for Sampling and Analysis Survey Research Laboratory University of Illinois.
1 Introduction to Survey Data Analysis Linda K. Owens, PhD Assistant Director for Sampling & Analysis Survey Research Laboratory University of Illinois.
LECTURE 3 SAMPLING THEORY EPSY 640 Texas A&M University.
1 Introduction to Survey Data Analysis Linda K. Owens, PhD Assistant Director for Sampling & Analysis Survey Research Laboratory University of Illinois.
Introduction to Secondary Data Analysis Young Ik Cho, PhD Research Associate Professor Survey Research Laboratory University of Illinois at Chicago Fall,
Bangor Transfer Abroad Programme Marketing Research SAMPLING (Zikmund, Chapter 12)
Introduction to Survey Data Analysis
Presentation transcript:

Using Stata for Subpopulation Analysis of Complex Sample Survey Data Brady T. West PhD Student Michigan Program in Survey Methodology July 30, 2009 2009 Stata Conference

2009 Stata Conference: Subpop Analysis of Survey Data Presentation Outline Introduction: Subclass Analysis Issues Kish’s Taxonomy of Subclasses Two Alternative Approaches to Inference Variance Estimation and Methods for ‘Singletons’ Examples using NHANES and NHAMCS Data Suggestions for Practice Directions for Future Research 2009 Stata Conference: Subpop Analysis of Survey Data

Subclass Analysis Issues Analysts of large, complex sample survey data sets are often interested in making inferences about subpopulations of the original population that the sample was selected from (e.g., Caucasian Females) These subpopulations are referred to interchangeably in various literatures as subgroups, subclasses, subpopulations, domains, and subdomains, leading to confusion among analysts of survey data 2009 Stata Conference: Subpop Analysis of Survey Data

Subclass Analysis Issues, cont’d Software procedures for analysis of complex sample survey data are becoming more powerful, flexible, and widely available, offering analysts several options Analysts need to be careful when analyzing subclasses, and be aware of the alternative approaches to subclass analysis that are possible and their implications for inference 2009 Stata Conference: Subpop Analysis of Survey Data

Kish’s Taxonomy of Subclasses Design Domains: Restricted to specific strata according to the complex sample design (usually geographically, e.g., Texas) Cross-Classes: Broadly distributed (in theory) across the strata and primary sampling units defining a complex sample (e.g., African-Americans over age 50) Mixed Classes: Disproportionately distributed across the complex sample design (e.g., Hispanics in a sample including Los Angeles as a stratum) See Kish (1987), Statistical Design for Research 2009 Stata Conference: Subpop Analysis of Survey Data

Design Domains X = Sample Element in Subclass Stratum PSU 1 PSU 2 1 XXXXXXXXXXX XXXXXXXXX 2 XXXXXXXXXX XXXXXXXXXXXX 3 4 5 2009 Stata Conference: Subpop Analysis of Survey Data

2009 Stata Conference: Subpop Analysis of Survey Data Cross-Classes Stratum PSU 1 PSU 2 1 XXXXXXXXXXXX XXXXX 2 XXXX XXXXXXX 3 XXXXXXXXXXX XXXXXXXXX 4 XXXXXX 5 XXXXXXXXXX 2009 Stata Conference: Subpop Analysis of Survey Data

2009 Stata Conference: Subpop Analysis of Survey Data Mixed Classes Stratum PSU 1 PSU 2 1 XXXXXXXXXXXXXX XXXXXXXXXXXXX 2 X 3 XXXXXXXXXX 4 XX 5 XXXXXXXXXXXX 2009 Stata Conference: Subpop Analysis of Survey Data

Applying Kish’s Taxonomy The type of subclass is critical for determining an appropriate analysis approach Two possible approaches to inference motivated by the taxonomy: 1. Unconditional approach (cross-classes, mixed classes) 2. Conditional approach (design domains) 2009 Stata Conference: Subpop Analysis of Survey Data

The Unconditional Approach Appropriate for Cross-Classes, and in some cases Mixed Classes; the subclass of interest theoretically can appear in all design strata and primary sampling units (PSUs) KEY POINT: Allow the software to process the entire survey data set, and recognize all possible design strata and PSUs; DO NOT delete sample cases not in the subclass! 2009 Stata Conference: Subpop Analysis of Survey Data

The Unconditional Approach Rationale: estimated variances for sample estimates of subclass parameters (based on within-stratum variance between PSUs) need to reflect sample-to-sample variability based on the full complex design In other words, if a particular subclass does not appear in a PSU in any given sample (although in theory it could have), that PSU should contribute 0 to variance estimates, rather than be ignored completely! 2009 Stata Conference: Subpop Analysis of Survey Data

The Unconditional Approach Further, the subclass sample size in each stratum is going to be a random variable, and theoretical sample-to-sample variance in realizations of this random variable should be incorporated into any variance estimation procedures 2009 Stata Conference: Subpop Analysis of Survey Data

The Unconditional Approach If cross-classes (or in some cases mixed classes) are being analyzed, and PSUs where the subclass does not appear (by random chance) are deleted, problems arise Some strata may appear to have only one PSU by design (preventing variance estimation unless an ad hoc approach is used) Entire design strata may be dropped, impacting variance estimates and calculations of degrees of freedom 2009 Stata Conference: Subpop Analysis of Survey Data

The Unconditional Approach: General Stata Code svy, subpop(indicator): command varlist, options indicator = an indicator variable for the subpop or an if condition, e.g., if male == 1 svy: mean, over(groupvar) svy: prop, over(groupvar) Stata drops strata* with no subpopulation observations from degrees of freedom calculations * Exercise: repeat 10 times really fast 2009 Stata Conference: Subpop Analysis of Survey Data

The Conditional Approach Appropriate for Design Domains, where a subclass cannot appear outside of specific design strata The rationale behind the unconditional approach no longer applies Certain design strata should not contribute to variance estimation or calculation of degrees of freedom 2009 Stata Conference: Subpop Analysis of Survey Data

The Conditional Approach Restrict the analysis to only those design strata where the subclass of interest exists Variance estimates reflecting sample-to-sample variability should only be based on those design strata where the subclass can appear (unlike the unconditional approach) Subclass sample sizes in design domains are assumed to be fixed, by design 2009 Stata Conference: Subpop Analysis of Survey Data

The Conditional Approach: General Stata Code svy: command varlist if (condition), options (condition) might be male == 1, or a more complex combination of conditions (e.g., male == 1 & age >= 50 & age <= 90) 2009 Stata Conference: Subpop Analysis of Survey Data

Variance Estimation Methods All of these issues are only relevant when using Taylor Series Linearization, which is a default for variance estimation in Stata Conditional analyses are OK to perform when using replication methods, such as Balanced Repeated Replication or Jackknife Repeated Replication (Rust and Rao, 1996) 2009 Stata Conference: Subpop Analysis of Survey Data

Ad-hoc Fixes for ‘Singleton’ Clusters in Stata 10.1 Stata 10.1 provides users with four ad-hoc fixes for the problem where strata are identified with only a single ultimate cluster for variance estimation in a subpopulation analysis: Report Missing Standard Errors (not really a fix) Treat Units as Certainty Units, which contribute nothing to the standard error Scale Variance using Certainty Units, which uses the average variance from each stratum with multiple PSUs for each stratum with only a single PSU Center at the Grand Mean, where the variance contribution comes from a deviation from the grand mean instead of the stratum mean 2009 Stata Conference: Subpop Analysis of Survey Data

Example: The NHANES Data We first consider examples based on the NHANES II data set, collected from a nationally representative multistage probability sample of the U.S. population from 1976-1980 (oldie but a goodie) Briefly, a sample of the U.S. population was given medical examinations in an effort to assess the health of the U.S. population 2009 Stata Conference: Subpop Analysis of Survey Data

Example NHANES Analysis Analysis Subclass: African-Americans ages 50 and above (this is a cross-class of the U.S. population, which can theoretically appear in all design strata and PSUs) Analysis Objective: Estimate the mean systolic blood pressure of this subclass and an appropriate standard error See West et al. (2007) for more details 2009 Stata Conference: Subpop Analysis of Survey Data

Conditional Approach: Stata Code for NHANES Analysis svyset ppsu [pweight = fwgtexam], strata(stratum) singleunit(missing) svyset ppsu [pweight = fwgtexam], strata(stratum) singleunit(centered) Also singleunit(certainty), singleunit(scaled) gen b50subp = (race == 2 & ager >= 50) svy: mean bpsyst if b50subp == 1 2009 Stata Conference: Subpop Analysis of Survey Data

Conditional Approach: Results Method Est. Mean TSL SE Design DF Missing SE 144.09 . 50-29 = 21 Centered 1.66 Certainty 1.62 Scaled 1.90 2009 Stata Conference: Subpop Analysis of Survey Data

2009 Stata Conference: Subpop Analysis of Survey Data Conditional Approach? This approach would not be appropriate for this particular subclass Computed standard errors would generally be biased downward, because additional sources of sample-to-sample variability are ignored when following this approach Same issues apply for analytic models Evidence that the “scaled” ad-hoc fix may be overly conservative! 2009 Stata Conference: Subpop Analysis of Survey Data

Unconditional Approach: Stata Code for NHANES Analysis svyset ppsu [pweight = fwgtexam], strata(stratum) singleunit(missing) Note: choice of single unit option does not matter when following this approach! gen b50subp = (race == 2 & ager >= 50) svy, subpop(b50subp): mean bpsyst 2009 Stata Conference: Subpop Analysis of Survey Data

Unconditional Approach: Results Method Est. Mean TSL SE Des. DF* Missing SE 144.09 1.66 58-29 = 29 Centered Certainty Scaled * Note: Stata dropped three strata with no sample units in the subpopulation. 2009 Stata Conference: Subpop Analysis of Survey Data

Unconditional Approach? This approach would be the appropriate choice for a cross-class such as African-Americans over the age of 50 Inferences are theoretically appropriate Same idea for analytic models Results suggest that the “centered” and “certainty” ad-hoc fixes for conditional analyses are reasonable 2009 Stata Conference: Subpop Analysis of Survey Data

Example: The NHAMCS Data Analysis Subclass: Visits to Emergency Departments (ED) by African-American men ages 60 and above (this is another cross-class of the U.S. population, which can theoretically appear in all NHAMCS design strata and PSUs) Analysis Objective: Estimate the percentage of all ED visits by members of this subclass for dizziness and/or vertigo in 2004 See West et al. (2008) for more details 2009 Stata Conference: Subpop Analysis of Survey Data

Stata Code for NHAMCS Analyses svyset cpsum [pweight = patwt], strata(cstratm) singleunit(…) generate subc = (settype == 3 & sex == 2 & agecat == 5 & race == 2) svy: tabulate dizzyrfv if subc == 1, se ci percent * conditional svy, subpop(subc): tabulate dizzyrfv, se ci percent * unconditional 2009 Stata Conference: Subpop Analysis of Survey Data

NHAMCS Analysis Results Method Est. % TSL SE Design DF Missing SE 4.82 1.576 106 Centered Certainty Scaled Unconditional 1.590 286 2009 Stata Conference: Subpop Analysis of Survey Data

NHAMCS Analysis Implications No problems with strata having only a single ultimate cluster: ad-hoc fixes all give the same results Weighted point estimates are identical Substantially fewer design-based degrees of freedom when following the conditional approach; the full complex design will not be reflected in estimation of sample-to-sample variance (many ultimate clusters are lost) Conditional analysis assumes that each sample will be of fixed size n = 397 for variance estimation purposes; no random variance! Conditional analysis results in overly liberal inferences 2009 Stata Conference: Subpop Analysis of Survey Data

Suggestions for Practice Consider Kish’s Taxonomy when determining an appropriate subclass analysis approach Utilize the appropriate software options for unconditional analyses when analyzing cross-classes Be careful with missing values when creating the subpopulation indicator The unconditional analysis approach generally works fine for both cases (when in doubt, use this approach) 2009 Stata Conference: Subpop Analysis of Survey Data

Directions for Future Research More appropriate calculation / estimation of design-based and effective degrees of freedom for sparse subclasses or mixed classes Development of analytic theory for interval estimation when working with small subclasses, which does not rely on asymptotic results 2009 Stata Conference: Subpop Analysis of Survey Data

2009 Stata Conference: Subpop Analysis of Survey Data References Kish, L. 1987. Statistical Design for Research. New York: Wiley. Rust, K. F., and J. N. K. Rao. 1996. Variance estimation for complex surveys using replication. Statistical Methods in Medical Research 5: 283–310. West, B.T., Berglund, P., and Heeringa, S.G. 2008. A Closer Examination of Subpopulation Analysis of Complex Sample Survey Data. The Stata Journal, 8(3), 1-12. West, B.T., Berglund, P., and Heeringa, S.G. 2007. Alternative Approaches to Subclass Analysis of Complex Sample Survey Data. Proceedings of the 2007 Joint Statistical Meetings. 2009 Stata Conference: Subpop Analysis of Survey Data

2009 Stata Conference: Subpop Analysis of Survey Data Questions / Thank You! For additional questions, comments, or electronic copies of these slides or the papers, please send an email to bwest@umich.edu 2009 Stata Conference: Subpop Analysis of Survey Data