Presentation is loading. Please wait.

Presentation is loading. Please wait.

Complexities of Complex Survey Design Analysis. Why worry about this? Many government studies use these designs – CDC National Health Interview Survey.

Similar presentations


Presentation on theme: "Complexities of Complex Survey Design Analysis. Why worry about this? Many government studies use these designs – CDC National Health Interview Survey."— Presentation transcript:

1 Complexities of Complex Survey Design Analysis

2 Why worry about this? Many government studies use these designs – CDC National Health Interview Survey (NHIS) – National Health and Nutrition Examination Survey (NHANES) also CDC – National Longitudinal Survey of Youth – Medicare Beneficiary Survey (MCBS) – Almost any survey seeking a representative sample from a large population will have a complex multi-stage probability sampling methodology.

3 Why do we care? These studies will make their data available to researchers at a very minimal cost (sometimes even free Getting free data seems great but the analysis challenges are considerable as well. The studies do not always document the study design very well so it can be difficult to understand how deal with it.

4 Today’s talk Will not deal with all of the issues Will start at the basics and lead up to some of the complexities. Will talk about how various software deals with some of the complexities.

5 Usual assumptions Infinite populations. – Never true but can be “true enough” – Most methods work under the infinite population assumption. – This will hold if N is very, very large and n is not too big relative to N (ie N >> n) – Survey design people are sort of the statistical version of numerical analysts. ie what to do when the analysis environment is not infinite.

6 Background Types of sampling Simple random sampling with replacement – Easiest to deal with – Population size N sample size n – Each population unit has probability 1/N of being selected to be in the sample. – Drawback – each population unit can be selected multiple times (ie repeat information) – If N is large, the probability of any unit being selected twice is small.

7 More Background Simple random sampling without replacement – Unequal probability for population unit to be in the sample. – First unit selected has probability 1/N. – Second unit selected has probability 1/(N-1) – nth unit has probability 1/(N-n+1) – if N >> n and N is large then 1/N ≈ 1/(N-1) ≈... ≈ 1/(N-n+1) So approximately the same as simple random sampling with replacement. Can use FPC (finite population correction) ((N-n)/(N-1)) 1/2. Note if N>>n then this is ≈ 1

8 Why complex sampling Cost (main reason) – simpler and more cost effective May differentially sample easy units versus difficult to sample units. eg homeless, minorities, rural – Harder to sample units Want to account for inclusion difficulty of certain types of population units.

9 Sampling Strategies Strata Clusters Weights

10 Strata – Fixed known groups regions, groups of countries states – Not sampled -- however sampling within strata is not equal across strata. – All Strata are included

11 Adjusting for Strata Assume two strata with N1=100 and N2=10 elements. sample of size 20 from N1 and 8 from N2. Assume with replacement to make the math easier. so P =.2 in strata 1 and P=.8 from strata 2. Use inverse probability to weight analyses weights for strata w1 = 1/.2 =5 and for strata 2 w2 = 1/.8 = 1.25

12 Example Want to estimate job openings in a town. Large businesses have more job openings than small business. Say that you have 10 large businesses and 100 small business. Sample get a sample of 28 businesses with 20 small businesses and 8 large businesses. Use the probability weights from the previous slide. Let x be the number of job openings in each business.

13 Example continued Total job openings =  wi xi where the weights are 5 if in strata 1 (small businesses) and weights are 1.25 if in strata 2. Note that w1*n1 + w2*n2 = 110 -- the population size. So the idea is that businesses sampled from strata 1 look like 5 businesses, while businesses sampled from strata 2 look like 1.25 businesses. Complex survey design works on population totals and the resulting proportions. Note in this case the PSU – primary sampling unit is a business.

14 With no weights (assumes equal weighting) Cumulative Cumulative open Frequency Percent Frequency Percent -------------------------------------------------------------- 0 6 21.43 6 21.43 1 4 14.29 10 35.71 2 3 10.71 13 46.43 3 2 7.14 15 53.57 4 2 7.14 17 60.71 5 2 7.14 19 67.86 6 1 3.57 20 71.43 10 1 3.57 21 75.00 13 1 3.57 22 78.57 15 1 3.57 23 82.14 20 1 3.57 24 85.71 22 1 3.57 25 89.29 25 1 3.57 26 92.86 27 1 3.57 27 96.43 30 1 3.57 28 100.00 Total job openings 202*3.93 = 793 Over estimate because weights large companies equal to small companies. (110/28 = 3.93) 7.2 per company

15 With weights (unequal sampling) Cumulative Cumulative open Frequency Percent Frequency Percent --------------------------------------------------------- 0 30 27.27 30 27.27 1 20 18.18 50 45.45 2 15 13.64 65 59.09 3 10 9.09 75 68.18 4 10 9.09 85 77.27 5 10 9.09 95 86.36 6 5 4.55 100 90.90 10 1.26 1.15 101.26 92.05 13 1.25 1.14 102.51 93.18 15 1.25 1.14 103.76 94.32 20 1.25 1.14 105.01 95.45 22 1.25 1.14 106.26 96.59 25 1.25 1.14 107.51 97.73 27 1.25 1.14 108.76 98.86 30 1.25 1.14 110.01 100.00 Total job openings 402.6 or around 402 (3.6 per company)

16 types of weights pweights – Inverse probability weights. Also known as sampling weights wi = 1/pi. fweights – Frequency weights. Used when one record represents a number of identical records. aweights -- Analytic weights, are weights that are inversely proportional to the variance of an observation (meta-analysis) iweights – Importance weights weights that indicate the "importance" of the observation in some nonstatistical sense.

17 Replicate Weights Series of weights used to correct standard errors Used to more securely protect the identity of the respondents Two common kinds – Balanced Repeated Replicates (BRR) – Jack-Knife (JK-1)

18 Add clustering Strata are fixed groups that are all used and are mutually exclusive – eg Big companies and small companies Clusters are sampled. Unit sampled is the PSU Eg strata Region:Urban/Rural Cluster zip code sample zip codes in region (PSU) Sample person residing in zip code area. Unequal sampling of PSU in strata then unequal sampling of individual in zip code area. Use conditional probabilities to get weights at various levels. Units within a cluster are likely to be more similar (ie smaller variability)

19 NHANES Sampling design (Continuous) The NHANES sample is designed to be nationally representative of the civilian, non- institutionalized U.S. population, in that it does not include persons residing in nursing homes, institutionalized persons, or U.S. nationals living abroad. Thus, for NHANES 1999- 2010, each year's sample and any combination of samples from consecutive years comprise a nationally representative sample of the resident, non-institutionalized U.S. population. Stage 1: Primary sampling units (PSUs) are selected. These are mostly single counties or, in a few cases, groups of contiguous counties with probability proportional to a measure of size (PPS). Stage 2: The PSUs are divided up into segments (generally city blocks or their equivalent). As with each PSU, sample segments are selected with PPS. Stage 3: Households within each segment are listed, and a sample is randomly drawn. In geographic areas where the proportion of age, ethnic, or income groups selected for oversampling is high, the probability of selection for those groups is greater than in other areas. Stage 4: Individuals are chosen to participate in NHANES from a list of all persons residing in selected households. Individuals are drawn at random within designated age-sex- race/ethnicity screening subdomains. On average, 1.6 persons are selected per household. http://www.cdc.gov/nchs/tutorials/NHANES/SurveyDesign/SampleDesign/Info1.htm

20 Weight calculation

21 Implications of sampling design Strata – no definition of strata – says two county PSUs are selected per strata so strata exist. Variables that sampling is based on – stage 1 : PSU county: size of county(PPS-probability proportional to size so larger counties have greater probability of selection) – stage 2: segment: Size of segment (PPS – see above) – stage 3: household:age, ethnic, income group – stage 4: individual: age-sex-race/ethnicity

22 Sample weights numerical sample weight assigned to each participant – number of people in the population represented by that particular sampled person – includes adjustments for unequal selection non-response control totals (make sure estimates of age, sex, and race/ethnicity categories match known population totals)

23 Variance Estimates Unequal weighting causes complications in variance estimation Can use: – Taylor series estimate – BRR – Balanced Repeated Replicates (if weights are provided) get a lot of subsample weights, calculate the estimate a bunch of times and take the variance of these estimates. – Jack Knife (if weights are provided) see above

24 How? You Can’t do this on your calculator Sudaan (the original) STATA (says it is better) SAS (has come out with survey procedures) Getting variances always seems to be the issue (although unbiased estimates are usually a good thing).

25 Example of SAS code PROC SURVEYMEANS data=d.ncsdxdm3 ; strata str ; cluster secu ; var deplt1 gadlt1 ; weight p1fwt ; run ;

26 Example of STATA code svyset county [pw = pwvar], strata(state) fpc(fpcvar) school, fpc(fpcvar2) This sets up the design Use svy: function eg svy: mean svy: regress svy modules are listed in the STATA documentation


Download ppt "Complexities of Complex Survey Design Analysis. Why worry about this? Many government studies use these designs – CDC National Health Interview Survey."

Similar presentations


Ads by Google