Presentation is loading. Please wait.

Presentation is loading. Please wait.

Population Health Surveys Bootstrap Hands-on Workshop Yves Beland, CCHS senior methodologist Larry MacNabb, CCHS dissemination manager developed by François.

Similar presentations


Presentation on theme: "Population Health Surveys Bootstrap Hands-on Workshop Yves Beland, CCHS senior methodologist Larry MacNabb, CCHS dissemination manager developed by François."— Presentation transcript:

1 Population Health Surveys Bootstrap Hands-on Workshop Yves Beland, CCHS senior methodologist Larry MacNabb, CCHS dissemination manager developed by François Brisebois CCHS/NPHS senior methodologist francois.brisebois@statcan.ca

2 Purpose of the presentation l Justify the use, understand the theory, and get familiar with the bootstrap technique l Demystify all illusions about using the bootstrap technique for variance estimation

3 Outline l Context l NPHS \ CCHS Complex survey design l Variance estimation \ Bootstrap 101 l Data support \ using the bootvar program l Why bootstrap? l CV lookup tables l Historical info about variance estimation for NPHS l Variance estimation with other software programs l Future for STC Health Surveys (re. bootstrap)

4 Context l A data user is interested in producing some results 1- Compute an estimate (total, ratio, etc.) 2- Compute the precision of the estimate (variance, coefficient of variation (CV), etc.)

5 Context 1- Compute an estimate l Is not a problem! l Use the provided survey weight with NPHS/CCHS files

6 Context 1- Compute an estimate (cont’d) l Why use the survey weight? l Conclusion: ALWAYS USE THE WEIGHTS

7 Context 2- Compute the precision of an estimate l Is a problem!!

8 Context 2- Compute the precision of the estimate (cont’d) l Scaled weights: lScaled weight = weight / mean(weight) lUsed to overcome problems with the computation of the variance for some statistics in SAS lReference: paper from G.Roberts & al.

9 Context 2- Compute the precision of the estimate (cont’d) l Why such a difference? Answer: The complex survey design is the main cause (other factors to be discussed later) Note: CCHS and NPHS have slightly different frames but are both considered as complex survey designs

10 Complex survey design 1- Each province is divided into strata Stratum #1 Stratum #2 Province A

11 2- Selection of clusters within each stratum Stratum #1 Stratum #2 Province A Complex survey design

12 3- Selection of households within each cluster Stratum #1 Stratum #2           Province A Complex survey design

13 l How does the sample design affect the precision of estimates? lStratification decreases variability (more precise) lClustering increases variability (less precise) lOverall, the multistage design has the effect of increasing variability (less precise than SRS) Complex survey design

14 l So why use a multistage cluster sample design anyway? lPros: lEfficient for interviewing (less travel, less costly) lBetter coverage of the entire region of interest lCons: lProblems for variance estimation Complex survey design

15 l Variance estimation with complex multistage cluster sample design: «Exact formula for variance estimation is too complex; use of an approximate approach required «NOTE: taking account for the design in variance estimation is as crucial as using the sampling weights for the estimation of a statistic Bootstrap Method

16 l Approximate methods for variance estimation: «Taylor linearization «Re-sampling methods: wBalanced Repeated Replication wJackknife wBootstrap Bootstrap Method

17 l Principle: «You want to estimate how precise is your estimation of the number of smokers in Canada «You could draw 500 totally new samples, and compare the 500 estimations you would get from these samples. The variance of these 500 estimations would indicate the precision. «Problem: drawing 500 new samples is $$$ «Solution: Use your sample as a population, and take many smaller subsamples from it. Bootstrap Method

18 l How Bootstrap weights are created (the secret is finally revealed!!!) Bootstrap 101 T = 40 Var =  (B i - B) 2 / 499

19 l How Bootstrap replicates are built (cont’d) lThe “real” recipe 1- Subsampling of clusters (SRS) within strata 2- Apply (initial design) weight 3- Adjust weight for selection of n-1 among n 4- Apply all standard adjustments (nonresponse, share, etc.) 5- Post-stratification to population counts Bootstrap 101

20 l How Bootstrap replicates are built (cont’d) lThe bootstrap method intends to mimic the same approach used for the sampling and weighting processes lBe careful: some software programs say they include the bootstrap technique; what they really do is to skip steps #4 and #5, and use directly the final weight in step #2 Bootstrap 101

21 l STC Methodologists create the bootstrap weight files. l Can you create your own bootstrap wgt file? No Why? Because to do so you need to know: lThe design information, i.e. strata, clusters (to generate the bootstrap subsamples) lThe definition of all adjustment classes (including post- stratification)

22 Bootstrap 101 l The bootstrap wgt files are: «Available for all file (except PUMF - confidentiality) «Distributed with the data files in separate files l The bootstrap wgt files contain: lIDs (REALUKEY/SAMPLEID, PERSONID) lFinal sampling weight (WTxx) l500 Bootstrap weights (BSW1--BSW500)

23 Bootstrap - Support l NPHS/CCHS provides data users with SAS & SPSS macro programs to compute bootstrap variances lMacros simplifying computation of bootstrap variance estimates for totals, ratio, differences of ratios, regressions (linear and logistic), and basic generealized linear models lCome with documentation & examples lFrench and English lreferred as “bootvar”

24 Example: Step by Step l Let’s get to work! l Goal: Interested in estimating the number of diabetics (total) «NPHS 1998-99 Dummy file (see information sheet)

25 STEP #1 Create your « analysis data file » l Read NPHS\CCHS data file l Prepare dummy variables necessary for your analysis l Keep only necessary variables (include geography desired) l Run the analysis to get point estimates only (not necessary but recommended) STEP #2 Compute your variances with bootvar l Location of INPUT files: sYour « analysis data file » sThe bootstrap weights file l Geography desired l Number of bootstrap weights to use l Specify the desired analysis sTotals, ratios, diff of ratios sRegression (linear & logit) sGeneralized linear modeling Example: Step by Step

26 l Step #1: On your own (but can use the examples provided as a starting point) l Step #2: Use the provided Bootvar program

27 STEP #1 l Read input file l Create dummy variables l Keep only necessary variables l Run the analysis to get point estimates l Create dummy variables «For qualitative/categorical variables, we need to identify which value(s) we are interested in. This is done through the creation of a dummy variable «Dummy variable = 1 for characteristic of interest = 0 otherwise

28 STEP #1 l Create dummy variable: example #1 «During the past 12 months, how often did you drink alcoholic beverages? (ALC8_2) 1=Less than once a month 2=Once a month 3=2 to 3 times a month 4=Once a week 5=2 to 3 times a week 6=4 to 6 times a week 7=Every day «Interested in categories 1 to 4 (once a week or less) wDRINK = 1 if ALC8_2 is 1,2,3 or 4 = 0 otherwise

29 STEP #1 l Create dummy variable: example #2 Diabetes (CCC8_1J) Sex (DHC8_SEX) 1=Yes1=Male 2=No2=Female 6=Not applicable 7=Don’t know 9=Not stated «Interested in “males having diabetes” wmdiab = 1 if CCC8_1J = 1 and SEX =1 = 0 otherwise

30 STEP #1 l Create dummy variable: example #2 «How to use the dummy variable to get an estimate wTotal: In SAS: Proc freq; tables mdiab; weight wt56; run;

31 STEP #1 l Create dummy variable: example #2 «How to use the dummy variable to get an estimate wRatio:

32 STEP #1 l See example in SPSS

33 STEP #1 l Now your turn! (exercise #1) «Add asthma (CCC8_1C) to the table «Use existing program (step1.sas) and add SPSS codes to create a dummy variable for asthma; and then get the results

34 Step #2: Bootvar Program l Created by methodologists in 1997 (first used with NPHS cycle 2 data) l Version 1.0 «one single program (over 1,000 lines of codes) «divided into 4 sections wusers have to adapt the program to their requests; changes in 3 sections «SAS: bootvar.sas / bootvarf.sas SPSS: beta version available only on request (bvr_b.sps)

35 l Version 2.0 «Justifications: wCompatible with SAS 8+ wCentralize the codes where modifications have to be done by the user wCan use with both NPHS and CCHS data files «Now consists of 2 programs Œ Contains the codes users need to modify for their requests  Contains the codes users do not have to modify (macros) Step #2: Bootvar Program

36 l Version 2.0 «SAS version: Œ bootvare_v20.sas / bootvarf_v20.sas  macroe_v20.sas / macrof_v20.sas «SPSS version: Œ bootvare_v21.sps / bootvarf_v21.sps  macroe_v21.sps / macrof_v21.sps Step #2: Bootvar Program

37 STEP #2: Use of bootvar l Point estimates have already been obtained, let us now estimate the sampling variability of those estimates  Go through the bootvar program (bootvare_v21.sps)

38 l See example in SPSS STEP #2: Use of bootvar

39 STEP #2 l Now your turn! (exercise #2) «Compute confidence intervals for asthma «Use bootvare_v21.sps and adjust it to obtain desired results (use the already set up step2.sps program for this exercise)

40 l Why 500 bootstrap weights? lSize of file (for dissemination) lTime of computation (for an average PC) lAccuracy l Use more bootstrap weights? lFaster PC lAccuracy for small domains and more complex analysis methods Bootstrap - More

41 l Confidentiality revealed from the bootstrap weights Bootstrap - More

42 l Confidentiality revealed from the bootstrap weights (cont’d) «How PUMF users estimate their exact variances? wRemote access ŸProvide dummy file (same structure as master files but contain dummy data) ŸTest programs and send by e-mail wResearch Data Centre wRegional Offices Bootstrap - More

43 Why Bootstrap? l Other techniques examined: Taylor, Jackknife «Taylor: wNeed to define a linear equation for each statistic examined «Jackknife: wCan not disseminate because of confidentiality wNumber of replicates depends on the number of strata (large number of strata in 1996 makes it impossible to disseminate)

44 Why Bootstrap? l Bootstrap: «Handle more easily survey design with many strata «Sets of 500 bootstrap weights can be distributed to data users «Recommended (over the jackknife) for estimating the variance of nonsmooth functions like quantiles, LICO «Reference: “Bootstrap Variance Estimation for the National Population Health Survey”, D.Yeo, H.Mantel, and T.-P. Liu. 1999, ASA Conference.

45 Bootvar: exercise #3 l Results for diabetes broken down by sex and province

46 Bootvar: Tricks l If you need to create a dummy variable for a characteristic based on many variables: «Example: Males with diabetes «First, create dummy variables for each individual variable (males, diabetes) «Then, create the dummy variable for the characteristic by multiplying the individual dummy variables

47 Bootvar: Tricks l Example: H Males = 1,0 (MALES) H Diabetes = 1,0 (DIAB) H Males having diabetes (MDIAB) = MALES * DIAB * =

48 Bootvar: Tricks l Use the REGION parameter in bootvar to specify a “stratification” variable (doesn’t have to be a geographic variable!) «Example: REGION = sex  will produce results by sex

49 CV look-up tables l What is it? lApproximate sampling variability tables lProduced for Canada, each province, and by age groups for Canada (also by Health Regions for cycle 2) l Useful only for categorical estimates Totals & ratios only

50 CV look-up tables

51 Sampling Variability Guidelines Type of estimate CVGuidelines Acceptable 0.0-16.5 General unrestricted release Marginal 16.6-33.3 General unrestricted release but with warningcautioning users of the high sampling variablitity. Should be identified by letter M. Unacceptable> 33.3 No release. Should be flagged with letter U.

52 CV look-up tables Manitoba total: T=32K  Cvtable =18%, BTS = 18.7% Manitoba Males : T=16K  Cvtable=25.7%, BTS=27.6% Manitoba Females: T=16.5K  Cvtable=25.3%, BTS=26.4% l Comparison between bootstrap CV and CV from lookup table «For number of people having diabetes:

53 CV look-up tables «Other examples (from master - general file) wNumber of people experiencing food insecurity: wNumber of people in the lowest income quintile: l Comparison between bootstrap CV and CV from lookup table Manitoba total: T=40K  Cvtable =11.9%, BTS = 19.8% Manitoba total: T=118K  Cvtable =6.4%, BTS = 11.2%

54 Bootvar: Regression models Logistic regression model l log (Y) = intercept + b 1 *X 1 + b 2 *X 2 →Y has to be qualitative (categorical) (for now assume it is dichotomous, i.e. 0,1) →X i can be quantitative or qualitative variables

55 Bootvar: Regression models Logistic regression model l Example: Diabetes vs sex and age →Categorical variables need to be dichotomized (“dummied”; 1 variable for each category except 1) →Sex: if sex=2 then FEMALE = 1; else FEMALE = 0; →Age: create a variable for people over 60 (if age > 60 then OVER60=1; else OVER60=0) →The model is: wDIAB = intercept + b 1 *FEMALE + b 2 *OVER60

56 Bootvar: Regression models Logistic regression model l Example: Diabetes vs sex and age wDIAB = intercept + b 1 *FEMALE + b 2 *OVER60 l In bootvar, use %logreg macro %logreg(yvar,xvar); %logreg(DIAB,FEMALE OVER60);

57 Bootvar: Regression models Linear regression model l Y = intercept + b 1 *X 1 + b 2 *X 2 →Y is quantitive →X i can be qualitative (categorical) or quantitative

58 Bootvar: Regression models Linear regression model l Example: BMI (body mass index) vs sex and age →Categorical variables need to be dichotomized (“dummied”; 1 variable for each category except 1) →Sex: if sex=2 then FEMALE = 1; else FEMALE = 0; →Age: use it as quantitative (single year of age) →The model is: wBMI = intercept + b 1 *FEMALE + b 2 *AGE

59 Bootvar: Regression models Linear regression model l Example: BMI vs sex and age wBMI = intercept + b 1 *FEMALE + b 2 *AGE l In bootvar, use %regress macro %regress(yvar,xvar); %regress(BMI,FEMALE AGE);

60 Bootvar: testing l For version 2.0/2.1: «Simply set 2 < B < 500 l For version 1.0: «See documentation!

61 Historical info about variance estimation for NPHS l Cycle 1: Use of Jackknife technique lCould not disseminate with public-use microdata files; only custom requests l Cycle 2 & +: Use of bootstrap technique lCan not disseminate ….; custom requests or remote access l All cycles: CV look-up tables lfor large domains (provinces, age groups) lonly good for totals, ratios, and differences of...

62 Variance estimation with other software programs l WesVar (SPSS) l SAS l SUDAAN l STATA

63 Future for Stats Can Health Surveys (vs. bootstrap) l NPHS sCycle 4 (2000-2001) data processing & weighting sPromote the use of longitudinal data sBootstrap pgms: finalize version 2.0 (SAS & SPSS) s CCHS sCycle 1.1 bootstrap weights sBootstrap also used for variance estimation (same programs as for NPHS)

64 Contacts Health Pgm Surveys Manager:Lorna Bailie (lorna.bailie@statcan.ca) NPHS Manager:France Bilocq (france.bilocq@statcan.ca) CCHS Manager:Marc Hamel (marc.hamel@statcan.ca) CCHS Dissemination manager:Larry MacNabb (larry.macnabb@statcan.ca Senior Methodologists:François Brisebois (francois.brisebois@statcan.ca) Mylène Lavigne (lavimyl@statcan.ca) Yves Béland (yves.beland@statcan.ca) Data Access Services Manager:Mario Bédard (mario.bedard@statcan.ca) Custom Services Requests:Garry Macdonald (macdgar@statcan.ca) Population Health Surveys


Download ppt "Population Health Surveys Bootstrap Hands-on Workshop Yves Beland, CCHS senior methodologist Larry MacNabb, CCHS dissemination manager developed by François."

Similar presentations


Ads by Google