Population Health Surveys Bootstrap Hands-on Workshop Yves Beland, CCHS senior methodologist Larry MacNabb, CCHS dissemination manager developed by François.

Population Health Surveys Bootstrap Hands-on Workshop Yves Beland, CCHS senior methodologist Larry MacNabb, CCHS dissemination manager developed by François Brisebois CCHS/NPHS senior methodologist francois.brisebois@statcan.ca

Purpose of the presentation l Justify the use, understand the theory, and get familiar with the bootstrap technique l Demystify all illusions about using the bootstrap technique for variance estimation

Outline l Context l NPHS \ CCHS Complex survey design l Variance estimation \ Bootstrap 101 l Data support \ using the bootvar program l Why bootstrap? l CV lookup tables l Historical info about variance estimation for NPHS l Variance estimation with other software programs l Future for STC Health Surveys (re. bootstrap)

Context l A data user is interested in producing some results 1- Compute an estimate (total, ratio, etc.) 2- Compute the precision of the estimate (variance, coefficient of variation (CV), etc.)

Context 1- Compute an estimate l Is not a problem! l Use the provided survey weight with NPHS/CCHS files

Context 1- Compute an estimate (cont’d) l Why use the survey weight? l Conclusion: ALWAYS USE THE WEIGHTS

Context 2- Compute the precision of an estimate l Is a problem!!

Context 2- Compute the precision of the estimate (cont’d) l Scaled weights: lScaled weight = weight / mean(weight) lUsed to overcome problems with the computation of the variance for some statistics in SAS lReference: paper from G.Roberts & al.

Context 2- Compute the precision of the estimate (cont’d) l Why such a difference? Answer: The complex survey design is the main cause (other factors to be discussed later) Note: CCHS and NPHS have slightly different frames but are both considered as complex survey designs

Complex survey design 1- Each province is divided into strata Stratum #1 Stratum #2 Province A

2- Selection of clusters within each stratum Stratum #1 Stratum #2 Province A Complex survey design

3- Selection of households within each cluster Stratum #1 Stratum #2           Province A Complex survey design

l How does the sample design affect the precision of estimates? lStratification decreases variability (more precise) lClustering increases variability (less precise) lOverall, the multistage design has the effect of increasing variability (less precise than SRS) Complex survey design

l So why use a multistage cluster sample design anyway? lPros: lEfficient for interviewing (less travel, less costly) lBetter coverage of the entire region of interest lCons: lProblems for variance estimation Complex survey design

l Variance estimation with complex multistage cluster sample design: «Exact formula for variance estimation is too complex; use of an approximate approach required «NOTE: taking account for the design in variance estimation is as crucial as using the sampling weights for the estimation of a statistic Bootstrap Method

l Approximate methods for variance estimation: «Taylor linearization «Re-sampling methods: wBalanced Repeated Replication wJackknife wBootstrap Bootstrap Method

l Principle: «You want to estimate how precise is your estimation of the number of smokers in Canada «You could draw 500 totally new samples, and compare the 500 estimations you would get from these samples. The variance of these 500 estimations would indicate the precision. «Problem: drawing 500 new samples is $$$ «Solution: Use your sample as a population, and take many smaller subsamples from it. Bootstrap Method

l How Bootstrap weights are created (the secret is finally revealed!!!) Bootstrap 101 T = 40 Var =  (B i - B) 2 / 499

l How Bootstrap replicates are built (cont’d) lThe “real” recipe 1- Subsampling of clusters (SRS) within strata 2- Apply (initial design) weight 3- Adjust weight for selection of n-1 among n 4- Apply all standard adjustments (nonresponse, share, etc.) 5- Post-stratification to population counts Bootstrap 101

l How Bootstrap replicates are built (cont’d) lThe bootstrap method intends to mimic the same approach used for the sampling and weighting processes lBe careful: some software programs say they include the bootstrap technique; what they really do is to skip steps #4 and #5, and use directly the final weight in step #2 Bootstrap 101

l STC Methodologists create the bootstrap weight files. l Can you create your own bootstrap wgt file? No Why? Because to do so you need to know: lThe design information, i.e. strata, clusters (to generate the bootstrap subsamples) lThe definition of all adjustment classes (including post- stratification)

Bootstrap 101 l The bootstrap wgt files are: «Available for all file (except PUMF - confidentiality) «Distributed with the data files in separate files l The bootstrap wgt files contain: lIDs (REALUKEY/SAMPLEID, PERSONID) lFinal sampling weight (WTxx) l500 Bootstrap weights (BSW1--BSW500)

Bootstrap - Support l NPHS/CCHS provides data users with SAS & SPSS macro programs to compute bootstrap variances lMacros simplifying computation of bootstrap variance estimates for totals, ratio, differences of ratios, regressions (linear and logistic), and basic generealized linear models lCome with documentation & examples lFrench and English lreferred as “bootvar”

Example: Step by Step l Let’s get to work! l Goal: Interested in estimating the number of diabetics (total) «NPHS 1998-99 Dummy file (see information sheet)

STEP #1 Create your « analysis data file » l Read NPHS\CCHS data file l Prepare dummy variables necessary for your analysis l Keep only necessary variables (include geography desired) l Run the analysis to get point estimates only (not necessary but recommended) STEP #2 Compute your variances with bootvar l Location of INPUT files: sYour « analysis data file » sThe bootstrap weights file l Geography desired l Number of bootstrap weights to use l Specify the desired analysis sTotals, ratios, diff of ratios sRegression (linear & logit) sGeneralized linear modeling Example: Step by Step

l Step #1: On your own (but can use the examples provided as a starting point) l Step #2: Use the provided Bootvar program

STEP #1 l Read input file l Create dummy variables l Keep only necessary variables l Run the analysis to get point estimates l Create dummy variables «For qualitative/categorical variables, we need to identify which value(s) we are interested in. This is done through the creation of a dummy variable «Dummy variable = 1 for characteristic of interest = 0 otherwise

STEP #1 l Create dummy variable: example #1 «During the past 12 months, how often did you drink alcoholic beverages? (ALC8_2) 1=Less than once a month 2=Once a month 3=2 to 3 times a month 4=Once a week 5=2 to 3 times a week 6=4 to 6 times a week 7=Every day «Interested in categories 1 to 4 (once a week or less) wDRINK = 1 if ALC8_2 is 1,2,3 or 4 = 0 otherwise

STEP #1 l Create dummy variable: example #2 Diabetes (CCC8_1J) Sex (DHC8_SEX) 1=Yes1=Male 2=No2=Female 6=Not applicable 7=Don’t know 9=Not stated «Interested in “males having diabetes” wmdiab = 1 if CCC8_1J = 1 and SEX =1 = 0 otherwise

STEP #1 l Create dummy variable: example #2 «How to use the dummy variable to get an estimate wTotal: In SAS: Proc freq; tables mdiab; weight wt56; run;

STEP #1 l Create dummy variable: example #2 «How to use the dummy variable to get an estimate wRatio:

STEP #1 l See example in SPSS

STEP #1 l Now your turn! (exercise #1) «Add asthma (CCC8_1C) to the table «Use existing program (step1.sas) and add SPSS codes to create a dummy variable for asthma; and then get the results

Step #2: Bootvar Program l Created by methodologists in 1997 (first used with NPHS cycle 2 data) l Version 1.0 «one single program (over 1,000 lines of codes) «divided into 4 sections wusers have to adapt the program to their requests; changes in 3 sections «SAS: bootvar.sas / bootvarf.sas SPSS: beta version available only on request (bvr_b.sps)

l Version 2.0 «Justifications: wCompatible with SAS 8+ wCentralize the codes where modifications have to be done by the user wCan use with both NPHS and CCHS data files «Now consists of 2 programs Contains the codes users need to modify for their requests Contains the codes users do not have to modify (macros) Step #2: Bootvar Program

l Version 2.0 «SAS version: bootvare_v20.sas / bootvarf_v20.sas macroe_v20.sas / macrof_v20.sas «SPSS version: bootvare_v21.sps / bootvarf_v21.sps macroe_v21.sps / macrof_v21.sps Step #2: Bootvar Program

STEP #2: Use of bootvar l Point estimates have already been obtained, let us now estimate the sampling variability of those estimates  Go through the bootvar program (bootvare_v21.sps)

l See example in SPSS STEP #2: Use of bootvar

STEP #2 l Now your turn! (exercise #2) «Compute confidence intervals for asthma «Use bootvare_v21.sps and adjust it to obtain desired results (use the already set up step2.sps program for this exercise)

l Why 500 bootstrap weights? lSize of file (for dissemination) lTime of computation (for an average PC) lAccuracy l Use more bootstrap weights? lFaster PC lAccuracy for small domains and more complex analysis methods Bootstrap - More

l Confidentiality revealed from the bootstrap weights Bootstrap - More

l Confidentiality revealed from the bootstrap weights (cont’d) «How PUMF users estimate their exact variances? wRemote access Provide dummy file (same structure as master files but contain dummy data) Test programs and send by e-mail wResearch Data Centre wRegional Offices Bootstrap - More

Why Bootstrap? l Other techniques examined: Taylor, Jackknife «Taylor: wNeed to define a linear equation for each statistic examined «Jackknife: wCan not disseminate because of confidentiality wNumber of replicates depends on the number of strata (large number of strata in 1996 makes it impossible to disseminate)

Why Bootstrap? l Bootstrap: «Handle more easily survey design with many strata «Sets of 500 bootstrap weights can be distributed to data users «Recommended (over the jackknife) for estimating the variance of nonsmooth functions like quantiles, LICO «Reference: “Bootstrap Variance Estimation for the National Population Health Survey”, D.Yeo, H.Mantel, and T.-P. Liu. 1999, ASA Conference.

Bootvar: exercise #3 l Results for diabetes broken down by sex and province

Bootvar: Tricks l If you need to create a dummy variable for a characteristic based on many variables: «Example: Males with diabetes «First, create dummy variables for each individual variable (males, diabetes) «Then, create the dummy variable for the characteristic by multiplying the individual dummy variables

Bootvar: Tricks l Example: H Males = 1,0 (MALES) H Diabetes = 1,0 (DIAB) H Males having diabetes (MDIAB) = MALES * DIAB * =

Bootvar: Tricks l Use the REGION parameter in bootvar to specify a “stratification” variable (doesn’t have to be a geographic variable!) «Example: REGION = sex  will produce results by sex

CV look-up tables l What is it? lApproximate sampling variability tables lProduced for Canada, each province, and by age groups for Canada (also by Health Regions for cycle 2) l Useful only for categorical estimates Totals & ratios only

CV look-up tables

Sampling Variability Guidelines Type of estimate CVGuidelines Acceptable 0.0-16.5 General unrestricted release Marginal 16.6-33.3 General unrestricted release but with warningcautioning users of the high sampling variablitity. Should be identified by letter M. Unacceptable> 33.3 No release. Should be flagged with letter U.

CV look-up tables Manitoba total: T=32K  Cvtable =18%, BTS = 18.7% Manitoba Males : T=16K  Cvtable=25.7%, BTS=27.6% Manitoba Females: T=16.5K  Cvtable=25.3%, BTS=26.4% l Comparison between bootstrap CV and CV from lookup table «For number of people having diabetes:

CV look-up tables «Other examples (from master - general file) wNumber of people experiencing food insecurity: wNumber of people in the lowest income quintile: l Comparison between bootstrap CV and CV from lookup table Manitoba total: T=40K  Cvtable =11.9%, BTS = 19.8% Manitoba total: T=118K  Cvtable =6.4%, BTS = 11.2%

Bootvar: Regression models Logistic regression model l log (Y) = intercept + b 1 *X 1 + b 2 *X 2 →Y has to be qualitative (categorical) (for now assume it is dichotomous, i.e. 0,1) →X i can be quantitative or qualitative variables

Bootvar: Regression models Logistic regression model l Example: Diabetes vs sex and age →Categorical variables need to be dichotomized (“dummied”; 1 variable for each category except 1) →Sex: if sex=2 then FEMALE = 1; else FEMALE = 0; →Age: create a variable for people over 60 (if age > 60 then OVER60=1; else OVER60=0) →The model is: wDIAB = intercept + b 1 *FEMALE + b 2 *OVER60

Bootvar: Regression models Logistic regression model l Example: Diabetes vs sex and age wDIAB = intercept + b 1 *FEMALE + b 2 *OVER60 l In bootvar, use %logreg macro %logreg(yvar,xvar); %logreg(DIAB,FEMALE OVER60);

Bootvar: Regression models Linear regression model l Y = intercept + b 1 *X 1 + b 2 *X 2 →Y is quantitive →X i can be qualitative (categorical) or quantitative

Bootvar: Regression models Linear regression model l Example: BMI (body mass index) vs sex and age →Categorical variables need to be dichotomized (“dummied”; 1 variable for each category except 1) →Sex: if sex=2 then FEMALE = 1; else FEMALE = 0; →Age: use it as quantitative (single year of age) →The model is: wBMI = intercept + b 1 *FEMALE + b 2 *AGE

Bootvar: Regression models Linear regression model l Example: BMI vs sex and age wBMI = intercept + b 1 *FEMALE + b 2 *AGE l In bootvar, use %regress macro %regress(yvar,xvar); %regress(BMI,FEMALE AGE);

Bootvar: testing l For version 2.0/2.1: «Simply set 2 < B < 500 l For version 1.0: «See documentation!

Historical info about variance estimation for NPHS l Cycle 1: Use of Jackknife technique lCould not disseminate with public-use microdata files; only custom requests l Cycle 2 & +: Use of bootstrap technique lCan not disseminate ….; custom requests or remote access l All cycles: CV look-up tables lfor large domains (provinces, age groups) lonly good for totals, ratios, and differences of...

Variance estimation with other software programs l WesVar (SPSS) l SAS l SUDAAN l STATA

Future for Stats Can Health Surveys (vs. bootstrap) l NPHS sCycle 4 (2000-2001) data processing & weighting sPromote the use of longitudinal data sBootstrap pgms: finalize version 2.0 (SAS & SPSS) s CCHS sCycle 1.1 bootstrap weights sBootstrap also used for variance estimation (same programs as for NPHS)

Contacts Health Pgm Surveys Manager:Lorna Bailie (lorna.bailie@statcan.ca) NPHS Manager:France Bilocq (france.bilocq@statcan.ca) CCHS Manager:Marc Hamel (marc.hamel@statcan.ca) CCHS Dissemination manager:Larry MacNabb (larry.macnabb@statcan.ca Senior Methodologists:François Brisebois (francois.brisebois@statcan.ca) Mylène Lavigne (lavimyl@statcan.ca) Yves Béland (yves.beland@statcan.ca) Data Access Services Manager:Mario Bédard (mario.bedard@statcan.ca) Custom Services Requests:Garry Macdonald (macdgar@statcan.ca) Population Health Surveys

Population Health Surveys Bootstrap Hands-on Workshop Yves Beland, CCHS senior methodologist Larry MacNabb, CCHS dissemination manager developed by François.

Similar presentations

Presentation on theme: "Population Health Surveys Bootstrap Hands-on Workshop Yves Beland, CCHS senior methodologist Larry MacNabb, CCHS dissemination manager developed by François."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Population Health Surveys Bootstrap Hands-on Workshop Yves Beland, CCHS senior methodologist Larry MacNabb, CCHS dissemination manager developed by François.

Similar presentations

Presentation on theme: "Population Health Surveys Bootstrap Hands-on Workshop Yves Beland, CCHS senior methodologist Larry MacNabb, CCHS dissemination manager developed by François."— Presentation transcript:

Similar presentations

About project

Feedback