2Why sample? To make an inference about a population Studying entire pop is impractical or impossible
3Example of samplingEstimate the proportion of adults, ages 18-65, in Port Elizabeth that have type 2 diabetesSelect a sample from which to estimate the proportionPopulation: adults aged living in Port ElizabethInference: proportion with type 2 diabetes
4Probability samplingEach individual has known (non-zero) probability of selectionPrecision of estimates can be quantified
5Non-probability sampling Cheaper, more convenientQuality of estimates cannot be assessedMay not be representative of population
7Sampling errorRandom variability in sample estimates that arises out of the randomness of the sample selection processPrecision can be quantified (estimation of standard errors, confidence intervals)
8Non-sampling errorEstimation error that arises from sources other than random variationnon-responseundercoverage of surveypoorly-trained interviewersnon-truthful answersnon-probability samplingThis type of error is a bias
9What is bias?We want to estimate the mean weight of all women aged living in Coopersville. Suppose there are 50,000 such women and the true mean weight is 61.7 kg.We select a sample of 200 such women and interview them, asking each woman what her weight is.The sample mean weight is 59.4 kg.Is our estimate biased?
10Bias Suppose we could repeat the survey many, many times. Then we compute the mean of all the sample means.Say the mean of the means = 62.9Bias = (mean of means) - (true mean)= = 1.2 kg
11Unbiased estimation If . . . (mean of the means) = (true mean) then the bias is zero, and we say that the estimator is unbiased.The “mean of the means” is called the “expected value” of the estimator.
12Simple sampling methods Task: Select a sample of n individuals or items from a population of N individuals or itemsCommon methodssimple random samplingsystematic sampling
13Simple sampling methods Simple random sampling (SRS)each item in population is equally likely to be selectedeach combination of n items is equally likely to be selectedSystematic sampling (typical method)randomly select a starting pointselect every kth item thereafter
14Systematic sampling example Stack of 213 hospital admission forms; select a sample of 15213/15 = 14.2 Select every 14th formStarting point: random number between 1 and 14 (we choose 11)First form selected is 11th from topSecond form selected is 25th from top ( = 25)Third form selected is 39th from top (11 + 2x14 = 39)And so forth . . .
15Systematic sampling, continued What is the probability that the 146th form will be selected? The 195th?Does this qualify as a simple random sample? Why or why not?Is there any potential problem arising from the use of systematic sampling in this situation?
16Example was typical quick method In the preceding example, we selected every 14th formIdeally, we would select every 14.2th form (see later example on 2-stage sample of nurses)Example is a quick and easy method, commonly used in the field; it is a good approximation to the more rigorous procedure
17Systematic sampling: + and - Advantages of systematic samplingtypically simpler to implement than SRScan provide a more uniform coveragePotential disadvantage of systematic samplingcan produce a bias if there is a systematic pattern in the sequence of items from which the sample is selected
18Role of simple sampling methods These simple sampling methods are necessary components of more complex sampling methods:cluster samplingstratified samplingWe’ll discuss these more complex methods next (following some definitions)
19Definitions Listing units (or enumeration units) the lowest level sampled units (e.g., households or individuals)PSUs (primary sampling units)the first units sampled (e.g., states or regions)Sampling probabilityfor any unit eligible to be sampled, the probability that the unit is selected in the sample
20More definitions EPSEM sampling Sampling frame “equal probability of selection method”, thus a method in which each listing unit has the same sampling probabilitySampling framethe set of items from which sampling is done--often a list of items.
21More definitionsUndercoverage: the degree to which we fail to identify all eligible units in the populationincomplete listsincomplete or incorrect eligibility information
22Still more definitions Non-response: failure to interview sampled listing units (study subjects)refusaldeathphysician refusalinability to locate subjectunavailability
23Still more definitions Precision: the amount of random error in an estimateoften measured by the width or half-width of the confidence intervalstandard error is another measure of precisionestimates with smaller standard error or narrower CI are said to be more precise
25Clusters Subsets of the listing units in the population Set of clusters must be mutually exclusive and collectively exhaustivecountiestownshipsregionsinstitutions
26Example Single-stage cluster sampling There are 361 nurses working at the 31 hospitals and clinics in Region 4We wish to interview a sample of these nursesselect a simple random sample of 5 hospitals/clinicsinterview all nurses employed at the 5 selected institutions
27Assessing the example Hospitals/clinics are the PSUs Nurses are the listing unitsSampling probability for each nurse is 5/31Thus, this is an EPSEM sampleSampling frame is the list of 31 hospitals and clinics
29Cluster sampling -- two stage Select a sample of clusters, as in the single-stage methodFrom each selected cluster, select a subsample of listing units
30Cluster sampling -- two stage It is always nice to do EPSEM sampling because such samples are self-weightingdon’t need sampling weights in analysisA common EPSEM method for two-stage sampling is PPS (probability proportional to size)
31PPS samplingThe key to the method is that the sampling probabilities of clusters in the first stage are proportional to the “sizes” of the clusterssize = number of listing units in clusterAt stage 2, select the same number of listing units from each selected cluster
32Nurse example revisited Two-stage sampling We want to interview a sample of 36 nursesWe can afford to visit 9 different hospitals/clinicsThus, we need to interview 36/9 = 4 nurses at each institution
33Nurse example revisited Two-stage sampling Stage 1: select a sample of 9 hospitals/clinicsSelection prob. proportional to “size”Stage 2: select a sample of 4 nurses from each selected institutionAt each stage, use one of the simple sampling methods
34Nurse example revisited Two-stage sampling PSUs are the hospitals/clinicsListing units are the nursesSampling framesStage 1: List of 31 hospitals/clinicsStage 2: Lists of nurses at each selected hospital/clinic
35Selecting 2-stage nurse sample Sampling interval, I = 361/9 = 40.1Starting point, random number between 1 and 40; we choose R = 14First sampling number = R = 142nd sampling number = x40.1 = 54.13rd sampling number = x40.1 = 94.2We have selected institutions 2, 5, 9, . . .
37Applying the sampling numbers For each sampling number, choose the first unit with cumulative “size” equal to or greater than the sampling numberExample: sampling number 54.1first unit with cumulative size 54.1 is unit 5 (cum. no. of nurses = 57)so we select unit 5 for the sample
38Optional challengeWhat is the selection probability for institution 1?12/40.1 = 0.299What is the selection probability for a nurse in institution 1?(12/40.1) x (4/12) = = 36/361What is the selection probability for a nurse in institution 2?(7/40.1) x (4/7) = = 36/361All nurses have the same selection probability.
39Why do cluster sampling instead Of a simple sampling method? Advantagesreduced logistical costs (e.g., travel)list of all 361 nurses may not be available (reduces listing labor)Disadvantagesestimates are less preciseanalysis is more complicated (requires special software)
40Design effectRelative increase in variance of an estimate due to the sampling design“variance” = (standard error)2Formulas1 = standard error under simple random samplings2 = standard error under complex sampling design (e.g., cluster sampling)design effect = (s2/s1)2
41Design effect for cluster sampling For cluster sampling designs, the design effect is always >1This means that estimates from a survey done with cluster sampling are less precise than corresponding estimates obtained from a survey having the same sample size done with simple random sampling
42Cluster sizesRecommended “take” per cluster is for multi-purpose surveysTime and resource limitations will often dictate the maximum number of clusters you can include in the studyIncluding more clusters improves the precision of your estimates more than a corresponding increase in sample size within the clusters already in the sample
44Strata Subsets of the listing units in the population Set of strata must be mutually exclusive and collectively exhaustiveStrata are often based on demographic variablesagesexrace
45Stratified sampling Sample from each stratum Often, sampling probabilities vary across strata
46Stratified sampling Advantages Disadvantages guarantees coverage across stratacan over-sample some strata in order to obtain precise within-stratum estimatestypically, design effect < 1Disadvantageswith unequal sampling probabilities, sampling weights must be included in analysismore complicatedrequires special software
47Example: sampling breast cancer cases for the Women’s CARE Study Stratification variablesgeographic siterace (2 races)five-year age groupOver-sampled younger womenOver-sampled black women
48Example: Sampling households for a reproductive health survey in 11 refugee camps in Pakistan Selected simple random sample of households from within each of the 11 campsAll households were selected with the same probability
50The sampling operation Must be carefully controlleddon’t leave to discretion in the fielduse a carefully defined procedureDocument what you didfor reference during analysisto defend your study
51Sampling framesA list containing all listing units is great if you can get itok if it includes some ineligiblesProblems associated with geographic location-based samplingmap-based samplingEPI sampling
52Sampling weights Inverse of the net sampling probability Interpretation: the sampling weight for an sampled individual is the number of individuals his/her data “represent”
53Example--sampling weights There are 150 employees in a firmstratum 1: 50 employees aged 18-29stratum 2: 100 employees aged 30-69We sample 10 from each stratumSampling probabilities arestratum 1: 10/50 = 0.20stratum 2: 10/100 = 0.10
55What about non-response? 1 employee in the stratum 1 sample and 3 employees in the stratum 2 sample refuse to participate in the surveyNet sampling probabilitiesstratum 1: 9/50 = 0.18stratum 2: 7/100 = 0.07
56Revised sampling weights Sampling weights revised for non-responsestratum 1: 1/0.18 = 5.56stratum 2: 1/0.07 = 14.29This computation is often done by multiplying the original sampling weights by adjustment factors to account for non-response rates
57Post-stratification weighting Define strata, which may or may not have been used as strata in the sampling designCompute sampling probabilities = proportion of each stratum that was actually sampledCompute sampling weights from these sampling probabilitiesAllows post-hoc treatment of unequal representation of population segments in the sample
58Discussion topics What is the population of interest? Infinite populationsSelecting random numbersSelecting simple random samplesfrom finite populationsfrom infinite populationsAnalysis software for complex surveys