Download presentation

Presentation is loading. Please wait.

1
**Ch 5: Cluster sampling with equal probabilities**

Ch 5: Equal probability cluster samples 4/1/2017 Ch 5: Cluster sampling with equal probabilities DEFN: A cluster is a group of observation units (or “elements”) Stat 804

2
Cluster sample DEFN: A cluster sample is a probability sample in which a sampling unit is a cluster

3
**Cluster sample – 2 1-stage cluster sampling**

Divide the population (of K elements) into N clusters (of size Mi for cluster i) Cluster = group of elements An element belongs to 1 and only 1 cluster Sampling unit Cluster = group of elements = PSU = primary sampling unit We’ll start by assuming a SRS of clusters (equal prob) Can use any design to select clusters (STS, PPS) – we’ll work with other designs in Ch 6 Data collection Collect information on ALL elements in the cluster

4
**1-stage CS STS Sample of 40 elements A block of cells is a cluster**

A block of cells is a stratum SU is a cluster Don’t sample from every cluster SU is an element (or OU) Sample from every stratum

5
**Cluster vs. stratified sampling**

Cluster sample Divide K elements into N clusters Cluster or PSU i has Mi elements Take a sample of n clusters Stratified sampling N elements divided into H strata An element belongs to 1 and only 1 stratum Take a sample of n elements, consisting of nh elements from stratum h for each of the H strata

6
**Cluster sample – 3 2-stage cluster sampling (later) Process**

Select PSUs (stage 1) Select elements within each sampled PSU (stage 2) First stage sampling unit is a … PSU = primary sampling unit = cluster Second stage sampling unit is a … SSU = secondary sampling unit = element = OU Only collect data on the SSUs that were sampled from the cluster

7
**1-stage vs. 2-stage cluster sampling**

1-stage cluster sample (stop here) OR Stage 1 of 2-stage cluster sample (select PSUs) Stage 2 of 2-stage cluster sample (select SSUs w/in PSUs)

8
**Why use cluster sampling?**

May not have a list of OUs for a frame, but a list of clusters may be available List of Lincoln phone numbers (= group of residents) is available, but a list of Lincoln residents is not available List of all NE primary and secondary schools (= group of students) is available, but a list of all students in NE schools is not available May be cheaper to conduct the study if OUs are clustered Occurs when cost of data collection increases with distance between elements Household surveys using in-person interviews (household = cluster of people) Field data collection (plot = cluster of plants, or animals)

9
**Defining clusters due to frame limitations**

A cluster (or PSU) is a group of elements corresponding to a record (row) in the frame Example Population = employees in McDonald’s franchises Element = employee Frame = list of McDonald’s stores PSU = store = cluster of employees

10
**Defining clusters to reduce travel costs**

A cluster (or PSU) is a group of nearby elements Example Population = all farms Element = farm Frame = list of sections (1 mi x 1 mi areas) in rural area PSU = section = cluster of farms

11
**Cluster samples usually lead to less precise estimates**

Elements within clusters tend to be correlated due to exposure to similar conditions Members of a household Employees in a business Plants or soil within a field plot We are getting less information than if selected same number of unrelated elements Select sample of city blocks (clusters of households) Ask each household: Should city upgrade storm sewer system? PSU (city block) 1 No storm sewer households will tend to say yes PSU (city block) 2 New development households will tend to say no

12
**Defining clusters for improved precision**

Define clusters for which within-cluster variation is high (rarely possible) Make each cluster as heterogeneous as possible Like making each cluster a mini-population that reflects variation in population Minimizes the amount of correlation among elements in the cluster Opposite of the approach to stratification Large variation among strata, homogeneous within strata Define clusters that are relatively small Extreme case is cluster = element Decreasing the number of correlated observations in the sample

13
**Example for single-stage cluster sampling w/ equal prob (CSE1)**

Dorm has N = 100 suites (clusters) Each suite has Mi = 4 students (4 elements in cluster i , i = 1, 2, … , N) Note that there are Take SRS n = 5 suites (clusters) Ask each student living in each of the 5 suites How many nights per week do you eat dinner in the dining hall? Will get observations from a sample of 20 students = 5 suites x 4 students/suite

14
**Dorm example – 2 Stu-dent Suite 6 Suite 21 Suite 28 Suite 54 Suite 89**

3 6 2 4 Total 20 14 19 21 10

15
**Dorm example – 3 SRS of n = 5 dorm rooms**

Data on each cluster (all students in dorm room) ti = total number of dining hall dinners for dorm room i t2 = 14 dining hall dinners for 4 students in dorm room 2 Estimated total number of dining hall nights for the dorm students HT estimator of total = pop size x sample mean (of cluster totals)

16
**Notation Indices Number of PSUs (clusters) in the population**

i = index for PSU i i j = index for SSU j in PSU i Number of PSUs (clusters) in the population N clusters Number of SSUs (elements) in a PSU (cluster) Mi elements Number of SSUs (elements) in the polulation In Chapters 1-4, this was designated as N

17
**Notation – 2 N = 12 PSUs K = 20 + 12 + … + 9 + 16 = 150 SSUs i =1 i =2**

M1 = 20 SSUs M2 = 12 SSUs N = 12 PSUs K = … = SSUs i =1 i =2 i =3 i =4 i =5 i =9 i =11 i =12 SSU i = 9 j = 1 M11 = 9 SSUs M12 = 16 SSUs SSU i = 9 j = 7

18
**Notation – 3 Response variable for SSU j in PSU i yij**

e.g., age of j-th resident in household i e.g., whether or not dorm resident j in room i owns a computer

19
**Cluster-level population parameters (for cluster i )**

Cluster size = Cluster population total Note that we observe cluster population total (or mean or variance) for each sample cluster in 1-stage cluster sampling We will estimate cluster parameters in 2-stage cluster sampling Mi elements

20
**Cluster-level population parameters (for cluster i ) – 2**

Cluster population mean Within-cluster variance

21
Popuation 1-stage cluster sample

22
**Cluster-level population parameters (for cluster i ) – 3**

For 1-stage cluster samples Have a complete enumeration of the cluster elements Cluster population parameters are known For 2-stage cluster samples Observe data on a sample of elements in a cluster Estimate cluster population parameters

23
**Population parameters**

Same parameters as in previous chapters, rewritten in notation for cluster sampling Population size (** K was referred to as N in previous chapters) Population total (sum of all cluster totals)

24
**Population Parameters-2**

Population mean (of K elements) Population variance (among K elements) Variance among N cluster totals

25
**Data from cluster samples**

Work with element and cluster-level data Element data set will have columns for Cluster id Element id within cluster Variable (y) Will also summarize this data set to generate cluster parameters (1-stage) or estimates of cluster parameters (2-stage) Cluster total (or estimate) Cluster mean (or estimate) Cluster variance (or estimate)

26
**1-stage cluster sample Element data Cluster summary**

i j yij 1 y11 2 y12 3 Y13 4 y14 y21 y22 y23 y31 … i ti 1 t1 2 t2 3 t3 …

27
**Estimation for CSE1 Chapter reading Two types estimators**

Section covers equal sized clusters (Mi constant, read) We’ll start with (unequal sized clusters, Mi varies) Section covers theory Two types estimators Unbiased – HT estimator Ratio estimation Equal probability sample of clusters – assume SRS of clusters

28
**CSE1 unbiased estimation under SRS – total t**

Estimator for population total using data collected from a 1-stage cluster sample SRS of clusters Estimator of variance of

29
Dorm example – 4 Estimated population total Estimated variance

30
**CSE1 inclusion probability for an element**

Two events : A and B Pr{ A and B both occur } = P { A occurs } x P { B occurs given A occurs } In our setting A = sample cluster i B = sample element j (in cluster i) Inclusion probability for for element j in cluster i ij = Pr {including element j and cluster i in sample} = Pr {including cluster i in sample} x Pr {incl. element j given cluster i has been included in sample}

31
**CSE1 inclusion probability for an element – 2**

Need to two pieces Pr {including cluster i in sample} = n / N Pr {including element j given cluster i has been included in sample} = 1 Inclusion probability ij = Pr {including element j and cluster i in sample} = Pr {including cluster i in sample} x Pr {including element j given cluster i has been included in sample} = (n / N ) x 1 = n / N

32
**CSE1 weight for an element**

Weight for element j in cluster i Inverse element inclusion probability wij = 1/ ij = N /n Estimator using weights

33
**Dorm example – 5 Inclusion probability for student j in dorm room i**

N = 100 dorm rooms n = 5 sample dorm rooms Take all 4 students in dorm room ij = n / N = 1/20 = 0.05 Weight for student j in dorm room i wij = N / n = 20 students

34
**CSE1 unbiased estimation under SRS – mean**

Unbiased estimator for population mean For SRS, estimator for total divided by number of population elements (OUs) Units are y-units per element

35
Dorm example – 6

36
**Unbiased estimation – proportion p**

What is y ?

37
Ratio estimation Usually ti (cluster total) is correlated with Mi (cluster size) As Mi (# SSUs/elements in cluster i ) increases, value for ti (total of yij for cluster i ) increases Positive correlation between Mi and ti No intercept Perfect conditions for SRS ratio estimator Notation of Ch Notation of Ch 5 yi (variable of interest) ti (cluster total) xi (auxiliary info) Mi (cluster size)

38
**Ratio estimation for CSE1**

Estimator for population mean Units are y-units per element

39
**Ratio estimation for CSE1 – 2**

Estimator for variance of ratio estimator of population mean is average cluster size for population

40
**Ratio estimation for CSE1 – 3**

Average cluster size If unknown, can estimate with sample mean of cluster sizes

41
Dorm example – 7 Estimated population mean Average cluster size

42
Dorm example – 8 Estimated variance

43
**Ratio estimation for CSE1 – 4**

Estimator for population total

44
Dorm example – 9 Estimated population total Estimated variance

45
**CSE1: impact of cluster size**

If cluster sizes Mi are variable across clusters, generally estimate population parameter with less precision If ti is related to Mi , then get large variation among cluster totals if Mi is variable Variance of population parameter estimator (unbiased or ratio) is a function of variation among cluster totals

46
**2-stage equal probability cluster sampling (CSE2)**

CSE2 has 2 stages of sampling Stage 1. Select SRS of n PSUs from population of N PSUs Stage 2. Select SRS of mi SSUs from Mi elements in PSU i sampled in stage 1

47
**2-stage cluster sampling**

Stage 1 of 2-stage cluster sample (select PSUs) Stage 2 of 2-stage cluster sample (select SSUs w/in PSUs)

48
**Motivation for 2-stage cluster samples**

Recall motivations for cluster sampling in general Only have access to a frame that lists clusters Reduce data collection costs by going to groups of nearby elements (cluster defined by proximity)

49
**Motivation for 2-stage cluster samples – 2**

Likely that elements in cluster will be correlated May be inefficient to observe all elements in a sample PSU Extra effort required to fully enumerate a PSU does not generate that much extra information May be better to spend resources to sample many PSUs and a small number of SSUs per PSU Possible opposing force: study costs associated to going to many clusters

50
**CSE2 unbiased estimation for population total t**

Have a sample of elements from a cluster We no longer know the value of cluster parameter, ti Estimate ti using data observed for mi SSUs

51
**CSE2 unbiased estimation for population total – 2**

Approach is to plug estimated cluster totals into CSE1 formula CSE1 CSE2

52
**CSE2 unbiased estimation for population total – 3**

The variance of has 2 components associated with the 2 sampling stages 1. Variation among PSUs 2. Variation among SSUs within PSUs among PSU within PSU

53
**CSE2 unbiased estimation for population total – 4**

In CSE1, we observe all elements in a cluster We know ti Have variance component 1, but no component 2 In CSE2, we sample a subset of elements in a cluster We estimate ti with Component 2 is a function of estimates variance for

54
**CSE2 unbiased estimation for population total – 5**

Estimated variance among cluster totals Estimated variance among elements in a cluster

55
**CSE2 unbiased estimation for population total – 6**

56
**Dorm example – 10 Stage 2: select 2 students in each room Stu-dent**

5 3 6 2 4 Total ?

57
**Dorm example – 11 Stage 1 Stage 2 Cluster = N = n = SRS Element = Mi =**

58
**Dorm example – 12 1 5 3 4 2 6 Stu-dent (j) Rm 6 (i=1) Rm 21 (i=2)**

59
Dorm example – 13

60
Dorm example – 14

61
**CSE2 unbiased estimation for population mean**

62
Dorm example – 15

63
**CSE2 inclusion probability for an element**

Two events : A and B Pr{ A and B both occur } = P { A occurs } x P { B occurs | A occurs } “|” denotes “given” (a condition) In our setting A = sample cluster i B = sample element j Inclusion probability symbols ij = Pr {including element j and cluster i in sample} i = Pr {including cluster i in sample} j|i = Pr {incl. element j | cluster i has been included in sample}

64
**CSE2 inclusion probability for an element – 2**

Need to two pieces i = Pr {including cluster i in sample} = n / N j|i = Pr {including element j | cluster i has been included in sample} = mi /Mi Inclusion probability for element j in cluster i ij = i j|i =

65
**CSE2 weight for an element**

Sampling Weight for element j in cluster i Estimator for population total

66
**What does equal probability mean in Ch 5?**

Clusters (PSUs) sampled using SRS Equal inclusion probability for stage 1 PSUs (clusters) i is same for all i

67
**What does equal probability mean in Ch 5? – 2**

Elements (SSUs) in a given PSU are sampled using SRS All elements (j ) in a sample PSU (i ) are selected with equal probability This is a conditional probability (given PSU i ) For a given PSU i , j|i is the same for all elements j

68
**What does equal probability mean in Ch 5? – 3**

Note that Equal probability at stage 1 (i ) plus Equal probability at stage 2 given PSU i (j|i ) does NOT imply equal inclusion probability for an element In fact, element-level (unconditional) inclusion probability is not necessarily constant Depends on cluster size Mi and sample size mi for the cluster to which the element belongs

69
**CSE2 ratio estimation for population mean**

70
**CSE2 ratio estimation for population mean – 2**

71
Dorm example – 16 Stu-dent (j) Rm 6 (i=1) Rm 21 (i=2) Rm 28 (i=3) Rm 54 (i=4) Rm 89 (i=5) 1 5 3 4 2 6 5.5 2.5 4.5 3.0 22 10 18 12 0.5 2.0

72
Dorm example – 16

73
Dorm example – 17

74
**CSE2 ratio estimation for population total t**

75
Dorm example – 18

76
Coots egg example Target pop = American coot eggs in Minnedosa, Manitoba PSU / cluster = clutch (nest) SSU / element = egg w/in clutch Stage 1 SRS of n = 184 clutches N = ??? Clutches, but probably pretty large Stage 2 SRS of mi = 2 from Mi eggs in a clutch Do not know K = ??? eggs in population, also large Can count Mi = # eggs in sampled clutch i Measurement yij = volume of egg j from clutch i

77
**Coots egg example – 2 Scatter plot of volumes vs. i (clutch id)**

Double dot pattern - high correlation among eggs WITHIN a clutch Quite a bit of clutch to clutch variation Implies May not have very high precision unless sample a large number of clutches Certainly lower precision than if obtained a SRS of eggs Could use a side-by-side plot for data with larger cluster sizes – PROC UNIVARIATE w/ BY CLUSTER and PLOTS option

78
**Coots egg example – 3 Plot Observations**

Rank the mean egg volume for clutch i , Plot yij vs. rank for clutch i Draw a line between yi 1 and yi2 to show how close the 2 egg volumes in a clutch are Observations Same results as Fig 5.3, but more clear Small within-cluster variation Large between-cluster variation Also see 1 clutch with large WITHIN clutch variation check data (i = 88) i sorted by

79
**Coots egg example – 4 Plot si vs. for clutch i**

Since volumes are always positive, might expect si to increase as gets larger If is very small, yi 1 and yi 2 are likely to be very small and close small si See this to moderate degree Clutch 88 has large si , as noted in previous plot

80
**Coots egg example – 5 Estimation goal What estimator?**

Estimate , population mean volume per coot egg in Minnedosa, Manitoba What estimator? Unbiased estimation Don’t know N = total number of clutches or K = total number of eggs in Minnedosa, Manitoba Ratio estimation Only requires knowledge of Mi , number of eggs in selected clutch i , in addition to data collected May want to plot versus Mi

81
Coots egg example – 6

82
**Coots egg example – 7 Don’t know Use Don’t know N , but assumed large**

FPC 1 2nd term is very small, so approximate SE ignores 2nd

83
**Coots egg example – 8 What is first-stage PSU inclusion probability?**

What is conditional SSU inclusion probability at second stage? What is unconditional SSU inclusion probability?

84
**CSE2: Unbiased vs. ratio estimation**

Unbiased estimator can poor precision if Cluster sizes (Mi ) are unequal ti (cluster total) is roughly proportional to Mi (cluster size) Biased (ratio estimator) can be precise if ti roughly proportional to Mi This happens frequently in pops w/cluster sizes (Mi) vary

85
**CSE2: Self-weighting design**

Stage 1: Select n PSUs from N PSUs in pop using SRS Inclusion probability for PSU i : Stage 2: Choose mi proportional to Mi so that mi /Mi is constant, use SRS to select sample Inclusion probability for SSU j given PSU i : Unconditional inclusion probability for SSU j in cluster i is constant for all elements Inclusion probability may vary in practice because may not be possible for mi /Mi to be equal to c for all clusters

86
**Self-weighting designs in general**

Why are self-weighting samples appealing? Are dorm student or coot egg samples self-weighting 2-stage cluster samples? What other (non-cluster) self-weighting designs have we discussed?

87
**Self-weighting designs in general – 2**

What is the caveat for variance estimation in self-weighting samples? No break on variance of estimator – must use proper formula for design Why are self-weighting samples appealing? Simple mean estimator Homogeneous weights tends to make estimates more precise

88
**Return to systematic sampling (SYS)**

Have a frame, or list of N elements Determine sampling interval, k k is the next integer after N/n Select first element in the list Choose a random number, R , between 1 & k R-th element is the first element to be included in the sample Select every k-th element after the R-th element Sample includes element R, element R + k, element R + 2k, … , element R + (n-1)k

89
SYS example Telephone survey of members in an organization abut organization’s website use N = 500 members Have resources to do n = 75 calls N / n = 500/75 = 6.67 k = 7 Random number table entry: Rule: if pick 1, 2, …, 7, assign as R; otherwise discard # Select R = 5 Take element 5, then element 5+7 =12, then element 12+7 =19, 26, 33, 40, 47, …

90
**Ch 5: Equal probability cluster samples**

4/1/2017 SYS – 2 Arrange population in rows of length k = 7 R 1 2 3 4 5 6 7 i 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 … 491 492 493 494 495 496 497 71 498 499 500 72 Many samples have no chance of being selected Stat 804

91
**Relationship between SYS and cluster sampling**

Design relationships Element = ? Cluster = ? Sampling unit(s) = ? Cluster sampling design = ? Relationship between frame ordering and expected precision of a an estimate from a cluster sample? Periodic, where cycle of pattern is coincident with sampling interval k Ordered by X , which is correlated with response variable Y Random

92
**Ch 5: Equal probability cluster samples**

4/1/2017 SYS – 3 Suppose X [age of member] is correlated with Y [use of org website] Sort list by X before selecting sample k 1 2 3 4 5 6 7 X i young 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 … mid 491 492 493 494 495 496 497 71 498 499 500 old 72 Many samples have no chance of being selected Stat 804

Similar presentations

© 2017 SlidePlayer.com Inc.

All rights reserved.

Ads by Google