Download presentation

Presentation is loading. Please wait.

Published byMalcolm Heron Modified over 2 years ago

1
1 Ch 5: Cluster sampling with equal probabilities DEFN: A cluster is a group of observation units (or elements)

2
2 Cluster sample DEFN: A cluster sample is a probability sample in which a sampling unit is a cluster

3
3 Cluster sample – 2 1-stage cluster sampling Divide the population (of K elements) into N clusters (of size M i for cluster i) Cluster = group of elements An element belongs to 1 and only 1 cluster Sampling unit Cluster = group of elements = PSU = primary sampling unit Well start by assuming a SRS of clusters (equal prob) Can use any design to select clusters (STS, PPS) – well work with other designs in Ch 6 Data collection Collect information on ALL elements in the cluster

4
4 1-stage CSSTS A block of cells is a stratum A block of cells is a cluster SU is a cluster Dont sample from every cluster SU is an element (or OU) Sample from every stratum Sample of 40 elements

5
5 Cluster vs. stratified sampling Cluster sample Divide K elements into N clusters Cluster or PSU i has M i elements Take a sample of n clusters Stratified sampling N elements divided into H strata An element belongs to 1 and only 1 stratum Take a sample of n elements, consisting of n h elements from stratum h for each of the H strata

6
6 Cluster sample – 3 2-stage cluster sampling (later) Process Select PSUs (stage 1) Select elements within each sampled PSU (stage 2) First stage sampling unit is a … PSU = primary sampling unit = cluster Second stage sampling unit is a … SSU = secondary sampling unit = element = OU Only collect data on the SSUs that were sampled from the cluster

7
7 1-stage vs. 2-stage cluster sampling 1-stage cluster sample (stop here) OR Stage 1 of 2-stage cluster sample (select PSUs) Stage 2 of 2-stage cluster sample (select SSUs w/in PSUs)

8
8 Why use cluster sampling? May not have a list of OUs for a frame, but a list of clusters may be available List of Lincoln phone numbers (= group of residents) is available, but a list of Lincoln residents is not available List of all NE primary and secondary schools (= group of students) is available, but a list of all students in NE schools is not available May be cheaper to conduct the study if OUs are clustered Occurs when cost of data collection increases with distance between elements Household surveys using in-person interviews (household = cluster of people) Field data collection (plot = cluster of plants, or animals)

9
9 Defining clusters due to frame limitations A cluster (or PSU) is a group of elements corresponding to a record (row) in the frame Example Population = employees in McDonalds franchises Element = employee Frame = list of McDonalds stores PSU = store = cluster of employees

10
10 Defining clusters to reduce travel costs A cluster (or PSU) is a group of nearby elements Example Population = all farms Element = farm Frame = list of sections (1 mi x 1 mi areas) in rural area PSU = section = cluster of farms

11
11 Cluster samples usually lead to less precise estimates Elements within clusters tend to be correlated due to exposure to similar conditions Members of a household Employees in a business Plants or soil within a field plot We are getting less information than if selected same number of unrelated elements Select sample of city blocks (clusters of households) Ask each household: Should city upgrade storm sewer system? PSU (city block) 1 No storm sewer households will tend to say yes PSU (city block) 2 New development households will tend to say no

12
12 Defining clusters for improved precision Define clusters for which within-cluster variation is high (rarely possible) Make each cluster as heterogeneous as possible Like making each cluster a mini-population that reflects variation in population Minimizes the amount of correlation among elements in the cluster Opposite of the approach to stratification Large variation among strata, homogeneous within strata Define clusters that are relatively small Extreme case is cluster = element Decreasing the number of correlated observations in the sample

13
13 Example for single-stage cluster sampling w/ equal prob (CSE1) Dorm has N = 100 suites (clusters) Each suite has M i = 4 students (4 elements in cluster i, i = 1, 2, …, N) Note that there are Take SRS n = 5 suites (clusters) Ask each student living in each of the 5 suites How many nights per week do you eat dinner in the dining hall? Will get observations from a sample of 20 students = 5 suites x 4 students/suite

14
14 Dorm example – 2 Stu- dent Suite 6 Suite 21 Suite 28 Suite 54 Suite Total

15
15 Dorm example – 3 SRS of n = 5 dorm rooms Data on each cluster (all students in dorm room) t i = total number of dining hall dinners for dorm room i t 2 = 14 dining hall dinners for 4 students in dorm room 2 Estimated total number of dining hall nights for the dorm students HT estimator of total = pop size x sample mean (of cluster totals)

16
16 Notation Indices i = index for PSU i i j = index for SSU j in PSU i Number of PSUs (clusters) in the population N clusters Number of SSUs (elements) in a PSU (cluster) M i elements Number of SSUs (elements) in the polulation In Chapters 1-4, this was designated as N

17
17 Notation – 2 N = 12 PSUs K = … = 150 SSUs M 1 = 20 SSUsM 2 = 12 SSUs M 12 = 16 SSUs M 11 = 9 SSUs i =1 i =9 i =4i =3i =2 i =11i =12 i =5 SSU i = 9 j = 1 SSU i = 9 j = 7

18
18 Notation – 3 Response variable for SSU j in PSU i y ij e.g., age of j-th resident in household i e.g., whether or not dorm resident j in room i owns a computer

19
19 Cluster size = Cluster population total Note that we observe cluster population total (or mean or variance) for each sample cluster in 1- stage cluster sampling We will estimate cluster parameters in 2-stage cluster sampling Cluster-level population parameters (for cluster i ) M i elements

20
20 Cluster population mean Within-cluster variance Cluster-level population parameters (for cluster i ) – 2

21
21 Popuation 1-stage cluster sample

22
22 Cluster-level population parameters (for cluster i ) – 3 For 1-stage cluster samples Have a complete enumeration of the cluster elements Cluster population parameters are known For 2-stage cluster samples Observe data on a sample of elements in a cluster Estimate cluster population parameters

23
23 Population parameters Same parameters as in previous chapters, rewritten in notation for cluster sampling Population size (** K was referred to as N in previous chapters) Population total (sum of all cluster totals)

24
24 Population Parameters-2 Population mean (of K elements) Population variance (among K elements) Variance among N cluster totals

25
25 Data from cluster samples Work with element and cluster-level data Element data set will have columns for Cluster id Element id within cluster Variable (y) Will also summarize this data set to generate cluster parameters (1-stage) or estimates of cluster parameters (2-stage) Cluster id Cluster total (or estimate) Cluster mean (or estimate) Cluster variance (or estimate)

26
26 1-stage cluster sample Element dataCluster summary ijy ij 11y 11 12y 12 13Y 13 14y 14 21y 21 22y 22 23y 23 31y 31 … ititi 1t1t1 2t2t2 3t3t3 …

27
27 Estimation for CSE1 Chapter reading Section covers equal sized clusters (M i constant, read) Well start with (unequal sized clusters, M i varies) Section covers theory Two types estimators Unbiased – HT estimator Ratio estimation Equal probability sample of clusters – assume SRS of clusters

28
28 CSE1 unbiased estimation under SRS – total t Estimator for population total using data collected from a 1-stage cluster sample SRS of clusters Estimator of variance of

29
29 Dorm example – 4 Estimated population total Estimated variance

30
30 Two events : A and B Pr{ A and B both occur } = P { A occurs } x P { B occurs given A occurs } In our setting A = sample cluster i B = sample element j (in cluster i) Inclusion probability for for element j in cluster i ij = Pr {including element j and cluster i in sample} = Pr {including cluster i in sample} x Pr {incl. element j given cluster i has been included in sample} CSE1 inclusion probability for an element

31
31 Need to two pieces Pr {including cluster i in sample} = n / N Pr {including element j given cluster i has been included in sample} = 1 Inclusion probability ij = Pr {including element j and cluster i in sample} = Pr {including cluster i in sample} x Pr {including element j given cluster i has been included in sample} = (n / N ) x 1 = n / N CSE1 inclusion probability for an element – 2

32
32 CSE1 weight for an element Weight for element j in cluster i Inverse element inclusion probability w ij = 1/ ij = N /n Estimator using weights

33
33 Dorm example – 5 Inclusion probability for student j in dorm room i N = 100 dorm rooms n = 5 sample dorm rooms Take all 4 students in dorm room ij = n / N = 1/20 = 0.05 Weight for student j in dorm room i w ij = N / n = 20 students

34
34 CSE1 unbiased estimation under SRS – mean Unbiased estimator for population mean For SRS, estimator for total divided by number of population elements (OUs) Units are y-units per element

35
35 Dorm example – 6

36
36 Unbiased estimation – proportion p What is y ?

37
37 Ratio estimation Usually t i (cluster total) is correlated with M i (cluster size) As M i (# SSUs/elements in cluster i ) increases, value for t i (total of y ij for cluster i ) increases Positive correlation between M i and t i No intercept Perfect conditions for SRS ratio estimator Notation of Ch 3 Notation of Ch 5 y i (variable of interest) t i (cluster total) x i (auxiliary info) M i (cluster size)

38
38 Ratio estimation for CSE1 Estimator for population mean Units are y-units per element

39
39 Ratio estimation for CSE1 – 2 Estimator for variance of ratio estimator of population mean is average cluster size for population

40
40 Ratio estimation for CSE1 – 3 Average cluster size If unknown, can estimate with sample mean of cluster sizes

41
41 Dorm example – 7 Estimated population mean Average cluster size

42
42 Dorm example – 8 Estimated variance

43
43 Ratio estimation for CSE1 – 4 Estimator for population total

44
44 Dorm example – 9 Estimated population total Estimated variance

45
45 CSE1: impact of cluster size If cluster sizes M i are variable across clusters, generally estimate population parameter with less precision If t i is related to M i, then get large variation among cluster totals if M i is variable Variance of population parameter estimator (unbiased or ratio) is a function of variation among cluster totals

46
46 2-stage equal probability cluster sampling (CSE2) CSE2 has 2 stages of sampling Stage 1. Select SRS of n PSUs from population of N PSUs Stage 2. Select SRS of m i SSUs from M i elements in PSU i sampled in stage 1

47
47 2-stage cluster sampling Stage 1 of 2-stage cluster sample (select PSUs) Stage 2 of 2-stage cluster sample (select SSUs w/in PSUs)

48
48 Motivation for 2-stage cluster samples Recall motivations for cluster sampling in general Only have access to a frame that lists clusters Reduce data collection costs by going to groups of nearby elements (cluster defined by proximity)

49
49 Motivation for 2-stage cluster samples – 2 Likely that elements in cluster will be correlated May be inefficient to observe all elements in a sample PSU Extra effort required to fully enumerate a PSU does not generate that much extra information May be better to spend resources to sample many PSUs and a small number of SSUs per PSU Possible opposing force: study costs associated to going to many clusters

50
50 Have a sample of elements from a cluster We no longer know the value of cluster parameter, t i Estimate t i using data observed for m i SSUs CSE2 unbiased estimation for population total t

51
51 CSE2 unbiased estimation for population total – 2 Approach is to plug estimated cluster totals into CSE1 formula CSE1 CSE2

52
52 The variance of has 2 components associated with the 2 sampling stages 1.Variation among PSUs 2.Variation among SSUs within PSUs CSE2 unbiased estimation for population total – 3 among PSU within PSU

53
53 In CSE1, we observe all elements in a cluster We know t i Have variance component 1, but no component 2 In CSE2, we sample a subset of elements in a cluster We estimate t i with Component 2 is a function of estimates variance for CSE2 unbiased estimation for population total – 4

54
54 CSE2 unbiased estimation for population total – 5 Estimated variance among cluster totals Estimated variance among elements in a cluster

55
55 CSE2 unbiased estimation for population total – 6

56
56 Dorm example – 10 Stage 2: select 2 students in each room Stu- dent Rm 6 Rm 21 Rm 28 Rm 54 Rm Total?????

57
57 Dorm example – 11 Stage 1 Cluster = N = n = SRS Stage 2 Element = M i = m i = SRS

58
58 Dorm example – 12 Stu- dent (j) Rm 6 (i=1) Rm 21 (i=2) Rm 28 (i=3) Rm 54 (i=4) Rm 89 (i=5)

59
59 Dorm example – 13

60
60 Dorm example – 14

61
61 CSE2 unbiased estimation for population mean

62
62 Dorm example – 15

63
63 Two events : A and B Pr{ A and B both occur } = P { A occurs } x P { B occurs | A occurs } | denotes given (a condition) In our setting A = sample cluster i B = sample element j Inclusion probability symbols ij = Pr {including element j and cluster i in sample} i = Pr {including cluster i in sample} j|i = Pr {incl. element j | cluster i has been included in sample} CSE2 inclusion probability for an element

64
64 Need to two pieces i = Pr {including cluster i in sample} = n / N j|i = Pr {including element j | cluster i has been included in sample} = m i /M i Inclusion probability for element j in cluster i ij = i j|i = CSE2 inclusion probability for an element – 2

65
65 CSE2 weight for an element Sampling Weight for element j in cluster i Estimator for population total

66
66 What does equal probability mean in Ch 5? Clusters (PSUs) sampled using SRS Equal inclusion probability for stage 1 PSUs (clusters) i is same for all i

67
67 What does equal probability mean in Ch 5? – 2 Elements (SSUs) in a given PSU are sampled using SRS All elements (j ) in a sample PSU (i ) are selected with equal probability This is a conditional probability (given PSU i ) For a given PSU i, j|i is the same for all elements j

68
68 What does equal probability mean in Ch 5? – 3 Note that Equal probability at stage 1 ( i ) plus Equal probability at stage 2 given PSU i ( j|i ) does NOT imply equal inclusion probability for an element In fact, element-level (unconditional) inclusion probability is not necessarily constant Depends on cluster size M i and sample size m i for the cluster to which the element belongs

69
69 CSE2 ratio estimation for population mean

70
70 CSE2 ratio estimation for population mean – 2

71
71 Dorm example – 16 Stu- dent (j) Rm 6 (i=1) Rm 21 (i=2) Rm 28 (i=3) Rm 54 (i=4) Rm 89 (i=5)

72
72 Dorm example – 16

73
73 Dorm example – 17

74
74 CSE2 ratio estimation for population total t

75
75 Dorm example – 18

76
76 Coots egg example Target pop = American coot eggs in Minnedosa, Manitoba PSU / cluster = clutch (nest) SSU / element = egg w/in clutch Stage 1 SRS of n = 184 clutches N = ??? Clutches, but probably pretty large Stage 2 SRS of m i = 2 from M i eggs in a clutch Do not know K = ??? eggs in population, also large Can count M i = # eggs in sampled clutch i Measurement y ij = volume of egg j from clutch i

77
77 Coots egg example – 2 Scatter plot of volumes vs. i (clutch id) Double dot pattern - high correlation among eggs WITHIN a clutch Quite a bit of clutch to clutch variation Implies May not have very high precision unless sample a large number of clutches Certainly lower precision than if obtained a SRS of eggs Could use a side-by-side plot for data with larger cluster sizes – PROC UNIVARIATE w/ BY CLUSTER and PLOTS option

78
78 Coots egg example – 3 Plot Rank the mean egg volume for clutch i, Plot y ij vs. rank for clutch i Draw a line between y i 1 and y i2 to show how close the 2 egg volumes in a clutch are Observations Same results as Fig 5.3, but more clear Small within-cluster variation Large between-cluster variation Also see 1 clutch with large WITHIN clutch variation check data (i = 88) i sorted by

79
79 Coots egg example – 4 Plot s i vs. for clutch i Since volumes are always positive, might expect s i to increase as gets larger If is very small, y i 1 and y i 2 are likely to be very small and close small s i See this to moderate degree Clutch 88 has large s i, as noted in previous plot

80
80 Coots egg example – 5 Estimation goal Estimate, population mean volume per coot egg in Minnedosa, Manitoba What estimator? Unbiased estimation Dont know N = total number of clutches or K = total number of eggs in Minnedosa, Manitoba Ratio estimation Only requires knowledge of M i, number of eggs in selected clutch i, in addition to data collected May want to plot versus M i

81
81 Coots egg example – 6

82
82 Dont know Use Coots egg example – 7 Dont know N, but assumed large FPC 1 2 nd term is very small, so approximate SE ignores 2 nd

83
83 Coots egg example – 8 What is first-stage PSU inclusion probability? What is conditional SSU inclusion probability at second stage? What is unconditional SSU inclusion probability?

84
84 CSE2: Unbiased vs. ratio estimation Unbiased estimator can poor precision if Cluster sizes (M i ) are unequal t i (cluster total) is roughly proportional to M i (cluster size) Biased (ratio estimator) can be precise if t i roughly proportional to M i This happens frequently in pops w/cluster sizes (M i ) vary

85
85 CSE2: Self-weighting design Stage 1: Select n PSUs from N PSUs in pop using SRS Inclusion probability for PSU i : Stage 2:Choose m i proportional to M i so that m i /M i is constant, use SRS to select sample Inclusion probability for SSU j given PSU i : Unconditional inclusion probability for SSU j in cluster i is constant for all elements Inclusion probability may vary in practice because may not be possible for m i /M i to be equal to c for all clusters

86
86 Self-weighting designs in general Why are self-weighting samples appealing? Are dorm student or coot egg samples self- weighting 2-stage cluster samples? What other (non-cluster) self-weighting designs have we discussed?

87
87 Self-weighting designs in general – 2 What is the caveat for variance estimation in self-weighting samples? No break on variance of estimator – must use proper formula for design Why are self-weighting samples appealing? Simple mean estimator Homogeneous weights tends to make estimates more precise

88
88 Return to systematic sampling (SYS) Have a frame, or list of N elements Determine sampling interval, k k is the next integer after N/n Select first element in the list Choose a random number, R, between 1 & k R-th element is the first element to be included in the sample Select every k-th element after the R-th element Sample includes element R, element R + k, element R + 2k, …, element R + (n-1)k

89
89 SYS example Telephone survey of members in an organization abut organizations website use N = 500 members Have resources to do n = 75 calls N / n = 500/75 = 6.67 k = 7 Random number table entry: Rule: if pick 1, 2, …, 7, assign as R; otherwise discard # Select R = 5 Take element 5, then element 5+7 =12, then element 12+7 =19, 26, 33, 40, 47, …

90
90 SYS – 2 Arrange population in rows of length k = 7 R i ……

91
91 Relationship between SYS and cluster sampling Design relationships Element = ? Cluster = ? Sampling unit(s) = ? Cluster sampling design = ? Relationship between frame ordering and expected precision of a an estimate from a cluster sample? Periodic, where cycle of pattern is coincident with sampling interval k Ordered by X, which is correlated with response variable Y Random

92
92 SYS – 3 Suppose X [age of member] is correlated with Y [use of org website] Sort list by X before selecting sample k Xi young …mid… old72

Similar presentations

© 2017 SlidePlayer.com Inc.

All rights reserved.

Ads by Google