Presentation on theme: "Sampling Research Questions"— Presentation transcript:
1 Sampling Research Questions Bruce D. SpencerStatistics Department and Institute for Policy ResearchNorthwestern UniversitySAMSI Workshop 10/21/10
2 IntroductionAt the end of the opening workshop the group in Sampling, Modeling, and Inference raised a number of open questions related to sampling.Today I will discuss those questions, most of which are still unsolved.
3 Goal of Sample-Based Inference What is the target of the inference?a stochastic model that generated a network or set of networkspopulation of networks, e.g., dynamic networksmultiple networks on a single population of edgessingle network
4 Various Network Sampling Designs Conventional sample design to learn about the networkprobabilities do not depend on observed dataE.g., Current Population SurveyAdaptive sample design using the networkprobabilities may depend on observed dataE.g. RDS; ego-centric samples; link-tracing designsTwo-phase sampling to target further investigation of missing data or measurement errorSubsampling (?) to reduce computational burden at possible loss of efficiency
5 Conventional Sampling Design to Learn about the Network(s) Samples of nodes or of edges - used fordescription of network(s)prediction of future state of networkprediction of links/gaps/nodesfitting a model to the graph
6 Limitations from Sampling Sampling introduces random error into the estimates (and possibly bias, since E f(X) ≠ f (EX) for nonlinear f )Sampling variance needs to be estimated, maybe bias does too; may be problematic for small samplesSome population characteristics may not be “estimable” from a sampleE.g., maximum path length between any two nodes?Number of components in a general graph?What does “estimable” mean?
7 Limitations from Sampling If elements of interest (edges/non-edges, stars, motifs, etc.) have unequal probabilities of being observed, thenneed to know the probabilities and adjust for themor, need to have a model that explains the populationor, sometimes, both.
8 E.g.: Induced Graph Sampling Undirected parent graph (V, G)Sample nodes S VObserve G(S) G – observe edge/non-edge between u, v iff u,v SConventional sampling with possibly unequal probabilities (including multiple- frame stratified multi-stage): probability of including u1,u2 ,...,uj and excluding u1,u2 ,...,vk knowable for any j, kDenote inclusion probabilities by
10 H-T Estimators of Triad Distribution DefineTk,u,v,w = 1 if u,v,w are distinct vertices sharing k edges and= 0 otherwiseTk number of triads in E with 0 < k < 3 edgesOther totals estimated similarly, e.g., number ofstars or other motifs.
11 Degree Distribution du degree of node u (its number of edges) M maximum degree in (E, G)Nr number of nodes of degree 0 < r < M(F0,F1,…,FM) is degree distribution, with Fr =Nr /NDegree distribution of the sample can differ from degree distribution of the population.“Subnets of Scale-Free Networks are Not Scale-Free: Sampling Properties of Networks” Stumpf, Wiuf, May (PNAS, 2005)
12 Estimation of Degree Distribution Induced subgraph from SRS of size n from (E,G)Nr number of nodes of degree r in parent graphNr(S) number of nodes of degree r in subgraph
14 Estimation of Mean and Variance of Degree Distribution
15 Partial RecapUsing induced graph subsamples from conventional samples where joint inclusion probabilities are known, we can estimatepopulation values of descriptive statistics based on totalsdegree distribution.(Only undirected graphs at one point in time discussed.)What aboutother descriptive statisticsmodel fittinglarge variances when sample size smalladaptive samples?
16 Approaches to Model Fitting You trust* your model.Under certain conditions** on the sample design and the model, you can ignore the way the sample was selected and treat the sample as having been generated from the model.The sampling mechanism needs to be carefully examined to make sure it meets the requirements, which depend on the model being used.* Reagan and others, “trust but verify”** Handcock and Gile (2010 AoAS) call the condition “amenability” and relate it to “ignorability” (Rubin 1976).
17 Approaches to Model Fitting “Model as descriptive statistic”. You do not necessarily believe the model, but you want to fit the model the way you would if you completely observed the population.Anathema to many social scientists. . .E.g., in ERGMs, model fitting for population depends on sufficient statistics that are population totals. One can estimate them with H-T estimates (or alternatives) and then fit model. (Pavel Krivitsky poster)I have not investigated how to implement for other models.If both approaches are tried, “large” differences in fits can indicate model misspecification.
18 Adaptive SamplingProbabilities of observations depend on data from sampled units.Provides more information about network than conventional samples (Frank). Note: variances may be too large when sample is conventional but sparse.Probabilities of observing triads and larger typically unavailable, and even probabilities for dyads known for ego-centric designs but not link-tracing designs. (H-G 2010)In order to use full data, either need to estimate unknown probabilities (hard!!) or rely on model if amenability condition can be verified and model validated.E.g., when using conventional unequal probability samples to estimate a population total, the amenability condition typically does not hold.
19 Model ValidationModel validation is important, but challenging when sampling probabilities are unknown.At the heart of every adaptive sample is a conventional sample.Use conventional sample to fit model as descriptive statistic. Compare result to model fitted under assumption of ignorability/amenability for (i) conventional sample and (ii) larger and more informative adaptive sample.
20 RecapWhat is the population (network, or set of networks) from which sample is selected?Sample design (and inference) to learn about the networkStaticOver timeDescription of networkPrediction of future state of network and prediction of links/gaps/nodes
21 RecapSample design (and inference) using the network to learn about a populationRespondent Driven SamplingAdaptive SamplingOthersStatic and over time
22 Recap Subsampling design (and inference) to Ease computational burdenTarget further investigation to learn about measurement errorWhen can inferences be made based on sample design information to provide approx. unbiasedness whether or not model is valid?
23 Recap How can model inferences be made? What models? Exponential random graph modelsMixed membership stochastic block modelsLatent space modelsAgent based modelsWhat network characteristics (what summary statistics)
24 RecapWhat is effect of measurement error (and missing data, non-response) on inferences about network?RDS samplesOthersHow to design and analyze randomized experiments when subjects are part of a static network? Dynamic?Google experimentsExperiments on adolescents in schools (e.g., drug counseling, obesity “treatment”) – effects on peers