Presentation is loading. Please wait.

Presentation is loading. Please wait.

Computational Methods for Testing Adequacy and Quality of Massive Synthetic Proximity Social Networks Huadong Xia, Christopher Barrett, Jiangzhuo Chen,

Similar presentations


Presentation on theme: "Computational Methods for Testing Adequacy and Quality of Massive Synthetic Proximity Social Networks Huadong Xia, Christopher Barrett, Jiangzhuo Chen,"— Presentation transcript:

1 Computational Methods for Testing Adequacy and Quality of Massive Synthetic Proximity Social Networks Huadong Xia, Christopher Barrett, Jiangzhuo Chen, Madhav Marathe IEEE BDSE2013 Network Dynamics and Simulation Science Laboratory Virginia Tech NDSSL TR-13-153

2 We thank our external collaborators and members of the Network Dynamics and Simulation Science Laboratory (NDSSL) for their suggestions and comments. This work has been partially supported by DTRA Grant HDTRA1-11-1-0016, DTRA CNIMS Contract HDTRA1-11-D-0016-0001, NIH MIDAS Grant 2U01GM070694-09, NSF PetaApps Grant OCI-0904844, NSF NetSE Grant CNS-1011769. Acknowledgement

3 Background and Contributions Methods: Network Synthesis Comparison of Large Scale Networks Conclusions Outline

4 Pandemics cause substantial social, economic and health impacts – 1918 flu pandemic, killed 50-100 million people or 3 to 5 percent of world population. – … – SARS 2003, H1N1 2009, Avian flu (H7N9) 2013 Mathematical and Computational models have played an important role in understanding and controlling epidemics – controlled experiments are not allowed for ethic consideration. – understand the space-time dynamics of epidemics Importance of Computational Epidemiological Models

5 Heterogeneous Spatial-Temporal features of populations Massive, Irregular, Dynamic and Unstructured Social contact networks are usually synthesized Networked Epidemiology (Figure From the Internet)

6 Volume Facts in Delhi 13.85M Population 2.67M Households > 200M Contacts 2.64M Locations The Four V’s in Networked Epidemiology Velocity Interactions Change every second Node Status changes every second They are modeled in minute scale Variety Demographics Geographic Temporal Feature Virus Infectivity … … Veracity Data Do we collect enough raw data to render a clear picture? Method Do we extract all useful information out of available raw data? 9am 7am 3pm 8pm

7 The Veracity of the network one makes depends on: – Time available to make such a network (human, computational) – The data available to make the network – The specific question that one would like to investigate Different level of networks may be retrieved for the same region. How do we evaluate networks that span large regions? – How to compare two networks constructed for the same population? – When is the synthesized network adequate? Social Contact Network Modeling and Analysis

8 Propose a number of network measurements to understand and compare urban scale social contact networks which are extremely large, dynamics and unstructured. Explore quantitatively the adequacy standards in modeling proximity networks. Contributions

9 Background and Contributions Methods: Network Synthesis Comparison of Large Scale Networks Conclusions Outline

10 Synthetic Populations and Their Contact Networks Goal:  Determine who are where and when. Process:  Create a statistically accurate baseline population  Assign each individual to a home  Estimate their activities and where these take place  Determine individual’s contacts & locations throughout a day.

11 Constructing Synthetic Social Contact Networks

12 Networks capture social interaction pertinent to the disease We focus on flu like diseases and the appropriate network is a social contact network based on proximity relationship. What Is a Network Edge attributes: activity type: shop, work, school activity type: shop, work, school (start time 1, end time 1) (start time 1, end time 1) (start time 2, end time 2) (start time 2, end time 2) … … Vertex attributes: (x,y,z) (x,y,z) land use land use … …Locations Vertex attributes: age age household size household size gender gender income income … … People

13 Two Sets of Data Sources and Generation Methods for Delhi Synthetic Population and Network Data & Methodsthe coarse networkthe detailed network data demographicsIndia census 2001 India census 2001 + micro- data (India Human Development Survey - UMD) geographic data LandScan 2007MapMyIndia activitygeneric activity templates Thane travel survey residential contact survey method peopledistributiondistribution/IPF locationsdensity Real locations+ home along roads activity schedules categorized templates decision tree + templates configuration model activity locations gravity model

14 Residential Contacts: for the Detailed Network Only Office Mall School Residential Area

15 Motivation of the residential contact network : – Approximate 40% adults in India do not travel to work. The network model interaction among them around their homes (within residential area). Survey data collected : – age, gender of staying at home people: node label – contact durations/frequencies of each person near their home: edge label/node degree Formal question: generate a random network s.t. – Given degree distribution of a bunch of nodes – Given label of each node – Assumption: network tend to be homophilous (nodes of the similar labels is connected with higher probability ) Method: – Configuration model with the added feature of node homophilous. – Refer to the next slide for details. Generation of the Residential Network

16 Population for the coarse networkPopulation for the detailed network Population Synthesis M47 F22F4 M71 M17 F22F11F46 M23M33 F2 M53 Split into HHs F36F6 M13M65 M47 F22F4 M71 M17 F22F11F46 M23M13 F2 M53 F36F6 M13M65 M47 F22F6 M65 M17 F22F2F46 M23M1\23 F4 M53 F36F11 M33M71 Extract individuals M47 F22F4 M71 M17 F22F11F46 M23M21 F2 M53 F36F6 M13M65

17 Metrics – Entity level: the population, built infrastructure and their layout – Collective level: validate against aggregate statistics. – Network level: structural properties – Epidemic dynamics level: policy effects How to Compare Two Networks

18 Individual level age-gender structure Comparison for Synthetic Populations Household level demographic structure Entropy: 1.35 v.s. 1.02

19 the Coarse Network Precision of Location Distribution the Detailed Network LandScan Grid Synthetic LocationsReal Locations

20 Activity Statistics

21 Note: First Row: the coarse network; Second Row: the detailed network Temporal Visiting Degree in Random Selected Locations

22 travel distance distribution radius of gyration distribution G PL : Temporal and Spatial Properties

23 G PL : Structural Properties The people-location network G PL : the degree of a large portion of nonhome Locations have a power law like distribution.

24 People-People Network G P

25 Disease Spread in a Social Network Within-host disease model: SEIR Between-host disease model: – probabilistic transmissions along edges of social contact network – from infectious people to susceptible people

26 Epidemic Simulations to Study the Delhi Population Disease model  Flu similar to H1N1 in 2009: assume R 0 =1.35, 1.40, 1.45, 1.60 (only the results when R 0 =1.35 are shown, but others are similar)  SEIR model: heterogeneous incubation and infectious durations  10 random seeds every day Interventions  Vaccination: implemented at the beginning of epidemic; compliance rate 25%  Antiviral: implemented when 1% population are infectious; covers 50% population; effective for 15 days  School closure: implemented when 1% population are infectious; compliance rate 60%; lasts for 21 days  Work closure: implemented when 1% population are infectious; compliance rate 50%; lasts for 21 days Total five configurations (including base case). Each configuration is simulated for 300 days and 30 replicates

27 Comparison in Epidemic Simulations Impact to Epidemic Dynamics (R 0 =1.35): – The coarse network exploits generic activity schedules, where people travel much more frequently. Therefore, the two networks show very different epidemic dynamics in base case.

28 Similarities of two networks: – Vaccination is still most effective strategy. – Pharmaceutical interventions is more effective than the non-pharmaceutical. – School closure is more effective than work closure Differences of two networks – Severity is significantly different – In delaying outbreak of disease, school closure is more effective than Antiviral in the coarse network, which is on the contrary in the detailed network. Epidemic Simulation Results: Interventions

29 CategoriesMetrics Underlying Synthetic Population Household Structure Location Layout Duration of Activities Number of Daily Activities Travel Distance Radius of Gyration G PL Temporal Degree of Random Locations Degree of People-Location Graphs GPGP Degree, Clustering Coefficient, Contact Duration, Shortest Path Epidemic Dynamics No Interventions Pharmaceutical Interventions Non-Pharmaceutical Interventions Metrics Review

30 Novel methodologies in creating a realistic social contact network for a typical urban area in developing countries Comparison to a coarser network suggests: – Similarity reflects generic properties for social contact networks – Region specific features are captured in the detailed model – The epidemic dynamics of the region is strongly influenced by activity pattern and demographic structure of local residents – A higher resolution social contact network helps us make better public health policy A realistic representation of social networks require adequate empirical input. We propose the criteria of adequacy: – Does the new input decrease uncertainty of the system? – Does the new input significantly change epidemics and intervention policy? Conclusions

31 END Questions?

32 EXTRA SLIDES

33 Calibrate R 0 to be 1.35 Vulnerability is defined as: Normalized number of infected over 10,000 runs of random simulations Vulnerability distribution of the detailed network is flat comparing to the coarse network, and it is less vulnerable due to less frequent travel. Epidemic Simulation Results: Vulnerability

34 Calibrate R 0 to be 1.35 Epidemic Simulation Results

35 Case study: – Delhi (NCT-I): a representative south Asian city that was never studied before. Statistics: – 13.85 million people in 2001; 22 million in 2011 – Most populous metropolis: 2 nd in India; 4 th in the world – 573 square miles, 9 regions (refer to the pic) – The Yamuna river going through urban area. Unique socio-cultural characteristics: – Large slum area – Tropical weather – Environmental hygiene Delhi: National Capital Territory of India

36 Two Versions of Delhi Networks The coarse network: – Based on very limited data – Generic methodology applicable to any region in world The detailed network: – Requires household level micro sample data and other detailed data, not available for all countries Improvement on results is expected: – to evaluate the network generation model; – to understand importance of different levels of details.

37 Population generation Input: Joint distribution of age and gender of the population in Delhi (from the India census 2001) Algorithm: – Normalize the counts in the joint distribution of age and gender into a joint probability table – Create 13.85 million individuals one by one. For each individual: Randomly select a cell c with the probability of each cell of the city. Create a person with the age and gender corresponding to the cell c. End Output: 13.85 million individuals are created, each individual is associated with disaggregate attributes of gender and age. V1: Synthetic Population Generation

38 Demographic Data: basic census data + India Micro-Sample – India Census 2001 – Micro sample for household structure: India Human Development Survey 2005 by the University of Maryland and the National Council of Applied Economic Research, which tells about each household sample: hh size, hh head’s age, hh income, house types, animal care; and also for each individual in the hh: demographic details, religion, work, marital status, relationship to head, etc. Activity Data: Thane travel survey + residential contacts survey – Activity templates from 2001 Household Travel Survey statistics for Thane, India, and 2005-2009 school attendance statistics from the UNESCO Institute of Statistics (UIS) o Activity templates are extracted with CART, and assigned to synthetic population with decision tree. – Survey on residential area contacts in India, conducted by NDSSL o Approximate 40% adults in India do not travel to work. The survey focused on them. o Collected people’s age, gender, and contact durations/frequencies near their home. Location Data: MapMyIndia data – Ward-wise statistics for population and households. – Coordinates for locations such as schools, shopping centers, hotels etc. – Infrastructures such as roads, railway stations, land use etc. – Boundary for each city, town and ward. Data Input

39 Same methodology as we did for US populations: Input:total # of households Aggregate distribution of demographic properties from Census: hh size, householder’s age Household micro-samples Output: Synthetic population with household structure. Each individual is assigned an age and gender. Algorithm: 1. Estimate joint distribution of household size and householder’s age: 1) construct a joint table of hh size and householder’s age: fill in # of samples for each cell 2) multiply total # of households to distributions to calculate marginal totals for the table 3) run IPF to get a convergent joint table 4) normalize: divide counts in each cell with (total # of samples), it’s probability for each cell. (illustrated in next slide) 2. create the synthetic households and population: 1) randomly select a cell with the probability in joint table 2) select a household sample h from all samples associated with that cell uniformly at random 3) create a synthetic household H, so that H has same members as h, each member in H has same demographic attributes as those in h. 4) repeat step 2.1-2.3, until # of synthetic households is equal to the total # of households from Census. V2: synthetic population creation method

40 IPF example Row Adjustment Column Adjustment Iteration 1 29.6239.6130.76354025 208.00 4.0020.789.458.083.25 308.5710.71 29.6510.1310.828.71 3511.2512.5011.2535.0613.2912.629.14 151.808.404.8014.512.138.483.90 Iteration 2 34.8140.0925.10354025 209.107.773.1320.029.157.763.12 3010.2510.958.8130.0010.3010.928.77 3513.2712.609.1335.0113.3412.579.09 152.208.774.0314.982.218.754.02 Iteration 3: Finished 34.9940.0025.00354025 209.147.753.1120.009.147.753.11 3010.3010.928.7830.0010.3010.928.77 3513.3412.579.0935.0013.3412.579.09 152.218.764.0215.002.218.764.02 Row Column 2035 3040 3525 15 Start 354025 20663 30810 359109 153148

41 V2: household distribution – a snapshot Households are distributed along real streets/community blocks. V2 avoids to distribute households on rivers, lakes and green land etc. (V1 distribute them uniformly within each 1(miles)*1(miles) block)

42 Activity templates generation Flowchart: Generating Activity Sequences based on Thane Survey for Delhi-V2 Frequency distribution of reported activity sequences Demographics of the Thane sample population; UIS stat Demographics of the Thane sample population; UIS stat 1) Demographics 2) Act template: Activity sequence Activity duration 1) Demographics 2) Act template: Activity sequence Activity duration Commute categories Activity sequences sampling Data sources: Outcome: sampling decision tree Frequency distribution of trips: Trip start time Trip length Frequency distribution of trips: Trip start time Trip length

43 Motivation of the residential contact network: – Approximate 40% adults in India do not travel to work. The network model interaction among them around their homes (within residential area). Survey data collected: – age, gender of staying at home people: node label – contact durations/frequencies of each person near their home: edge label/node degree Formal question: generate a random network s.t. – Given degree distribution of a bunch of nodes – Given label of each node – Assumption: network tend to be homophilous (nodes of the similar labels is connected with higher probability ) Method: – Configuration model with the added feature of node homophilous. – Refer to the next slide for details. Generation of the Residential Network

44 For each edge-type in (long-dur, mid-dur, short-dur), do: 1. Initialize each node with a degree drawn i.i.d. from the degree distribution according to its label (age/gender) 2. Form a list of “stubs” – connections of nodes that haven’t be matched with neighbors. Call it stubList. 3. Pick a starting node v 0 randomly. 4. For each of v 0 ’s stubs, choose an element v 1 from the stubList as described in following: 1) v 1 is chosen randomly from the stubList; 2) if v1 is same as v0 or already connected to v0, go to 4.1). 3) with a probability p (>0.5), we do test if v 1 is similar to v 0, if not, go to 4.1) and repeat the selection. 4) create an edge between v 0 and v 1, its duration is computed randomly based on the edge-type (long, mid or short duration) Done. Random Network Generation: configuration model with the added feature of node homophilous.


Download ppt "Computational Methods for Testing Adequacy and Quality of Massive Synthetic Proximity Social Networks Huadong Xia, Christopher Barrett, Jiangzhuo Chen,"

Similar presentations


Ads by Google