Presentation is loading. Please wait.

Presentation is loading. Please wait.

A High-Throughput Computational Approach to Environmental Health Study Based on CyberGIS Xun Shi 1, Anand Padmanabhan 2, and Shaowen Wang 2 1 Department.

Similar presentations


Presentation on theme: "A High-Throughput Computational Approach to Environmental Health Study Based on CyberGIS Xun Shi 1, Anand Padmanabhan 2, and Shaowen Wang 2 1 Department."— Presentation transcript:

1 A High-Throughput Computational Approach to Environmental Health Study Based on CyberGIS Xun Shi 1, Anand Padmanabhan 2, and Shaowen Wang 2 1 Department of Geography, Dartmouth College 2 Department of Geography and Geographic Information Science, National Center for Supercomputing Applications (NCSA), University of Illinois at Urbana Champaign September, 2013

2 Basic functionality of CyberGIS Accessibility: Making GIS capabilities accessible to a large of number of users for research and education, through online cyberGIS Gateway; Computational Capability: Embedding geospatial software capabilities into advanced cyberinfrastructure environments; Interoperability: Managing heterogeneous and distributed resources and services through GISolve middleware.

3 Basic functionality of CyberGIS Accessibility: Making GIS capabilities accessible to a large of number of users for research and education, through online cyberGIS Gateway; Computational Capability: Embedding geospatial software capabilities into advanced cyberinfrastructure environments; Interoperability: Managing heterogeneous and distributed resources and services through GISolve middleware.

4 Disaggregate polygon-level location data using restricted and controlled Monte Carlo (RCMC). Calculate local statistics, e.g., calculate intensity of disease occurrence using kernel ratio estimation (KRE). Estimate statistical significance of the intensity using unrestricted and controlled Monte Carlo (UCMC). A computational approach to spatial epidemiology

5 Disaggregate polygon-level location data 23 births with defects 1202 births Birth with defect(s) Normal birth Population High Low

6 Restricted and Controlled Monte Carlo (RCMC) for Disaggregation Assign polygon-level addresses to random locations. The randomization is restricted by the smallest polygon to which a polygon-level address belongs. The randomization is controlled by the detailed background data. The randomization is repeated many times (Monte Carlo).

7 Advantages of RCMC Allows analyses designed for individual/precise locations to be conducted. Maximize the utilization of available spatial information. Explicitly evaluate the spatial uncertainty caused by the imprecision in the data.

8 Kernel ratio estimation (KRE) for Estimating Local Disease Intensity Birth with defect(s) Normal birth Essentially, calculate the ratio between cases and cohort for each and every location.

9 Setting of KRE fixed bandwidth vs. adaptive bandwidth site-side kernel vs. case-side kernel

10 Types of KRE Site-side fixed bandwidth Case-side fixed bandwidth Site-side adaptive bandwidthCase-side adaptive bandwidth

11 Unrestricted and Controlled Monte Carlo (UCMC) for Estimating Statistical Significance RCMC KRE UCMC KRE Compare P-value

12 MalesFemales AGE countrateAGE countrate >0<=2930.0000>0<=2900.0000 >29<=39280.0003>29<=39340.0003 >39<=492000.0013>39<=491790.0011 >49<=542320.0054>49<=542140.0050 >54<=593080.0098>54<=592700.0086 >59<=644330.0188>59<=643680.0153 >64<=696020.0303>64<=695000.0235 >69<=746610.0395>69<=745080.0248 >74<=15010310.0403>74<=1508960.0203 total3498total2969 Epidemiological Confounding factors 2

13 0.000 0.006 0.020 1.000 mean P-value Std dev of P-value hot spots

14 RCMC-UCMC-based Simulated Case-Control Study for Detecting Disease-Environment Association Case location from RCMC Control location from UCMC Environmental exposure

15 Spatial variation in disease-environment association: A map of P-value 1 P-value 0.0001

16 Computational Demand I: Number of local statistic computing (e.g. KRE) iterations in RCMC and UCMC RCMC iterations: No. of Strata X No. of iterations for cases X No. of iterations for cohort e.g. 2 X 100 X 100 = 20,000 UCMC iterations: No. of Strata X No. of iterations for simulation X No. of iterations for cohort e.g. 2 X 99 X 100 = 19,800 Scenario: Stratification is needed for addressing confounding factors Case data are at the polygon level Cohort data are at the polygon level Detailed background data are available

17 No. of iterations for cases X No. of iterations for simulation X No. of iterations for cohort e.g. 100 X 99 X 100 = 990,000 Computational Demand II: Number of layer-on-layer comparisons for estimating P-value

18 No. of pixels that are not “nodata” pixels e.g. About 3 million in a 1652 X 2912 raster Major operations, use case-side adaptive bandwidth KRE as example: Expand the kernel in a spinning way Accumulate the distance-decayed kernel value for each case encountered Accumulate the cohort value Check if the threshold is met Computational Demand III: Pixel-wise statistic computing

19 Number of raster layers generated during the process: No. of RCMC iterations + No. of UCMC iterations + No. of Parallel Comparisons e.g. 20,000 + 19,800 + 10,000 = 49,800 Memory: Size of data type X No. of columns X No. of rows X No. of raster layers e.g. 4 bytes X 1652 X 2912 X 49,800 = 550 gigabytes Computational Demand IV – Memory

20 On a HP Z800 Workstation (2 Xeon CPUs 3.07GHz, 32GB RAM) Mapping birth defects for New Hampshire 1400 birth defect cases for 2003-2009 99,000 births for 2003-2009 2 age categories 220 town polygons 100-m resolution female population raster (1652 x 2912) 100 RCMC iterations for cases 100 RCMC iterations for cohort 99 URMC iterations 40 hours

21 Migrating to cyberGIS Setup infrastructure – New repository created in CyberGIS SVN – Establish a development environment Define the application interface using GISolve Open Service APIs Build and deploy the code on cyberinfrastructure resources from SVN Publish the application Test application execution

22 Computation Management through GISolve Open Service APIs Compress input into a single zip file and make it available on a Web accessible location – Input to the program include files for point cases, zone cases, cohort, background, zone file, and associated settings need by the application – The URL of the zip file is the single parameter to the Open service APIs Code execution and input/output data are put into a computation sandbox Simply run php job-submit.php and the GISolve middleware will take care of the rest

23 Parallel computing through CIGI local cluster and XSEDE Original MFC (Windows) code was extracted and adapted to run on the Linux environment Application code has been checked into the CyberGIS SVN for co-development and deployment on a CIGI local cluster and XSEDE Developed a set of parallel and distributed computing strategies based on a spatial computational domain construct Optimizing computational performance of these strategies

24 Ongoing … Accessibility: Making GIS capabilities accessible to a large of number of users for research and education, through online cyberGIS Gateway; Computational Capability: Embedding geospatial software capabilities into advanced cyberinfrastructure environments; Interoperability: Managing heterogeneous and distributed resources and services through GISolve middleware.

25 Designing and constructing secured data transporting protocol and tunnel …

26 Acknowledgements  National Science Foundation - OCI-1047916 -XSEDE SES070004  NIH P20RO18787  NIH P20ES018175 and EPA RD83459901  Dartmouth Neukom/IQBS CompX Faculty Grant

27 Thanks! Questions …


Download ppt "A High-Throughput Computational Approach to Environmental Health Study Based on CyberGIS Xun Shi 1, Anand Padmanabhan 2, and Shaowen Wang 2 1 Department."

Similar presentations


Ads by Google