Efficient Simulation of Agent-based Models on Multi-GPU & Multi-Core Clusters Kalyan S. Perumalla, Ph.D. Senior R&D Manager Oak Ridge National Laboratory.

Efficient Simulation of Agent-based Models on Multi-GPU & Multi-Core Clusters Kalyan S. Perumalla, Ph.D. Senior R&D Manager Oak Ridge National Laboratory Adjunct Professor Georgia Institute of Technology Kalyan S. Perumalla, Ph.D. Senior R&D Manager Oak Ridge National Laboratory Adjunct Professor Georgia Institute of Technology SimuTools, Malaga, Spain March 16, 2010

2Managed by UT-Battelle for the U.S. Department of Energy SimuTools10 Presentation – Perumalla (ORNL) In a Nut Shell B2R Algorithm Hierarchical Hardware Multi-GPU Multi-core Network Agent-based Model Execution Large scale Fine-grained Challenges Latency spectrum Unified recursive solution Dramatic improvements in speed

3Managed by UT-Battelle for the U.S. Department of Energy SimuTools10 Presentation – Perumalla (ORNL) Outline Definition, Examples, Larger sizes, Demo, Time stepped, Parallel style ABMS Multi-GPU, Multi-CPU, MPI, CUDA, Access times, Latency problem Computational Hierarchy Basic idea, Hierarchical framework, Analysis equations, Cubic nature, Implementation B2R Algorithm CUDA, Pthreads, MPI, Lens cluster, Game of Life, Leadership, R vs. Improvement Performance Study Multi-GPU per node, OpenCL, More benchmarks, Unstructured inter-agent graphs Future Work

4Managed by UT-Battelle for the U.S. Department of Energy SimuTools10 Presentation – Perumalla (ORNL) ABMS: Motivating Demonstrations Agent Based Modeling and Simulation (ABMS) Game of Life Afghan Leadership GOLLDR

5Managed by UT-Battelle for the U.S. Department of Energy SimuTools10 Presentation – Perumalla (ORNL) GPU-based ABMS References Examples: K. S. Perumalla and B. Aaby, "Data Parallel Execution Challenges and Runtime Performance of Agent Simulations on GPUs," in Agent- Directed Simulation Symposium, 2008 R. D'Souza, M. Lysenko, and K. Rehmani, "SugarScape on Steroids: Simulating Over a Million Agents at Interactive Rates," in AGENT Conference on Complex Interaction and Social Emergence, 2007 Examples: K. S. Perumalla and B. Aaby, "Data Parallel Execution Challenges and Runtime Performance of Agent Simulations on GPUs," in Agent- Directed Simulation Symposium, 2008 R. D'Souza, M. Lysenko, and K. Rehmani, "SugarScape on Steroids: Simulating Over a Million Agents at Interactive Rates," in AGENT Conference on Complex Interaction and Social Emergence, 2007

6Managed by UT-Battelle for the U.S. Department of Energy SimuTools10 Presentation – Perumalla (ORNL) Hierarchical GPU System Hardware

7Managed by UT-Battelle for the U.S. Department of Energy SimuTools10 Presentation – Perumalla (ORNL) Computation Kernels on each GPU E.g., CUDA Threads Host initiates launch of many SIMD threads Threads get scheduled in batches on GPU hardware CUDA claims extremely efficient thread-launch implementation – Millions of CUDA threads at once Host initiates launch of many SIMD threads Threads get scheduled in batches on GPU hardware CUDA claims extremely efficient thread-launch implementation – Millions of CUDA threads at once

8Managed by UT-Battelle for the U.S. Department of Energy SimuTools10 Presentation – Perumalla (ORNL) GPU Memory Types (CUDA) GPU memory comes in several flavors Registers Local Memory Shared Memory Constant Memory Global Memory Texture Memory An important challenge is organizing the application to make most effective use of hierarchy GPU memory comes in several flavors Registers Local Memory Shared Memory Constant Memory Global Memory Texture Memory An important challenge is organizing the application to make most effective use of hierarchy

9Managed by UT-Battelle for the U.S. Department of Energy SimuTools10 Presentation – Perumalla (ORNL) GPU Communication Latencies (CUDA) Memory TypeSpeedScopeLifetimeSize RegistersFastest (4 cycles)ThreadKernel Shared MemoryVery fast (4 -? cycles)BlockThread Global Memory100x slower (400- cycles)DeviceProcess Local Memory150x slower (600 cycles)BlockThread Texture MemoryFast (10s of cycles)DeviceProcess Constant MemoryFairly fast (read-only)DeviceProcess

10Managed by UT-Battelle for the U.S. Department of Energy SimuTools10 Presentation – Perumalla (ORNL) CUDA + MPI An economical cluster solution – Affordable GPUs, each providing one-node CUDA – MPI on giga-bit Ethernet for inter-node comm. Memory speed-constrained system – Inter-memory transfers can dominate runtime – Runtime overhead can be severe Need a way to tie CUDA and MPI – Algorithmic solution needed – Need to overcome latency challenge An economical cluster solution – Affordable GPUs, each providing one-node CUDA – MPI on giga-bit Ethernet for inter-node comm. Memory speed-constrained system – Inter-memory transfers can dominate runtime – Runtime overhead can be severe Need a way to tie CUDA and MPI – Algorithmic solution needed – Need to overcome latency challenge

11Managed by UT-Battelle for the U.S. Department of Energy SimuTools10 Presentation – Perumalla (ORNL) Analogous Networked Multi-core System

12Managed by UT-Battelle for the U.S. Department of Energy SimuTools10 Presentation – Perumalla (ORNL) Parallel Execution: Conventional Method Block 0,0 P 0,0 Block 0,1 P 0,1 Block 0,2 P 0,2 Block 1,0 P 1,0 Block 1,1 P 1,1 Block 1,2 P 1,2 Block 2,0 P 2,0 Block 2,1 P 2,1 Block 2,2 P 2,2 B

13Managed by UT-Battelle for the U.S. Department of Energy SimuTools10 Presentation – Perumalla (ORNL) Latency Challenge: Conventional Method High latency between GPU and CPU memories High latency between GPU and CPU memories – CUDA inter-memory data transfer primitives Very high latency across CPU memories Very high latency across CPU memories – MPI communication for data transfers High latency between GPU and CPU memories High latency between GPU and CPU memories – CUDA inter-memory data transfer primitives Very high latency across CPU memories Very high latency across CPU memories – MPI communication for data transfers Naïve method gives very poor computation to communication ratio Naïve method gives very poor computation to communication ratio – Slow-downs instead of speedups Need latency resilient method … Need latency resilient method …

14Managed by UT-Battelle for the U.S. Department of Energy SimuTools10 Presentation – Perumalla (ORNL) Our Solution: B2R Method Block 0,0 P 0,0 Block 0,1 P 0,1 Block 0,2 P 0,2 Block 1,0 P 1,0 Block 1,1 P 1,1 Block 1,2 P 1,2 Block 2,0 P 2,0 Block 2,1 P 2,1 Block 2,2 P 2,2 B RR

15Managed by UT-Battelle for the U.S. Department of Energy SimuTools10 Presentation – Perumalla (ORNL) B2R Algorithm

16Managed by UT-Battelle for the U.S. Department of Energy SimuTools10 Presentation – Perumalla (ORNL) Total Runtime Cost: Analytical Form At any level in the hierarchy, total runtime F is given by: Most interesting aspect Cubic in R! Most interesting aspect Cubic in R!

17Managed by UT-Battelle for the U.S. Department of Energy SimuTools10 Presentation – Perumalla (ORNL) Implications of being Cubic in R Total Execution Time R Benefits with B2R not immediately seen for small R – In fact, degradation for small R! Dramatic improvement possible after small R – Our experiments confirm this trend! Too large is too bad too – Cant profit indefinitely! Benefits with B2R not immediately seen for small R – In fact, degradation for small R! Dramatic improvement possible after small R – Our experiments confirm this trend! Too large is too bad too – Cant profit indefinitely!

18Managed by UT-Battelle for the U.S. Department of Energy SimuTools10 Presentation – Perumalla (ORNL) Sub-division Across Levels E.g., MPI to Blocks to Threads MPI: R m Block: R b Thread: R t

19Managed by UT-Battelle for the U.S. Department of Energy SimuTools10 Presentation – Perumalla (ORNL) Hierarchy and Recursive Use of B & R B2R can be applied at all levels! A different R can be chosen at every level, E.g. A different R can be chosen at every level, E.g. – R b for block-level R – R t for thread-level R Simple constraints exist for possible values of R Simple constraints exist for possible values of R – Between R and B – Between Rs at different levels – Details in our paper E.g., CUDA Hierarchy

20Managed by UT-Battelle for the U.S. Department of Energy SimuTools10 Presentation – Perumalla (ORNL) B2R Implementation within CUDA

21Managed by UT-Battelle for the U.S. Department of Energy SimuTools10 Presentation – Perumalla (ORNL) Performance Over 100× speedup with MPI+CUDA Speedup relative to naïve method with no latency-hiding

22Managed by UT-Battelle for the U.S. Department of Energy SimuTools10 Presentation – Perumalla (ORNL) Multi-GPU MPI+CUDA – Game of Life

23Managed by UT-Battelle for the U.S. Department of Energy SimuTools10 Presentation – Perumalla (ORNL) Multi-core MPI+pthreads– Game of Life

24Managed by UT-Battelle for the U.S. Department of Energy SimuTools10 Presentation – Perumalla (ORNL) Multi-core MPI+Pthreads – Game of Life

25Managed by UT-Battelle for the U.S. Department of Energy SimuTools10 Presentation – Perumalla (ORNL) Multi-core MPI+pthreads – Leadership

26Managed by UT-Battelle for the U.S. Department of Energy SimuTools10 Presentation – Perumalla (ORNL) Summary B2R Algorithm applies across heterogeneous, hierarchical platforms – Deep GPU hierarchies – Deep CPU multi-core systems Cubic nature of runtime dependence on R is a a remarkable aspect – A maximum and minimum exist – Optimal (minimum) can be dramatically low Results show clear performance improvement – Up to 150x in the best case (fine grained) B2R Algorithm applies across heterogeneous, hierarchical platforms – Deep GPU hierarchies – Deep CPU multi-core systems Cubic nature of runtime dependence on R is a a remarkable aspect – A maximum and minimum exist – Optimal (minimum) can be dramatically low Results show clear performance improvement – Up to 150x in the best case (fine grained)

27Managed by UT-Battelle for the U.S. Department of Energy SimuTools10 Presentation – Perumalla (ORNL) Future Work Generate cross-platform code Generate cross-platform code – E.g, Implement in OpenCL Add to CUDA-MPI levels Add to CUDA-MPI levels – Multi-GPU per node Implement and test with more benchmarks Implement and test with more benchmarks – E.g., From existing ABMS suites NetLogo & Repast Generate cross-platform code Generate cross-platform code – E.g, Implement in OpenCL Add to CUDA-MPI levels Add to CUDA-MPI levels – Multi-GPU per node Implement and test with more benchmarks Implement and test with more benchmarks – E.g., From existing ABMS suites NetLogo & Repast Generalize to unstructured inter-agent graphs Generalize to unstructured inter-agent graphs – E.g., Social networks Potential to apply to other domains Potential to apply to other domains – E.g., Stencil computations

Thank you! Questions? Additional material at our webpage: Discrete Computing Systems www.ornl.gov/~2ip Additional material at our webpage: Discrete Computing Systems www.ornl.gov/~2ip

Efficient Simulation of Agent-based Models on Multi-GPU & Multi-Core Clusters Kalyan S. Perumalla, Ph.D. Senior R&D Manager Oak Ridge National Laboratory.

Similar presentations

Presentation on theme: "Efficient Simulation of Agent-based Models on Multi-GPU & Multi-Core Clusters Kalyan S. Perumalla, Ph.D. Senior R&D Manager Oak Ridge National Laboratory."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Efficient Simulation of Agent-based Models on Multi-GPU & Multi-Core Clusters Kalyan S. Perumalla, Ph.D. Senior R&D Manager Oak Ridge National Laboratory.

Similar presentations

Presentation on theme: "Efficient Simulation of Agent-based Models on Multi-GPU & Multi-Core Clusters Kalyan S. Perumalla, Ph.D. Senior R&D Manager Oak Ridge National Laboratory."— Presentation transcript:

Similar presentations

About project

Feedback