Efficient Simulation of Agent-based Models on Multi-GPU & Multi-Core Clusters Kalyan S. Perumalla, Ph.D. Senior R&D Manager Oak Ridge National Laboratory.

Slides:

Advertisements

Similar presentations

A B C D E F G H I J K L M N O P Q R S T U V W X Y Z

Advertisements

You have been given a mission and a code. Use the code to complete the mission and you will save the world from obliteration…

Using Matrices in Real Life

Advanced Piloting Cruise Plot.

1 Vorlesung Informatik 2 Algorithmen und Datenstrukturen (Parallel Algorithms) Robin Pomplun.

Chapter 1 The Study of Body Function Image PowerPoint

Dynamic Power Redistribution in Failure-Prone CMPs Paula Petrica, Jonathan A. Winter * and David H. Albonesi Cornell University *Google, Inc.

Cognitive Radio Communications and Networks: Principles and Practice By A. M. Wyglinski, M. Nekovee, Y. T. Hou (Elsevier, December 2009) 1 Chapter 12 Cross-Layer.

1 Copyright © 2010, Elsevier Inc. All rights Reserved Chapter 1 Why Parallel Computing? An Introduction to Parallel Programming Peter Pacheco.

1 Copyright © 2010, Elsevier Inc. All rights Reserved Fig 2.1 Chapter 2.

Energy-Efficient Distributed Algorithms for Ad hoc Wireless Networks Gopal Pandurangan Department of Computer Science Purdue University.

and 6.855J Cycle Canceling Algorithm. 2 A minimum cost flow problem , $4 20, $1 20, $2 25, $2 25, $5 20, $6 30, $

Jeopardy Q 1 Q 6 Q 11 Q 16 Q 21 Q 2 Q 7 Q 12 Q 17 Q 22 Q 3 Q 8 Q 13

Jeopardy Q 1 Q 6 Q 11 Q 16 Q 21 Q 2 Q 7 Q 12 Q 17 Q 22 Q 3 Q 8 Q 13

Title Subtitle.

My Alphabet Book abcdefghijklm nopqrstuvwxyz.

DIVIDING INTEGERS 1. IF THE SIGNS ARE THE SAME THE ANSWER IS POSITIVE 2. IF THE SIGNS ARE DIFFERENT THE ANSWER IS NEGATIVE.

FACTORING ax2 + bx + c Think “unfoil” Work down, Show all steps.

Year 6 mental test 5 second questions

Prasanna Pandit R. Govindarajan

Re-examining Instruction Reuse in Pre-execution Approaches By Sonya R. Wolff Prof. Ronald D. Barnes June 5, 2011.

High-Performance Simulations of Complex Networked Systems for Capturing Feedback and Fidelity Kalyan S. Perumalla, Ph.D. Senior Research Staff Member Oak.

Acceleration of Cooley-Tukey algorithm using Maxeler machine

SE-292 High Performance Computing

Taking CUDA to Ludicrous Speed Getting Righteous Performance from your GPU 1.

Debugging operating systems with time-traveling virtual machines Sam King George Dunlap Peter Chen CoVirt Project, University of Michigan.

1 Column Generation. 2 Outline trim loss problem different formulations column generation the trim loss problem master problem and subproblem in column.

Randomized Algorithms Randomized Algorithms CS648 1.

Instructor Notes This lecture describes the different ways to work with multiple devices in OpenCL (i.e., within a single context and using multiple contexts),

Shredder GPU-Accelerated Incremental Storage and Computation

ABC Technology Project

Shadow Prices vs. Vickrey Prices in Multipath Routing Parthasarathy Ramanujam, Zongpeng Li and Lisa Higham University of Calgary Presented by Ajay Gopinathan.

INTRODUCTION TO SIMULATION WITH OMNET++ José Daniel García Sánchez ARCOS Group – University Carlos III of Madrid.

The Modular Structure of Complex Systems Team 3 Nupur Choudhary Aparna Nanjappa Mark Zeits.

Making Time-stepped Applications Tick in the Cloud Tao Zou, Guozhang Wang, Marcos Vaz Salles*, David Bindel, Alan Demers, Johannes Gehrke, Walker White.

1 Breadth First Search s s Undiscovered Discovered Finished Queue: s Top of queue 2 1 Shortest path from s.

Factor P 16 8(8-5ab) 4(d² + 4) 3rs(2r – s) 15cd(1 + 2cd) 8(4a² + 3b²)

1 Weiren Yu 1,2, Xuemin Lin 1, Wenjie Zhang 1 1 University of New South Wales 2 NICTA, Australia Towards Efficient SimRank Computation over Large Networks.

© 2012 National Heart Foundation of Australia. Slide 2.

Understanding Generalist Practice, 5e, Kirst-Ashman/Hull

GG Consulting, LLC I-SUITE. Source: TEA SHARS Frequently asked questions 2.

HJ-Hadoop An Optimized MapReduce Runtime for Multi-core Systems Yunming Zhang Advised by: Prof. Alan Cox and Vivek Sarkar Rice University 1.

Μπ A Scalable & Transparent System for Simulating MPI Programs Kalyan S. Perumalla, Ph.D. Senior R&D Manager Oak Ridge National Laboratory Adjunct Professor.

Region-Scale Evacuation Modeling using GPUs Towards Highly Interactive, GPU-based Evaluation of Evacuation Transport Scenarios at State-Scale Kalyan S.

KAIST Computer Architecture Lab. The Effect of Multi-core on HPC Applications in Virtualized Systems Jaeung Han¹, Jeongseob Ahn¹, Changdae Kim¹, Youngjin.

Addition 1’s to 20.

25 seconds left…...

Test B, 100 Subtraction Facts

Håkan Sundell, Chalmers University of Technology 1 Evaluating the performance of wait-free snapshots in real-time systems Björn Allvin.

We will resume in: 25 Minutes.

Local Search Jim Little UBC CS 322 – CSP October 3, 2014 Textbook §4.8

SE-292 High Performance Computing Memory Hierarchy R. Govindarajan

A SMALL TRUTH TO MAKE LIFE 100%

PSSA Preparation.

How Cells Obtain Energy from Food

Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.

Optimization on Kepler Zehuan Wang

System Simulation Of 1000-cores Heterogeneous SoCs Shivani Raghav Embedded System Laboratory (ESL) Ecole Polytechnique Federale de Lausanne (EPFL)

OpenFOAM on a GPU-based Heterogeneous Cluster

CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA

Jie Chen. 30 Multi-Processors each contains 8 cores at 1.4 GHz 4GB GDDR3 memory offers ~100GB/s memory bandwidth.

CUDA Compute Unified Device Architecture. Agent Based Modeling in CUDA Implementation of basic agent based modeling on the GPU using the CUDA framework.

Linchuan Chen, Xin Huo and Gagan Agrawal

Presentation transcript:

Efficient Simulation of Agent-based Models on Multi-GPU & Multi-Core Clusters Kalyan S. Perumalla, Ph.D. Senior R&D Manager Oak Ridge National Laboratory Adjunct Professor Georgia Institute of Technology Kalyan S. Perumalla, Ph.D. Senior R&D Manager Oak Ridge National Laboratory Adjunct Professor Georgia Institute of Technology SimuTools, Malaga, Spain March 16, 2010

2Managed by UT-Battelle for the U.S. Department of Energy SimuTools10 Presentation – Perumalla (ORNL) In a Nut Shell B2R Algorithm Hierarchical Hardware Multi-GPU Multi-core Network Agent-based Model Execution Large scale Fine-grained Challenges Latency spectrum Unified recursive solution Dramatic improvements in speed

3Managed by UT-Battelle for the U.S. Department of Energy SimuTools10 Presentation – Perumalla (ORNL) Outline Definition, Examples, Larger sizes, Demo, Time stepped, Parallel style ABMS Multi-GPU, Multi-CPU, MPI, CUDA, Access times, Latency problem Computational Hierarchy Basic idea, Hierarchical framework, Analysis equations, Cubic nature, Implementation B2R Algorithm CUDA, Pthreads, MPI, Lens cluster, Game of Life, Leadership, R vs. Improvement Performance Study Multi-GPU per node, OpenCL, More benchmarks, Unstructured inter-agent graphs Future Work

4Managed by UT-Battelle for the U.S. Department of Energy SimuTools10 Presentation – Perumalla (ORNL) ABMS: Motivating Demonstrations Agent Based Modeling and Simulation (ABMS) Game of Life Afghan Leadership GOLLDR

5Managed by UT-Battelle for the U.S. Department of Energy SimuTools10 Presentation – Perumalla (ORNL) GPU-based ABMS References Examples: K. S. Perumalla and B. Aaby, "Data Parallel Execution Challenges and Runtime Performance of Agent Simulations on GPUs," in Agent- Directed Simulation Symposium, 2008 R. D'Souza, M. Lysenko, and K. Rehmani, "SugarScape on Steroids: Simulating Over a Million Agents at Interactive Rates," in AGENT Conference on Complex Interaction and Social Emergence, 2007 Examples: K. S. Perumalla and B. Aaby, "Data Parallel Execution Challenges and Runtime Performance of Agent Simulations on GPUs," in Agent- Directed Simulation Symposium, 2008 R. D'Souza, M. Lysenko, and K. Rehmani, "SugarScape on Steroids: Simulating Over a Million Agents at Interactive Rates," in AGENT Conference on Complex Interaction and Social Emergence, 2007

6Managed by UT-Battelle for the U.S. Department of Energy SimuTools10 Presentation – Perumalla (ORNL) Hierarchical GPU System Hardware

7Managed by UT-Battelle for the U.S. Department of Energy SimuTools10 Presentation – Perumalla (ORNL) Computation Kernels on each GPU E.g., CUDA Threads Host initiates launch of many SIMD threads Threads get scheduled in batches on GPU hardware CUDA claims extremely efficient thread-launch implementation – Millions of CUDA threads at once Host initiates launch of many SIMD threads Threads get scheduled in batches on GPU hardware CUDA claims extremely efficient thread-launch implementation – Millions of CUDA threads at once

8Managed by UT-Battelle for the U.S. Department of Energy SimuTools10 Presentation – Perumalla (ORNL) GPU Memory Types (CUDA) GPU memory comes in several flavors Registers Local Memory Shared Memory Constant Memory Global Memory Texture Memory An important challenge is organizing the application to make most effective use of hierarchy GPU memory comes in several flavors Registers Local Memory Shared Memory Constant Memory Global Memory Texture Memory An important challenge is organizing the application to make most effective use of hierarchy

9Managed by UT-Battelle for the U.S. Department of Energy SimuTools10 Presentation – Perumalla (ORNL) GPU Communication Latencies (CUDA) Memory TypeSpeedScopeLifetimeSize RegistersFastest (4 cycles)ThreadKernel Shared MemoryVery fast (4 -? cycles)BlockThread Global Memory100x slower (400- cycles)DeviceProcess Local Memory150x slower (600 cycles)BlockThread Texture MemoryFast (10s of cycles)DeviceProcess Constant MemoryFairly fast (read-only)DeviceProcess

10Managed by UT-Battelle for the U.S. Department of Energy SimuTools10 Presentation – Perumalla (ORNL) CUDA + MPI An economical cluster solution – Affordable GPUs, each providing one-node CUDA – MPI on giga-bit Ethernet for inter-node comm. Memory speed-constrained system – Inter-memory transfers can dominate runtime – Runtime overhead can be severe Need a way to tie CUDA and MPI – Algorithmic solution needed – Need to overcome latency challenge An economical cluster solution – Affordable GPUs, each providing one-node CUDA – MPI on giga-bit Ethernet for inter-node comm. Memory speed-constrained system – Inter-memory transfers can dominate runtime – Runtime overhead can be severe Need a way to tie CUDA and MPI – Algorithmic solution needed – Need to overcome latency challenge

11Managed by UT-Battelle for the U.S. Department of Energy SimuTools10 Presentation – Perumalla (ORNL) Analogous Networked Multi-core System

12Managed by UT-Battelle for the U.S. Department of Energy SimuTools10 Presentation – Perumalla (ORNL) Parallel Execution: Conventional Method Block 0,0 P 0,0 Block 0,1 P 0,1 Block 0,2 P 0,2 Block 1,0 P 1,0 Block 1,1 P 1,1 Block 1,2 P 1,2 Block 2,0 P 2,0 Block 2,1 P 2,1 Block 2,2 P 2,2 B

13Managed by UT-Battelle for the U.S. Department of Energy SimuTools10 Presentation – Perumalla (ORNL) Latency Challenge: Conventional Method High latency between GPU and CPU memories High latency between GPU and CPU memories – CUDA inter-memory data transfer primitives Very high latency across CPU memories Very high latency across CPU memories – MPI communication for data transfers High latency between GPU and CPU memories High latency between GPU and CPU memories – CUDA inter-memory data transfer primitives Very high latency across CPU memories Very high latency across CPU memories – MPI communication for data transfers Naïve method gives very poor computation to communication ratio Naïve method gives very poor computation to communication ratio – Slow-downs instead of speedups Need latency resilient method … Need latency resilient method …

14Managed by UT-Battelle for the U.S. Department of Energy SimuTools10 Presentation – Perumalla (ORNL) Our Solution: B2R Method Block 0,0 P 0,0 Block 0,1 P 0,1 Block 0,2 P 0,2 Block 1,0 P 1,0 Block 1,1 P 1,1 Block 1,2 P 1,2 Block 2,0 P 2,0 Block 2,1 P 2,1 Block 2,2 P 2,2 B RR

15Managed by UT-Battelle for the U.S. Department of Energy SimuTools10 Presentation – Perumalla (ORNL) B2R Algorithm

16Managed by UT-Battelle for the U.S. Department of Energy SimuTools10 Presentation – Perumalla (ORNL) Total Runtime Cost: Analytical Form At any level in the hierarchy, total runtime F is given by: Most interesting aspect Cubic in R! Most interesting aspect Cubic in R!

17Managed by UT-Battelle for the U.S. Department of Energy SimuTools10 Presentation – Perumalla (ORNL) Implications of being Cubic in R Total Execution Time R Benefits with B2R not immediately seen for small R – In fact, degradation for small R! Dramatic improvement possible after small R – Our experiments confirm this trend! Too large is too bad too – Cant profit indefinitely! Benefits with B2R not immediately seen for small R – In fact, degradation for small R! Dramatic improvement possible after small R – Our experiments confirm this trend! Too large is too bad too – Cant profit indefinitely!

18Managed by UT-Battelle for the U.S. Department of Energy SimuTools10 Presentation – Perumalla (ORNL) Sub-division Across Levels E.g., MPI to Blocks to Threads MPI: R m Block: R b Thread: R t

19Managed by UT-Battelle for the U.S. Department of Energy SimuTools10 Presentation – Perumalla (ORNL) Hierarchy and Recursive Use of B & R B2R can be applied at all levels! A different R can be chosen at every level, E.g. A different R can be chosen at every level, E.g. – R b for block-level R – R t for thread-level R Simple constraints exist for possible values of R Simple constraints exist for possible values of R – Between R and B – Between Rs at different levels – Details in our paper E.g., CUDA Hierarchy

20Managed by UT-Battelle for the U.S. Department of Energy SimuTools10 Presentation – Perumalla (ORNL) B2R Implementation within CUDA

21Managed by UT-Battelle for the U.S. Department of Energy SimuTools10 Presentation – Perumalla (ORNL) Performance Over 100× speedup with MPI+CUDA Speedup relative to naïve method with no latency-hiding

22Managed by UT-Battelle for the U.S. Department of Energy SimuTools10 Presentation – Perumalla (ORNL) Multi-GPU MPI+CUDA – Game of Life

23Managed by UT-Battelle for the U.S. Department of Energy SimuTools10 Presentation – Perumalla (ORNL) Multi-core MPI+pthreads– Game of Life

24Managed by UT-Battelle for the U.S. Department of Energy SimuTools10 Presentation – Perumalla (ORNL) Multi-core MPI+Pthreads – Game of Life

25Managed by UT-Battelle for the U.S. Department of Energy SimuTools10 Presentation – Perumalla (ORNL) Multi-core MPI+pthreads – Leadership

26Managed by UT-Battelle for the U.S. Department of Energy SimuTools10 Presentation – Perumalla (ORNL) Summary B2R Algorithm applies across heterogeneous, hierarchical platforms – Deep GPU hierarchies – Deep CPU multi-core systems Cubic nature of runtime dependence on R is a a remarkable aspect – A maximum and minimum exist – Optimal (minimum) can be dramatically low Results show clear performance improvement – Up to 150x in the best case (fine grained) B2R Algorithm applies across heterogeneous, hierarchical platforms – Deep GPU hierarchies – Deep CPU multi-core systems Cubic nature of runtime dependence on R is a a remarkable aspect – A maximum and minimum exist – Optimal (minimum) can be dramatically low Results show clear performance improvement – Up to 150x in the best case (fine grained)

27Managed by UT-Battelle for the U.S. Department of Energy SimuTools10 Presentation – Perumalla (ORNL) Future Work Generate cross-platform code Generate cross-platform code – E.g, Implement in OpenCL Add to CUDA-MPI levels Add to CUDA-MPI levels – Multi-GPU per node Implement and test with more benchmarks Implement and test with more benchmarks – E.g., From existing ABMS suites NetLogo & Repast Generate cross-platform code Generate cross-platform code – E.g, Implement in OpenCL Add to CUDA-MPI levels Add to CUDA-MPI levels – Multi-GPU per node Implement and test with more benchmarks Implement and test with more benchmarks – E.g., From existing ABMS suites NetLogo & Repast Generalize to unstructured inter-agent graphs Generalize to unstructured inter-agent graphs – E.g., Social networks Potential to apply to other domains Potential to apply to other domains – E.g., Stencil computations

Thank you! Questions? Additional material at our webpage: Discrete Computing Systems Additional material at our webpage: Discrete Computing Systems